Photo credit: www.gadgets360.com
Microsoft has unveiled a groundbreaking foundation model known as Magma, capable of executing agentic tasks. This new artificial intelligence (AI) system is built on an extensive foundation, trained on a diverse array of datasets encompassing text, images, videos, and spatial data. The tech giant from Redmond has positioned Magma as an advancement of vision-language (VL) models, asserting its ability not only to comprehend multimodal content but also to plan and take actions based on that information. This makes Magma suitable for various applications, including computer vision, user interface (UI) navigation, and robotic manipulation.
Microsoft Announces Magma Foundation Model
In a detailed GitHub post, Microsoft researchers described the functionalities of the Magma foundation model. Unlike traditional large language models (LLMs), which may be derived from previous architectures, foundation models are developed independently from the ground up, serving as the foundation for subsequent models. What sets Magma apart is its comprehensive pre-training across varied datasets.
The underlying architecture of Magma is based on the Llama 3 AI model. However, its capabilities extend beyond standard outputs typical of chatbots; Magma can plan and act within visual and spatial contexts. This unique feature allows it to function as a computer vision chatbot that interprets and provides insights about the environment it perceives through camera sensors. Additionally, Magma can facilitate UI control for devices and, notably, can manage robotic systems to perform complex tasks utilizing its agentic features.
The impressive capabilities of Magma are attributed to its broad range of training data and the implementation of two innovative technical components: Set-of-Mark and Trace-of-Mark. The Set-of-Mark component enables the model to ground actions within images, videos, and spatial contexts by predicting numeric markers for buttons or robotic appendages. Meanwhile, the Trace-of-Mark component supplies the model with temporal video dynamics, empowering it to forecast subsequent frames prior to action. This dual approach enhances the model’s spatial awareness significantly.
According to internal benchmarking conducted by the researchers, Magma has demonstrated competitive performance across all evaluated agentic tasks, surpassing notable models from OpenAI, Alibaba, and Google. As of now, Microsoft has not made Magma publicly available.
Source
www.gadgets360.com