AI
AI

Microsoft Unveils Magma Foundation Model for Executing Multimodal Agentic Tasks

Photo credit: www.gadgets360.com

Microsoft has unveiled a groundbreaking foundation model known as Magma, capable of executing agentic tasks. This new artificial intelligence (AI) system is built on an extensive foundation, trained on a diverse array of datasets encompassing text, images, videos, and spatial data. The tech giant from Redmond has positioned Magma as an advancement of vision-language (VL) models, asserting its ability not only to comprehend multimodal content but also to plan and take actions based on that information. This makes Magma suitable for various applications, including computer vision, user interface (UI) navigation, and robotic manipulation.

Microsoft Announces Magma Foundation Model

In a detailed GitHub post, Microsoft researchers described the functionalities of the Magma foundation model. Unlike traditional large language models (LLMs), which may be derived from previous architectures, foundation models are developed independently from the ground up, serving as the foundation for subsequent models. What sets Magma apart is its comprehensive pre-training across varied datasets.

The underlying architecture of Magma is based on the Llama 3 AI model. However, its capabilities extend beyond standard outputs typical of chatbots; Magma can plan and act within visual and spatial contexts. This unique feature allows it to function as a computer vision chatbot that interprets and provides insights about the environment it perceives through camera sensors. Additionally, Magma can facilitate UI control for devices and, notably, can manage robotic systems to perform complex tasks utilizing its agentic features.

The impressive capabilities of Magma are attributed to its broad range of training data and the implementation of two innovative technical components: Set-of-Mark and Trace-of-Mark. The Set-of-Mark component enables the model to ground actions within images, videos, and spatial contexts by predicting numeric markers for buttons or robotic appendages. Meanwhile, the Trace-of-Mark component supplies the model with temporal video dynamics, empowering it to forecast subsequent frames prior to action. This dual approach enhances the model’s spatial awareness significantly.

According to internal benchmarking conducted by the researchers, Magma has demonstrated competitive performance across all evaluated agentic tasks, surpassing notable models from OpenAI, Alibaba, and Google. As of now, Microsoft has not made Magma publicly available.

Source
www.gadgets360.com

Related by category

EA Allegedly Cancels Another Titanfall Game and Cuts Hundreds of Jobs

Photo credit: www.engadget.com The gaming sector is witnessing significant upheaval,...

A2 Hosting Unveils New Identity as Hosting.com

Photo credit: www.techradar.com New websiteNew panelNew productsHosting.com, formerly known as...

SpaceX Sends 23 Starlink Satellites into Orbit with Falcon 9 Rocket from Cape Canaveral

Photo credit: www.gadgets360.com SpaceX achieved a remarkable feat by launching...

Latest news

NASA Reaches New Heights in the First 100 Days of the Trump Administration

Photo credit: www.nasa.gov Today marks the 100th day of the...

CBS Evening News Plus: April 29 Edition

Photo credit: www.cbsnews.com Understanding Trump's Auto Tariff Modifications Recent shifts in...

Carême Review – A Sizzling French Adventure Featuring a Chef That’s Too Hot to Handle | Television & Radio

Photo credit: www.theguardian.com Exploring "Carême": A Culinary Journey Through the...

Breaking news