AI
AI

Exploring the Visual Comprehension of Language Models | MIT News

Photo credit: news.mit.edu

Exploring the Visual Capabilities of Language Models

You’ve probably heard the adage that a picture is worth a thousand words. But can a large language model (LLM) comprehend visual elements if it has never directly encountered images? Recent research suggests that LLMs trained solely on text possess a surprising comprehension of the visual domain.

Researchers from the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) have discovered that these models can generate image-rendering code capable of creating intricate scenes filled with various objects and artistic compositions. Remarkably, even when initial outputs are simplistic, LLMs can enhance their images upon further prompting. This observation was made when the team encouraged language models to self-correct their code for different visuals, resulting in progressively more complex drawings.

The visual knowledge embedded within these language models stems from the plethora of descriptions about shapes and colors available in textual form across the internet. When tasked with prompts like “draw a parrot in the jungle,” the models draw upon this textual knowledge to construct visual representations. To better understand the extent of visual knowledge within LLMs, the CSAIL team designed a “vision checkup” using their “Visual Aptitude Dataset,” wherein the models demonstrated their abilities to draw, recognize, and refine visual concepts. The final versions of these illustrations were subsequently utilized to train a computer vision system capable of identifying the content within real images.

“We essentially train a vision system without directly using any visual data,” said Tamar Rott Shaham, co-lead author of the study and a postdoctoral researcher at MIT’s EECS department. “Our team prompted language models to generate image-rendering codes, which we then used to train the vision system for the evaluation of natural images. We were motivated by the inquiry of how visual concepts can be represented through alternative mediums, such as text. LLMs utilize code as a bridge between textual and visual understanding.”

The team initiated the creation of their dataset by asking models to construct codes for various shapes, objects, and scenes. These codes were then compiled to produce simple digital illustrations, including depictions of objects like bicycles lined up in a row, highlighting the models’ understanding of spatial relationships. In another instance, they generated a cake shaped like a car, showcasing the model’s ability to amalgamate disparate concepts creatively. Furthermore, a glowing light bulb illustration attested to its capability for generating visual effects.

“Our findings indicate that when you prompt an LLM to create an image, it has a deeper understanding than it may appear at first glance,” remarked co-lead author Pratyusha Sharma, an EECS PhD student and CSAIL member. “For instance, if you ask it to draw a chair, the model draws upon its knowledge about the object beyond its immediate representation, allowing users to refine the visual outputs with subsequent queries. It’s remarkable that the model can iteratively enhance its drawings by improving the rendering code significantly.”

The collection of illustrations contributed to training a computer vision system capable of recognizing objects in real photographs, despite never having had access to actual images. This synthetic, text-generated data provided a more effective training reference than several traditional image datasets comprised of genuine photographs.

The research team posits that merging the implicit visual understanding of LLMs with the artistic features of other AI technologies, like diffusion models, could yield valuable outcomes. Tools like Midjourney, while creative, sometimes struggle with intricate edits, such as modifying the quantity of objects in a scene or repositioning elements spatially. If an LLM outlines the necessary changes in advance, the diffusion model could produce superior edits in response.

Notably, as Rott Shaham and Sharma have observed, LLMs occasionally struggle to recognize concepts that they are capable of drawing. This became evident when models inaccurately identified human renderings of images from their dataset, highlighting the complexity of diverse visual interpretations. Despite these challenges, the models displayed creativity, producing various representations of similar concepts, such as strawberries and arcades, from multiple angles, colors, and shapes. This suggests that the models may possess a form of internal imagery tied to the concepts they draw, rather than merely replicating prior examples.

The CSAIL team’s research could serve as a foundational approach for assessing the capability of generative AI models to train computer vision systems effectively. Looking ahead, the researchers aim to expand their explorations and enhance the tasks assigned to language models. They acknowledge the limitation of not having access to the training set of the LLMs involved, which complicates their ability to further dissect the origins of the models’ visual knowledge. Future endeavors may involve training a more advanced vision model by fostering direct collaboration with LLMs.

Sharma and Rott Shaham, along with their colleagues, presented their findings at the IEEE/CVF Computer Vision and Pattern Recognition Conference, highlighting the potential of bridging language and visual understanding in AI development.

Source
news.mit.edu

Related by category

BurgerBots Launches Fast Food Restaurant Featuring ABB Robots in the Kitchen

Photo credit: www.therobotreport.com A dual-arm YuMi cobot puts the finishing...

Epson Introduces GX-C Series Featuring RC800A Controller in Its Robot Lineup

Photo credit: www.therobotreport.com Epson Robots, recognized as the leading SCARA...

Glacier Secures $16M in Funding and Unveils New Recology King Deployment

Photo credit: www.therobotreport.com Two Glacier systems at work in an...

Latest news

Top Aid Official Urges Progress in Recovery Efforts in Southern Lebanon

Photo credit: news.un.org Imran Riza has issued an urgent call...

Grandpa Robber Confesses to Role in Kim Kardashian Jewelry Heist

Photo credit: www.theguardian.com Trial of Kim Kardashian Robbery Suspects Unfolds...

Increase in Gig Cancellations in Germany Following ‘Kill Your MP’ Controversy

Photo credit: www.bbc.com Kneecap Faces Controversy Over Recent Remarks The rap...

Breaking news