The main purpose of training an image-free computer vision system using illustrations generated by language models, as discussed in the MIT research, is to assess how much visual knowledge language models have and to create a vision system that can recognize the content of real photos without directly using any visual data. The researchers constructed a "vision checkup" for language models and used their "Visual Aptitude Dataset" to test the models' abilities to draw, recognize, and self-correct visual concepts. The illustrations generated by the language models were then used to train a computer vision system that can recognize objects within real photos, despite never having seen one before. This approach allows the system to outperform other procedurally generated image datasets trained with authentic photos.
Language models that are trained solely on text acquire a solid understanding of the visual world by learning from the descriptions of visual concepts available on the internet. These descriptions can be found in language or code, and they help the language models understand how to generate images or refine existing ones. For example, when given a direction like "draw a parrot in the jungle," a language model will consider the descriptions of parrots and jungles it has read before to generate an image. Researchers have also found that language models can use code as a common ground between text and vision, allowing them to express their visual knowledge more effectively.
The "vision checkup" conducted by the CSAIL team on language models was a systematic evaluation of the models' abilities to generate and recognize a variety of visual concepts of increasing complexity5. The evaluation involved three main tasks:
Generation/Drawing with text: This tested the ability of the language models to generate image-rendering code corresponding to a specific concept. The team used their "Visual Aptitude Dataset" to prompt the models and assessed their ability to create accurate and diverse images.
Recognition/Seeing with text: This tested the models' performance in recognizing visual concepts and scenes represented as code. The models were evaluated on their ability to understand and interpret spatial relations and visual effects.
Correcting drawings with text feedback: This evaluated the models' ability to iteratively modify their generated code using natural language feedback. The team used a self-generated feedback mechanism to prompt the models to improve their drawings, and evaluated the models' ability to systematically modify their representations based on the feedback.
The results of the "vision checkup" showed that language models have a solid understanding of the visual world and can be trained to generate and recognize complex visual concepts without directly using any visual data15. The team's work highlights the potential of using language models to train vision systems and make semantic assessments of natural images1.