Abstract

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical 'whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves 0% accuracy, while whiteboard-of-thought enables up to 92% accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

Can MLLMs use textual reasoning to solve tasks that we solve with visual thinking?

We identify multiple tasks which may suggest MLLMs can perform implicit visual thinking if solved with text alone. These tasks are easy for a human, potentially given pen and paper. We find that GPT-4o using chain-of-thought fails dramatically on these tasks, including more than one where it achieves 0% accuracy.

Evaluation: ASCII Understanding

ASCII understanding is a clearly visual task with only text inputs. For humans, written text is typically processed with the same input modality as images (our eyes), allowing us to engage in visual thinking without any intermediate processing.

Examples of three BIG-Bench ASCII understanding tasks + WoT visualizations.

ASCII accuracy. MLLMs fail to perform the task with text alone. WoT unlocks visual processing to achieve substantial gains.

Consider the difficulty of understanding ASCII art being read aloud. In some sense, this is similar to how LLMs process ASCII.

Breaking down the ASCII word recognition performance, we see 'Bubble' and 'Doh' can be solved without visuals.

Text-only baselines fail to solve every instance that actually requires visual understanding.

We find most errors are due to visual perception.

Spatial Navigation

Creating visuals also shows promise for spatial navigation, especially outside of 2D grids (which are more easily represented in text).

Calligrams

Calligrams are poems where the visual arrangement of the words adds to the meaning of the poem.

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Whiteboard-of-Thought enables MLLMs to express intermediate reasoning as images using code. You probably didn't use typography knowledge to solve this query.