Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
STRUDEL: Unrolling a Benchmark for Evaluating Vision-Language Models on Structured Diagram Understanding across Domains
Paper Fields
Click the edit button next to a field to report a correction.
STRUDEL: Unrolling a Benchmark for Evaluating Vision-Language Models on Structured Diagram Understanding across Domains
Vision-Language Models (VLMs) have achieved impressive progress across diverse multimodal tasks, yet their ability to interpret structured diagrams, such as circuit schematics, molecular structures, musical notation, business process flow charts or class diagrams, which are central to scientific and engineering communication, remains underexplored. We introduce STRUDEL (STRUctured Diagram EvaLuation), a benchmark for evaluating VLMs on structured diagram understanding across 8 domains and 20 image categories. STRUDEL leverages Large-Language Models (LLMs) to synthesize code in domain-specific formal representation languages (FRLs) (e.g. circuit netlists, SMILES, ABC-Notation, BPMN or PlantUML), which are rendered into valid diagrams and paired with generated tasks, functional descriptions, and captions. A multi-stage pipeline filters invalid, cluttered, or redundant samples and employs LLM-as-a-judge scoring to ensure correctness. Through targeted experiments, we evaluate the ability of LLMs to generate valid code in distinct FRLs, demonstrating their capability to successfully perform this task. The resulting benchmark comprises diverse task types covering identification, quantification, structural analysis, image-text association, and image-to-code translation. Evaluating 35 VLMs using STRUDEL reveals that models excel at association tasks, demonstrating strong visual-textual alignment, yet struggle with quantification and identification, where precise structural understanding is required. Performance varies markedly in image-to-code translation, reflecting significant differences in how models connect visual inputs to formal representations. Overall, STRUDEL establishes a scalable foundation for assessing and advancing VLMs torward deeper and more systematic understanding of structured visual information across domains.
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.