PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Harnessing Layered Graphic Designs with Real Intentions for Text-to-Design Generation

Xinya Song, Bo Yang, Ying Cao^*

Shanghaitech University
CVPR 2026 Finding
^*Indicates Corresponding Author

Paper Supplementary Code

Left: Each design in our dataset is composed of basic elements of four different types: text, image, shape, and background. Different element types are highlighted with bounding boxes in different colors. Each element has a set of attributes that define its content and control its appearance and composition on the design. The detailed attributes of one element per type is shown in the dashed rectangle. Each design is accompanied by a detailed design intention alongside with the title and tags. Right: The dataset encompasses various themes such as education, healthcare, arts, business, and festivals.

Abstract

Text-to-design generation, which synthesizes plausible and diverse graphic designs from textual design intentions, has recently emerged as a research area that gains growing interest. However, further process in this area is hindered by the absence of public, high-quality paired intention-design data. To mitigate this issue, we introduce LADEREIN, a new benchmark of layered designs with real intentions for training and evaluating text-conditioned graphic design generation models. As opposed to synthetic short intention prompts in prior datasets, the intentions of our dataset are real, long and complex, spanning various factors such as design purpose, visual style, feeling, and expected audience behavior. Such real intentions allow us to train text-to-design models that generalize to real design scenarios, and make our dataset a promising ground for validating progress in text-to-design generation. For objective assessment of model performance in intention-following, we contribute, DesignCLIP, a new text-design alignment metric learned from our dataset. Moreover, we build a simple yet effective text-to-design generator, DesignDiff, as a baseline on our benchmark. We show that: 1) our DesignCLIP outperforms GPT-4V in judging the alignment of graphic designs and textual design intentions; 2) our dataset enhances the capabilities of text-to-design models in following complex user intentions accurately; 3) our DesignDiff is able to generate high-quality designs of great text alignment.