Abstract: Pre-trained vision-language models (VLMs) and language models (LMs) have recently garnered significant attention due to their remarkable ability to represent textual concepts, opening up new ...
Abstract: In the era of digital technology, when material created by users is prevalent on online platforms, considerable difficulty is faced in analyzing large volumes of text in order to comprehend ...
* Equal contribution. † Co-corresponding author. Each image is paired with one or more text instances with polygon-level annotations. The dataset follows a consistent annotation format, detailed in ...