Sprecher
Beschreibung
Vision Language Models (VLMs) represent a major advancement in multi-modal Artificial Intelligence, combining visual and textual data processing. However, VLMs have mainly a knowledge about public available data. The Retrieval Augmented Generation (RAG) approach enhances access to external information.
In this talk, a Visual RAG Pipeline that merges the RAG approach with VLMs is presented. The pipeline involves five main steps: Preprocessing, Vector Store, Retrieval, Classification and Relational Query, Prompt Generation, and Completion. A custom dataset has been utilized for the evaluation of the pipeline. This dataset comprises image data depicting product advertisements as presented in leaflets, along with corresponding product and promotion information pertaining to the advertisements. Promotion data includes aspects such as price, regular price, and discounts, while product data covers attributes like brand, weight, and Global Trade Item Numbers (GTINs), with the GTIN serving as a standardized and unique identifier for products.
In the retail and supply chain domain, data related to GTINs are crucial for reporting and analysis. Given the constantly changing range of traded products, many of which are often highly similar, the Fine-Grained Classification (FGC) of these products is essential for effective analysis.
The task of FGC has been explored using the Visual RAG Pipeline. The comparison of various VLM back-ends, including GPT-4o, GPT-4o-mini, and Gemini 2.0 Flash, utilized within this pipeline, has yielded an accuracy rate of 86.8%.