13.–14. Nov. 2024
Europe/Berlin Zeitzone

Specification and Identification of Relationships between Products in the Food Segment

14.11.2024, 14:00
45m
Poster Main Track Main Conference

Beschreibung

Identifying and classifying similar or identical food products in e-commerce and retail is challenging due to diverse data formats, inconsistent product descriptions, and varying detail levels across sources. But precise product matching can optimize databases, enhance customer experience, enable personalized recommendations, and improve competitor price analyses. Current work primarily focuses on recognizing identical products or returning similar items. The relationship types defined in this work extend the conventional distinction between identical and similar products by providing a more nuanced categorization of the similarities between products in the food sector.
For this purpose, general properties were taken from Schema.org which were then adapted to the food sector. The defined product relationships are:

• SameAs: describes identical products.
• IsVariantOf: refers to products of the same brand and product type, but which vary in certain characteristics, such as flavor or processing method.
• IsSimilarTo: classify products of different brands of the same product type with a high degree of similarity between the products.
• Predecessor/Successor: identifies successor or predecessor products that are the result of product updates.
• IsRelatedTo: covers products with further connections such as common areas of use.
• IsConsumableFor: refers to products that serve as refill packs for other products.
For the automated determination of these relationships a multi-stage process using data from various web shops extracted by a web crawler and data from the internal ERP system of a retail company was developed and evaluated. Both text and image data were used.
The method developed to determine these product relationships include three main areas: Data preparation, blocking and classification.
1. Data preparation: This step includes normalisation of the data to ensure consistency, enrichment with a Named Entity Recognition (NER) model to identify product attributes (such as brand) from the product name and the creation of embeddings. BERT, SBERT and OpenAI embeddings for text and ResNet50 embeddings for images were tested to optimise the classification of the different relationships.
2. Blocking: To increase efficiency and limit the amount of product comparisons, a special procedure was implemented based on the classification of GPC bricks, the brand of the product and an ANN (Approximate Nearest Neighbour) approach. A trained BERT model for text classification is used to precisely determine the GPC brick codes for the product. The blocking procedure restricted the comparison set and retained 80 % of the potential product pairs.
3. Classification of the product relationships: Both an attribute-based approach and machine learning methods were used to determine the relationships SameAs, IsVariantOf and IsSimilarTo. For the other relationships, rule-based methods were used. The attribute-based approach analysed the similarities in product attributes (e.g. name, description, ingredients) with metrics like cosine similarity and was used as a baseline for the machine learning models. The Machine learning approach tested random forest models (F1 score of 0.86) and siamese neural networks (F1 score of 0.84) in different experimental settings for the classification of the product relationships.

Hauptautoren

Co-Autor

Präsentationsmaterialien