The problem

Your product catalogue has 2 million items. Each has a brand field populated by sellers — "Nike", "NIKE", "Nike, Inc.", "nike official store", "NlKE" (that's a lowercase L, not an i). You need to map all of them to a canonical brand taxonomy of 50,000 entries.

Rule-based deduplication (fuzzy string matching, edit distance) handles obvious cases but fails at semantic variants: "Apple Computer" should map to "Apple", but Levenshtein distance won't tell you that. You need semantic search.

The intuition

The approach: embed everything into the same vector space, then find nearest neighbours. Embed your 50,000 canonical brand names. Embed the noisy incoming strings. For each noisy string, find the nearest canonical embedding — that's your standardised match.

The key insight is that embeddings capture meaning, not just surface form. "Apple Computer" and "Apple Inc." land near each other in embedding space, even though they share only one word. "Nike" and "NIKE" are identical in meaning and will be very close.

FAISS (Facebook AI Similarity Search) makes nearest-neighbour lookup fast at scale. Instead of comparing a query against all 50,000 candidates one by one (slow), FAISS builds an index that retrieves the top-k matches in milliseconds using approximate methods.

Embeddings plus FAISS is the combination that makes semantic deduplication tractable. Neither alone is enough.

In practice

The pipeline I built at Amazon processed 300K noisy brand strings against a 50K canonical taxonomy in under 10 minutes on a single machine.

A few things that matter in practice:

Threshold tuning. Not every noisy string has a valid canonical match. Set a cosine similarity threshold below which you flag the result as "no match found" rather than forcing an incorrect mapping. I calibrated this on 2,000 labelled pairs — the threshold that maximised F1 was 0.82.

Two-stage retrieval. FAISS retrieves top-20 candidates. A reranker (a cross-encoder, or just GPT-4 with a short prompt) re-scores those 20 candidates and picks the best. This catches cases where the bi-encoder embedding is imprecise.

Embedding model choice. General sentence-transformers work well for clean text but struggle with noisy, abbreviated brand names. Fine-tuning on a small set (1,000 labelled pairs) of brand-specific examples improved recall@5 by 14%.

Going deeper (optional)

FAISS supports several index types with different accuracy/speed tradeoffs. IndexFlatL2 is exact but slow for large indices. IndexIVFFlat partitions the space into Voronoi cells — fast but approximate. IndexHNSW (hierarchical navigable small world) is often the best balance for production.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Build index from canonical brands
canonical_embeddings = model.encode(canonical_brands, normalize_embeddings=True)
index = faiss.IndexFlatIP(canonical_embeddings.shape[1])  # Inner product = cosine for normalised
index.add(canonical_embeddings)

# Query
query_embeddings = model.encode(noisy_strings, normalize_embeddings=True)
scores, indices = index.search(query_embeddings, k=5)  # top-5 candidates