Embeddings are dense numerical vector representations of data — such as words, sentences, images, or users — that capture semantic meaning in a continuous, high-dimensional space. They are a foundational building block of modern machine learning and AI systems.
An embedding is a learned mapping from a discrete or high-dimensional input (like a word or a product ID) to a fixed-size vector of floating-point numbers, for example [0.23, -0.87, 0.51, ...]. These vectors are positioned in space so that semantically or contextually similar items end up geometrically close together. Unlike one-hot encoding, embeddings are dense and compact, typically ranging from 64 to 1536 dimensions. The values are not hand-crafted — they are learned automatically during model training.
Embeddings give machine learning models a way to work with categorical or unstructured data — text, images, audio, users, products — in a mathematically tractable form. They encode rich relational structure: the classic example is that vector('King') - vector('Man') + vector('Woman') ≈ vector('Queen'). This property enables powerful capabilities like semantic search, recommendation systems, clustering, and retrieval-augmented generation (RAG). Without embeddings, models would have no meaningful way to reason about similarity or analogy.
Embeddings are typically produced by training a neural network on a large dataset with a self-supervised objective. Word2Vec trains a shallow network to predict surrounding words; transformer-based models like BERT produce contextual embeddings by masking and predicting tokens. In practice, you can generate embeddings by passing data through a pre-trained encoder model and extracting the output of a specific hidden layer. APIs like OpenAI's text-embedding-3-small or open-source models like sentence-transformers make this straightforward without training from scratch.
Once data is embedded, similarity between items is measured using distance metrics in the vector space. Cosine similarity is the most common choice because it measures the angle between vectors, making it scale-invariant and well-suited for semantic comparisons. Euclidean (L2) distance measures absolute spatial distance and works well when vector magnitudes carry meaning. The choice of metric should match how the embedding model was trained, as some models are explicitly optimized for cosine similarity while others are not.
Storing millions of embedding vectors and searching them in real time requires a vector database or approximate nearest-neighbor (ANN) index. Tools like Faiss, Pinecone, Weaviate, Qdrant, and pgvector enable sub-millisecond similarity search by using algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). These indexes trade a small amount of recall accuracy for enormous speed gains compared to brute-force search. This infrastructure is central to production RAG pipelines and semantic search applications.
Never mix embeddings from different models or different versions of the same model — they live in incompatible vector spaces and comparisons will be meaningless. Always normalize vectors to unit length before computing cosine similarity if your library does not do this automatically. Embedding quality degrades on out-of-distribution input, so re-embed or fine-tune when your domain shifts significantly. For best retrieval performance, embed queries and documents with the same model and consider asymmetric models (like those from Cohere or OpenAI) that are specifically optimized for query-to-document matching.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app