Wednesday, October 29, 2025

Unlocking RAG with LangChain: Embeddings, Vector Databases, and Retrieva

  • Large language models are powerful, but their answers depend on the context you give them. RAG (Retrieval-Augmented Generation) fixes that by retrieving relevant pieces of real documents and feeding them to the model
  • Every great AI system starts with one simple question: “How can my model remember and use knowledge that’s not in its training data?”
  • That’s where Retrieval-Augmented Generation (RAG) comes in — a method that lets Large Language Models (LLMs) retrieve real information from external sources before answering.
  • Think of RAG as giving your model a search engine for its memory.

In this post, we’ll walk together through each stop on the RAG journey:

  1. ๐Ÿงฉ Splitting raw text into meaningful pieces

  2. ๐Ÿง  Turning text into embeddings

  3. ๐Ÿ“ฆ Storing those embeddings in a vector database

  4. ๐Ÿ” Retrieving the right pieces on demand

  5. ๐Ÿ’ฌ Generating an accurate, grounded answer

๐Ÿงฉ Step 1: Text Splitters — Preparing Your Knowledge Base

  • Imagine you’re feeding a giant encyclopedia to ChatGPT.
  • You can’t just dump all 10,000 pages at once — it would choke! ๐Ÿซฃ
  • That’s why we use Text Splitters — they break large documents into smaller, manageable chunks without losing meaning.

Why Split Text?

  • LLMs have token limits (e.g., GPT-4-turbo ≈ 128k tokens max).

  • Smaller chunks = faster, cheaper queries.

  • Each chunk gets stored separately, which improves retrieval precision.

Key Parameters


Example:
  • from langchain_text_splitters import CharacterTextSplitter
  • splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
  • chunks = splitter.split_text(document_text)

Pro Tip: 
  • Adjust chunk_size based on your content.
  • Technical docs = larger chunks.
  • Narrative text = smaller chunks for accuracy.


๐Ÿง  Step 2: Embeddings — Teaching AI What “Meaning” Feels Like

Here’s the magic trick:

  • LLMs can’t understand raw text in storage — but they can measure similarity between ideas using embeddings.
  • An embedding is a numerical representation (a list of floating-point numbers) of text in a high-dimensional vector space.
  • In this space:
    • “Artificial Intelligence” and “Machine Learning” are close together.
    • “Dog” and “Banana” are far apart. ๐ŸŒ๐Ÿ•


How It Works

Text → Encoder (e.g., OpenAIEmbeddings) → Vector like [0.12, 0.43, -0.87, ...]


Example:

from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings() vector = embeddings.embed_query("What is Pinecone?")

๐Ÿ“ฆ Step 3: Vector Databases — Where Memory Lives

Now that you have semantic vectors, you need somewhere to store them — that’s what vector databases are for.

Think of a vector store as the “memory palace” ๐Ÿฐ of your AI system — it remembers embeddings and finds the most similar ones to any new query.

๐Ÿ”ฅ Popular Vector Stores


Example:

from langchain_pinecone import PineconeVectorStore vectorstore = PineconeVectorStore(index_name="my-index", embedding=embeddings)

Tip:

Use Pinecone for production apps that need scale and persistence. Use FAISS for local prototypes.

๐Ÿงฑ Step 4: Building the Retrieval Pipeline

Now that your data is split, embedded, and stored — it’s time to retrieve!

  • Retrieval is how your AI searches its memory for the most relevant chunks based on a user’s query.

What is a Retriever?

  • It’s like a librarian ๐Ÿ“š — it doesn’t answer questions, it just fetches relevant pages.

In LangChain:

  • retriever = vectorstore.as_retriever(k=3)
  • Here, k = number of most similar chunks to return.

๐Ÿ’ฌ Step 5: RetrievalQA Chain — Connecting Retrieval and Generation

The RetrievalQA Chain is LangChain’s “brains + memory” combo.

It connects:

  • a retriever (memory searcher ๐Ÿง )

  • an LLM (responder ๐Ÿ’ฌ)

  • and a prompt template (question formatter ๐Ÿงพ)

When you call the chain, it:

  1. Takes your query

  2. Retrieves the top matching chunks

  3. Inserts them into a prompt template

  4. Sends it to the LLM

  5. Returns the final grounded answer

Example:

from langchain.chains.retrieval import create_retrieval_chain retrieval_chain = create_retrieval_chain( retriever=vectorstore.as_retriever(), combine_docs_chain=combine_docs_chain )

⏱ Step 6: Token Limits in LLMs — Why Chunking Matters

  • Every LLM has a maximum context window — the number of tokens it can “see” at once.
  • If your input exceeds that, the model literally forgets parts of it. ๐Ÿ˜…            

             Model        Token Limit
            GPT-3.5        ~16K tokens
            GPT-4-turbo        ~128K tokens
            Claude 3 Sonnet        ~200K tokens

How to Manage It:

  • Use text splitters to stay under token limits.

  • Use retrieval to dynamically bring only relevant chunks.

  • Use map-reduce chains if you need to process huge docs.

๐ŸŒ Step 7: Retrieval-Augmented Generation (RAG)

  • Now the final destination: RAG, or Retrieval-Augmented Generation. ๐Ÿš€
  • It combines everything we’ve learned:

    • Retrieve relevant chunks from the vector DB.
    • Augment the user query with these chunks as context.
    • Generate a grounded, factually accurate answer.

  • You’re no longer asking the model to “remember” —

  • You’re teaching it to look things up before answering. ๐Ÿ”Ž

LangChain Example — End-to-End Code

Scalable Retrieval with Pinecone and LangChain

Let’s walk through a complete example that connects LangChain, OpenAI, and Pinecone into one seamless retrieval pipeline.

๐Ÿงฑ Example: Ingesting a Blog File into Pinecone
import os from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore load_dotenv() if __name__ == "__main__": print("๐Ÿ“ฅ Ingesting...") # 1️⃣ Load your document loader = TextLoader("/Users/edenmarco/Desktop/intro-to-vector-dbs/mediumblog1.txt") document = loader.load() # 2️⃣ Split it into smaller chunks print("✂️ Splitting text...") text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(document) print(f"✅ Created {len(texts)} chunks") # 3️⃣ Create embeddings embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY")) # 4️⃣ Store embeddings in Pinecone print("๐Ÿ“ฆ Uploading vectors to Pinecone...") PineconeVectorStore.from_documents(texts, embeddings, index_name=os.environ["INDEX_NAME"]) print("๐ŸŽ‰ Ingestion complete!")
๐Ÿ” Let’s Break It Down

๐Ÿงฑ Example: Retrieval  a Blog File from Pinecone

import os from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_pinecone import PineconeVectorStore from langchain import hub from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains.retrieval import create_retrieval_chain load_dotenv() if __name__ == "__main__": print("๐Ÿ” Retrieving...") embeddings = OpenAIEmbeddings() llm = ChatOpenAI(temperature=0) query = "What is Pinecone in machine learning?" vectorstore = PineconeVectorStore( index_name=os.environ["INDEX_NAME"], embedding=embeddings ) retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat") combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt) retrieval_chain = create_retrieval_chain( retriever=vectorstore.as_retriever(), combine_docs_chain=combine_docs_chain ) result = retrieval_chain.invoke(input={"input": query}) print(result)

๐Ÿ” What This Code Does — Step by Step

1️⃣ load_dotenv()

  • Loads environment variables from your .env file — typically your API keys and Pinecone index name.
  • Example
  • .env contents:OPENAI_API_KEY=sk-xxx
PINECONE_API_KEY=xxx INDEX_NAME=my-rag-index

This keeps credentials secure and out of your codebase. ๐Ÿ”

2️⃣ embeddings = OpenAIEmbeddings()

  • Creates an embedding function — the same one you used during ingestion.
  • This ensures your query embedding matches the stored embeddings in Pinecone.
  • Remember:

    • Embedding model consistency is crucial — mismatched models = bad retrieval results.

3️⃣ llm = ChatOpenAI(temperature=0)

  • Initializes your Large Language Model (LLM) for generating answers.
  • Setting temperature=0 makes responses deterministic (repeatable and factual).

  • Perfect for knowledge-based Q&A systems.
  • ๐Ÿ”ง “Temperature” controls creativity — higher = more varied responses, lower = precise, consistent ones.

4️⃣ query = "What is Pinecone in machine learning?"

  • Your user’s natural-language question — it’ll be embedded, matched against the Pinecone index, and used to generate the answer.

5️⃣ vectorstore = PineconeVectorStore(...)

  • Connects your LangChain app to your Pinecone vector index.
  • This index must already contain embeddings from your ingestion phase.

vectorstore = PineconeVectorStore( index_name=os.environ["INDEX_NAME"], embedding=embeddings )

  • Input: embedding function + index name
  • Output: a vector store object that can search semantic similarity in Pinecone
    • Under the hood: When you later call .as_retriever()
    • LangChain uses the Pinecone API to fetch the most similar vectors to your query.

6️⃣ retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

  • LangChain’s Hub offers pre-built, optimized prompts — you’re pulling one designed for retrieval-augmented chat.
  • It automatically formats the question and retrieved docs in a way the LLM understands, reducing your prompt engineering workload.
  • Think of it as: “Give me the context from these docs, and answer the question clearly.”

7️⃣ combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

  • This creates a chain that “stuffs” (inserts) retrieved documents into the prompt for the LLM.
  • Stuffing = simply concatenating docs into one prompt.

  • Great for small-to-medium context sizes.
  • LangChain offers other strategies too (e.g., map_reduce, refine) for larger document sets.

8️⃣ retrieval_chain = create_retrieval_chain(...)

  • Here’s the heart of the RAG pipeline ๐Ÿ’ก
  • This chain connects
    • Retriever → fetches relevant documents from Pinecon
    • CombineDocsChain → merges docs + question into a single LLM prompt
    • LLM → generates the final answer
  • It automates the full flow:
  • Query → Retrieve → Combine → Generate Answer

9️⃣ result = retrieval_chain.invoke(input={"input": query})

  • Executes the full chain.
  • LangChain embeds the query, retrieves relevant docs, and sends them to the LLM for generation.
    • The result is a contextually grounded answer.
  • Example output:

{ 'answer': 'Pinecone is a vector database that enables efficient similarity search and retrieval for large-scale machine learning and AI applications.' }
⚙️ Example 2 — PDF Retrieval with FAISS Vector Store
  • Here’s a simple, complete example that loads a PDF,
    • splits it into chunks,
    • embeds them,
    • saves them in FAISS (a lightweight local vector store),
    • then retrieves the relevant chunks to answer a question — all using LangChain.

import os # ๐Ÿ” Set your API key (you can also use .env for better practice) os.environ["OPENAI_API_KEY"] = "YOUR-APIKEY-HERE" from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings, OpenAI from langchain_community.vectorstores import FAISS from langchain.chains.retrieval import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain import hub if __name__ == "__main__": print("๐Ÿ“„ Loading PDF...") # 1️⃣ Load your PDF document pdf_path = "react.pdf" loader = PyPDFLoader(file_path=pdf_path) documents = loader.load() # 2️⃣ Split the document into smaller chunks text_splitter = CharacterTextSplitter( chunk_size=1000, chunk_overlap=30, separator="\n" ) docs = text_splitter.split_documents(documents=documents) # 3️⃣ Generate embeddings using OpenAI print("๐Ÿง  Creating embeddings...") embeddings = OpenAIEmbeddings() # 4️⃣ Store vectors locally using FAISS vectorstore = FAISS.from_documents(docs, embeddings) vectorstore.save_local("faiss_index_react") # 5️⃣ Load the saved vector store new_vectorstore = FAISS.load_local( "faiss_index_react", embeddings, allow_dangerous_deserialization=True ) # 6️⃣ Create Retrieval-QA Chain retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat") combine_docs_chain = create_stuff_documents_chain(OpenAI(), retrieval_qa_chat_prompt) retrieval_chain = create_retrieval_chain( new_vectorstore.as_retriever(), combine_docs_chain ) # 7️⃣ Ask your question print("๐Ÿค– Asking model: Give me the gist of ReAct in 3 sentences") res = retrieval_chain.invoke({"input": "Give me the gist of ReAct in 3 sentences"}) print("๐Ÿงพ Answer:", res["answer"])

๐Ÿ” What’s Happening Here


๐ŸŒ FAISS vs. Pinecone

๐ŸŽฏ The Takeaway

  • By now, you’ve seen how RAG is built — one layer at a time:
    • Split → Embed → Store → Retrieve → Generate
  • With LangChain, you don’t just get a chatbot — you get a system that can search, reason, and explain using your own data.

    • ๐Ÿงฉ Text Splitters help the model handle large documents
    • ๐Ÿง  Embeddings teach it semantic meaning
    • ๐Ÿ“ฆ Vector Databases give it memory
    • ๐Ÿ” Retrieval Chains help it think contextually
    • ๐Ÿ’ฌ RAG turns it all into a grounded, reliable answer engine

Key concepts & vocabulary (plain language)
  • Embedding: numeric representation (vector) of text where similar meanings are close in space.

  • Vector database: stores embeddings and lets you run nearest-neighbor searches (example: Pinecone).

  • Document loader: reads PDFs, txt, DOCX, Google Drive files, Notion exports; produces text strings for processing.

  • Text splitter: divides documents into chunks (e.g. 500 tokens) to avoid LLM token limits while preserving meaning.

  • chunk_size: how large each chunk is (words/tokens/characters depending on splitter).

  • chunk_overlap: how many tokens/characters overlap between adjacent chunks (helps context continuity).

  • Retriever: component that looks up top-k relevant chunks from the vector DB.

  • Combine docs chain (create_stuff_documents_chain): simple chain that stuffs retrieved docs into a prompt for the LLM.

  • Retrieval chain (create_retrieval_chain): connects retriever + combine chain into a single callable object.

๐Ÿš€ ALL AI / LangChain Post

You may also like

Kubernetes Microservices
Python AI/ML
Spring Framework Spring Boot
Core Java Java Coding Question
Maven AWS