- Large language models are powerful, but their answers depend on the context you give them. RAG (Retrieval-Augmented Generation) fixes that by retrieving relevant pieces of real documents and feeding them to the model
- Every great AI system starts with one simple question: “How can my model remember and use knowledge that’s not in its training data?”
- That’s where Retrieval-Augmented Generation (RAG) comes in — a method that lets Large Language Models (LLMs) retrieve real information from external sources before answering.
- Think of RAG as giving your model a search engine for its memory.
-
๐งฉ Splitting raw text into meaningful pieces
-
๐ง Turning text into embeddings
-
๐ฆ Storing those embeddings in a vector database
-
๐ Retrieving the right pieces on demand
-
๐ฌ Generating an accurate, grounded answer
๐งฉ Step 1: Text Splitters — Preparing Your Knowledge Base
๐งฉ Step 1: Text Splitters — Preparing Your Knowledge Base
- Imagine you’re feeding a giant encyclopedia to ChatGPT.
- You can’t just dump all 10,000 pages at once — it would choke! ๐ซฃ
- That’s why we use Text Splitters — they break large documents into smaller, manageable chunks without losing meaning.
Why Split Text?
-
LLMs have token limits (e.g., GPT-4-turbo ≈ 128k tokens max).
-
Smaller chunks = faster, cheaper queries.
-
Each chunk gets stored separately, which improves retrieval precision.
Key Parameters
- Adjust
chunk_sizebased on your content. - Technical docs = larger chunks.
- Narrative text = smaller chunks for accuracy.
๐ง Step 2: Embeddings — Teaching AI What “Meaning” Feels Like
๐ง Step 2: Embeddings — Teaching AI What “Meaning” Feels Like
Here’s the magic trick:
- LLMs can’t understand raw text in storage — but they can measure similarity between ideas using embeddings.
- An embedding is a numerical representation (a list of floating-point numbers) of text in a high-dimensional vector space.
- In this space:
- “Artificial Intelligence” and “Machine Learning” are close together.
- “Dog” and “Banana” are far apart. ๐๐
How It Works
Text → Encoder (e.g., OpenAIEmbeddings) → Vector like [0.12, 0.43, -0.87, ...]
Example:
from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings() vector = embeddings.embed_query("What is Pinecone?")
๐ฆ Step 3: Vector Databases — Where Memory Lives
๐ฆ Step 3: Vector Databases — Where Memory Lives
Now that you have semantic vectors, you need somewhere to store them — that’s what vector databases are for.
Think of a vector store as the “memory palace” ๐ฐ of your AI system — it remembers embeddings and finds the most similar ones to any new query.
๐ฅ Popular Vector Storesfrom langchain_pinecone import PineconeVectorStore vectorstore = PineconeVectorStore(index_name="my-index", embedding=embeddings)
Use Pinecone for production apps that need scale and persistence. Use FAISS for local prototypes.
๐งฑ Step 4: Building the Retrieval Pipeline
๐งฑ Step 4: Building the Retrieval Pipeline
Now that your data is split, embedded, and stored — it’s time to retrieve!
- Retrieval is how your AI searches its memory for the most relevant chunks based on a user’s query.
What is a Retriever?
- It’s like a librarian ๐ — it doesn’t answer questions, it just fetches relevant pages.
In LangChain:
๐ฌ Step 5: RetrievalQA Chain — Connecting Retrieval and Generation
๐ฌ Step 5: RetrievalQA Chain — Connecting Retrieval and Generation
The RetrievalQA Chain is LangChain’s “brains + memory” combo.
-
a retriever (memory searcher ๐ง )
-
an LLM (responder ๐ฌ)
-
and a prompt template (question formatter ๐งพ)
When you call the chain, it:
-
Takes your query
-
Retrieves the top matching chunks
-
Inserts them into a prompt template
-
Sends it to the LLM
-
Returns the final grounded answer
from langchain.chains.retrieval import create_retrieval_chain retrieval_chain = create_retrieval_chain( retriever=vectorstore.as_retriever(), combine_docs_chain=combine_docs_chain )
⏱ Step 6: Token Limits in LLMs — Why Chunking Matters
⏱ Step 6: Token Limits in LLMs — Why Chunking Matters
- Every LLM has a maximum context window — the number of tokens it can “see” at once.
- If your input exceeds that, the model literally forgets parts of it. ๐
| Model | Token Limit |
|---|---|
| GPT-3.5 | ~16K tokens |
| GPT-4-turbo | ~128K tokens |
| Claude 3 Sonnet | ~200K tokens |
How to Manage It:
-
Use text splitters to stay under token limits.
-
Use retrieval to dynamically bring only relevant chunks.
-
Use map-reduce chains if you need to process huge docs.
๐ Step 7: Retrieval-Augmented Generation (RAG)
๐ Step 7: Retrieval-Augmented Generation (RAG)
- Now the final destination: RAG, or Retrieval-Augmented Generation. ๐
- It combines everything we’ve learned:
- Retrieve relevant chunks from the vector DB.
- Augment the user query with these chunks as context.
- Generate a grounded, factually accurate answer.
- You’re no longer asking the model to “remember” —
- You’re teaching it to look things up before answering. ๐
LangChain Example — End-to-End Code
LangChain Example — End-to-End Code
Scalable Retrieval with Pinecone and LangChain
import os from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore load_dotenv() if __name__ == "__main__": print("๐ฅ Ingesting...") # 1️⃣ Load your document loader = TextLoader("/Users/edenmarco/Desktop/intro-to-vector-dbs/mediumblog1.txt") document = loader.load() # 2️⃣ Split it into smaller chunks print("✂️ Splitting text...") text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(document) print(f"✅ Created {len(texts)} chunks") # 3️⃣ Create embeddings embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY")) # 4️⃣ Store embeddings in Pinecone print("๐ฆ Uploading vectors to Pinecone...") PineconeVectorStore.from_documents(texts, embeddings, index_name=os.environ["INDEX_NAME"]) print("๐ Ingestion complete!")
import os
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain import hub
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
load_dotenv()
if __name__ == "__main__":
print("๐ Retrieving...")
embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(temperature=0)
query = "What is Pinecone in machine learning?"
vectorstore = PineconeVectorStore(
index_name=os.environ["INDEX_NAME"], embedding=embeddings
)
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
retrieval_chain = create_retrieval_chain(
retriever=vectorstore.as_retriever(),
combine_docs_chain=combine_docs_chain
)
result = retrieval_chain.invoke(input={"input": query})
print(result)
import os from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_pinecone import PineconeVectorStore from langchain import hub from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains.retrieval import create_retrieval_chain load_dotenv() if __name__ == "__main__": print("๐ Retrieving...") embeddings = OpenAIEmbeddings() llm = ChatOpenAI(temperature=0) query = "What is Pinecone in machine learning?" vectorstore = PineconeVectorStore( index_name=os.environ["INDEX_NAME"], embedding=embeddings ) retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat") combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt) retrieval_chain = create_retrieval_chain( retriever=vectorstore.as_retriever(), combine_docs_chain=combine_docs_chain ) result = retrieval_chain.invoke(input={"input": query}) print(result)
import os
# ๐ Set your API key (you can also use .env for better practice)
os.environ["OPENAI_API_KEY"] = "YOUR-APIKEY-HERE"
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain import hub
if __name__ == "__main__":
print("๐ Loading PDF...")
# 1️⃣ Load your PDF document
pdf_path = "react.pdf"
loader = PyPDFLoader(file_path=pdf_path)
documents = loader.load()
# 2️⃣ Split the document into smaller chunks
text_splitter = CharacterTextSplitter(
chunk_size=1000, chunk_overlap=30, separator="\n"
)
docs = text_splitter.split_documents(documents=documents)
# 3️⃣ Generate embeddings using OpenAI
print("๐ง Creating embeddings...")
embeddings = OpenAIEmbeddings()
# 4️⃣ Store vectors locally using FAISS
vectorstore = FAISS.from_documents(docs, embeddings)
vectorstore.save_local("faiss_index_react")
# 5️⃣ Load the saved vector store
new_vectorstore = FAISS.load_local(
"faiss_index_react", embeddings, allow_dangerous_deserialization=True
)
# 6️⃣ Create Retrieval-QA Chain
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(OpenAI(), retrieval_qa_chat_prompt)
retrieval_chain = create_retrieval_chain(
new_vectorstore.as_retriever(), combine_docs_chain
)
# 7️⃣ Ask your question
print("๐ค Asking model: Give me the gist of ReAct in 3 sentences")
res = retrieval_chain.invoke({"input": "Give me the gist of ReAct in 3 sentences"})
print("๐งพ Answer:", res["answer"])
import os # ๐ Set your API key (you can also use .env for better practice) os.environ["OPENAI_API_KEY"] = "YOUR-APIKEY-HERE" from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings, OpenAI from langchain_community.vectorstores import FAISS from langchain.chains.retrieval import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain import hub if __name__ == "__main__": print("๐ Loading PDF...") # 1️⃣ Load your PDF document pdf_path = "react.pdf" loader = PyPDFLoader(file_path=pdf_path) documents = loader.load() # 2️⃣ Split the document into smaller chunks text_splitter = CharacterTextSplitter( chunk_size=1000, chunk_overlap=30, separator="\n" ) docs = text_splitter.split_documents(documents=documents) # 3️⃣ Generate embeddings using OpenAI print("๐ง Creating embeddings...") embeddings = OpenAIEmbeddings() # 4️⃣ Store vectors locally using FAISS vectorstore = FAISS.from_documents(docs, embeddings) vectorstore.save_local("faiss_index_react") # 5️⃣ Load the saved vector store new_vectorstore = FAISS.load_local( "faiss_index_react", embeddings, allow_dangerous_deserialization=True ) # 6️⃣ Create Retrieval-QA Chain retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat") combine_docs_chain = create_stuff_documents_chain(OpenAI(), retrieval_qa_chat_prompt) retrieval_chain = create_retrieval_chain( new_vectorstore.as_retriever(), combine_docs_chain ) # 7️⃣ Ask your question print("๐ค Asking model: Give me the gist of ReAct in 3 sentences") res = retrieval_chain.invoke({"input": "Give me the gist of ReAct in 3 sentences"}) print("๐งพ Answer:", res["answer"])
๐ What’s Happening Here
๐ฏ The Takeaway
๐ฏ The Takeaway
- By now, you’ve seen how RAG is built — one layer at a time:
- Split → Embed → Store → Retrieve → Generate
- With LangChain, you don’t just get a chatbot — you get a system that can search, reason, and explain using your own data.
- ๐งฉ Text Splitters help the model handle large documents
- ๐ง Embeddings teach it semantic meaning
- ๐ฆ Vector Databases give it memory
- ๐ Retrieval Chains help it think contextually
- ๐ฌ RAG turns it all into a grounded, reliable answer engine
Embedding: numeric representation (vector) of text where similar meanings are close in space.
-
Vector database: stores embeddings and lets you run nearest-neighbor searches (example: Pinecone).
-
Document loader: reads PDFs, txt, DOCX, Google Drive files, Notion exports; produces text strings for processing.
-
Text splitter: divides documents into chunks (e.g. 500 tokens) to avoid LLM token limits while preserving meaning.
-
chunk_size: how large each chunk is (words/tokens/characters depending on splitter).
-
chunk_overlap: how many tokens/characters overlap between adjacent chunks (helps context continuity).
-
Retriever: component that looks up top-k relevant chunks from the vector DB.
-
Combine docs chain (
create_stuff_documents_chain): simple chain that stuffs retrieved docs into a prompt for the LLM. -
Retrieval chain (
create_retrieval_chain): connects retriever + combine chain into a single callable object.
๐ ALL AI / LangChain Post
๐ ALL AI / LangChain Post




