Tech Twitter: Unlocking RAG with LangChain: Embeddings, Vector Databases, and Retrieva

Large language models are powerful, but their answers depend on the context you give them. RAG (Retrieval-Augmented Generation) fixes that by retrieving relevant pieces of real documents and feeding them to the model

Every great AI system starts with one simple question: “How can my model remember and use knowledge that’s not in its training data?”

That’s where Retrieval-Augmented Generation (RAG) comes in — a method that lets Large Language Models (LLMs) retrieve real information from external sources before answering.

Think of RAG as giving your model a search engine for its memory.

In this post, we’ll walk together through each stop on the RAG journey:

🧩 Splitting raw text into meaningful pieces
🧠 Turning text into embeddings
📦 Storing those embeddings in a vector database
🔍 Retrieving the right pieces on demand
💬 Generating an accurate, grounded answer

🧩 Step 1: Text Splitters — Preparing Your Knowledge Base

Imagine you’re feeding a giant encyclopedia to ChatGPT.

You can’t just dump all 10,000 pages at once — it would choke! 🫣

That’s why we use Text Splitters — they break large documents into smaller, manageable chunks without losing meaning.

Why Split Text?

LLMs have token limits (e.g., GPT-4-turbo ≈ 128k tokens max).
Smaller chunks = faster, cheaper queries.
Each chunk gets stored separately, which improves retrieval precision.

Key Parameters

Example:


from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(document_text)

Pro Tip:

Adjust chunk_size based on your content.
Technical docs = larger chunks.
Narrative text = smaller chunks for accuracy.

🧠 Step 2: Embeddings — Teaching AI What “Meaning” Feels Like

Here’s the magic trick:

LLMs can’t understand raw text in storage — but they can measure similarity between ideas using embeddings.

An embedding is a numerical representation (a list of floating-point numbers) of text in a high-dimensional vector space.

In this space:

“Artificial Intelligence” and “Machine Learning” are close together.
“Dog” and “Banana” are far apart. 🍌🐕

How It Works

Text → Encoder (e.g., OpenAIEmbeddings) → Vector like [0.12, 0.43, -0.87, ...]

Example:

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vector = embeddings.embed_query("What is Pinecone?")

📦 Step 3: Vector Databases — Where Memory Lives

Now that you have semantic vectors, you need somewhere to store them — that’s what vector databases are for.

Think of a vector store as the “memory palace” 🏰 of your AI system — it remembers embeddings and finds the most similar ones to any new query.

🔥 Popular Vector Stores

Example:

from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore(index_name="my-index", embedding=embeddings)

Tip:

Use Pinecone for production apps that need scale and persistence.
Use FAISS for local prototypes.

🧱 Step 4: Building the Retrieval Pipeline

Now that your data is split, embedded, and stored — it’s time to retrieve!

Retrieval is how your AI searches its memory for the most relevant chunks based on a user’s query.

What is a Retriever?

It’s like a librarian 📚 — it doesn’t answer questions, it just fetches relevant pages.

In LangChain:


retriever = vectorstore.as_retriever(k=3)
Here, k = number of most similar chunks to return.

💬 Step 5: RetrievalQA Chain — Connecting Retrieval and Generation

The RetrievalQA Chain is LangChain’s “brains + memory” combo.

It connects:

a retriever (memory searcher 🧠)
an LLM (responder 💬)
and a prompt template (question formatter 🧾)

When you call the chain, it:

Takes your query
Retrieves the top matching chunks
Inserts them into a prompt template
Sends it to the LLM
Returns the final grounded answer

Example:

from langchain.chains.retrieval import create_retrieval_chain
retrieval_chain = create_retrieval_chain(
    retriever=vectorstore.as_retriever(),
    combine_docs_chain=combine_docs_chain
)

⏱ Step 6: Token Limits in LLMs — Why Chunking Matters

Every LLM has a maximum context window — the number of tokens it can “see” at once.

If your input exceeds that, the model literally forgets parts of it. 😅

Model	Token Limit
GPT-3.5	~16K tokens
GPT-4-turbo	~128K tokens
Claude 3 Sonnet	~200K tokens

How to Manage It:

Use text splitters to stay under token limits.
Use retrieval to dynamically bring only relevant chunks.
Use map-reduce chains if you need to process huge docs.

🌍 Step 7: Retrieval-Augmented Generation (RAG)

Now the final destination: RAG, or Retrieval-Augmented Generation. 🚀

It combines everything we’ve learned:

Retrieve relevant chunks from the vector DB.
Augment the user query with these chunks as context.
Generate a grounded, factually accurate answer.

You’re no longer asking the model to “remember” —

You’re teaching it to look things up before answering. 🔎

LangChain Example — End-to-End Code

Scalable Retrieval with Pinecone and LangChain

Let’s walk through a complete example that connects LangChain, OpenAI, and Pinecone into one seamless retrieval pipeline.

🧱 Example: Ingesting a Blog File into Pinecone

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

load_dotenv()

if __name__ == "__main__":
    print("📥 Ingesting...")

    # 1️⃣ Load your document
    loader = TextLoader("/Users/edenmarco/Desktop/intro-to-vector-dbs/mediumblog1.txt")
    document = loader.load()

    # 2️⃣ Split it into smaller chunks
    print("✂️ Splitting text...")
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(document)
    print(f"✅ Created {len(texts)} chunks")

    # 3️⃣ Create embeddings
    embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY"))

    # 4️⃣ Store embeddings in Pinecone
    print("📦 Uploading vectors to Pinecone...")
    PineconeVectorStore.from_documents(texts, embeddings, index_name=os.environ["INDEX_NAME"])

    print("🎉 Ingestion complete!")

🔍 Let’s Break It Down

🧱 Example: Retrieval a Blog File from Pinecone

import os
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain import hub
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

load_dotenv()

if __name__ == "__main__":
    print("🔍 Retrieving...")

    embeddings = OpenAIEmbeddings()
    llm = ChatOpenAI(temperature=0)
    query = "What is Pinecone in machine learning?"

    vectorstore = PineconeVectorStore(
        index_name=os.environ["INDEX_NAME"], embedding=embeddings
    )

    retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
    combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
    retrieval_chain = create_retrieval_chain(
        retriever=vectorstore.as_retriever(),
        combine_docs_chain=combine_docs_chain
    )

    result = retrieval_chain.invoke(input={"input": query})
    print(result)

🔍 What This Code Does — Step by Step
1️⃣ load_dotenv()
Loads environment variables from your .env file — typically your API keys and Pinecone index name.
Example 
.env contents:OPENAI_API_KEY=sk-xxx
PINECONE_API_KEY=xxx
INDEX_NAME=my-rag-index
This keeps credentials secure and out of your codebase. 🔐
2️⃣ embeddings = OpenAIEmbeddings()
Creates an embedding function — the same one you used during ingestion.
This ensures your query embedding matches the stored embeddings in Pinecone.
Remember:
Embedding model consistency is crucial — mismatched models = bad retrieval results.
3️⃣ llm = ChatOpenAI(temperature=0)
Initializes your Large Language Model (LLM) for generating answers.
Setting temperature=0 makes responses deterministic (repeatable and factual).
Perfect for knowledge-based Q&A systems.
🔧 “Temperature” controls creativity — higher = more varied responses, lower = precise, consistent ones.
4️⃣ query = "What is Pinecone in machine learning?"
Your user’s natural-language question — it’ll be embedded, matched against the Pinecone index, and used to generate the answer.
5️⃣ vectorstore = PineconeVectorStore(...)
Connects your LangChain app to your Pinecone vector index.
This index must already contain embeddings from your ingestion phase.
vectorstore = PineconeVectorStore(
    index_name=os.environ["INDEX_NAME"],
    embedding=embeddings
)
Input: embedding function + index name
Output: a vector store object that can search semantic similarity in Pinecone
Under the hood: When you later call .as_retriever()
LangChain uses the Pinecone API to fetch the most similar vectors to your query.
6️⃣ retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
LangChain’s Hub offers pre-built, optimized prompts — you’re pulling one designed for retrieval-augmented chat.
It automatically formats the question and retrieved docs in a way the LLM understands, reducing your prompt engineering workload. 
Think of it as: “Give me the context from these docs, and answer the question clearly.”
7️⃣ combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
This creates a chain that “stuffs” (inserts) retrieved documents into the prompt for the LLM.
Stuffing = simply concatenating docs into one prompt.
Great for small-to-medium context sizes.
LangChain offers other strategies too (e.g., map_reduce, refine) for larger document sets.
8️⃣ retrieval_chain = create_retrieval_chain(...)
Here’s the heart of the RAG pipeline 💡
This chain connects
Retriever → fetches relevant documents from Pinecon
CombineDocsChain → merges docs + question into a single LLM prompt
LLM → generates the final answer
It automates the full flow:
Query → Retrieve → Combine → Generate Answer
9️⃣ result = retrieval_chain.invoke(input={"input": query})
Executes the full chain.
LangChain embeds the query, retrieves relevant docs, and sends them to the LLM for generation.
The result is a contextually grounded answer.
Example output:
{
  'answer': 'Pinecone is a vector database that enables efficient similarity search and retrieval for large-scale machine learning and AI applications.'
}
⚙️ Example 2 — PDF Retrieval with FAISS Vector Store
Here’s a simple, complete example that loads a PDF, 
splits it into chunks, 
embeds them, 
saves them in FAISS (a lightweight local vector store),
 then retrieves the relevant chunks to answer a question — all using LangChain.

import os

# 🔐 Set your API key (you can also use .env for better practice)
os.environ["OPENAI_API_KEY"] = "YOUR-APIKEY-HERE"

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain import hub

if __name__ == "__main__":
    print("📄 Loading PDF...")

    # 1️⃣ Load your PDF document
    pdf_path = "react.pdf"
    loader = PyPDFLoader(file_path=pdf_path)
    documents = loader.load()

    # 2️⃣ Split the document into smaller chunks
    text_splitter = CharacterTextSplitter(
        chunk_size=1000, chunk_overlap=30, separator="\n"
    )
    docs = text_splitter.split_documents(documents=documents)

    # 3️⃣ Generate embeddings using OpenAI
    print("🧠 Creating embeddings...")
    embeddings = OpenAIEmbeddings()

    # 4️⃣ Store vectors locally using FAISS
    vectorstore = FAISS.from_documents(docs, embeddings)
    vectorstore.save_local("faiss_index_react")

    # 5️⃣ Load the saved vector store
    new_vectorstore = FAISS.load_local(
        "faiss_index_react", embeddings, allow_dangerous_deserialization=True
    )

    # 6️⃣ Create Retrieval-QA Chain
    retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
    combine_docs_chain = create_stuff_documents_chain(OpenAI(), retrieval_qa_chat_prompt)
    retrieval_chain = create_retrieval_chain(
        new_vectorstore.as_retriever(), combine_docs_chain
    )

    # 7️⃣ Ask your question
    print("🤖 Asking model: Give me the gist of ReAct in 3 sentences")
    res = retrieval_chain.invoke({"input": "Give me the gist of ReAct in 3 sentences"})

    print("🧾 Answer:", res["answer"])

🔍 What’s Happening Here

🌍 FAISS vs. Pinecone

🎯 The Takeaway

By now, you’ve seen how RAG is built — one layer at a time:
Split → Embed → Store → Retrieve → Generate
With LangChain, you don’t just get a chatbot — you get a system that can search, reason, and explain using your own data.
🧩 Text Splitters help the model handle large documents
🧠 Embeddings teach it semantic meaning
📦 Vector Databases give it memory
🔍 Retrieval Chains help it think contextually
💬 RAG turns it all into a grounded, reliable answer engine
Key concepts & vocabulary (plain language)
Embedding: numeric representation (vector) of text where similar meanings are close in space.

Vector database: stores embeddings and lets you run nearest-neighbor searches (example: Pinecone).

Document loader: reads PDFs, txt, DOCX, Google Drive files, Notion exports; produces text strings for processing.

Text splitter: divides documents into chunks (e.g. 500 tokens) to avoid LLM token limits while preserving meaning.

chunk_size: how large each chunk is (words/tokens/characters depending on splitter).

chunk_overlap: how many tokens/characters overlap between adjacent chunks (helps context continuity).

Retriever: component that looks up top-k relevant chunks from the vector DB.

Combine docs chain (`create_stuff_documents_chain`): simple chain that stuffs retrieved docs into a prompt for the LLM.

Retrieval chain (`create_retrieval_chain`): connects retriever + combine chain into a single callable object.

🚀 ALL AI / LangChain Post

ALL LangChain Post

Kubernetes	Microservices
K8s_introduction Introduction To Docker & Docker-Swarm Mastering Kubernetes Design Patterns common_commands Deep Dive into Kubeproxy: Unraveling Its Inner Workings in Kubernetes Helm KubeApiServer QoS A Deep Dive into Kubernetes Sidecar, Init Containers & Container Communication A Comprehensive Guide to Different Types of Services in Kubernetes Troubleshooting Kubernetes Ingress vs Service Mesh What is Prometheush Simplifying Kubernetes Complexity with the Operator Pattern Dynamic kubernetes cluster scaling POWERFUL TOOLS TO MANAGE KUBERNETEST All k8s Post	MicroServices Design Patterns Reverse proxy v/s Forward proxy How To Implement Hystrix Circuit Breaker In Microservices Application? What is Externalized configuration - Build Once, Run Anywhere in Ms? What is Prometheus Monitoring system & time series database What is an API gateway and why is it ant?
Python	AI/ML
Python libraries and frameworks Python Basic Concepts ALL Post Python Intermediate Concepts ALL Post	AI: Categories and Subcategories
Spring Framework	Spring Boot
Spring Framework- Introduction What is bean In Spring Framework? Inversion Of Control [IOC] Spring - Beans AutoWiring Spring - Bean Validations Spring - Event Handling Spring - Internationalization (I18N) Spring - Bean Manipulations or Bean Wrappers Spring - Property Editors Spring - Profiling Spring Expression Language – SpEL API & Example	Building A Dockerizing Spring Boot App Part1 - End-to-End data Encryption Using Public and Private Keys in java / Spring Boot Part2 - End-to-End data Encryption - Different methods of encryption using public and private keys Demystifying Role based JWT Authentication in Modern Web Applications using spring boot
Core Java	Java Coding Question
Java_Fundamentals Java_8_To_18_Features Design_Patterns_&_Principles Benefits of setting initial and maximum memory size to the same value StackoverflowError causes-solutions	Java8_Coding_Question String_Coding_Question Array_Coding_Question Stack_Coding_Question Queue_Coding_Question Linked_List_Coding_Question Binary_Tree_Coding_Question Binary_Search_Tree_Coding_Question Sorting_Coding_Question Graph_Coding_Question DynamicProgramming_Easy_coding_Question Dynamic_Programming_Coding_Question Miscellaneous_Programming_Coding_Question
Maven	AWS
Demystifying the Maven Build Lifecycle: Phases, Goals, and Custom Lifecycles Mastering Maven Profiles: Tailoring Your Builds with Precision Mastering Maven Plugins and Dependency Management with Spring Boot	AWS Basics service AWS Service Sketch AWS v/s Azure Service All AWS Post

Tech Twitter

Wednesday, October 29, 2025

Unlocking RAG with LangChain: Embeddings, Vector Databases, and Retrieva

`🧩 Step 1: Text Splitters — Preparing Your Knowledge Base`

Why Split Text?

Key Parameters

`🧠 Step 2: Embeddings — Teaching AI What “Meaning” Feels Like`

Here’s the magic trick:

How It Works

Example:

`📦 Step 3: Vector Databases — Where Memory Lives`

`🧱 Step 4: Building the Retrieval Pipeline`

Now that your data is split, embedded, and stored — it’s time to retrieve!

What is a Retriever?

`💬 Step 5: RetrievalQA Chain — Connecting Retrieval and Generation`

The RetrievalQA Chain is LangChain’s “brains + memory” combo.

`⏱ Step 6: Token Limits in LLMs — Why Chunking Matters`

How to Manage It:

`🌍 Step 7: Retrieval-Augmented Generation (RAG)`

`LangChain Example — End-to-End Code`

Scalable Retrieval with Pinecone and LangChain

🔍 What This Code Does — Step by Step

2️⃣ `embeddings = OpenAIEmbeddings()`

3️⃣ `llm = ChatOpenAI(temperature=0)`

4️⃣ `query = "What is Pinecone in machine learning?"`

5️⃣ `vectorstore = PineconeVectorStore(...)`

6️⃣ `retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")`

7️⃣ `combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)`

8️⃣ `retrieval_chain = create_retrieval_chain(...)`

9️⃣ `result = retrieval_chain.invoke(input={"input": query})`

`🌍 FAISS vs. Pinecone`

`🎯 The Takeaway`

`🚀 ALL AI / LangChain Post`
ALL LangChain Post

You may also like

Wednesday, October 29, 2025

Unlocking RAG with LangChain: Embeddings, Vector Databases, and Retrieva

🧩 Step 1: Text Splitters — Preparing Your Knowledge Base

Why Split Text?

Key Parameters

🧠 Step 2: Embeddings — Teaching AI What “Meaning” Feels Like

Here’s the magic trick:

How It Works

Example:

📦 Step 3: Vector Databases — Where Memory Lives

🧱 Step 4: Building the Retrieval Pipeline

Now that your data is split, embedded, and stored — it’s time to retrieve!

What is a Retriever?

💬 Step 5: RetrievalQA Chain — Connecting Retrieval and Generation

The RetrievalQA Chain is LangChain’s “brains + memory” combo.

⏱ Step 6: Token Limits in LLMs — Why Chunking Matters

How to Manage It:

🌍 Step 7: Retrieval-Augmented Generation (RAG)

LangChain Example — End-to-End Code

Scalable Retrieval with Pinecone and LangChain

🔍 What This Code Does — Step by Step

2️⃣ embeddings = OpenAIEmbeddings()

3️⃣ llm = ChatOpenAI(temperature=0)

4️⃣ query = "What is Pinecone in machine learning?"

5️⃣ vectorstore = PineconeVectorStore(...)

6️⃣ retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

7️⃣ combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

8️⃣ retrieval_chain = create_retrieval_chain(...)

9️⃣ result = retrieval_chain.invoke(input={"input": query})

🌍 FAISS vs. Pinecone

🎯 The Takeaway

🚀 ALL AI / LangChain PostALL LangChain Post

You may also like

`🧩 Step 1: Text Splitters — Preparing Your Knowledge Base`

`🧠 Step 2: Embeddings — Teaching AI What “Meaning” Feels Like`

`📦 Step 3: Vector Databases — Where Memory Lives`

`🧱 Step 4: Building the Retrieval Pipeline`

`💬 Step 5: RetrievalQA Chain — Connecting Retrieval and Generation`

`⏱ Step 6: Token Limits in LLMs — Why Chunking Matters`

`🌍 Step 7: Retrieval-Augmented Generation (RAG)`

`LangChain Example — End-to-End Code`

2️⃣ `embeddings = OpenAIEmbeddings()`

3️⃣ `llm = ChatOpenAI(temperature=0)`

4️⃣ `query = "What is Pinecone in machine learning?"`

5️⃣ `vectorstore = PineconeVectorStore(...)`

6️⃣ `retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")`

7️⃣ `combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)`

8️⃣ `retrieval_chain = create_retrieval_chain(...)`

9️⃣ `result = retrieval_chain.invoke(input={"input": query})`

`🌍 FAISS vs. Pinecone`

`🎯 The Takeaway`

`🚀 ALL AI / LangChain Post`
ALL LangChain Post