Build a Semantic Search Engine over a PDF document Using LangChain

In this tutorial, we'll build a semantic search engine over a PDF document locally using LangChain and llamaCPP.

May 15, 2025

This guide introduces core LangChain concepts like document loaders, embeddings, vector stores, and retrievers, which together allow we to fetch and reason over data in AI applications—especially useful for retrieval-augmented generation (RAG).

You can find this code in my Github repository colab notebook.

In this guide, we’ll cover this:

Load and chunk text from PDFs
Create embeddings to represent document meaning
Store and search documents in a vector store
Use retrievers to integrate semantic search into LLM workflows
Use PromptTemplate and RetrievalQA Chain for Q&A

Let’s start off by setting up the system,

Install dependencies using pip or conda:

pip install langchain llama-cpp pdfplumber numpy langchain_community langchain-chroma>=0.1.2

We’ll also need to download a Llama model from Hugging Face or any other local repository we prefer. To run it locally, we can use llama-cpp-python, which allows you to run Llama models on your machine without calling APIs.

Step 1: Extract Text from the PDF

We will first extract text from a PDF document. For this task, we’ll use pdfplumber, as it can handle complex PDFs better, including those with tables and multi-column formats.

from langchain_community.document_loaders import PyPDFLoader

file_path = "nke-10k-2023.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs), docs[0].page_content[:200], docs[0].metadata)

Step 3: Split Text into Chunks with LangChain’s `RecursiveCharacterTextSplitter`

To improve retrieval, we can split the document into smaller chunks. This helps ensure that the search process captures meaningful context in smaller portions.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000, chunk_overlap=200, add_start_index=True

)

# Split the loaded documents into chunks

all_splits = text_splitter.split_documents(docs)

print(f"Total chunks: {len(all_splits)}")

Step 4: Generate Embeddings Using SentenceTransformer

We’ll now use SentenceTransformers to generate embeddings for each chunk. SentenceTransformers provides a simple interface to load pre-trained models and generate sentence-level embeddings

from sentence_transformers import SentenceTransformer

# Load a pretrained Sentence Transformer model

embeddings = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings for each chunk

chunk_texts = [chunk.page_content for chunk in all_splits]

embedding_vectors = embeddings.encode(chunk_texts, convert_to_numpy=True)

print(f"Generated {len(embedding_vectors)} embeddings.")

Step 5: Store Embeddings in Chroma (Vector Store)

Next, we will use Chroma, an open-source vector store, to store and index the embeddings. Chroma makes it easy to work with embeddings and provides efficient search capabilities.

import chromadb

# Initialize Chroma client and create a collection for storing the embeddings

persist_directory = "chroma_db"

client = chromadb.Client()

vector_store = client.create_collection(name="document_embeddings", persist_directory=persist_directory)

# Add documents and embeddings to Chroma

vector_store.add(

documents=chunk_texts,

embeddings=embedding_vectors,

metadatas=[{"start_index": chunk.metadata["start_index"]} for chunk in all_splits],

ids=[str(i) for i in range(len(chunk_texts))]

)

print("Embeddings stored in Chroma vector store.")

Step 6: Search Using Similarity Search with Scores and retriever integration

We can perform a similarity search over the vector store (Chroma) using the similarity_search method to find the most relevant documents based on a query. This method returns the closest matching document(s) based on vector similarity.

The similarity_search_with_score method is used to return the documents along with their similarity scores, providing a ranking of relevance. This can be helpful if we want to assess how closely each document matches the query.

We use the as_retriever method to convert the vector store into a retriever, making it easier to integrate into more advanced workflows, such as the question-answering system using an LLM (Large Language Model). The retriever allows you to search based on similarity and adjust how many results we want (search_kwargs={"k": 3} means it returns the top 3 results).

# Use similarity search with the vector store

results = vector_store.similarity_search("How many distribution centers does Nike have in the US?")

print(results[0].page_content) # Print the content of the most relevant document

# Use similarity search with scores

doc, score = vector_store.similarity_search_with_score("Nike revenue in 2023")[0]

print(f"Score: {score}\nText: {doc.page_content}")

# Convert Chroma to a retriever

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Use retriever to fetch relevant documents for a query

docs = retriever.invoke("When was Nike incorporated?")

print(docs) # Display the retrieved documents

Step 7: Use PromptTemplate and RetrievalQA Chain for Q&A

The PromptTemplate allows you to create dynamic prompts for the LLM that are based on the retrieved context. This ensures that the LLM is given relevant information (context) to answer questions accurately.

Finally, we use the RetrievalQA chain, which takes the LLM and the retriever to combine document retrieval and question-answering in one seamless process.

# Create a prompt template for the Q&A system

template = """

You are a helpful assistant that answers questions based on the provided document context.

Context:

{context}

Question:

{question}

Answer in a concise, fact-based manner:

"""

prompt = PromptTemplate(

input_variables=["context", "question"],

template=template,

)

# Initialize the RetrievalQA chain

qa_chain = RetrievalQA.from_chain_type(

llm=llm, # Replace `llm` with your specific language model

retriever=retriever,

chain_type_kwargs={"prompt": prompt},

)

# Perform a question-answering task

response = qa_chain.run("Summarize Nike’s financial highlights for 2023.")

print(response)

This method offers a simple and effective solution for developing a local semantic search engine over PDFs, utilizing SentenceTransformers for embedding creation, Chroma for storing and querying embeddings, and LangChain for combining the document retrieval and question-answer components.

This design allows you to simply create semantic search, perform question-answering, and scale it to additional sorts of documents or models as needed.

SOTA Insights

Discussion about this post

Ready for more?