Retrieving Wisdom from Coursera Courses: How I Built an AI Bot with RAG and LangChain
Recently, I completed the IBM AI Engineering course on the Coursera platform, and I can confidently say it was a rich and rewarding experience. The course provided a strong balance between theory and hands-on practice, offering deep insights into the field of AI.
But knowledge, no matter how powerful, is meaningless if it’s not applied. It’s like a sharp sword never taken to battle. It has full potential, yet is gathering dust. Leaving knowledge unused is not just a missed opportunity; it’s almost a disservice to ourselves, to humanity, and to the universe.
As we say in Sinhala: “යුද්දෙට නැති කඩුව කොස් කොටන්නද?”
So, I decided to put what I had learned into action. While exploring how to begin, I had a moment of insight — why not use this knowledge to improve my own learning process? If I could build a bot capable of retrieving information through concepts taught in the course, just like a question-answering system. It would allow me to quickly recall the parts I’d forgotten. This way, I could reinforce and apply the concepts more effectively, without the need to rewatch course videos repeatedly.
Before diving into the implementation, it’s important to understand what RAG (Retrieval-Augmented Generation) and LangChain are, and how they can be used to build customized Large Language Model (LLM) applications tailored to specific needs.
What is RAG
RAG (Retrieval-Augmented Generation) is a technique that enhances the performance of Large Language Models (LLMs) by integrating external knowledge sources into the generation process. Instead of relying solely on the LLM’s pre-trained internal knowledge, RAG dynamically retrieves relevant information from a custom or external knowledge base
Large Language Models (LLMs) are trained on vast amounts of publicly available and commonly used text data. While this makes them highly effective for general-purpose tasks, their knowledge is inherently limited to the data they were trained on.
To truly leverage the power of LLMs for specific domains or personalized use cases, we need to provide them with relevant contextual knowledge, drawn from our own custom knowledge bases. This is where a technique called Retrieval-Augmented Generation (RAG) comes into play. RAG enables LLMs to access external data sources at runtime, enhancing their responses with up-to-date or domain-specific information.
The typical workflow of RAG looks like this:
- Indexing phase
This phase starts with collecting custom knowledge from text files, PDFs, databases, or online sources. The data is then split into smaller, manageable chunks to ensure efficient retrieval without losing context. - Embedding and Storing phase
In this phase, the text chunks are converted into numerical vector representations using an embedding model. These vectors, along with the original text chunks, are then stored in a vector database for efficient retrieval later. - Retrieval phase
In this phase, the user’s query is also converted into a vector embedding using the same embedding model, and the RAG system performs a similarity search in the vector database to retrieve the most relevant chunks. - Augmentation phase
In this phase, the retrieved relevant chunks are combined with the user’s original query to form a prompt, which is then passed to the underlying LLM to generate the final output. - Generation Phase
In this phase, the LLM receives the augmented prompt and generates a human-readable, contextually relevant response by leveraging its language understanding capabilities.
What is LangChain
LangChain is an orchestration framework that brings all components together, managing and coordinating the different phases of advanced agentic AI systems. In our specific use case, LangChain is responsible for…
- Get user input and pass it for indexing.
LangChain provides dedicated loaders for various file types, which handle the processing and chunking of data automatically. This allows LangChain to seamlessly manage the data ingestion phase in our pipeline. - Knowledge base Management.
LangChain offers tools for generating embeddings and integrating with vector databases. After chunking, it passes the segments to an embedding model and stores the resulting vectors efficiently in a connected vector database. - Contextual Retrieval.
LangChain enables the embedding of user queries and, retrieval of relevant context from a vector database. - Context Augmentation.
LangChain enables dynamic prompt construction using templates and augments this capability by incorporating retrieved relevant documents. This process enhances the accuracy and relevance of responses generated by the underlying language model. - Final Result Generation.
LangChain ties all the components together, sequencing them into a cohesive pipeline that orchestrates the flow from user input to the final response. It routes the constructed prompt through the LLM, and once the output is generated, LangChain can process and format the response using its built-in or custom-defined parsers.
How the AI Bot was Built
Now that we have a solid understanding of RAG and LangChain, it’s time to internalize these concepts through hands-on practice. To truly embed this knowledge into our nerves, we need to apply it step by step. Let’s dive into how I implemented this bot, demonstrating the practical application of these powerful technologies.
Step 1: Prepare the knowledge base
The first step is to prepare our knowledge base. We want our bot to respond to user queries using a specific, curated knowledge set rather than relying solely on the general knowledge the LLM was trained on. This knowledge can come from various sources such as text files, PDF documents, online resources, or external databases. For this project, I downloaded the transcripts of Coursera course videos as text files, which were then categorized and organized into a dedicated folder.
Step 2: Download required libraries
# Install Libraries
!pip install torch
!pip install transformers langchain_community langchain_text_splitters langchain_core
!pip install sentence-transformers
!pip install chromadb
!pip install huggingface_hub
!pip install accelerate
!pip install gradio
For this task, we require the following libraries. Here’s a brief overview of their roles:
transformers
– A high-level API for working with large language models (LLMs), which we’ll use for model integration.
langchain_community
– Provides various document loaders (e.g., for text and PDF files), supports the LangChain pipeline, embedding models, and vector databases.
langchain_text_splitters
– Helps break down long documents into manageable chunks for better processing and embedding.
langchain_core
– Offers essential components like output parsers and prompt templates using LangChain Expression Language (LCEL).
chromadb
– A vector database used to store and retrieve document embeddings efficiently.
accelerate
– Optimizes model loading and execution by intelligently leveraging available hardware, such as CPUs and GPUs.
gradio
– Enables us to build interactive user interfaces, especially suited for showcasing machine learning models.
Step 3: Import the required packages from the downloaded libraries
# Module Imports
import os
import torch
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import gradio as gr
Step 4: Document preprocess stage
In this stage, we write functions to prepare files for the chunking process, define file loaders based on file types, and then read and split large text data into manageable chunks.
This function handles loading files from a specified location by type and renames them to a standardized format for consistent processing.
def preprocess_data_files():
for filename in os.listdir(folder_path):
full_old_path = os.path.join(folder_path, filename)
new_filename = filename.replace(",","").replace(" ", "-")
full_new_path = os.path.join(folder_path, new_filename)
os.rename(full_old_path, full_new_path)
txt_list: list[str] = [filename for filename in os.listdir(folder_path) if os.path.splitext(filename)[1] =='.txt']
pdf_list: list[str] = [filename for filename in os.listdir(folder_path) if os.path.splitext(filename)[1] =='.pdf']
return txt_list, pdf_list
Next, we define functions to load files based on their type. One function is designed specifically for loading plain text files, while another handles PDF files
# TXT loader
def text_loader(txt_filepath: str):
txt_loader = TextLoader(txt_filepath)
return txt_loader.load()
# PDF loader
def pdf_file_loader(pdf_filepath: str):
pdf_loader = PyPDFLoader(pdf_filepath, extract_images=False,)
return pdf_loader.load()
Next, we define a function to split the file content into smaller, manageable chunks. We use theRecursiveCharacterTextSplitter
which iteratively breaks down large text blocks based on a specified maximum character length while maintaining a fixed overlap with the previous chunk. This overlapping strategy helps preserve the continuity of context across chunks, allowing the embedding model to generate more coherent and meaningful vector representations.
def split_text_for_chunks(document):
text_splitter= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
return text_splitter.split_documents(document)
Step 5: Model Creating Stage.
In this step, we define the LLM using Hugging Face’s transformers
pipeline, which provides a high-level, unified interface for working with language models. The pipeline abstracts away the complexity of tasks such as tokenization, text generation, and contextual processing, allowing us to configure parameters like the model, tokenizer, maximum tokens, and temperature with ease. LangChain supports seamless integration with Hugging Face pipelines through its wrappers, making it straightforward to incorporate these models into LCEL (LangChain Expression Language) chains.
def define_LLM():
model_name: str = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map='auto'
)
pipe = pipeline(
'text-generation',
model=model,
tokenizer=tokenizer,
max_new_tokens=1024,
temperature=0.7,
do_sample=True,
top_p=0.95,
repetition_penalty=1.1
)
return HuggingFacePipeline(pipeline=pipe)
Step 6: VectorDB Creating Stage.
In this stage, we initialize both the embedding model and the vector database. The embedding model is responsible for converting each text chunk into a high-dimensional vector representation. These vectors, along with their associated text or metadata, are then stored in the vector database. This setup enables efficient semantic search and retrieval based on the meaning of the user’s query.
This function initializes the embedding model using the
BAAI/bge-small-en-v1.5
model, with customized arguments tailored for our use case.
def initialize_embedding_model():
embedding_model_name = "BAAI/bge-small-en-v1.5"
embeddings_model = HuggingFaceEmbeddings(
model_name=embedding_model_name,
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings': True}
)
return embeddings_model
This function initializes the vector database. If a database file already exists, it loads the existing data; otherwise, it creates a new database instance. At this stage, we don’t index any text chunks yet, as we handle indexing later in bulk due to the large volume of data to be stored.
def intialize_vector_DB(embeddings_model):
vector_db_path = "./chroma_db_hf_embeddings"
if os.path.exists(vector_db_path) and os.listdir(vector_db_path):
vector_db = Chroma(
persist_directory=vector_db_path,
embedding_function=embeddings_model
)
else:
vector_db=Chroma.from_documents(
documents=[],
embedding=embeddings_model,
persist_directory=vector_db_path
)
vector_db.persist()
return vector_db
Next, we define a function to add individual chunks of a document to the vector database and persist them. This function is necessary because, as mentioned earlier, we delayed adding chunks until after the vector database has been initialized, rather than during its initialization.
def load_chunks_to_vector_db(doc_chunks, vector_db):
vector_db.add_documents(doc_chunks)
vector_db.persist()
Next, we write a function that reads formatted file paths from arrays of text and PDF files, loads each file using the appropriate loader, splits the content into manageable chunks, and finally adds these chunks to the vector database.
def read_files_and_load(txt_list: list[str], pdf_list: list[str], vector_db):
for txt_filename in txt_list:
full_path = os.path.join(folder_path, txt_filename)
loaded_file = text_loader(full_path)
chunks = split_text_for_chunks(loaded_file)
load_chunks_to_vector_db(chunks, vector_db)
for pdf_filename in pdf_list:
try:
full_path = os.path.join(folder_path, pdf_filename)
loaded_file = text_loader(full_path)
chunks = split_text_for_chunks(loaded_file)
load_chunks_to_vector_db(chunks, vector_db)
except Exception as e:
print(f'Exception occured when reading {pdf_filename}')
Step 7: LangChain Creating Stage.
Here, we assemble all the functions we created earlier to build the final LangChain workflow by invoking them sequentially in the correct order. After initializing LangChain, we take the user’s query, process it through the chain, and generate an answer by retrieving relevant knowledge from our knowledge base via the RAG process.
The following function initializes LangChain by assembling the previously created functions. Here, we also craft a well-engineered prompt template designed to elicit accurate and efficient responses from the LLM.
def create_langchain():
txt_list, pdf_list = preprocess_data_files()
embedding_model = initialize_embedding_model()
vector_db = intialize_vector_DB(embedding_model)
read_files_and_load(txt_list, pdf_list, vector_db)
retriever = vector_db.as_retriever(search_kwargs={"k": 3})
llm = define_LLM()
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()
)
return rag_chain
qa_rag_chain = create_langchain()
The prompt template we used is
template = """You are an AI assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question concisely.
If you don't know the answer, just say that you don't know.
Keep the answer to three sentences maximum.
Question: {question}
Context: {context}
Answer:"""
The following function invokes LangChain by receiving the user’s question, which will be collected through a user interface created in the next step. It then processes LangChain’s response to extract the final answer and returns it to the user.
def process_query(user_question):
print(user_question)
if not user_question:
return "Please enter a query to get a response."
try:
response = qa_rag_chain.invoke(user_question)
answer_prefix = "Answer:"
answer_start_index = response.find(answer_prefix)
if answer_start_index != -1:
extracted_answer = response[answer_start_index + len(answer_prefix):].strip()
else:
print("Warning: 'Answer:' prefix not found in response. Printing full output.")
extracted_answer = ""
return extracted_answer
except Exception as e:
print(f"An error occurred: {e}")
Step 8: Create A User Interface.
In this function, we create an interactive user interface using Gradio, a Python library specifically designed for building UIs for AI applications. We implement a simple interface consisting of a single text box for user input and an output box to display the model’s response.
demo = gr.Interface(
fn=process_query,
inputs=gr.Textbox(
lines=5,
label="Your Input Query:",
placeholder="Type your question here..."
),
outputs=gr.Textbox(
lines=10,
label="Output Response:",
interactive=False
),
title="Simple RAG Query Interface",
description="Enter a query and get a response from the (simulated) RAG system."
)
demo.launch()
Final Product Demo
This is a demo of the final product we built using the concepts and code discussed above. You may notice that response generation takes some time. This is due to the size of our knowledge base and the limitations of our computing resources. In the next article, we’ll explore the root causes behind this latency and discuss strategies to improve performance significantly.
Instead of just reading this article, I encourage you to apply what you’ve learned by building your own question-answering bot using a custom knowledge base. It’s the best way to reinforce the concepts in a practical, hands-on way.
You can find the GitHub repository for this project below — feel free to fork it, modify it, and make it your own!