From Web Scraping to Conversational AI

💡 A Step-by-Step Tutorial on how to make an AI chatbot for your website
💡 We will learn and use:
• Python uv: blazing fast pip replacement
• Scrapy: Web scraper toolkit
• OpenAI: embeddings api & LLM
• Langchain: Semantic Chunking
• Qdrant: Vector DB to store embedding & query them
Did you see lately an increase in the number of websites that has an AI chatbot capability? That allows you to ask question about pricing, documentation, advanced questions, etc....
If you've noticed this trend, you're not alone. AI chatbots have become increasingly popular on websites across various industries, as they provide a convenient and efficient way for users to get the information they need quickly. These chatbots are powered by advanced technologies such as natural language processing (NLP), machine learning, and deep learning, which enable them to understand and respond to user queries in a human-like manner.
We will look into the process of creating your own AI chatbot for your website using Python and a combination of powerful tools and libraries. We'll cover web scraping using Scrapy to gather content from your site, creating embeddings from the scraped content using the OpenAI API, and utilizing Qdrant as a vector database to store and query the embeddings. Additionally, we'll explore how to use Langchain for semantic chunking and integrate the queried content with OpenAI's GPT-4 to provide accurate and context-aware responses to user queries.
Set Up & Env
We will make a twist and use uv as our blazing fast python installer, so go ahead and install if you don't have it:
# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Then, we will create a new directory and install all the needed libraries:
mkdir chatbot && cd chatbot
uv venv
uv pip install scrapy
uv pip install beautifulsoup4
uv pip install openai
uv pip install qdrant-client
uv pip install langchain
uv pip install langchain_experimental
uv pip install langchain_openai
uv pip install gradio
Website Scrapping
Now let's create a new scrapy project for our website, in our use case, we will use yours truly codereliant.io:
scrapy startproject codereliant
In the file items.py
modify the class CodereliantItem
like below:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CodereliantItem(scrapy.Item):
url = scrapy.Field()
pass
Let's do a simple test to see if our scraping is working as intended:
cd codereliant
scrapy crawl codereliant -L ERROR -O output.json
From within the codereliant folder we run the scrapping command, which will result in a file output.json
that contains all of our website urls:
less output.json
[
{"url": "https://www.codereliant.io"},
{"url": "https://www.codereliant.io/sre-interview-prep-plan-week-6/"},
{"url": "https://www.codereliant.io/free-goodies/"},
{"url": "https://www.codereliant.io/about/"},
{"url": "https://www.codereliant.io/the-most-tragic-bug/"},
{"url": "https://www.codereliant.io/pod-doctor/"},
{"url": "https://www.codereliant.io/from-reactive-to-proactive-transforming-software-maintenance-with-auto-remediation/"},
{"url": "https://www.codereliant.io/topics/"},
{"url": "https://www.codereliant.io/the-2038-problem/"},
{"url": "https://www.codereliant.io/14-years-of-go/"},
......
We now that we will need more than just the urls if we want to create the chatbot.
Data Extraction & Preprocessing
We already collected the url, now we need to other interesting data: page title, description, & page body.
Let's add these items to our CodereliantItem
class in items.py
:
class CodereliantItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
body = scrapy.Field()
pass
Now our item class has these fields defined, lets' actually grab the data in parse_item
method that is defined in CodereliantSpider
within spiders/codereliant.py
:
def parse_item(self, response):
item = CodereliantItem()
item['url'] = response.url
item['title'] = response.css('title::text').get()
item['description'] = response.css(
'meta[name="description"]::attr(content)').get()
item['body'] = response.css('body').get()
return item
💡 We used css selector to collect the data needed. You can read more about Scrapy data selectors in here.
In order to either preprocess or post process our data, we need to write pipelines. the first pipeline we will write is the pipeline that will convert our page body content from html to just plain text.
First let's add a data field to CodereliantItem
, which will hold the plain body:
class CodereliantItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
body = scrapy.Field()
plain_body = scrapy.Field() # add this line
pass
All of our pipelines are located in pipelines.py
. we will write our first simple BodyTxtConverterPipeline
:
class BodyTxtConverterPipeline:
# use beatifulsoup to convert html to text
def process_item(self, item, spider):
item['plain_body'] = BeautifulSoup(item['body'], 'html.parser').get_text()
return item
Now that we wrote our pipeline we need to enable it in settings.py
, we need to uncomment the variable ITEM_PIPELINES and make it match:
ITEM_PIPELINES = {
"codereliant.pipelines.BodyTxtConverterPipeline": 300,
}
To test that everything is working, we just need to try the scrapy
command like above:
scrapy crawl codereliant -L ERROR -O output.json
by looking inside the file output.json
we can see that we have successfully collected the data we are interested in + the plain_body
.
Semantic Chunking
Now the more interesting part, which is now that we have collected all the data we need, what's next?
💡 For this part, you will need an OpenAI API account with sufficient credits.
Let's add a field in CodereliantItem
to hold the chunks:
class CodereliantItem(scrapy.Item):
chunks = scrapy.Field() # add this line
......
Our pipeline will leverage the abstractions of langchain, and it will also live in pipelines.py
:
class TextSplitterPipeline:
def open_spider(self, spider):
self.text_splitter = SemanticChunker(
OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)
def process_item(self, item, spider):
docs = self.text_splitter.create_documents([item['plain_body']])
item['chunks'] = []
for doc in docs:
text = doc.page_content.replace("\n", " ")
item['chunks'].append(text)
return item
Let's add this tosettings.py
like the previous pipeline:
ITEM_PIPELINES = {
"codereliant.pipelines.BodyTxtConverterPipeline": 300,
"codereliant.pipelines.TextSplitterPipeline": 400,
}
The number here signifies the order, we run pipelines from smallest to largest.
VectorDB Storage
Now we need create embeddings from these chunks and store them in our database. Let's start a standalone docker container of qdrant in a new terminal:
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
Our pipeline this time is a bit involved, but we will explain the code below:
class VecorDBPipeline:
def open_spider(self, spider):
# constants
self.COLLECTION_NAME = "codereliant"
self.EMBEDDING_MODEL = "text-embedding-3-small"
self.EMBEDDING_MODEL_VECOTR_SIZE = 1536 # hardcoded, but can be obtained from the model
# clients
self.openai_client = OpenAI()
self.client = QdrantClient(url="http://localhost:6333")
# create collection in qdrant
self.client.recreate_collection(
collection_name=self.COLLECTION_NAME,
vectors_config=models.VectorParams(
size=self.EMBEDDING_MODEL_VECOTR_SIZE,
distance=models.Distance.COSINE,
),
)
def process_item(self, item, spider):
unique_ids = [str(uuid.uuid4()) for _ in item['chunks']]
result = self.openai_client.embeddings.create(input=item["chunks"], model=self.EMBEDDING_MODEL)
self.client.upload_points(
collection_name=self.COLLECTION_NAME,
points=[
models.PointStruct(
id=unique_id,
vector=data.embedding,
payload={
"url": item['url'],
"title": item['title'],
"description": item['description'],
"chunk": chunk,
}
)
for unique_id, chunk, data in zip(unique_ids, item['chunks'], result.data)
],
)
return item
In summary, the open_spider
method is responsible for initializing the necessary clients and creating the collection in the Qdrant database, while the process_item
method handles the creation of embeddings for each chunk using the OpenAI API and uploads the embeddings and associated metadata to the Qdrant database.
and here is the explanation in great details:
open_spider(self, spider)
method:This method is called when the spider is opened and is used for initialization purposes.
It sets up some constants:
COLLECTION_NAME
: The name of the collection in the Qdrant database.EMBEDDING_MODEL
: The name of the OpenAI embedding model to be used.EMBEDDING_MODEL_VECOTR_SIZE
: The vector size of the embedding model (hardcoded as 1536 in this case).
It initializes two clients:
openai_client
: An instance of theOpenAI
client for interacting with the OpenAI API.client
: An instance of theQdrantClient
for connecting to the Qdrant database.
It recreates a collection in the Qdrant database using the
recreate_collection
method of theQdrantClient
. The collection is created with the specifiedCOLLECTION_NAME
and aVectorParams
configuration that includes the vector size (EMBEDDING_MODEL_VECOTR_SIZE
) and the distance metric (models.Distance.COSINE
).
process_item(self, item, spider)
method:This method is called for each item yielded by the spider.
It generates unique IDs for each chunk in the
item['chunks']
list usinguuid.uuid4()
.It creates embeddings for each chunk using the OpenAI API's
embeddings.create
method, specifying the chunks and theEMBEDDING_MODEL
.It uploads the embeddings and associated metadata to the Qdrant database using the
upload_points
method of theQdrantClient
.For each chunk, it creates a
PointStruct
object containing:id
: The unique ID generated for the chunk.vector
: The embedding vector obtained from the OpenAI API.payload
: A dictionary containing metadata associated with the chunk, such as the URL, title, description, and the chunk itself.
Finally, we need to add the pipeline VecorDBPipeline
to settings.py
:
ITEM_PIPELINES = {
"codereliant.pipelines.BodyTxtConverterPipeline": 300,
"codereliant.pipelines.TextSplitterPipeline": 400,
"codereliant.pipelines.VecorDBPipeline": 500,
}
Now before we test it we need to set our OpenAI token
export OPENAI_API_KEY='sk-...........'
scrapy crawl codereliant -L ERROR -O output.json
We can verify by looking at our Qdrant dashboard UI which is located at: http://localhost:6333/dashboard .
From there we can see the info of the collection and some useful stats:
Chatbot
Finally we got to the actual chatbot. We will create a separate python file outside the codereliant
and within out chatbot
directory:
cd ../
touch chat.py
uv pip install gradio
we just created a python file that will contain our chat interface and install gradio ui toolkit.
Below is the code for chat.py:
import json
import gradio as gr
from openai import OpenAI
from qdrant_client import QdrantClient
# constants
COLLECTION_NAME = "codereliant"
EMBEDDING_MODEL = "text-embedding-3-small"
# clients
openai_client = OpenAI()
client = QdrantClient(url="http://localhost:6333")
def search_similar_text(text):
hits = client.search(
collection_name=COLLECTION_NAME,
query_vector=openai_client.embeddings.create(input=[text],model=EMBEDDING_MODEL,)
.data[0]
.embedding,
limit=3,
)
return hits
def generate_response(query, history):
hits = search_similar_text(query)
# create the context based on the hits
context = ""
for hit in hits:
context += json.dumps(hit.payload) + "\n"
prompt = generate_prompt(context, query)
completion = openai_client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "You are codereliant chatbot, skilled in answering questions about codereliant.io."},
{"role": "user", "content": prompt}
]
)
return completion.choices[0].message.content
def generate_prompt(context, query):
prompt = f"""
Context:
{context}
Query: {query}
Based on the provided context, answer the query. If the context does not contain enough information to answer the query, respond with "I'm sorry, but the provided context does not have sufficient information to answer your query. Also, answer like a customer service representative would."
Answer:
"""
return prompt
demo = gr.ChatInterface(fn=generate_response, examples=[ "what is this website about?","what can I learn from codereliant?", "Do you have any SRE interview preparation posts?"], title="CodeReliant Bot", multimodal=False)
demo.launch()
Let's go through each function and briefly explain what they do:
search_similar_text(text)
:This function takes a
text
parameter and searches for similar texts in the Qdrant database.It creates an embedding of the input
text
using the specifiedEMBEDDING_MODEL
and the OpenAI API.It then searches the Qdrant collection (
COLLECTION_NAME
) for the top 3 (limit=3
) most similar vectors to the input text's embedding.The function returns the search results (
hits
) containing the similar texts and their associated metadata.
generate_response(query, history)
:This function generates a response to a given
query
based on the search results from the Qdrant database.It calls the
search_similar_text(query)
function to retrieve the most similar texts to the query.It then constructs a
context
string by concatenating thepayload
(metadata) of each hit retrieved from the database.It generates a prompt using the
generate_prompt(context, query)
function, which combines the context and the query.Finally, it uses the OpenAI API to create a chat completion using the
gpt-4-turbo
model, passing the generated prompt as the user's message.The function returns the generated response obtained from the chat completion.
generate_prompt(context, query)
:This function generates a prompt by combining the
context
and thequery
.The prompt includes the context, the query, and instructions on how to answer the query based on the provided context.
If the context does not contain enough information to answer the query, it instructs the model to respond with a specific message indicating the lack of information and to answer like a customer service representative.
The function returns the generated prompt as a string.
The last part of the code sets up a Gradio
ChatInterface
for interacting with the chatbot:It creates an instance of
gr.ChatInterface
calleddemo
.The
fn
parameter is set to thegenerate_response
function, which will be called whenever a user sends a message.The
examples
parameter is set to a list of example queries to showcase the chatbot's capabilities.The
title
parameter sets the title of the chat interface to "CodeReliant Bot".The
multimodal
parameter is set toFalse
, indicating that the chat interface will only handle text input and output.Finally, the
launch()
method is called on thedemo
instance to start the chat interface.
These functions work together to create a chatbot that can search for similar texts in our Qdrant database based on user queries, generate prompts combining the retrieved context and the query, and provide responses using the OpenAI API's chat completion feature. The chatbot is then made interactive using Gradio's ChatInterface
.
To run it, make sure the OpenAI token is set, and just run:
python chat.py
#Running on local URL: http://127.0.0.1:7860
#To create a public link, set `share=True` in `launch()`.
Below is a screenshot of trying the examples with our ChatBot and getting responses like we do from other sites:
The Github link contains the full code of this tutorial!
Hey there, reader! 👋
You know what they say, sharing is caring! If you found this newsletter helpful please take a moment to share it with your friends, colleagues and social media followers. Thanks for being a part of this community! 🚀