From Web Scraping to Conversational AI

Apr 17, 2024

💡 A Step-by-Step Tutorial on how to make an AI chatbot for your website

💡 We will learn and use:
• Python uv: blazing fast pip replacement
• Scrapy: Web scraper toolkit
• OpenAI: embeddings api & LLM
• Langchain: Semantic Chunking
• Qdrant: Vector DB to store embedding & query them

Did you see lately an increase in the number of websites that has an AI chatbot capability? That allows you to ask question about pricing, documentation, advanced questions, etc....

If you've noticed this trend, you're not alone. AI chatbots have become increasingly popular on websites across various industries, as they provide a convenient and efficient way for users to get the information they need quickly. These chatbots are powered by advanced technologies such as natural language processing (NLP), machine learning, and deep learning, which enable them to understand and respond to user queries in a human-like manner.

We will look into the process of creating your own AI chatbot for your website using Python and a combination of powerful tools and libraries. We'll cover web scraping using Scrapy to gather content from your site, creating embeddings from the scraped content using the OpenAI API, and utilizing Qdrant as a vector database to store and query the embeddings. Additionally, we'll explore how to use Langchain for semantic chunking and integrate the queried content with OpenAI's GPT-4 to provide accurate and context-aware responses to user queries.

Set Up & Env

We will make a twist and use uv as our blazing fast python installer, so go ahead and install if you don't have it:

# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Then, we will create a new directory and install all the needed libraries:

mkdir chatbot && cd chatbot
uv venv
uv pip install scrapy
uv pip install beautifulsoup4
uv pip install openai
uv pip install qdrant-client
uv pip install langchain
uv pip install langchain_experimental
uv pip install langchain_openai
uv pip install gradio

Website Scrapping

Now let's create a new scrapy project for our website, in our use case, we will use yours truly codereliant.io:

scrapy startproject codereliant

In the file items.py modify the class CodereliantItem like below:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CodereliantItem(scrapy.Item):
    url = scrapy.Field()
    pass

Let's do a simple test to see if our scraping is working as intended:

cd codereliant
scrapy crawl codereliant -L ERROR -O output.json

From within the codereliant folder we run the scrapping command, which will result in a file output.json that contains all of our website urls:

less output.json
[
{"url": "https://www.codereliant.io"},
{"url": "https://www.codereliant.io/sre-interview-prep-plan-week-6/"},
{"url": "https://www.codereliant.io/free-goodies/"},
{"url": "https://www.codereliant.io/about/"},
{"url": "https://www.codereliant.io/the-most-tragic-bug/"},
{"url": "https://www.codereliant.io/pod-doctor/"},
{"url": "https://www.codereliant.io/from-reactive-to-proactive-transforming-software-maintenance-with-auto-remediation/"},
{"url": "https://www.codereliant.io/topics/"},
{"url": "https://www.codereliant.io/the-2038-problem/"},
{"url": "https://www.codereliant.io/14-years-of-go/"},
......

We now that we will need more than just the urls if we want to create the chatbot.

Data Extraction & Preprocessing

We already collected the url, now we need to other interesting data: page title, description, & page body.

Let's add these items to our CodereliantItem class in items.py:

class CodereliantItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    body = scrapy.Field()
    pass

Now our item class has these fields defined, lets' actually grab the data in parse_item method that is defined in CodereliantSpider within spiders/codereliant.py:

def parse_item(self, response):
    item = CodereliantItem()
    item['url'] = response.url
    item['title'] = response.css('title::text').get()
    item['description'] = response.css(
        'meta[name="description"]::attr(content)').get()
    item['body'] = response.css('body').get()
    return item

💡 We used css selector to collect the data needed. You can read more about Scrapy data selectors in here.

In order to either preprocess or post process our data, we need to write pipelines. the first pipeline we will write is the pipeline that will convert our page body content from html to just plain text.

First let's add a data field to CodereliantItem, which will hold the plain body:

class CodereliantItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    body = scrapy.Field()
    plain_body = scrapy.Field() # add this line
    pass

All of our pipelines are located in pipelines.py. we will write our first simple BodyTxtConverterPipeline:

class BodyTxtConverterPipeline:
    # use beatifulsoup to convert html to text
    def process_item(self, item, spider):
        item['plain_body'] = BeautifulSoup(item['body'], 'html.parser').get_text()
        return item

Now that we wrote our pipeline we need to enable it in settings.py, we need to uncomment the variable ITEM_PIPELINES and make it match:

ITEM_PIPELINES = {
   "codereliant.pipelines.BodyTxtConverterPipeline": 300,
}

To test that everything is working, we just need to try the scrapy command like above:

scrapy crawl codereliant -L ERROR -O output.json

by looking inside the file output.json we can see that we have successfully collected the data we are interested in + the plain_body.

Semantic Chunking

Now the more interesting part, which is now that we have collected all the data we need, what's next?

💡 For this part, you will need an OpenAI API account with sufficient credits.

Let's add a field in CodereliantItem to hold the chunks:

class CodereliantItem(scrapy.Item):
    chunks = scrapy.Field() # add this line
    ......

Our pipeline will leverage the abstractions of langchain, and it will also live in pipelines.py:

class TextSplitterPipeline:
    def open_spider(self, spider):
        self.text_splitter = SemanticChunker(
            OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
        )

    def process_item(self, item, spider):
        docs = self.text_splitter.create_documents([item['plain_body']])
        item['chunks'] = []
        for doc in docs:
            text = doc.page_content.replace("\n", " ")
            item['chunks'].append(text)
        return item

Let's add this tosettings.py like the previous pipeline:

ITEM_PIPELINES = {
   "codereliant.pipelines.BodyTxtConverterPipeline": 300,
   "codereliant.pipelines.TextSplitterPipeline": 400,
}

The number here signifies the order, we run pipelines from smallest to largest.

VectorDB Storage

Now we need create embeddings from these chunks and store them in our database. Let's start a standalone docker container of qdrant in a new terminal:

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

Our pipeline this time is a bit involved, but we will explain the code below:

class VecorDBPipeline:
    def open_spider(self, spider):
        # constants
        self.COLLECTION_NAME = "codereliant"
        self.EMBEDDING_MODEL = "text-embedding-3-small"
        self.EMBEDDING_MODEL_VECOTR_SIZE = 1536 # hardcoded, but can be obtained from the model

        # clients
        self.openai_client = OpenAI()
        self.client = QdrantClient(url="http://localhost:6333")

        # create collection in qdrant
        self.client.recreate_collection(
            collection_name=self.COLLECTION_NAME,
            vectors_config=models.VectorParams(
                size=self.EMBEDDING_MODEL_VECOTR_SIZE,  
                distance=models.Distance.COSINE,
            ),
        )

    def process_item(self, item, spider):
        unique_ids = [str(uuid.uuid4()) for _ in item['chunks']]
        result = self.openai_client.embeddings.create(input=item["chunks"], model=self.EMBEDDING_MODEL)

        self.client.upload_points(
            collection_name=self.COLLECTION_NAME,
            points=[
                models.PointStruct(
                    id=unique_id,
                    vector=data.embedding,
                    payload={
                        "url": item['url'],
                        "title": item['title'],
                        "description": item['description'],
                        "chunk": chunk,
                    }
                )
                for unique_id, chunk, data in zip(unique_ids, item['chunks'], result.data)
            ],
        )
        return item

In summary, the open_spider method is responsible for initializing the necessary clients and creating the collection in the Qdrant database, while the process_item method handles the creation of embeddings for each chunk using the OpenAI API and uploads the embeddings and associated metadata to the Qdrant database.

and here is the explanation in great details:

open_spider(self, spider) method:
- This method is called when the spider is opened and is used for initialization purposes.
- It sets up some constants:
  - COLLECTION_NAME: The name of the collection in the Qdrant database.
  - EMBEDDING_MODEL: The name of the OpenAI embedding model to be used.
  - EMBEDDING_MODEL_VECOTR_SIZE: The vector size of the embedding model (hardcoded as 1536 in this case).
- It initializes two clients:
  - openai_client: An instance of the OpenAI client for interacting with the OpenAI API.
  - client: An instance of the QdrantClient for connecting to the Qdrant database.
- It recreates a collection in the Qdrant database using the recreate_collection method of the QdrantClient. The collection is created with the specified COLLECTION_NAME and a VectorParams configuration that includes the vector size (EMBEDDING_MODEL_VECOTR_SIZE) and the distance metric (models.Distance.COSINE).
process_item(self, item, spider) method:
- This method is called for each item yielded by the spider.
- It generates unique IDs for each chunk in the item['chunks'] list using uuid.uuid4().
- It creates embeddings for each chunk using the OpenAI API's embeddings.create method, specifying the chunks and the EMBEDDING_MODEL.
- It uploads the embeddings and associated metadata to the Qdrant database using the upload_points method of the QdrantClient.
  - For each chunk, it creates a PointStruct object containing:
    - id: The unique ID generated for the chunk.
    - vector: The embedding vector obtained from the OpenAI API.
    - payload: A dictionary containing metadata associated with the chunk, such as the URL, title, description, and the chunk itself.

Finally, we need to add the pipeline VecorDBPipeline to settings.py:

ITEM_PIPELINES = {
   "codereliant.pipelines.BodyTxtConverterPipeline": 300,
   "codereliant.pipelines.TextSplitterPipeline": 400,
   "codereliant.pipelines.VecorDBPipeline": 500,
}

Now before we test it we need to set our OpenAI token

export OPENAI_API_KEY='sk-...........'
scrapy crawl codereliant -L ERROR -O output.json

We can verify by looking at our Qdrant dashboard UI which is located at: http://localhost:6333/dashboard .

From there we can see the info of the collection and some useful stats:

Chatbot

Finally we got to the actual chatbot. We will create a separate python file outside the codereliant and within out chatbot directory:

cd ../
touch chat.py
uv pip install gradio

we just created a python file that will contain our chat interface and install gradio ui toolkit.

Below is the code for chat.py:

import json
import gradio as gr
from openai import OpenAI
from qdrant_client import QdrantClient

# constants
COLLECTION_NAME = "codereliant"
EMBEDDING_MODEL = "text-embedding-3-small"

# clients
openai_client = OpenAI()
client = QdrantClient(url="http://localhost:6333")

def search_similar_text(text):
    hits = client.search(
        collection_name=COLLECTION_NAME,
        query_vector=openai_client.embeddings.create(input=[text],model=EMBEDDING_MODEL,)
        .data[0]
        .embedding,
        limit=3,
    )
    return hits

def generate_response(query, history):
    hits = search_similar_text(query)
    # create the context based on the hits
    context = ""
    for hit in hits:
        context += json.dumps(hit.payload) + "\n"
    prompt = generate_prompt(context, query)
    completion = openai_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are codereliant chatbot, skilled in answering questions about codereliant.io."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return completion.choices[0].message.content

def generate_prompt(context, query):
    prompt = f"""
Context:
{context}

Query: {query}

Based on the provided context, answer the query. If the context does not contain enough information to answer the query, respond with "I'm sorry, but the provided context does not have sufficient information to answer your query. Also, answer like a customer service representative would."

Answer:
"""
    return prompt

demo = gr.ChatInterface(fn=generate_response, examples=[ "what is this website about?","what can I learn from codereliant?", "Do you have any SRE interview preparation posts?"], title="CodeReliant Bot", multimodal=False)
demo.launch()

Let's go through each function and briefly explain what they do:

search_similar_text(text):
- This function takes a text parameter and searches for similar texts in the Qdrant database.
- It creates an embedding of the input text using the specified EMBEDDING_MODEL and the OpenAI API.
- It then searches the Qdrant collection (COLLECTION_NAME) for the top 3 (limit=3) most similar vectors to the input text's embedding.
- The function returns the search results (hits) containing the similar texts and their associated metadata.
generate_response(query, history):
- This function generates a response to a given query based on the search results from the Qdrant database.
- It calls the search_similar_text(query) function to retrieve the most similar texts to the query.
- It then constructs a context string by concatenating the payload (metadata) of each hit retrieved from the database.
- It generates a prompt using the generate_prompt(context, query) function, which combines the context and the query.
- Finally, it uses the OpenAI API to create a chat completion using the gpt-4-turbo model, passing the generated prompt as the user's message.
- The function returns the generated response obtained from the chat completion.
generate_prompt(context, query):
- This function generates a prompt by combining the context and the query.
- The prompt includes the context, the query, and instructions on how to answer the query based on the provided context.
- If the context does not contain enough information to answer the query, it instructs the model to respond with a specific message indicating the lack of information and to answer like a customer service representative.
- The function returns the generated prompt as a string.
The last part of the code sets up a Gradio ChatInterface for interacting with the chatbot:
- It creates an instance of gr.ChatInterface called demo.
- The fn parameter is set to the generate_response function, which will be called whenever a user sends a message.
- The examples parameter is set to a list of example queries to showcase the chatbot's capabilities.
- The title parameter sets the title of the chat interface to "CodeReliant Bot".
- The multimodal parameter is set to False, indicating that the chat interface will only handle text input and output.
- Finally, the launch() method is called on the demo instance to start the chat interface.

These functions work together to create a chatbot that can search for similar texts in our Qdrant database based on user queries, generate prompts combining the retrieved context and the query, and provide responses using the OpenAI API's chat completion feature. The chatbot is then made interactive using Gradio's ChatInterface.

To run it, make sure the OpenAI token is set, and just run:

python chat.py

#Running on local URL:  http://127.0.0.1:7860

#To create a public link, set `share=True` in `launch()`.

Below is a screenshot of trying the examples with our ChatBot and getting responses like we do from other sites:

The Github link contains the full code of this tutorial!

Hey there, reader! 👋

You know what they say, sharing is caring! If you found this newsletter helpful please take a moment to share it with your friends, colleagues and social media followers. Thanks for being a part of this community! 🚀

Codereliant’s Substack

Discussion about this post