From Web Scraping to Conversational AI
Did you see lately an increase in the number of websites that has an AI chatbot capability? That allows you to ask question about pricing, documentation, advanced questions, etc....
If you've noticed this trend, you're not alone. AI chatbots have become increasingly popular on websites across various industries, as they provide a convenient and efficient way for users to get the information they need quickly. These chatbots are powered by advanced technologies such as natural language processing (NLP), machine learning, and deep learning, which enable them to understand and respond to user queries in a human-like manner.
We will look into the process of creating your own AI chatbot for your website using Python and a combination of powerful tools and libraries. We'll cover web scraping using Scrapy to gather content from your site, creating embeddings from the scraped content using the OpenAI API, and utilizing Qdrant as a vector database to store and query the embeddings. Additionally, we'll explore how to use Langchain for semantic chunking and integrate the queried content with OpenAI's GPT-4 to provide accurate and context-aware responses to user queries.
Set Up & Env
We will make a twist and use uv as our blazing fast python installer, so go ahead and install if you don't have it:
# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Then, we will create a new directory and install all the needed libraries:
mkdir chatbot && cd chatbot
uv venv
uv pip install scrapy
uv pip install beautifulsoup4
uv pip install openai
uv pip install qdrant-client
uv pip install langchain
uv pip install langchain_experimental
uv pip install langchain_openai
uv pip install gradio
Website Scrapping
Now let's create a new scrapy project for our website, in our use case, we will use yours truly codereliant.io:
scrapy startproject codereliant
In the file items.py
modify the class CodereliantItem
like below:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CodereliantItem(scrapy.Item):
url = scrapy.Field()
pass
Let's do a simple test to see if our scraping is working as intended:
cd codereliant
scrapy crawl codereliant -L ERROR -O output.json
From within the codereliant folder we run the scrapping command, which will result in a file output.json
that contains all of our website urls:
less output.json
[
{"url": "https://www.codereliant.io"},
{"url": "https://www.codereliant.io/sre-interview-prep-plan-week-6/"},
{"url": "https://www.codereliant.io/free-goodies/"},
{"url": "https://www.codereliant.io/about/"},
{"url": "https://www.codereliant.io/the-most-tragic-bug/"},
{"url": "https://www.codereliant.io/pod-doctor/"},
{"url": "https://www.codereliant.io/from-reactive-to-proactive-transforming-software-maintenance-with-auto-remediation/"},
{"url": "https://www.codereliant.io/topics/"},
{"url": "https://www.codereliant.io/the-2038-problem/"},
{"url": "https://www.codereliant.io/14-years-of-go/"},
......
We now that we will need more than just the urls if we want to create the chatbot.
Data Extraction & Preprocessing
We already collected the url, now we need to other interesting data: page title, description, & page body.
Let's add these items to our CodereliantItem
class in items.py
:
class CodereliantItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
body = scrapy.Field()
pass
Now our item class has these fields defined, lets' actually grab the data in parse_item
method that is defined in CodereliantSpider
within spiders/codereliant.py
: