The Poor Man's Google
2024-05-09
The code for this project is all on GitHub
It was a dark and stormy night dim and cloudy day when I happened across a person talking about how search engines worked. When I thought about it, I figured that it couldn't be that hard to reproduce, and here we are today. And besides, no matter how bad the end project is, it can't possibly be worse than Bing.
Search engines are a lot simpler than you may think, consisting of only 3 major components, the crawler, indexer, and the actual search page. The crawler is in charge of discovering new web pages, the indexer decides what the web pages are about, and the search page does exactly what you think it does. It takes the input you give it and tries to figure out what the most likely matches for that search result. For simplicity, I combined the first two together, so the indexing happens right after a page is discovered.
All a web crawler has to do is visit a website, find every link on that page, and add them to a queue to be visited next. All it takes is a basic knowledge of HTML parsing, and you're off to the races. It seems like a very simple task, until it isn't. Before you even think of deleting your google account, buying an RV and a shotgun, and living a life without any of them dang companies stealing your dadgurn data, there are a lot of variables to keep in mind. For example, what if a link that you encounter leads to a page that no longer exists? What if the link has no specified destination at all, or if it leads to a website that is currently down? And what if that that link isn't even a webpage? All of these cases have to be considered, and more. If you look at my crawler on github, you will see enough try/except statements and input validations to last you a lifetime.
But that's not all!
It turns out that some web admins don't want their websites being hit by hundreds of requests a minute from some scraper that doesn't pay them their precious 0.01 cents in ad money. Robots just don't seem to have the same desire to get together with hot singles in their area that a human would. Or maybe when humanity is about to be wiped out by robots, they don't want them to enjoy the many cat pictures of the internet. Either way, the humans of the year 1994 developed an ingenious method of robot repellant: robots.txt. The robots.txt file on any website will tell the robots where they can and cannot play. Unfortunately for the cast of Terminator, It doesn't need to be followed, but if you want to remain un-blacklisted from certain websites, you probably should. So how do we handle that? Well, this may seem pretty anticlimactic, but all that really has to be done is to get someone to do it for us.
from urllib import robotparser
# Set the object to read the robots.txt of a certain website
botparser = robotparser.RobotFileParser()
botparser.set_url("https://example.com/robots.txt")
botparser.read()
# Check if the program can read a certain url
if botparser.can_fetch("https://example.com/kill-all-humans/"):
# Get the webpage
Thankfully, we don't have to re-invent the wheel here. Some kind fellow has already saved us countless hours of staring at tracebacks and wondering why the code doesn't run. All we have to do is point this at the webpage we want to visit and we can check if our crawler is allowed to be snooping around in this neck of the woods.
Now that all the pre-conditions are out of the way, let's actually parse the website. All we need to do at this point is to grab the URL of every link on the webpage and run with it. Again, there is no need to re-invent the wheel here, because python already has the amazing BeautifulSoup library written for it. So, let's put it to use here. Here's a simple script to demonstrate.
from bs4 import BeautifulSoup
from urllib.request import urlopen
# Get the webpage
webpage = urlopen("https://example.com/")
# Parse the website
soup = BeautifulSoup(webpage.read().decode("utf-8"), 'html.parser')
for link in soup.find_all('a'):
url = link.get("href")
# Do something with the url
It isn't hard at this point to imagine a loop where every link in a page adds another website to be crawled, the bot watching its tasks grow endlessly with each request. And there it is. You have a simple web crawler. But this on its own isn't very useful, unless all you want is a list of links. This is where we can start the indexing process. Put simply, that means that we take the website and find a way to represent what it is about. The algorithm I used is very simple: It simply stores the percentage of each unique word compared to every other word on a webpage. It's not perfect, but it works. Here's what a representation of example.com would look like:
{'example': 0.06666666666666667, 'domain': 0.13333333333333333, 'this': 0.06666666666666667, 'is': 0.03333333333333333, 'for': 0.06666666666666667, 'use': 0.06666666666666667, 'in': 0.1, 'illustrative': 0.03333333333333333, 'examples': 0.03333333333333333, 'documents': 0.03333333333333333, 'you': 0.03333333333333333, 'may': 0.03333333333333333, 'literature': 0.03333333333333333, 'without': 0.03333333333333333, 'prior': 0.03333333333333333, 'coordination': 0.03333333333333333, 'or': 0.03333333333333333, 'asking': 0.03333333333333333, 'permission': 0.03333333333333333, 'more': 0.03333333333333333, 'information': 0.03333333333333333}
It isn't too hard to recreate this. BeautifulSoup comes with a neat soup.get_text()
function. It returns every piece of text in the document. Once we have that, we can split it by every whitespace character and get every word seperately, like so:
# Get the text
page_text = soup.get_text()
# Turn it into a list of words
page_words = page_text.split()
for word in page_words:
real_word = ''.join(c for c in word.lower() if c in "abcdefghijklmnopqrstuvwxyz") # The word can only contain lowercase letters
# Do something with real_word
Now once we have the words, we can easily store them in a dictionary based on how often they occur.
if real_word in keywords.keys():
keywords[real_word] += 1 / len(page_words)
else:
keywords[real_word] = 1 / len(page_words)
And, finally, we can save the dictionary to a file. This part isn't that complicated, so I'm not going to talk about it much. The format you store the data in doesn't matter, as long as it can be written to and retrieved from a file on your file system.
Now that we've created the crawler/indexer, it is time to move on to the querying part of the engine. Again, there is much I will not cover about my own implementation, because the only thing that really matters is the searching algorithm. The search algorithm that I wrote is extremely simple: it gets every keyword from the query, and sorts the websites by how much of each website is a keyword.
# Assume we've loaded the urls dictionary to the urls variable
# Get the query
query = input("Search the index: ")
# Get the keywords from the query
keywords = {}
for word in query.split(' '):
real_word = ''.join(c for c in word if c.isalpha())
if real_word in keywords.keys():
keywords[real_word] += 1
else:
keywords[real_word] = 1
# Store relevent entries
results = {}
for url in urls.keys():
relevance = 0
for word in urls[url][2].keys():
for keyword in keywords.keys():
if word.lower() == keyword.lower():
relevance += urls[url][2][word] * keywords[keyword]
if relevance > 0:
results[url] = relevance
# Sort the results by relevance
sorted_results = {k: v for k, v in sorted(results.items(), key=lambda item: item[1])}
# Print results to screen
for result in list(reversed(sorted_results)):
print(result)
And that is really all that there is to it. If you want to see the full thing in action, you can check out the code on GitHub. There are still some things that I wish to add in the future, such as image search, that LLM integration all the cool search engines have these days, and, rewriting the query engine in rust so that it runs faster, but it is at a presentable state right now, so I will be working on other projects for now. Stay tuned!