Creating Wordlists from Scraped Web Data
2024-08-04
A long time ago, in a basement far, far away, some perfect storm of events made me become fascinated with cracking hashes. I had just bought a decently powerful computer and was looking for some way to mess with cybersecurity tools, but did not yet understand many of the underlying principles. As chance would have it, I happened across this neat tool called pwnagotchi, a simple operating system that can be installed on a Raspberry Pi to passively capture Wi-Fi authentication packets, which can later be cracked on a powerful machine. This led me down a massive rabbit hole where I learned about the different nuances about how WPA authentication works and how to most effectively recover a Wi-Fi password.
More recently, I had been reminded of this phase after thinking of various uses for JessesIndex, a basic search engine I created about 2 months ago. Because of the way JessesIndex searches for relevant content, its master database is essentially a treasure trove for english words. If you don't remember how it works (or simply didn't read it in the first place), any page the JessesIndex bot crawls is ranked by the percent of the text on that page that is a certain word. For example, if the page's text was I had had a chicken
, the bot would remember that 40% of the page was the word 'had', and that every other word in the text was 20%.
This makes it very easy to look through the database and store every unique word, so I wrote a simple script to do just that.
# get-wordlist.py
from json import load
# Load the sites dictionary from JSON
with open("sites.json", "r") as f:
sites = load(f)
# Get every word from the sites dictionary
words = []
for url in sites.keys(): # Every page crawled by the bot
for word in sites[url][2].keys(): # Every word in that page
if not word in words:
words.append(word)
words.sort()
# Create wordlist file
with open("jessesenglish.txt", "w") as f:
f.write("\n".join(words))
print("Done")
This script will take the sites.json file that the JessesIndex scraper/indexer generates and loop through every website that has been scraped. And for every website, it will store every word if it is not already in the list. Put simply, it just adds every word in the sites.json file to a wordlist.
Now we have a long list of words extracted from JessesIndex, but there's a problem. Due to an unfortunate bug in the BeautifulSoup4 library that JessesIndex uses (and a stupid mistake I made writing the indexer that will be fixed in the next commit), there are a lot of words that blend together without spaces. This leads to a ton of extra data in the wordlist that aren't real words, and only serve to waste space and time when the time comes to use the wordlist to crack passwords. Unfortunately, it is very hard, if not impossible, to automatically remove all of the compound words, but there are some tells that can indicate that a word is likely not an actual english word, so I wrote another script to get rid of words that are clearly not real english words.
# prune-wordlist.py
# Read wordlist file
with open("jessesenglish.txt", "r") as f:
words = f.read().split()
# Remove words according to certain rules
words_2 = []
for word in words:
# Only allow certain lengths of word
if not 4 < len(word) < 15:
continue
# Do not allow words with 3 or more repeating characters
triple_chars = False
for char in "abcdefghijklmnopqrstuvwxyz":
if char * 3 in word:
triple_chars = True
break
if triple_chars:
continue
words_2.append(word)
# Remove words that start with previous words in the wordlist
# (we assume that the wordlist is in alphabetical order from the last script)
words_3 = []
for word in words_2:
starts_with_other_word = False
for prefix in words_3:
if word.startswith(prefix):
starts_with_other_word = True
break
if not starts_with_other_word:
words_3.append(word)
print(f"{len(words)} -> {len(words_3)} : {100 - (len(words_3) / len(words) * 100)}% Improvement!")
# Write wordlist
with open("jessesenglish.txt", "w") as f:
f.write("\n".join(words_3))
The final thing we could do is run the wordlist through a translation API and only keep the words that are autodetected as English, but sadly I am not rich, so I don't have access to an API that can do this.
These scripts will now leave us with a nice, (mostly) lean wordlist that can be used for various password cracking attacks. You could feed them into a password cracker like hashcat directly, but the main reason you would want a list of English words like this is to perform a combination attack.
One of the more recent suggestions that security folk will give you for a secure password is to pick a certain number of separate words (normally 4 or more) and use that as the base of your password, changing some letters around, mixing upper and lowercase letters to make your passwords $uP3r 1337!! We can now use this wordlist with a tool such as combinatorX and some hashcat ruleset such as one rule to rule them all.
Sadly, this wordlist is not very efficient at all (you are probably better off just using something like the Oxford 5000, but I will likely create a new post about this journey in the future. This project was more of an experiment for me to come up with a strategy to create my own wordlist, and it gave me an excuse to use hashcat again, for the first time in a while.
Anyway, I have a few posts with more practical uses in the works, so please, stay tuned.
The Poor Man's Google
2024-05-09
The code for this project is all on GitHub
It was a dark and stormy night dim and cloudy day when I happened across a person talking about how search engines worked. When I thought about it, I figured that it couldn't be that hard to reproduce, and here we are today. And besides, no matter how bad the end project is, it can't possibly be worse than Bing.
Search engines are a lot simpler than you may think, consisting of only 3 major components, the crawler, indexer, and the actual search page. The crawler is in charge of discovering new web pages, the indexer decides what the web pages are about, and the search page does exactly what you think it does. It takes the input you give it and tries to figure out what the most likely matches for that search result. For simplicity, I combined the first two together, so the indexing happens right after a page is discovered.
All a web crawler has to do is visit a website, find every link on that page, and add them to a queue to be visited next. All it takes is a basic knowledge of HTML parsing, and you're off to the races. It seems like a very simple task, until it isn't. Before you even think of deleting your google account, buying an RV and a shotgun, and living a life without any of them dang companies stealing your dadgurn data, there are a lot of variables to keep in mind. For example, what if a link that you encounter leads to a page that no longer exists? What if the link has no specified destination at all, or if it leads to a website that is currently down? And what if that that link isn't even a webpage? All of these cases have to be considered, and more. If you look at my crawler on github, you will see enough try/except statements and input validations to last you a lifetime.
But that's not all!
It turns out that some web admins don't want their websites being hit by hundreds of requests a minute from some scraper that doesn't pay them their precious 0.01 cents in ad money. Robots just don't seem to have the same desire to get together with hot singles in their area that a human would. Or maybe when humanity is about to be wiped out by robots, they don't want them to enjoy the many cat pictures of the internet. Either way, the humans of the year 1994 developed an ingenious method of robot repellant: robots.txt. The robots.txt file on any website will tell the robots where they can and cannot play. Unfortunately for the cast of Terminator, It doesn't need to be followed, but if you want to remain un-blacklisted from certain websites, you probably should. So how do we handle that? Well, this may seem pretty anticlimactic, but all that really has to be done is to get someone to do it for us.
from urllib import robotparser
# Set the object to read the robots.txt of a certain website
botparser = robotparser.RobotFileParser()
botparser.set_url("https://example.com/robots.txt")
botparser.read()
# Check if the program can read a certain url
if botparser.can_fetch("https://example.com/kill-all-humans/"):
# Get the webpage
Thankfully, we don't have to re-invent the wheel here. Some kind fellow has already saved us countless hours of staring at tracebacks and wondering why the code doesn't run. All we have to do is point this at the webpage we want to visit and we can check if our crawler is allowed to be snooping around in this neck of the woods.
Now that all the pre-conditions are out of the way, let's actually parse the website. All we need to do at this point is to grab the URL of every link on the webpage and run with it. Again, there is no need to re-invent the wheel here, because python already has the amazing BeautifulSoup library written for it. So, let's put it to use here. Here's a simple script to demonstrate.
from bs4 import BeautifulSoup
from urllib.request import urlopen
# Get the webpage
webpage = urlopen("https://example.com/")
# Parse the website
soup = BeautifulSoup(webpage.read().decode("utf-8"), 'html.parser')
for link in soup.find_all('a'):
url = link.get("href")
# Do something with the url
It isn't hard at this point to imagine a loop where every link in a page adds another website to be crawled, the bot watching its tasks grow endlessly with each request. And there it is. You have a simple web crawler. But this on its own isn't very useful, unless all you want is a list of links. This is where we can start the indexing process. Put simply, that means that we take the website and find a way to represent what it is about. The algorithm I used is very simple: It simply stores the percentage of each unique word compared to every other word on a webpage. It's not perfect, but it works. Here's what a representation of example.com would look like:
{'example': 0.06666666666666667, 'domain': 0.13333333333333333, 'this': 0.06666666666666667, 'is': 0.03333333333333333, 'for': 0.06666666666666667, 'use': 0.06666666666666667, 'in': 0.1, 'illustrative': 0.03333333333333333, 'examples': 0.03333333333333333, 'documents': 0.03333333333333333, 'you': 0.03333333333333333, 'may': 0.03333333333333333, 'literature': 0.03333333333333333, 'without': 0.03333333333333333, 'prior': 0.03333333333333333, 'coordination': 0.03333333333333333, 'or': 0.03333333333333333, 'asking': 0.03333333333333333, 'permission': 0.03333333333333333, 'more': 0.03333333333333333, 'information': 0.03333333333333333}
It isn't too hard to recreate this. BeautifulSoup comes with a neat soup.get_text()
function. It returns every piece of text in the document. Once we have that, we can split it by every whitespace character and get every word seperately, like so:
# Get the text
page_text = soup.get_text()
# Turn it into a list of words
page_words = page_text.split()
for word in page_words:
real_word = ''.join(c for c in word.lower() if c in "abcdefghijklmnopqrstuvwxyz") # The word can only contain lowercase letters
# Do something with real_word
Now once we have the words, we can easily store them in a dictionary based on how often they occur.
if real_word in keywords.keys():
keywords[real_word] += 1 / len(page_words)
else:
keywords[real_word] = 1 / len(page_words)
And, finally, we can save the dictionary to a file. This part isn't that complicated, so I'm not going to talk about it much. The format you store the data in doesn't matter, as long as it can be written to and retrieved from a file on your file system.
Now that we've created the crawler/indexer, it is time to move on to the querying part of the engine. Again, there is much I will not cover about my own implementation, because the only thing that really matters is the searching algorithm. The search algorithm that I wrote is extremely simple: it gets every keyword from the query, and sorts the websites by how much of each website is a keyword.
# Assume we've loaded the urls dictionary to the urls variable
# Get the query
query = input("Search the index: ")
# Get the keywords from the query
keywords = {}
for word in query.split(' '):
real_word = ''.join(c for c in word if c.isalpha())
if real_word in keywords.keys():
keywords[real_word] += 1
else:
keywords[real_word] = 1
# Store relevent entries
results = {}
for url in urls.keys():
relevance = 0
for word in urls[url][2].keys():
for keyword in keywords.keys():
if word.lower() == keyword.lower():
relevance += urls[url][2][word] * keywords[keyword]
if relevance > 0:
results[url] = relevance
# Sort the results by relevance
sorted_results = {k: v for k, v in sorted(results.items(), key=lambda item: item[1])}
# Print results to screen
for result in list(reversed(sorted_results)):
print(result)
And that is really all that there is to it. If you want to see the full thing in action, you can check out the code on GitHub. There are still some things that I wish to add in the future, such as image search, that LLM integration all the cool search engines have these days, and, rewriting the query engine in rust so that it runs faster, but it is at a presentable state right now, so I will be working on other projects for now. Stay tuned!
Is This Thing On?
2024-04-09
I have been spending some time trying to jerry-rig a basic content management system with python and bash. The resulting code is not very nice to look at, but since
I only need something that works, it's good enough (at least, it's good enough for now). I will revel in the fact that it works now, and spend time making it pretty
later -- at least, that's what I tell myself.
So this blog is now fully working, and I have no real excuse to not use it. But I will probably make one up anyway.