5 Ducks Blog

NOW ABOUT POSTS

Creating Wordlists from Scraped Web Data

2024-08-04

A long time ago, in a basement far, far away, some perfect storm of events made me become fascinated with cracking hashes. I had just bought a decently powerful computer and was looking for some way to mess with cybersecurity tools, but did not yet understand many of the underlying principles. As chance would have it, I happened across this neat tool called pwnagotchi, a simple operating system that can be installed on a Raspberry Pi to passively capture Wi-Fi authentication packets, which can later be cracked on a powerful machine. This led me down a massive rabbit hole where I learned about the different nuances about how WPA authentication works and how to most effectively recover a Wi-Fi password.

More recently, I had been reminded of this phase after thinking of various uses for JessesIndex, a basic search engine I created about 2 months ago. Because of the way JessesIndex searches for relevant content, its master database is essentially a treasure trove for english words. If you don't remember how it works (or simply didn't read it in the first place), any page the JessesIndex bot crawls is ranked by the percent of the text on that page that is a certain word. For example, if the page's text was I had had a chicken, the bot would remember that 40% of the page was the word 'had', and that every other word in the text was 20%.

This makes it very easy to look through the database and store every unique word, so I wrote a simple script to do just that.

# get-wordlist.py

from json import load

# Load the sites dictionary from JSON
with open("sites.json", "r") as f:
    sites = load(f)

# Get every word from the sites dictionary
words = []
for url in sites.keys(): # Every page crawled by the bot
    for word in sites[url][2].keys(): # Every word in that page
        if not word in words:
            words.append(word)

words.sort()

# Create wordlist file
with open("jessesenglish.txt", "w") as f:
    f.write("\n".join(words))

print("Done")

This script will take the sites.json file that the JessesIndex scraper/indexer generates and loop through every website that has been scraped. And for every website, it will store every word if it is not already in the list. Put simply, it just adds every word in the sites.json file to a wordlist.

Now we have a long list of words extracted from JessesIndex, but there's a problem. Due to an unfortunate bug in the BeautifulSoup4 library that JessesIndex uses (and a stupid mistake I made writing the indexer that will be fixed in the next commit), there are a lot of words that blend together without spaces. This leads to a ton of extra data in the wordlist that aren't real words, and only serve to waste space and time when the time comes to use the wordlist to crack passwords. Unfortunately, it is very hard, if not impossible, to automatically remove all of the compound words, but there are some tells that can indicate that a word is likely not an actual english word, so I wrote another script to get rid of words that are clearly not real english words.

# prune-wordlist.py

# Read wordlist file
with open("jessesenglish.txt", "r") as f:
    words = f.read().split()

# Remove words according to certain rules
words_2 = []
for word in words:
    
    # Only allow certain lengths of word
    if not 4 < len(word) < 15:
        continue
    
    # Do not allow words with 3 or more repeating characters
    triple_chars = False
    for char in "abcdefghijklmnopqrstuvwxyz":
        if char * 3 in word:
            triple_chars = True
            break
    if triple_chars:
        continue
    
    words_2.append(word)

# Remove words that start with previous words in the wordlist
# (we assume that the wordlist is in alphabetical order from the last script)
words_3 = []
for word in words_2:
    starts_with_other_word = False
    for prefix in words_3:
        if word.startswith(prefix):
            starts_with_other_word = True
            break
    if not starts_with_other_word:
        words_3.append(word)

print(f"{len(words)} -> {len(words_3)} : {100  - (len(words_3) / len(words) * 100)}% Improvement!")

# Write wordlist
with open("jessesenglish.txt", "w") as f:
    f.write("\n".join(words_3))

The final thing we could do is run the wordlist through a translation API and only keep the words that are autodetected as English, but sadly I am not rich, so I don't have access to an API that can do this.

These scripts will now leave us with a nice, (mostly) lean wordlist that can be used for various password cracking attacks. You could feed them into a password cracker like hashcat directly, but the main reason you would want a list of English words like this is to perform a combination attack.

One of the more recent suggestions that security folk will give you for a secure password is to pick a certain number of separate words (normally 4 or more) and use that as the base of your password, changing some letters around, mixing upper and lowercase letters to make your passwords $uP3r 1337!! We can now use this wordlist with a tool such as combinatorX and some hashcat ruleset such as one rule to rule them all.

Sadly, this wordlist is not very efficient at all (you are probably better off just using something like the Oxford 5000, but I will likely create a new post about this journey in the future. This project was more of an experiment for me to come up with a strategy to create my own wordlist, and it gave me an excuse to use hashcat again, for the first time in a while.

Anyway, I have a few posts with more practical uses in the works, so please, stay tuned.