Web Scraping and Data Mining for GMs, Part 1
Web Scraping and Data Mining for GMs, Part 1
A Brief Introduction
I'm planning on running an AD&D 2nd edition campaign, and I'd like to be able to programatically access information about monsters in my world. TSR released a bunch of monsters for 2nd edition in their Monstrous Compendium series. I own most of these books in pdf, and could theoretically grab all the monsters information from those PDFs, but there is an easier way. These days, web scraping is an essential skill for programmers, and sometimes, for Game Masters. If, like me, you're a GM with some basic programming skill, you can scrape a lot of useful data.
At one time, lomion.de was where you went if you wanted to see AD&D 2nd edition monster information online. Unfortunately, that site is no longer in existence. Enter The Wayback Machine: a project by archive.org to make older websites accessible. They have an archive of lomion.de, and a very liberal scraping policy. This guide will show you how to grab this information and store it on your hard drive. Once it's there, we can use Python to parse those files and put them in a more useful form (like a SQLite database). That'll be part 2. Let's get scraping.
Just a note about the legality of this: given that this stuff is all copyrighted material of Wizards of the Coast, it was likely illegal for it to be online like this in the first place. Archive.org is on good legal standing as this appears to be fair use: they are providing information about what this stie was like when it was extant. Me demonstrating how to scrape it is also informational, and is likely covered under fair use. You using this data in the privacy of your home is on murkier ground. But never use this data for any monetary gain. That is very illegal. I am not advocating copyright infringement, nor will I ever advocate copyright infringement. Many people at TSR worked very hard on the Monstrous Compendiums. You can purchase them on DriveThruRPG and I encourage you to do so if you're going to be using this data.
What You Need
- A Python interpreter. I'm using Python 3.11, but this should work for most recent versions of Python.
- The following python modules: Beautiful Soup 4, Requests, loguru (optional), lxml (optional)
- Jupyter Notebook (optional, but recommended)
- wget
You could use a text editor rather than Jupyter Notebook, but I much prefer the interactive nature of Jupyter for this type of work. I can test things before I commit to a strategy. wget is available for most platforms, and might already be installed if you're using Linux. For everyone else, instructions for installation can be found on the web very easily. In order to not clutter up my main python environment, I used a python virtual environment for this project. I recommend you do the same. In most cases it's as easy as:
python3 -m venv [path/to/environment]
Once created, you have to activate it. I'm working on windows, and the virtual environment path is .venv so to activate, I type:
.venv\Scripts\activate.bat
Now I'm in a clean virtual environment and can install the modules I need.
pip install requests
pip install BeautifulSoup4
pip install lxml
pip install loguru
pip install Jupyter Notebook
You can find more about virtual environments in python
here
Why these modules
Requests is the most commonly used library for working with http requests in
Python. It's the defacto standard. Similarly, Beautiful Soup is the defacto
standard for parsing html (though there are many others). I install lxml
because that's the parsing engine I use with Beautiful Soup, but you don't
have to use that if you have a preference for a different parser. Loguru is a
powerful logging library for Python. I spent most of my development years
using print
to debug programs, but these days I use logging and
loguru is my preferred logging library. But if you don't want to log, then
don't install loguru.
![]() |
loguru in action |
Why wget
wget is one of many ways to download web pages and web sites. I could use requests and beautiful soup to do this, but wget already exists and I see no reason to reinvent the wheel. Why download at all when I can parse the html right on the server? Well, because I make mistakes, and I'll have to try different strategies to properly parse the HTML on the site. That means I'll keep hitting the site over and over to properly get the information. That's a waste of my bandwidth and a waste of archive.org's server resources. I'm going to have to download the information anyhow: I might as well get it all in one fell swoop and then parse it at my leisure later on.
How to start
The Basic Steps
We want to: Get a list of the monsters Create the URLs to the pages we want to download Download the pages and store them on our local storage, including all relevant files like images, javascript, and css.
I'll be the first to admit that I rarely use the subprocess module in
Python. In fact, I can't remember the last time I did. But a little
research led me to
this site, which
describes the process with wget, which is what we'll need. Like I said, I
don't like reinventing the wheel, so I'll just use their technique
(and their nifty runcmd
function). If you're unfamiliar, the
subprocess module
lets you work with operating system subprocesses. It can start programs, send
them things via standard input, and receive output of the programs, amongst
other things. I'll be using it to call wget.
If you're unfamiliar with wget, it can seem daunting. There are a lot of options. I used the wget dirty manual to figure out what options I wanted. In my case, those options are as follows:
-
-nc
- No clobber. If the transfer gets interrupted, and I have to start over, no clobber prevents wget from downloading and overriding the stuff that's already downloaded. -
-k
- This will convert the links in the downloaded documents so that they work when stored locally -
--user-agent="PythonGrabMonsterScript"
- Not necessary, but this identifies the script to the server that it's talking to. I just consider it polite. -
-p
- This downloads page requisites: things like css and javascript files, along with images. -
--restrict-file-names=windows
- Makes sure any downloaded filenames work in Windows.
And now we have all the pieces to build the progarm. The program I've
built is below. It is under the
MIT license (except the runcmd
function, which isn't mine.) I am not repsonsible for what you do with
this, but please don't use it to break the law in any way. Note that wget
will recreate the original directory structure of the server to the best of
its ability, so finding the files might be a pain. The main file you need to
find after this is done is _index.php
. On my machine, this is
under
web.archive.org\web\20180818101608if_\http%3A\lomion.de\cmm
. Keep
in mind that this will take a long time (it took about 12 hours on my
machine). It takes up about 200 MB on disk.
import requests
import time
import sys
from loguru import logger
from bs4 import BeautifulSoup
import subprocess
def runcmd(cmd, verbose = False, *args, **kwargs):
"""Run a shell command from within python. Credit to Scrapingbee.com (https://www.scrapingbee.com/blog/python-wget/)
"""
process = subprocess.Popen(
cmd,
stdout = subprocess.PIPE,
stderr = subprocess.PIPE,
text = True,
shell = True
)
std_out, std_err = process.communicate()
if verbose:
logger.debug(std_out.strip())
pass
# set up logging
logger.add("DnDLog.txt", rotation="100 MB")
# grab the index
indexURL='https://web.archive.org/web/20180818101608if_/http://lomion.de/cmm/_index.php'
try:
response=requests.get(indexURL)
logger.success("Grabbed {}".format(indexURL))
except:
logger.error('Cannot grab url {}'.format(indexURL))
if response.status_code != 200:
logger.error('Received status code {}. Cannot continue'.format(response.status_code))
sys.exit("Exiting on error")
# Now that we have the index, let's grab the monster list from the index using beautiful soup. This will
# help us build the URLS to donwload
theHTML=response.text
soup = BeautifulSoup(theHTML, 'lxml')
monsterURLS=[]
for link in soup.find_all('a', href=True):
monsterURL=link['href']
if '.php' in monsterURL and '_' not in monsterURL:
monsterURLS.append(monsterURL)
numMonsters=len(monsterURLS)
logger.debug('Retrieved {} monster names'.format(numMonsters))
# Download the index, this time for storage purposes
wgetCommand='wget -nc -k --user-agent="PythonGrabMonsterScript" -p --restrict-file-names=windows'
indexCommand='{0} {1}'.format(wgetCommand, indexURL)
logger.debug(indexCommand)
logger.debug("Downloading index")
runcmd(indexCommand, verbose = True)
logger.debug("Index downloaded")
# and finally, let's grab all the monsters
baseURL='https://web.archive.org/web/20180818101608if_/http://lomion.de/cmm/'
logger.debug ('Starting Monsters----')
monsterURLS.sort()
monsterCount=0
for i in monsterURLS:
monsterCount+=1
time.sleep(1) # let's be nice
monsterURL="{0}{1}".format(baseURL, i)
monsterCommand="{0} {1}".format(wgetCommand, monsterURL)
logger.debug("Grabbing {0} from {1}: {2}/{3}".format(i, monsterURL, monsterCount, numMonsters))
runcmd(monsterCommand, verbose = True)
logger.debug("{0} done".format(i))
logger.debug("\n\nProgram finished")
Happy scraping!
Comments
Post a Comment