Running scrapers on GitHub to simplify your workflow

How the LAT Data and Graphics team uses GitHub Actions to keep code and data in one place, and track scraper history for free

Sometimes collecting the data once isn’t enough. You might be monitoring your state’s monthly WARN notices, or, like many of us, tracking changes in daily COVID–19 cases. Getting that data may require some fancy scraper or just a simple curl command that downloads a csv, but either way, the task has to be done on a regular basis. That’s where the challenge is, because who wants to get on their computer and save a csv to a folder on a Sunday?

At the Los Angeles Times Data and Graphics desk, we use GitHub Actions for almost all of our scrapers that run on a schedule. Some run a couple times a day while others may run once a week. Many of our scrapers feed and update applications like our Coronavirus Tracker, State Elections Money Tracker and Drought Tracker.

We started using GitHub Actions for our COVID–19 scrapers about two years ago. In hindsight, it was an ideal choice for a few reasons. For one, our code—from scrapers to cleaners to aggregators—all lived in GitHub already. Using GitHub Actions meant that we didn’t have to upload our code to a different service every time we made changes. It’s also free to use, if your repository is public (and even allows for certain free minutes for private repos). And it keeps the history of your scrapes—Simon Wilson calls it “git scraping”—allowing you to go back in your git history if you want to see how the data changed over time.

What are GitHub Actions?

Simply put, it’s GitHub’s own continuous integration platform. If you’ve never used something like that before, think of it as renting a blank computer with an operating system of your choice (Ubuntu, MacOS or Windows). In order to use this blank computer, you’ll need to write a few instructions in YAML to define what libraries need to be installed, what script to run, and how often this needs to happen. Below, we’ll use a very simple example to show how you can get your own scraper ready on GitHub Actions. There’s also an excellent tutorial for beginners on using GitHub for scrapers on Ben Welsh’s site.

The code

One of our daily tasks at the Data and Graphics department is updating the wildfire evacuation zones shown on our Wildfire Map. Most of our evacuation zones come from a statewide evacuation map hosted on a California Department of Technology site, which updates frequently.

The scraper that keeps our map updated needs to do two things: download the GeoJSON, then filter to the zones we need. This requires two terminal commands (we use curl and mapshaper for these tasks), which can be run from a Makefile. It looks like this:

download: curl "veryLongUrl" -o raw/zones.geojson 

filter: mapshaper raw/zones.geojson \
   -filter '"Evacuation Order, Evacuation Warning".indexOf(STATUS) > -1'  \
   -o processed/1-filtered.geojson
run: make download
     make filter

The Makefile, along with two folders for GeoJSON files, are kept in a GitHub repository. These commands need to be run every day—and even multiple times a day during a fire season. A typical workflow would require a reporter that’s on shift to clone the repository, run the make command on their local terminal and push the changes back up.

By using GitHub Actions, we’ve allowed the reporters to complete the whole workflow with a single click of a button. And no one has to clone the repo unless it’s for development purposes!

Creating a GitHub Actions workflow

GitHub has great documentation on how to get started with Actions, and we highly recommend taking a look if you are thinking of creating one. Here’s an example of the process you can follow to move an existing project into GitHub Actions:

First, move your scripts to GitHub if you are not keeping them there already. For our example above, the repo consists of a single Makefile and some folders. In that repository create a .github/workflows directory. You can also click on the “Actions” tab of your repository on GitHub, and commit a simple workflow. It will be a .yml file; ours is called update.yml with about 20 lines of code.

name: Update evacuation zones

  - cron: 0 0 * * 0,1 

    runs-on: ubuntu-latest
    - uses: actions/checkout@v2
    - uses: actions/setup-node@v2
        node-version: '16'
    - name: npm install
    run: npm install -g mapshaper
    - name: Run Makefile
    run: make run
    - uses: EndBug/add-and-commit@v8
        message: 'Updated evacuation zones'
        add: '*.geojson'

Name: This is the name of your workflow. If you have more than one workflow, this is what will be displayed on your Actions tab.

On: This is when the action will run. There are many events you can configure here to trigger the workflow. This example uses a schedule of our choosing indicated on the cron, running at midnight every Sunday and Monday UTC time (5 p.m. Saturday and Sunday PST). We also added an option called workflow_dispatch, which creates a button on your Actions tab of the repository page—allowing anyone with access to the repository to run the action manually. In our case, this means a reporter can click a button to run the scraper manually each weekday, and on weekends, it runs automatically and saves data to our repository.

Jobs: A workflow can have more than one job, hence “jobs.” Our workflow has just one, called download.

GitHub Actions is mostly free to use. If your repository is public, you will not be charged for GitHub Actions (it’s why we have made many of our scrapers public). For private repos, like this evacuation zone downloader, there are some limits, and you can be charged for usage depending on storage, how long it takes for your job to run, and the operating system your job is run on.

Runs-on: This is where you configure the operating system of your job. Even though most of us use Mac, by default we like to run our GitHub Actions on Ubuntu, because it’s the most affordable. But you can also run this on MacOS or Windows.

Steps: Groups all the steps that need to run to complete the job you are creating. From this point on, you can write your own commands using run. You can also use actions that others have created, with uses.

To set up the steps of your workflow, think through the steps you’d take to do this scrape locally. For our example, first you would have to clone this repository. Then you’d install Node and any dependencies—in our case, just Mapshaper. And then you’d then run the actual command that triggers the scraper and commits updates back to the repository.

To automate that process in a GitHub Action, you write these instructions out in the .yml file under steps.

Uses: Many commonly used steps in a workflow can be found in the GitHub Actions Marketplace. The first two steps in our example download job are commonly used actions—checking out the repo and installing Node. So we don’t have to write the code for these steps ourselves; instead we can use a “checkout” action, which checks out our repository to the blank computer that’s about to run our scraper, and a “setup-node” action, which installs Node with a version of our choice. (If your scraper uses Python, you could use a “setup-python” action instead.)

Run: After we’ve installed Node, our scraper needs to install Mapshaper. You can do this by listing the command you would use on your own terminal after run: npm install -g mapshaper.Then finally, one more run command that runs our Makefile that triggers the downloader and filtering: make run, just as you would on your terminal.

Once new files are downloaded, you’ll have to save your scraped results to the repository. You could write out your own commands, like:

git add .
git commit -m "Updated evacuation zones"
git push origin main

But what if there was no update on the evacuation zones? This action would fail on push because there’s nothing to commit. To account for this, we can use a pre-made “add & commit” action instead, which allows the action to complete even if there’s nothing to commit.

Other things to consider

YAML syntax can be tricky, and your action will immediately fail if the indentations are off. We’ve noticed that editing inside the GitHub interface 😱 works well because errors are easier to catch.

You can also run your actions locally if you want to test them before pushing them to GitHub and possibly wasting your precious minutes.

Some other cool features

Let’s say your repository is public, and you have a step where you upload the files to a different location that needs authentication. You can use secrets! These are encrypted variables that you can use as arguments for actions.

Let’s say instead of committing to GitHub, we want to upload the files from our scraper to AWS. Using a ready-made action, and hiding our credentials in secrets, the code would look something like this:

- uses: shallwefootball/s3-upload-action@master
name: Upload files
    aws_key_id: ${{ secrets.AWS_KEY }}
    aws_secret_access_key: ${{ secrets.AWS_SECRET_KEY}}
    aws_bucket: myS3Bucket
    destination_dir: 'zones'

You can also link workflows so that a successful completion of one workflow triggers another one. This next code snippet is from a workflow that builds to our test site. Notice how it’s set to run once another workflow called “Process” is completed.

    workflows: ["Process"]
    branches: [main]
    - completed

Would you like to be alerted every time your scrape completes/fails and annoy your co-workers? You can add Slack integrations.

You can even save an output of one step and use it in another step. For example, the EndBug/add-and-commit@v8 has an output option called “committed” that tells you whether a new commit was created. You could add this output to a Slack integration to customize your message.


  • Iris Lee

    Iris Lee is an assistant data and graphics editor, focused on news applications at the Los Angeles Times. She is also an adjunct professor at the University of Southern California.


Current page