How We Made “Sending Even More Immigrants to Prison”
On scraping, sketching, and understanding The Dip: in-depth data analysis at the Marshall Project
Family separations along the southern border are just the latest enforcement policies for an administration and Justice Department that has prioritized immigration deterrence and criminalization. The foundation for these enforcement actions was laid earlier this year when Attorney General Jeff Sessions issued a memorandum setting a zero-tolerance policy prioritizing certain criminal immigration offenses. In another memo, he announced the addition of 35 federal prosecutors to the federal districts along the border who would focus on these prosecutions.
At the time of these announcements I had been cleaning and working with federal sentencing data from the United States Sentencing Commission. The commission publishes an annual roundup of statistics of sentencing in federal courts that is rich in data, breaking down prison sentences by federal district, offense, adherence to the guidelines, and much more.
At the Marshall Project, we published a story that showed that the Department of Justice’s latest focus on immigration offenses along the border was not new. Using data going back to 2001, we showed that immigration offenses had always made up a substantial portion of federal prison sentences in districts along the southern border, specifically in the Southern District of California, Southern District of Texas, Western District of Texas, District of Nevada, and District of Arizona.
Here’s how we did it.
Getting the Data
I started looking into the sentencing data last summer, focusing on a collection of PDFs that broke down prison sentences by offense and length for all 94 federal districts. After finding out that data was uploaded going back to 2001, I thought of Manuel Villa, our summer intern at the time (and now data fellow). Manuel had many goals for the summer—one of them was to learn how to scrape.
While working on other projects, Manuel learned Python and wrote a Jupyter notebook script that scraped the PDFs for each year. Several weeks later, Manuel shared his work with me. We opened his exported CSV in Excel and started creating pivot tables for everything. But soon after making several time-based charts we saw The Dip. The telltale sign that something was wrong with our data.
For the years 2004 and 2005 we were getting almost half the number of prison sentences of the rest of the years. I continued taking notes but kept in mind that we were missing something. When the summer came to an end, Manuel left to join the International Consortium of Investigative Journalists as a fellow for the rest of the year. After his departure I spent several days exploring the data in Excel but other assignments took over, and I had to leave this data behind.
Cleaning and Analysis
At the start of the year I blocked out time to figure out what was going on with The Dip. After some initial interviews and reading, I discovered that the scraping script had missed half the PDFs published in 2004 and 2005. In those years, the sentencing commission published two sets of data. Two cases, Blakely v. Washington in 2004 and Booker v. United States in 2005, changed the way the commission tracked prison sentences, since sentencing lengths went from being mandatory to guidelines for judges.
This proved that I had real analysis and data cleaning ahead of me. I moved to R to help make my work reproducible, but also, easier to edit. With Google by my side, I joined the duplicate rows I got for 2004 and 2005 and added offenses for the whole year. I also standardized federal district names and created a column that I could use to compare each sentencing category and merged data tables so I could keep track of the states for each district. The Dip was gone, and I had a clean set of data I could start interrogating.
Building the graphics was really fun. While doing the preliminary analysis in Excel, I had sketched ideas for graphics using paper and pencil. These early sketches ended up forming the basis for a lot of the questions I asked the data.
These hand-drawn sketches were also what I took to my editor when I pitched my initial story. This was followed up by a lot of prototypes built in illustrator and D3. But as the focus of the story changed and narrowed through editing and reporting, those graphics ideas evolved.
I found all aspects of the data interesting since I had been working and thinking about it for months. I wanted to cover it all, and this is where my editors stepped in. They helped me narrow my focus to one aspect of the data. I had been looking through the immigration numbers, but once it was decided that those numbers would be the sole focus of the story, I felt a new surge of energy. This narrowed focus sped up the process of reporting, design, analysis, and development.
Making the Line Chart
After initial sketching in Illustrator and D3, I had many prototypes to work off. To build our main chart, I repurposed an earlier exploratory chart prototype to include only immigration offenses. Building a functional line chart took a day. Designing and making it usable took a week.
I had built a line chart that responded to hover with CSS, but the lines in the bottom of the chart were too close together. On hover, the whole chart felt jumpy and unintuitive, and it was never clear which district was being highlighted. To help, Anna Flagg, our interactive reporter (and graphics designer extraordinaire!) walked me through the steps of building a function that triggered at mousemove. It looked at the gap between the x- and y-axis, selecting the line with the shortest distance. On hover, the selected line was given an ID that easily triggered CSS changes. The line selected triggered a tooltip, remained highlighted, and all other lines received an opacity. This was synced with Twitter’s Typeahead search box functionality. The days of work were worth it though, as the districts are now easily distinguishable on hover and thus, useful for readers.
Recognizing the Power of the Script
Manuel had left behind a fully detailed repository, so I was able to easily clone it and start the new downloading process myself. With a newly edited script, I downloaded 17 years of data, 94 PDFs for each year and 188 for the years 2004 and 2005. The process would take about four hours.
It’s a script I ended up running six times: the first when I originally cloned Manuel’s script; the second when I added code to capture the missing 2004 and 2005 PDFs; the third when new 2017 data was released; the fourth when I discovered I had missed the state of Minnesota; and the fifth and sixth time, days before publication, because I ran all downloading and analyzing scripts from scratch to double-check my work.
What You Can Do with the Data
Being able to see the number of prison sentences for immigration offenses by local jurisdictions was always a priority for me. It is why I wanted to highlight the federal districts along the border, but also why I wanted to share that data on all districts. In my reporting I talked to many experts who emphasized how much local prosecutorial priorities are based on localized legal culture. These are local stories that we did not have the time or resources to pursue.
Two days before we published, I approached my editor to ask if we could share the clean immigration data. It was an immediate yes, and hours after publication I created a public repo on GitHub so others could look at the data themselves and hopefully follow through on these local stories.
Yolanda Martinez is a graphics producer at The Marshall Project. She was previously a digital producer at Pew Research Center, working for the Hispanic and Social Trends research teams. She was a homepage intern at the Los Angeles Times and also spent two lovely weeks in Tucson, Arizona taking part in the New York Times Student Journalism Institute. She earned a master’s degree in multimedia journalism from the UC Berkeley Graduate School of Journalism.