Huge amounts of data are publicly available but poorly organised. One way to unleash the power of the data is to scrape it from its original source and repackage it in a spreadsheet.
Here’s an example of how. “The Pentagon’s $2.2 Billion Soviet Arms Pipeline Flooding Syria” was almost entirely based on public information that the United States government makes available on several websites like USA Spending or Federal Procurement Data System. But the only way to get to the interesting data we needed was to scrape.
Basically, scraping is downloading data from a web page or an online database and storing it locally, so you can play with it and organize it in a way that helps you work on your story.
Data are all around the web, but often organised in a way that is readable to humans, but not to machines. That’s “unstructured” data.
Your goal is to make it structured, to end up with a clean set of data stored in a table with only the information you need for the story.
However, it’s a long road to get there.
Scraping is not simple. There are a lot of tools, browser add-ons and apps that offer one-click solutions but those often don’t work on larger sets of data and can’t handle multiple pages or, if they do, they are expensive.
Good scraping requires some coding skills. Even if you use free and open-source tools like Scrapy or BeautifulSoup you will need to understand programming to follow online tutorials for scraping. The good news is that there are a lot of free online tutorials to help you get started.
Those tutorials will teach you how to extract information from a website and export it into a table that you can then manipulate.
Chrome extensions such as Scraper and Data Miner, for example, are excellent tools for simple scraping jobs.
A more complex and more expensive solution is websites such as import.io which enables you to make scrapers for multiple pages.
If you do have some coding skills, several websites, like USAspending.gov for example, offer an application programming interface, API, that you can use to dig your own data.
If not, there are readily available solutions like enigma.io or data.world that already have done half a job for you; they have scraped public databases so you can use it to dig your own stories.
That’s what we did for the Pentagon story. We had spotted that for each public payment to a contractor, the origin of the goods was declared. This meant that it was possible to see where the US military was buying weapons from in Central and Eastern Europe.
Unfortunately, there was no easy way to search through the US federal procurement data system by “country of origin”, so we needed a Plan B.
We decided to turn to enigma.io, a website that offers a comprehensive repository of public data. They have already scraped complete USAspending.gov, so we didn’t have to, and could concentrate on filtering what was relevant to our story.
That sounds easy, but it was far from that. The dataset of US government contracts lists every dollar contracted by every US agency, so, with all the filters applied, the first set of data we downloaded was well over 100MB in one .csv file. CSV stands for Comma Separated Values, and it’s just one big text file containing columns separated by commas and rows in separate paragraphs. You import that in a program that handles tables, like MS Excel, LibreOffice Calc or Google Sheets.
As we were facing this type of data for the first time, we had to become acquainted with it pretty fast.
The data contained all the weapons and ammunition procured by Pentagon from 2011 to 2016. But to figure out what we needed, we had to understand the procurement processes at the US Department of Defense and decode the technical lexicon.
Without doing background research, the data would have been meaningless.
Since we were all new to this dataset, we had to go back and forth several times before we were satisfied with the gathered information.
We focused our attention only on so-called “non-standard” weapons procured from Central and Eastern European countries and intended, or likely to be intended, for Syria. At the same time, because the clerks filing this information were sometimes careless, we needed to go through all other data and dig out valuable contracts hidden behind incomprehensible codes.
After removing all the data that was not relevant to our work, we reduced it to less than 1MB, which was much easier to work with. That shows why it’s important to know exactly what you’re looking for, what your story is about and what data you need to tell it. Click here for a copy of our final dataset.
It shows why it’s important to ask the right question. Only then can you expect data to give you answers.
However, to trust the gathered information, we had to verify it.
Luckily, US procurement data are stored in multiple databases.
To verify every contract we were interested in, we used the US online procurement records database in the US federal procurement data system. Likewise, we checked the most problematic contracts with officials at the Pentagon.
Thanks to all of that, we were able to catch the Pentagon’s attempt to hide some embarrassing data. We found out that, after we started to work on the story and after asking the Pentagon inconvenient questions, the US Department of Defense staff went into FPDS database and changed several mentions of Syria. As we had stored the originals, we were able to expose this attempt to rewrite history.
This article was originally published in BIRN Albania’s manual ‘Getting Started in Data Journalism’.