I loathe .PDFs of public records with the power of a thousand suns. They’re a tease. They’re full of data tables but useless to most data journalists in the .PDF format. And government officials love to share them with us because they know a .PDF doesn’t allow for sorting and filtering.
Converting .PDFs into spreadsheets can solve this problem.
One of the most popular is a free tool called Tabula that you can download to your desktop. It’s popular among investigative reporters and editors because the software runs off your hard drive and allows for more privacy than a browser-based tool where you have to load the .PDF to the web.
Tabula is available for PC and Macs, and makes it easy to scrape a “native .PDF” — a Word/Google doc or spreadsheet saved into .PDF format.
Once you download Tabula to your computer, click on the green icon in your applications folder to open the software in your default browser. It’s easy to use. Just click on the browse button, select a .PDF off your desktop or hard drive and hit the “Import” button. (Figure 1.1)
It will take a few seconds for the .PDF to load. Once it appears in your browser, you can highlight the table you want to scrape two ways:
- Hold down on our mouse and drag it over the table or
- Hit the “Auto Detect Tables” button at the top of the interface. This option works best if you want to scrape multiple tables from the .PDF. (Figure 1.2)
Next, hit the green “Preview and Extract Data” button in the upper right corner of the interface. This will convert the data from a .PDF to an HTML table.
You’re almost done. Just select “CSV” from the dropdown menu and hit the “Export” button. You will get a comma-separated values spreadsheet exported to your downloads folder. Then open the file in Google Sheets or Excel and begin your analysis.
For scanned .PDFs, Tabula requires a few extra steps, which can be found under the help menu at the top of the page.
If you prefer a browser-based scraping tool, try CometDocs or PDFtoExcel, which are free web-based services for extracting data tables from regular and scanned PDF files into fully editable Excel spreadsheets. Both are good tools but they require you to upload your dataset to their websites. That can be an issue with privacy for your source and investigative story.
Video: Scraping .PDFs with Tabula
Quick tip: There are many audio editing tools on the market, but if you need a good audio clipping tool to use on-the-fly, give AudioTrimmer.com a try. The interface is simple: Just hit the upload button and use the clipping tool to trim any audio you don’t want to use. It supports several file formats, including .mp3 and .wav, and helps you make online ringtones, too. Another good option for a browser-based audio editing tool is Sodaphonic. Hokusai is an iOS app that edits audio on your smartphone. Here’s a video on how to use it:
Find more resources on JournalistsToolbox.org. Subscribe to our free, twice-monthly newsletter full of tips, tricks and tools. And subscribe to our free YouTube channel with more than 55 training videos. Follow Mike on Twitter @journtoolbox.
Scraper image courtesy of PinClipart.com.
Tagged under: PDF, Tabula, data tables, spreadsheets