Government websites love to bury data in tables on web pages. Why? It satisfies legal requirements for making document public under sunshine laws, but it renders the data useless. You can’t sort or filter the data to look for trends, do math calculations to find rates and averages, and other things journalists need to find stories.
But a simple spreadsheet formula, a website link and a Google Sheet can scrape the data off a site and move it into a format you can use. Here’s a step-by-step breakdown on how to scrape data from the Federal Highway Administration’s bridge inspections study:
- Start with this formula: =IMPORTHTML(“URL”,“table”, 0)
- Add this URL to the formula. This is the page we’ll scrape: https://www.fhwa.dot.gov/bridge/nbi/no10/condition17.cfm
- Add this type of element to the formula: table:0
=IMPORTHTML(“https://www.fhwa.dot.gov/bridge/nbi/no10/condition17.cfm”,”table”,NUMBER OF ELEMENT ON PAGE)
- Now add the table number (in this case a 0 as you’re telling it to start at the top of the first table (table 0 is the first table on an HTML page): =IMPORTHTML(“https://www.fhwa.dot.gov/bridge/nbi/no10/condition17.cfm”,”table”,0)
- Final code you enter into cell A1 on the Google Sheet. Just COPY and paste this into a Google Sheet: =IMPORTHTML(“https://www.fhwa.dot.gov/bridge/nbi/no10/condition17.cfm”,”table”,0)
Note: If you get an error, just try it again and make sure you copy everything, including the equal sign and close parentheses in the formula.
- In the upper left corner, click on “Untitled Spreadsheet” and type: FHA bridge inspections.
Cleaning the Data and Working off a Copy
There’s one problem after scraping the data: If you go to type on the screen or sort/filter/edit the data, your data disappears. This is because the spreadsheet is linked to the web page you scraped. When the web page updates, so does the sheet.
- To remedy the issue, highlight all the data on the screen and copy it. Then click on the Plus Sign in the lower left corner of the Google Sheet to open a new tab in the spreadsheet. You always, always, always work off a COPY of your dataset in case you mess something up.
- To paste the data into the new sheet, click into cell A1, then go to the Edit > Paste Special > Paste values only menu at the top and click (See figure 1.1 below). Presto! Your data appears in the new sheet and is now editable. You still have your original scraped sheet in the first tab.
- Once you paste the data into row one, label the tab at the bottom EDITS. Clean up the top row and fill in the labels in Row 2, then delete Row 1 so your header labels start there. MAKE SURE that Alabama starts in Row 2 and that you have a single header row with State typed over cell A1. Delete out any extra rows at the top or the formulas below won’t work.
- Now that it’s in a spreadsheet, you can use formulas to analyze the data.
Exercise on your own
Here are some more pages you can practice scraping, using this formula for webpages. Just plug ONE of the web address into the area that says URL in one of the blank formulas below, copy and paste the entire formula into cell A1 on a Google Sheet and hit return:
- Franklin County, Ohio polling locations: https://vote.franklincountyohio.gov/voters/polling-locations.cfm
Video: If you’re a visual learner, here’s a training video walking you through the scraping formula steps and how to install a scraping plug-in to your browser:
Quick tip: Need a Google Doc or a Google Sheet quickly? Just type docs.new or sheets.new in your browser window and the doc will open. Be sure to download or save the document into Google Drive so you don’t lose track of it.
Find more resources on JournalistsToolbox.org. Subscribe to our free, twice-monthly newsletter full of tips, tricks and tools. And subscribe to our free YouTube channel with more than 55 training videos. Follow Mike on Twitter @journtoolbox.