At 10 AM the Saturday before inauguration day, on the sixth floor of the Van Pelt Library at the University of Pennsylvania, roughly 60 hackers, scientists, archivists, and librarians were hunched over laptops, drawing flow charts on whiteboards, and shouting opinions on computer scripts across the room. They had hundreds of government web pages and data sets to get through before the end of the dayall strategically chosen from the pages of the Environmental Protection Agency and the National Oceanic and Atmospheric Administrationany of which, they felt, might be deleted, altered, or removed from the public domain by the incoming Trump administration.
Their undertaking, at the time, was purely speculative, based on travails of Canadian government scientists under the Stephen Harper administration, which
muzzled them from speaking about climate change. Researchers watched as Harper officials threw thousands of books of aquatic data into dumpsters as federal environmental research libraries closed.
But three days later, speculation became reality as news broke that the incoming Trump administration’s EPA transition team does indeed intend to remove some climate data from the agency’s website. That will include references to President Barack Obama’s June 2013 Climate Action Plan and the strategies for 2014 and 2015 to cut methane, according to an unnamed source who spoke with
Inside EPA. It’s entirely unsurprising, said Bethany Wiggin, director of the environmental humanities program at Penn and one of the organizers of the data-rescuing event.
Back at the library, dozens of cups coffee sat precariously close to electronics, and coders were passing around 32-gigabyte zip drives from the university bookshop like precious artifacts.
At Penn, a group of coders that called themselves baggers set upon these tougher sets immediately, writing scripts to scrape the data and collect them in data bundles to be uploaded to
DataRefuge.org, an Amazon Web Services-hosted site which will serve as an alternate repository for government climate and environmental research during the Trump administration. (A digital bag is like a safe, which would alert the user if anything within it is changed.)
Were yanking the data out of a page, said Laurie Allen, the assistant director for digital scholarship in the Penn libraries and the technical lead on the data rescuing event. Some of the most important federal data sets cant be extracted with web crawlers: Either theyre too big, or too complicated, or theyre hosted in aging software and their URLs no longer work, redirecting to error pages. So we have to write custom code for that, Allen says, which is where the improvised data-harvesting scripts that the baggers write will come in.
But data, no matter how expertly it is harvested, isnt useful divorced from its meaning. It no longer has the beautiful context of being a website, its just a data set, Allen says.
Thats where the librarians came in. In order to be used by future researchersor possibly used to repopulate the data libraries of a future, more science-friendly administrationthe data would have to be untainted by suspicions of meddling. So the data must be meticulously kept under a secure chain of provenance. In one corner of the room, volunteers were busy matching data to descriptors like which agency the data came from, when it was retrieved, and who was handling it. Later, they hope, scientists can properly input a finer explanation of what the data actually describes.
But for now, the priority was getting it downloaded before the new administration got the keys to the servers next week. Plus, they all had IT jobs and dinner plans and exams to get back to. There wouldnt be another time.
Bag It Up
By noon, the team feeding web pages into the Internet Archive had set crawlers upon 635 NOAA data sets—everything from ice core samples to radar-derived coastal ocean current velocities. The baggers, meanwhile, were busy finding ways to rip data from the Department of Energys
Atmospheric Radiation Measurement Climate Research Facility website.
In one corner, two coders were puzzling over how to download the Department of Transportations Hazmat accidents database. I dont think there would be more than a hundred thousand hazmat accidents a year. Four years of data for fifty statesso 200 state-years, so
Less than a 100,000 in the last four years in every state. So thats our upper limit.
Its kind of a macabre activity to be doing heresitting here downloading hazmat accidents.
At the other end of the table, Nova Fallen, a Penn computer science grad student, was puzzling over an interactive EPA map of the US showing facilities that violated EPAs rules.
Theres a 100,000 limit on downloading these. But its just a web form, so Im trying to see if theres a Python way to fill out the form programmatically, said Fallen. Roughly 4 million violations filled the system. This might take a few more hours, she said.
Brendan OBrien, a coder who builds tools for open-source data, was deep into a more complicated task: downloading the EPAs
entire library of local air monitoring results from the last four years. The page didnt seem very public. It was so buried, he said.
Each entry for each air sensor linked to another set of dataclicking each link would take weeks. So OBrien wrote a script that could find each link and open them. Another script opened the link, and copied what it found into a file. But inside those links were more links, so the process began again.
Eventually, OBrien was watching raw databasically, a text fileroll in. It was indecipherable at first, just a long string of words or numbers separated by commas. But they began to tell a story. One line contained an address in Phoenix, Arizona: 33 W Tamarisk Ave. This was air quality data from an air sensor at that spot. Beside the address were number values, then several types of volatile organic compounds: propylene, methyl metacrylate, acetonitrile, chloromethane, chloroform, carbon tetrachloride. Still, there was no way to tell if any of those compounds were actually in the air in Phoenix; in another part of the file, numbers that presumably indicated levels of air pollution were sitting unpaired with whatever contaminant they corresponded to.
But OBrien said they had reason to believe this data was particularly at riskespecially since the incoming EPA administrator Scott Pruitt has
sued the EPA multiple times as Oklahomas Attorney General to roll back the agencys more blockbuster air pollution regulations. So hed figure out a way to store the data anyway, and then go back and use a tool he built called qri.io to pull apart the files and try to arrange them into a more readable database.
By the end of the day, the group had collectively loaded 3,692 NOAA web pages onto the Internet Archive, and found ways to download 17 particularly hard-to-crack data sets from the EPA, NOAA, and the Department of Energy. Organizers have already laid plans for several more data rescue events in the coming weeks, and a professor from NYU was talking hopefully about hosting one at his university in February. But suddenly, their timeline became more urgent.
On the day that the Inside EPA report came out, an email from OBrien popped up on my phone with Red Fucking Alert in the subject line.
Were archiving everything we can, he wrote.
Source article via