19
Developing a Data Harvester in the Amazon Cloud for the Automated Assimilation of Florida’s Healthy Beaches Reports into the GCOOS Data Portal Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Embed Size (px)

Citation preview

Page 1: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Developing a Data Harvester in the Amazon Cloud for the

Automated Assimilation of Florida’s Healthy Beaches Reports into

the GCOOS Data Portal

Robert Currier, Mote Marine LaboratoryDr. Barbara Kirkpatrick, TAMU/GCOOS

Page 2: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

OverviewFL Department of Health monitors 34 coastal

countiesE. coli/Enterroccus samples taken weeklyDOH data publicly available but no APIOriginal DOH website used standard

HTML/CSSPython “web scraping” app developed to

harvest dataDOH outsourced website to commercial

provider

Page 3: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

We had no access to DOH staff or API for the data

In “Big Data” world of today this is becoming typical:

What we built broke when data format changed

This is the story of how we fixed the harvester

Page 4: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Original Data HarvesterWritten in PythonUsed the ‘urllib’ library for web scrapingData stored in MySQL databaseHarvester ran nightly out of cronApp walked through list of counties and built

url: http://esetappsdoh.doh.state.fl.us/irm00beachwater/beachresults.apx?county=’sarasota’

Data returned as Python text objectText object fed to regular expression for

matching

Page 5: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Original Data Format

Page 6: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

And Then It Stopped Working…FL DOH suddenly (to us) outsourced in early

2013New website used proprietary JavaScript and

MapsPlain HTML no longer sent to the browserInstead, custom JavaScript was loadedThe JavaScript used AJAX and DOM

manipulation

Page 7: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

New Data Format

Page 8: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

The SolutionEmulating a browser with Selenium

Portable software test framework for web applicationsCan act like FireFox, Chrome and IETypically used for building automated testsWe repurposed and used as a virtual browserAs a browser Selenium can execute JavaScript

Page 9: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Soup’s On!Selenium worked and we now had data

availableBut data was very unstructured and

massively uglyBeautifulSoup4 to the rescue…

Page 10: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

And The Soup Was Tasty!BeautifulSoup4 gave us back our

“structured” dataSome modification needed to data parsing

code as…Locations, variables and dates were not on

same line

Page 11: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

The New Code Worked PerfectlyIn Our Development Environment

Page 12: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

But Failed Spectacularly When We Deployed

Page 13: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

What Happened?Amazon EC-2 instances are “headless” serversNo display hardwareNo graphics libraries (GTK+)Since no graphics libraries, no browsersWithout a browser, we crash and burn

Page 14: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Adding A Virtual Headhttp://joekiller.com provided us with a script

that pulled the source and built GTK+ on our cloud server in under two hours. Thanks, Joe Lawson!

Unfortunately, the script bombed and didn’t build FireFox. We had to download the source and build by hand.

Now we had a working browser, but no monitor on which to display our output…

Page 15: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Getting A Head with XVFBXVFB: The X virtual frame bufferPerforms all graphical operations in memoryDoesn’t show outputPrimarily used for testing, but…We repurposed, just like Selenium

+ =

Page 16: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Automating The Process

Page 17: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

ConclusionsDon’t be afraid to use untraditional data

sourcesBut be prepared for your code to breakWe live in a data rich environmentBut most of the data is very

messy/unstructuredSo tread lightly, and don’t lose your head!

Page 18: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

Thanks To:Mote Marine LaboratoryGulf Coast Ocean Observing SystemsTexas A&M Department of OceanographyAll the Free and Open Source Software

developers

Page 19: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS

In Remembrance OfSeth Vidal, creator of ‘yum’, friend and FOSS

guruKilled while biking on July 8th 2013 in

Durham, NC