21
Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads (Computer Scientists) or

Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Embed Size (px)

Citation preview

Page 1: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Notes on Geographic Information Systems, DBMS Technology,

and SciDB

Dr. Paul G. BrownParadigm4 / SciDB

A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads (Computer Scientists)

or

Page 2: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Overview of Talk• How did we get here?

• DBMS, GIS, Scientific and Statistical Data Management• Pioneers and Pilgrims

• Dark Ages: Pre-Internet, Pre-Web, Sneaker-net and boxes full of CDs

• Sequoia 2000• Jim Gray, and the SLOAN Digital Sky Survey• XLDB Conferences

• Science and Its Methods• Why your skill sets will become lucrative (not just important).

• Quick Overview of SciDBWe are witnessing the rise of the Scientific Data Management System: a category of applications that draw on the lessons of traditional IT, but focuses on the requirements and methods of scientific data management and analysis.

Page 3: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

In the (very) beginning … • In the (very) beginning was the application …

• Small set of files with (semi-)standardized internal format.• Large and complex libraries for accessing file content.• Simple(?) scripting languages for glue.• Examples: IMS + COBOL + JCL, NetCDF/HDF/FITS + C/C++ + Perl/Python etc

• Commercial Data Management: Rapid adoption of RDBMS / SQL. Why?• Ad hoc (for each task) data model requirement (no industry standards).• More demanding quality of service guarantees (transactions, access control).• Enormous pent up demand for data sharing and collaboration.• Commercial data management was process oriented.

• Scientific Data Management: Went a different route. Why?• Data consumers and producers in different communities. (Sneaker-net). • Science organized into project teams: goal oriented.• Technical innovation (algorithm development) as important as scientific progress.

Page 4: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Sequoia 2000• 5 Year Investigation into Scientific Data Management

• 1991-1996 – University of California System, DARPA, Digital Equipment Corp.• Collaboration between Computer Science types (Mike Stonebraker at UC

Berkeley, Jim Gray at DEC) and users of Geographic Data (Dozier/Frew at UC Santa Barbara, UCLA and UC San Diego Climate Modelers)

• EOS-DIS Alternative Architecture Study• First wide area network (connecting UC Campuses) at “T3” bandwidth (100MBps)• Postgres 4.3 – R-Trees, spatial types etc. Eventually, PostgreSQL and PostGIS

• The Propeller-heads and the Dirt-bags• “Ignorance raised to the power of arrogance.” – James Frew• Computer Scientists – “What do you mean your data’s square?”• Earth Scientists – “What do you mean more than one person can read and write

the same data at the same time?”

Page 5: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

(Hard-won) Lessons

• Collaboration isn’t easy …• Different teams spoke different languages … • … even within related scientific disciplines.

• Dirt-bags had more to gain than Propeller-heads• Technology that enables collaboration.• Ask questions (queries), don’t write programs.

• Propeller-heads had more to learn than Dirt-bags• SQL might be a $10 billion market, but it doesn’t do:

• Image processing, numerical analysis, time-series, HDF, etc.

Strong Claim # 1 : Inter-disciplinary innovation is necessary for us to make significant progress.

Page 6: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

XLDB Conferences – 2008

• Bring together Propeller-head and Dirt-bags• 2008 Thought: How to do next-gen Science?

– Large Hadron Collider– Large Synoptic Survey Telescope– Initial survey of science requirements that informed the

design and implementation of SciDB.• 2009 Thought: What about Industrial Big Data Users?

– Turned out, industrial data sizes will be 10x scientific!– “Internet of things”

• Now 3 Annual Conferences – US, Europe, Asia

Page 7: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Big Science – Research SystemsHow can we use DBMS technology to help Scientific data management?

• SLOAN-Digital Sky Survey (http://www.sdss.org/)– A database of astronomical objects.– Query-centric interfaces, web-facing APIs.

• TeraServer – (http://www.terraserver.com/) – Point the “big eye” down– Commercial application of remote sensing data.

• NIH – 1,000 Genomes Online– Powered by SciDB since 2010– 8T of data online, 3,000 analytic sessions per day.– Growing as fast as they can …

Page 8: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Where are the Propeller-Heads?

Playing Football like Seven Year Olds

• Documents! Hadoop! Triple-Stores! Graph-DBs!• Take a technology with proven value in a specific use-case … • … declare it to be the Next Big Thing (it will crush SQL/RDBMS!) …• … chase each new idea like seven-year-olds chase a ball.

• Roll the Clock Forward to 2014• Hadoop Providers are (Re-)Implementing SQL

HIVE, Cloudera’s Impala, Hortonworks Stinger, YARN + Spark

Strong Claim # 2: One size (one technical architecture) does NOT fit all problem domains.

Page 9: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Jim Gray and the Fourth Paradigm• Who was Jim?

• Turing Award Winner - 1998• Architect of $erious $ystems

(Ultimate Propeller-head)

• What is the “Fourth Paradigm”?• eScience

“Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers

manipulate and explore massive datasets.”

Page 10: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

The “Big Idea” SlideThe methodologies used to analyze scientific data

are central to how we understand our world.• Ubiquitous networks of sensors will render much of the world

an empirical or scientific phenomenon.• How to store all that data? • How do we share that data?• How will be reason about it?• How can such development be made to work economically?

“Increasingly, scientific breakthroughs life’s everyday decisions will be powered by advanced computing capabilities that help

researchers everyone manipulate and explore massive datasets.”

Page 11: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Challenges• Collaboration

• Overcoming the “language challenges” inherent when attempting any inter-disciplinary project.

• Sensitivity to legal and ethical issues: privacy. • Information Integration

• Technical standards for data communication.• Data cleanliness, and identifying common information.

• Visualization and Simple (but not too simple!) Interfaces• Nothing to add!

• Ubiquitous Availability“We have to do better at producing tools to support the whole research cycle—from data capture and data curation to data

analysis and data visualization.” (Gray’s Turing Lecture)

Page 12: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Ergo, SciDB

• How do you find out what Dirt-bags want?You ask them! ?

Arrays (or Matrices) as the basic structural building blockAlgebra of array manipulation operations as APIDistributed computation (cloud or cluster) for scaleIntegrated processing and storage platformExtensible framework (to allow for algorithm innovation)Provenance (track data through its life-cycle) and no-overwrite storageClient languages of choice: ‘R’, Python, not 4GLs or C/C++In-situ data access (as well as providing a data store)

M. Stonebraker, J. Becla, D.J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S.B. Zdonik, "Requirements for Science Data Bases and SciDB", ;in Proc. CIDR, 2009

or http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf

if you are reading this in the 21st Century.

Page 13: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

MPPStorage

andCompute

Arraydata

model

Complex analytics

Commodity clusters or cloud

R, Python, Matlab, Julia,…

Big analytics without big hassles

Why SciDB?

Page 14: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

SciDB – The (Very) Short Tour – 1.

• Array Data Model– “Data Management for Squares”

CREATE ARRAY Example< data : float > [ X=0:*,1000,0, J=0:*,1000,0 ];

CREATE ARRAY geodata< track-index : intl6, scanindex : intl6, height : intl6, sensorzenith : float, sensorazimuth : float, range : uint32, solarzenith : float, solarazimuth : float, landseamask : uint8>[ longitude = -1800000 : 1800000, 50000, 0 latitude = -900000 : 900000, 50000, 0, start_time = 199900000000 : 201400000000, 1, 0, platformid = 0 : 1, 1, 0, resolutionid = 0 : 2, 1, 0];

Example Array from: Planthaber, Gary Lee, Jr. “MODBASE : a SciDB-powered system for large-scale distributed storage and analysis of MODIS earth remote sensing data” MIT 2012

Page 15: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

SciDB – The (Very) Short Tour – 2.

• Query Languages– High Level API – What not How

SELECT SUM ( data ) AS Sum_Data FROM between ( Example, 500, 500, 1500, 1500 );

SELECT MEDIAN( height ) AS Median_Height, AVG ( height ) AS Avg_Height FROM slice ( geodata, platformid, 3 ) WHERE sensorzenith < 35.0REGRID AS ( PARTITION BY longitude 1000, latitude 1000, start_time geo_range ( ’10 days’ ) );AQL looks a bit like SQL, but the underlying algebra is arrays, not sets.

Page 16: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

SciDB – The (Very) Short Tour – 3.

project ( apply ( join ( filter ( Masks, name =‘California’ ), geodata ), height_color, calc_height_color(height) ));

• AFL– Functional, array level manipulation– Familiar to ‘R’ and Python Users

Composible Query Languages allow you to build sophisticated programs by combining simple building blocks.

SELECT MEDIAN( height ) AS Median_Height, AVG ( height ) AS Avg_Height FROM slice ( geodata, platformid, 3 ) WHERE sensorzenith < 35.0REGRID AS ( PARTITION BY longitude 1000, latitude 1000, start_time geo_range ( ’10 days’ ) );

Page 17: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

SciDB Architecture

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB EngineSciDB Client

( iquery, ‘R’, Java, Python )

SciDB Worker Nodes

SciDB Coordinator Node(s)

PostgreSQLPersistent System

Catalog Service

PostgreSQLConnection

SciDB Inter-Node

Communication

1 2 3

4

5

6

Page 18: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Scientific “Big Data”, “Big Analytics”

• Dark Matter Detector – LUX– 1 TB per day– 100 collaborators (research grants)– Find “interesting” particle collisions in a barrel holding 370 liters of liquid

helium, where interesting events are very rare.

• Metabolic Atlas – Mass Spectrometry DB– Genomics + Phenotype + Proteomics– “What is alive in this drop of sea-water?”

• Next Generation Genome Sequencing– Cost of sequencing a human genome is collapsing.

2000 - $1M, 2010 - $10K, 2015 - $1K– Data per sequencing process is growing.

Ion Torrent Sequencing – 80 B reads of 400 bp / read @ $1 per M bp in 2 hours– Gene sequencing be a routine part of medicine by the end of the decade.

Page 19: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Surprises Along the Way• There is commercial demand for SciDB!

• Image processing applications in Radiology, Bio-IT.• Remote sensing applications interesting to various Govt. agencies and

some commercial entities (agriculture, logistics).• Geo-located sensors in vehicles; driver behavior for insurance.

• Arrays for more than just images• Genome database: 2D array [ sample x base_pair ]• Timeseries data: [ anything x time ] • Graph Analytics: [ calling_phone_# x called_phone_# ]

• Traction on the “Scientific Warehouse”• Cost savings by centralizing infrastructure• Productivity advantages from cross-team collaboration

Page 20: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Strong Claim # 3: Tools and methodologies that have

traditionally been restricted to “scientific” research will become

central to “commercial” and “industrial” data processing.

Page 21: Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

Conclusions• Dirt-bags told the Propellor-heads what they wanted …

• Scalable, flexible array storage and data processing.• Platform for collaborative analytics on machine-data.

• Propellor-heads responded … • Not just SciDB!• MonetDB, Rasdaman, InfluxDB – all array DBMSs• Struggling to fit scientific data processing into other paradigms – SQL +

HFDS.

• And not a moment too soon!• Shift of management approach in Big Science towards shared

infrastructure (cost saving, productivity).• Multiple “commercial” consumers who need to use scientific tools

and methods in their analysis.