Data Science Languages and Industry Analytics

1 © Cloudera, Inc. All rights reserved.

Data Science Languages and Industry Analy<cs Wes McKinney, BIDS 2015-‐09-‐19


Me

•  Serial creator of structured data tools / user interfaces • Mathema<cian — MIT ‘07 • Professional SQL programmer 2007-‐2010 (@ AQR) • Created pandas, April 2008 • Wrote Python for Data Analysis 2012 •  Founder of DataPad -‐> Cloudera


A sample big data architecture

Kafka

Kafka

Kafka

Kafka

Application dataS3 or HDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL


Big data architectures currently dominated by Java / JVM languages Python/R/Julia don’t have much of a “seat at the table”


Industry Analy<cs Scien<fic Compu<ng

Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines

Homogeneous data Mul<dimensional arrays HPC tools Linear algebra Scien<fic data formats Fewer physical machines

Some simplis<c generaliza<ons


Many Interac<ve-‐speed SQL engines

… and more


Ibis: not the direct subject of this talk

• hjp://blog.ibis-‐project.org • Craking a compelling Python-‐on-‐Hadoop user experience • Remove SQL-‐programming from user workflows • Develop high performance Python extension APIs

• Pythonic composable DSL designed to target SQL seman<cs • Develop roadmap targets Impala (C++ / LLVM) query engine • … but SQL compiler toolchain works well with other SQL dialects


Enabling interoperability with big data systems

• Distributed / MPP query engines: implemented in a host language • Typically C++, Java, or Scala

• User-‐defined func<ons (UDFs) through various means •  Implement in host language •  Implement in user language through some external language protocol

• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)


What are UDFs good for?

• Note: industry data scien<sts have libraries containing 100s of UDFs for Hive or other distributed query engines

• Custom data transforma<ons • Custom domain logic (date / <me / data types) • Custom data types • Custom aggrega<ons (incl. machine learning / sta<s<cs expressible as reduc<ons)


Why are external UDFs slow?

•  Serializa<on / deserializa<on overhead •  Scalar vs vectorized computa<ons • RPC overhead


How to make them fast?

• Common run<me memory representa<on for tabular data •  Share-‐memory (zero-‐copy or memcpy-‐only) external UDF protocol • Vectorized UDF interface (for interpreted languages)


Memory representa<on

• Many query engines are standardizing on in-‐memory columnar rep’n of materialized transient data • Apache Drill: hjps://drill.apache.org/faq/ • Spark •  Impala: hjp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/

•  Industry-‐standard serializa<on format: Apache Parquet • hjps://parquet.apache.org/


Serializa<on vs In-‐memory

•  Serializa<on formats (e.g. Parquet) • Op<mize for IO / DFS throughput at expense of CPU/memory bus throughput • Do not consider random access or in-‐memory analy<cs as a goal

• No standardized in-‐memory containers for materialized data from file / RPC protocols (Parquet, Thrik, protobuf, Avro, etc.)


One possible proposal

•  Standardize on an augmented variant of the Apache Drill in-‐memory columnar memory layout • hjps://drill.apache.org/docs/value-‐vectors/

• Common / shared C impl for R/Python/Julia • Currently all languages have poor support for JSON-‐like data • make your needs known! • Enumerate required data types and other requirements


More on the Drill layout persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},


Strings in Drill person.name

offset03

wesmark


Array<Struct> example person.addresses.street

person.addresses

025

offset013610

abbcccddddf

person.addresses.number

23456

offset


Array<Array<Int32>> example persons'='[''{''''name:'‘wes’,''''fav_sequences:'[''''''[0,'1,'2],''''''[2,'3]'''']''},''{''''name:'‘mark’,''''fav_sequences:'[''''''[3],''''''[4,'5],''''''[6,'7]'''']''},

person.fav_sequences/values

person.fav_sequences

025

offset03568

0122334567

offset


Thank you Wes McKinney @wesmckinn Views are my own

Technology

Data Science Languages and Industry Analytics