Upload
wes-mckinney
View
4.394
Download
0
Embed Size (px)
Citation preview
1 © Cloudera, Inc. All rights reserved.
Data Science Languages and Industry Analy<cs Wes McKinney, BIDS 2015-‐09-‐19
2 © Cloudera, Inc. All rights reserved.
Me
• Serial creator of structured data tools / user interfaces • Mathema<cian — MIT ‘07 • Professional SQL programmer 2007-‐2010 (@ AQR) • Created pandas, April 2008 • Wrote Python for Data Analysis 2012 • Founder of DataPad -‐> Cloudera
3 © Cloudera, Inc. All rights reserved.
A sample big data architecture
Kafka
Kafka
Kafka
Kafka
Application dataS3 or HDFS
JSON Spark/MapReduce
Columnar storage
Analytic SQL Engine
User
SQL
4 © Cloudera, Inc. All rights reserved.
Big data architectures currently dominated by Java / JVM languages Python/R/Julia don’t have much of a “seat at the table”
5 © Cloudera, Inc. All rights reserved.
Industry Analy<cs Scien<fic Compu<ng
Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines
Homogeneous data Mul<dimensional arrays HPC tools Linear algebra Scien<fic data formats Fewer physical machines
Some simplis<c generaliza<ons
6 © Cloudera, Inc. All rights reserved.
Many Interac<ve-‐speed SQL engines
… and more
7 © Cloudera, Inc. All rights reserved.
Ibis: not the direct subject of this talk
• hjp://blog.ibis-‐project.org • Craking a compelling Python-‐on-‐Hadoop user experience • Remove SQL-‐programming from user workflows • Develop high performance Python extension APIs
• Pythonic composable DSL designed to target SQL seman<cs • Develop roadmap targets Impala (C++ / LLVM) query engine • … but SQL compiler toolchain works well with other SQL dialects
8 © Cloudera, Inc. All rights reserved.
Enabling interoperability with big data systems
• Distributed / MPP query engines: implemented in a host language • Typically C++, Java, or Scala
• User-‐defined func<ons (UDFs) through various means • Implement in host language • Implement in user language through some external language protocol
• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)
9 © Cloudera, Inc. All rights reserved.
What are UDFs good for?
• Note: industry data scien<sts have libraries containing 100s of UDFs for Hive or other distributed query engines
• Custom data transforma<ons • Custom domain logic (date / <me / data types) • Custom data types • Custom aggrega<ons (incl. machine learning / sta<s<cs expressible as reduc<ons)
10 © Cloudera, Inc. All rights reserved.
Why are external UDFs slow?
• Serializa<on / deserializa<on overhead • Scalar vs vectorized computa<ons • RPC overhead
11 © Cloudera, Inc. All rights reserved.
How to make them fast?
• Common run<me memory representa<on for tabular data • Share-‐memory (zero-‐copy or memcpy-‐only) external UDF protocol • Vectorized UDF interface (for interpreted languages)
12 © Cloudera, Inc. All rights reserved.
Memory representa<on
• Many query engines are standardizing on in-‐memory columnar rep’n of materialized transient data • Apache Drill: hjps://drill.apache.org/faq/ • Spark • Impala: hjp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/
• Industry-‐standard serializa<on format: Apache Parquet • hjps://parquet.apache.org/
13 © Cloudera, Inc. All rights reserved.
Serializa<on vs In-‐memory
• Serializa<on formats (e.g. Parquet) • Op<mize for IO / DFS throughput at expense of CPU/memory bus throughput • Do not consider random access or in-‐memory analy<cs as a goal
• No standardized in-‐memory containers for materialized data from file / RPC protocols (Parquet, Thrik, protobuf, Avro, etc.)
14 © Cloudera, Inc. All rights reserved.
One possible proposal
• Standardize on an augmented variant of the Apache Drill in-‐memory columnar memory layout • hjps://drill.apache.org/docs/value-‐vectors/
• Common / shared C impl for R/Python/Julia • Currently all languages have poor support for JSON-‐like data • make your needs known! • Enumerate required data types and other requirements
15 © Cloudera, Inc. All rights reserved.
More on the Drill layout persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},
16 © Cloudera, Inc. All rights reserved.
Strings in Drill person.name
offset03
wesmark
17 © Cloudera, Inc. All rights reserved.
Array<Struct> example person.addresses.street
person.addresses
025
offset013610
abbcccddddf
person.addresses.number
23456
offset
18 © Cloudera, Inc. All rights reserved.
Array<Array<Int32>> example persons'='[''{''''name:'‘wes’,''''fav_sequences:'[''''''[0,'1,'2],''''''[2,'3]'''']''},''{''''name:'‘mark’,''''fav_sequences:'[''''''[3],''''''[4,'5],''''''[6,'7]'''']''},
person.fav_sequences/values
person.fav_sequences
025
offset03568
0122334567
offset
19 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own