19
1 © Cloudera, Inc. All rights reserved. Data Science Languages and Industry Analy<cs Wes McKinney, BIDS 20150919

Data Science Languages and Industry Analytics

Embed Size (px)

Citation preview

Page 1: Data Science Languages and Industry Analytics

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  Science  Languages  and  Industry  Analy<cs  Wes  McKinney,  BIDS  2015-­‐09-­‐19  

Page 2: Data Science Languages and Industry Analytics

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Me  

•  Serial  creator  of  structured  data  tools  /  user  interfaces  • Mathema<cian  —  MIT  ‘07  • Professional  SQL  programmer  2007-­‐2010  (@  AQR)  • Created  pandas,  April  2008  • Wrote  Python  for  Data  Analysis  2012  •  Founder  of  DataPad  -­‐>  Cloudera  

 

Page 3: Data Science Languages and Industry Analytics

3  ©  Cloudera,  Inc.  All  rights  reserved.  

A  sample  big  data  architecture  

Kafka

Kafka

Kafka

Kafka

Application dataS3 or HDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL

Page 4: Data Science Languages and Industry Analytics

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Big  data  architectures  currently  dominated  by  Java  /  JVM  languages    Python/R/Julia  don’t  have  much  of  a  “seat  at  the  table”  

Page 5: Data Science Languages and Industry Analytics

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Industry  Analy<cs   Scien<fic  Compu<ng  

Heterogeneous  data          Flat  tables  and  JSON  Spark  /  MapReduce  SQL  DFS-­‐friendly  /  streaming  data  formats  More  physical  machines  

Homogeneous  data          Mul<dimensional  arrays  HPC  tools  Linear  algebra  Scien<fic  data  formats  Fewer  physical  machines  

Some  simplis<c  generaliza<ons  

Page 6: Data Science Languages and Industry Analytics

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Many  Interac<ve-­‐speed  SQL  engines  

…  and  more  

Page 7: Data Science Languages and Industry Analytics

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis:  not  the  direct  subject  of  this  talk  

• hjp://blog.ibis-­‐project.org  • Craking  a  compelling  Python-­‐on-­‐Hadoop  user  experience  • Remove  SQL-­‐programming  from  user  workflows  • Develop  high  performance  Python  extension  APIs  

• Pythonic  composable  DSL  designed  to  target  SQL  seman<cs  • Develop  roadmap  targets  Impala  (C++  /  LLVM)  query  engine  • …  but  SQL  compiler  toolchain  works  well  with  other  SQL  dialects  

Page 8: Data Science Languages and Industry Analytics

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Enabling  interoperability  with  big  data  systems  

• Distributed  /  MPP  query  engines:  implemented  in  a  host  language  • Typically  C++,  Java,  or  Scala  

• User-­‐defined  func<ons  (UDFs)  through  various  means  •  Implement  in  host  language  •  Implement  in  user  language  through  some  external  language  protocol  

• External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  

Page 9: Data Science Languages and Industry Analytics

9  ©  Cloudera,  Inc.  All  rights  reserved.  

What  are  UDFs  good  for?  

• Note:  industry  data  scien<sts  have  libraries  containing  100s  of  UDFs  for  Hive  or  other  distributed  query  engines  

• Custom  data  transforma<ons  • Custom  domain  logic  (date  /  <me  /  data  types)  • Custom  data  types  • Custom  aggrega<ons  (incl.  machine  learning  /  sta<s<cs  expressible  as  reduc<ons)  

Page 10: Data Science Languages and Industry Analytics

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  are  external  UDFs  slow?  

•  Serializa<on  /  deserializa<on  overhead  •  Scalar  vs  vectorized  computa<ons  • RPC  overhead  

Page 11: Data Science Languages and Industry Analytics

11  ©  Cloudera,  Inc.  All  rights  reserved.  

How  to  make  them  fast?  

• Common  run<me  memory  representa<on  for  tabular  data  •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol  • Vectorized  UDF  interface  (for  interpreted  languages)  

Page 12: Data Science Languages and Industry Analytics

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Memory  representa<on  

• Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of  materialized  transient  data  • Apache  Drill:  hjps://drill.apache.org/faq/  • Spark  •  Impala:  hjp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/  

•  Industry-­‐standard  serializa<on  format:  Apache  Parquet  • hjps://parquet.apache.org/  

Page 13: Data Science Languages and Industry Analytics

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Serializa<on  vs  In-­‐memory  

•  Serializa<on  formats  (e.g.  Parquet)    • Op<mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput  • Do  not  consider  random  access  or  in-­‐memory  analy<cs  as  a  goal  

• No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC  protocols  (Parquet,  Thrik,  protobuf,  Avro,  etc.)  

Page 14: Data Science Languages and Industry Analytics

14  ©  Cloudera,  Inc.  All  rights  reserved.  

One  possible  proposal  

•  Standardize  on  an  augmented  variant  of  the  Apache  Drill  in-­‐memory  columnar  memory  layout  • hjps://drill.apache.org/docs/value-­‐vectors/  

• Common  /  shared  C  impl  for  R/Python/Julia  • Currently  all  languages  have  poor  support  for  JSON-­‐like  data  • make  your  needs  known!  • Enumerate  required  data  types  and  other  requirements  

Page 15: Data Science Languages and Industry Analytics

15  ©  Cloudera,  Inc.  All  rights  reserved.  

More  on  the  Drill  layout  persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},

Page 16: Data Science Languages and Industry Analytics

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Strings  in  Drill  person.name

offset03

wesmark

Page 17: Data Science Languages and Industry Analytics

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Array<Struct>  example  person.addresses.street

person.addresses

025

offset013610

abbcccddddf

person.addresses.number

23456

offset

Page 18: Data Science Languages and Industry Analytics

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Array<Array<Int32>>  example  persons'='[''{''''name:'‘wes’,''''fav_sequences:'[''''''[0,'1,'2],''''''[2,'3]'''']''},''{''''name:'‘mark’,''''fav_sequences:'[''''''[3],''''''[4,'5],''''''[6,'7]'''']''},

person.fav_sequences/values

person.fav_sequences

025

offset03568

0122334567

offset

Page 19: Data Science Languages and Industry Analytics

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Wes  McKinney  @wesmckinn  Views  are  my  own