28
Raising the Tides: Open Source Analytics for Data Science Wes McKinney @wesmckinn NEWSWEEK AI & DATA SCIENCE CONFERENCE – CAPITAL MARKETS 2 MARCH 2017

Raising the Tides: Open Source Analytics for Data Science

Embed Size (px)

Citation preview

Page 1: Raising the Tides: Open Source Analytics for Data Science

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney @wesmckinn

N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S2 M A R C H 2 0 1 7

Page 2: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Me

Page 3: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Important Legal Information• The information presented here is offered for informational purposes only

and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time.

• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws.  Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved

Page 4: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

In the next 20 minutes

∞ Important trends in the industry∞ Two Sigma involvement in open

source∞ Growing the community

Page 5: Raising the Tides: Open Source Analytics for Data Science

WHAT I’M SEEING TODAY

Page 6: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Industry giants open source core AI and machine learning technology

Page 7: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Open source “disruption” in data science languages and supporting technologies

Page 8: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Observation #1:

User Mindshare is a Key Asset

Page 9: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Observation #2:

Tools may be less important than human capital and data

Page 10: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Two SigmaBuilding a state-of-the-art, collaborative data science platform

Page 11: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Scaling data science in many dimensions

∞ Access to diverse data sets

Page 12: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Scaling data science in many dimensions

∞ Access to diverse data sets∞ Enhancing individual productivity

Page 13: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Scaling data science in many dimensions

∞ Access to diverse data sets∞ Enhancing individual productivity∞ Computational capabilities: larger

and more complex data sets

Page 14: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Scaling data science in many dimensions

∞ Access to diverse data sets∞ Enhancing individual productivity∞ Computational capabilities: larger

and more complex data sets∞ Collaboration within and across

teams

Page 15: Raising the Tides: Open Source Analytics for Data Science

TOOLS AND THE“DATA SCIENTIST SHORTAGE”

Page 16: Raising the Tides: Open Source Analytics for Data Science

WHY WE PARTICIPATE IN OPEN SOURCE

Page 17: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Why we participate in Open Source1. Drive progress and innovation in

foundational technologies

Page 18: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Why we participate in Open Source1. Drive progress and innovation in

foundational technologies2. Increase the overall value,

interoperability, and sustainability of our closed source systems

Page 19: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Why we participate in Open Source1. Drive progress and innovation in

foundational technologies2. Increase the overall value,

interoperability, and sustainability of our closed source systems

3. Raise awareness of problems faced at scale on real world data

Page 20: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Why we participate in Open Source1. Drive progress and innovation in

foundational technologies2. Increase the overall value, interoperability,

and sustainability of our closed source systems

3. Raise awareness of problems faced at scale on real world data

4. Benefit sooner from open source innovations

Page 21: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Why we participate in Open Source1. Drive progress and innovation in foundational

technologies2. Increase the overall value, interoperability,

and sustainability of our closed source systems

3. Raise awareness of problems faced at scale on real world data

4. Benefit sooner from open source innovations5. Attract and retain the best engineering talent

Page 22: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Where we are investing

Collaboration and Publishing

Cluster Resource Management

Scalable / Distributed Computing

High Performance Data Processing

Page 23: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Core data infrastructure technologies

Apache Arrow

Apache Parquet

• Efficient columnar in-memory data processing

• High-speed, interoperable data messaging for Java, C++, Python

• Industry-standard columnar file format for distributed storage

• Efficient IO for Spark, Python, etc.

Page 24: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Open source in-memory and distributed analytics

• Popular Python analytics library

• Powerful and easy-to-use data cleaning, analytics, and time series processing

• Flint: scalable time series analytics for Spark

• Enhanced Python integration

Page 25: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Cluster resource management

• Scalable cluster resource manager

• Native container support

• Fair job scheduler for Mesos

• Managing multi-tenant Spark clusters

cook

Page 26: Raising the Tides: Open Source Analytics for Data Science

Wes McKinney @wesmckinn

Collaboration and publishing

• Notebook “kernels” for polyglot research and development

• Inter-language data exchange

• Leading web notebook & reproducible research development platform

• Interactive widgets framework

Page 27: Raising the Tides: Open Source Analytics for Data Science

TOWARD HIGH TIDE:Preserving competitive advantage and building common knowledge

Page 28: Raising the Tides: Open Source Analytics for Data Science

Thank youWes McKinney @wesmckinn