Transcript

1

Boston Hadoop User Group Meetup, July 7, 2015

Kamil Bajda-Pawlikowski Matt Fuller

2

•  History of Teradata Center for Hadoop

–  Formerly Hadapt Founded in July, 2010 by Borgman, Bajda-Pawlikowski, and Abadi

–  Pioneered SQL-on-Hadoop market

–  Based on work done by database research group in Yale Computer Science Department

–  Hybrid of Hadoop scalability and DBMS performance

•  Today

–  Acquired by Teradata in July, 2014, renamed Teradata Center for Hadoop

–  30 developers with deep Hadoop and database expertise

–  Headquarters in Boston, MA

–  Contributors to open source project Presto

Who are we? - Teradata Center for Hadoop!

3

•  What is Presto?

•  What is Teradata doing?

•  Can I see a Demo?

•  How can I contribute?

Talk Agenda

4

•  100% open source distributed ANSI SQL engine for Big Data –  Modern code base

–  Proven scalability

–  Optimized for low latency, Interactive querying

•  Cross platform query capability, not only SQL on Hadoop

•  Distributed under the Apache license, now supported by Teradata

•  Used by a community of well known, well respected technology companies

What is Presto?

5

History of Presto

FALL 2012 4 developers start Presto

development

FALL 2014 88 Releases

41 Contributors 3943 Commits

SPRING 2015 98 Releases

65 Contributors

4587 Commits

--------- Teradata joins

Presto community & offers support

SPRING 2013 Presto rolled out within Facebook

FALL 2013 Facebook open sources Presto

FALL 2008 Facebook

open sources Hive

Timeline image courtesy of Facebook

6

Presto Architecture

Data stream API

Worker

Data stream API

Worker

Coordinator

Metadata API

Parser/ analyzer

Planner Scheduler

Worker

Client

Data location API

Pluggable

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

7

Presto Extensibility – connectors

Parser/ analyzer

Planner

Worker

Data location API

Hiv

e

Ca

ssa

nd

ra

Kafk

a

MyS

QL

Metadata API

Hiv

e

Ca

ssa

nd

ra

Kafk

a

MyS

QL

Data stream API

Hiv

e

Ca

ssa

nd

ra

Kafk

a

MyS

QL

Scheduler

Coordinator

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

8

•  Data stays in memory during execution and is pipelined across nodes MPP-style

•  Vectorized columnar processing

•  Presto is written in highly tuned Java

–  Efficient in-memory data structures

–  Very careful coding of inner loops

–  Bytecode generation

•  Optimized ORC reader

Presto = Performance

9

•  Facebook –  Multiple production clusters (100s of nodes total) -  Including 300PB Hadoop data warehouse

–  1000s of internal daily active users

–  Millions of queries each month

–  Multiple PBs scanned every day

–  Trillions of rows a day

•  Netflix –  Over 200-node production cluster on EC2

–  Over 15 PB in S3 (Parquet format)

–  Over 300 users and 2.5K queries daily

Presto in Production

10

•  100% open source contributions to Presto to increase adoption in the enterprise

•  A multi-year roadmap commitment to phased enhancements of the open source code

•  The first ever commercial support offering for Presto

What is Teradata Doing?

Teradata Certified Presto www.teradata.com/presto

11

•  Hadoop Distro Agnostic

•  Modern Code Base –  Presto is well-designed open source software with proper database

architecture

•  Strong Like-Minded Community

•  Push down processing across multiple data platforms

•  Leverage Teradata expertise to make SQL for Hadoop viable

Why is Teradata Contributing to Presto?

12

Demo Time!

13

Implement Integrate Proliferate

•  Installer •  Documentation •  Monitoring & Support

Tools

•  Management Tool Integration

•  YARN Integration

•  ODBC / JDBC Drivers •  BI Certification •  Security •  Connectors

Commercial Support

Phase 1 Phase 2 Phase 3 June 8, 2015 Q4 2015 2016

Expanding ANSI SQL Coverage

Teradata Contributions to Presto

14

•  Ease of install and management via Presto-Admin tool –  www.github.com/prestodb/presto-admin

–  Packaging Presto as an RPM

•  Testing Framework for Presto –  www.github.com/prestodb/tempto

–  Added large number of tests

•  Improvements to JDBC driver –  To be open sourced on www.github.com/prestodb soon!

•  Various SQL improvements

Teradata’s Contributions

15

•  YARN Integration

•  Ambari Integration

•  ODBC & JDBC Drivers that actually work

•  Security – Authentication & Authorization

•  Continued SQL Improvements

•  BI tool certifications – e.g. Tableau

•  More Connectors – e.g. Hbase

•  Open Source our Docker based Dev Env

•  Open our Continuous Integration platform to the community

Teradata’s Contribution Product Roadmap

16

www.github.com/facebook/presto

www.github.com/prestodb

Certified Distro: www.teradata.com/presto

Website: www.prestodb.io

Presto User’s Group: www.groups.google.com/group/presto-users

Facebook Page: www.facebook.com/prestodb

Twitter: #prestodb

How can I contribute?

17

Available for Download –  Presto 101t Server, CLI, JDBC

–  Presto-Admin 0.1

–  Documentation

–  HDP w/ Presto VM Sandbox

–  CDH w/ Presto VM Sandbox

www.teradata.com/presto

Presto 101t certified by Teradata

18