44
Presto Meetup 2016 My Use case Small Start

Presto Meetup 2016 Small Start

Embed Size (px)

Citation preview

Page 1: Presto Meetup 2016 Small Start

Presto Meetup 2016

My Use case

Small Start

Page 2: Presto Meetup 2016 Small Start

self introduction‐• @toyama0919• Analytics Infra.• Nearly working embulk…• Presto using one and a half years.

Page 3: Presto Meetup 2016 Small Start

Our development situation• We commonly used sql.• Marketing occupation don't write sql.• I often write the complicated SQL, that is

100 lines..• We love OSS.• Not use Update, Insert, Delete by Presto.

Page 4: Presto Meetup 2016 Small Start

Our Business situation• We manage and operate web site of

BtoB.• Our data lifecycle is long.• Business side not write sql.• watching re:dash and Adobe analytics.• Sales increase 15 straight year.

Page 5: Presto Meetup 2016 Small Start

analyst want data quickly

Page 6: Presto Meetup 2016 Small Start

Ruby Batch

CollectBatchVisualize Data Store

(Digdag)

Page 7: Presto Meetup 2016 Small Start

Analytics Priolity1. Direct SQL

2. Presto

3. ETL

Page 8: Presto Meetup 2016 Small Start

Cost is large differencefrom 1 to 3

Page 9: Presto Meetup 2016 Small Start

Why use presto?• Cross server Join• Window function• UDF

Page 10: Presto Meetup 2016 Small Start

Cross Server Join

Page 11: Presto Meetup 2016 Small Start

Join• Cross server and cross database.• A single Presto query can combine data

from multiple sources.• We use multiple sources join query.• reduce ETL pain.

Page 12: Presto Meetup 2016 Small Start

Collect data in one place?• Equal able to get data by one query.• I not want to have duplicate data.(master

data, user data)• Collect the data in one place, high

develop cost.

Page 13: Presto Meetup 2016 Small Start

with mysql_user as ( select user_id, user_name from mysql.schema.users),redshift_user_log as ( select user_id, log_time from redshift.schema.pageview)select user_id, user_name, count(*)from mysql_userinner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_idgroup by user_id, user_name

Parallel

Page 14: Presto Meetup 2016 Small Start

UDF

Page 15: Presto Meetup 2016 Small Start

Mysql not support mechanism

• window function• with query– not support Recursive.

• URL function• Array data type• cross join unnest

Page 16: Presto Meetup 2016 Small Start

URL Functionselect url_encode('Presto最高 ');=> Presto%8d%c5%8d%82

select url_decode('Presto%8d%c5%8d%82');=> Presto最高

Page 17: Presto Meetup 2016 Small Start

Regexp Functionselect regexp_extract_all('1a 2b 14m', '\d+')=> [1, 2, 14]

select regexp_extract('超低床型自動梱包機  RQ-8LD', '([a-zA-z0-9\-]+)’)=> RQ-8LD

Page 18: Presto Meetup 2016 Small Start

SQL no good at it• Normalization of the character string.• split csv string.• Morphological analysis.

Page 19: Presto Meetup 2016 Small Start

Normalization

select normalize(upper('hoge '), NFKC)#=> HOGE

Page 20: Presto Meetup 2016 Small Start

Array typeselect

split(keywords, ',') as keywordsFrom

mysql_keywords_table

keywords---------------------------- keyword1,keyword2,keyword3

keywords---------------------------- ['keyword1','keyword2','keyword3']

Page 21: Presto Meetup 2016 Small Start

horizontal to verticalSELECT

keywordFROM

mysql_keywords_tableCROSS JOIN

UNNEST(split(keywords, ',')) AS t (keyword)

Page 22: Presto Meetup 2016 Small Start

horizontal to vertical keywords---------------------------- keyword1,keyword2,keyword3 keyword4,keyword5 keyword6 keyword1

keyword---------------------------- keyword1 keyword2 keyword3 keyword4 keyword5 keyword6 keyword1

Page 23: Presto Meetup 2016 Small Start

window function• We use window function for Mysql.

(Presto on mysql) • data source is Mysql, But Presto world

can use.• But can not use original function of

mysql.

Page 24: Presto Meetup 2016 Small Start

Rank function on mysqlselect company_id, category_id, count(*), rank() over ( partition by company_id order by count(*) desc )from mysql.schema.mysql_tablegroup by company_id, category_id

Page 25: Presto Meetup 2016 Small Start

other window function• last_value• first_value• dense_rank• percent_rank

Page 26: Presto Meetup 2016 Small Start

Prestogres• PostgreSQL protocol gateway for Presto.• rewrite queries before sending Presto to

PostgreSQL.• have password-based authentication and

SSL.

Page 27: Presto Meetup 2016 Small Start

Why Prestogres?• Other application connectivity.– pgAdmin, psql command.– re:dash connecte with PostgreSQL protocol to presto.– But can directly connect to presto.

• We connect to presto, need Presto client.– I not want use java client.

• Weak security. – certification is taken by prestogres

Page 28: Presto Meetup 2016 Small Start

Prestogres Limitation• prepared statement.– not support Presto too.– so not work embulk-input-postgresql• Can’t fetch schema by sql.

• Temporary table• DROP TABLE

Page 29: Presto Meetup 2016 Small Start

re:dash

• Visualization platform, write by python.• Supports many data sources.• Sharing query with member.• Scheduling query.(per day, per hour)• Very active contribution.

Page 30: Presto Meetup 2016 Small Start

increased rapidly Presto query by re:dash

• Number of the presto queries increased than 10 times.

• That won't change with writing ETL on re:dash.

• Re:dash having a good reputation in internal.

Page 31: Presto Meetup 2016 Small Start

Okay,analytics

problems all clear!

Page 32: Presto Meetup 2016 Small Start

No.. Can’t escape from ETL

Page 33: Presto Meetup 2016 Small Start

Embulk with Presto• use embulk-input-presto of own making.– Support json type.

• Create point in time data.• Create machine learning data.

Page 34: Presto Meetup 2016 Small Start

Why Embulk?• Very active plugin ecosystem.• Complicated string analysis can not only

sql.• With digdag combination is very

powerful.• Want can do it shortest distance.• Fluentd overwork..

Page 35: Presto Meetup 2016 Small Start

Operation

Page 36: Presto Meetup 2016 Small Start

Install by RPM• Presto have RPM.– not distribution.– need source build..

• include init script.• But not support open-jdk..– Pull requesting..

Page 37: Presto Meetup 2016 Small Start

AWS integration• We build Presto on ec2.• Not use EMR.• Worker is spot instance, multi instance

types.– prevent down all at once

Page 38: Presto Meetup 2016 Small Start

networking• Presto cluster(coordinator and workers)

place in the same AZ.• If other AZ, very high traffic cost(and

money).– should not multi AZ.

Page 39: Presto Meetup 2016 Small Start

Networking on AWS

Availability Zone Availability Zone

cordinator

worker

worker

worker

NotCluster

Page 40: Presto Meetup 2016 Small Start

problem• Very huge repository.• SPOF cordinator.• run long range query, occur

OutOfMemory Error.

Page 41: Presto Meetup 2016 Small Start

Very huge repository• monolithic application.– I want Separate repository.

• First build takes 30 minutes.• After the second time build takes 10 minutes.• All connector is main repository.– MongoDB、 Kafka、 cassandra..– will nearly support Elasticsearch

• Hard to do the contribution.

Page 42: Presto Meetup 2016 Small Start

Big change for jdbc• support multi data type predicate

pushdown.• We used apply patch presto…• Let's try mysql people.

Page 43: Presto Meetup 2016 Small Start

listened Presto impression• extended technology of Hadoop.

=>I don't know hadoop. Presto have many connector.• parallel processing looks difficult.

=>Presto not have storage, There is not so influence.・ I do not have so big data.

=>I don't so big player.

Page 44: Presto Meetup 2016 Small Start

Summary• Presto is great software.• So not difficult.• Let's use it more.