Presto Meetup 2016 Small Start

  • Published on
    12-Apr-2017

  • View
    1.111

  • Download
    0

Embed Size (px)

Transcript

Presto Meetup 2016

Presto Meetup 2016My Use caseSmall Start

selfintroduction@toyama0919Analytics Infra.Nearly working embulkPresto using one and a half years.

Our development situationWe commonly used sql.Marketing occupation don't write sql.I often write the complicated SQL, that is 100 lines..We love OSS.Not use Update, Insert, Delete by Presto.

Our Business situationWe manage and operate web site of BtoB.Our data lifecycle is long.Business side not write sql.watching re:dash and Adobe analytics.Sales increase 15 straight year.

analyst want data quickly

Ruby Batch

Collect

Batch

Visualize

Data Store(Digdag)

Analytics PriolityDirect SQL

Presto

ETL

Cost is large differencefrom 1 to 3

Why use presto?Cross server JoinWindow functionUDF

Cross Server Join

JoinCross server and cross database.A single Presto query can combine data from multiple sources.We use multiple sources join query.reduce ETL pain.

Collect data in one place?Equal able to get data by one query.I not want to have duplicate data.(master data, user data)Collect the data in one place, high develop cost.

with mysql_user as ( select user_id, user_name from mysql.schema.users),redshift_user_log as ( select user_id, log_time from redshift.schema.pageview)select user_id, user_name, count(*)from mysql_userinner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_idgroup by user_id, user_name

Parallel

UDF

Mysql not support mechanism

window functionwith querynot support Recursive.URL functionArray data typecross join unnest

URL Functionselect url_encode('Presto');=> Presto%8d%c5%8d%82

select url_decode('Presto%8d%c5%8d%82');=> Presto

Regexp Functionselect regexp_extract_all('1a 2b 14m', '\d+')=> [1, 2, 14]

select regexp_extract('RQ-8LD', '([a-zA-z0-9\-]+))=> RQ-8LD

SQL no good at itNormalization of the character string.split csv string.Morphological analysis.

Normalization

select normalize(upper('ho'), NFKC)#=> HOGE

Array typeselectsplit(keywords, ',') as keywordsFrommysql_keywords_table

keywords---------------------------- keyword1,keyword2,keyword3

keywords---------------------------- ['keyword1','keyword2','keyword3']

horizontal to verticalSELECTkeywordFROM mysql_keywords_tableCROSS JOINUNNEST(split(keywords, ',')) AS t (keyword)

horizontal to vertical keywords---------------------------- keyword1,keyword2,keyword3 keyword4,keyword5 keyword6 keyword1

keyword---------------------------- keyword1 keyword2 keyword3 keyword4 keyword5 keyword6 keyword1

window functionWe use window function for Mysql. (Presto on mysql) data source is Mysql, But Presto world can use.But can not use original function of mysql.

Rank function on mysqlselect company_id, category_id, count(*), rank() over ( partition by company_id order by count(*) desc )from mysql.schema.mysql_tablegroup by company_id, category_id

other window functionlast_valuefirst_valuedense_rankpercent_rank

PrestogresPostgreSQL protocol gateway for Presto.rewrite queries before sending Presto to PostgreSQL.have password-based authentication and SSL.

Why Prestogres?Other application connectivity.pgAdmin, psql command.re:dash connecte with PostgreSQL protocol to presto.But can directly connect to presto.We connect to presto, need Presto client.I not want use java client.Weak security. certification is taken by prestogres

Prestogres Limitationprepared statement.not support Presto too.so not work embulk-input-postgresqlCant fetch schema by sql.Temporary tableDROP TABLE

re:dash

Visualization platform, write by python.Supports many data sources.Sharing query with member.Scheduling query.(per day, per hour)Very active contribution.

increased rapidly Presto query by re:dashNumber of the presto queries increased than 10 times.That won't change with writing ETL on re:dash.Re:dash having a good reputation in internal.

Okay,analytics problems all clear!

No.. Cant escape from ETL

Embulk with Prestouse embulk-input-presto of own making.Support json type.Create point in time data.Create machine learning data.

Why Embulk?Very active plugin ecosystem.Complicated string analysis can not only sql.With digdag combination is very powerful.Want can do it shortest distance.Fluentd overwork..

Operation

Install by RPMPresto have RPM.not distribution.need source build..include init script.But not support open-jdk..Pull requesting..

AWS integrationWe build Presto on ec2.Not use EMR.Worker is spot instance, multi instance types.prevent down all at once

networkingPresto cluster(coordinator and workers) place in the same AZ.If other AZ, very high traffic cost(and money).should not multi AZ.

Networking on AWS

Availability Zone

Availability Zone

cordinator

workerworkerworker

NotCluster

problemVery huge repository.SPOF cordinator.run long range query, occur OutOfMemory Error.

Very huge repositorymonolithic application.I want Separate repository.First build takes 30 minutes.After the second time build takes 10 minutes.All connector is main repository.MongoDBKafkacassandra..will nearly support ElasticsearchHard to do the contribution.

Big change for jdbcsupport multi data type predicate pushdown.We used apply patch prestoLet's try mysql people.

listened Presto impressionextended technology of Hadoop.=>I don't know hadoop. Presto have many connector.parallel processing looks difficult.=>Presto not have storage, There is not so influence.I do not have so big data.=>I don't so big player.

SummaryPresto is great software.So not difficult.Let's use it more.

Recommended

View more >