34
Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io | @ApacheKylin

Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Apache Kylin Balance between Space and Time

Debashis Saha | Luke Han 2015-06-09

http://kylin.io | @ApacheKylin

Page 2: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

About us

§Debashis Saha (@debashis_saha ) - VP, eBay Cloud Services (Platform, Infrastructure, Data) §Luke Han (@lukehq) - Sr. Product Manager, Analytics Data Infrastructure - Committer & PMC Member of Apache Kylin

Page 3: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A

http://kylin.io

Page 4: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Whatkylin    /  ˈkiːˈlɪn  /  麒麟 -­‐-­‐n.  (in  Chinese  art)  a  mythical  animal  of  composite  form  

Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

http://kylin.io

• Open  Sourced  on  Oct  1st,  2014  • Accepted  as  Apache  Incubator  Project  on  Nov  25th,  2014  

• http://kylin.io  (http://kylin.incubator.apache.org)

@ApacheKylin

Page 5: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

§ External § 25+ contributors in community § Adoption: § On Production: Baidu Map § Evaluation: Huawei, Bloomberg Law, British Gas, JD.com

Microsoft, Tableau…

§ eBay Internal Cases - 90% ile query < 5 seconds

Case Cube  Size Raw  Records

User  Session  Analysis 26  TB 28+  billion  rows

Traffic  Analysis 21  TB 20+  billion  rows

Behavior  Analysis 560  GB 1.2+  billion  rows

—from mailing list

Who are using Kylin?

http://kylin.io

Page 6: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Why

http://kylin.io

Happiness

Latency10s

size

Page 7: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Balance Between Space and Time

http://kylin.io

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, location

item, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item>

• Cuboid = one combination of dimensions • Cube = all combination of dimensions

(all cuboids)

OLAP Cube

Page 8: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

How

http://kylin.io

Map Reduce

Kylin

BI Tools, Web App…

ANSI SQL

Page 9: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A

http://kylin.io

Page 10: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Feature Highlights• Extremely Fast OLAP Engine at scale

• ANSI SQL Interface on Hadoop

• Seamless Integration with BI Tools, like Tableau

• Interactive Query Capability

• MOLAP Cube

• Incremental Build of Cubes

• Approximate Query Capability for Distinct Count (HyperLogLog)

• Leverage HBase Coprocessor for query latency

• Job Management and Monitoring

• User friendly Web GUI for manage, build, monitor and query cubes

• Security capability to set ACL at Cube/Project Level

• Support LDAP Integration

http://kylin.io

Page 11: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Define Data Model

http://kylin.io

Page 12: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Manage Jobs

http://kylin.io

Page 13: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Explore the Data

http://kylin.io

Page 14: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Interactive with BI Tool - Tableau

http://kylin.io

Page 15: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A

http://kylin.io

Page 16: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Cube  Build  Engine  (MapReduce…)

SQL

Low    Latency  -­‐  SecondsMid  Latency  -­‐  MinutesRouting

3rd  Party  App  (Web  App,  Mobile…)

Metadata

SQL-­‐Based  Tool  (BI  Tools:  Tableau…)

Query  Engine

Hadoop Hive

REST  API JDBC/ODBC

➢Online  Analysis  Data  Flow  ➢Offline  Data  Flow  

➢Clients/Users  interactive  with  Kylin  via  SQL  

➢OLAP  Cube  is  transparent  to  users

Star  Schema  Data Key  Value  Data

Data  CubeOLAP  Cube  (HBase)

SQL

REST  Server

Kylin Architecture Overview

Page 17: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Cube:  …  Fact  Table:  …  Dimensions:  …  Measures:  …  Storage(HBase):  …

Fact

Dim Dim

Dim

Source  Star  Schema

row  A

row  B

row  C

Column  Family

Val  1

Val  2

Val  3

Row  Key Column

Target    HBase  Storage

Mapping  Cube  Metadata

End  User Cube  Modeler Admin

Data Modeling

Page 18: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Cube Build Job Flow

Page 19: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

How to Store Cube - HBase Schema

Page 20: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

SELECT  test_cal_dt.week_beg_dt,  test_category.category_name,  test_category.lvl2_name,  test_category.lvl3_name,  test_kylin_fact.lstg_format_name,    test_sites.site_name,  SUM(test_kylin_fact.price)  AS  GMV,  COUNT(*)  AS  TRANS_CNT  FROM    test_kylin_fact        LEFT  JOIN  test_cal_dt  ON  test_kylin_fact.cal_dt  =  test_cal_dt.cal_dt      LEFT  JOIN  test_category  ON  test_kylin_fact.leaf_categ_id  =  test_category.leaf_categ_id    AND  test_kylin_fact.lstg_site_id  =  test_category.site_id      LEFT  JOIN  test_sites  ON  test_kylin_fact.lstg_site_id  =  test_sites.site_id  WHERE  test_kylin_fact.seller_id  =  123456OR  test_kylin_fact.lstg_format_name  =  ’New'  GROUP  BY  test_cal_dt.week_beg_dt,  test_category.category_name,  test_category.lvl2_name,  test_category.lvl3_name,  test_kylin_fact.lstg_format_name,test_sites.site_name

OLAPToEnumerableConverter      OLAPProjectRel(WEEK_BEG_DT=[$0],  category_name=[$1],  CATEG_LVL2_NAME=[$2],  CATEG_LVL3_NAME=[$3],  LSTG_FORMAT_NAME=[$4],  SITE_NAME=[$5],  GMV=[CASE(=($7,  0),  null,  $6)],  TRANS_CNT=[$8])          OLAPAggregateRel(group=[{0,  1,  2,  3,  4,  5}],  agg#0=[$SUM0($6)],  agg#1=[COUNT($6)],  TRANS_CNT=[COUNT()])              OLAPProjectRel(WEEK_BEG_DT=[$13],  category_name=[$21],  CATEG_LVL2_NAME=[$15],  CATEG_LVL3_NAME=[$14],  LSTG_FORMAT_NAME=[$5],  SITE_NAME=[$23],  PRICE=[$0])                  OLAPFilterRel(condition=[OR(=($3,  123456),  =($5,  ’New'))])                      OLAPJoinRel(condition=[=($2,  $25)],  joinType=[left])                          OLAPJoinRel(condition=[AND(=($6,  $22),  =($2,  $17))],  joinType=[left])                              OLAPJoinRel(condition=[=($4,  $12)],  joinType=[left])                                  OLAPTableScan(table=[[DEFAULT,  TEST_KYLIN_FACT]],  fields=[[0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10,  11]])                                  OLAPTableScan(table=[[DEFAULT,  TEST_CAL_DT]],  fields=[[0,  1]])                              OLAPTableScan(table=[[DEFAULT,  test_category]],  fields=[[0,  1,  2,  3,  4,  5,  6,  7,  8]])                          OLAPTableScan(table=[[DEFAULT,  TEST_SITES]],  fields=[[0,  1,  2]])

Kylin Query Engine - Explain Plan

Page 21: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Cube Optimization§ Curse  of  Dimensionality  

§ N  dimension  cube  has  2N  cuboid  § Full  Cube  vs.  Parqal  Cube    

§ Huge  Data  Volume    § Dicqonary  Encoding  § Incremental  Building

http://kylin.io

Page 22: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

§ Full  Cube - Pre-aggregate all dimension combinations - “Curse of dimensionality”: N dimension cube has 2N cuboid.

§ ParOal  Cube - To avoid dimension explosion, we divide the dimensions into different aggregation

groups - 2N+M+L à 2N + 2M + 2L

- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands - 230 à 210 + 210 + 210

- Tradeoff between online aggregation and offline pre-aggregation

http://kylin.io

Full Cube vs. Partical Cube

Page 23: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Partical Cube

Page 24: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Incremental Build

Page 25: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

What’s Next§ Improve  cube  algorithm  § Cube  by  segments,  30%-­‐50%  faster  § Build  delay  down  to  tens  of  minutes  

§ Streaming  cubing  § Analyze  real-­‐qme  data  § Build  delay  down  to  seconds  

§ Spark

http://kylin.io

Page 26: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Cube by Layer

§ The current algorithm - Many MRs, the number of

dimensions - Huge shuffles, aggregation at

reduce side, 100x of total cube size

http://kylin.io

Full  Data

0-­‐D  Cuboid

1-­‐D  Cuboid

2-­‐D  Cuboid

3-­‐D  Cuboid

4-­‐D  Cuboid

MR

MR

MR

MR

MR

Page 27: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Cube by Segments

§ The to-be algorithm, 30%-50% faster

- 1 round MR - Reduced shuffles, map side

aggregation, 20x total cube size

- Hourly incremental build done in tens of minutes

http://kylin.io

Data  Split

Cube  Segment

Data  Split

Cube  Segment

Data  Split

Cube  Segment

……

Final  Cube

Merge  Sort  (Shuffle)

mapper mapper mapper

Page 28: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Streaming Cubing§Cube is great but…

- Cube takes time to build, how about real-time analysis? - Sometimes we want to drill down to row level information

§Streaming cubing - Build micro cube segments from streaming - Use inverted index to capture last minute data

http://kylin.io

Page 29: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

Streamingsec

onds  delay

Last  Hour

Inverted  Index

Before  Last  Hour

Cube

Kylin Lambda Architecture

minutes  delay

Query  Engine

ANSI  SQ

L

Hybrid StorageInterface

Page 30: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Adding Spark Support§Cubing Efficiency §MR is not optimal framework §Spark Cubing Engine

§Source from SparkSQL §Read data from SparkSQL instead of Hive

§Route to SparkSQL §Unsupported queries be coved by SparkSQL

http://kylin.io

Page 31: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A

http://kylin.io

Page 32: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Future2015 2016

Kylin Evolution Roadmap

http://kylin.io

20142013

Initial

Prototype  for  MOLAP  • Basic  end  to  end  POC  

MOLAP  • Incremental  Refresh  • ANSI  SQL  • ODBC  Driver  • Web  GUI  • Tableau  • ACL  • Open  Source

StreamingOLAP  • Streaming  OLAP  • JDBC  Driver  • New  UI  • Excel  • SparkSQL  • …  more  

TBD

Sep,  2013

Jan,  2014

Oct,  2014

H1,  2015

HybridOLAP  • Lambda  Arch  • Automation  • Capacity  

Management  • Spark    • …  more

Next  Gen  • Adv  OLAP  Functions  • In-­‐Memory  Analysis  

(TBD)  • Mobile  (TBD)  • …  more

Page 33: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

Kylin Ecosystem■ Kylin Core

■ Fundamental framework of Kylin OLAP Engine

■ Extension ■ Plugins to support for additional functions and features

■ Integration ■ Lifecycle Management Support to integrate with other

applications

■ Interface ■ Allows for third party users to build more features via user-

interface atop Kylin core

http://kylin.io

Kylin OLAP Core

Extension à Security à Redis Storage à Spark Engine à Docker

Interface à Web Console à Customized BI à Ambari/Hue Plugin

Integration à ODBC Driver à ETL à Drill à SparkSQL

Page 34: Apache Kylin Balance between Space and Time · 2017-12-14 · SQL REST,Server Kylin Architecture Overview. ... Kylin Lambda Architecture y e L Hybrid Storage Interface. Adding Spark

http://kylin.io

If  you  want  to  go  fast,  go  alone.  If  you  want  to  go  far,  go  together.

-­‐-­‐African  [email protected]