Building a Business on Open Source Distributed Computing

Embed Size (px)

Citation preview

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    1/89

    Building a Business on Open SourceDistributed Computing

    company: www.visibletechnologies.com

    blog: www.roadtofailure.comtwitter: @lusciouspear

    Sunday, December 20, 2009

    http://www.roadtofailure.com/http://www.roadtofailure.com/http://www.visibletechnologies.com/http://www.visibletechnologies.com/
  • 8/14/2019 Building a Business on Open Source Distributed Computing

    2/89

    Social Media and Scaling

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    3/89

    Social Media and Scaling

    Scalability Matters Now.

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    4/89

    Social Media and Scaling

    Scalability Matters Now.

    SM produces large, complex data

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    5/89

    Social Media and Scaling

    Scalability Matters Now.

    SM produces large, complex data

    Anyone can collect the web

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    6/89

    Social Media and Scaling

    Scalability Matters Now.

    SM produces large, complex data

    Anyone can collect the web

    Make a Twitter in a few days

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    7/89

    Social Media and Scaling

    Scalability Matters Now.

    SM produces large, complex data

    Anyone can collect the web

    Make a Twitter in a few days

    Easy to get TBs of data

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    8/89

    Social Media and Scaling

    Scalability Matters Now.

    SM produces large, complex data

    Anyone can collect the web

    Make a Twitter in a few days

    Easy to get TBs of data

    Big Data enabling new fields forcompanies

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    9/89

    What Visible Does

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    10/89

    What Visible Does

    BI and Brand Management on SocialMedia

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    11/89

    What Visible Does

    BI and Brand Management on SocialMedia

    Listen, Monitor, Engage

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    12/89Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    13/89Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    14/89Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    15/89

    Old Product: RDBMS

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    16/89

    Old Product: RDBMS

    A few MSSQL servers on boxes

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    17/89

    Old Product: RDBMS

    A few MSSQL servers on boxes

    Lots of ETL

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    18/89

    Old Product: RDBMS

    A few MSSQL servers on boxes

    Lots of ETL

    Several TB, inserts slow, deletes

    impossible, random fail

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    19/89

    Why RDBMS Bad

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    20/89

    Why RDBMS Bad

    Nonlinear scale cost

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    21/89

    Why RDBMS Bad

    Nonlinear scale cost

    Used as a storage abstraction

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    22/89

    Why RDBMS Bad

    Nonlinear scale cost

    Used as a storage abstraction

    Mainly Select, Join, Group, Count

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    23/89

    Why RDBMS Bad

    Nonlinear scale cost

    Used as a storage abstraction

    Mainly Select, Join, Group, CountSpecialized Scale-Out ones meh

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    24/89

    Why RDBMS Bad

    Nonlinear scale cost

    Used as a storage abstraction

    Mainly Select, Join, Group, CountSpecialized Scale-Out ones meh

    Impedance Mismatch - Try to be High-Throughput, Low-Latency

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    25/89

    Why RDBMS Bad

    Nonlinear scale cost

    Used as a storage abstraction

    Mainly Select, Join, Group, CountSpecialized Scale-Out ones meh

    Impedance Mismatch - Try to be High-Throughput, Low-Latency

    Swiss-army knife, unstable,

    transactions, advanced SQL, tuningSunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    26/89

    Why OSS?

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    27/89

    Why OSS?

    Previously all MS

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    28/89

    Why OSS?

    Previously all MS

    It exists!

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    29/89

    Why OSS?

    Previously all MS

    It exists!Scaling + Licensing = No

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    30/89

    Why OSS?

    Previously all MS

    It exists!Scaling + Licensing = No

    Cant build a platform without source

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    31/89

    Why OSS?

    Previously all MS

    It exists!Scaling + Licensing = No

    Cant build a platform without source

    Its Enterprise Now!

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    32/89

    Goals for New Platform

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    33/89

    Goals for New Platform

    Golden Timeline

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    34/89

    Goals for New Platform

    Golden Timeline

    Search/Analyze *any* data

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    35/89

    Goals for New Platform

    Golden Timeline

    Search/Analyze *any* dataLinear Cost

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    36/89

    Goals for New Platform

    Golden Timeline

    Search/Analyze *any* dataLinear Cost

    Not Hacked Together

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    37/89

    Goals for New Platform

    Golden Timeline

    Search/Analyze *any* dataLinear Cost

    Not Hacked Together

    Collect the Social Internet

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    38/89

    HOW TO SCALE

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    39/89

    HOW TO SCALE

    What makes you special?

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    40/89

    HOW TO SCALE

    What makes you special?What are you willing to sacrifice?

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    41/89

    HOW TO SCALE

    What makes you special?What are you willing to sacrifice?

    How will you structure the data?

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    42/89

    Avoiding Impedance Mismatch

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    43/89

    Avoiding Impedance Mismatch

    Most problems can be divided intoHigh or Low latency

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    44/89

    Avoiding Impedance Mismatch

    Most problems can be divided intoHigh or Low latency

    Get a lot of data eventually, or a littlenow

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    45/89

    Avoiding Impedance Mismatch

    Most problems can be divided intoHigh or Low latency

    Get a lot of data eventually, or a littlenow

    MapReduce vs. Sharding/Indexing

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    46/89

    Ecosystem

    Hadoop DFS

    HBase

    Hive

    MapReduce

    CascadingPig

    Katta/App

    lications

    Zookeeper

    Unstructured

    Storage

    Structured

    Storage

    Raw

    Processing

    Compiled

    Processing

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    47/89

    Simple Workflow

    Collect SemanticAnalysis

    UnstructuredAnalysis

    Store in

    HBase

    StructuredAnalysis

    Indexing

    Pull

    Indexes

    Load/

    Replicate

    Shards Search

    Store in

    Hadoop

    Hadoop

    Hadoop +

    HBase

    Lucene+

    Solr+

    Katta

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    48/89

    Unstructured Processing Cluster

    CollectSemantic

    Analysis

    Unstructured

    AnalysisInternet

    XMLHTMLHBase

    Records

    Structured

    Store

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    49/89

    Hadoop + MR

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    50/89

    Hadoop + MR

    Special: Crunch web-scale data fast

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    51/89

    Hadoop + MR

    Special: Crunch web-scale data fast

    Sacrifice: Low-Latency, Transactions,Random Access, Updates

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    52/89

    Hadoop + MR

    Special: Crunch web-scale data fast

    Sacrifice: Low-Latency, Transactions,Random Access, Updates

    Structure: Chunked flat files

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    53/89

    Structured Processing Cluster

    Store in

    HBase

    Structured

    Analysis

    IndexingStore in

    HadoopHBase

    Records

    Unstructured

    ClusterSearch

    Cluster

    Lucene Index ShardedLucene Index

    Enriched Data

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    54/89

    Document Structure

    ContentID: 00BAC189

    Title: Iron Maiden Rules

    Body: I think Janick Gers is an amazing guitarist blah blah

    PostDT: 20090718

    ParentID: 0FDEADBEEF

    Permalink: www.roadtofailure.com/post?=20

    Sunday, December 20, 2009

    http://www.roadtofailure.com/post?=20http://www.roadtofailure.com/post?=20http://www.roadtofailure.com/post?=20
  • 8/14/2019 Building a Business on Open Source Distributed Computing

    55/89

    HBase

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    56/89

    HBase

    Special: Scalable random/sequential

    access almost as fast as RDBMS

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    57/89

    HBase

    Special: Scalable random/sequentialaccess almost as fast as RDBMS

    Sacrifice: Joins, Secondary Indexes,Transactions (kind of)

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    58/89

    HBase

    Special: Scalable random/sequentialaccess almost as fast as RDBMS

    Sacrifice: Joins, Secondary Indexes,Transactions (kind of)

    Structure: BigTable - column oriented

    Sunday, December 20, 2009

    h l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    59/89

    Search Cluster

    Pull

    Indexes

    Load/

    ReplicateShards

    Lucene

    Indexes from

    HDFS

    Lucene

    Indexes

    Lucene

    Indexes

    Search

    Sunday, December 20, 2009

    h

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    60/89

    Search

    Sunday, December 20, 2009

    l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    61/89

    Katta + Solr

    Sunday, December 20, 2009

    S l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    62/89

    Katta + Solr

    Special: Sharded search

    Sunday, December 20, 2009

    K S l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    63/89

    Katta + Solr

    Special: Sharded search

    Sacrifice: Consistency, high-throughput

    Sunday, December 20, 2009

    K S l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    64/89

    Katta + Solr

    Special: Sharded search

    Sacrifice: Consistency, high-throughput

    Structure: Reverse index

    Sunday, December 20, 2009

    BI

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    65/89

    BI

    Sunday, December 20, 2009

    BI

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    66/89

    BI

    Group, Sort, Filter, Count, Sum

    Sunday, December 20, 2009

    BI

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    67/89

    BI

    Group, Sort, Filter, Count, Sum

    Semi-additive (Avg) rare but not hard

    Sunday, December 20, 2009

    BI

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    68/89

    BI

    Group, Sort, Filter, Count, Sum

    Semi-additive (Avg) rare but not hard

    MapReduce Jobs

    Sunday, December 20, 2009

    BI

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    69/89

    BI

    Group, Sort, Filter, Count, Sum

    Semi-additive (Avg) rare but not hard

    MapReduce Jobs

    Faceted Search

    Sunday, December 20, 2009

    E l

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    70/89

    Examples

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    71/89

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    72/89

    Sunday, December 20, 2009

    Ch ll

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    73/89

    Challenges

    Sunday, December 20, 2009

    Ch ll

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    74/89

    Challenges

    Scaling Search

    Sunday, December 20, 2009

    Ch ll

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    75/89

    Challenges

    Scaling Search

    Understanding Latency

    Sunday, December 20, 2009

    Ch ll

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    76/89

    Challenges

    Scaling Search

    Understanding LatencyWhat do we need now? Can

    customers wait for big data?

    Sunday, December 20, 2009

    Ch ll

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    77/89

    Challenges

    Scaling Search

    Understanding LatencyWhat do we need now? Can

    customers wait for big data?

    Monitoring

    Sunday, December 20, 2009

    R R l f S li

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    78/89

    Recap: Rules for Scaling

    Sunday, December 20, 2009

    R R l f S li

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    79/89

    Recap: Rules for Scaling

    RDBMS is not a Swiss-Army Knife

    Sunday, December 20, 2009

    Recap: R les for Scaling

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    80/89

    Recap: Rules for Scaling

    RDBMS is not a Swiss-Army Knife

    Know your sacrifices

    Sunday, December 20, 2009

    Recap: Rules for Scaling

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    81/89

    Recap: Rules for Scaling

    RDBMS is not a Swiss-Army Knife

    Know your sacrifices

    Know your specialness

    Sunday, December 20, 2009

    Recap: Rules for Scaling

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    82/89

    Recap: Rules for Scaling

    RDBMS is not a Swiss-Army Knife

    Know your sacrifices

    Know your specialness

    Know your data structure

    Sunday, December 20, 2009

    Recap: Rules for Scaling

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    83/89

    Recap: Rules for Scaling

    RDBMS is not a Swiss-Army Knife

    Know your sacrifices

    Know your specialness

    Know your data structure

    Ponder Latency

    Sunday, December 20, 2009

    What Next?

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    84/89

    What Next?

    Sunday, December 20, 2009

    What Next?

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    85/89

    What Next?

    HBase Analytics?

    Sunday, December 20, 2009

    What Next?

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    86/89

    What Next?

    HBase Analytics?

    What would make a bank trust it

    Sunday, December 20, 2009

    What Next?

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    87/89

    What Next?

    HBase Analytics?

    What would make a bank trust it

    Teach people to think about data

    Sunday, December 20, 2009

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    88/89

    ...

    Sunday, December 20, 2009

    The End

  • 8/14/2019 Building a Business on Open Source Distributed Computing

    89/89

    The End

    company: www.visibletechnologies.com

    blog: www.roadtofailure.comtwitter: @lusciouspear

    [email protected]

    http://www.roadtofailure.com/http://www.roadtofailure.com/http://www.visibletechnologies.com/http://www.visibletechnologies.com/