10

Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Embed Size (px)

Citation preview

Page 1: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

1

Avatar at eBay

Srinivasan Rengarajan ([email protected])

Mohit Soni ([email protected])

CourtesyAnil Madan ([email protected])

Page 2: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

2

• 2007 Research Team Builds a 4 node Cluster – Subset of Click Stream and EDW data– Innovation with Mobius Query Language– Visualization and Click Path analysis

• 2009 Sept Search Clusters – Machine Learning Ranking cluster of 28 nodes– Search relevance cluster of 10 nodes– Subset of Click Stream and EDW Data

• 2010 May – Athena* Exploratory Cluster of 532 nodes– Platform Teams join hands with Search/Research to build a larger cluster .– Build it as a core competency for advanced insights for complex data– Rapid build-out with timelines pulled in by couple of months

* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology

MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.

2

Page 3: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Infrastructure

3

• Enterprise Nodes – Sun 64bit , Red Hat Linux– 2 Quad Core Nehalem, 72GB RAM, 4TB– Servers

• NameNode(s)• Job Tracker• Zookeeper• HBaseMaster• Ganglia Server• eBay (Cloudera) HUE

• Data Nodes– SGI-Rackables, Cent OS, 1U , 5.3PB– 2 Quad Core Nehalem, 36GB RAM, 10TB– Hbase on 20 nodes

• Network– TOR 1Gbps– Core Switches uplink 40Gbps

3

Page 4: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Ecosystem

44

Hadoop Core (HDFS,Common)

MapReduce (Java, Streaming, Pipes,Scala)

Data Access (Hbase, Pig, Hive)

Tools & Libraries(HUE,UC4,Oozie.Mobius,Mahout)

Monitoring & Alerting (Ganglia, Nagios)

• MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python…

• Data Access FrameworksHbase - for EDWdataPig – data piplelinesHive – Adhoc queries MQL – Mobius Query Language

• Monitoring & AlertingGanglia, Nagios

• Tools HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines Mahout – data mining

Page 5: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Administration

• Groups– Built to support multiple groups– Job invocation uses the group name– Fair Scheduler

• Allocations based on investment• Weights • Minimum share of mappers and reducers• poolMaxJobsDefault• userMaxJobsDefault• defaultMinSharePreemptionTimeout• fairSharePreemptionTimeout

• Auth & Auth– HUE – custom module to use corp. credentials– CLI*– PAM custom module– Security* - Implement token interface to replace

Kerberos with SAML.

* Work in Progress5

Page 6: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Data Sourcing Patterns

6

Click Stream

EDW

Images

Search Indices

Analytics Reporting

Algorithmic Models

AcquisitionDescription

Source Preparation Format Pattern

Click StreamSessionEventSession Container

Session/Event Streamed as LZO/Text

SessionContainer generate Sequence Files

Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/TwitterSession Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join

EDWItemTransactionUserFeedbackBids

Streamed as GZIP/TextGenerate SequenceFile/ Hbase snapshot with previous day snapshot and current day data.Hive StorageHandlers to point to SequenceFile/Hbase snapshot

TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.Create Hbase regions using HfileUpdate RegionServers using ruby script loadtable.rb

Concerns - Hbase append performance, Hfile flush HBASE-1923

Page 7: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Search Use Case – Machine Learned Ranking

7

ClickStream Items Users Feedback

Classifiers

Ranking Function

Great Search Results

• Goal– Enhance search relevance for eBay’s items.

• Hadoop Usage– Build a ranking function that takes multiple factors into account like price, listing format, seller

track record, relevance.– Ability to add new factors to validate hypothesis

– .

Page 8: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Research Use Case – Description Data Mining

• Goal– Extend catalog coverage

• Hadoop Usage– Leverage data mining/machine learning techniques to create inventory into name value pairs in an completely unsupervised way

8

BARBIE1999 "PREMIERE NIGHT"

Home Shopping Special EditionGorgeous Doll With Beautiful Blond Hair /  In A Gown

Of Purple And SilverNew / Never Removed From Box / Doll Is In Mint

Condition / Remember This Beauty Is 11 Years OldFree Shipping To US Only / Will Ship International /

Please E-mail For CostFeel Free To Ask Me Any Questions Or Concerns

Smoke - Free EnvironmentFree Shipping

Year: 1999Model: premiere nightEdition: home shopping specialHair: blondGown: purple and silverCondition: new / never removed from box / mint

Page 9: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Platform DetailsMetrics Job Statistics, System/Disk Consumption, UtilizationInfrastructure Publish/Subscribe ETL tools, low latency data movementDevelopment Tools, Environment, IDE,Architecture Schemas, Metadata, Governance, PoliciesOperations Administration, Configuration, MonitoringReporting Visualization, BI Generation, Information deliverySecurity User & Group Management, Auth & Auth

9

Clusters DetailsExploratory Strategic investment 1000-5000 nodes

Production Site facing, low latency, high availability

Use Case Specific Advertising, Trust & Safety , Merchandizing

Page 10: Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

10

Acknowledgments

• Athena Team

• Cloudera Inc.

• Community