"Building Data Foundations and Analytics Tools Across The Product" by Crystal Widjaja...

Preview:

Citation preview

BuildingDataFoundationsandAnalyticsToolsAcrossthe

Product

WhoamI?

● StartedatGO-JEKinJuly2015asthefirst“data”hireFirstday:CreatingaDataDictionarywithoutanyreferencetablesYesterday:Discussions foramoreadvancedexperimentationplatform,

prototyping GrowthROIformulas,QAingnewdatamarts

Agenda

● Infrastructure forScale

● DataModelFoundations

● ToolsforBusinessUsers

Infrastructure

GO-JEKDataToday

~27%*Thisisonlybusinessmetricsdatacollected

byBI

GROWINGDATAVOLUMEPERMONTH

>5000METABASE CARDSANDTABLEAUSHEETS

>450AVGDAILYBUSINESSUSERSON

INTERNALDATA TOOLS

4FULLTIMEDATAWAREHOUSEDEVELOPERS

>30BIDATA ANALYSTS

100sOFMICROSERVICES ACROSSGO-JEK

GO-JEKDataToday

“Thechoicesyoumadeweretherightchoicesgiventhefactsthatyouhadatthetime.”

- Ajey Gore,CTOatGO-JEK

Storage

Storage

crontabs are fun

DataModeling

Moredatatomorepeople

StagingLayer

RAWDataset

IntegrationLayer

Fact/Dimensiondataset

AccessLayer

Summaryandroll-up data

DatamartLayer

Product-specialized dataset

CurrentDataArchitecture

StagingLayer

RAWDataset

IntegrationLayer

Fact/DimensionDataset

AccessLayer

Summaryandroll-up data

DatamartLayer

Product-specialized dataset

CurrentDataArchitecture

Why?1. Transparency2. Standardization

“CanIgetalistofallfull-timedrivers?Iwantto[givethemareward|putthemonabeta

group|interviewthem|…]”

Whatqualitiesmakeadrivera“full-timedriver”?

#ofdaysthedriver logsintotheappinaweek#ofminutesadriverspendsonabooking#ofbookings adriverdoesperdayonavginthepastXweeks#ofminutesadriverspends logged intotheappperday#ofcompletedbookings adriverdoesinaparticularservicemostcommonhour thedriverlogsintotheappinthepastmonth

KeeptheFirstDataLayerFactual

● Star Schema

● Advantages

○ Clean and structured model

MerchantDimension

id nama kategori_merchant

1 WarungBuIis TRADISIONAL

CustomerDimension

id nama nomor_telepon

123 Jo 628112345678

DriverDimension

id nama jenis_kelamin

456 Asep M

457 Doni M

458 Siti F

OrderFact

id id_customer id_driver id_merchant

10001 123 458 1

ItemFact

id id_order nama_item harga

101 10001 NasiGoreng 30000

102 10001 EsTehManis 5000

DriverSearchFact

id id_driver nama status

1 456 Asep Rejected

2 457 Doni Rejected

3 458 Siti Accepted

● Disadvantages

○ Difficulttododatadiscoveryfornon-technicalusers

○ Needs alotofjoins,resultinginhighcomputationalresourceneeds

AppLoginData BidData CompletedBookingData IncomeData DriverProfileData

FactualActivityData

DailyPartitionofDriverActivityand

ProfileDatainDenormalized&NestedForm

TheDataModel

avg_minutes_online_past_3_days total_minutes_online_past_3_days

avg_minutes_online_past_7_days total_days_active_past_3_days

avg_minutes_online_past_30_days total_orders_completed_past_7_days

avg_income_past_3_days total_orders_completed_past_30_days

avg_income_past_7_days total_services_completed_past_7_days

total_completed_ride_past_7_days total_completed_send_past_7_days

foreachdriver_id...

…and+200otherdatapoints

ToolsforScale

LifecycleofaDataPointOneWeekOld

OneMonthOld

3MonthsOld

LetAnalystsDefineEvents

SampleEventstoSaveonCosts

Better samplethatdatapoint...

TakeAway● Buildfor theinfrastructureyouhave,notwhatyouthinkyou’llhave

● Buildsimplestep-by-stepdatamodelswithtransparency

● Buildtoolsthatworkforallthedifferent stagesofthecompany

Recommended