20
Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19) CS341: Project in Mining Massive Datasets

Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

  • Upload
    others

  • View
    3

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Advanced ML in Google Cloud (2)

Abhay Agarwal (MS Design ‘19)

CS341: Project in Mining Massive Datasets

Page 2: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Agenda

● ‘Productizing’ analytics

● Data wrangling

● Data fundamentals

● Data studio vs datalab vs colab

Page 3: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

‘Productizing’● What does it mean to ‘productize’ your ML?

Page 4: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Pitfalls in Productizing● My algorithm has a 95% accuracy -- is it ready for production?

● My algorithm has a 95% accuracy and 95% precision -- is it ready for

production?

● My algorithm has a 95% accuracy, 95% precision, and my training data is

roughly sampled from real examples -- is it ready for production?

● My algorithm has a 95% accuracy, 95% precision, training data sampled from

real examples, and my algorithm tests hypotheses that match the use cases --

is it ready for production?

Page 5: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data wrangling

Page 6: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

DATA COLLECTION FUNDAMENTALS

6

Page 7: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Key Concepts

7

Freshness

Quality

Structure

Cost

Quantity

Page 8: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Quantity•Breadth

• Number of entities or observations• E.g., People, companies, stars, shopping trips,…• Ideally: comprehensive

•Depth• Data gathered on each entity or observation

8

Page 9: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Breadth and Depth

9

Depth

Bre

adt

h

World Bank Development Indicators

Page 10: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Structure

10

Structured Unstructured Semi-structured

Page 11: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Graph DataGraphs arise naturally in many settings

Many interesting techniques e.g., Page Rank, community detection

11Moz.com

Page 12: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data Quality•Errors

• E.g., human labeling mistakes

•Missing data• E.g., missing addresses in customer records

•Bias• Sample bias, measurement bias, prejudice/stereotype

12

Page 13: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data Quality: Sample Bias

13

Day Driving vs Night Driving

Tank recognition

Page 14: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data Quality: Prejudice/Stereotype BiasAlgorithmic Law Enforcement

14The Economist, August 20, 2016

But what about perpetuating bias against minorities?

Page 15: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data Quality: Measurement Bias

15

Page 16: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data FreshnessRate of data collection must match rate of change of underlying phenomenon

16

Page 17: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data manipulation in Google Cloud● Data Studio

● Datalab

● Colab

● (offline!)

Page 18: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Data Studio● Data Studio - glorified spreadsheets with a few integrations to Google Cloud

to pull data

● Use cases: excel-like functions, simple visualizations (e.g. geographic)

Page 19: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Datalab● Datalab - hosted Jupyter instance with preset libraries

● Use cases: python scripting, visualization, ML pipelining, some long-running

scripting, versioned scripts and models

Page 20: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)

Colab● Colab - Shared, no-setup version of Datalab that is designed around sharing

● Use cases: creating publicly accessible work, collaboration, but no

long-running scripting