Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Data Engineering Tools & Best PracticesSriram BaskaranInsight
Bachelors in CSGrad 2013
Machine Learning Engineer
2013-2016
Insight2018
Masters in CS (Data Science)
Grad 2018
Sriram Baskaran
Program DirectorData Engineer
linkedin.com/[email protected]
apply.insightdatascience.com
Some context
AppBackend
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customers
Let’s take an example
Why Relational?
● Rows of my tables are accessed together.○ Single row-All column○ All relational databases follow this pattern: Postgres, MySQL, Oracle○ Huge amount of planning is required to design good schemas!
■ No flexibility for schema changes
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
Backend Databases
● Mostly Relational: Postgres, MySQL are popular.● Based on Relational Algebra and Codd’s model! It’s important to know this! ● Things to know: SQL, ER modeling.
○ Crow’s foot notation
● Most of your data for Data pipelines start here○ It is important to understand backend databases.
● Binary format like Images are stored separately○ Caching and Content Delivery Networks
Data Engineering starts here
Data engineering
● Extensions and Analytics on Backend databases.● Building pipelines to move data from A to B. ● Ingest and store data in efficient storage systems. ● Ability to handle large scale data processing.● Automating a large part of ETL work
Agenda
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
Agenda - focus
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
NormalizedRestaurantsCustomersRatings
Joins happen every time.
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
DenormalizedAll Data
Star Schema(But prod is not optimized,Let’s fix that in sometime)
Joins don’t happen here
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
DenormalizedAll Data
Load on the production database.
Joins don’t happen here
Build a warehouse that is independent of your prod database
Some way to sync
Analytical DatabaseTransactional
Database
What are our options?
● You will come across○ Postgres○ MySQL○ Oracle○ Druid○ Redshift○ Elastic Search○ Cassandra○ Memcached○ Redis○ Dynamo○ Couchbase○ Flat-files (S3)
Pick a database after knowing the access patterns
Analytical in Relational
● OLAP is pretty powerful.○ Use of ROLLUP and CUBE operations○ Star Schema and Snowflake schema are pretty nice.○ Examples: Postgres, Oracle, SQL Server, MySQL
● Good but it will not scale well. Mainly due to the way the data is stored.● Schema is rigid so changes are very hard.
Groupings and Aggregations
● Columnar○ Druid○ Redshift
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
Search through unstructured text
● Like % in SQL is not efficient. ○ SELECT * FROM reviews WHERE review_text LIKE ‘%great%’○ SELECT * FROM reviews WHERE review_text LIKE ‘Loved%’
● Indexing through unstructured text should be really good○ Elastic Search○ Solr
● Eg, searching the text in the review● Each tool has a new data structure called “Postings-list”, which makes it
faster.
Caching
● Temporary in-memory storage○ Redis○ Memcache
● Optimized for quick and fast storage/retrieval. Key-value store (not a document store)
● Use reasonable keys so hashing algorithm is not a bottleneck
How to pick one?
● Make educated & reasonable assumptions○ Type of Data○ Access Patterns○ Scaling factor (Most databases are designed to scale in their “domain”)
● Read a lot, never stop reading it. ● Use it in a project
○ There are hundreds of open large datasets available. ○ Start with GDELT (https://www.gdeltproject.org/data.html)
Complexities of communication
● More tools, difficult it is to communicate between them● Keeping databases in sync is one of the main challenges in the industry.● Kafka may be a solution
○ Act as a message bus○ Use Kafka Connect to bridge
Remember our Denormalized issue?
DenormalizedAll Data
Star Schema(But prod is not optimized,Let’s fix that in sometime)
Joins don’t happen here
Remember our Denormalized issue?
AppBackend
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
We are talking about scale!
● Tackling two problems: Time and Space○ Data size is greater than size of your “main-memory”○ Data cannot fit entirely.○ It takes too long to compute
● Distributed computing is a popular solution○ Hadoop, Spark, Presto, Hive○ Kafka is gaining popularity in processing too
● Example: Scrape menu items for each restaurant○ Go to each restaurant’s website○ Scrape it○ Parse it the website○ Find the menu content and process it.
Yelp - update menu items
Yelp’s Database
1.Get URL
2.Get actual content from internet
3.Process text and store results
Postgres
Yelp - update menu items - 1 million urls!
1.custom way to get urls
2.Each script access separately
3.Each script Process text and store results
Yelp’s Database
Yelp - update menu items - 1 million urls!
Yelp’s Database
Yelp - update menu items - 1 million urls!
Yelp’s Database
Yelp - update menu items - 1 million urls!
Yelp’s Database
or
ML Training at Scale
● Use distributed computing to scale your training. ● Compute weights in a fast and efficient manner.
○ Sparkling water wrapper: https://github.com/h2oai/sparkling-water ○ H20
What about Speed/Velocity?
● Data can be unbounded stream of information● Example: Processing reviews for each restaurant, Do a POS tagging.
….r50, r52, r53, …..
id cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
Batch Processing
POS Tagging Model
What about Speed/Velocity?
● Data can be unbounded stream of information● Need a robust system● Example: Processing reviews
….r50, r52, r53, …..
Spark Streaming (Micro-batches)
id cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
POS Tagging Model
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
Visualize the output data
● It’s like building a software application○ Consider end-users○ What is most intuitive way to see this information?
● Professor would have give even better examples● Do not reinvent the wheel
○ Tableau (education edition)○ Kibana (Self-setup)○ Mode (Paid)○ Looker (Paid)○ Plotly (open source, free)○ Dash (abstraction around plotly, free)○ Matlab (not so much used in industry)
If you are not able to show it in a good way, there was no need to process it!
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
Putting together a pipeline
Transactional
AppBackend
Putting together a pipeline
Transactional
AppBackend
Putting together a pipeline
Transactional
AppBackend
Putting together a pipeline
Transactional
AppBackend
POS Tagging Model
Putting together a pipeline
Transactional
AppBackend
Event Store
POS Tagging Model
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
How to automate the tasks?
Scheduling & Monitoring
● Scheduling tasks in a sequence● Easy to specify dependency● Code based configuration● Easy to deploy and manage● Every Batch pipeline needs a scheduler to automate tasks.● Handling failure● Also allows backfill.
Backfill
…………...
??
Events in time
Backfill
…………... Events in time
Backfill
Think ahead, Think smart
● Get all data in to one place (know about data warehousing)● Understand the why behind any tool choices● Expect future requests from stakeholders● Learn by collaborating, know all different ways a data can be stored,
processed and visualized.● Constantly learn, know the latest updates in a too
○ Start with basics of why the tool was built
● Learn these five: Kafka, Spark, Cassandra, Postgres (PostGIS), Redshift● Managed: Lambdas, Redshift, Dynamo, S3
Start using cloud resources
● Students get $300 in credits both in AWS and GCP. Start using them.● Spin up compute resources● Try out labs for managed services. ● AWS for Students
○ AWS Lambdas○ AWS Redshift○ AWS Dynamo
More resources
● Data Engineering Tools (Visualized)● Rise of a Data Engineer● Preparing for Transition into a Data Engineer● What’s Parquet?● More blogs on insight
Or!
Insight
Insight Offerings - Which one to pick
Data Science Program
● PhD in quantitative fields.
● Have worked in analysing data.
● Good problem solving skills
Data Engineering Program
● Engineering background.
● Worked on and maintained building engineering systems.
● Java/Python
Health Data Science Program
● Postdoctoral researcher, medical doctors
● Interested in genome sequences,clinical trials.
Artificial Intelligence Program
● Engineering background.
● Have worked on training and deploying ML or NN.
DevOps Engineering Program
● Systems admin and Linux background.
● Problem solver critical thinker.
● Can understand containerized sys.
New Programs - More focused domains
● Designing security measures
● Building secure applications.
● Blockchain technology
● Smart contract management
● Decentralized architectures
Where are we?
65
Seattle
Portland
San Francisco
Los Angeles
Austin
Chicago
New
York
Boston
Toronto
In Person
Remote
Apply to Insight● 3 sessions a year● Apply when you are ready
for full-time ● Prepare a role-driven
resume● Read our blog posts● Contact alumni● Application process:
○ Resume + Application Form○ Interview
Note: Data Engineering program has a Coding challenge before the interview.
Applications open for June 2020 Session!
Apply.insightdatascience.comSign up for Notifications list