Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data

Data Engineering Tools & Best PracticesSriram BaskaranInsight

Bachelors in CSGrad 2013

Machine Learning Engineer

2013-2016

Insight2018

Masters in CS (Data Science)

Grad 2018

Sriram Baskaran

Program DirectorData Engineer

linkedin.com/[email protected]

apply.insightdatascience.com

mailto:[email protected]

https://apply.insightdatascience.com/?utm_source=info_session&utm_medium=inperson&utm_campaign=USC-Fall

Some context

AppBackend

id rest_name loc

1 Everest Momo Sunnyvale

2 Cafe Centro San Francisco

... ... ...

id user_name user_base_loc

101 James San Jose

102 Mark San Francisco

... ... ...

Restaurants Customers

Let’s take an example

Why Relational?

● Rows of my tables are accessed together.○ Single row-All column○ All relational databases follow this pattern: Postgres, MySQL, Oracle○ Huge amount of planning is required to design good schemas!

■ No flexibility for schema changes

id rest_name loc



... ... ...


101 James San Jose


... ... ...

Restaurants Customersid cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Backend Databases

● Mostly Relational: Postgres, MySQL are popular.● Based on Relational Algebra and Codd’s model! It’s important to know this! ● Things to know: SQL, ER modeling.

○ Crow’s foot notation

● Most of your data for Data pipelines start here○ It is important to understand backend databases.

● Binary format like Images are stored separately○ Caching and Content Delivery Networks

https://www.keycdn.com/support/cdn-architecture

Data Engineering starts here

Data engineering

● Extensions and Analytics on Backend databases.● Building pipelines to move data from A to B. ● Ingest and store data in efficient storage systems. ● Ability to handle large scale data processing.● Automating a large part of ETL work

Agenda

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Agenda - focus

Storing / Ingesting

Data

Processing Data

Visualizing Data


Storing Data

● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.

id rest_name loc



... ... ...


101 James San Jose


... ... ...


1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Storing Data


NormalizedRestaurantsCustomersRatings

Joins happen every time.

Storing Data


DenormalizedAll Data

Star Schema(But prod is not optimized,Let’s fix that in sometime)

Joins don’t happen here

Storing Data



Load on the production database.


Build a warehouse that is independent of your prod database

Some way to sync

Analytical DatabaseTransactional

Database

What are our options?

● You will come across○ Postgres○ MySQL○ Oracle○ Druid○ Redshift○ Elastic Search○ Cassandra○ Memcached○ Redis○ Dynamo○ Couchbase○ Flat-files (S3)

Pick a database after knowing the access patterns

Analytical in Relational

● OLAP is pretty powerful.○ Use of ROLLUP and CUBE operations○ Star Schema and Snowflake schema are pretty nice.○ Examples: Postgres, Oracle, SQL Server, MySQL

● Good but it will not scale well. Mainly due to the way the data is stored.● Schema is rigid so changes are very hard.

Groupings and Aggregations

● Columnar○ Druid○ Redshift

id rest_name loc



... ... ...


101 James San Jose


... ... ...


1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Search through unstructured text

● Like % in SQL is not efficient. ○ SELECT * FROM reviews WHERE review_text LIKE ‘%great%’○ SELECT * FROM reviews WHERE review_text LIKE ‘Loved%’

● Indexing through unstructured text should be really good○ Elastic Search○ Solr

● Eg, searching the text in the review● Each tool has a new data structure called “Postings-list”, which makes it

faster.

Caching

● Temporary in-memory storage○ Redis○ Memcache

● Optimized for quick and fast storage/retrieval. Key-value store (not a document store)

● Use reasonable keys so hashing algorithm is not a bottleneck

How to pick one?

● Make educated & reasonable assumptions○ Type of Data○ Access Patterns○ Scaling factor (Most databases are designed to scale in their “domain”)

● Read a lot, never stop reading it. ● Use it in a project

○ There are hundreds of open large datasets available. ○ Start with GDELT (https://www.gdeltproject.org/data.html)

https://www.gdeltproject.org/data.html

Complexities of communication

● More tools, difficult it is to communicate between them● Keeping databases in sync is one of the main challenges in the industry.● Kafka may be a solution

○ Act as a message bus○ Use Kafka Connect to bridge

Remember our Denormalized issue?


Star Schema(But prod is not optimized,Let’s fix that in sometime)


Remember our Denormalized issue?

AppBackend

Agenda - for completion

Storing / Ingesting

Data

Processing Data

Visualizing Data


We are talking about scale!

● Tackling two problems: Time and Space○ Data size is greater than size of your “main-memory”○ Data cannot fit entirely.○ It takes too long to compute

● Distributed computing is a popular solution○ Hadoop, Spark, Presto, Hive○ Kafka is gaining popularity in processing too

● Example: Scrape menu items for each restaurant○ Go to each restaurant’s website○ Scrape it○ Parse it the website○ Find the menu content and process it.

Yelp - update menu items

Yelp’s Database

1.Get URL

2.Get actual content from internet

3.Process text and store results

Postgres

Yelp - update menu items - 1 million urls!

1.custom way to get urls

2.Each script access separately

3.Each script Process text and store results

Yelp’s Database


Yelp’s Database


Yelp’s Database


Yelp’s Database

or

ML Training at Scale

● Use distributed computing to scale your training. ● Compute weights in a fast and efficient manner.

○ Sparkling water wrapper: https://github.com/h2oai/sparkling-water ○ H20

https://github.com/h2oai/sparkling-water

What about Speed/Velocity?

● Data can be unbounded stream of information● Example: Processing reviews for each restaurant, Do a POS tagging.

….r50, r52, r53, …..

id cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Batch Processing

POS Tagging Model

What about Speed/Velocity?

● Data can be unbounded stream of information● Need a robust system● Example: Processing reviews

….r50, r52, r53, …..

Spark Streaming (Micro-batches)

id cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

POS Tagging Model


Storing / Ingesting

Data

Processing Data

Visualizing Data


Visualize the output data

● It’s like building a software application○ Consider end-users○ What is most intuitive way to see this information?

● Professor would have give even better examples● Do not reinvent the wheel

○ Tableau (education edition)○ Kibana (Self-setup)○ Mode (Paid)○ Looker (Paid)○ Plotly (open source, free)○ Dash (abstraction around plotly, free)○ Matlab (not so much used in industry)

If you are not able to show it in a good way, there was no need to process it!


Storing / Ingesting

Data

Processing Data

Visualizing Data


Putting together a pipeline

Transactional

AppBackend


Transactional

AppBackend


Transactional

AppBackend


Transactional

AppBackend

POS Tagging Model


Transactional

AppBackend

Event Store

POS Tagging Model


Transactional

AppBackend

Event Store

Spark Streaming (Micro-batches)POS Tagging

Model


Transactional

AppBackend

Event Store


Model


Transactional

AppBackend

Event Store


Model

How to automate the tasks?

Scheduling & Monitoring

● Scheduling tasks in a sequence● Easy to specify dependency● Code based configuration● Easy to deploy and manage● Every Batch pipeline needs a scheduler to automate tasks.● Handling failure● Also allows backfill.

Backfill

…………...

??

Events in time

Backfill

…………... Events in time

Backfill

Think ahead, Think smart

● Get all data in to one place (know about data warehousing)● Understand the why behind any tool choices● Expect future requests from stakeholders● Learn by collaborating, know all different ways a data can be stored,

processed and visualized.● Constantly learn, know the latest updates in a too

○ Start with basics of why the tool was built

● Learn these five: Kafka, Spark, Cassandra, Postgres (PostGIS), Redshift● Managed: Lambdas, Redshift, Dynamo, S3

Start using cloud resources

● Students get $300 in credits both in AWS and GCP. Start using them.● Spin up compute resources● Try out labs for managed services. ● AWS for Students

○ AWS Lambdas○ AWS Redshift○ AWS Dynamo

https://aws.amazon.com/about-aws/whats-new/2015/05/aws-educate-students-and-educators-can-access-aws-technology-cloud-courses-training-and-collaboration-tools/

More resources

● Data Engineering Tools (Visualized)● Rise of a Data Engineer● Preparing for Transition into a Data Engineer● What’s Parquet?● More blogs on insight

Or!

http://xyz.insightdataengineering.com/blog/pipeline_map/

https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/

https://blog.insightdatascience.com/preparing-for-the-transition-to-data-engineering-bfb39d327316

https://www.youtube.com/watch?v=MZNjmfx4LMc

https://blog.insightdatascience.com/

Insight

Insight Offerings - Which one to pick

Data Science Program

● PhD in quantitative fields.

● Have worked in analysing data.

● Good problem solving skills

Data Engineering Program

● Engineering background.

● Worked on and maintained building engineering systems.

● Java/Python

Health Data Science Program

● Postdoctoral researcher, medical doctors

● Interested in genome sequences,clinical trials.

Artificial Intelligence Program

● Engineering background.

● Have worked on training and deploying ML or NN.

DevOps Engineering Program

● Systems admin and Linux background.

● Problem solver critical thinker.

● Can understand containerized sys.

https://www.insightdatascience.com/

https://www.insightdatascience.com/

https://www.insightdataengineering.com



https://www.insighthealthdata.com/



https://www.insightdata.ai/



http://insightdevops.com



New Programs - More focused domains

● Designing security measures

● Building secure applications.

● Blockchain technology

● Smart contract management

● Decentralized architectures

https://www.insightsecurityengineering.com/



https://www.insightconsensus.com/



Where are we?

65

Seattle

Portland

San Francisco

Los Angeles

Austin

Chicago

New

York

Boston

Toronto

In Person

Remote

Apply to Insight● 3 sessions a year● Apply when you are ready

for full-time ● Prepare a role-driven

resume● Read our blog posts● Contact alumni● Application process:

○ Resume + Application Form○ Interview

Note: Data Engineering program has a Coding challenge before the interview.

Applications open for June 2020 Session!

Apply.insightdatascience.comSign up for Notifications list

https://apply.insightdatascience.com/?utm_source=info_session&utm_medium=inperson&utm_campaign=USC-Fall

https://notify.insightdatascience.com/notify

Documents

Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data