Streaming data mining

Preview:

Citation preview

Streaming Data Mining

04/13/2023 Streaming Data Mining 1

Once upon a time.

• Life was easy– Eg. Org. has only transaction data, analyst were happy analyzing them.– Competition was less.– Customer had lesser options to review product.

• Wait! Web- 2.(oh)0– Customer who consumed data started generating data - tweets, blogs,

facebook comments, reviews………..– Another burst came when the Mobile era came in.

• Apps recording customers location• Actions on apps.• Pattern of app use.

04/13/2023 Streaming Data Mining 2

Server

DB

DB

DB

DB

DB

DB

04/13/2023 Streaming Data Mining

Its All About the Numbers!

4

58M/Day 500Tb/Day 2.1M GB/Hr 4B view/day

So, Its GOOD to have data, Right?

Digging Into the Data

• Analyze to understand customer.• Identify Patterns

• Machine Learning• Statistical Model Building• Natural Language Processing• …….

04/13/2023 Streaming Data Mining 7

Usual Pipeline in Data Mining.

04/13/2023 Streaming Data Mining 8

Data of Entire Population

Sample Population

Cleaning and Preprocessing

Training and testing Models

Production Server

Why?

Huge Training Data Set - Volume

• Organizations these days have huge datasets that can be used to train their models.

• But Main Memory Restrictions.– Machine Learning Algorithm.– Batch Processing.

• Y no Sampling??

04/13/2023 Streaming Data Mining 12

Streams - Velocity

• Ubiquitous Computing, Mobile Devices, Social Media.

• Potentially of Infinite length

• Usual Strategy – Batch Mode.

04/13/2023 Streaming Data Mining 13

Contextual Trends.

• Trending topic on social media.• Weather• Location• Demographics• Market Dynamics

• Jargon Alert : Concept Drift

04/13/2023 Streaming Data Mining 14

What we want today?

Consume Real time data and extract insights.

Wait.! Can I say Analyze Streams?

Streaming Data Mining!

Philosophy

• Continuous Data Record aka Data Streams• Bounded Storage• Single Pass• Real Time• Concept Drift

04/13/2023 Streaming Data Mining 17

So What… We have Hadoop…

• The big Elephant doesn’t fit in here.• Hadoop – Batch Processing• We need Storm

– Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.

04/13/2023 Streaming Data Mining 18

Algorithms.

• The conventional Machine learning algorithm were designed for batch processing.– The Algorithm needs to load entire dataset into the memory.– Computes the necessary statistics, example entropy\information gain

in decision trees.

• With Streams?– Streams are of infinite length– Storing everything, if you can, will be an issue on the memory of the

system $$$$

04/13/2023 Streaming Data Mining 19

Streaming Machine Learning

• When?– High Data volume– Rate at which data comes is high.– Unbound, will always arrive in the system and we wont be able to fit it

in our memory

• Requirements to be adhered.– Each input element to be processed atmost once.– Space– Time– Start predicting from t0

04/13/2023 Streaming Data Mining 20

General Flow of Streaming Algorithms

04/13/2023 Streaming Data Mining 21

Spam Detection

• Models trained in the past by traditional data mining strategy will become obsolete as spammers will find a way out.

• Solution : VFDT - Hoeffding Tree Steam Classification• Train the model in streaming setup.• When new spam pattern detected, people mark them as

spams.• Use them to retrain the model in real time.

Concept Drift! Win!

04/13/2023 Streaming Data Mining 22

Answering Todays BigData Needs

• Streaming Data Mining– Storm– MOA– SAMOA– KAFKA– ……

04/13/2023 Streaming Data Mining 23

Thank You!Ankit Solanki

Neil Shah

Recommended