Data Stream Mining

Data Stream Mining

Tore RischDept. of information technology

Uppsala UniversitySweden

2014-05-14

New applications• Data comes as huge data streams, e.g.:

- Satellite data- Scientific instruments, e.g. colliders - Social networks- Stock data- Process industry- Traffic control - Patient monitoring

Enormous data growth• Read landmark article in Economist 2010-02-27:

http://www.economist.com/node/15557443/ • The traditional Moore’s law:

– Processor speed doubles every 1.5 years• Current data growth rate significantly higher

– Data grows 10-fold every 5 year, which about the same as Moore’s law• Major opportunities:

– spot business trends– prevent diseases– combat crime– scientific discoveries, the 4th paradigm

(http://research.microsoft.com/en-us/collaboration/fourthparadigm/) – data-centered economy

• Major challenges:– Information overload– Scalable data processing, ‘Bigdata management’– Data security– Data privacy

Too much data to store on disk

Need to mine streaming data

Mining a swift data river

Cover, Economist 14 pages thematic issue 2010-02-27

Mining streams vs. databases• Data streams dynamic and infinite in size

– Data is continuously generated and changing– Live streams may have no upper limit – A live stream, can be read only once (‘Just One

Look’)– The stream rate may be very high

• Traditional data mining does not work for streaming data:– Regular data mining based on finite data collections

stored in files or databases– Regular data mining done in batch:

• read (access) data in collection several times• analyze accessed data• store results in database (or file)

– For example, store for each data object what cluster it belongs to

Data stream mining vs. traditional data mining

• Live streams grow continuously– The system cannot read streams many times as with traditional

data mining• Live streams mining must be done on-line

– Not traditional batch processing• Live streams require main memory processing

– To keep up with very high data flow rates (speed)• Live streams must be mined with limited memory

– Not load-and-analyze as traditional data mining– Iterative processing of statistics

• Live streams must keep up with very high data flow volumes– Approximate mining algorithms– Parallel on-line computations

Streams vs. regular data• Data stream mining requires iterative aggregation:

– Read data tuples iteratively from input stream(s)• E.g. measured values• Do not store read data in database

– Analyze read tuples incrementally– Keep partial results in main memory

• E.g. sum and count– Continuosly emit incremental analyzed resuls

• E.g. running average by dividing sum with count• Result of data stream mining is derived stream

– Continuously produced by emit– Can be stored as continuously changing database

Streams vs. regular data• Stream processing should keep up with data

flow– Make computations so fast that they keep up with the

flow (on average)– Should not be catastrophic if if miner cannot keep up

with the flow:• Drop input data values• Use approximate computations such as sampling

• Asynchronous logging in files often possible– At least of processed (reduced) streaming data– At least during limited time (log files)

Requirements for Stream Mining

• Single scan of data, – because of very large or infinite size of streams.– because it may be impossible or very expensive to reread the

stream for the computation

• Limited memory and CPU usage– because the processing should be done in main memory despite

the very large stream volume

• Compact continuously evolving representation of mined data– It is not possible to store the mined data in database as with

traditional data mining– A compact main memory representation of mined data needed

Requirements for Stream Mining

• Allow for continuous evolution of mined data– Traditional batch mining static mined data– Continuous mining makes mined data into a stream too

=>Concept drift

• Often mining over different kinds of windows of streams– E.g. sliding or tumbling windows– Windows of limited size– Often only statistics summaries needed (synopses,

sketches)

Data Stream Management Systems (DSMS)

• DataBase Management System (DBMS)– General purpose software to handle large volume

persistent data (usually on disk)– Important tool for traditional datamining

• Data Stream Management System (DSMS)– General purpose software to handle large volume

data streams (often transient data)– Important tool for data stream mining

Query Processor

Data Manager

Meta –

data

DBMS

SQL Queries

Stored

Data

Data Base Management System

Query Processor

Data & Stream ManagerData

streams

Meta –

data

DSMS

Continuous Queries (CQs)

Data

streams

Stored

Data

Data Stream Management System

Stream windows• Limited size section of stream stored temporarily in

DSMS– ’Regular’ database queries can be made over these windows

• Need window operator to chop stream into segments• Window size (sz) based on:

– Number of elements, a counting window• E.g. last 10 elements

– i.e. windows has fixed size of 10 elements– A time window

• E.g. elements last second– i.e. windows contains all event processed during the last second

– A landmark window• All events from time t0 in window

– c.f. growing log file– A decaying window

• Decrease importance of measurement by multiplying with factor • Remove when importance below threshold

Stream windows

• Windows may also have stride (str)– Rule for how fast they move forward,

• E.g. 10 elements for a 10 element counting window– A tumbling window

• E.g. 2 elements for a 10 element counting window– A sliding windows

• E.g. 100 elements for a 10 element counting window– A sampling window

• Windows need not always be materialized– E.g often sufficient to keep statistics materialized

Continuous (standing) queries over streams from expressways

Schema for stream CarLocStr of tuples:CarLocStr(car_id, /* unique car identifier */ speed, /* speed of the car */ exp_way, /* expressway: 0..10 */ lane, /* lane: 0,1,2,3 */ dir, /* direction: 0(east), 1(west) */ x-pos); /* coordinate in express way */

CQL query to continuously get the cars in a window of the stream every 30 seconds:

SELECT DISTINCT car_idFROM CarLocStr [RANGE 30 SECONDS];

Get the average speed of vehicles per expressway, direction, segment each 5 minutes:

SELECT exp_way, dir, seg, AVG(speed) as speed,FROM CarSegStr [RANGE 5 MINUTES]GROUP BY exp_way, dir, seg;

• http://www.cs.brandeis.edu/~linearroad/

Denstream

• Streamed DBScan• Published: 2006 SIAM Conf. on Data Mining

(http://user.it.uu.se/~torer/DenStream.pdf)• Regular DBScan:

– DBScan saves cluster memberships of static database per member object in database by scanning database looking for pairs of objects close to each other

– Database accessed many times– For scalable processing a spatial index must be used to index

points in hyperspace and answer nearest-neighbor queries

Denstream

• Denstream– One pass processing– Limited memory– Evolving clustering => not static cluster membership– Indefinite stream => store cluster memberships not

stored in database– No assumption of number of clusters

• clusters fade in and fade out

– Clusters of arbitrary shape allowed– Good at handling outliers

Core micro-clusters

• Core point: ‘anchor’ in cluster of other points• Core micro-cluster: An area covering points

close to (epsilon similar) a core point• Cluster defined as set of micro-clusters

Potential micro-clusters

• Outlier o-micro-cluster– New point not included in any micro-cluster

• Potential p-micro-cluster– Several clustered points not large enough to form a

micro-cluster

• When new data point arrives:1. Try to merge with nearest p-micro-cluster

2. Try to merge with nearest o-micro-cluster• If so convert o-micro-cluster to p-micro-cluster

3. Otherwise make new o-micro-cluster

Decaying p-micro-cluster windows

• Maintain weight Cp per p-micro-cluster

• Periodically (each Tp time period) decrease weight exponentially by multiplying old weight with

• Weight lower than threshold => delete, i.e. decaying window

• Decaying window of micro clusters

Dealing with outliers

• o-micro-clusters important for forming new evoving p-micro-clusters Keep o-micro-clusters around

• Keeping all o-micro-clusters may be expensive Delete o-micro-cluster by special exponential pruning

rule (decaying window)

• Decaying window method proven to make # micro-clusters grow logarithmically with stream size– Good, but not sufficient for indefinite stream– Shown to grow very slowly though

Growth of micro-clusters

Forming c-micro-cluster sets

• Regularly (e.g. each time period) the user demands forming current c-micro-clusters from the current p-micro-clusters

• Done by running regular DBSCAN over the p-micro-clusters– Center of each p-micro-cluster regarded as

point– Close when p-micro-clusters intersect

=> Clusters formed

Bloom-filters

• Problem: Testing membership in extremely large data sets– E.g. all non-spam e-mail addresses

• No false negatives, i.e. if address is in set then OK guaranteed

• Few false positives allowed, i.e. a small number of spams may sneek through

See http://infolab.stanford.edu/~ullman/mmds/ch4.pdf section 4.3.2

Bloom-filters

• Main idea:– Assume bitmap B of objects of size s– Hash each object x to h in [1,s]– Set bit B[h]

• Smaller than sorted table in memory:– 109 email addresses of 40 bytes => 40 GByte if set to

be stored sorted in memory• Would be expensive to extend

– Bitmap could have e.g. 109/8= 125MBytes

• May have false positives– Since hash function not perfect

Lowering false positives

• Small bitmap => many false positives• Idea, hash with several independent hash

functions h1(x), h2(x) and set bits correspondingly (logical OR)

• For each new x check that all hi(x) are set– If so => match– Chance of false positives decrease exponentially with

number of hi

• Assumes independent hi(x)

– hi(x) and hj(x) no common factors if i ≠ j

Books

- Anand Rajaraman & J.Ullman: Mining of Massive Datasets http://infolab.stanford.edu/~ullman/mmds.html

- L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, http://www.acm.org/sigmod/record/issues/0306/1.golab-ozsu1.pdf

Documents

Data Stream Mining