31
Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

Embed Size (px)

Citation preview

Page 1: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

DatcrackerOpen data-mining platform

connecting Rseslib and WEKA

Marcin Wojnarski

Warsaw University, Poland

Page 2: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

2

Outline

Datcracker is …

Motivation

What is available in version 0.5

HOWTO …

Architecture

Future releases

Page 3: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

3

Datcracker is…

…an open-source extensible data-mining platform which

provides common architecture for data processing algorithms

of various types. The algorithms can be combined together to

build data processing schemes of large complexity.

Page 4: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

4

Main characteristics

Extensibility of algorithm pool through well-defined API

Extensibility of types of data that algorithms operate on

Stream-based data processing, for efficient handling of large volumes of data and for freedom of designing complex experiments

Language: Java

Licence: GPL

Download: www.datcracker.org

Page 5: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

5

Motivation

To enable independent research groups exchange and combine their algorithms

To simplify implementation of new algorithms

Page 6: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

6

Available in version 0.5

Rseslib algorithms: classifiers (~20 algorithms)

Weka algorithms: ARFF reader classifiers (~60) filters (47)

Datcracker algorithms: Train&Test evaluation scheme

Data types: vectors of numeric and/or symbolic features

Page 7: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

7

HOWTO: Read ARFF file

Cell arff = new ArffReaderCell();

arff.set("filename", "data/iris.arff");

arff.set("labelIndex", "last");

arff.open();

System.out.println(arff.next());

System.out.println(arff.next());

arff.close();

Output:

[data:[5.1 3.5 1.4 0.2] label:[Iris-setosa]][data:[4.9 3.0 1.4 0.2] label:[Iris-setosa]]

Page 8: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

8

HOWTO: Train classifier (Rseslib)

Cell learner = new RseslibClassifier("C45");

learner.set("pruning", "true");

learner.setSource(arff);

learner.build();

learner.setSource(arff_test);

learner.open();

System.out.println(learner.next());

learner.close();

Page 9: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

9

HOWTO: Train classifier (Weka)

Cell learner = new WekaClassifier("J48");

learner.set("minNumObj", "2");

learner.setSource(arff);

learner.build();

Page 10: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

10

HOWTO: Apply Weka filter

Cell filter = new WekaFilter("attribute.Remove");

filter.set("attributeIndices", "3-6");

filter.setSource(arff);

filter.open();

System.out.println(filter.next());

System.out.println(filter.next());

filter.close();

Page 11: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

11

HOWTO: Set parameters

arff.set("filename", "data/iris.arff");

arff.set("labelIndex", "last");

...

Parameters par = new Parameters();

par.set("filename", "data/iris.arff");

par.set("labelIndex", "last");

...

arff.setParameters(par);

par = arff.getParameters();

OR

Page 12: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

12

HOWTO: Train & Test

Cell learner = new RseslibClassifier("C45");

learner.set("pruning", "true");

TrainAndTest tt = new TrainAndTest(learner);

tt.set("trainPercent", "70");

tt.set("repetitions", "10");

tt.setSource(source);

tt.build();

System.out.println(tt.report());

Page 13: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

13

Data Processing Chain

ARFF Filter2 ClassifierFilter1

AnotherClassifier

NewARFF

Cell.setSource(sourceCell)

ARFF

set("attributeIndices","0-3")set("attributeIndices","5")

Page 14: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

14

Architecture

Page 15: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

15

Outline

Cell interfaces state how to override

Data

MetaData

Page 16: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

16

Cell

Main class of Datcracker architecture

Base class for all data-processing algorithms classifiers clusterers filters data loaders data generators …

Cells can be connected in a Data Processing Chain

Data transfer between cells have form of a stream of samples

Receiving cell may immidiately consume incoming samples large volumes of data processed efficiently

Page 17: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

17

Cell’s interface

Cell can be: a data source a data receiver buildable parameterized

Page 18: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

18

Cell as a data source

Cell’s interface for data transfer:

open() : MetaSample opens communication session

next() : Sample retrieves next sample of data

close() closes communication session

Page 19: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

19

Cell as a data receiver

Cell’s interface for receiving data:

setSource(Cell) set source cell

Page 20: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

20

Buildable cells

Some cells may be buildable: they have to be built before use

Building a cell is implemented by subclasses and may mean different things: training a decision system running an evaluation scheme (T&T, CV, …) buffering input data …

Cell’s interface for building:

build() builds the cell

erase() erases the cell; it can be built again afterwards

Page 21: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

21

Fixed cells

Cells that are not buildable are called fixed. They are usable just after construction or setting parameters: file reader WEKA filter …

Page 22: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

22

Parameterized cells

Cell’s interface for parameterization:

set(String name, String value) sets a parameter

setParameters(Parameters) sets all parameters at once

getParameters() : Parameters returns all parameters that are set

Page 23: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

23

State of the cell

EMPTY cell has no content, cannot be used

CLOSED content has been built, cell ready to use

OPEN cell is being used now (generating samples of data)

EMPTY CLOSED OPEN

build() open()

close()erase()

next()

Page 24: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

24

…motivation

To check against access violations when the cell is accessed.Examples: two cells try to retrieve data from a given cell at the same time someone tries to use an empty cell someone tries to reconnect cells during their activity

To simplify implementation of subclasses (new algorithms):they may safely assume that access is correct(build() before open(), open() before next(), …)

To detect bugs early – important in heterogenous system!

Page 25: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

25

How to override Cell

Methods to override: onBuild() onErase() onOpen() onNext() onClose()

Public methods build(), … can’t be overriden.They perform state checking and then call on…() method

Like event handlers in event-driven programming

You do not have to override all of them!(e.g. cell for reading data will not be buildable)

You can provide additional interface in your subclass

Page 26: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

26

Data representation

Data set split into samples

Sample: data : Data input data label : Data associated decision label

Separation of data and label: useful for complex types of data/labels, e.g. in image processing (like

segmentation) useful for meta-learning algorithm, which operate on labels alone labelled / unlabelled / partially labl. samples handled in the same way

Data: abstract base class. Downcasted by cells to what they expect

Currently available subclasses: NumericFeature, SymbolicFeature, DataVector

In the future: time series, images, special types of labels, ...

Page 27: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

27

Immutability

Data objects are immutable: they cannot be modified after creation (like String class)

They can be freely shared among cells without risk of accidental modification safety simplicity efficiency:

no need to copy data between cells no need for synchronization in multi-threaded execution

Page 28: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

28

Metadata

Many algorithms have to know „type” of input data in advance, before processing of data starts metadata

Separation of data and metadata base class MetaData

Describes common properties of all Data objects generated in a given session number and types of features in a DataVector dictionary of possible values of a SymbolicFeature …

Each Data subclass has an associated MetaData subclass

Immutable!

Page 29: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

29

Future releases

Architecture Multi-input and multi-output cells Composite cells (e.g. meta-learning) Serialization and copying Progress info and suspension of cell building

Algorithms cross-validation data buffering …

Data types time series …

Page 30: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

30

www.datcracker.org

Home

Page 31: Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

31