Upload
calvin-james
View
212
Download
0
Embed Size (px)
Citation preview
DatcrackerOpen data-mining platform
connecting Rseslib and WEKA
Marcin Wojnarski
Warsaw University, Poland
2
Outline
Datcracker is …
Motivation
What is available in version 0.5
HOWTO …
Architecture
Future releases
3
Datcracker is…
…an open-source extensible data-mining platform which
provides common architecture for data processing algorithms
of various types. The algorithms can be combined together to
build data processing schemes of large complexity.
4
Main characteristics
Extensibility of algorithm pool through well-defined API
Extensibility of types of data that algorithms operate on
Stream-based data processing, for efficient handling of large volumes of data and for freedom of designing complex experiments
Language: Java
Licence: GPL
Download: www.datcracker.org
5
Motivation
To enable independent research groups exchange and combine their algorithms
To simplify implementation of new algorithms
6
Available in version 0.5
Rseslib algorithms: classifiers (~20 algorithms)
Weka algorithms: ARFF reader classifiers (~60) filters (47)
Datcracker algorithms: Train&Test evaluation scheme
Data types: vectors of numeric and/or symbolic features
7
HOWTO: Read ARFF file
Cell arff = new ArffReaderCell();
arff.set("filename", "data/iris.arff");
arff.set("labelIndex", "last");
arff.open();
System.out.println(arff.next());
System.out.println(arff.next());
arff.close();
Output:
[data:[5.1 3.5 1.4 0.2] label:[Iris-setosa]][data:[4.9 3.0 1.4 0.2] label:[Iris-setosa]]
8
HOWTO: Train classifier (Rseslib)
Cell learner = new RseslibClassifier("C45");
learner.set("pruning", "true");
learner.setSource(arff);
learner.build();
learner.setSource(arff_test);
learner.open();
System.out.println(learner.next());
learner.close();
9
HOWTO: Train classifier (Weka)
Cell learner = new WekaClassifier("J48");
learner.set("minNumObj", "2");
learner.setSource(arff);
learner.build();
10
HOWTO: Apply Weka filter
Cell filter = new WekaFilter("attribute.Remove");
filter.set("attributeIndices", "3-6");
filter.setSource(arff);
filter.open();
System.out.println(filter.next());
System.out.println(filter.next());
filter.close();
11
HOWTO: Set parameters
arff.set("filename", "data/iris.arff");
arff.set("labelIndex", "last");
...
Parameters par = new Parameters();
par.set("filename", "data/iris.arff");
par.set("labelIndex", "last");
...
arff.setParameters(par);
par = arff.getParameters();
OR
12
HOWTO: Train & Test
Cell learner = new RseslibClassifier("C45");
learner.set("pruning", "true");
TrainAndTest tt = new TrainAndTest(learner);
tt.set("trainPercent", "70");
tt.set("repetitions", "10");
tt.setSource(source);
tt.build();
System.out.println(tt.report());
13
Data Processing Chain
ARFF Filter2 ClassifierFilter1
AnotherClassifier
NewARFF
Cell.setSource(sourceCell)
ARFF
set("attributeIndices","0-3")set("attributeIndices","5")
14
Architecture
15
Outline
Cell interfaces state how to override
Data
MetaData
16
Cell
Main class of Datcracker architecture
Base class for all data-processing algorithms classifiers clusterers filters data loaders data generators …
Cells can be connected in a Data Processing Chain
Data transfer between cells have form of a stream of samples
Receiving cell may immidiately consume incoming samples large volumes of data processed efficiently
17
Cell’s interface
Cell can be: a data source a data receiver buildable parameterized
18
Cell as a data source
Cell’s interface for data transfer:
open() : MetaSample opens communication session
next() : Sample retrieves next sample of data
close() closes communication session
19
Cell as a data receiver
Cell’s interface for receiving data:
setSource(Cell) set source cell
20
Buildable cells
Some cells may be buildable: they have to be built before use
Building a cell is implemented by subclasses and may mean different things: training a decision system running an evaluation scheme (T&T, CV, …) buffering input data …
Cell’s interface for building:
build() builds the cell
erase() erases the cell; it can be built again afterwards
21
Fixed cells
Cells that are not buildable are called fixed. They are usable just after construction or setting parameters: file reader WEKA filter …
22
Parameterized cells
Cell’s interface for parameterization:
set(String name, String value) sets a parameter
setParameters(Parameters) sets all parameters at once
getParameters() : Parameters returns all parameters that are set
23
State of the cell
EMPTY cell has no content, cannot be used
CLOSED content has been built, cell ready to use
OPEN cell is being used now (generating samples of data)
EMPTY CLOSED OPEN
build() open()
close()erase()
next()
24
…motivation
To check against access violations when the cell is accessed.Examples: two cells try to retrieve data from a given cell at the same time someone tries to use an empty cell someone tries to reconnect cells during their activity
To simplify implementation of subclasses (new algorithms):they may safely assume that access is correct(build() before open(), open() before next(), …)
To detect bugs early – important in heterogenous system!
25
How to override Cell
Methods to override: onBuild() onErase() onOpen() onNext() onClose()
Public methods build(), … can’t be overriden.They perform state checking and then call on…() method
Like event handlers in event-driven programming
You do not have to override all of them!(e.g. cell for reading data will not be buildable)
You can provide additional interface in your subclass
26
Data representation
Data set split into samples
Sample: data : Data input data label : Data associated decision label
Separation of data and label: useful for complex types of data/labels, e.g. in image processing (like
segmentation) useful for meta-learning algorithm, which operate on labels alone labelled / unlabelled / partially labl. samples handled in the same way
Data: abstract base class. Downcasted by cells to what they expect
Currently available subclasses: NumericFeature, SymbolicFeature, DataVector
In the future: time series, images, special types of labels, ...
27
Immutability
Data objects are immutable: they cannot be modified after creation (like String class)
They can be freely shared among cells without risk of accidental modification safety simplicity efficiency:
no need to copy data between cells no need for synchronization in multi-threaded execution
28
Metadata
Many algorithms have to know „type” of input data in advance, before processing of data starts metadata
Separation of data and metadata base class MetaData
Describes common properties of all Data objects generated in a given session number and types of features in a DataVector dictionary of possible values of a SymbolicFeature …
Each Data subclass has an associated MetaData subclass
Immutable!
29
Future releases
Architecture Multi-input and multi-output cells Composite cells (e.g. meta-learning) Serialization and copying Progress info and suspension of cell building
Algorithms cross-validation data buffering …
Data types time series …
30
www.datcracker.org
Home
31