151
What Semantic Web researchers need to know about Machine Learning? http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/ Marko Grobelnik, Blaž Fortuna, Dunja Mladeni

What Semantic Web researchers need to know about Machine Learning?

Embed Size (px)

Citation preview

Page 1: What Semantic Web researchers need to know about Machine Learning?

What Semantic Web researchers need to know about Machine Learning?

http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/

Marko Grobelnik, Blaž Fortuna, Dunja Mladenić

Page 2: What Semantic Web researchers need to know about Machine Learning?

How Semantic Web compares to Machine Learning? (1/2)

• Semantic Web is traditionally non-analytic area of research…– …it includes a lot of logic, some language

technologies, some standardization etc.– Generally, manual encoding of knowledge is

still appreciated the most…– …where the main paradigm is top-down.– Top-down approaches enable accuracy when

designing knowledge bases, but can cost a lot and are mostly not scalable

Page 3: What Semantic Web researchers need to know about Machine Learning?

How Semantic Web compares to Machine Learning? (2/2)

• Machine Learning is on the other hand almost exclusively analytic & bottom-up…– …where analytic techniques have the main

goal to extract knowledge from the data…– …with a little or no human involvement.– The cost per encoded piece of knowledge is

lower, achievable scale is high, …– …but, the techniques are limited to what is

computable in a reasonable time.

Page 4: What Semantic Web researchers need to know about Machine Learning?

What are stitching points between SW and ML?

• …SW can contribute to ML and vice versa

• Machine Learning for Semantic Web– …decrease costs when extracting knowledge

and building ontologies, speeding-up reasoning, scalability

• Semantic Web for Machine Learning– …Semantic Web techniques can provide new

insights into the data (e.g. via reasoning) which are unobservable by statistical means

Page 5: What Semantic Web researchers need to know about Machine Learning?

Part 1: Understanding Machine Learning

in simple terms?

• Sub-areas of machine learning• What are constituents of a machine

learning algorithm?• What kind of problems one could try to

solve with ML methods?• Why ML is easy and why it is hard?

Page 6: What Semantic Web researchers need to know about Machine Learning?

Sub-areas of machine learning

• Data-Mining• Main events: ACM KDD, IEEE ISCM, SIAM SDM• Books: Witten and Frank, 1999

• AI style Machine Learning• Main events: ICML, ECML/PKDD• Books: Mitchell, 1997

• Statistical Machine Learning• Main events: NIPS, UAI• Books: Bishop 2007, Duda & Hard & Stork 2000

• Theoretical Machine Learning• Main events: COLT, ALT• Books: Vapnik 1999

Page 7: What Semantic Web researchers need to know about Machine Learning?

Which areas are contributing?

Statistical data analysis

Theory on learning

Learning with different representations

Analysis of Data bases

Page 8: What Semantic Web researchers need to know about Machine Learning?

What is “Learning” in Machine Learning?

• Herbert Simon: “Learning is any process by which a system improves performance from experience.”

• Improve on task T, with respect to performance metric P, based on experience E:

– T: Recognizing hand-written words– P: Percentage of words correctly classified– E: Database of human-labeled images of handwritten words

– T: Driving on four-lane highways using vision sensors– P: Average distance traveled before a human-judged error– E: A sequence of images and steering commands recorded while observing a

human driver

– T: Categorize email messages as spam or legitimate– P: Percentage of email messages correctly classified– E: Database of emails, some with human-given labels

Page 9: What Semantic Web researchers need to know about Machine Learning?

What are constituents of a machine learning algorithm?

• Typical ML algorithm includes:• Input data in some kind of form (most often vectors)• Input parameters (settings)• Fitting data using some search algorithm into a model• Output the model in the form of y=f(x)• Use the model for classification of unseen data

– …the key elements are:• Language of the model – determines complexity of the model

– …popular language for many algorithms is linear model (used in SVM, Perceptron, Bayes, ...): y = a x1 + b x2 + c x3 + …

• Search algorithm – determines the quality of a result– …most often it is some kind of local optimization

Page 10: What Semantic Web researchers need to know about Machine Learning?

What kind of problems one could try to solve with ML methods?

• Generally, machine learning is about finding patterns and regularities in the data…

• …the result is always a model which could be understood as a summary of the data used for learning model

• The resulting model can be used for:– Explaining existing phenomena in the data– Predicting future situations

Page 11: What Semantic Web researchers need to know about Machine Learning?

Why ML is easy and why it is hard?

• ML is easy:– …because it seems the God is not playing dice and the Nature

usually behaves nicely and is not chaotic– With ML techniques we try to discover the rules how the nature

was generating the observed data.– Simple algorithmic approaches can bring a lot of success when

understanding the observed data.

• ML is hard:– …because we don’t know the language the God is using when

programming the Universe– There are many different ways to represent similar concepts…– …we need to deal with scale, noise, dynamics, etc and is hard to

find the right way how to model the observed data

Page 12: What Semantic Web researchers need to know about Machine Learning?

Part 2: Main approaches in

Machine Learning

• Supervised learning• Semi-supervised learning• Unsupervised learning

Page 13: What Semantic Web researchers need to know about Machine Learning?

When to apply each of the approaches?

• …let’s take the example from medicine:– we have patients which have symptoms (inputs) and diagnosis

(output)

• Supervised learning (classification)– …given symptoms and corresponding diagnoses for many

patients, the goal is to find rules which can map/predict symptoms to diagnosis for an unseen patient

• Semi-supervised learning (transduction, active learning)– …given symptoms and corresponding diagnoses for only a few

symptoms, leverage these to find most probable diagnoses for all the patients

• Unsupervised learning (clustering, decompositions)• …given only symptoms for many patients, find the groups of similar

patients (with e.g. possibly similar diagnoses).

Page 14: What Semantic Web researchers need to know about Machine Learning?

Supervised learningAssign an object to a given finite set of classes:• Medical diagnosis

– …assign diagnosis to a patient• Credit card applications or transactions

– …assign credit score to an applicant• Fraud detection in e-commerce

– …decide about fraud or non-fraud event in a business process• Financial investments

– …decide whether to buy or sell or hold on a stock exchange• Spam filtering of e-mails

– …decide if an email is a spam or a regular email• Recommending articles in a newspaper

– …decide if an article fits the user profile• Semantic/linguistic annotation

– …assign semantic or linguistic annotation to a word or phrase

Page 15: What Semantic Web researchers need to know about Machine Learning?

Recommending News Articles

labeled articles

(training examples)

new article

(test example)

predicted article class

(interesting or not)

Machine Learning

???

Page 16: What Semantic Web researchers need to know about Machine Learning?

Supervised learning• Given is a set of labeled examples represented by feature vectors• The goal is: to build a model approximating the target function

which would automatically assign right label to a new unlabeled example

• Feature values: – discrete (eg., eyes_color {brown, blue, green})– continuous (eg., age [0..200])– ordered (eg., size {small, medium, large})

• Values of the target function – labels:– discrete (classification) or continuous (regression)– exclude each other (eg., medical diagnosis) or not (eg., document

content can be about arts and computer science)– have some predefined relations (taxonomy of document categories,

e.,g., DMoz or Medline) • The target function can be represented in different ways (storing

examples, symbolic, numerical, graphical,…) and modeled by using different algorithms

Page 17: What Semantic Web researchers need to know about Machine Learning?

Illustrative exampleRecommending cartoon for a 5-year old boy

Title Main characters Duration

Bob the builder vehicles, human 10 mins

Pixar-Locomotion vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Title Main characters Duration

Bob the builder vehicles, human 10 mins

Pixar-Locomotion vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

Main characters {vehicle, human, animal}Duration [5..90]

Page 18: What Semantic Web researchers need to know about Machine Learning?

Target function

There is a trade-off between the expressiveness of a representation and the ease of learning– The more expressive a representation, the better it will

be at approximating an arbitrary function; however, the more examples will be needed to learn an accurate function

Illustrative example•Values of the target function: discrete labels (classification), exclude each other

Interesting movie: yes no

Page 19: What Semantic Web researchers need to know about Machine Learning?

Possible data visualization

vehicles = yes vehicles = no

hum

an =

yes

hum

an =

no

(vehicles = no) and

(human = yes)

Possible Model for not interesting

Page 20: What Semantic Web researchers need to know about Machine Learning?

Generalization

• Model must generalize the data to correctly classify yet unseen examples (the ones which don’t appear in the training data)

• Lookup table of training examples is a consistent model that does not generalize– An example that was not in the training data can not

be classified

Occam’s razor:– Finding a simple model helps ensure generalization

Page 21: What Semantic Web researchers need to know about Machine Learning?

Ockham (Occam)’s Razor• William of Ockham (1295-1349) was a

Franciscan friar who applied the criteria to theology:– “Entities should not be multiplied beyond

necessity” (Classical version but not an actual quote)

– “The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” (Einstein)

• requires a precise definition of simplicity• assumes that nature itself is simple

Page 22: What Semantic Web researchers need to know about Machine Learning?

Algorithms for learning classification models

Storing examples– Nearest Neighbour

Symbolic– Decision trees– Rules in propositional logic or first order logic

Numerical– Perceptron algorithm– Winnow algorithm– Support Vector Machines– Logistic Regression

Probabilistic graphical models– Naive Bayesian classifier– Hidden-Markov Models

Page 23: What Semantic Web researchers need to know about Machine Learning?

Nearest neighbor• Storing training examples without generating any

generalization– Simple, requires efficient storage

• Classification by comparing the example to the stored training examples and estimating the class based on classes of the most similar examples– Similarity function is crucial

Also known as:– Instance-based, Case-based, Exemplar-based,

Memory-based, Lazy Learning

Page 24: What Semantic Web researchers need to know about Machine Learning?

Similarity/Distance

• For continuous features use Euclidian distance

nkkkk

n

iii

fffe

ffeeDist

,2,1,

1

22121

,...,

)(),(

• For discrete features, assume distance between two values is 0 if they are the same and 1 if they are different (eg., Hamming distance for bit vectors).

To compensate for difference in units across features, scale all continuous values to the interval [0,1].

Page 25: What Semantic Web researchers need to know about Machine Learning?

Nearest neighbor

Page 26: What Semantic Web researchers need to know about Machine Learning?

Nearest neighbor - exampleTitle Main characters Duration

Bob the builder vehicles, human 10 mins

Pixar-Locomotion vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

Model

jiji

F

iii ffffffeeeeL

;1;)(||||),(||

12121211

Vehicles (yes, no)

Human (yes, no)

Animals (yes, no)

Duration(5, 10, 20, 30, 60, 90)

Model

Features

Page 27: What Semantic Web researchers need to know about Machine Learning?

Nearest neighbor - exampleTitle Main characters Duration

Bob the builder vehicles, human 10 mins

Pixar-Locomotion vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

Jungle book human, animals 60 minsJungle book human, animals 60 mins

Distance:

1+0+1+1=3

1+1+1+1=4

1+1+0+1=3

1+1+0+0=2

1+1+1+1=4

1+0+1+1=3

1+0+1+1=3

1+0+1+1=3

jiji

F

iii ffffffeeeeL

;1;)(||||),(||

12121211

Dist.

3

4

3

2

4

3

3

3

Page 28: What Semantic Web researchers need to know about Machine Learning?

Decision trees

• Recursively splitting set of examples until (almost) all examples in the set have the same class value– starts with a set of all examples– find a feature giving the best split– split the examples according to the feature values– repeat for each subset of examples

• Classification by “pushing” the example from root to leaf, assign class value of the leaf

ii

i

fFValuesf

f

ppSEntropy

SEntropyS

SSEntropyFSInfGain

log)(

)()(),()(

Page 29: What Semantic Web researchers need to know about Machine Learning?

Decision tree - exampleTitle Main characters Duration

Bob the builder vehicles 10 mins

Pixar-Locomotion

vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

vehicle

humananimals

characters 5 3

3 3 2

8

3log

8

3

8

5log

8

5})3,5({ 22 Entropy

= 0.63*0.68+0.38*0.53= 0.95

0})0,({

003

3log

3

3})0,3({ 2

xEntropy

Entropy

3

2log

3

2

3

1log

3

1})2,1({ 22 Entropy

= 0.33*1.58+0.67*0.58= 0.91

InfGain=0.95-(0+0+0)=0.95InfGain=0.95-(0+0+0.38*0.91+0)=0.61

duration

10<, ≤ 30 30 <,<90≤10

90≥

5 3

2 2 1 1 2

Page 30: What Semantic Web researchers need to know about Machine Learning?

Decision tree - exampleTitle Main characters Duration

Bob the builder vehicles 10 mins

Pixar-Locomotion

vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

vehicle

humananimals

characters 5 3

3 3 2

Jungle book animals 60 minsJungle book animals 60 mins

Page 31: What Semantic Web researchers need to know about Machine Learning?

Decision tree model

Page 32: What Semantic Web researchers need to know about Machine Learning?

If-then Rules

Generating rules by adding conditions: – Λ – restricting the rule (less examples match)– V – generalizing the rule (more examples match)

• Maximize quality of each rule (eg., matching examples are of the same class) while aiming at describing all the examples with a set of rules

Page 33: What Semantic Web researchers need to know about Machine Learning?

If-then Rules

vehicle

humananimals

characters 5 3

33 2

:

Converting a tree to rules• If (Main characters = vehicles) then interesting• If (Main characters = human) then uninteresting• If (Main characters = animals) then interesting

Page 34: What Semantic Web researchers need to know about Machine Learning?

Support Vector Machine• Learns a hyperplane in

higher dimensional space– that separates the training

data and – gives the highest margin

• Implicit mapping of the original feature space into higher dimensional space– mapping using so called

kernel function (eg., linear, polynomial, …)

Regarded as state-of-the-art in text document classification

Positive class

Negative class

Margin

Hyperplane

Page 35: What Semantic Web researchers need to know about Machine Learning?

Linear Model

Page 36: What Semantic Web researchers need to know about Machine Learning?

Naïve BayesDetermine class of example ek by estimating

• P(ci) – estimate from the data using frequency: no. of examples with class ck / no. of all examples

• P(ek|ci) – too many possibilities (all combinations of feature values) – assume feature independence given the class

)|()(maxarg)(

)|()()|( iki

ik

ikiki cePcP

eP

cePcPecP

)|()|(1

n

jikjik cfPceP

Page 37: What Semantic Web researchers need to know about Machine Learning?

Naïve Bayes - exampleTitle Main characters Duration

Bob the builder vehicles 10 mins

Pixar-Locomotion

vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

Jungle book animals 60 minsJungle book animals 60 mins

j

ikjii

ki cfPcPecP )|()(maxarg)|(

P(vehicle| )=0.6P(vehicle| )=0

P(human| )=0P(human| )=1

P(animal| )=0.4P(animal| )=0

P( )=5/8P( )=3/8

P(5| )=0.2P(5| )=0

P(10| )=0.2P(10| )=0

P(20| )=0P(20| )=0.33

P(30| )=0P(30| )=0.33

P(60| )=0.2P(60| )=0

P(90| )=0.4P(90| )=0.33

P( | ) → 5/8*(0.4*0.2) = 0.05P( | ) → 3/8*(0*0) = 0

Model

Page 38: What Semantic Web researchers need to know about Machine Learning?

Generative Probabilistic Models• Assume a simple (usually unrealistic) probabilistic

method by which the data was generated• Each class value has a different parameterized

generative model that characterizes it• Training: Use the data for each category to estimate the

parameters of the generative model for that category. – Maximum Likelihood Estimation (MLE): Set parameters to

maximize the probability that the model produced the given training data

– If Mλ denotes a model with parameter values λ and Dk is the training data for the kth class, find model parameters for class k (λk) that maximize the likelihood of Dk:

• Testing: Use Bayesian analysis to determine the category model that most likely generated a specific test instance.

)|(argmax

MDP kk

Page 39: What Semantic Web researchers need to know about Machine Learning?

Semi-supervised learning

Similar to supervised learning except that• we have examples and only some of them are

labeled• we may have a human available for a limited

time to provide labels of examples

– …this corresponds to the situation where all the patients in the database have symptoms, but only a few have diagnosis

– …and occasionally we have doctors for a limited time to respond the questions about the patients

Page 40: What Semantic Web researchers need to know about Machine Learning?

Using unlabeled data (Nigam et al., 2000)

• Given: a small number of labeled examples and a large pool of unlabeled examples, no human available – e.g., classifying news article as interesting or not interesting

• Approach description (EM + Naive Bayes):– train a classifier with only labeled documents,– assign probabilistically-weighted class labels to unlabeled

documents,– train a new classifier using all the documents– iterate until the classifier remains unchanged

Page 41: What Semantic Web researchers need to know about Machine Learning?

E-step: Estimate labels of unlabeled

documents

M-step: Use all documents to

rebuild classifier

Naive Bayes

Initialize: Learn from labeled only

Using Unlabeled Data with Expectation-Maximization (EM)

Guarantees local maximum a posteriori parameters

Page 42: What Semantic Web researchers need to know about Machine Learning?

Co-training (Blum & Mitchell, 1998)

Theory behind co-training• Possible to learn from unlabeled examples• Value of unlabeled data depends on

– How (conditionally) independent are the two representations of the same data

• The more the better

– The number of redundant inputs (features)• Expected error decreases exponentially with this number

• Disagreement on unlabeled data predicts true error

Better performance on labelling unlabeled data compared to EM approach

Page 43: What Semantic Web researchers need to know about Machine Learning?

Bootstrap Learning to Classify Web Pages

Page Classifier

Link Classifier

few labeled and many unlabeled

Given: set of documents where each document is described by two independent sets of features (e.g. document text + hyperlinks anchor text)

Hyperlink to the document

Document

Page 44: What Semantic Web researchers need to know about Machine Learning?

Active Learning

Page 45: What Semantic Web researchers need to know about Machine Learning?

Active Learning

• We use this methods whenever hand-labeled data are rare or expensive to obtain

• Interactive method

• Requests only labeling of “interesting” objects

• Much less human work needed for the same result compared to arbitrary labeling examples

Teacher passive student

Teacher active student

Data &labels

query

label

Passive student asking randomquestions

Active studentasking smart questions

number of questions

perf

orm

ance

Page 46: What Semantic Web researchers need to know about Machine Learning?

Approaches to Active Learning• Uncertainty sampling (efficient)

– select example closest to the decision hyperplane (or the one with classification probability closest to P=0.5) [Tong & Koller 2000]

• Maximum margin ratio change– select example with the largest predicted impact on the margin size if selected

[Tong & Koller 2000]• Monte Carlo Estimation of Error Reduction

– select example that reinforces our current beliefs [Roy & McCallum 2001]• Random sampling as baseline

• Experimental evaluation (using F1-measure) of the four listed approaches shown on three categories from Reuters-2000 dataset– average over 10 random samples of 5000 training (out of 500k) and

10k testing (out of 300k)examples– two of the methods a rather time consuming, thus we run them for

including the first 50 unlabeled examples– experiments show that active learning is especially useful for

unbalanced data

Page 47: What Semantic Web researchers need to know about Machine Learning?

Category with very unbalanced class distribution having 2.7% of positive examples

Uncertainty seems to outperform MarginRatio

Page 48: What Semantic Web researchers need to know about Machine Learning?

Illustration of Active learning

• starting with one labeled example from each class (red and blue)

• select one example for labeling (green circle)• request label and add re-generate the model using

the extended labeled data

Illustration of linear SVM model using• arbitrary selection of unlabeled examples (random) • active learning selecting the most uncertain

examples (closest to the decision hyperplane)

Page 49: What Semantic Web researchers need to know about Machine Learning?

Uncertainty sampling of unlabeled example

Page 50: What Semantic Web researchers need to know about Machine Learning?
Page 51: What Semantic Web researchers need to know about Machine Learning?
Page 52: What Semantic Web researchers need to know about Machine Learning?
Page 53: What Semantic Web researchers need to know about Machine Learning?
Page 54: What Semantic Web researchers need to know about Machine Learning?
Page 55: What Semantic Web researchers need to know about Machine Learning?
Page 56: What Semantic Web researchers need to know about Machine Learning?
Page 57: What Semantic Web researchers need to know about Machine Learning?
Page 58: What Semantic Web researchers need to know about Machine Learning?
Page 59: What Semantic Web researchers need to know about Machine Learning?
Page 60: What Semantic Web researchers need to know about Machine Learning?
Page 61: What Semantic Web researchers need to know about Machine Learning?
Page 62: What Semantic Web researchers need to know about Machine Learning?
Page 63: What Semantic Web researchers need to know about Machine Learning?
Page 64: What Semantic Web researchers need to know about Machine Learning?
Page 65: What Semantic Web researchers need to know about Machine Learning?
Page 66: What Semantic Web researchers need to know about Machine Learning?

Unsupervised learning

• Given is a set of examples• The goal is: to cluster the examples into several

groups based on some similarity measure– examples inside the group should be similar while

examples between the groups should be different

Similarity measure plays a crucial role:Cosine Jaccard

Minkowsky (norm k)

Manhattan (k=1), Euclidean (k=2 )

k kj j

iii

xx

xx

dd

ddddCos

22

21

21

2121 ),(

BA

BABAJ

),(

kW

i

kiikk wwddddL

/1||

1212121 )(||||),(

Page 67: What Semantic Web researchers need to know about Machine Learning?

Clustering methods• Hierarchical

– agglomerative – at each step merge two or more groups

– divisive – at each step break the selected group into two or more groups

• Non hierarchical– requires specification of the number of clusters– optimization of the initial clustering (e.g., maximize

similarity of examples inside the same group)• Geometrical

– map multidimensional space into two- or three-dimensional (e.g., principal component analysis)

• Graph-theoretical

Page 68: What Semantic Web researchers need to know about Machine Learning?

K-Means clustering algorithm• Given:

– set of examples (e.g., TFIDF vectors of documents), – distance measure (e.g., cosine)– K (number of groups)

• For each of K groups initialize its centroid with a random document

• While not converging – Each document is assigned to the nearest group

(represented by its centroid)– For each group calculate new centroid (group mass

point, average document in the group)

Page 69: What Semantic Web researchers need to know about Machine Learning?

Example of k-means clustering

1. Randomly select two examples, e.g., A, D to be representatives of two clusters I: A, II: D

2. Calculate similarity of other examples to the them

B,I= 0.82, B,II= 0, C,I= 0.82, C,II= 0, E,I= 0, E,II= 0.7

3. Assign examples to the most similar cluster I: (A,B,C) II: (D,E)

4. Calculate the cluster centroid I: 1,0,0.67,0,0.67 II: 0,0.5,0,1,0

5. Calculate similarity of all the examples to the centroids A,I= 0.88, A,II= 0, B,I= 0.77, B,II= 0, C,I= 0.77, C,II= 0, D,I= 0, D,II= 0.82, E,I= 0, E,II= 0.87

6. Cluster the examples I: (A,B,C) II: (D,E)7. Stop as the clustering got stabilized

Examples:• A: 1,0,1,0,1• B: 1,0,0,0,1• C: 1,0,1,0,0• D: 0,0,0,1,0• E: 0,1,0,1,0

K=2

Page 70: What Semantic Web researchers need to know about Machine Learning?

Example of hierarchical clustering(bisecting k-means)

3, 5, 8

0, 2, 4, 7, 10, 11 1, 6, 9

0, 2, 4, 7, 11 10

2, 4, 11 0, 7

0 7

2

42, 11

11

1, 9

9

6

1

3, 8 5

3 8

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

0, 1, 2, 4, 6, 7, 9, 10, 11

Page 71: What Semantic Web researchers need to know about Machine Learning?

Latent Semantic Indexing

• LSI is a statistical technique that attempts to estimate the hidden content structure within documents:– …it uses linear algebra technique Singular-

Value-Decomposition (SVD)– …it discovers statistically most significant co-

occurrences of terms

Page 72: What Semantic Web researchers need to know about Machine Learning?

LSI Exampled1 d2 d3 d4 d5 d6

cosmonaut 1 0 1 0 0 0

astronaut 0 1 0 0 0 0

moon 1 1 0 0 0 0

car 1 0 0 1 1 0

truck 0 0 0 1 0 1

d1 d2 d3 d4 d5 d6

Dim1 -1.62 -0.60 -0.04 -0.97 -0.71 -0.26

Dim2 -0.46 -0.84 -0.30 1.00 0.35 0.65

d1 d2 d3 d4 d5 d6

d1 1.00

d2 0.8 1.00

d3 0.4 0.9 1.00

d4 0.5 -0.2 -0.6 1.00

d5 0.7 0.2 -0.3 0.9 1.00

d6 0.1 -0.5 -0.9 0.9 0.7 1.00

Correlation matrix

Original document-term mantrix

Rescaled document matrix,Reduced into two dimensions

High correlation althoughd2 and d3 don’t share any word

Page 73: What Semantic Web researchers need to know about Machine Learning?

Data one can analyze with ML methods

• Structured, tabular data• Textual data• Multilingual data• Social networks• Images/videos• Relational data• Temporal data• Cross-modal analysis• Very large datasets

Page 74: What Semantic Web researchers need to know about Machine Learning?

Structured, tabular dataTable, where each example is a raw, each

feature is a column with predefined values

Title Main characters Duration

Bob the builder vehicles, human 10 mins

Pixar-Locomotion

vehicles 5 mins

Ice age animals 90 mins

Over the hedge animals 60 mins

Cars vehicles 90 mins

Anima human 90 mins

South Park human 30 mins

Simpson human 20 mins

example

.

.

.

feature

Main characters {vehicle, human, animal}Duration [5..90]

Page 75: What Semantic Web researchers need to know about Machine Learning?

Textual dataHaving a set of documents, represent each as a

feature vector:• divide text into units (eg., words), remove punctuation,

(remove stop-words, stemming,…)• each unit becomes a feature having numeric weight as its

value (eg., number of occurrences in the text - referred to as term frequency or TF)

Commonly used weight is TFIDF:

• tf(w) – term frequency (no. of occurrences of word w in document dokumentu)

• df(w) – document frequency (no. of documents containing word w)• N – no. of all documents

)(log*)()(

wdf

NwtfwTFIDF

Page 76: What Semantic Web researchers need to know about Machine Learning?

Textual data - exampleBob the builder is a children animated movie on a character Bob and his friends that include several vehicle characters. They face challenges and jointly solve them, such as, repair a roof or save Bob’s cat from a tall tree…

Pixar has several short animated movies suitable for children. Locomotion is one of them showing train engine and a train wagon as two characters that face a challenge of crossing a half-broken bridge…

…Simpson family provokes a smile on many adult and children faces showing everyday life of a family of four…

bob builder children animated movie character friend vehicle … …

3 1 1 1 1 2 1 1 … …

0 0 1 1 1 1 0 0 … …

… … … … … … … … … …

0 0 1 0 0 0 0 0 … …

bob builder children animated movie character friend vehicle … …

… …

… …

… … … … … … … … … …

… …

bob builder children animated movie character friend vehicle … …

3 1 1 1 1 2 1 1 … …

… …

… … … … … … … … … …

… …

bob builder children animated movie character friend vehicle … …

3 1 1 1 1 2 1 1 … …

0 0 1 1 1 1 0 0 … …

… … … … … … … … … …

… …

… …

3 1 1 1 1 2 1 1 … …

0 0 1 1 1 1 0 0 … …

… … … … … … … … … …

0 0 1 0 0 0 0 0 … …

Page 77: What Semantic Web researchers need to know about Machine Learning?

Perceptron learning algorithm on (document classification)

Input: set of examples (documents) in the form of vectors;

each labeled as +1 or -1Output: linear model wi (one weight per word)

Algorithm:• initialize the model wi by setting word weights = 0• iterate through documents N times• For document d from D

– // Using current model wi classify the document d– If sum(di *wi) >= 0 then classify document as positive – Else classify document as negative– If document classification is wrong then

• // adjust weights of all words occurring in the document• wt+1 = wt +sign(true-class) * Beta (input parameter Beta>0)• // where sign(positive) = 1 and sign(negative) =-1

Page 78: What Semantic Web researchers need to know about Machine Learning?

Multilingual data

• Text in several natural languages• Perform machine learning and retrieval on

textual data regardless the language differences

• Approach:– Machine Translation (on sentence level)– Multilingual lexicon (on word level)– Mapping into semantic space (on word level,

eg., KCCA)

Page 79: What Semantic Web researchers need to know about Machine Learning?

KCCA to handle multilingual data

• KCCA enables representing documents in a “language neutral way”

• Intuition behind KCCA:1. Given a parallel corpus (such as Acquis)… 2. …first, we automatically identify language independent

semantic concepts from text,3. …then, we re-represent documents with the identified

concepts, 4. …finally, we are able to perform cross language

statistical operations (such as retrieval, classification, clustering…)

Page 80: What Semantic Web researchers need to know about Machine Learning?

EnglishGerman

French

Spanish

Italian

Slovenian SlovakCzech

Hungarian

Greek

Finnish

SwedishDutch

Lithuanian

DanishLanguage Independent Document Representation

New documentrepresented as text in

any of the above languages

New documentrepresented in

Language Neutral way

…enables cross-lingual retrieval, categorization, clustering, …

Page 81: What Semantic Web researchers need to know about Machine Learning?

Input for KCCAInput for KCCA

• On input we have set of aligned documents: – For each document we have a

version in each language

• Documents are represented as bag-of-words vectors

Bag-of-words space for English language

Pair of aligned documents

Bag-of-words space for German language

Page 82: What Semantic Web researchers need to know about Machine Learning?

The Output from KCCAThe Output from KCCA

• The goal: find pairs of semantic dimensions that co-appear in documents and their translations with high correlation– Semantic dimension is a

weighted set of words.

• These pairs are pairs of vectors, one from e.g. English bag-of-words space and one from German bag-of-words space.

loss, income, company, quarter

verlust, einkommen, firma, viertel

wage, payment, negotiations, union

zahlung, volle, gewerkschaft, verhand-lungsrunde

Semantic dimensions pair

Semantic dimension

Page 83: What Semantic Web researchers need to know about Machine Learning?

The Algorithm – TheoryThe Algorithm – Theory

Formally the KCCA solves:

max(x,y) Corr(<x,, , >, <y,, , >)

– x, y – semantic directions for English and German– ( , ) is a pair of aligned documents

Page 84: What Semantic Web researchers need to know about Machine Learning?

Examples of Semantic Dimensions from Examples of Semantic Dimensions from Acquis corpus: Acquis corpus: EnglishEnglish--FrenchFrench (1/2)(1/2)

Most important words from semantic dimensions automatically generated from 2000 documents:

DIRECTIVE, DECISION, VEHICLES, AGREEMENT, EC, VETERINARY, PRODUCTS, HEALTH, MEAT

DIRECTIVE, DECISION, VEHICULES, PRESENTE, RESIDUS, ACCORD, PRODUITS, ANIMAUX, SANITAIRE

NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF, CLASSIFICATION, CUSTOMS

NOMENCLATURE, COMBINEE, COLONNE, MARCHANDISES, CLASSEMENT, TARIF, TARIFAIRES

EMBRYOS, ANIMALS, OVA, SEMEN, ANIMAL, CONVENTION, BOVINE, DECISION, FEEDINGSTUFFS

EMBRYONS, ANIMAUX, OVULES, CONVENTION, SPERME, EQUIDES, DECISION, BOVINE, ADDITIFS

SUGAR, CONVENTION, ADDITIVES, PIGMEAT, PRICE, PRICES, FEEDINGSTUFFS, SEED

SUCRE, CONVENTION, PORC, ADDITIFS, PRIX, ALIMENTATION, SEMENCES, DECISION

EXPORT, LICENCES, LICENCE, REFUND, VEHICLES, FISHERY, CONVENTION, CERTIFICATE, ISSUED

EXPORTATION, CERTIFICATS, CERTIFICAT, PECHE, VEHICULES, LAIT, CONVENTION

Veterinary

Customs

AgricultureExport Licences

Veterinary, Transport

Page 85: What Semantic Web researchers need to know about Machine Learning?

Examples of Semantic Dimensions from Examples of Semantic Dimensions from Acquis corpora: Acquis corpora: EnglishEnglish--SloveneSlovene (2/2)(2/2)

Most important words from semantic dimensions automatically generated from 2000 documents :

OLIVE, OIL, AID, SUGAR, PRICE, STATE, MILK, LICENCES, OR, EXPORT, INTERVENTION

OLJA, OLJCNEGA, POMOCI, SLADKORJA, POMOC, OLJK, SLADKOR, ALI, DOVOLJENJA, CE

NOMENCLATURE, COLUMN, COMBINED, GOODS, TARIFF, CLASSIFICATION, ST, ANNEXED, INVOKED

NOMENKLATURO, STOLPCU, NOMENKLATURE, KOMBINIRANO, KOMBINIRANE, CARINSKI, BLAGA

QUOTAS, TARIFF, SEED, CUSTOMS, COLUMN, ENERGY, INVOKED, ATOMIC, QUOTA, OPENING

KVOT, TARIFNE, SEMENA, KVOTE, TARIFNIH, CARINSKI, ATOMSKO, ENERGIJO, ODPRTJU

DESIGNATIONS, GEOGRAPHICAL, INDICATIONS, EURATOM, PROTECTED, ECSC, NAMES, ORIGIN

OZNACB, EURATOM, GEOGRAFSKIH, POREKLA, ESPJ, ZASCITENIH, OZNACBE, IMEN, REGISTER

WINE, WINES, ALCOHOL, DRINKS, DISTILLATION, POULTRYMEAT, ICEWINE, ANALYSIS

VINO, VINA, VIN, VINSKEM, VINSKEGA, ALKOHOL, NAMIZNEGA, DESTILACIJO, DESTILACIJE

Energy

Customs

Agriculture protectionWine

Agriculture

Page 86: What Semantic Web researchers need to know about Machine Learning?

Applications of KCCAApplications of KCCA• Cross-lingual document retrieval: retrieved

documents depend only on the meaning of the query and not its language.

• Automatic document categorization: only one classifier is learned and not a separate classifier for each language

• Document clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

• Cross-media information retrieval: in the same way we correlate two languages we can correlate text to images, text to video, text to sound, …

Page 87: What Semantic Web researchers need to know about Machine Learning?

Borse = Stock Exchange

Borse = Stock Exchange

Borse = Stock Exchange

Borse = Stock Exchange‘Borse’ =

‘Stock Exchange’

Example of cross-lingual information retrieval on Reuters news corpus using KCCA

Page 88: What Semantic Web researchers need to know about Machine Learning?

Search engine

Page 89: What Semantic Web researchers need to know about Machine Learning?

Related approachesRelated approaches

• Usual approach for modelling cross language Information Retrieval is Latent Semantic Indexing (LSI/SVD) on parallel corpora– …measured performance of KCCA is

significantly better then of LSI [Vinokourov et. al, 2002]

Page 90: What Semantic Web researchers need to know about Machine Learning?

Availability/Scalability

• KCCA is available within Text-Garden text-mining software environment– …available at http://www.textmining.net

• Current version processes up-to 10.000 documents

• Next version (incremental) will be able to process up-to 100.000 documents

Page 91: What Semantic Web researchers need to know about Machine Learning?

Social networks

• Social networks can be also potential source of data for machine learning and building semantic structures– …conceptually they share similar underlying

structure as text – namely, the underlying distribution is generated by power-law

• In the next slides we show how social networks can be modeled using unsupervised techhniques

Page 92: What Semantic Web researchers need to know about Machine Learning?

Analysis of e-mail graph• An e-mail graph can be analyzed in the following 5 major steps:

1. Starting with log files from an e-mail server where the data include information about e-mail transactions with the fields: sender and the list of receivers.

2. After cleaning we get the data in the form of e-mail transactions which include e-mail addresses of sender and receiver.

3. From a set of e-mail transactions we construct a graph where vertices are e-mail addresses connected if there is a transaction between them

4. E-mail graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations

5. Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where e-mails came from.

Page 93: What Semantic Web researchers need to know about Machine Learning?

Graph transformationinto a set of sparse matrix

• Graph with N vertices is transformed into N*N sparse matrix where:– …Xth row represents information for Xth vertex– …Xth row has nonzero components for:

• Xth vertex itself and • Xth vertex’s neighbors on the distance D (e.g. 1, 2, 3)

– Intuitively, Xth row represents numerically “neighborhood” of the Xth vertex within the graph:

• Xth element in the Xth row has weight 1• …elements representing neighbors have lower weights

relative to the distance (d) from the Xth vertex (1/(2^d)) – (e.g. 1, 0.5, 0.25, 0.125, …)

Page 94: What Semantic Web researchers need to know about Machine Learning?

Graph transformation into sparse matrix (example)

5

7

8

9 10

0

1

23

4

6

11

0 1 2 3 4 5 6 7 8 9 10 11

0 1 0.5 0.25 0.25 0.5 0.25 0.5 0.25 0.25

1 0.5 1 0.25 0.5 0.25 0.5 0.25

2 1 0.25 0.5 0.25

3 0.25 1 0.5 0.25

4 0.25 0.5 0.5 1 0.25 0.5 0.25

5 0.25 0.25 1 0.5 0.25

6 0.5 0.25 1 0.25

7 0.25 0.5 1 0.25

8 0.5 0.25 0.25 0.25 0.5 0.5 0.25 1 0.5

9 0.25 0.5 0.25 1 0.5

10 0.25 0.5 1

11 0.25 0.25 0.25 0.5 1

TransformingGraph into

Matrix

Page 95: What Semantic Web researchers need to know about Machine Learning?

Data used for Experimentation• The data is the collection of log files with e-mail

transactions from local e-mail spam filter software Amavis (http://www.amavis.org/):– Each line of the log files denotes one event at the

spam filter software– We were interested in the events on successful e-mail

transactions • ...having information on time, sender, and list of receivers

– An example of successful e-mail transaction is the following line:• 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed CLEAN, [217.32.164.151] [193.113.30.29] <[email protected]> -> <[email protected]>, Message-ID: <21DA6754A9238B48B92F39637EF307FD0D4781C8@i2km41-ukdy.domain1.systemhost.net>, Hits: -1.668, 6389 ms

Page 96: What Semantic Web researchers need to know about Machine Learning?

Some statistics about the data

• The log files include e-mails for 19 months:– …this sums up to 12.8Gb of data. – After filtering out successful e-mail transactions it

remains 564Mb• …which contains approx. 2.7 million of successful e-mail

transitions used for further processing– The whole dataset contains references to approx.

45000 e-mail addresses• …after the data cleaning phase the number is reduced to

approx. 17000 e-mail addresses • …out of which 770 e-mail addresses are internal from the

home institution (with local domain name)

Page 97: What Semantic Web researchers need to know about Machine Learning?

Organizational structure of JSI produced from cleaned e-mail transactions with OntoGen in <5 minutes

Page 98: What Semantic Web researchers need to know about Machine Learning?

Organizational structure of JSI visualized from e-mail transactions with Document-Atlas

Page 99: What Semantic Web researchers need to know about Machine Learning?

Software Mining

Software

Structured data= networks

Unstructured data= textual documents

Document network

Linkanalysis

Textmining

= a set of interlinked documents; each link has type and weight

[Grčar, Mladenić, Grobelnik, 2007]

Page 100: What Semantic Web researchers need to know about Machine Learning?

/** The format of Documents. Subclasses of DocumentFormat know about * particular MIME types and how to unpack the information in any * markup or formatting they contain into GATE annotations. Each MIME * type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, * RtfDocumentFormat, MpegDocumentFormat. These classes register themselves * with a static index residing here when they are constructed. Static * getDocumentFormat methods can then be used to get the appropriate * format class for a particular document. */public abstract class DocumentFormatextends AbstractLanguageResource implements LanguageResource{

/** The MIME type of this format. */ private MimeType mimeType = null; /** * Find a DocumentFormat implementation that deals with a particular * MIME type, given that type. * @param aGateDocument this document will receive as a feature * the associated Mime Type. The name of the feature is * MimeType and its value is in the format type/subtype * @param mimeType the mime type that is given as input */ static public DocumentFormat getDocumentFormat(gate.Document aGateDocument, MimeType mimeType){ } // getDocumentFormat(aGateDocument, MimeType)

} // class DocumentFormat

Classcomment

Classname

Field type

Field name

Field comment

Returntype

Methodname

Super-class(base class)

Implementedinterface

A field

A method

Met

hod

com

men

t

Comment reference

Comment references

Extracting data

Page 101: What Semantic Web researchers need to know about Machine Learning?

Extracting dataDocumentFormat

DocumentFormat.class

Page 102: What Semantic Web researchers need to know about Machine Learning?

Extracting dataDocumentFormat

DocumentFormat.class

DocumentFormat

AbstractLanguageResource

MpegDocumentFormat

MimeType

RtfDocumentFormat

XmlDocumentFormat

LanguageResource

Document

2

Page 103: What Semantic Web researchers need to know about Machine Learning?

See next slide

Example graph

Page 104: What Semantic Web researchers need to know about Machine Learning?

Example graph - zoomed

Page 105: What Semantic Web researchers need to know about Machine Learning?

Structuring extracted knowledge

Source code PreprocessingFeaturevectors

OntoGenOntology

Page 106: What Semantic Web researchers need to know about Machine Learning?

Semantic Web and Machine Learning

Page 107: What Semantic Web researchers need to know about Machine Learning?

Examples of Semantic Web with Machine Learning

• Ontology learning• Detecting semantic via Google• Visualization of semantic spaces• Contextualized search• User modeling• Detecting bias in the news• Semantic summarization

Page 108: What Semantic Web researchers need to know about Machine Learning?

ML view on Ontology

• Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts

• Ontology can be seen as a graph/network structure consisting from: – a set of concepts (vertices in a graph),– a set of instances assigned to a particular concepts

(data records assigned to vertices in a graph)– a set of relationships connecting concepts (directed

edges in a graph)

Page 109: What Semantic Web researchers need to know about Machine Learning?

Example of a Topic Ontology

Page 110: What Semantic Web researchers need to know about Machine Learning?

Ontology learning

• Define the ontology learning tasks in terms of mappings between ontology components, where some of the components are given and some are missing and we want to induce the missing ones.

• Some typical scenarios in ontology learning are the following:– Inducing concepts/clustering of instances (given instances)– Inducing relations (given concepts and the associated

instances)– Ontology population (given an ontology and relevant, but not

associated instances)– Ontology generation (given instances and any other

background information)– Ontology updating/extending (given an ontology and

background information, such as, new instances or the ontology usage patterns)

Page 111: What Semantic Web researchers need to know about Machine Learning?

Ontology Learning with OntoGen (developed on the top of Text Garden)

• Semi-Automatic– provide suggestions and insights into the domain– the user interacts with parameters of methods– final decisions taken by the user

• Data-Driven– most of the aid provided by the system is based on

some underlying data– instances are described by features extracted from

the data (eg., words-vectors)[Fortuna, Mladenić, Grobelnik, 2005]

Installation package is publicly available in binaries at ontogen.ijs.si

Page 112: What Semantic Web researchers need to know about Machine Learning?

Basic idea behind OntoGen

Domain

Text corpus Ontology

Concept A Concept B

Concept C

112

Page 113: What Semantic Web researchers need to know about Machine Learning?

Hierarchy of concepts

Ontology visualization

Root concept

Selected conceptDetails about the

selected concept

Concept name

Descriptive keywords

Distinctive keywords

These two views are synchronized

Page 114: What Semantic Web researchers need to know about Machine Learning?

Suggesting sub-concepts

Number of suggestions

List of suggestions

Page 115: What Semantic Web researchers need to know about Machine Learning?
Page 116: What Semantic Web researchers need to know about Machine Learning?

Inspection tool

Page 117: What Semantic Web researchers need to know about Machine Learning?

Oil

Mining

Real estate

Restaurants

Life insurance

Airlines

Pharmacy

Page 118: What Semantic Web researchers need to know about Machine Learning?

Description of the concept

Page 119: What Semantic Web researchers need to know about Machine Learning?

The system asks several “yes or no” questions

Page 120: What Semantic Web researchers need to know about Machine Learning?
Page 121: What Semantic Web researchers need to know about Machine Learning?
Page 122: What Semantic Web researchers need to know about Machine Learning?

List of documents

Document preview pane

Similarity graph

Checkboxes indicate whether a document

belongs to the concept or not

Red dots represent documents that

currently belong to the concept

Page 123: What Semantic Web researchers need to know about Machine Learning?

Google Meaning

Page 124: What Semantic Web researchers need to know about Machine Learning?

Google meaning• Google offers insight on how words are used in a large corpora

(web)• This can be used to estimate distance between terms

– Estimation is based on the co-occurrence of terms– Uses the frequency of each term and their co-occurrences

• Examples:– GoogDist(“Oscar Wilde”, “The Picture of Dorian Gray”) = 0.2– GoogDist(“Oscar Wilde”, “A Midsummer Night’s Dream”) = 0.5

Page 125: What Semantic Web researchers need to know about Machine Learning?

Google meaning• Google offers insight on how words are used in a large corpora

(web)• This can be used to estimate similarities between terms

– Estimation is based on the co-occurrence of terms– Uses the frequency of each term and their co-occurrences

Page 126: What Semantic Web researchers need to know about Machine Learning?

Text Visualization

Document-Atlas http://docatlas.ijs.si

Page 127: What Semantic Web researchers need to know about Machine Learning?

Document Atlas – visualization of document collections and their structure

http://docatlas.ijs.si

Page 128: What Semantic Web researchers need to know about Machine Learning?

Contextualized web-search

SearchPoint: http://searchpoint.ijs.si

Page 129: What Semantic Web researchers need to know about Machine Learning?

Visualization of search results

• Search engines generally work very well• There are cases where it is difficult to specify a

query• Idea: help the user by clustering all the hits and

visualise the results space

Page 130: What Semantic Web researchers need to know about Machine Learning?

Example: jaguar

Relevant part of ontologyfor the query “jaguar”

Page 131: What Semantic Web researchers need to know about Machine Learning?

Example: ISWC

Query ISWC has many meanings

Page 132: What Semantic Web researchers need to know about Machine Learning?

Example: Password

Page 133: What Semantic Web researchers need to know about Machine Learning?

SEKTbar – intelligent user profiling

Page 134: What Semantic Web researchers need to know about Machine Learning?

User focused browsing history in SEKTbar

• Web-based user profile is automatically generated while the user is browsing the Web

– User profile is represented as a user-interest topic ontology (root gives the user’s general interest, leaves specific interests)

– topic ontology is generated by using hierarchical clustering– nodes of current interest are determined by comparing the

profile node centroids to the centroid, computed out of the m most recently visited pages

• User profile is visualized on the SEKTbar (Internet Explorer Toolbar)– the user can select a node in the hierarchy to see its specific

keywords and associated pages

[Grčar, Mladenić, Grobelnik, 2006]

Page 135: What Semantic Web researchers need to know about Machine Learning?
Page 136: What Semantic Web researchers need to know about Machine Learning?

Detecting the bias in media with statistical learning methods

Page 137: What Semantic Web researchers need to know about Machine Learning?

Media Bias

• The descriptions of the same event in news can differ with respect to the news source and its bias.– Can we detect this using statistical methods?– How does this bias reflect in vocabulary?

• To answer these questions we need:– A controlled collection of news for a longer time period

• A year corpus of news articles on international events– Matching of news articles from different news sources based

on the described event• We used bag-of-words model and “Best Reciprocal Hit” method

from Bioinformatics to match news articles describing similar events

Page 138: What Semantic Web researchers need to know about Machine Learning?

Experimental setup

• Time period: March 31st 2005 – April 14th 2006• Size of collections:

• Number of discovered matches:

Page 139: What Semantic Web researchers need to know about Machine Learning?

Prediction of news source

• The task: given a pair of news articles describing the same event, can we predict the news source for each?

• In this experiment we focused on CNN and Al Jazeera.

• SVM linear classifier was used for prediction– Evaluation was done using 10-fold cross-validation– Significance of results was tested against random matches

• We used SVM feature selection [Brank et. al.] to extract most important classification keywords.

Page 140: What Semantic Web researchers need to know about Machine Learning?

Source prediction results

• News source can be predicted with 81% BEP high bias in vocabulary

• News source specific keywords:

Page 141: What Semantic Web researchers need to know about Machine Learning?

Bias example

Page 142: What Semantic Web researchers need to know about Machine Learning?

Keywords differences on topic level

Page 143: What Semantic Web researchers need to know about Machine Learning?

News source similarity maps

Based on source vocabulary differences

Based on covered events

Page 144: What Semantic Web researchers need to know about Machine Learning?

Summarization of documents through semantic-graphs

Page 145: What Semantic Web researchers need to know about Machine Learning?

Approach Description

• Approach: – Learn a machine learning model for selecting

sentences– Use information about semantic structure of the

document (concepts and relations among concepts)

Results are promising– achieved 70% recall of and 40% precision on extracted Subject-

Predicate-Object triples on DUC (Document understanding conference) data

Page 146: What Semantic Web researchers need to know about Machine Learning?

Automaticsummarization by

selecting relevant triples

Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

Cracks Appear in U.N. Trade Embargo Against Iraq.

Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''.

Manual summarization

Creation of

semantic netw

ork

Nat. Lang. Generation

Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

Original Document

Human built document summary

Semantic net ofSubj-Pred-Obj

triples

Mapping between graphs learned

with ML methods

Semantic net ofSubj-Pred-Obj

triples

70% recall, 40% precisionof selected triples according to human generated summaries

Automatically built document

summary (not done by us)

Summarization

Page 147: What Semantic Web researchers need to know about Machine Learning?

Detailed Summarization Procedure

Linguistic analysis of the text- Deep parsing of sentences

Refinement of the text parse- Named-entity consolidation

Determine that ’George Bush’ = ‘Bush’ = ‘U.S. president’

- Anaphora resolution Link pronouns with name-entities

Extract Subject–Predicate–Object triples

Linguistic analysis of the text- Deep parsing of sentences

Refinement of the text parse- Named-entity consolidation

Determine that ’George Bush’ = ‘Bush’ = ‘U.S. president’

- Anaphora resolution Link pronouns with name-entities

Extract Subject–Predicate–Object triples

Tom Sawyer went to town. He met a friend. Tom was happy. …

Tom go townTom meet friendTom is happy

Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. …

Use summary graph to generate textual document summaryUse summary graph to generate textual document summary

Compose a graph from triples

Describe each triple with a set of features for learning

Learn a model to classify triples into the summary

Generate a summary graph

Compose a graph from triples

Describe each triple with a set of features for learning

Learn a model to classify triples into the summary

Generate a summary graph

Page 148: What Semantic Web researchers need to know about Machine Learning?

Cracks Appear in U.N. Trade Embargo Against Iraq.

Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''.

Example of summarization

Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

7800 chars, 1300 words

Human written summary

Page 149: What Semantic Web researchers need to know about Machine Learning?

Automatically generated graph of summary triples

Page 150: What Semantic Web researchers need to know about Machine Learning?

Further online information

• Recorded tutorials, lectures, summer-schools available from http://videolectures.net– Semantic Web:

http://videolectures.net/Top/Computer_Science/Semantic_Web/

– Machine Learning: http://videolectures.net/Top/Computer_Science/Machine_Learning/

Page 151: What Semantic Web researchers need to know about Machine Learning?

ML Summer Schools @ Video Lectures

• ML summer schools online