Upload
hadi-mohammadzadeh
View
5.117
Download
4
Tags:
Embed Size (px)
Citation preview
1
.
Hadi Mohammadzadeh Text Mining by Examples Pages
By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 27 Jan. 2010
Seminar on
Text Mining
by Examples
2
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
OutLine
1. New Terminologies2. WordNet - A Large Lexical DataBase of English3. Reuters-21578 … as a Text Collection4. CMU Text Learning Group Data Archives
5. Text Mine Software - Web based algorithms6. Text Mine Software - Command based algorithms7. Usefull Web sites
3
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part One
New TerminologiesWord and Meaning Relationships
4
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text
Hyponym and Hypernym
• In linguistics, a hyponym is a word or phrase whose semantic range is included within another word, its hypernym. For example, scarlet and crimson are all hyponyms of red (their hypernym), which is, in turn, a hyponym of colour.
5
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text Meronym
• Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,– X is a meronym of Y if Xs are parts of Y(s), or– X is a meronym of Y if Xs are members of Y(s).
• For example, 'finger' is a meronym of 'hand' because a finger is part of a hand. Similarly 'wheel' is a meronym of 'automobile'.
6
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text Holonym
• Holonymy defines the relationship between a term denoting the whole and a term denoting a part of the whole. That is,
– 'X' is a holonym of 'Y' if Ys are parts of Xs, or– 'X' is a holonym of 'Y' if Ys are members of Xs.
• For example, 'tree' is a holonym of 'bark', of 'trunk‘ and of 'limb.'
7
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Two
WordNetA Large Lexical DataBase of English
8
.
Hadi Mohammadzadeh Text Mining by Examples Pages
WordNet
• WordNet® is a large lexical database of English, developed under the direction of George A. Miller.
• Develpoment of WordNet began in 1985 and its use is widespread in tools to manage text.
• WordNet is more than just a dictionary and thesaurus; it includes all kinds of relationships between words. WordNet version 2.0 contains roughly 150,000 content words.
9
.
Hadi Mohammadzadeh Text Mining by Examples Pages
WordNet cont.
• Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
• WordNet is also freely and publicly available for download.
• WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
10
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text – Polysemy
Number of Senses in WordNet
• A word can have more than one meaning that is not obvious in a sentence.
• In WordNet a word has an average of 1.4 senses.
Average of Sense
Word Number Average of Senses
Verb 2.1Adjective 1.45
Adverb 1.25
Nouns 1.24
11
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text – Polysemy
Number of Senses in WordNet
Words with the Highest Number of Senses from WordNet
Word Number of Senses
Break 74
Cut 73
Run 57
Play 52
Make 51
12
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Understanding Text – Polysemy
Number of POS in WordNet
• Some words also have more than one part of speech(POS). For example still has five different parts of speech.
Word Number of POS
Out 5Round 5
Still 5Down 5Over 4
13
.
Hadi Mohammadzadeh Text Mining by Examples Pages
World Classifications in WordNet
• Words can be classified into word classes or POS.
• We refer to nouns, verbs, adjectives, and adverbs as content words.
• Conjunctions, determiners, pronouns, and prepositions are called function words.
Frequencies of Word Classes from WordNet
Type Number Type Number
Noun 114,400(75%) Preposition 133(0.08%)
Adjective 21,438(14%) Pronoun 118(0.07%)
Verb 11,341(7.4%) Conjunction 89(0.05%)
Adverb 4662(3%) Determiner 14(0.009%)
14
.
Hadi Mohammadzadeh Text Mining by Examples Pages
WordNet Website and Developed Program
• WordNet Website
• WordNet Developed Program
15
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Three
Reuters-21578
as a Text Collection
16
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Reuters-21578 History
• The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987.
• Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Really it is a classic benchmark for text categorization algorithms.
• The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files contain 1000 documents, while the last contains 578 documents.
17
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Reuters-21578
• Distribution 1.0 on 26 September 1997, By David D. Lewis AT&T Labs - Research
• The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text
categorization system.
18
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Four
CMU Text Learning Group Data Archives
as a Text Collection
19
.
Hadi Mohammadzadeh Text Mining by Examples Pages
CMU Text Learning Group Data Archives
• This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups. One thousand messages from each of the twenty newsgroups were chosen at random and partitioned by newsgroup name.
• Link
• Sample Message
• Experiment Results
• Prof. Cho , Sam Houston State of University
20
.
Hadi Mohammadzadeh Text Mining by Examples Pages
CMU Text Learning Group Data Archives
1. alt.atheism 2. talk.politics.guns 3. talk.politics.mideast 4. talk.politics.misc 5. talk.religion.misc 6. soc.religion.christian 7. comp.sys.ibm.pc.hardware 8. comp.graphics 9. comp.os.ms-windows.misc 10. comp.sys.mac.hardware 11. comp.windows.x 12. rec.autos 13. rec.motorcycles 14. rec.sport.baseball 15. rec.sport.hockey 16. sci.crypt 17. sci.electronics 18. sci.space 19. sci.med
20. misc.forsale
21
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Five
Text Mine SoftwareWeb based algorithms
22
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Text Mine Application
• The three scripts in the first row handle:1. the creation of text statistics
• Number of word types• Letter frequencies• Word frequencies
2. Entity Extraction3. Finding the POS tags for words
23
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Text Mine Application
• As an input use a text file such as Help File or write a text on Textbox.
24
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Six
Text Mine SoftwareCommand based algorithms
25
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Zeroth ProgramTokens
• Name of Program: tokens.pl• Input : sample. • Output : After runnig this program, it will generate a text
file with following name
tokens.txt• Aim : Generating Tokens
26
.
Hadi Mohammadzadeh Text Mining by Examples Pages
First ProgramPart of Speech Tagger
• Name of Program: pos-test.pl• Input : Inside Perl File. • Output : After runnig this program,
it will generate a text file with following name
pos_test_results.txt• Aim : Part of Speech Tagger
27
.
Hadi Mohammadzadeh Text Mining by Examples Pages
• To generate named entities with associated types, we need some dictionaries for categories such as – Person, place, organization, number, currency,
dimension, time, technical time, or miscellaneous.– For Exampel co_abbrev.dat contains a list of about 900
abbreviations. Or co_places table is a list of about 3000 of the world’s lager cities.
Second ProgramEntity Extraction
28
.
Hadi Mohammadzadeh Text Mining by Examples Pages
• Name of Program: test-ent.pl• Input : Inside Perl File. • Output : After runnig this program, it will
generate a text file with following name
test_ent_results.txt• Aim : Entity Extraction
Second ProgramEntity Extraction
29
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Third ProgramDisambiguate words with multiple
• Name of Program: sense.pl• Input : Inside Perl File. • Output : After runnig this program,
it will generate a text file with following name
sense.txt
30
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Fourth ProgramRandom Text Generator
• Name of Program: tgen.pl• Input : Inside Perl File. • Output : After runnig this program,
it will generate a text file with following name
tgen.txt
31
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Fifth ProgramSplitting of text into sentences
• Name of Program: tsplit.pl• Input : Inside Perl File. • Output : After runnig this program,
it will generate a text file with following name
tsplit.txt
32
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Sixth programClustering
• Name of Program: cluster.pl
• Input Data: a collection of 55 Reuters documents from three topics– Cocoa , 15 documents– Suger , 22 documents– Coffee , 18 documentsInput file included in cluster.pl.
• Input Parameters : A similarity threshold, a linking parameter, and an indexing parameter.
• Output : It returns a list of clusters and similarity matrix. Cluster.txt
• Method : This program is based on genetic algorithm method.
33
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Seminar on Text Mining by Examples
Part Seven
Usefull Web sites
34
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Talk to Ditto
• http://www.convo.co.uk/x02/?
35
.
Hadi Mohammadzadeh Text Mining by Examples Pages
36
.
Hadi Mohammadzadeh Text Mining by Examples Pages
37
.
Hadi Mohammadzadeh Text Mining by Examples Pages
38
.
Hadi Mohammadzadeh Text Mining by Examples Pages
How it works?
• Bayesian Classification is used to teach Ditto the donkey the basics of the English language
• When Ditto receives a message, he evaluates it for niceness or nastiness, then responds emotionally on a scale of –100 to +100
• Ditto was trained using 5525 examples
39
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Dragon Toolkit
• Dragon Toolkit
40
.
Hadi Mohammadzadeh Text Mining by Examples Pages
Disp
• http://www.ltg.ed.ac.uk/disp/resources/
41
.
Hadi Mohammadzadeh Text Mining by Examples Pages
References
• Books– Introduction to Information Retrieval-2008– Managing Gigabytes-1999– The Text Mining Handbook– Text Mining Application Programming– Web Data Mining