28
Copyright © 2014 KNIME.com AG Boston KNIME Users Text Processing Applications Kilian Thiel KNIME

Text Processing with KNIME

Embed Size (px)

Citation preview

Page 1: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Boston KNIME UsersText Processing Applications

Kilian Thiel

KNIME

Page 2: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Agenda

• KNIME Crash Course

• Text Mining with KNIME: Mining Tripadvisor Data

• Text Mining with KNIME: Mining Amazon Reviews (Anil Tarachandani)

• Networking Apero

2

Page 3: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Text Mining with KNIME: Mining Tripadvisor Data

Agenda

• The KNIME Textprocessing Extension

– Preliminaries

– Philosophy & Usage

• Classification of Tripadvisor Reviews

– Tripadvisor data

– Classification of reviews

3

Page 4: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Resources

http://tech.knime.org/knime-text-processing

• Documentation

• Examples

• Forum

• White Papers

4

Page 5: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Installation

5

1.) 2.)

Page 6: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Requirements

Requirements to import and run demo workflows

• KNIME 2.10

• Textprocessing (labs)

• Distance Matrix (KNIME)

• Palladian (Community)

6

Page 7: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Tips

• Settings (knime.ini)

– Set maximum memory for KNIME

– -Xmx3G

7

Page 8: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Prepare KNIME

• Go to KNIME directory

• Change knime.ini file (optional)

– -Xmx3G

• Start KNIME

• Install Textprocessing Extension

– (or better have it already installed)

8

Page 9: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Philosophy

9

… perhaps your nameis

Rumpelstiltskin[Person] ? …

… perhaps your nameis

Rumpelstiltskin[Person] ? … Visualization

Cluster-ing

Classifi-cation

1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0

Page 10: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Additional Data Types

• Document Cell

– Encapsulates a document• Title, sentences, terms, words

• Authors, category, source

• Generic meta data (key, value pairs)

• Term Cell

– Encapsulates a term• Words, tags

10

Page 11: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Data Table Structures

• Document table– List of documents

• Bag of words– Tuples of documents

and terms

• Document vectors– Numerical

representations of documents

11

Page 12: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Philosophy and Data Table Structures

12

Enrichment Preprocessing1 1 1 0 1 0 0 1

Documents Bow VectorsDocuments Documents

Page 13: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Tripadvisor Data

13

Title

Author

Rating

Fulltext

Page 14: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Tripadvisor Data

14

Reviews about italian and chinese restaurants in Boston

• Chinese: 272

• Italian: 268

Page 15: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Tripadvisor Data

15

Goal:

• Build classifier to distinguish between chinese anditalian restaurants, based on their reviews.

Review about italian orchinese restaurant?

Page 16: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Tripadvisor Data

16

Goal:

Page 17: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

1.) Reading

Read/Parse textual data

17

Page 18: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Reading

• Read Tripadvisor data (.table file)

• Filter rows with missing restaurant value

• Convert strings to documents

• Filter all but the document column

18

Page 19: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

2.) Enrichment

Enrich documents with semantic information

19

Page 20: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Enrichment / Tagging

• Apply POS Tagger node

• Use Bag of Words node to inspect tagging result

20

Page 21: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

3.) Preprocessing

Preprocess documents and filter words

21

Page 22: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Preprocessing

• Filter

– Numbers

– Punctuation marks

– Stop Words

• Convert to lower case

• Stemming

• Keep only nouns, verbs, adjectives

22

Page 23: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

4.) Transformation

Creation of numerical representation of documents

23

Page 24: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Transformation

• Transform to bag of word

• Compute TF value for terms

• Transform to document vectors

• Extract category (class) value

24

Page 25: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

5.) Classification

Training of a model (decision tree) and scoring

25

Page 26: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Demo

Classification

• Append color based on class

• Partition data into training and test set

• Train decision tree model in training data

• Apply decision tree model on test data

• Score model, measure accuracy

26

Page 27: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Additional Workflows

• Multi Word Tagging

– Detection of frequent Ngrams

– Creation of dictionary from Ngrams

– Applying Dictionary Tagger

• Classification with Multi Words

• Clustering of documents

27

Page 28: Text Processing with KNIME

Copyright © 2014 KNIME.com AG

Thank You

40k

60k

20k

28

Questions

• http://tech.knime.org/forum

[email protected]

Follow us

• Twitter: @KNIME

• LinkedIn: https://www.linkedin.com/groups?gid=2212172

• KNIME Blog: http://www.knime.org/blog