Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Preview:

Citation preview

Tokeniser

Francisco Miguel Pérez Romero

University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Web Wrapping

Information retrieval

VerifierOntologiserExtractor

Query

NavigatorFormFiller

Tokeniser

¨ Tokenisation Rules¨ Configuration File ¨ Web Page¨ Parser

Tokeniser Usage

¨ Web Page Classification¨ Information Extraction Learners¨ Information Extraction

Example

Config FileToken List

Web Page

Tokeniser

XML File Token

List

Concepts

¨ Configuration File¨ Token¨ Tokenisation types

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Example

3 Token Classes: Word Space Digit Space Digit

Class Diagram: Tokenisation

Tokenisation Example

Class Diagram: Tokeniser

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Comparison Features 1

¨ Comparison Features:¨ Javadoc documentation?¨ Support UNICODE UTF-8¨ Support UNICODE UTF-16¨ Named Groups¨ Indexable Groups > 9¨ Negative Groups¨ Nested groups¨ Lazy qualifications?

Comparison Features 2

¨ Comparison Features:¨ Fuzzy matching?¨ Support POSIX?¨ Support Ignore Case?¨ Support New Line Option?¨ Use State Machine?¨ Support accent?

Libraries

¨ Tabla 1

Libraries

¨ Tabla 2

Libraries

¨ Tabla 3

Benchmark 1

¨ Regular Expression List¨ String List¨ Matching all one another¨ Time in ms

Benchmark 1: 10000 Iterations

¨ org.apache: -> 7078 ms¨ com.stevesoft : -> 19782 ms¨ kmy.regex : -> 781 ms¨ java.util : -> 1266 ms¨ jregex.Pattern : -> 1000 ms¨ org.apache.oro : -> 2156 ms¨ dk.brics.automaton : -> 265 ms¨ com.karneim.util.collection : -> 407 ms

Benchmark 1: 20000 Iterations

¨ org.apache: -> 11796 ms¨ com.stevesoft : -> 26641 ms¨ kmy.regex : -> 906 ms¨ java.util : -> 1891 ms¨ jregex.Pattern : -> 1422 ms¨ org.apache.oro : -> 3375 ms¨ dk.brics.automaton : -> 312 ms¨ com.karneim.util.collection : -> 610 ms

Benchmark 1: 50000 Iterations

¨ org.apache: -> 28656 ms¨ com.stevesoft : -> 63297 ms¨ kmy.regex : -> 1781 ms¨ java.util : -> 4281 ms¨ jregex.Pattern : -> 3219 ms¨ org.apache.oro : -> 7641 ms¨ dk.brics.automaton : -> 531 ms¨ com.karneim.util.collection : -> 1312 ms

Diagram

org.

apac

he

com

.ste

veso

ft

kmy.

rege

x

java

.util

jrege

x.Pa

ttern

org.

apac

he.o

ro

dk.b

rics

com

.kar

neim

0

10000

20000

30000

40000

50000

60000

70000

10000 It20000 It50000 It

Benchmark 2

¨ Source Code¨ Matching tags

Benchmark 2: Amazon

¨ org.apache : -> 218 ms¨ com.stevesoft : -> 63 ms¨ kmy.regex : ->94 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 93 ms¨ org.apache.oro : -> 32 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 47 ms

Benchmark 2: Marca

¨ org.apache : -> 62 ms¨ com.stevesoft : -> 47 ms¨ kmy.regex : ->93 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 94 ms¨ org.apache.oro : -> 16 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 62 ms

Benchmark 2: Ebay

¨ org.apache : -> 31 ms¨ com.stevesoft : -> 125 ms¨ kmy.regex : ->266 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 156 ms¨ org.apache.oro : -> 47 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 172 ms

Diagram

org.

apac

he

com

.ste

veso

ft

kmy.

rege

x

java

.util

jrege

x.Pa

ttern

org.

apac

he.o

ro

dk.b

rics

com

.kar

neim

0

50

100

150

200

250

300

AmazonMarcaEbay

To sum up…

¨ Dk.brics.automaton is the faster¨ Dk.brics and com.karneim fail with URL¨ Kmy.regex or java.util

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Conclusions

¨ Tokenisation test¨ Searching information¨ A real project¨ Experience

Thanks!

Recommended