24
:: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

Embed Size (px)

Citation preview

Page 1: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM ::

Seminar: Web Mining

WS 2003/2004

Ingo Kampe

Heiko Scharff

Page 2: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 2/24

Content

Introduction and data mining context

DIAsDEM - functioning

New extensions

Page 3: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 3/24

Introduction

:: problems ::

Page 4: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 4/24

Introduction

known: data in databases (DB2, Oracle, ...) unproblematically to analyse, for example

with SQL, self-brewed programmes or data miners

but in enterprises: 80% of data in text documents (MS Word, plain text files, text archives, ...)

knowledge there, but „useless“

Page 5: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 5/24

Introduction

example (same meaning, other structure):

Mr. Schröder earns EUR 20.000 per month.

Mister Schröder earns 20000,- €/month.

What does it mean?

How to compare? How to analyse?

Does this mean the same?

Page 6: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 6/24

Introduction

:: data mining context ::

Page 7: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 7/24

Introduction

necessary to make knowledge analysable

desirable:– semantically structured knowledge– queryable knowledge

possible solution: XML– semantic tagging– analysable (XPath, XQuery, Tamino, ...)

Page 8: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 8/24

Introduction

for humans:

Mr. Schröder earns EUR 20.000 per month.=

Mister Schröder earns 20000,- €/month.

„useless“ for computational analyse only useful informations:

– Mister Schröder– 20000 Euro– month

Page 9: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 9/24

Introduction

need to– „find“ important information– mark important information

<person>Mr. Schröder</person>

<capital amount=„20000 EUR“>earns EUR 20.000</capital>

<period>per month</period>.

Page 10: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 10/24

DIAsDEM

:: DIAsDEM ::

Page 11: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 11/24

DIAsDEM

DIAsDEM: Datenintegration von Altlastdaten und semistrukturierten Dokumenten mit Mining-Verfahren (integration of legacy data and semi-structured documents with data mining techniques)

project of the Deutsche Forschungs-gemeinschaft (German Research Society)

necessary: domain specific knowledge (!!!)

Page 12: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 12/24

DIAsDEM

:: functioning ::

Page 13: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 13/24

DIAsDEM

2-phase-model

1. knowledge discovery– iterative process (with expert knowledge)– training phase with training text archive– finding of segments (clusters) and semi-automatic

annotation– deduction of an unstructured XML DTD

2. semantic tagging– usage of found clusters on new archives– „intelligent“ tagging of new, unknown texts of the same

domain

Page 14: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 14/24

DIAsDEM

Fig.: Winkler 2003b, page 6

Page 15: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 15/24

DIAsDEM

to achieve „good“ semantic tagging, expert knowledge necessary

What is needed?

<person>Mr. Schröder</person>

or

<title>Mr.</title>

<name>Schröder</name>

Page 16: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 16/24

DIAsDEM

steps in DIAsDEM:1. finding segments (for example sentences) in

training texts by using thesauri and knowledge of named entities (persons, ...)

2. building an unstructured XML DTD

3. clustering of similar text elements (cluster name = in cluster dominating descriptors)

4. renaming of clusters by experts

5. annotation of training texts

6. building a final XML DTD (for querying, XML based databases like Tamino, data miner, ...)

Page 17: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 17/24

Extensions

:: new extensions ::

Page 18: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 18/24

Extensions

main goal:

– searching documents from the internet, concerning user specification

– downloading hypertext documents– extracting plain text from hypertext documents– importing plain text into DIAsDEM collection

Page 19: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 19/24

Extensions

:: querying Google ::

Page 20: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 20/24

Extensions - Google

1. declaration of search words by user (panel)

2. querying of Google using the Google-API with reference to the search words

3. result: list of URLs (now only 10, limited by Google) automatic exported as list into a text file

Page 21: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 21/24

Extensions

:: processing and import ::

Page 22: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 22/24

Extensions - Processing and Import

1. reading url list (exported text file)

2. downloading hypertext files into a directory and renaming the files (enumeration)

3. detagging the files- cleaning hypertext documents- deleting comments an tags- replacing special characters (not yet

implemented)

4. importing files into the DIAsDEM collection

Page 23: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 23/24

Questions?

?

Page 24: :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM :: 24/24

Literature

Graubitz,H., Spiliopoulou,M. & Winkler,K. (2001). „The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques“. In Proceedings of the First IEEE International Conference on Data Mining, pages 171-178, San Jose, CA, USA, November / December 2001. IEEE Computer Society, Los Alamitos.

Winkler,K. & Spiliopoulou,M. (2003a). „Text Mining in der Wettbewerberanalyse: Konvertierung von Textarchiven in XML-Dokumente“. In Proceedings der 6. Konferenz der SAS Anwender in Forschung und Entwicklung, pages 347-363, Shaker Verlag, Aachen, Germany.

Winkler,K. (2003b). „Technical Report - Getting Started with DIAsDEM Workbench 2.1“. A Case-Based Approach Technical Report, 121 pages. HHL - Leipzig Graduate School of Management.