:: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

:: DIAsDEM ::

Seminar: Web Mining

WS 2003/2004

Ingo Kampe

Heiko Scharff

:: DIAsDEM :: 2/24

Content

Introduction and data mining context

DIAsDEM - functioning

New extensions

:: DIAsDEM :: 3/24

Introduction

:: problems ::

:: DIAsDEM :: 4/24

Introduction

known: data in databases (DB2, Oracle, ...) unproblematically to analyse, for example

with SQL, self-brewed programmes or data miners

but in enterprises: 80% of data in text documents (MS Word, plain text files, text archives, ...)

knowledge there, but „useless“

:: DIAsDEM :: 5/24

Introduction

example (same meaning, other structure):

Mr. Schröder earns EUR 20.000 per month.

Mister Schröder earns 20000,- €/month.

What does it mean?

How to compare? How to analyse?

Does this mean the same?

:: DIAsDEM :: 6/24

Introduction

:: data mining context ::

:: DIAsDEM :: 7/24

Introduction

necessary to make knowledge analysable

desirable:– semantically structured knowledge– queryable knowledge

possible solution: XML– semantic tagging– analysable (XPath, XQuery, Tamino, ...)

:: DIAsDEM :: 8/24

Introduction

for humans:

Mr. Schröder earns EUR 20.000 per month.=

Mister Schröder earns 20000,- €/month.

„useless“ for computational analyse only useful informations:

– Mister Schröder– 20000 Euro– month

:: DIAsDEM :: 9/24

Introduction

need to– „find“ important information– mark important information

<person>Mr. Schröder</person>

<capital amount=„20000 EUR“>earns EUR 20.000</capital>

<period>per month</period>.

:: DIAsDEM :: 10/24

DIAsDEM

:: DIAsDEM ::

:: DIAsDEM :: 11/24

DIAsDEM

DIAsDEM: Datenintegration von Altlastdaten und semistrukturierten Dokumenten mit Mining-Verfahren (integration of legacy data and semi-structured documents with data mining techniques)

project of the Deutsche Forschungs-gemeinschaft (German Research Society)

necessary: domain specific knowledge (!!!)

:: DIAsDEM :: 12/24

DIAsDEM

:: functioning ::

:: DIAsDEM :: 13/24

DIAsDEM

2-phase-model

1. knowledge discovery– iterative process (with expert knowledge)– training phase with training text archive– finding of segments (clusters) and semi-automatic

annotation– deduction of an unstructured XML DTD

2. semantic tagging– usage of found clusters on new archives– „intelligent“ tagging of new, unknown texts of the same

domain

:: DIAsDEM :: 14/24

DIAsDEM

Fig.: Winkler 2003b, page 6

:: DIAsDEM :: 15/24

DIAsDEM

to achieve „good“ semantic tagging, expert knowledge necessary

What is needed?

<person>Mr. Schröder</person>

or

<title>Mr.</title>

<name>Schröder</name>

:: DIAsDEM :: 16/24

DIAsDEM

steps in DIAsDEM:1. finding segments (for example sentences) in

training texts by using thesauri and knowledge of named entities (persons, ...)

2. building an unstructured XML DTD

3. clustering of similar text elements (cluster name = in cluster dominating descriptors)

4. renaming of clusters by experts

5. annotation of training texts

6. building a final XML DTD (for querying, XML based databases like Tamino, data miner, ...)

:: DIAsDEM :: 17/24

Extensions

:: new extensions ::

:: DIAsDEM :: 18/24

Extensions

main goal:

– searching documents from the internet, concerning user specification

– downloading hypertext documents– extracting plain text from hypertext documents– importing plain text into DIAsDEM collection

:: DIAsDEM :: 19/24

Extensions

:: querying Google ::

:: DIAsDEM :: 20/24

Extensions - Google

1. declaration of search words by user (panel)

2. querying of Google using the Google-API with reference to the search words

3. result: list of URLs (now only 10, limited by Google) automatic exported as list into a text file

:: DIAsDEM :: 21/24

Extensions

:: processing and import ::

:: DIAsDEM :: 22/24

Extensions - Processing and Import

1. reading url list (exported text file)

2. downloading hypertext files into a directory and renaming the files (enumeration)

3. detagging the files- cleaning hypertext documents- deleting comments an tags- replacing special characters (not yet

implemented)

4. importing files into the DIAsDEM collection

:: DIAsDEM :: 23/24

Questions?

?

:: DIAsDEM :: 24/24

Literature

Graubitz,H., Spiliopoulou,M. & Winkler,K. (2001). „The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques“. In Proceedings of the First IEEE International Conference on Data Mining, pages 171-178, San Jose, CA, USA, November / December 2001. IEEE Computer Society, Los Alamitos.

Winkler,K. & Spiliopoulou,M. (2003a). „Text Mining in der Wettbewerberanalyse: Konvertierung von Textarchiven in XML-Dokumente“. In Proceedings der 6. Konferenz der SAS Anwender in Forschung und Entwicklung, pages 347-363, Shaker Verlag, Aachen, Germany.

Winkler,K. (2003b). „Technical Report - Getting Started with DIAsDEM Workbench 2.1“. A Case-Based Approach Technical Report, 121 pages. HHL - Leipzig Graduate School of Management.