31
Leuven, 2007-05- 22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb [email protected] Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven [email protected]

Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty

Embed Size (px)

Citation preview

Leuven, 2007-05-22

Computer Aided Document Indexing System for Accessing Legislation

A Joint Venture of Flanders and Croatia

Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb

[email protected]

Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb

[email protected]

Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke

Universiteit [email protected]

Leuven, 2007-05-22

Talk overview

document indexing and computer aided document indexing

project AIDE

CADIS workstation: features

project CADIAL

eCADIS workstation: additional features

machine learning techniques

future developments

conclusions

Leuven, 2007-05-22

Computer Aided Document Indexing document indexing

attachment of descriptors from a controlled thesaurus to a document

descriptors = labels representing the content of a document

necessary for document retrieval in many document collections

parliamentary documentation

legislation

technical documentation

usually done manually

tedious, error prone, slow (max. 30-40 documents/day)

could computers be of any help in this process?

if we build a Computer Aided Document Indexing System (CADIS)

Leuven, 2007-05-22

Project AIDE in Croatia

idea for a project

September 2004

interdisciplinary collaboration of 3 institutions

Croatian Information Documentation Referral Agency (HIDRA)

Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb

Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb

Leuven, 2007-05-22

AIDE – collaborating institutions HIDRA

collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia

coordinator Maja Cvitaš, M.A.

ZEMRIS

research in the field of artificial intelligence, neural networks, machine learning, data and text mining

coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc.

ZZL

computational linguistic research and building language technologies for Croatian

coordinator prof. Marko Tadić

Leuven, 2007-05-22

AIDE – project objective

Development of intelligentsystem for automatic indexingof the official documentation

of the Republic of Croatiawith descriptors from Eurovoc thesaurus

Leuven, 2007-05-22

AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc

automatic indexing, how? program which “learns to index” documents

conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)

compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors

situation with Croatian documentation in 2004-09 there were only few hundreds of documents indexed manual indexing: painfully slow

how could we speed up the manual indexing?

Leuven, 2007-05-22

AIDE – activities

investigate and develop algorithms in the field of computational linguistics/language technologies

include that knowledge into the Computer Aided Document Indexing System (CADIS)

demonstration of CADIS in European parliament (2006-03-10)

Leuven, 2007-05-22

CADIS: two parallel windows

Document window

Eurovoc browser window

Leuven, 2007-05-22

Document Window

Leuven, 2007-05-22

Leuven, 2007-05-22

CADIS features

Enhanced user interface

list of descriptors literary appearing in document

Leuven, 2007-05-22

CADIS features

Descriptors and non-descriptors marked in document

Leuven, 2007-05-22

CADIS features

Lists of n-grams

Leuven, 2007-05-22

CADIS features

Integration of corpus analysis

greyed n-grams are statistically relevant in the corpus i.e. collocations

Leuven, 2007-05-22

CADIS features

Manual marking of significant n-grams

important step towards further refinment of automatic indexing

Leuven, 2007-05-22

Eurovoc browser window

Leuven, 2007-05-22

AIDE – activities

investigate and develop algorithms in the field of computational linguistics/language technologies

include that knowledge into the Computer Aided Document Indexing System (CADIS)

demonstration of CADIS in European parliament (2006-03-10)

ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006

joint project proposal with Katholieke Universiteit Leuven for CADIAL project

Leuven, 2007-05-22

CADIAL project Computer Aided Document Indexing for Accessing Legislation

a joint Flemish-Croatian project

Department International Flanders, grant no. KRO/009/06

partners:

Katholieke Universiteit Leuven (prof. Marie-Francine Moens)

University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić)

started: 2007-03

duration: 2 years

web: www.cadial.org

the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service

Leuven, 2007-05-22

CADIAL project 2

used the 10,000 manually indexed documents to train the

system for automatic indexing of documents in Croatian

used the 20,000 manually indexed documents from Acquis to

train the system for automatic indexing of documents in

English

included that training data into the next version: eCADIS (-

version)

Leuven, 2007-05-22

eCADIS () features

Automatic suggestion of relevant descriptorsi.e. automatic indexing

application of machine learning techniques

Leuven, 2007-05-22

eCADIS () features

Compare it to manually attached indexes…

Leuven, 2007-05-22

eCADIS () features

Manual marking of inappropriate suggestions

another step in further refinment of automatic indexing

Leuven, 2007-05-22

eCADIS () on document in English

Leuven, 2007-05-22

eCADIS () on document in English

Automatic suggestion of relevant descriptorsi.e. automatic indexing

Leuven, 2007-05-22

eCADIS () on document in English

Compare it to manually attached indexes…

Leuven, 2007-05-22

Training the classifiers already existing classifiers

profile classifier (Steinberger 2003)

K-nearest neighbours

binary classifiers

SVM, Logistic Regression, Rocchio, Bayes, …

classifiers used for the preliminary training

ca 3500 independent binary classifiers

need to be further evaluated

Logistic Regression used for 10,000 documents in Croatian

SVM used for 20,000 documents in English

features tokens, lemmas, stems, character n-grams

various feature selection methods and their combinations: 2, ig, mi…

Leuven, 2007-05-22

Further development of eCADIS

training with new features and feature selection methods

collocations, word n-grams, chunks

new measures for evaluation of results

sensitive to thesaurus hierarchy

web-interface for eCADIS for inclusion into the CADIAL system

eCADIS for other languages

now only Croatian and English (-version) covered

usable for other languages as it is, but without the linguistic module less efficient

no list of lemmas, but types poor statistics for n-grams

cooperation with language technology experts in different languages for development of linguistic modules

Leuven, 2007-05-22

Further development of eCADIS … eCADIS for other languages

training the automatic indexing system for other languages

enables automatic suggestions of relevant descriptors in new, unseen documents

analysis of manual markings descriptors, word n-grams, suggestions

promote the use of eCADIS in other countries beyond the scope of CADIAL project

e.g. Belgium (Flanders)

linguistic module for Dutch and French needed

computational lingustics expertise

training data from Acquis can be used to make an automatic indexing system for Dutch and French

machine learning expertise

Leuven, 2007-05-22

Conclusion CADIAL

a joint Flemish-Croatian project sponsored by Flemish government

better public access to Croatian official documentation

faster and improved document indexing

automatic content metadata generation (Semantic Web)

easier document retrieval and exploration of legislation

multilingual access via standardized EU thesaurus Eurovoc

a test-case for the usage of such a system in Flanders

Web information on CADIAL project and eCADIS

www.cadial.org

contact:

[email protected]

[email protected]

Leuven, 2007-05-22

Computer Aided Document Indexing System for Accessing Legislation

A Joint Venture of Flanders and Croatia

Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb

[email protected]

Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb

[email protected]

Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke

Universiteit [email protected]