Ipw slides

Preview:

Citation preview

EUROPEAN LEGISLATIVE

RESPONSES TO

INTERNATIONAL TERRORISM

A Database of Laws in German Plenary Protocols

Outline

1. Introduction

2. Xtract: a software for extraction

3. Expected results

4. Discussion

Introduction1

Linking Laws and Plenary Protocols

Extract agenda items and participants‘ information

from plenary protocols from terms 12 – 16

Use GESTA as an index of laws

Link laws to plenary speeches and vice versa

1 introduction

We have ...

Plenary protocol PDFs from electoral terms 12 – 16

1990-12-10 – present

120.655 pages in 1162 documents

GESTA database of laws, terms 8 – 16

1 introduction

We have ...

Plenary protocol PDFs from electoral terms 12 – 16

1990-12-10 – present

120.655 pages in 1162 documents

GESTA database of laws, terms 8 – 16

: ) and ambition to deliver excellent results

1 introduction

We want to ...

Extract from 1990 up to the present time

For each plenary session

Session number, date, ...

For each item on the agenda

Descriptions

list of participants

printed matter references

speech texts

tables

Link the results with our database of laws

1 introduction

Challanges

Older electoral terms are not digitalized

Each electoral term requires different pattern matching strategies

GESTA tables generated for the project

No consistent, direct links to plenary protocols

Course of legislation undetailed

Quality difference between older and newer terms

OCR errors

GESTA Database – no improvements possible for older terms

1 introduction

Xtract2

Xtract – software for data mining

a set of modern tools to annotate plenary protocols with relevant pieces of information

preserves document layout

uses multiple strategies to mark important text blocks

location, shape and internal structure of blocks

pattern matching

Euclidean distances

statistics

comes with its own document viewer

2 software

Xtract – implementation details

PDF access

pdftohtml (custom builds)

Acrobat Professional 9 Extended (older terms)

Data manipulation

C# 4.0: LINQ to XML

Visualization

C# 4.0: WPF (Windows Presentation Foundation)

Statistics

CORSIS: my personal open-source project for corpus analysis

2 software

Xtract – why XML?

Simple and highly-`liquid´ file format

based on simple international standards

excellent APIs in many programming languages

converts easily into other formats

used in Microsoft Office, OpenOffice.org

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

elements

attributes

hierarchical relations

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

elements: event, speaker, name, is

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

attributes: id

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

children: event → speaker

parents: event ← speaker

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

descendants: event → speaker, name, is

2 software

Xtract – XML crash course

<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>

</speaker></event>

siblings: name ↔ is

2 software

Xtract – how does it function?

extracts texts from PDF files along with layout

information

2 software

Xtract – how does it function?

merges texts into proximity blocks

2 software

Xtract – how does it function?

marks ambient constructs

2 software

Xtract – how does it function?

marks agenda items

2 software

Xtract – how does it function?

annotates blocks with sections they belong to

2 software

Expected Results3

DIGESTA

Based on `GESTA Gesamtausgaben´: terms 14 – 16

Always up-to-date

Detailed course of legislation information

Direct links to plenary protocols

Can be complemented with keywords from MZES

http://corsis.sf.net/ipw/digesta/

3 results

Done!!

PLEDA – Plenary Protocols Database

Based on plenary protocols

Links agenda items multidirectionally with

participants

Interesting for different linguistic/political research

purposes

3 results

PLEDA – Project Status

12 13 14 15 16

OC

R Run X X - - -

Correction - - -

XML Conversion * * X X X

Division C./S. X X X

Block Merging * * X X X

Ambient Constructs X X X

Page Sections X X X

Interjections * * X X X

Contents * * X

Speeches * * X

Contents-speech links * * X

3 results

GLIT – German Legislative Resp ...

Laws

• .law files

• from GESTA

Protocols

• .pro files

• from BTP

GLIT

• German part of ELIT

3 results

Discussion4

Open questions

Project hosting

Where can we host the results?

Initial GLIT interface

Web service?

Rich client-side app?

Any questions from your side?

4 discussion

Recommended