14
Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Language Independent Collocation Extraction

(LICE)

Vidas Daudaravičius

Andrius Utka

(Vytautas Magnus University)

Page 2: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Mutual Information

-10

0

10

20

30

0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,000 2,900,000

max

avg

min

The sum of word frequencies in a word pair

MI•quotations in foreign languages•specific noun phrases•first names and surnames preceded by titles•names of institutions and organisations

Midshipmen Abdulla Mohammed Al-Kaabi; Ahmed Suleman Al-Mamari; Ali Adam Al-Maimani; Ali Suleman Al-Rawahi; L P Chariandy; Feras Al-Kandari; Khalid Al-Moqbali; Khamis Ali Al-Sulaitni; Khamis

Saeed Al-Mazrouei; Majed Al-Majed; Mansour Sultan Al-Ramyan; Mohammed A Al-Mazrouei; Mohammed Ali Al-Wahaibi; Naser Al-Mutairi; Osama Khaled Al-Ammar.

)()(

),(log);( 2

yfxf

yxfNyxMI

Page 3: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

T-score

-6

-4

-2

0

2

4

6

8

10

12

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

max

avg

min

The sum of word frequencies in a word pair

Log*(T-score)

•specific noun phrases•proper nouns•idioms•verb phrases

“We think that there should be tighter safeguards with us being used as an example of what can go wrong. The Law Society has done the right thing but it was one of its members who did this, so

it is bad it spent two years and two previous attempts denying us our compensation.”

),(

)()(),(

),(yxfN

yfxfyxf

yxT

Page 4: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Dice

-25

-20

-15

-10

-5

0

5

0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,0002,900,000

max

avg

min

The sum of the word frequencies in a word pair

Dice

•quotations in foreign languages•specific noun phrases•first names and surnames preceded by titles•names of organisations and institutions•exclamations

Fade in theme music. Tum-ti-tum-ti-tum-ti-tum Tum-ti-tum-ti-tum tum etc (trad arr Snoop Doggy Dogg).

)()(

),(2log);( 2

yfxf

yxfyxDice

Page 5: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Gravity Counts

-15

-5

5

15

25

0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,000 2,900,000

max

avg

min

The sum of the word frequencies in a word pair

Gravity Counts

•specific noun phrases•proper nouns•idioms•verb phrases

… he replied: “The Conservative party wants to win the next election. I want to win the next election. I have the will to win the next election and I believe we will have a case to take to the British people that

will encourage them to believe it’s right that we carry on the job we’ve been trying to do.

)(

)('),(log

)(

)(),(log),(

yf

ynyxf

xf

xnyxfyxG

Page 6: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

EM

BR

AC

ING

N

OR

TH

16,4

10,813,6

20,818,8

5,1

11,7

6,2

0,6

13,113,711,5

7,210,6

0

15,4

9,38,8

14,8

-5

5

15

25H

E

WIL

L

WIL

L

WO

RK

WO

RK

F

OR

FO

R

A

A

NE

W

NE

W

FR

EE

FR

EE

T

RA

DE

TR

AD

E

AR

EA

AR

EA

E

MB

RA

CIN

G

NO

RT

H

AM

ER

ICA

AM

ER

ICA

A

ND

AN

D

EU

RO

PE

EU

RO

PE

A

N

AN

I

DE

A

IDE

A

PR

ES

IDE

NT

PR

ES

IDE

NT

C

LIN

TO

N

CLI

NT

ON

I

S

IS

IN

TE

RE

ST

ED

INT

ER

ES

TE

D

IN

President Clinton isinterested in

North America andEurope, an idea

Free trade areaHe will workfor a new

-3,2

Extraction of a Collocational Strings

Page 7: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Extraction of Nominal Phrases fromLithuanian Language Corpus (100m)

Page 8: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

CH

AQ

UE

ÉT

AT

ME

MB

RE

CO

MP

AR

E

SU

R

UN

E

RIO

DE D

AU

MO

INS

DE

UX

AN

S

LES

IND

ICE

S

DE

QU

ALI

DE

S

VA

RIÉ

S

DE

BLÉ

DU

R À

CE

UX

DE

S

VA

RIÉ

S

RE

PR

ÉS

E

AU

NIV

EA

U

GIO

NA

L

Span =1

Span = 3

-10-505

10

15202530

-10-50

5101520

25

GC

MI

AC (French)

Page 9: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

EA

CH

ME

MB

ER

ST

AT

E

SH

ALL

CO

MP

AR

E

OV

ER A

PE

RIO

D

OF

AT

LEA

ST

TW

O

YE

AR

S

TH

E

QU

ALI

TY

IND

EX

ES

OF

TH

E

DU

RU

M

WH

EA

T

VA

RIE

TIE

S

WIT

H

TH

OS

E

OF

TH

E

RE

PR

ES

EN

TA

TIV

E

VA

RIE

TIE

S

AT

RE

GIO

NA

L

LEV

EL

-10,0

-5,0

0,0

5,0

10,0

15,0

20,0

25,0

Span =1

Span = 3

-10

-5

0

5

10

15

20

25

30

GC

MI

AC (English)

Page 10: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

-15,0

-10,0

-5,0

0,0

5,0

10,0

15,0

20,0C

IAS

CU

NO

ST

AT

O

ME

MB

RO

RA

FF

RO

NT

A

NE

LL

AR

CO DI

UN

PE

RIO

DO DI

ALM

EN

O

DU

E

AN

NI

GLI

IND

ICI

DI

QU

ALI

DE

LLE

VA

RIE

TÀ DI

FR

UM

EN

TO

DU

RO

CO

N

QU

ELL

I

DE

LLE

VA

RIE

RA

PP

RE

SE

NT

AT

IVE A

LIV

ELL

O

RE

GIO

NA

LE

Span =1

Span = 3

-10-505

1015202530

GC

MI

AC (Italian)

Page 11: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

-10

-5

0

5

10

15

20

25

KU

LL

ST

AT

ME

MB

RU

AN

DU

JQA

BB

EL

FU

Q

FIR

XA

TA

MIL

L

AN

QA

S

SE

NT

EJN

L

IND

IĊI

TA

L

KW

ALI

TA

L

VA

RJE

TA

JIE

T

TA

QA

TA L

AW

ST

RA

LJA

MA

DA

WK

TA

L

VA

RJE

TA

JIE

T

FU

Q

LIV

ELL

RE

ĠJO

NA

LI

RA

PP

RE

ZE

NT

AT

TIV

I

Span =1

Span = 3

-10

-5

0

5

10

15

20

25

30

GC

MI

AC (Maltese)

Page 12: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

-10,0-5,00,05,0

10,015,020,025,0

ELK

E

LID

ST

AA

T

VE

RG

ELI

JKT

OP

RE

GIO

NA

AL

NIV

EA

U

OV

ER

EE

N

PE

RIO

DE

VA

N

TE

N

MIN

ST

E

TW

EE

JAA

R

DE

KW

ALI

TE

ITS

IND

EX

VA

N

DE

DU

RU

MT

AR

WE

RA

ME

T

DIE

VA

N

DE

RE

PR

ES

EN

TA

TIE

VE

RA

SS

EN

Span =1

Span = 3

-10-505

1015202530

GC

MI

AC (Dutch)

Page 13: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

FR

EN

IT

MT

NL

EACHMEMBER

STATESHALL

CHAQUE

ÉTAT

MEMBRE

CIASCUNO

STATO

MEMBRO

GĦANDU

KULL

STATMEMBRU

ELKE

LIDSTAAT

DE

OF

THE

DI

TA

VAN

DE

BLÉ

DUR

DURUM

WHEAT

FRUMENTO

DURO

TAL

QAMĦ

AWSTRALJA

DURUMTARWERA

DES

OF

THE

DELLE

TAL

VAN

DE

AUNIVEAU

RÉGIONAL

ATREGIONAL

LEVEL

ALIVELLO

REGIONALE

FUQLIVELL

REĠJONALI

OPREGIONAAL

NIVEAU

Phrase Alignment

Page 14: Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)

Language Independent Collocation Extraction

(LICE)

http://donelaitis.vdu.lt/~vidas/celex/lice.php

Vidas DaudaravičiusAndrius Utka

(Vytautas Magnus University)