27
Persian@CLEF Current and Future Research Directions University of Tehran Database Research Group 1 October 2009 Abolfazl AleAhmad, Ehsan Darrudi, Hadi Amiri, Azadeh Shakery, Farhad Oroumchian

Persian@CLEF Current and Future Research Directions

  • Upload
    elke

  • View
    78

  • Download
    0

Embed Size (px)

DESCRIPTION

University of Tehran Database Research Group. Persian@CLEF Current and Future Research Directions. Abolfazl AleAhmad , Ehsan Darrudi , Hadi Amiri , Azadeh Shakery , Farhad Oroumchian. 1 October 2009. Persian@CLEF Current and Future Research Directions. Outline. Why Persian IR - PowerPoint PPT Presentation

Citation preview

Page 1: Persian@CLEF Current and Future Research Directions

Persian@CLEFCurrent and Future Research Directions

University of TehranDatabase Research Group

1 October 2009

Abolfazl AleAhmad, Ehsan Darrudi, Hadi Amiri, Azadeh Shakery, Farhad Oroumchian

Page 2: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Persian@CLEF Current and Future Research Directions

Why Persian IRLanguage Resources for PersianHamshahri at CLEF 2009Persian@CLEF2009 participantsPersian@CLEF2009 resultsPersian@CLEF2009 pool analysisFuture works

Outline

2

Page 3: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Persian in the Middle East

3Source: Internet World Stats, http://internetworldstats.com/

User Population Growth on the Web (2000-2009)

Persian@CLEF Current and Future Research Directions

Page 4: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Persian@CLEF Current and Future Research Directions

Why Persian IR

Updated in June 2009 from Internet World Stats

4

Page 5: Persian@CLEF Current and Future Research Directions

1 Oct 2009

A branch of Indo-European LanguagesOfficial Language of Iran, Afghanistan and TajikistanIts morphological analysis is Comparably difficult

The word “خبر” has two plural forms:• Persian rules: “خبرها”• Arabic rules: “اخبار”

Writing Style Issues:e.g. ” شود are the same ”میشود“ and “میe.g. ”کتابها“ and ” ها are the same “کتاب

5

Persian@CLEF Current and Future Research Directions

The Persian Language

Page 6: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Persian Test Collections

Text IR DomainGhavanin (domain specific)Hamshahri (news): http://ece.ut.ac.ir/dbrg/hamshahriHamshahri 2 (recently developed 50 topics)

Web IR DomainFWT1m (.ir Web) nearly 1Million docs

NLP DomainBijankhan (2.7 Million Words): http://ece.ut.ac.ir/dbrg/bijankhan

6

Persian@CLEF Current and Future Research Directions

Page 7: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Hamshahri at CLEF 2008 & 2009

7

News articles of Hamshahri newspaper from year 1996 to 2002100 bilingual topics166,000+ documents

Persian@CLEF Current and Future Research Directions

Hamshahri 2News articles of Hamshahri newspaper from year 1996 to 200850 bilingual topics320,000 documents (2times larger ~ 1.5GB)Richer document tags

Page 8: Persian@CLEF Current and Future Research Directions

1 Oct 2009 8

Persian@CLEF2009 - Participants

Persian@CLEF Current and Future Research Directions

1. JHU-APL• N-gram tokenization (skip n-grams for n=5)

2. Unine• Developed “light” and “plural” stemmers and blind query

expansion

3. Open Text• Savoy’s Stemmer and 4-grams• Pool analysis (with top 10,000 retrieved docs)

4. Quazvin IAU• Perstem for monolingual runs (Prec +91%, Rec +43%)• “Query Wikification” Algorithm for bilingual runs

Page 9: Persian@CLEF Current and Future Research Directions

1 Oct 2009 9

Persian@CLEF2009 - Final Results

Persian@CLEF Current and Future Research Directions

Page 10: Persian@CLEF Current and Future Research Directions

1 Oct 2009 10

Persian@CLEF2008 - Final Results

Persian@CLEF Current and Future Research Directions

Page 11: Persian@CLEF Current and Future Research Directions

1 Oct 2009

39255

73606256

154832

91122

211444

13215

939093

131162

3354

719

227156

135150

202187

245194219

67125

615179

152175

561742

95206

108137

112103

310260

564435

224509

563351

411727

414333

467272

501638

273440

296524

521316

529663

565421

394551

431490

386338

222418

442507

587539

281202

350520

258476

457407

487474533

376

0 100 200 300 400 500 600 700 800

551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600

Que

ry N

umbe

r

Number of DocumentsRelevant Not Relevant 11

Pool of CLEF 2008

Persian@CLEF Current and Future Research Directions

Page 12: Persian@CLEF Current and Future Research Directions

1 Oct 2009

8993

6373

134135

38177

27829

20130

9333

140137135

19970

5940

1165859

51127

166233

8195

2422

7486

116100

11966

2844

13870

9793

5432

82266

45

571322

321279

51172

502249

514445397

402226

325526

232282

384331

274472

444367

440331

323390

282356

605640

267469

422497

329449

514251

418442247

318305

500443

404356

436550

0 100 200 300 400 500 600 700

601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650

Que

ry N

umbe

r

Number of DocumentsRelevant Not Relevant

12

Pool of CLEF 2009

Persian@CLEF Current and Future Research Directions

Page 13: Persian@CLEF Current and Future Research Directions

1 Oct 2009 13

Persian@CLEF- Pool Comparison

Persian@CLEF Current and Future Research Directions

Quoted from: Stephen Tomlinson. German, French, English and Persian Retrieval Experiments at CLEF 2008 & 2009. Working Notes for the CLEF 2008 & 2009 Workshops.

Page 14: Persian@CLEF Current and Future Research Directions

1 Oct 2009 14

Persian@CLEF- Pool Comparison

Persian@CLEF Current and Future Research Directions

Quoted from: Stephen Tomlinson. German, French, English and Persian Retrieval Experiments at CLEF 2008 & 2009. Working Notes for the CLEF 2008 & 2009 Workshops.

2009

2008

Page 15: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Persian@CLEF Current and Future Research Directions

Using Hamshahri 2 for CLEF 2010 (50 training topics)A campaign on the Persian WebIR collectionCreation of an English-Persian parallel corporaCreation of a comparable corporaA stemmer for the Persian language

Future Works

15

http://ece.ut.ac.ir/dbrg

Page 16: Persian@CLEF Current and Future Research Directions

1 Oct 2009

Thanks

?

16

Persian@CLEF Current and Future Research Directions

[email protected]

Page 17: Persian@CLEF Current and Future Research Directions

1 Oct 2009 17

Persian@CLEF Current and Future Research Directions

Page 18: Persian@CLEF Current and Future Research Directions

1 Oct 2009 18

Persian@CLEF Current and Future Research Directions

Page 19: Persian@CLEF Current and Future Research Directions

1 Oct 2009 19

Persian@CLEF Current and Future Research Directions

Page 20: Persian@CLEF Current and Future Research Directions

1 Oct 2009 20

Persian@CLEF Current and Future Research Directions

Page 21: Persian@CLEF Current and Future Research Directions

1 Oct 2009 21

Persian@CLEF Current and Future Research Directions

Page 22: Persian@CLEF Current and Future Research Directions

1 Oct 2009 22

Persian@CLEF Current and Future Research Directions

Page 23: Persian@CLEF Current and Future Research Directions

1 Oct 2009 23

Persian@CLEF Current and Future Research Directions

Page 24: Persian@CLEF Current and Future Research Directions

1 Oct 2009 24

Persian@CLEF Current and Future Research Directions

Page 25: Persian@CLEF Current and Future Research Directions

1 Oct 2009 25

Persian@CLEF Current and Future Research Directions

Page 26: Persian@CLEF Current and Future Research Directions

1 Oct 2009 26

Persian@CLEF Current and Future Research Directions

Page 27: Persian@CLEF Current and Future Research Directions

1 Oct 2009 27

Persian@CLEF Current and Future Research Directions