28
LT4eL - WP1: Setting the scene WP leader: UAIC Univ. AI. I. Cuza of Iasi Faculty of Computer Science Dan Cristea, Corina Forăscu, Dan Tufiş, Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene Contact: [email protected] Utrecht Review Meeting, February 1, 2007

Dan Cristea, Corina Forăscu, Dan Tufiş, Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

  • Upload
    haruki

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

LT 4 eL - WP1 : Setting the scene WP leader: UAIC Univ . AI. I. Cuza of Iasi Faculty of Computer Science. Dan Cristea, Corina Forăscu, Dan Tufiş, Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene Contact: [email protected]. Utrecht Review Meeting, February 1, 2007. Objectives. - PowerPoint PPT Presentation

Citation preview

Page 1: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

LT4eL - WP1: Setting the sceneWP leader: UAIC

Univ. AI. I. Cuza of IasiFaculty of Computer Science

Dan Cristea, Corina Forăscu, Dan Tufiş,

Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Contact: [email protected]

Utrecht Review Meeting, February 1, 2007

Page 2: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Objectives1. inventarization and classification of existing

tools necessary for the development of the relevant functionalities (i.e. key word extractor, glossary candidate detector);

2. collection and normalization of the learning material related to the use of the computer in education (Humanities, Social Sciences);

3. investigation of IPR issues; 4. adoption of relevant standards for linguistic

annotation of learning objects; 5. dissemination of the results through a Web

portal

Page 3: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Partners in WP1• Utrecht University (UU), The Netherlands • University of Hamburg (UHH), Germany • University of Lisbon (FFCUL), Portugal • Charles University Prague (CUP), Czech Republic • Institute for Parallel Processing, Bulgarian Academy

of Sciences (IPP-BAS), Bulgaria • University of Tübingen (UTU), Germany • Institute of Computer Science, Polish Academy of

Sciences (ICS-PAS), Poland • Zürich University of Applied Sciences Winterthur

(ZHW), Switzerland • University of Malta (UOM), Malta

Page 4: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Lexikon

CZ

CZCZEN

ENCONVERTOR 1

Documents SCORM

Pseudo-Struct.

Basic XML LING. PROCESSOR

Lemmatizer, POS, Partial Parser

CROSSLINGUAL RETRIEVAL

LMS User Profile

Documents SCORM

Pseudo-Struct

Metadata (Keywords)

Ling. Annot XML

Ontology

CONVERTOR 2

Documents HTML

Lexikon

PT

Lexikon

RO

Lexikon

PL

Lexicon

GE

Lexikon

MT

Lexikon

BG

Lexikon

DT

Lexicon

EN

PLPL

GEGE

BGBG

PTPT

MTMT

DTDT

RORO

ENDocuments User

(PDF, DOC, HTML,

SCORM,XML)

REPOSITORY

Glossary

Page 5: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

The Portal

• A working space: – Repository for resources, tools, deliverables– Exchange information among participants– Statistics

• Hosted by UAIC: – January 2007: 1.15 Gb (without realTimeStat,

searchForm, upload/updateForm)

• Address: http://consilr.info.uaic.ro/uploads_lt4el– Username: guestLt4eL– Passwd: elearning

Demo version on CD

Page 6: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

O1. Collection of language resources and tools (1)

• Inventarization and classification of existing tools (http://consilr.info.uaic.ro/uploads_lt4el/tools/all.php?) relevant to:– the integration of language technology resources in

eLearning (WP2)– the integration of semantic knowledge (WP3)

Page 7: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

O1. Collection of language resources and tools (2)

• Inventarization and classification of existing language resources– corpora and frequencies lists:

http://consilr.info.uaic.ro/uploads_lt4el/menu/all.php

– lexica: http://www.let.uu.nl/lt4el/wiki/index.php/Lexica_Joint_Table

Page 8: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

O2. Collection of LOs: the portal

Uploads, updates & real-time statistics at http://consilr.info.uaic.ro/uploads_lt4el/

Criteria (→ attributes):- Subdomains relevant for beginners in IST & e-learning

→ Domain - Multilingualism → Language- Medium sized documents → Number of words- IPR~clear → IPR- Uniformity in topics → keywords selected initially

Page 9: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: domains

1. Use of computers in education, with sub-domains:1.1 Teaching academic skills, with sub-domains: 1.1.1 Academic skills 1.1.2 Relevant computer skills for the above tasks (MS Word, Excel, Power

Point, LaTex, Web pages, XML) 1.1.3 Basic skills (use of computer for beginners) (chats, e-mail, Intenet)1.2 e-Learning, e-Marketing1.3 The I*Teach document (Leonardo project, http://i-teach.fmi.uni-sofia.bg/)1.4 Impact of use of computers in society1.5 Studies about use of computers in schools / high schools1.6 Impact of e-Learning on education

2. Calimera documents (parallel corpus developped in the Calimera FP5 project, http://www.calimera.org/ )

Page 10: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: domains coverageDomain Total Avg # lang

1.1 Teaching Academic skills 66,507 13,301 41.1.1 Academic skills 90,033 18,007 21.1.1.1 Writing a diploma paper 119,335 23,867 71.1.1.2 Making a presentation 41,303 8,261 51.1.1.3 Writing a scientific summary 21,699 4,340 41.1.1.4 Making an interview 22,798 4,560 31.1.1.3 Working out a small project 5,450 1,090 21.1.2 Relevant computer skills for above tasks 606,555 121,311 81.1.2.1 Using MS Word 269,525 53,905 81.1.2.2 Using Excel 136,803 27,361 71.1.2.3 Using Power Point 70,403 14,081 71.1.2.4 Using Latex 242,163 48,433 71.1.2.5 Creating Web pages 549,233 109,847 81.1.2.6 Using XML 259,120 51,824 81.1.3 Basic computer skills (use of computers for beginners) 123,790 24,758 41.1.3.1 Using chats 18,870 3,774 61.1.3.2 Using email 102,023 20,405 81.1.3.3 Accessing the Internet 189,499 37,900 81.2 eLearning, eMarketing 320,537 64,107 81.3 The I*Teach document 126,980 25,396 21.4 Impact of use of computers in society 121,446 24,289 41.5 Studies about use of computers in schools / high schools 559,362 111,872 41.6 Impact of eLearning on education 215,453 43,091 72.1 Calimera full guidelines 656,284 131,257 82.2 Calimera summaries 36,815 7,363 5

4,971,986

Page 11: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

The hierarchy of LOs’ formats

Page 12: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: annotation layers

1. Initial documents: doc, pdf, html, txt → Base-XML

2. Linguistic annotation: tokens, POS, lemma, chunks → WP2 XML format (LT4ELAna.dtd)

3. Keywords, definitions and ontology links annotations

Page 13: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions

Base-XML

plain texthtml

otherlatexpdfdoc

doc → html

Page 14: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions doc → html (UTF-8)

1. MS Office: Save As html

2. OpenOffice Writer SXC/ODT: Save As html

Page 15: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions

Base-XML

plain texthtml

otherlatexpdfdoc

pdf → html

Page 16: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions: pdf → html (UTF-8)

1. Adobe on-line conversion tool

2. pdfbox (Windows)

3. pdftohtml (Linux)

4. OpenOffice

5. Adobe Acrobat Professional

Page 17: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions

Base-XML

plain texthtml

otherlatexpdfdoc

Base-XML convertor

Page 18: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Level 1 conversions: html → Base-XML

• The UAIC Java converter – keeps all the tags possibly useful (fixed)– produces a log of all the removed

tags/data• The CUP html2xml.pl converter

– tags kept according to a DTD

Page 19: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: second level

WP2 XML format

tok-pos-lemma

lemmapostokmorpho NP

Language specific tools

Page 20: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: second level

WP2 XML format

tok-pos-lemma

lemmapostokmorpho NP

scripts

Page 21: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: KW extractor

WP2 XML format

Man KD XML Auto KD XML

Level 2

Level 3

KW extractor

Page 22: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: KW extractor

WP2 XML format

Man KD XML Auto KD XML

Level 2

Level 3

KW extractor evaluation

Page 23: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: third level

Incl. akw, adefIncl. km.xml, dm.xml

Man KD XML Auto KD XML

def extractor

kmxml: manually annotated kws

dmxml: manually annotated defs

akw: automatically annotated kws

adef: automatically annotated defs

Page 24: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Collection of LOs: third level

Incl. akw, adefIncl. km.xml, dm.xml

Man KD XML Auto KD XML

def extractor

kmxml: manually annotated kws

dmxml: manually annotated defs

akw: automatically annotated kws

adef: automatically annotated defs

def extractor evaluation

Page 25: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Open issues

• Convertors– Tables, figures, page look…

• IPRs– Clarify the IPR status

• authors & EU + national legislation

– Define IPR categories for LOs:• usage (free, restricted, for research...)

Page 26: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

WP1 over time

December 05

February 06

NowMay 06

Initial collection on Portal

Structure & functionalities to the portal- BaseXML convertors- new LOs

Levels 2&3 additions- new tools- grammars- guides, docs- ontology, TermLex

D1.1Official end of WP1

Beginning of project

Evaluation

Page 27: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Proposal: the hierarchy seen as a processing environment

Level 2

doc pdf latex other

htmltxt

sxml

morpho tok pos lemma NP

wp2xml

tpl

akw adef

axml

Level 3

Level 1

Page 28: Dan Cristea, Corina Forăscu, Dan Tufiş,  Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene

Conclusions

• LOs, resources and tools collected• Initially: portal seen as a repository• Now: portal potentially integrated

with the LMS as a processing environment