28
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Infrastructures for the Korean · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

  • Upload
    dothien

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Infrastructuresfor the Korean Language

Key-Sun Choi

Page 2: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qS IG-Korean Language Computing under KoreaInformation Science Society

u 300 members

qKorea Information Society

u linguistics oriented

Page 3: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

u Purpose:§ To improve Ko rean Language P rocess i ng Techno l ogy

§ To promote Korean Sof tware Industry

• in the planning phase (1993), targetted to Hangul W ordprocessor,Machine Translation and Korean Linguistic Research

u 1995 - 1997 (Phase 1): “word ”§ Two ministry joint project + Industry

• Ministry of Science&Technology, Ministry of Culture

u 1998 - 2000 (Phase 2): “sentence ”§ O n ly by Min istry o f Sc ience&Technology + Industry

§ w il l be evaluated in O ctober, 2000

u 2001 - 2003 (Phase 3): “discourse” - not decided

u http://kibs.kaist.ac.kr/

Page 4: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

q Purpose

u To promote the Korean Language Research in the l ingu is t i cs s i

de

u To prepare for the language p lann ing

§ for Unification of South-/North-Korea

§ for International use of Korean

q Sponsor: Ministry of Culture

q Period: 1998 - 2007 (10 years)

q Items

u corpus, dict ionary, international ization, terminology, education,

font , o ld Korean, o ld Chinese characters

q http://w w w .sejong .or.kr/

Page 5: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

User(Dictionary)

End User

MA1

MA2

TA1

TA2

PA1

PA2

WSD1

WSD2

DA1

DA2

RM1

RM2

Ontology

Common Knowledge

Domain Knowledge

Engine Module Level

Engine Level

Basic DB

corpus

MRD

Knowledge extractor

Knowledge Source Level

MT engine IR engineSpell checker Style checker UI engine

Application LevelWord processor MT system Information

RetrievalSystem

AutomaticSpeech

Translation

User(P

rogramm

er)U

ser(lexicographyist)

-- System

Distributed ResourceManagement System

Master DB

Knowledge Level

Page 6: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

l Title of Projectl KIBS I : Integrated Korean Information Basel KIBS II : On Development of Deep-Level Processing and Qu

ality Management Technology for Very Large Korean Information Base

l Outlinel Term : 1994.12.4 ~ 2004.9.30 (10 years)l Sponsor : Ministry of Science and Technologyl Staff : 50 person/year

Page 7: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

•Standard Module Interface•Corpus and Electronic Dict ionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System

•Standard Module Interface•Corpus and Electronic Dict ionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System

•Terminological Data Base Development and Management System

•Standard Korean Input/Output Environment

•Standardized Methodology for the Construction of a Balanced Corpus

•Part-Of-Speech Transfer Dictionary Rules and an Example Package

•Terminological Data Base Development and Management System

•Standard Korean Input/Output Environment

•Standardized Methodology for the Construction of a Balanced Corpus

•Part-Of-Speech Transfer Dictionary Rules and an Example Package

•Tree-Tagged Corpus

•Word-Level Narrative Speech Data Base

•Hand-written Hangul scripts of high frequency

•Tree-Tagged Corpus

•Word-Level Narrative Speech Data Base

•Hand-written Hangul scripts of high frequency

Page 8: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

•Terminology Entries

•Domain-specif ic Corpus for Terminology Building

•Sublanguage Analysis and Extraction of Terminology

•Terminology Entries

•Domain-specif ic Corpus for Terminology Building

•Sublanguage Analysis and Extraction of Terminology

•Development/Management System for Information Base

•Development of Integrated Management System for Distr ibuted Resources

•Development/Management System for Information Base

•Development of Integrated Management System for Distr ibuted Resources

•Syntactic Information Base for Syntactic Analysis/Generation

•Semantic Information Base for Semantic Analysis/Generation

•Additional Information on Language and GUI for Developing Applicat ions

•Syntactic Information Base for Syntactic Analysis/Generation

•Semantic Information Base for Semantic Analysis/Generation

•Additional Information on Language and GUI for Developing Applicat ions

Page 9: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

l Korean Concordance Program (KCP)l Compound Noun Browserl Corpus Browserl Corpus Browser by Categoryl Automatic English-to-Korean Transliteration System (TLEK)l KAIST Ontology Browserl Korean Morphological Analyserl Korean Taggerl Korean Syntactic Analyserl Editing Support Tools to Electronic Dictionary

Page 10: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qMajor Resultsl The first (KIBS I) : 1997.6. ~ present (80 site)

l Text corpus 10 million word phrasesl POS tagged corpus 1 million word phrasesl Syntactic structure tagged corpus 10 thousands sentencesl TDMS, Speech DB samples, Hand-written character DB samples

l The second (KIBS II) : 1998.12. ~ present (140 site)l Raw corpus 10 million word phrases, POS tagged corpus – 200 tho

usands word phrases

l The third (KIBS III) : 2000 (pending)l Proper noun 10 thousands entries, Compound noun 20 thousands e

ntries, Verb sentence pattern dictionary 3 thousands entries, ...

l Plan to maintain and distribute ...

Page 11: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

q D ictionaries: total 420K entries (estimated now)u Mach ine Readab le D ic t ionary ( Hangu l Society) : 200K entr ies

u Compound Noun, Proper Noun C lass i f i ca t ion , In terna l Semant ic S tructure: 50K entries

u S e a rched Compound Noun , P rope r Noun : open

u Ve rb Subcategor izat ion : 10K f rames (K -J compar ison)

u Thesau rus : Ko rean - Japanese -Ch inese -Eng l i sh – no t so good quality – 150K entr ies

u Usage f rom corpus fo r each sense

u Funct ional words

q Problemu Sense c lassi f icat ion standardizat ion

u Charac te r code : Ko rean , Japanese , Ch inese , … (most important problem) – now unde r un icode transfer

Page 12: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qC o rpus KW IC for Korean and Japanese

u http://morph.kaist.ac.kr/kcp/

qKorean morphological analysis service

u http://morph.kaist.ac.kr/

u By email, if send a text file, then reply its PO S taggin

g

u G raphic editor/debugger for Korean morphology

qProject Status

u http://kibs.kaist.ac.kr/

Page 13: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi
Page 14: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

q Through World-Wide Terminology Collection and TheirStandardization and Harmonization in Local Society

q Distribution, Publication and Application in Language and Knowledge Engineering are promoted.

q Through Education and Consultation of Terminology R&D Methodology for Each Subject Field,

q High-Quality, High-Reliable Terminology and Its Infrastructure and System are achieved.

Center of Terminology and Knowledge Engineering

Page 15: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Integration of Working Terminology•Terminology Collection (Basic S&T, Industry Standard, Economics)•Electronic Terminology (Publication)•R&D Environment (System Standardization)•Terminology Theory and Education Infrastructure

Value-Added Terminology Integration•Terminology Collection (Extended S&T)•Extension & Maintenance (Industry Standards)•High-Quality Terminology•Application in Language Industry•Verification for High-Reliability and Distribution

Multi-lingual Terminology Integration•Terminology Collection (Humanity and Social Science)•Maintenance and Extension•Large-Scale Knowledge Base for Terminology•Terminology Education Curriculum Development•Application Product Development

Continuous Extension and Management•Terminology Study Promotion•Distribution of Terminology Information Base•Continuous Terminology Extension and Management

Phase 2(2001-2003)

Value-Added Working System

Phase 3(2004-2007)Operation

Phase 4(2008 - )

Maintenance and Extension

Phase 1(1998-2000)

R&D Environment and Basic Data Collection

Page 16: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

q Basic Data (C orpus)

u Corpus fo r Each Sub jec t Domain

q E lectronic Dictionary for Basic Vocabulary

u Eve ryday Vocabu la r y cons i s t s o f Gene ra l Vocabu la r y and Eve r

yday Termino logy

q Internationalization of Korean Language

u S o u th-North Korean Termino logy Standard izat ion , Korean lang

uage Input Methods

q Korean Language Engineer ing

u S tandard ized Term Use for In format ion Retr ieva l , Mach ine Tran

s lat ion and Document Class i f icat ion

Page 17: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qLanguage Engineering

u Information R e trieval:§ E ffective Internet Information Creation and Information/K n o

w ledge Acquis i t ion

§ Multi- l ingual ism

uMachine Translation:§ E ff ic ient Information Generat ion through Terminology and V

ocabulary Col lect ion and Standardizat ion

uW ordprocessor:§ High Product iv i ty by Spel l ing Correct ion, Summarizat ion an

d E f f i c ient Use .

Page 18: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qLanguage, Information and Terminology

u Language Educat ion:

§ Techn ica l Th ink ing and Techn ica l Communicat ion

§ Termino logy -based Educat ion

u Language Study:

§ Domain - spec i f i c Language S tudy

Page 19: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qSupport from Government, Organization and Industryaccording to each specialtyu Ministry of Culture and Tourism (KORTERM Center Operat

ion)u Ministry of Science and Technology (R&D Fund)u Ministry of Information and Telecommunication (R&D Fund)u Ministry of Diplomacy and Tradeu Ministry of Industry and Resourceu Ministry of Educationu Korea Science and Technology Foundation (Event Support)

Page 20: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Terminology Base(Collection)Non-standards

International Term StandardTerminology Standard

Language&KnowledgeProduct

LanguageEducationEnvironment

Terminology Information Environment

R&

D E

nvironment

Application

Use

Term

inologySym

bolization

Terminology Access Standard Channel

Grid Size Controller

Application-Specific Dictionary

Language Education Adaptable to Student

R&D Industry Living Communication

Standardization & Harmonization

TerminologicalConceptual

Space

Page 21: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi
Page 22: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Organization

Test Suite

••••

Specification Standardization

••••

•••

•••

Language

••••Speech

Image

Language

Speech

Image

••

••

••

Page 23: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qTest Suites for IR /Q A

u Documents

§ 207,067 records (370MB)

§ Newspape r s

u Q u e ry Generat ion

§ 90 quer ies ( through 300 quiz query analys is)

§ Quer ies for W H -quest ion and o ther var ious types o f answer

s

§ fo r NLP prob lem so lv ing

§ r e l e ven t document se t to inc lude the answer

§ by us ing four k inds of commercia l ized IR systems by 16 k in

ds o f methods

Page 24: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

qType C lassification: About 300 Kinds

qTest Sentences and Test Query: 5,000 Records

u E x tracted from Textbook and G rammar books (1999-

2000)

u w il l be extracted from the Real usage l ike web, newspapers (2000-2001)

u E v a luation by Y e s /N o Q u e s tion

u Tested for 4 Commercia l ized Engl ish-Korean MT Systems

Page 25: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Page 26: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

M e ta data Input W orkbenchb y X M L

Page 27: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Page 28: Infrastructures for the Korean  · PDF fileKorea Terminology Research Center for Language and Knowledge Engineering Infrastructures for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering