27
Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz [email protected] Department of Linguistics, University of Sydney, Australia Jim Sng Wee School of Applied Science, Nanyang Technological University, Singapore Christopher Manning Departments of Computer Science and Linguistics, Stanford University, USA Nitin Indurkhya School of Applied Science, Nanyang Technological University, Singapore

Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz [email protected] Department of Linguistics, University

Embed Size (px)

Citation preview

Page 1: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Using XSL and XQL For Efficient, Customised Access To

Dictionary InformationKevin Jansz

[email protected] of Linguistics, University of Sydney, Australia

Jim Sng WeeSchool of Applied Science, Nanyang Technological University, Singapore

Christopher ManningDepartments of Computer Science and Linguistics, Stanford University, USA

Nitin IndurkhyaSchool of Applied Science, Nanyang Technological University, Singapore

Page 2: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Objectives Provide innovative ways for representing a

dictionary, through creative use of web technology Provide practical, educationally useful access to

information that can be customised to suit the needs of many users (at low labour cost)

Examine the richness of lexical structure

Initial target: the Warlpiri dictionary.

Page 3: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Research Program: Lexicon A language is more than individual words with a

definition– it is a vast network of associations between words and

within and across the concepts represented by words

Aim to provide people with a better understanding of this conceptual map.

Traditional paper dictionaries offer very limited ways for making such networks visible

There are no such limitations on a computer

Page 4: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Research: Computational Lexicography

Dictionaries on computers are now commonplace– Few utilise the potential of the new medium– Many present a plain, search-oriented representation of

the paper version

Goal: fun dictionary tools that are effective for language learning, browsing– Like flicking through pages of a paper dictionary – Words are grouped by their meaning and their

association with each other– Key to the effectiveness of this browsing is that the user

has control over the way this is presented.

Page 5: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Initial focus: Warlpiri Warlpiri is an Australian Aboriginal language spoken

in the Tanami desert (NW of Alice) There are a number of factors influencing this choice:

– One of the most comprehensive lexical databases for any Australian Language (Laughren & Nash 1983)

– Relatively large community of people interested in learning their traditional language

– Until now, results haven’t been produced in a format usable by the community (only raw printouts)

Page 6: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Target user community

Page 7: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Kirrkirr: A Warlpiri dictionary browser

(Jansz 1998; Jansz, Manning and Indurkhya 1999)

An environment for the interactive exploration of dictionaries.

Current work has just been with Warlpiri, the design is general (Arrernte coming soon!)

Attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information

It can either be run over the web [high bandwidth] or run locally (here Java’s main advantage is cross-platform support).

Page 8: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Animated Graph

layout of word relationships

Page 9: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted

entries

Page 10: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted

entries A Notes facility for

‘jotting in the margin’

Page 11: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted

entries Notes Multimedia:

audio, pictures

Page 12: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted

entries Notes Multimedia Advanced

searching interfaces

Page 13: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted

entries Notes Multimedia Advanced

searching Semantic

Domain Browsing

Page 14: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Overview Graph layout Formatted entries Notes Multimedia Advanced searching Semantic Domain Browsing

Others in planning: formatting (XSL) editing, figuration patterns.

These attempt to cater to users with different interests and competence levels

Page 15: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

The lexical database Original materials stored in an ad hoc format of

markup using backslash codes with some (rather odd) nesting of structural tags

These were converted to XML using an error-correcting stack-based parser (written in PERL).– The inconsistency and flexibility of dictionary entries actually

made this a surprisingly difficult task.– But parser tries to impose data integrity

Use of XML gives a clear structure to the lexical data, and makes available many (free) tools

Result remains a portable, tangible text file

Page 16: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

XML indexing - challenges Few XML parsers make single entries retrievable

from the file

Typically, the entire XML document is put in memory

This is not practical when parsing significant XML databases (e.g., the Warlpiri dictionary is approx. 10Mb).

Page 17: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

XML Dictionary Indexing (XDI) Hierarchical structure of XML lends itself to indexing

– Each entry in the XML file can be considered as a separate entity

To make the Warlpiri dictionary usable for Kirrkirr an ad hoc indexing system was developed– Uses a slightly modified Ælfred XML parser– Entries indexed by headword in a separate index file

The system returns an XML document object containing the single dictionary entry, facilitating:– processing for related words (Graph layout)– XSL processing to HTML

Page 18: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

<DICTIONARY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

</DICTIONARY>

headword file positionheadword file positionheadword file position

XML Formatted Warlpiri dictionary file

Index in Memory

Across file system or web

Kirrkirr’s XML Index Process

KirrkirrDictionary Browser

XML Parser

XML Document Object

XSL file+

XSL Processor

HTML document

Page 19: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

XDI in Kirrkirr The XML indexing process considerably improves

efficiency as only requested entries are parsed

Parsed entires are kept temporarily in a cache

Thus Kirrkirr uses XML as a median between the structure and indexing of a relational database, with the freedom and functionality of text.

Page 20: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

XQL - Potential An alternative to investigate for the future is using a

standard query language – such as XQL – to get material out of the XML dictionary, rather than using our ad hoc index.

At the moment not a huge issue since most retrieval is focussed on components of a particular word

Page 21: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

XQL - Optimizations Revamp data structure

– reduce redundancy, amount to load at start-up PDOM (Persistent Document Object Model)

– represents XML document as a collection of objects in a tree like model

XQL (Extensible Query Language)– query language for XML – e.g. /DICTIONARY/ENTRY[9]– DICTIONARY/ENTRY[HW='jaja']

Page 22: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Performance - Startup time Impact on Startup time.

Method Size of File Startup timeXML+XDI 2.13Mb 7min

One-PDOM 12.5Mb 13min4s

One-PDOM + Index PDOM - 12.5MbIndex - 520Kb

3min30s

Segmented PDOM+ Index

PDOM -12.5MbIndex – 454Kb

55.48s

Optimised XDI 978Kb 46s

Page 23: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Customised Presentation of Dictionary Content

Produced dynamically from the XML by using XSL (via James Clark’s XT)

XSL allows easy modelling of some user preferences.

This is useful as many users find information overload quite confusing and demotivating

Can produce bilingual or monolingual dictionary Opportunities for various output styles, and formats

such as RTF or TeX for printing.

Page 24: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Performance - XSL Presentation

Creates minimal load on the application Requires file creation permission for the applet

Takes load off file system (no need for 9000+ pre-generated files)

Gives the user the opportunity to customise the formatting.

Page 25: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Conclusions While we have focused our research on Warlpiri, the

system can be easily applied to other languages The Key to the effectiveness of the browsing

interfaces is that the user has the ability to customise their functionality due to the flexibility of the XML & Kirrkirr technology

Throughout this research, the educational interests of the user have been the highest priority.

Hope to better understand the usefulness & practicality of innovative dictionary browsing environments.

Page 26: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

LinksLinks

• Kirrkirr homepage: http://www.sultry.arts.usyd.edu.au/kirrkirr

Page 27: Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz kjansz@sultry.arts.usyd.edu.au Department of Linguistics, University

Using XSL and XQL For Efficient, Customised Access To

Dictionary InformationKevin Jansz

[email protected] of Linguistics, University of Sydney, Australia

Jim Sng WeeSchool of Applied Science, Nanyang Technological University, Singapore

Christopher ManningDepartments of Computer Science and Linguistics, Stanford University, USA

Nitin IndurkhyaSchool of Applied Science, Nanyang Technological University, Singapore