24
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz Linguistics, University of Sydney Nitin Indurkhya Applied Science, Nanyang Technological University http://www.sultry.arts.usyd.edu.au/

Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

  • Upload
    silas

  • View
    29

  • Download
    1

Embed Size (px)

DESCRIPTION

Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary. Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz Linguistics, University of Sydney Nitin Indurkhya Applied Science, Nanyang Technological University - PowerPoint PPT Presentation

Citation preview

Page 1: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Kirrkirr: Software for the Flexible and Interactive

Visualization of a Structured Warlpiri Dictionary

Christopher ManningComputer Science and Linguistics, Stanford University

Kevin Jansz Linguistics, University of Sydney

Nitin IndurkhyaApplied Science, Nanyang Technological University

http://www.sultry.arts.usyd.edu.au/kirrkirr/

Page 2: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Research Program: Lexicon A language is more than individual words with a

definition– it is a vast network of associations between words and

within and across the concepts represented by words

The aim of this work is to provide a wide variety of users – not just linguists – with a better understanding of this conceptual map.

Traditional paper dictionaries offer very limited ways for making such networks visible

On a computer, there are no such limitations to the way information can be displayed.

Page 3: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Research: Computational Lexicography

Dictionaries on computers are now commonplace– But there has been little attempt to utilise the potential of the

new medium– Most present a plain, search-oriented representation of the

paper version

Goal: fun dictionary tools that are effective for browsing and language learning (cf. Kegl 1995)– Like flicking through a paper dictionary, but better– Innovative ways for representing and linking dictionary

information, through creative use of computer software– Should improve user supports and incidental learning

Focus: exploration/dissemination, not creation

Page 4: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Initial focus: Warlpiri Warlpiri is an Australian Aboriginal language spoken

in the Tanami desert (NW of Alice Springs) There are a number of factors influencing this choice:

– Rich lexical materials have been collected by linguists over decades (Ken Hale, MIT, from 1950s, Simpson, Nash, Laughren, Hoogenraad) resulting in the most comprehensive lexical databases for any Australian Language

– Warlpiri is the first language of a relatively large community of people. There is reasonable vernacular literacy

– Until now, results haven’t been produced in a format usable by the community (only raw printouts) – which is not really acceptable. Fixing this is also good science: for subtle linguistic judgments, one needs speaker involvement.

Page 5: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Educational goals Dictionary structure and usability are often dictated

by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met. Focus: school kids.

The low level of literacy in the region makes an e-dictionary potentially more useful than a paper edition

• less dependent on good knowledge of spelling and alphabetical order. • builds on captivating qualities of computers• multimedia content and the pronunciations of words is a considerable help as well.

Page 6: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Kirrkirr: A Warlpiri dictionary browser

(Jansz 1998; Jansz, Manning and Indurkhya 1999)

An environment for the interactive exploration of dictionaries.

Although our current work has just been with Warlpiri, the design is general – any XML dictionary

Attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information

Written in Java, it can either be run over the web (needs bandwidth) or locally (here Java’s main advantage is cross-platform support: Win/Mac/Unix)– originally JDK1.1.6+Swing, now Java 2

Page 7: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Overview

Kirrkirr provides various modules Animated network layout of word relationships Formatted dictionary entries Semantic domains display A notes facility for ‘jotting in the margin’ annotations Multimedia: audio, pictures Advanced searching interfaces

others in planning: formatting (XSL) editing, figuration patterns, semantic domain browsing, terminology sets

These attempt to cater to users with different interests and competence levels

Page 8: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz
Page 9: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

The lexical database Original text materials are stored in an ad hoc format

of markup using backslash codes with some (rather odd) nesting of structural tags [origin: runoff]

These are converted to XML using an error-correcting stack-based parser (written in PERL)– The inconsistency and flexibility of dictionary entries actually

made this a surprisingly difficult task.– Innumerable structural errors/inconsistencies/typos from

years of hand maintenance in text editors and via regexps– Heuristic content-sensitive parser imposes data integrity

XML gives data an explicit, manipulable structure Result remains a portable text file

Page 10: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

<DICTIONARY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

</DICTIONARY>

headword file positionheadword file positionheadword file position

XML Formatted Warlpiri dictionary file

Index in Memory

Across file system or web

Kirrkirr’s XML Index Process

KirrkirrDictionary Browser

XML Parser

XML Document Object

XSL file+

XSL Processor

HTML document

Page 11: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

XML Indexing We are currently using ad hoc indexing of one large

XML file This gives adequate speed/memory use, but requires

a modified XML parser to extract and parse 1 entry We have also experimented with an XQL version

using a PDOM (GMD-IPSI): more flexible, but slower Parsed entries are cached

Page 12: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Performance - Startup time Impact on Startup time [200 MHz Pentium]:

Method Size of File Startup time XML + index Index: 2.13Mb 7min

One-PDOM PDOM: 12.5Mb 13min 4s

One-PDOM + Index PDOM: 12.5Mb Index: 520Kb

3min 30s

Segmented PDOM + Index

PDOM: 12.5Mb Index: 454Kb

55.48s

XML + Optimised index

Index: 481Kb 46s

Page 13: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Visualization of dictionary information

For dictionaries with simple textual content behind them, there is little that can be done but an on-line reflection of a printed page

But we would like to be able to do more– we want to know a word’s relationships to other words, and

the patterning in these relationships

In a computational approach, the program can mediate between lexical data and the user

The interface can select from and choose how to present information (according to the user’s preferences and abilities) – in many different ways

Page 14: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Perils of visualisation

Page 15: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Graph-based visualisation

(Jansz 1998; Jansz, Manning and Indurkhya 1999)

Classic graph layout problem Adapts work by Eades et al. (1998) and Huang et al.

(1998) on visualisation and navigation of WWW document linkages

Uses the spring algorithm. Big advantage is that it is an iterative updating algorithm, and so gives an easy interactivity:– it wiggles and people can play with it, clicking to sprout nodes

A major goal was clarity and simplicity of the graph: the software maintains a set of focus nodes to prevent overcrowding

Page 16: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Kirrkirr network display

Page 17: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Formatted dictionary entries Are produced automatically and online from the XML

by using XSLT – a tree transformation language XSL allows easy modelling of some user preferences One can leave out information such as part of

speech, or detailed definitions, or rearrange it We provide several stylesheets to choose from This issue is surprisingly important: many users find

information overload confusing and demotivating Can produce a bilingual or monolingual dictionary Can also use this for print dictionaries (via RTF or

TeX). We have produced a couple of samples.

Page 18: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Formatted dictionary entries

Page 19: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Rich typology of link types The semantic links present in a dictionary (synonym,

antonym, hyponym, subentry, variant, coverbs, …) solve a major problem of the web: we have many link types each with a clear semantic interpretation

We use consistent colour-coding of text and network edges to show these link types

Gives a richer browsing experience You can tell where you are going before clicking Dictionary-given links are supplemented by links

derived from collocational analysis of Warlpiri texts– uses loglikelihood ratios (Dunning 1993)– works reasonably successfully from 1/4 million words

Page 20: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Semantic domain browsing A common

request of teachers and users is to view words via semantic domains

Page 21: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Educational advantages/usability Work (at PARC and elsewhere: Pirolli et al. 1996)

has stressed the role for browsing as well as searching in information access

It provides a context for learning A student can opportunistically explore words that are

related in various ways Important semantic relationships can be understood

People continually see alphabetical order and word spellings, but don’t need to know them to use Kirrkirr

Use of “fuzzy spelling” in searches supports users with poor spelling. It usually finds what you wanted.

Page 22: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Other components Multimedia (currently pictures and audio)

– Can hear pronunciations – gives a much better understanding of pronunciation than phonetic symbols

– pictures of plants and animals are more intelligible than descriptions

– (future: videos of Warlpiri sign language …)

Advanced search page– search various fields,

regular expressions, fuzzy spelling, etc.

Notes:– one can annotate dictionary

entries (to correct or personalise)

Page 23: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Interim Conclusions Kirrkirr is a prototype of what one can do to develop

new ways to organize and visualize lexicons We have addressed the challenge of making

dictionary information accessible and usable in the creation of an application which mediates between well-structured data and users’ needs and insights in searching/browsing and presentation

The interface has this year started being regularly used in Warlpiri schools – one school at the moment, hopefully more to follow soon:– “Look it up on that thing!”

Page 24: Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz

Kirrkirr: Software for the Flexible and Interactive

Visualization of a Structured Warlpiri Dictionary

Christopher ManningComputer Science and Linguistics, Stanford University

Kevin Jansz Linguistics, University of Sydney

Nitin IndurkhyaApplied Science, Nanyang Technological University

http://www.sultry.arts.usyd.edu.au/kirrkirr/