(=43 - avml2012.files.wordpress.com€¦ · Geographic Information System (GIS) and Perceptual Dialect Mapping — ... Statistics and Computing. Springer, 2nd edition. Figure 5: Rapidly

AVMLS E P T E M B E R 5 -7, 2 0 1 2

AD

VA

NC

ES

IN

V

IS

UA

L M

ET

HO

DS

F

OR

L

IN

GU

IS

TIC

S

Abstractsn n

n

Table of ContentsVisualizing Linguistic Data: From Principles to Toolkits for Doing it Yourself — Verena Lyding, Chris Culy 2

Visualizing Linguistic Data Using the Grammar of Graphics: ggplot2 — Josef Fruehwald 3

Discursis: An Interactive Visual Discourse Analysis System — Daniel Angus, Andrew Smith, Janet Wiles 5

Visualization of Linguistic Data Using Generalized Additive Models — R. Harald Baayen 6

Visualizing Language Features Together with their Genealogical and Areal Information — Thomas Mayer, Michael Hund, Christian Rohrdantz, Bernhard Wälchli 7

Dialectal Data Visualization and Statistical Manipulation with ggmap-package in R — Aki-Juhani Kyröläinen, Kristel Uiboaed 8

Visualization in Regional Dialectology using Scalable Vector Graphics — Jack Grieve 9

Visualizing Dialect Change as Such: Factoring out the Role of the Standard Language — Wilbert Heeringa, Frans Hinskens 10

Nothing Beats a Picture Except an Interface: Interactive Visualisation for Corpus Linguistics — Tanja Säily, Terttu Nevalainen, Harri Siirtola 11

Real Time Aesthetic Visualisation of NLP-driven Semantic Pathways — Christopher Rowland, John Anderson 12

Visual Exploration of the Lexical Resource Saldo Using Freely Available Visualization Tools — Daniela Oelke, Markus Forsberg & Lars Borin 13

Plotting Speakers’ Vowel Systems in Real-Time Interaction: A First Approach, Anne Fabricius, Charlotte Vaughn, Tyler Kendall 14

Web-Based Visualisation of Multi-Dimensional Linguistic Annotation — Daniel Jettka 15

Visualizing Vowels: Restoring Some Lost Images — Michael Ashby 16

Speech as Visible Patterns of Sound — Mark Huckvale 17

Mapping the BBC Voices — John Holliday 18

Visualising Perceptual Dialectology Data Using Geographical Information Systems — Chris Montgomery, Philipp Stoeckle 19

Maps as a Central Linguistic Research Tool — Joel J. Priestley, Janne Bondi Johannessen, Kristin Hagen, Anders Nøklestad, André Lynum 20

Geographic Information System (GIS) and Perceptual Dialect Mapping — Betsy E. Evans, Matthew D. Dunbar 21

Generating Visual Insights into Effective Doctor-Patient Consultations — Daniel Angus, Janet Wiles, Andrew Smith 22

Transcripts Beyond Text: Tools and Techniques for Visualizing and Quantifying Discourse — Tyler Kendall 23

Visualizing Spoken Conversation Structure — Li-chiung Yang 24

Knowledge Visualization and the Depiction of Conceptual Relations in a Multimodal Terminological Database — Juan Antonio Prieto Velasco, Clara Inés López Rodríguez 25

Visual Methods for Figurative Meaning Explanation in Science — José Manuel Ureña Gómez-Moreno 27

Some Challenges and Directions for the Visualization of Language and Linguistic Data — Chris Culy 28

Visualisation of Prosody in English and Arabic Speech Corpora — Claire Brierley, Majdi Sawalha 29

Visualization of Corpus Composition for Machine Translation — Marco Brunello 31

Measuring the Optimization of Vowel-Spaces: A Method for Cross-Linguistic Analysis — Jon William Carr 32

Visual Methods for Understanding the Language of the Quran — Kais Dukes, Eric Atwell 33

Investigation into Terrorist Activity: VAST 2011 Challenge — Sharmin (Tinni) Choudhury, Chris Rooney, Eric Atwell, Claire Brierley, Kai Xu, Raymond Chen, William Wong 35

Men, Women and Gods: Distant Reading in Literary Collections – Combining Visual Analytics with Language Technology — Dimitrios Kokkinakis, Daniela Oelke 37

An Online Visual Articulatory Resource for Phonetics Teaching and Independent Study — Eleanor Lawson, Jane Stuart-Smith 38

Visualizing Collocational Environments in Interdisciplinary Discourse — David Oakey 39

Visualisation of Arabic Morphology — Majdi Sawalha, Eric Atwell 41

Visualising Spoken Language Transcriptions – Old Principles and New Opportunities — Thomas Schmidt 43

2

WORKSHOP

Visualizing Linguistic Data: From Principles to Toolkits for Doing it Yourself

Verena Lyding (EURAC) Chris Culy (University of Tübingen)

The goal of this workshop is to give a practical guide to visualizing linguistic data. We will start with some fundamental principles of visualization design and discuss issues in the linking of information visualization and language data. We will be showing how to respect visual perceptual properties to enhance visualizations, the “do’s and dont’s” of visualization design. This will lead to a discussion of some visualization tools and toolkits that are suited for linguistic visualizations, including some of our own tools. In addition, there will be hands-on practice visualizing linguistic data, using these tools. Some sample data will be provided, but participants are encouraged to bring samples of their own data as well.

3

WOR

KSHO

P

Visualizing Linguistic Data Using the Grammar of Graphics: ggplot2

Josef Fruehwald (University of Pennsylvania)

The goal of this workshop will be to provide participants with the necessary foundations to produce complex exploratory and expository data graphics using ggplot2, an implementation of the grammar of graphics (Wickham, 2009; Wilkinson, 2005). As an R library, ggplot2 is rapidly becoming the most popular environment for producing graphics, largely due its exible and intuitive nature.

The flexibility of ggplot2 is attributable to its underlying grammar. Many statistical graphics are uniquely named (scatter plot, histogram, bar chart, boxplot, etc.), and the strategy of some graphing environments is to provide a plotting option for each named type (e.g. from base R, plot(), hist(), barplot(), boxplot(), or the large menu of options in Microsoft Excel). The insight from Wilkinson (2005), which was implemented in ggplot2, was that all of these plot types can be described as the combination of a small set of primitive graphical elements, statistical functions, and coordinate systems. As it is with other grammatical systems participants are likely to be familiar with, this combinatorial system of graphical primitives allows for incredible power and exibility in graph formulation. Moreover, the combinatorial system is dened in terms of layers in ggplot2, which proves to be very intuitive for the purpose of building complex graphics.

Graphical exploration of data is a crucial rst step in the process of statistical analysis. The majority of the workshop will therefore focus on the production of exploratory graphics using ggplot2. We will start with how to produce graphics with the most simple kinds of graphical elements, (points, bars, lines, etc). From there, we will add more than two dimensions to our graphics, by mapping data dimensions to the color, size, shape, and transparency of graphical elements, and by faceting (a.k.a. small multiples (Tufte, 1983)). Finally, we will review how to plot data summaries (such as histograms, boxplots, and means with condence intervals), and regression summaries of various types.

Graphical presentation of data is also one of the most powerful tools for communication of research results. After covering the basics of the ggplot2 system, we will cover some basics of ne tuning and polishing graphics for public display, or publication. We will brie y discuss changing color schemes, to match personal taste or style guidelines, retitling axis and legend labels, and other smaller aesthetic adjustments.

A modicum of knowledge of the R language will be assumed.

Figure 1: Rapidly produced exploratory graphic, utilizing a two dimensional data summary.

Figure 2: Graphic prepared for a black-and-white publication.

Figure 3: Rapidly produced exploratory graphic, overlaying a loess smoothing line.

Figure 4: Polished graphic with cubic regression splines overlaid.

4

WORKSHOP

References:

Tufte, Edward. 1983. Visual Display of Quantitative Information. Graphics Press.

Wickham, Hadley. 2009. ggplot2: elegant graphics for data analysis. Springer. URL: http://had.co.nz/ggplot2/book.

Wilkinson, Leland. 2005. The Grammar of Graphics. Statistics and Computing. Springer, 2nd edition.

Figure 5: Rapidly produced exploratory graphic, displaying average rate of ING variation by speaker.

Figure 6: Polished graphic.

5

WOR

KSHO

P

Discursis: An Interactive Visual Discourse Analysis System

Daniel Angus (School of Information Technology and Electrical Engineering, School of Journalism and Communication, The University of Queensland)

Andrew Smith (Institute for Social Science Research, The University of Queensland)

Janet Wiles (School of Information Technology and Electrical Engineering, The University of Queensland)

Discursis is a new computer-based tool for analysing human communication. Communication can be in the form of conversations, web forums, training scenarios, and many more. Discursis automatically processes transcribed text to show participant interactions around specific topics and over the time-course of the conversation. Discursis can assist practitioners in understanding the structure, infor-mation content, and inter-speaker relationships that are present within input data as reflected by the patterns of topic use.

Discursis works by processing input text (conversation transcript) data to determine the conceptual content of each conversation turn. Each turn in the conversation is coded with the concepts that are present in that turn, and turns containing similar concepts are then linked in an interactive visualisa-tion by shading the vertically and horizontally adjacent elements below the diagonal according to the strength of match, and owner of the content (see example in Figure 1).

Discursis has been used to perform an analysis of conversations from Australian television talk shows (Angus, Smith, & Wiles, 2011), to analyse topic usage patterns of children with autism as contrasted with typically developing children of a similar age (Lai, Reilly, Wiles, Angus, & Smith, 2011), and to look for patterns of topic use and turn-taking by doctors and their patients (Watson, Angus, Farmer, Wiles, & Smith, 2011). These studies indicate that as a decision support tool, discourse analysts could use the system to confirm pre-held hypotheses about the type and magnitude of interaction between conversation participants, and as a forensic tool to discover patterns of interaction and interesting time periods where conversation participants demonstrated topic convergence characteristics. Further de-tails are available at discursis.com

Workshop Details

In this workshop we will introduce participants to the fundamental theory behind Discursis, and offer a series of case studies illustrating the functionality of the Discursis system. Datasets are all English language transcripts and include:

■■ Doctor/Patient consultations

■■ Television interviews

■■ Conversations with persons with language disorders (Autism, Dementia)

Workshop participants should expect to gain a basic understanding of the Discursis software and its intended use; and, in conjunction with further reading, the necessary skills to begin using the software on their own data.

Intended workshop duration: 2 hours.

References:

Angus, D., Smith, A., & Wiles, J. (2011). Conceptual recurrence plots: Revealing patterns in human communication. IEEE Transactions on Visualization and Computer Graphics (in press).

Lai, J., Reilly, J., Wiles, J., Angus, D., & Smith, A. E. (2011). Conversational Narratives in School-Age Children With High-Functioning Autism. Paper presented at the 2011 American Speech-Language-Hearing Association Convention.

Watson, B., Angus, D., Farmer, J., Wiles, J., & Smith, A. E. (2011). Evaluating Effective Open Disclosure Through Visualisation: What Works and for Whom? Paper presented at the 61st Annual Conference of the International Communication Association.

Figure 1: Conceptual recurrence plot of 13 utterances and 4 corresponding recurrence elements from a doctor/patient consultation. The patient is coloured red and the doctor is coloured blue. Recurrence between the patient and the doctor is indicated by a half/half coloured square, and self-recurrence is in the speaker’s own colour.

6

KEYNOTE

6 September (Conference Day 1)

Keynote 1

Visualization of Linguistic Data Using Generalized Additive Models

R. Harald Baayen (University of Tübingen & University of Alberta)

Generalized additive models (GAMs, see, e.g., Wood, 2006) provide a flexible toolkit for modeling complex prediction surfaces and hypersurfaces. For understanding such surfaces, visualization, for instance with contour plots or perspective plots, is essential. The goal of my presentation is twofold.

First, I will illustrate the potential of GAMs for linguistic research, using as examples data from a dialectometric study (Wieling et al., 2011), a study using evoked response potentials to auditory stimuli (Kryuckova et al., 2012), and work in progress on the analysis of first fixation durations in an eye-tracking study of compound reading (Kuperman & Baayen, in progress).

Second, the complex regression surfaces revealed by GAMs can be quite difficult to interpret. I will argue that rather than ignoring unwelcome complexity, we should embrace it, and search for computational models that correctly predict these complex surfaces that GAMs detect in linguistic data. I will discuss one example data set concerning the lexical processing of compounds, contrasting two very different computational approaches (Baayen, 2010 and Baayen et al., 2011) to understanding the observed regression surfaces.

On the basis of these examples, I will argue that GAMs provide an analytic tool in which visualization and theory-building go hand in hand and have actually become inseparable.

References:

Baayen, R.H. (2010) The directed compound graph of English. An exploration of lexical connectivity and its processing consequences. In S. Olson (ed.), New impulses in word-formation (Linguistische Berichte Sonderheft 17), Buske, Hamburg, 383-402.

Baayen, R. H., Milin, P., Filipovic Durdevic, D., Hendrix, P. and Marelli, M. (2011), An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review. 118, 438-482.

Kryuchkova, T., Tucker, B. V., Wurm, L., and Baayen (2012) R. H., Danger and usefulness in auditory lexical processing: evidence from electroencephalography. Brain and Language. 122, 81-91.

Wieling, M., Nerbonne, J. and Baayen, R. H. (2011). Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially. PLoS ONE 6(9): e23613. doi:10.1371/journal.pone.0023613.

Wood, S. (2006). Generalized additive models. Chapman/Hall.

7

PRES

ENTA

TION

Se

ssio

n 1:

Mappin

g I

Visualizing Language Features Together with their Genealogical and Areal Information

Thomas Mayer (LMU München) Michael Hund (University of Konstanz)

Christian Rohrdantz (University of Konstanz) Bernhard Wälchli (University of Stockholm)

The field of areal typology investigates typological, genealogical and areal properties of languages with the goal of finding interesting patterns in the geographical distribution of language features. One of the major challenges in this enterprise is to separate those cases where a given geographical clustering of features is due to a shared genealogical relationship between the languages from those where an agglomeration of the same or similar feature values is a consequence of language contact.

This paper presents a novel visual analytics approach that helps researchers to distinguish between those two cases in finding potentially interesting contact situations by visually presenting geographical and genealogical information about languages together with the feature values that have been selected by the user. To this end, we introduce an extended Sunburst visualization (cf. Stasko and Zhang, 2000) which allows for an at-a-glance crosscomparison of a set of features, within the hierarchical context of the language genealogy. On top of that, we propose an integration of areal information into the Sunburst visualization by dividing the world into six macro-areas as described by Dryer (1992). Each macro-area is mapped to a different color and can also be inspected on an integrated world map to which all data points are linked.

The design of the visualization is thus inspired by insights from typological research and provides the functionalities in one single visual analytics environment. Figure 1 gives an overview of this environment whose main component is an extended Sunburst visualization. The Sunburst represents the genealogical hierarchy of the languages by plotting languages of the same branch in the hierarchy to the same segment of the circle. The saturation of the gray tone of an inner “node”, which is automatically computed according to a metric distance function of Swadesh list items in the languages, enables the user to differentiate more easily between the various language families. The rings of the circle are used to display the language features as well as the geographical information where the language is approximately spoken (innermost ring). The latter is represented by the respective color from the macro-area distribution. The former are visualized in the outer rings in different ways depending on the level of measurement of the individual features that are to be displayed. For nominal data, the different categories of the feature are mapped to different color hues; for ordinal data, the ranks are encoded by different saturation levels. Finally, quantitative feature rings show their values in a histogram. As a result, all three relevant components of an areal typological investigation (feature values, geographical and genealogical information) are combined in one visualization.

Figure 1. Overview of the main components of the system including short descriptions.

In order to test the usefulness of our approach we automatically extracted typologically relevant features from parallel texts, which served as the input for the visualization method. We provide evidence of the good performance of the system with some examples where interesting patterns can be detected by merely looking at the Sunburst component, without reading and comparing feature values.

8

PRESENTATION

Se

ssion 1: M

apping I

Dialectal Data Visualization and Statistical Manipulation with ggmap-package in R

Aki-Juhani Kyröläinen (University of Turku) Kristel Uiboaed (University of Tartu)

Cartographic visualization has always been very important part of dialectological research and data representation. Contemporary technical possibilities have contributed to the visualization techniques. Larger dataset and corpora have become more common also in dialectological research and that has also contributed to the demand for methodologies and techniques for dialectal data visualization and statistical manipulation.

There is already a variety of applications for this kind of purposes available such that VDM developed in Salzburg (Goebl 2006) and Gabmap developed in Groningen (Nerbonne et al. 2011) which have made any kind of variational data analyses and visualization very easy. These are valuable tools for dialectal and variational language research. However, if a linguist starts to use some of them, (s)he can quite easily come across problems, such that these applications might not provide relevant possibilities for specific data manipulation tasks and research purposes. For instances, these implementations usually provide limited number of statistical analyses techniques and a possibility to use some other statistical methods might not be provided. Map-drawing possibilities can also be restricted; a researcher might not be able to draw the exact dialect borders as needed.

The freeware statistical program R (R Development Core Team 2011) is increasingly used for statistical analyses in linguistics and provides a large number of statistical data analysis techniques. In our paper we show based on Estonian dialect data how any kind of statistical analyses available in R can be conducted and visualized at same time with the ggmap package in R (Kahle, Wickham 2011). At the same time, Estonian dialect data is quite problematic. For instance, the traditional dialect borders do not follow the contemporary territorial subdivisions. That makes automatic map retrieval with Google maps and similar applications impossible. In that case, dialect map must be drawn by a researcher for that purpose beforehand. The same problems occur when one wants to investigate (dialectal) variation among languages which are spoken on the territory of different countries.

In the second part of our work we show how this kind of spatial data and statistical inference can be straight-forwardly combined and visualised on pre-drawn maps in R. Recently, R has become more popular in language research allowing to maintain basic data manipulation operations and visualization techniques within the same framework. Therefore, the basic design of research process becomes considerably more convenient and easier.

References:

Goebl, H. 2006, “Recent Advances in Salzburg Dialectometry”, Literary and Linguistic Computing: Journal of the Association for Literary and Linguistic Computing, vol. 21, no. 4, pp. 411-435.

Kahle, D. & Wickham, H. 2011, ggmap: A package for spatial visualization with Google Maps and OpenStreetMap, R package

version 0.7 edn, http://cran.r-project.org/web/packages/ggmap/ggmap.pdf.Nerbonne, J., Colen, R., Gooskens, C., Kleiweg, P. & Leinonen, T. 2011, “Gabmap — A Web Application for Dialectology”,

Dialectologia. Special Issue II, pp. 65-89.

R Development Core Team 2011, R: A Language and Environment for Statistical Computing. R Foundation for Statistical

Computing., R version 2.14.1 edn, Austria, Vienna, http://www.r-project.org/.

9

PRES

ENTA

TION

Se

ssio

n 1:

Mappin

g I

Visualization in Regional Dialectology using Scalable Vector Graphics

Jack Grieve (Aston University)

This paper presents numerous visualization techniques for regional dialectology using Scalable Vector Graphics (SVG). SVG is an open- source, XML- based file format for graphics, including animation. Because SVG files are text files, they can be generated, modified and searched using a scripting language such as Perl. This is very useful in regional dialectology, where it is important to be able to quickly generate high quality maps in order to make sense of complex patterns of regional linguistic variation.

In this presentation, the use of SVG graphics in regional dialectology is exemplified through two applications: the generation of dialect maps and the animation of regional linguistic simulations.

First, the mapping of individual quantitative linguistic variables using SVG and Perl is described. Maps are produced for individual grammatical, lexical and phonetic variables based on a variety of data sources including acoustic data gathered through linguistic interviews (Labov et al, 2006) and lexical- grammatical data gathered through corpus methods (Grieve et al, 2011). In addition, aggregated patterns of regional linguistic variation is mapped based on the results of multivariate statistical analyses of these datasets. Two example SVG dialect maps generated by Perl scripts are presented below. Figure 1 plots the height of the /ae/ vowel across 236 cities in the United States based on the data from Labov et al (2006). Figure 2 maps an aggregated pattern of regional linguistic variation identified in this phonetic dataset by a factor analysis.

Second, the results of a linguistic simulation that models the development of regional linguistic variation in a speech community are animated using SVG. This simulation was designed to test the hypothesis that regional linguistic variation can emerge in a speech community where communication between speakers is limited only by physical distance. By using SVG animation, it is possible to visualize the results of this simulation: starting with a linguistic variable that is randomly distributed across a set of agents, the SVG animation allows for the development of regional patterns in the simulated speech community to be mapped over time.

Figure 1. /ae/ Formant 1 Figure 2.Factor 2 Phonetic Data

References:

Grieve J, Speelman D, Geeraerts D. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change. 23: 193- 221.

Labov W, Ash S, Boberg C. 2006. Atlas of North American English: Phonetics, Phonology, and Sound Change. New York: Mouton de Gruyter.

10

PRESENTATION

Se

ssion 1: M

apping I

Visualizing Dialect Change as Such: Factoring out the Role of the Standard Language

Wilbert Heeringa (University of Groningen, Faculty of Arts, Scandinavian department) Frans Hinskens (Meertens Instituut, Variationist linguistics (KNAW) / VU University Amsterdam)

Dialectometry focuses on measuring linguistic similarity or dissimilarity among dialect varieties. The first to develop a method of measuring dialect distances was Jean Séguy, assisted and inspired by Henri Guiter. Dissimilarity between two dialect varieties was measured as the percentage of items on which they disagree.

The measurements may be carried out on different linguistic levels, such as lexis, morphology, sound components, syntax etc. In 1995 Kessler introduced Levensthein distance as a useful tool for measuring distances in the sound components, applying this meaure to Irish Gaelic. The Levenshtein distance is a numerical value defined as the cost of the least expensive set of insertions, deletions and substitutions needed to transform one string into another . The algorithm may directly applied to digitized phonetic trancriptions.

Most dialectometric studies are synchronic. In this paper we will report on a study of dialect change, using phonetic transcriptions of newly made recordings of a representative set of Dutch dialects spoken in 86 locations in the Netherlands and Flanders; the recordings were made in the period 2007-2011. For each site older male and younger female speakers were recorded, representing conservative and innovative dialect varieties, respectively. Change of both local dialect and relationships among dialects in the sound components will be measured with Levenshtein distance. The left map in Figure 1 shows dialect change for each of the 86 varieties.

Since standard Dutch has been shown to strongly influence dialects, we also distinguish between dialect change based on sound changes which makes a dialect converge to standard Dutch (Figure 1, center) and dialect change based on sound changes which make a dialect diverge from standard Dutch (Figure 1, right). The two latter measures were obtained with a three-dimensional Levenshtein distance, i.e. for each item the older male realization, the younger female realization and the standard Dutch realization.

Figure 2 shows change in the relationships between dialects. Likewise a distinction can be made between dialect change which is the result of sound changes which cause dialects to converge to standard Dutch and change which is the result of sound changes which cause dialects to diverge from standard Dutch.

In order to make this distinction and analyse the change of the relationship between dialect variety A and dialect variety B, we use a five-dimensional Levenshtein implementation, where realizations of older male speakers of dialects A and B, realizations of younger female speakers of dialects A and B and the standard Dutch realizations are aligned to each other. The use of three- and five-dimensional Levenshtein is a novel step in dialectometry and enables us to visualize dialect change and change of dialect relationships factoring out the influence of the overarching standard language.

Figure 1. Dialect differences between older male and younger female speakers obtained on the basis of all sound changes (left), sound changes which cause the dialect to converge to standard Dutch (center) and sound changes which cause the dialect to diverge from standard Dutch (right). The intensity of blue in a dot represents the extent to which a variety has changed.

Figure 2. Convergence and divergence among dialects. Red lines indicate convergence and blue lines indicate divergence.

11

PRES

ENTA

TION

Se

ssio

n 2

: V

isu

alis

ing

Pro

pert

ies

of

Ling

uis

tic

Co

rpo

ra

Nothing Beats a Picture Except an Interface: Interactive Visualisation for Corpus Linguistics

Tanja Säily (Department of Modern Languages, University of Helsinki) Terttu Nevalainen (Department of Modern Languages, University of Helsinki)

Harri Siirtola (TAUCHI, School of Information Sciences, University of Tampere)

When something is visualised in corpus linguistics, the prevailing mode of interaction is passive: something is computed from the text, a visualisation is created, and then inspected. While this approach is useful, a more direct mode is needed to create more effective visualisations and to support the discovery of insights (Pike et al. 2009).

We discuss the benefits and ramifications of an information visualisation approach to corpus linguistics and present a linguistically informed, interactive tool for exploratory analysis. Although general-purpose visualisation techniques provide a good starting point, techniques that dig deeper into the structure of the corpus, and work bottom- up from the texts, are needed to gain insight into linguistic variation and change.

In our view, corpus linguists need three kinds of visual analysis tools:

1. exploratory visualisation and analysis tools (which are our focus here);

2. tools for statistical, confirmatory analysis; and

3. explanatory tools for presenting the results.

The need for (1) and (3) is often ignored, and statistical tools are seen as sufficient for both exploration and presentation. However, there is a fundamental difference between (1) and (3). As Theus and Urbanek (2008:6) point out, the relation between the number of observers and the number of graphics in use is inverse. In exploration, a huge number of graphics is created for a single observer, while in presentation, a single graphic must serve a huge number of observers. An attempt to serve both purposes with the same graphic becomes a mediocre compromise.

The key difference between (1) and (2) is interaction. Statistical tools rarely support the continuous, direct- manipulation style of interaction that is highly valuable for pattern discovery and insight generation.

It is challenging to combine a large text corpus and its various measurements in an effective interactive visualisation. The connection between the text and the visualisation is usually lost, and only the visualisation is shown, as in the otherwise excellent Mondrian (Theus 2002). We are developing a tool called Text Variation Explorer (Figure 1), in which the text and the visualisation are shown side by side, and manipulation of either one will highlight the corresponding object in the other. Thus, interaction realises the missing connection between text and visualisation, and enhances the exploration of data for insight generation.

Figure 1. The user interface of Text Variation Explorer. James Joyce’s Ulysses is divided into 996- word samples (bottom left), which are visualised according to type/token ratio, proportion of hapax legomena and average word length in the line graph view (top), and clustered according to a list of personal pronouns in the principal components view (bottom right and a colour overlay at the top). All four measures indicate that the end of the novel is somehow different. The user has clicked one of the end samples in the principal components view, causing the sample to be highlighted in both the line graph view (red line) and the text view. Clearly, the different section is Molly Bloom’s famous soliloquy.

References :

Pike, William A., John Stasko, Remco Chang & Theresa A. O’Connell. 2009. “The science of interaction”. Information Visualization 8(4): 263–274.

Theus, Martin. 2002. “Interactive data visualization using Mondrian”. Journal of Statistical Software 7(11).

http://www.jstatsoft.org/v07/i11/

Theus, Martin & Simon Urbanek. 2008. Interactive Graphics for Data Analysis: Principles and Examples (Computer Science & Data Analysis). Boca Raton, FL: Chapman & Hall/CRC.

12

PRESENTATION

Se

ssion 2

: Visu

alising Pro

pertie

s of Lingu

istic Co

rpo

ra

Real Time Aesthetic Visualisation of NLP-driven Semantic Pathways

Christopher Rowland and John Anderson (3D Visualisation Research Lab, Duncan of Jordanstone College of Art & Design, University of Dundee)

Real Time Aesthetic Visualisation of NLP- driven “Semantic Pathways” Abstract This paper describes a method for utilising Natural Language Processing (NLP) and high-level Dataflow Programming in order to produce an interactive visualisation which is capable of revealing “semantic pathways” between documents in a given collection. The visualisation’s purpose is twofold. Firstly it provides keyword “gists” of the collection permitting the easy grasping of high level themes and topics. Secondly it encourages fluid, interactive querying of the collection thus revealing a sense of its overall narrative. Our system is designed around a linguistic modeling layer which performs NLP related operations on data, and a visualisation layer (Figure 1) which presents results interactively to the viewer.

Figure 1. Visualisation Layer

With respect to linguistic modeling our method utilises functionalities of the Natural Language Toolkit (NLTK) such as frequency distributions, lexical dispersion, concordance and collocations together with the computation of significance metrics such as the log-likelihood statistic. Our method operates on unprocessed, non-tagged plain text to deliver keywords ranked by significance directly to our visualisation layer.

We utilise Dataflow Programming in order to model a visualisation and rendering pipeline which describes and executes the high level logic which enables a viewer to interactively query the collection gist by selecting any significant keyword. This action requests a subset of documents from the linguistic modeling layer in which the selected keyword is statistically significant, and a further set of corresponding significant keywords for each returned document in this subset. This data is then presented back to the viewer in the visualisation layer context. This query and response mechanism of interaction may then proceed in a recursive fashion enabling the viewer to explore the narrative themes of the collection.

Our method supports interactive querying of the collection in a number of useful ways. In addition to the recursive “keyword list to keyword list” method described above, the method supports concordances, lexical dispersion, collocations and full text document rendering. In every case, wherever a word is displayed in our visualisation, the viewer may initiate a query on that word. Thus we feel we have a highly flexible query system, yet one which is capable of returning high relevance gists due to the linguistic modeling techniques employed.

The design of the visualisation layer takes into account not only presentation of the semantic relationships between documents but also, and importantly, the aesthetics of how this information should be presented visually to the viewer. We therefore make extensive and subtle use of typographic design, spatial layout and ordering, transparency, animation and graphic design in order to present results in a clear, concise, visually ordered and engaging manner. We have found that these aesthetic considerations play an important role not only in aiding the viewer’s comprehension of the data but also in encouraging the viewer’s further fluid interaction with the collection.

13

PRES

ENTA

TION

Se

ssio

n 2

: V

isu

alis

ing

Pro

pert

ies

of

Ling

uis

tic

Co

rpo

ra

Visual Exploration of the Lexical Resource Saldo Using Freely Available Visualization Tools

Daniela Oelke (University of Konstanz, Germany, Department of Computer and Information Science) Markus Forsberg & Lars Borin (University of Gothenburg, Sweden, Department of Swedish)

1. Introduction to SALDO

SALDO is a Swedish electronic “associative thesaurus” under active development, with over 117,300 hierarchically organized word senses. Additionally, secondary connections link terms that are semantically close and differentiate senses with the same superordinate. SALDO resembles WordNet and Roget’s Thesaurus, but is also different in important respects (Borin & Forsberg 2009).

Since SALDO is being constructed with a local view, an advanced interface with a global view that allows browsing and visual exploration of the resource is important for both developers and users and can help to gain a deeper understanding of SALDO.

2. Overview Representation and Browsing of the Hierarchical Structure of SALDO

Providing a global view on the SALDO tree structure is challenging from a visualization perspective because of the large size of the tree. To display it, we use TreeViz,1 a freely available tool specifically designed for the visualization of large tree structures. The hyperbolic tree (Fig. 1) is a technique similar to the most common tree representation – a node-link diagram. Being a focus+context technique, it assigns more space to the node in focus and its local neighborhood at the expense of shrinking nodes that are further away. A disadvantage of the technique is that the overall structure of the tree becomes difficult to perceive. Furthermore, the available space is not used as efficiently as with space-filling techniques such as rectangular treemaps (Fig. 2) which assign space proportionally to some node attribute and nest the representation of the tree levels recursively. Variants of this technique are circular treemaps, icicle plots, and the sunburst representation (Figs. 3–5). The latter two come with the advantage that the overall tree structure is easier to perceive and intermediate nodes are shown as well.

In such a large graph it is impossible to display all node labels. Therefore, the interaction capabilities of TreeViz are very important for navigating the resource. Hovering over a node displays its full path to the root of the tree (see inlay of Fig. 4). Furthermore, the tool supports zooming into the visualization and displaying a subtree in detail.

3. Visual Exploration of the Graph Structure of SALDO

Secondly, visone2 (Brandes & Wagner 2003) – a tool for the analysis and visualization of social networks – is used to visually explore subgraphs of SALDO in detail. Here the secondary connections can be displayed as well and an in-depth analysis of the relationship between different terms becomes possible (Figs. 6–7).

4. Conclusions and Future Work

The visualization of large lexical resources is a common problem (see, e.g., Katifiori et al. 2007, Priss & Old 2011, Eckert et al. 2007, or Schulz n.d. for a survey on tree visualizations in general). In this work we experimented with two freely available tools for visualizing and analyzing the large lexical network SALDO. In the future we plan to additionally include the links among different lexical resources (e.g., in the ongoing Swedish FrameNet++ project; Borin et al. 2010) and to advance the interactive exploration facilities.

1. TreeViz, http://www.randelshofer.ch/treeviz/

2. visone, http://visone.info/

References:

L. Borin, D. Dannélls, M. Forsberg, M.T. Gronostaj, and D. Kokkinakis 2010. The past meets the present in Swedish FrameNet++. 14th EURALEX International Congress, pages 269–281. EURALEX, 2010.

L. Borin and M. Forsberg, All in the family: A comparison of SALDO and WordNet, Proceedings of the Nodalida 2009 Workshop on WordNets and other Lexical Semantic Resources - between Lexical Semantics, Lexicography, Terminology and Formal Ontologies, pages 7–12. NEALT, 2009.

U. Brandes and D. Wagner. Visone - Analysis and Visualization of Social Networks. In GRAPH DRAWING SOFTWARE, pages 321–340. Springer-Verlag, 2003.

K. Eckert, H. Stuckenschmidt, and M. Pfeffer. Interactive thesaurus assessment for automatic document annotation. In Proceedings of the 4th International Conference on Knowledge Capture, pages 103-110, ACM, 2007.

A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, and E. Giannopoulou. Ontology Visualization Methods — A Survey. In ACM Computing Survey, 39 (4), 2007.

U. Priss and L. J. Old. Data weeding techniques applied to Roget’s thesaurus. In Proceedings of the First International Conference on Knowledge Processing and Data Analysis, pages 150-163, Springer-Verlag, 2011.

H.-J. Schulz. TreeVis.net, http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html, last accessed January 2012.

14

PRESENTATION

Se

ssion 3

: Deve

lopm

ents in V

isualisatio

n Tech

nique

s

Plotting Speakers’ Vowel Systems in Real-Time Interaction: A First Approach

Anne Fabricius (Roskilde University, Denmark) Charlotte Vaughn (Northwestern University)

Tyler Kendall (Dept. of Linguistics, University of Oregon)

There is a long tradition in disciplines of linguistics and phonetics of visualizing speakers’ vowel systems through transformations of acoustic measurements of vowel formants (resonances) into two-dimensional x-y plots (since Joos 1948; see also the discussion in Watt, Fabricius and Kendall 2011). Within sociolinguistics, these plots have served not just as illustrations of vowel change processes, but have actually formed an integral part of the analytical process, as evidenced for example by Labov’s concept of peripherality which plays a central theoretical role in his typology of diachronic changes to high and mid vowels (Labov 1994).

Recent work in visualizing vowel systems has begun to move beyond the static geometric x-y plot and experimented with, for example, three-dimensional representations (Fridland and Kendall 2009) and with showing animated trajectories of vowel systems in communities over decades (Fruehwald 2011). New computational speech-processing possibilities, including forced alignment, open up this area for innovative analyses and presentation methods.

In this paper, we will present a proof-of-concept for the visualization of vowels and the vowel space in real-time speech, using data from a conversational interview. While a speech recording is played, the F1 and F2 values of vowels are plotted on two-dimensional vowel plots, allowing viewers to watch the unfolding of vocalic characteristics within a speech event over time, and enabling researchers to examine such speech data in new ways, both qualitatively and quantitatively. This type of representation has the potential to contribute new insights to our understanding of both intra- and inter-speaker variation in interaction, as well as long- and short-term speech accommodation under inter-variety linguistic contact of many kinds. In addition to its research implications, real-time vowel plotting provides a helpful visualization for phonetics and sociophonetics students to develop their understanding of the vowel space. With this preliminary work, we hope to encourage further advances in the visualization of vocalic production.

References:

Fridland, Valerie and Tyler Kendall. 2009. Mapping production and perception in regional vowel shifts: The effects of vowel duration and formant trajectories. Paper presented at NWAV 38. University of Ottawa.

Fruehwald, Josef. 2011. Philadelphia language in motion. http://www.ling.upenn.edu/~joseff/phillymotion.htmlJoos, Martin. 1948. Acoustic phonetics. Language 24(2). Language Monograph. 23: 5-136.

Labov, William. 1994. Principles of Linguistic Change, vol. 1: Internal Factors. Oxford: Blackwell.

Watt, Dominic, Anne Fabricius & Tyler Kendall.2011. More on vowels: plotting and normalization. In Marianna di Paolo & Malcah Yaeger-Dror (eds.). Sociophonetics: A Student’s Guide. Routledge, pp. 107-118.

15

PRES

ENTA

TION

Se

ssio

n 3: D

eve

lopm

ent

s in

Vis

ual

isat

ion

Tech

nique

s

Web-Based Visualisation of Multi-Dimensional Linguistic Annotation

Daniel Jettka (Hamburg Centre for Speech Corpora, University of Hamburg)

The present work relates to a research programme that deals with the comprehensive web-based visualisation of linguistic transcriptions and annotations in the context of a broader system of annotation tools and concepts. The visualisation method that will be presented is based on standard XML and web technologies including XSLT, HTML5, SVG, X3D, and JavaScript.

The prototypical implementation of the visualisation module at this stage handles two distinct XML input formats: (1) the generic XStandoff format (XSF, cf. http://www.xstandoff.net) which can store multi-dimensional, potentially overlapping or discontinuous annotations of texts (and in principle audio-visual data), and (2) EXMARaLDA basic transcriptions (EXB, cf. http://www.exmaralda.org), the storage and serialization format for transcripts of the EXMARaLDA Partitur Editor that is used for the transcription and annotation of audio and video data (and can in principle also handle textual data). In a later step, a conversion framework shall be created in order to support additional transcription and annotation formats as well as unrestricted inline XML annotations.

The two above-mentioned formats, XSF and EXB, serve as the input for an XSLT stylesheet which transforms the data into an HTML5 web page that contains several components: an audio player, text areas which show various information on metadata, transcriptions, and annotations, as well as 2D grid-like, and 3D multiple-rooted tree-like visualisations of the annotations. The latter is especially useful for multi-layer hierarchical annotations as will be demonstrated. The visualisations are directly embedded into the HTML5 page by the use of SVG and X3D. Interactive effects which can be triggered by user actions (mouse-over and click events, control forms) illustrate the alignment of primary data (text or audio) and the corresponding annotations. The HTML5 audio player can be used to play the recordings, which are referenced in the input data (in the case of EXB), or go through the textual primary data (in the case of XSF – here a certain time frame is assigned to each character; the speed can be adjusted by the user). During the playback, the corresponding annotations in the 2D and 3D visualisations as well as the transcriptions or textual primary data are highlighted and associated information is displayed. This effect can also be triggered directly by the user who for instance is able to freely navigate through the 3D scene and explore the annotations with the mouse cursor.

16

KEYNOTE Keynote 2

Visualizing Vowels: Restoring Some Lost Images

Michael Ashby (University College London)

Visual representations are crucial to our understanding of vowels: the vocal-tract section, the speech waveform, the acoustic spectrum, and the multi-dimensional vowel space (phonetic or perceptual). None of these now-familiar representations were obtained easily, and the conventionalized versions which appear in every introductory textbook originally faced competition from forgotten rivals which still deserve investigation and understanding. This paper presents new historiographical findings on a range of topics: the discovery of lost X-ray data of Tsutomu Chiba (1883-1959), the complex story of the interplay between X-ray data and the Cardinal Vowel system, re-discovery and restoration of the first X-ray sound film from 1935, Robert Curry’s pioneering efforts to capture cathode-ray oscillograms on film, and the remarkable elliptical vowel space proposed by Sun-Gee Gim in 1937.

Conference Day 1 Ends

17

KEYN

OTE

Keynote 3

Speech as Visible Patterns of Sound

Mark Huckvale (Speech, Hearing & Phonetic Sciences, University College London)

Spoken language is built on the reliable communication of information through tiny, rapid fluctuations of air pressure which are both invisible and ephemeral. Our species has learned to exploit its vocal apparatus to encode meaningful utterances into patterns of sound and to exploit its auditory apparatus to decode them. Insights into these encoding and decoding processes can be gained by making these patterns of sound visible and permanent. In this talk I will demonstrate a number of ways in which we can look at speech as patterns of sound. I will cover a broad range of linguistic levels: from pressure variations, spectral analysis and neural firing to phonetic properties, lexical contrast and dialogue structure. On the way I hope to show how visualisation can aid understanding of how speech communication works, and can also create a sense of wonder at the marvel that it works at all.

7 September (Conference Day 2 Begins)

18

PRESENTATION

Se

ssion 4

: Mapping II

Mapping the BBC Voices

John Holliday (Information School, University of Sheffield)

The recent BBC Voices project (www.bbc.co.uk/voices) has resulted in a wealth of linguistic data which illustrate the diversity of the British language. The Voices project includes over three hundred recordings made by BBC radio journalists between October 2004 and June 2005, the largest survey of the English language ever made. In addition, online surveys have been used to collect data on, amongst other things, the types of words that people use in everyday speech. The data collected help to illustrate the close relationship between the use of language and our geographical, social and cultural identity.

The Voices data have been analysed and characterised into three descriptor types:

■■ everyday words used for a selection of 38 terms or ‘concepts’, such as ‘playing truant’;

■■ a selection of features which characterise grammatical variation;

■■ a selection of lexical features which characterise phonetic variation.

Techniques in Geographical Information Systems (GIS) have been used to further investigate the relationship between language and geographical identity in order to visually present and analyse this relationship. In particular, GIS software (ArcGIS 10.0) has been used to analyse the 38 concepts and to present these as maps in both static (Figure 1) and interactive web-based forms (MapServer), based on Postcode area and Postcode district.

In order to further investigate the consensus of all features, clustering has been carried out on each of the three descriptor types. Clustering is a classification scheme which involves grouping objects in order that members of one cluster, or group, are maximally similar to each other, but maximally dissimilar from members of other groups. The similarity between objects is facilitated by applying a similarity coefficient to the descriptors, the Euclidean distance for instance. The clusters are then presented on a map in order to determine if a relationship exists between geography and descriptor type (Figures 2 and 3) and, if so, whether areas of natural language in the UK can be identified.

The study also analysis the co-correlation between features, examining which combinations of dependent linguistic features can be used characterise each geographical area. Heat maps are used to illustrate the co-correlation.

Further work uses Bayesian Learning to build models for each of these areas of natural language. These models can then be tested in an attempt to identify the geographical identity of a test subject, essentially an automated ‘Professor Higgins’.

Figure 3: Wards hierarchical agglomerative clustering of phonology by Postcode area: 8 and 16 clusters using Thiessen Polygons

Figure 1: Two terms used for the concept ‘Playing Truant’, by Postcode area

Figure 2: Wards hierarchical agglomerative clustering of 38 concepts by Postcode area: 12 and 24 clusters

19

PRES

ENTA

TION

Se

ssio

n 4:

Mappin

g II

Visualising Perceptual Dialectology Data Using Geographical Information Systems

Chris Montgomery (Sheffield Hallam, Department of Humanties) Philipp Stoeckle (Albert-Ludwigs-Universität Freiburg, Deutsches Seminar)

Perceptual dialectology (PD) is the study of non-linguists’ conceptions of linguistic variation. It asks where and how such respondents delimit dialect areas. The main technique for studying the perception of dialects in this way is the ‘draw-a-map’ task (Preston 1982). The draw-a-map task asks respondents to draw lines on blank or minimally detailed maps indicating where they believe dialect areas to exist. One of the aims of this approach is to produce aggregate maps, as discussed by Preston and Howe (1987: 363). Such aggregate maps display grouped perceptions of the extent and placement of dialect areas. As such they are a valuable resource which can be used to directly compare perception data with those from other studies.

Previous computerised attempts to deal with the data gathered using draw-a-map tasks have been made by Preston and Howe (1987) and Onishi and Long (1997). Both of these

techniques used a grid-based method to display the extent of agreement amongst groups of respondents relating to the placement and extent of dialect areas (see Figure 1).

Onishi and Long’s (1997) Perceptual Dialectology Quantifier (PDQ) for Windows produced improved visualisations that displayed percentage agreement on one map, such as those seen in Figure 2. PDQ’s interface also allowed users to query the data, and thus the programme held significant advantages over the one used by Preston and Howe.

The techniques developed by both Preston and Howe (1987) and

Onishi and Long (1997) made the aim of producing aggregate maps more achievable. However, the age of the software (and hardware on which it is run) means that widespread use is not now possible. In addition to this, both approaches essentially treated the data as graphical, as opposed to spatial. As a result, both programmes aggregated data as if the geographical space onto which the graphics were projected did not have independent characteristics of its own (cf. Britain 2010).

This paper will discuss a method to extract, process, query, and visualise PD data using a piece of off-the-shelf Geographical Information Systems (GIS) software (ArcGIS) using data from studies in England (Figure 3) and Germany. As well as underlining improvements in visualisation quality using a GIS, the paper will also discuss the huge advantages of treating data from draw-a-map tasks as geospatial as opposed to graphical. It will also demonstrate why using a GIS for processing data of this type confers numerous advantages over other processing techniques, such as the ability to overlay linguistic and non-linguistic datasets. By doing this, the paper will argue for the widespread adoption of GIS technology in order to answer questions in PD, as well as in the wider field of geolinguistics.

References:Britain, D., 2010. Conceptualisations of Geographic Space in Linguistics. In A. Lameli, R. Kehrein, & S. Rabanus, eds. Language and

Space: An International Handbook of Linguistic Variation. Volume 2: Language Mapping. Berlin: Mouton de Gruyter, pp. 69-97.Long, D., 1999. Geographical perception of Japanese dialect regions. In D. R. Preston, ed. Handbook of perceptual dialectology.

Amsterdam: John Benjamins, pp. 177-198.Montgomery, C. & Stoeckle, P., Forthcoming. Perceptual Dialectology and GIS.Onishi, I. & Long, D., 1997. Perceptual Dialectology Quantifier (PDQ) for Windows.Preston, D.R., 1982. Perceptual dialectology: Mental maps of United States dialects from a Hawaiian perspective. Hawaii Working

Papers in Linguistics, 14(2), pp.5-49.Preston, D.R. & Howe, G.M., 1987. Computerized Studies of Mental Dialect Maps. In K. Dennning et al., eds. In Variation in

language: NWAV-XV at Stanford (Proceedings of the Fifteenth Annual Conference on New Ways of Analyzing Variation). Stanford CA: Department of Linguistics, Stanford University, pp. 361-78.

Figure 1: Visualisation produced using Preston and Howe’s method (1987: 373)

Figure 2: Visualisation produced using PDQ (Long 1999: 183)

Figure 3: Visualisation displaying a composite perceptual map of dialect areas in Great Britain using a GIS (Montgomery & Stoeckle, Forthcoming)

20

PRESENTATION

Se

ssion 4

: Mapping II

Maps as a Central Linguistic Research Tool

Joel J. Priestley, Janne Bondi Johannessen, Kristin Hagen, Anders Nøklestad and André Lynum (University of Oslo, ILN, The Text Laboratory)

Finding the geographical distribution of linguistic phenomena is difficult at the best of times. However, if a corpus of spoken language covering a large area exists, the task is simpler. If the corpus also has metadata with the relevant GIS information given coordinates for each place of each informant, a map can be drawn automatically.

In this paper we will show exactly this kind of resource. The Nordic Dialect Corpus (Johannessen et al. 2009) is a corpus over dialects in six Nordic countries: Denmark, Finland, Faroe Islands, Iceland, Norway and Sweden. The corpus contains the speech of 745 informants from 204 places, altogether more than 2.5 million words.

The corpus is searchable along a long range of variables, including two types of transcriptions (for a couple of the languages), which makes possible detailed visualizations of individual pronunciations or grammatical features in maps, showing many new and unknown isoglosses. Although the corpus is very new (it was officially launched in December 2011), some research has already been conducted, e.g. Vangsnes and Johannessen (2011), Johannessen (2012). The talk will present examples from both.

The map solution not only displays the places where a particular phenomenon occurs. It also gives a list of all the phonetic variants of a search word, with the possibility of choosing a separate colour for each phenomenon, thus displaying isoglosses directly. We don’t know

of any comparable resource. We can mention WALS online, but that is a static map display, where each map has been carefully crafted based on existing literature. There is no way that new maps can be displayed except if a linguist takes on the task of assembling the necessary information and puts it into a map. Our map solution is dynamic. Any hit in the corpus can be displayed automatically on the map, and further specifications can be semi-automatically displayed.

The coordinates for each recording location are stored in a table.

Using Google’s map interface enables us to plot these locations as needed. The next step will be to harness PostgreSQL and PostGIS and access their rich functionality for geographical computation.

An example is given below. The first picture shows the corpus hits for the negation ikke ‘not’ in Norway. This word, in its orthographic form, is distributed all over the country, of course. The second picture shows the phonetic variants of the negation (using a traditional Norwegian script). Each variant is accompanied by a clickable box, which in turn displays an array of colours to choose from for a better map display. The third picture displays the map with the additional markers displayed. The maps illustrate how easy it is to get a picture of isoglosses for a given phenomenon with our solution.

Figure 1: Distribution of the negation ikke ‘not’ in Norway

Figure 2: Choosing colours for display of phonetic variant of negation.

Figure 3: Purple colour displays fricative/affricate pronunciation of negation word, while yellow displays a pronunciation with a velar stop.

21

PRES

ENTA

TION

Se

ssio

n 4:

Mappin

g II

Geographic Information System (GIS) and Perceptual Dialect Mapping

Betsy E. Evans (University of Washington Department of Linguistics) Matthew D. Dunbar (University of Washington Center for Studies in Demography and Ecology)

The analysis of perceptual dialectology maps has largely been qualitative due to the nature of the hand-drawn map technique (Preston 1981). This technique asks respondents to indicate on a map of their community/state/country where they believe language differences exist. This requires the researcher to compare each map to arrive at an analysis of the data. In this presentation, we discuss a study which used Geographic Information System (GIS) to aggregate and then query subsets of such perceptual maps collected in Washington State (WA). 229 WA residents were given paper copies of maps provided by the researcher. Each respondent’s paper map was digitized into spatial data files, which were then combined in a GIS to create a composite map showing the spatial distribution of response frequency (Figure 1). In addition, maps of the labels given by respondents that were frequently associated with particular regions on the WA map could also be generated. For example, Figure 2 shows the regions labeled as ‘country’.

So far, very few linguists have used digital techniques to carry out perceptual dialectology research (e.g. Preston & Howe 1987, Cramer 2010, Bounds 2010) due to the stumbling blocks presented by bringing together recent technological advances in spatial analysis afforded by GIS and linguistic analysis. This paper will demonstrate some types of analyses possible with such collaboration in addition to the future prospects of GIS and perceptual data analysis.

The social evaluation of dialects from a geographic perspective is integral for defining cultural communities and linguistic details of language varieties. Therefore a technique for the quantitative analysis of perceptual maps is a crucial tool for the analysis and presentation of perceptual dialectology data.

Figure 1. Areas identified by All Respondents

Figure 2. Areas with labels relating to “Country

22

PRESENTATION

Se

ssion 5

: Visu

alising Interactio

n

Generating Visual Insights into Effective Doctor-Patient Consultations

Daniel Angus (School of Information Technology and Electrical Engineering, School of Journalism and Communication, The University of Queensland)

Janet Wiles (School of Information Technology and Electrical Engineering, The University of Queensland)

Andrew Smith (Institute for Social Science Research, The University of Queensland)

Effective communication between healthcare professionals and patients is critical to patient health outcomes. The doctor/patient dialogue has been extensively researched from different theoretical perspectives, with findings emphasising a range of communication behaviour that leads to effective communication. In this study we applied the Discursis visualisation technique [1] to analyse examples of training and clinical medical discourse transcripts. Discursis automatically builds an internal language model from a conversation transcript, tags conversation turns based on their conceptual content, and generates an interactive visual representation of the discourse under study. The analysis afforded by Discursis is useful in examining whole consultation scale interactions, and findings from this work are helping to highlight effective consultation techniques including accommodative, engagement and repetition behaviours. A significant advantage of Discursis over alternative visualisation techniques is the ability to visualise topic usage patterns across a range of time scales simultaneously. In a recent study Angus et al. [1] performed an analysis of conversations from two Australian television talk shows to highlight topic usage patterns including topic convergence (on a turn-by-turn time scale) between the participants, and whole conversation range call-back behaviours where early mentioned topics were revisited much later in the conversation. The study indicated that as a decision support tool, a discourse analyst could use the system to confirm pre-held hypotheses about the type and magnitude of interaction (in terms of topic reuse) between conversation participants, and as a forensic tool to discover patterns of interaction and interesting time periods where conversation participants demonstrated topic convergence characteristics. Findings from the application of Discursis analysis to doctor/patient consultation transcripts suggest that particular visual patterns of topic use are present in effective doctor/patient consultations. One feature of interest is what we term the ‘introductory stripe’ which is a band of topic recurrence stemming from the patient which occurs early in the consultation. An example of an introductory stripe is indicated in Fig. 1a. The introductory stripe is present as the patient mentioned a number of prominent topics in this single early occurring utterance. The conceptual content of this early utterance recurs with the patient’s own statements (red squares) and the doctor’s statements (red & blue squares) throughout the remainder of the consultation. The introductory stripe has high importance given that these concepts are repeated at multiple time-points throughout the remainder of the consultation. In essence this early turn frames much of the later occurring discussion. Platt and Gordon [2] speak of the importance in clarifying the patient’s agenda, and that the doctor should try early in the interview to obtain a complete list of the patient’s concerns. The ‘introductory stripe’ recurrence pattern is indicative of such a clarification. As a contrasting example, there is no introductory stripe present in poor consultations (Fig. 1b); instead the Doctor’s early turns are the ones that recur throughout the remainder of the consultation suggesting that the Doctor provided the framing instead of the patient.

References:

[1] Daniel Angus, Andrew Smith, and Janet Wiles. Conceptual recurrence plots: Revealing patterns in human communication. IEEE Transactions on Visualization and Computer Graphics, 2011. in press.

[2] Frederic W. Platt and Geoffrey H. Gordon. Field guide to the difficult patient interview. Lippincott Williams & Wilkins, 2nd edition, 2004.

Figure 1: Example doctor/patient Discursis plots. Each conversation turn is represented by a square of colour with time running down the diagonal. If any two turns share conceptual content then the horizontally and vertically adjacent square is shaded to indicate the strength of the match. Matches between different speakers are two-colour and matches by the same person single colour. The size of the squares represents the size of the conversation turns. (a) Features of a good doctor/patient consultation: Strong engagement between the doctor (blue) and patient (red) is observed throughout the whole consultation, observable by many two colour recurrence blocks. (b) Features of a poor doctor/patient consultation: Good

engagement between the doctor (blue) and patient (red) is witnessed early in the consultation; however the consultation degrades over time as the patient (red) begins to repeat concepts due to the doctor failing to engage with these concepts.

23

PRES

ENTA

TION

Se

ssio

n 5:

Vis

ual

isin

g In

tera

ctio

n

Transcripts Beyond Text: Tools and Techniques for Visualizing and Quantifying Discourse

Tyler Kendall (Dept. of Linguistics, University of Oregon)

Both within and outside linguistics, the orthographic transcript is the primary representation used to present speech in a non-aural format. There is no doubt that transcripts – typically text-based representations of speech – are extremely valuable. At the same time, several scholars (e.g., Ochs 1979, Mishler 1991, Bucholtz 2000, Edwards 2001) have noted that the decisions made while creating transcripts influence and ultimately constrain their resulting possible readings and uses. Decisions as seemingly straightforward as how to lay out the text, to those more nuanced – like how much non-verbal information to include and how to encode minutiae such as pause length and utterance overlap – have far-reaching effects on the utility of a transcript. This paper discusses visualization strategies as they have developed from the approach to transcription implemented by the Sociolinguistic Archive and Analysis Project (SLAAP; http://ncslaap.lib.ncsu.edu/), a web-based preservation and research repository of sociolinguistic audio recordings housed at the North Carolina State University Libraries (Kendall 2007). In addition to providing password-protected online access to a growing collection of sociolinguistic interview recordings (currently, over 2,400 interviews), the project features a range of tools for exploring and analyzing the audio files and their associated transcripts. SLAAP’s transcripts are time-aligned at the phonetic utterance (pause-group) level and stored in a relational database. The transcripts are then queried, accessed, and manipulated through a range of dynamic interfaces available through the SLAAP website. Through the click of a mouse, users can change the presentation of the transcribed data, and, thus, easily gain multiple perspectives on the data (addressing some of the concerns of, e.g., Ochs 1979, Mishler 1991). Yet, even with dynamic formatting and presentation, text-based views alone remain limited in the possible perspectives they allow.

Visual methods – beyond text – provide opportunities to explore discourse, and in particular discourse timing, in powerful ways. In this talk, I focus on two visualization features developed in SLAAP which allow for new examinations of discourse data. First, I discuss a graphical approach (graphicalization in SLAAP’s terminology; Kendall 2007), which presents a tablaturelike view of speech timing. In this view, each talker’s speech is presented as a row of shaded rectangles and blank spaces, which accurately depict features like speaker articulation rates, utterance lengths and timing, and pause locations and durations. This view allows users to quickly assess how several aspects of speech timing unfold within and across talkers in discourse. Second, I present SLAAP’s implementation of the Henderson graph (Henderson et al. 1966, Kendall 2009), a visualization method which graphs talk-time along the horizontal axis and pause-time along the vertical axis in a “stair-wise” portrayal of talk that depicts speech fluency and hesitancy and their changes over time. Slope lines and other quantitative measures can be generated over the stretches of talk depicted in a Henderson graph and examined as dependent/independent variables in analysis. In the talk, I discuss and illustrate both the utility of these visualization methods for substantive research on discourse and the technical implementation of the graphing software.

24

PRESENTATION

Se

ssion 5

: Visu

alising Interactio

n

Visualizing Spoken Conversation Structure

Li-chiung Yang (Faculty of Arts, Tunghai University & Institute of Linguistics, Academia Sinica)

The representation of sound in graphical form has an ancient history, as exemplified in the association of aural pitch, rhythm, and duration to visual distinctions in musical notation systems passed down from our ancestors. Like music, conversational speech is particularly well-suited to visual representation because of the iconic association of pitch and duration with height and length. Both with music and with spoken conversation, variation in duration, pitch level and movement, and amplitude convey levels of meaning that enhance and go beyond a strictly lexically-based understanding of human emotional and cognitive states.

For the current study, we focus on the graphical representation of topic in spontaneous conversational dialogues. We emphasize how components of pitch and duration that have been traditionally used to explore syllable, word, and boundary forms, can be extended to explore speech at the scope of phrases and entire conversations. In particular, we demonstrate how graphical representation of phrase-to-phrase pitch movement within conversations reveals the topic and interactive structure of the conversation. The visual representation obtained through identification of phrase peak-pitch points is seen to convey key elements of both local and global structure in conversation, because it mirrors the contextual meanings of pitch variation that are inherent in spoken language. Thus the visualization technique we present is a direct reflection of physical sound in visual form, and simultaneously a reflection of underlying cognitive and interactive processes that occur in conversations.

Visualization using this method can be used to either abstract from lexical meaning to bring out elemental relationships among phrases and between the speech activities of participants, or used to enhance our understanding of the lexical meaning by focusing our attention on the contextual meanings that accompany the speech stream. We demonstrate how the graphical technique presented provides a greater understanding of overall topic flow, of local interrelationships between successive phrases, such as relative certainty or uncertainty or connected idea or logic sequences, and intensity of speaker involvement in a topic. In addition, graphical separation by speaker over the extent of a conversation is also shown to provide important insights on how speakers coordinate their cooperative building of a conversational flow. In particular, we show that visualization increases our understanding of how participants coordinate topic flow through pitch-signaled indications of agreement on topic, and how interruptions are introduced, distributed, and coordinated throughout conversations.

The great strength of the visualization method presented lies in its unification of local and global views of conversational speech. The visualization achieved suggests that, like story-telling, musical performances, and dance, the prosody of conversational speech has an internal structure in which inner states are communicated in a cooperatively constructed flow of sound variations. These phrase-to-phrase variations systematically signal relational topic development, interactional, and cognitive and emotional aspects of speaker state at several different levels of prosodic form, and it is the direct and iconic link to underlying contextual expressiveness through sound that provides the ability to achieve significant advances in our knowledge of the nature of spoken language.

Figures 1a-b: Plots of 600 consecutive pitch peaks of both speakers in an extended conversation. The upper plot represents speaker A, the lower plot speaker B. Three extended rise-fall arches can be seen in U100-U275, U275-U425, and U450-U575 in speaker A’s pitch height movement. Speaker B’s pitch movements in the corresponding sections also reflect speaker involvement with the topic development.

25

PRES

ENTA

TION

Se

ssio

n 6:

Kno

wle

dge V

isual

isat

ion

Knowledge Visualization and the Depiction of Conceptual Relations in a Multimodal Terminological Database

Juan Antonio Prieto Velasco (University Pablo de Olavide, Sevilla, Spain) Clara Inés López Rodríguez (University of Granada, Spain)

The computational and artificial intelligence approaches have paved the way for cognitive theories of Linguistics and Terminology. In fact, Knowledge Visualization has arisen as an interdisciplinary field with a cognitive orientation to “examine the use of visual representations to improve the creation and transfer of knowledge between at least two people” (Eppler and Burkhard 2004, 2007).

Among those theories, Frame-based Terminology (Faber 2011; Faber et al. 2006) explores how concepts are represented in multimodal environments using corpora, and establishes the premises for the design of terminological knowledge bases, like EcoLexicon. EcoLexicon is a visual thesaurus on the environment, developed thanks to the ThinkMap technology, where each concept appears in the context of a specialized frame called the Environmental Event (Faber et al. 2006) that highlights its relations to other concepts.

We have contributed to the representation of the environmental domain and the transfer of knowledge on the environment

by including a repertory of visual resources that facilitate knowledge management tasks like: categorization, concept retrieval and term activation. EcoLexicon offers access through a multimodal interface with modules devoted to different semiotic modes: conceptual, linguistic and graphical resources, which must be selected in a principled way, so that images represent the most salient attributes of concepts (Prieto 2008; Prieto and López 2009) as shown in the picture below.

To facilitate knowledge visualization, we need to study the arrangement of information in time and space in order to figure out which of the concepts depicted in EcoLexicon are new or presupposed; which refer to ideal or real elements, and which are emphasized because they are central as opposed to other ancillary elements, as Kress and Van Leeuwen (2006: 197) suggest. Therefore, a meaningful depiction of conceptual relations should account for the semantic role of concepts, as well as the arrangement of entities/processes in a framed context, following the logics of time and space (Kress 2009: 56). This way EcoLexicon can meet the needs of different user groups (experts, people interested in the Environment, students, translators, technical writers, etc.) so as to guarantee knowledge transfer at all levels of cognition.

In this paper, we discuss the issue of knowledge visualization through the convergence of texts and images according to conceptual structures like: CENTRE-PERIPHERY; PART-WHOLE (part_of/has_part); GENERAL-TO-SPECIFIC (is_a/type_of); GIVEN-NEW; IDEAL-REAL (Martinec and Van Leeuwen 2009; León 2009). To assess which images are suitable for their inclusion in EcoLexicon, we analyze how information is foregrounded and organized in images and how conceptual interrelations are depicted in definitions and highlighted in visual resources and categories hierarchies, as in the example below.

Figure 1. Environmental event structure

Figure 2. EcoLexicon multimodal interface

26

PRESENTATION

Se

ssion 6

: Know

ledge

Visu

alisation

Finally, we assume that the depiction of specialized knowledge is essential for the design of multimodal terminological databases and share the foundations of Knowledge Visualization which point at the fact that images can be as good as terms for the representation of specialized concepts.

References:

Eppler, Martin J. and Remo A. Burkhard (2004). “Knowledge Visualization: towards a new discipline and its field of application”. ICA Working Paper #2/2004, University of Lugano, Switzerland.

Eppler, Martin J. and Remo A. Burkhard (2007). “Visual representations in knowledge management: framework and cases”. Journal of Knowledge Management, vol. 11, no. 4, 112-122.

Faber, Pamela (2011). “The dynamics of specialized knowledge representation: simulational reconstruction or the perception–action interface”. Terminology 17, no. 1: 9-29.

Faber, Pamela; Pilar León Araúz; Juan Antonio Prieto Velasco; Arianne Reimerink (2007). “Linking Images and Words: the description of specialized concepts”. International Journal of Lexicography 20, no. 1: 39-65.

Faber, Pamela; Silvia Montero Martínez; María Rosa Castro Prieto; José Senso Ruiz; Juan Antonio Prieto Velasco; Pilar León Araúz; Carlos Márquez Linares; and Miguel Vega Expósito. 2006. “Process-oriented terminology management in the domain of Coastal Engineering”. Terminology 12, no. 2: 189-213.

Kress, Gunther R. (2009). Multimodality: A social semiotic approach to contemporary communication. London/New York: Routledge.

Kress, Gunther R. and Theo Van Leeuwen (2006). Reading images: the grammar of visual design. London/New York : Routledge.

León, Pilar (2009). Representación multidimensional del conocimiento especializado: el uso de marcos desde la macroestructura a la microestructura. Unpublished doctoral dissertation. University of Granada.

Martinec, Radan and Theo Van Leeuwen (2009). The language of new media design: theory and practice. London/New York: Routledge.

Prieto Velasco, Juan A. and Clara I. López Rodríguez (2009). “Managing graphic information in terminological knowledge bases”. Terminology 15, no. 2: 179-213.

Prieto, Juan A. (2008). Información gráfica y grados de especialidad en el discurso científico-técnico: un estudio de corpus.

Unpublished doctoral dissertation. University of Granada.

Figure 3. Depictive scheme of conceptual relations in the terminological entry GROYNE

27

PRES

ENTA

TION

Se

ssio

n 6:

Kno

wle

dge V

isual

isat

ion

Visual Methods for Figurative Meaning Explanation in Science

José Manuel Ureña Gómez-Moreno (Lecturer at the University of Castilla La Mancha, Department of Modern Languages, Campus of Ciudad Real, Spain)

The relationship between multimodality and cognitive effects has become an inescapable trend in Cognitive Linguistics. Visual information is the dominant component of our conceptualisation system (Watt 1991), and thus, it plays a prominent role in the analysis of perceptual data, including non-verbal metaphor (Forceville and Urios-Aparisi 2009: 6).

Research in pictorial metaphor is scarce in science and even scarcer when it comes to explaining specialised concepts through dynamic images and videos. In the domain of coastal engineering, Prieto (2008) proposes a typology of pictorial devices according to their levels of iconicity (i.e. scale of isomorphism), abstraction (the recipient’s cognitive effort to interpret an image), and dynamicity (capacity of an image to represent motion). Nevertheless, none of these studies addresses metaphorical representations.

To fill this gap, this paper examines: (i) pictures extracted from a corpus of publications covering different biology subdomains; (ii) videoclips that feature animals and biological processes. Both (i) and (ii) include expert and science popularisation materials. It is shown that visuals often require or invite the construal of metaphors not only for pedagogical purposes, but also for theory-constitutive and heuristic purposes. Evidence is thus provided of the cognitive and semiotic potential of metaphor in this knowledge field, a potential which facilitates the understanding of specialised concepts and text comprehension. For instance, Figure 1, extracted from an academic journal article, explains the interaction between two wind drifts—which is crucial for specific sea organisms—through four conceptual metaphors:

(i) IMPORTANT IS CENTRAL: H stands for high pressure, which is a major cause for certain winds to occur. For this reason, H is placed at the centre of the wind drifts on the map.

(ii) IMPORTANT IS BIG: the prominent size of H on the map is hardly a coincidence. This is a common metaphor in everyday language too (Grady 1997).

(iii) COLD IS BLUE: blue arrows stand for cold currents, which helps the reader of the article easily identify the nature of the winds on the map.

(iv) HEAT IS RED: red arrows stand for hot currents. The same claim as in (iii) is in order.

Figure 1. Conceptual metaphors explaining wind drift interactions

An example involving dynamic images can be consulted at http://www.youtube.com/watch?v=fhBZ40jIo4Q. This videoclip explains to laymen why the archerfish receives its metaphorical name.

The biology pictures were then classified according to their levels of iconicity, abstraction, and dynamicity. This classification revealed the incidence of visual thinking —Caballero (2006)—in pictorial metaphor. Accordingly, it was found that certain visual resemblance metaphors in the biology corpus combine imagistic and cognitive facets, which blend with each other to shape and communicate scientific knowledge. Visual metaphors were also described on the basis of the familiarity and accessibility of their

source domains. In verbal communication, scientists sometimes go beyond familiar experiences to provide metaphorical models of particular phenomena (Semino 2008: 155). This paper shows that this fact also holds true for pictorial representations of biology concepts.

References:

Caballero, R. 2006. Re-viewing Space: Figurative Language in Architects’ Assessment of Built Space. Berlin and New York: Mouton de Gruyter.

Forceville, C. and E. Urios-Aparisi. 2009. Setting the scene: Introduction. In Forceville and Urios-Aparisi (Eds.), Multimodal Metaphor, 1–18. Berlin: Mouton de Gruyter.

Grady, J. (1997a). Foundations of meaning: primary metaphors and primary scenes. Ph.D. Dissertation, University of California.

Prieto, J.A. 2008. Información gráfica y grados de especialidad en el discurso científico-técnico: un estudio de corpus. Ph.D. dissertation, University of Granada.

Semino, E. 2008. Metaphor in Discourse. Cambridge University Press.

Watt, R. 1991. Understanding Vision. New York: Academic Press.

28

KEYNOTE

Keynote 4

Some Challenges and Directions for the Visualization of Language and Linguistic Data

Chris Culy (University of Tübingen)

While linguists and language professionsals have long used visual representations of information (e.g. syntax diagrams, language family trees, lexical meaning diagrams, etc.), digital, interactive visualizations are becoming more popular due to their usefulness and flexibility.

At the same time, language and linguistic (L/L) data poses some interesting challenges for visualization, due to the fact that L/L data has some significant differences from other types of data. The most important is that language is not mappable: it cannot in general be represented in a more compact, human understandable way, unlike e.g. numbers, which can be represented by location, or size.

I will show why these properties of L/L data are a challenge for visualization and I will give some working examples of how they might be addressed. I will also discuss how the visualization of L/L data can contribute to further explorations of the connections between cognitive psychology and computer interfaces. Finally, I will present some concrete suggestions for future directions for the development of the field of visualization of L/L data.

Conference Day 2 Ends

29

POST

ER

Visualisation of Prosody in English and Arabic Speech Corpora

Claire Brierley (School of Computing, University of Leeds, Leeds, UK)

Majdi Sawalha (Department of Computer Information Systems, King Abdullah II School for Information Technology, University of Jordan, Amman, Jordan)

1. Prosodic Boundary Mark-up for English

Prosodic-syntactic chunking is a language universal [1]: we process linguistic content by splitting it up into meaningful stand-alone units, where each chunk includes at least one accented word with pitch variation on the stressed syllable(s). In English, punctuation in text plus pauses and inflections in speech signify phrase breaks or boundaries between successive chunks. Furthermore, boundary annotation schemes for British and American English, as in the Spoken English Corpus [2] and the Boston University Radio News Corpus [3], are an attempt to visualise perceived boundaries, and distinguish relative boundary strengths. The scheme for British English describes three levels of juncture between words: none; minor break (|); major break (||), while that for American English describes five: {0, 1, 2, 3, 4}, with break indices {3} and {4} mapped to tone unit boundaries (|) and pauses (||) in the previous scheme.

2. Prosodic Boundary Mark-up for Arabic

We report on automated phrase break prediction for Arabic [4], [5]. Most editions of the Qur’an1, intended to help both native and non-native Arabic speakers parse the text during oral recitation, carry fine-grained boundary annotations (Table 1) which we have adapted (cf. Fig. 1) in our machine-readable Boundary-Annotated Qur’an Corpus [6].

This prescriptive scheme of stops and starts is one component of the art of Tajwīd, or recitation, where boundary annotations (Fig. 1) project ‘implicit prosody’ onto the text (Fodor, 2002) to guide reading (whether silent or vocal) and comprehension.

Figure 1: We have mapped our boundary annotation scheme in the Boundary Annotated Qur’an Corpus to Tajwīd mark-up, as in this example from The Holy Qur’an: 29.45

Alternative phrasing given by Tajwīd mark-up in verse 29.45 economically subsumes two phrasing variants for the following extract, as translated into English, and utilising our adapted minor/major boundary divide [6], (Fig. 2): (1) Recite [O Muhammad] what has been revealed to you of the Book and establish prayer indeed prayer prohibits immorality and wrongdoing | and the

remembrance of Allah is greater | And Allah knows that which you do || (2) Recite [O Muhammad] what has been revealed to you of the Book and establish prayer | indeed prayer prohibits immorality and wrongdoing | and the remembrance of Allah is greater | And Allah knows that which you do ||

Figure 2: Two phrasing variants enabled by exploiting optional Tajwīd boundary mark-up

Table 1: Symbols are used to visualise fine-grained levels of juncture between words in Qur’anic verses

30

POSTER

In variant (1), there is an option to stop after the first occurrence of prayer: the speaker/turn-taker emphasises how the beneficial effects of regular prayer are immediately felt on its establishment. In variant (2), the speaker/turn-taker uses the optional stop to differentiate these two events.

3. Colour coded Arabic Phonetics and Phonology

Tajwīd editions of the Qur’an also use colour coding for guidance on how to intone the verse, and especially on accenting. Our corpus adopts the same widely-used recitation style (ḥafṣ bin ‘āṣim) as this e-book (Fig. 3) published by the Islamic Bulletin. Here, the use of orange and red fonts in the opening chapter of the Qur’an denotes permissible and necessary prolongation respectively on prominent syllables. For example, in verse 1.3, orange mark-up in the invocation to Allah as Master of the Day of Judgment highlights the syllable nucleus in the penultimate syllable: malīki yawmi al-dīni (verse transliteration). Similarly, red and orange mark-up on the last word within the final phrase in verse 1.7 ...walā al-dālīna (…and not of those who go astray) reinforces the (already considerable) sonority of the open vowel in l-dālīna and its immediate successor: the high vowel al-dālīna. We plan to extract this wealth of linguistic knowledge (i.e. morphology, syntax, phonetics, and phonology) in Qur’anic orthography, and to model it in chunking algorithms for Arabic, for example to supplement sparse punctuation in Modern Standard Arabic.

1. For example: http://tanzil.net/downloadReferences:

[1] Ladd, R. 1996. Intonational Phonology Cambridge. Cambridge University Press.

[2] Taylor, L.J. and Knowles, G. 1988. Manual of Information to Accompany the SEC Corpus: The machine readable corpus of spoken

English. Online. Accessed: January 2012. http://khnt.hit.uib.no/icame/manuals/sec/INDEX.HTM[3] Ostendorf, M., Price, P. and Shattuck-Hufnagel, S. 1996. Boston University Radio Speech Corpus. Philadelphia. Linguistic Data

Consortium.

[4] Authors. 2012a. Boundary-Annotated Qur’an Corpus for Arabic Phrase Break Prediction. To appear in proceedings: Speech is where it’s at: IVACS Annual Symposium 2012. Cambridge.

[5] Authors. 2012b. Predicting Phrase Breaks in Classical and Modern Standard Arabic Text. Submitted to LREC 2012.

[6] Authors. 2012c. Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. Submitted to LREC 2012.

Figure 3: The Holy Qur’an, Chapter 1: http://www.islamicbulletin.com/services/details.aspx?id=260

31

POST

ER

Visualization of Corpus Composition for Machine Translation

Marco Brunello (School of Modern Language and Cultures, Centre for Translation Studes, University of Leeds)

Automatically locating documents that are similar to each other can have several useful practical applications. For example, a classification into domain and genre “can be used for various purposes, such as improving the relevance of information retrieval or selecting more appropriate language models in POS tagging, parsing, machine translation, or in word sense disambiguation” (Sharoff, 2007). In the field of machine translation (MT), the capability of recognizing the text variety of documents is very important, not only for those MT paradigms that almost completely rely on corpus-based systems (statistical and example-based MT), but also for rule-based MT – where grammatical rules and dictionary entries can be extracted from documents belonging to the same text type of the texts we want to machine-translate - and computer-assisted translation – where it is useful to create translation memories of documents of the genre or domain of interest for fuzzy matches, as suggested in (Eberle et al., 2012).

Under this point of view it could be possible to maximise the benefits of parallel corpora that are big in size but circumscribed to strict communicative situations like Europarl, selecting the most suitable training texts for a specific translation task. This can be done in several ways, e.g. either by employing document similarity measures, or document classification with topic modeling (Steyvers & Griffiths, 2006). In both cases, the output can have an intuitive graphic representation, that can be useful to understand the composition of our corpus, and which data are better to be selected for our purpose.

In our case, I experimented with the Europarl corpus, trying to understand what topics it contains (see Figure 1 and 2). In this case the use of graphics has been useful to discover and display the main arguments discussed in Europarl, and then used this knowledge, as well as with the employment of the cosine similarity measure, to select the texts most similar to a test document, and used them as training set for the Moses SMT system (Koehn et al., 2007), with successful results.

References:

Eberle, K., Babych, B., Geiß, J., Ginest’i-Rosell, M., Hartley, A., Rapp, R., Sharoff, S., et al. (2012). Design of a hybrid high quality machine translation system. Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 101-112). Avignon, France: Association for Computational Linguistics.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., et al. (2007). Moses: open source toolkit for statistical machine translation. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (pp. 177-180). Stroudsburg, PA, USA: Association for Computational Linguistics.

Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proc. of Web as Corpus Workshop (Vol. 5, pp. 1-10). Louvain-la-Neuve.

Steyvers, M., & Griffiths, T. (2006). Probabilistic Topic Models. In T. Landauer, D. Mcnamara, S. Dennis, & W. Kintsch (Eds.), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum.

1. http://gephi.org

Figure 2: highlighting of a specific topic and related keywords with the same program

Figure 1: graphic representation of topics contained in Europarl, with keywords (lower-case) and topics (upper-case), made with the the interactive visualization platform Gephi1

32

POSTER

Measuring the Optimization of Vowel-Spaces: A Method for Cross-Linguistic Analysis

Jon William Carr (Language Evolution and Computation Research Unit, School of Philosophy, Psychology and Language Sciences, University of Edinburgh)

Recently, a number of studies looking at the emergence and evolution of phonological systems have shown that, given sufficient time, organizations of the articulatory space emerge in which phonemes are maximally distinctive perceptually (e.g. Steels, 1997; de Boer, 2000; Oudeyer, 2005; de Boer & Zuidema, 2010). However, there has been little investigation into the typological description of articulatory optimization in the world’s languages. It is not known, for example, how optimized natural vowel-spaces actually are, or whether the vowels of, for example, English are more or less distinctive than those of, for example, Swahili. Peterson and Barney (1952) showed that it is possible to ‘see’ the spacial relationships between vowels by plotting their first and second formant frequencies in a plot with reversed logarithmic axes. This provides a useful means to visualize the space used by the tongue to produce vowels of differing qualities. By converting the formant frequencies to a linear psychoacoustical scale (e.g. the bark, mel, or ERB-rate scales), it is possible to calculate the perceptual distances between vowels, and obtain a measure of the degree to which a particular organization of vowels is optimized.

In this poster I introduce such a methodology, which (to my knowledge) has not been attempted previously. This is achieved in three steps: first, we measure the formant frequencies of the basic set of monophthongs in a given language; second, we plot these vowels in a perceptual space; third, we compare this vowel-space to 100,000 randomly simulated vowel-spaces. Since an optimized distribution of vowels should deviate greatly from distributions generated stochastically, this method provides a robust measure of the nonrandomness (i.e. optimization) of a vowel-space. I call this measure the vowel optimization quotient (VOQ).

Using recordings from the UCLA Phonetics Lab Archive (2007), this method has been applied to 100 languages. The results suggest that there is a high level of variation in the extent to which vowel-spaces are optimized. The Kamba language, for example, uses a set of vowels that are highly distinctive from one another (fig. 1a), while the Amharic language uses a set of vowels that are not so perceptually distinctive (fig. 1b). This could have potential applications in phonology and linguistic typology, and, given its reliance on a number of visual methods, such as multidimensional vowel plots, spectrograms, and diagrams of the vocal tract, it also highlights the important role visual methods play in modern Linguistics.

Figure 1. Plot (a) shows the vowel-space for Kamba, a Niger-Congo language, in which the distribution is highly optimized (VOQ = 3.8). Plot (b) shows the vowel-space for Amharic, an Afro-Asiatic language, in which the distribution is relatively unoptimized (VOQ = 1.6).

References:

de Boer, B. (2000). Self-organization in vowel systems. Journal of Phonetics, 28(4), 441—465.

de Boer, B., & Zuidema, W. (2010). Multi-agent simulations of the evolution of combinatorial phonology. Adaptive Behavior, 18(2), 141—154.

Oudeyer, P.-Y. (2005). The self-organization of speech sounds. Journal of Theoretical Biology, 233(3), 435—449.

Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175—184.

Steels, L. (1997). The synthetic modeling of language origins. Evolution of communication, 1(1), 1—34. UCLA Phonetics Lab

Archive (2007). Los Angeles, CA: UCLA Department of Linguistics. http://archive.phonetics.ucla.edu/.

33

POST

ER

Visual Methods for Understanding the Language of the Quran

Kais Dukes and Eric Atwell (I-AIBS Institute for Artificial Intelligence and Biological Systems, School of the Computing, University of Leeds)

We present an interdisciplinary approach to understanding the language of the Quran, combining visual methods with ideas from Quranic Studies, traditional Arabic linguistics and computational linguistics research, and hopefully feeding back to all three areas.

The Quran is the last in a series of five major religious texts. Believers hold that God gave the message to the angel Gabriel to pass it on to Muhammad to learn by heart and pass on to all mankind.

It is written in Classical Arabic, from 1300 years ago, before the English language existed.

All believers are supposed to try and understand the original text rather than translations or interpretations. It has guided philosophy, science and other aspects of knowledge; particularly Arabic linguistics, which was initially developed to try and help believers understand the Quran.

Today, there are many websites where you can access English translations of the Quran, and try keyword searches. BUT verse-by-verse translations are ”interpretations”, and Muslims should aim to access the “true” Classical Arabic source.

Readers trying to understand the Quran may want to go beyond keyword searches; e.g. to ask questions in plain English, like “How long should I breastfeed my child for?” and have a way to find the Quran verse which has relevant meaning to answer the question.

We augmented the text of the Quran with rich linguistic annotation, which could lead to new ways to access the underlying language and meaning of the text. The Classical Arabic of the Quran presents several challenges: complex orthography or script, highly inflected morphology or

word structure, non-linear grammar combining elements of dependency and phrase structure, and lexical semantics built around entities and concepts referred to by pronouns and nouns.

All these levels had to be made accessible to the non-linguist general public looking to understand the source Quranic verses. So, we developed a visual approach to displaying the layers of linguistic information in the Quranic Arabic Corpus:

■■ As the base - the verified Uthmani script for each word; the Arabic is read from right-to-left.

■■ For non-Arabic speakers, there is also a phonetic transcription – not using true international phonetic alphabet, but something like the standard Roman alphabet, so English speakers who have no phonetics training can probably work this out.

■■ Although most Muslims do not speak Arabic, many more speak English; so we added an interlinear word-by-word exact translation of the Arabic words and morphemes, enabling English speakers to access a more literal translation.

34

POSTER

■■ There is a referencing system which allows you to locate any particular chapter, verse, word, and even morphological segment.

■■ Arabic word structure is quite complex: a typical word may have a root, for example a verb, with a conjunction at the start of it, and subject and object pronouns after it. So, we have divided the word into individual segments.

■■ We give detailed information on the grammatical categories of the individual segments. So this example is – reading from right-to-left – a conjunction, followed by a main verb, followed by subject pronoun, followed by an object pronoun.

■■ And for use by Arabic grammarians, there is also an automatically generated Arabic translation of the grammatical description.

■■ At the sentence level, there is a parse structure tree, or diagram showing the grammatical structure for each sentence, based on traditional Arabic grammar rather than modern theories.

■■ We also display a complex ontology, a set of all the entities or ‘things’. Every noun or pronoun refers to some ‘thing’ and this is linked to from the text, so you can find other instances of the entity from the ontology.

■■ Overarching the linguistic annotation, there is a complex architecture or framework for collaboration; including a message board, so that anybody finding anything wrong can point it out, and a large set of downloadable resources including the software and the data.

■■ This is used by researchers and members of the public worldwide. This map on the website shows where the users are: many in America and Britain, but also around the whole world. And these are not just lay people trying to read the Quran, but also researchers worldwide.

35

POST

ER

Investigation into Terrorist Activity: VAST 2011 Challenge

Sharmin (Tinni) Choudhury (Middlesex University) Chris Rooney (Middlesex University)

Eric Atwell (University of Leeds) Claire Brierley (University of Leeds)

Kai Xu (Middlesex University) Raymond Chen (Middlesex University) William Wong (Middlesex University)

This research is an example of combined visualization and linguistic analysis, applied to a practical challenge.

The VAST’2011 Challenge was a contest aimed at the Visualization research community. For the 3rd part of the challenge, researchers were given a corpus of news reports from the imaginary city of Vastopolis; we had to analyze the corpus, and identify the threads of a possible terrorist plot. The primary tool we used was Middlesex University’s INteractive VIsual Search and QUery Environment (INVISQUE). The Invisque user interface (UI), which is written in Adobe Flash, is supported by a middleware written in Java that queried the MC3 dataset as stored in a MySQL database. In addition, University of Leeds used a python based implementation of their corpus analysis algorithm to generate log likelihood statistics for words in the MC3 news article corpus that was also stored in the MySQL database for ease of access. A simple python script was used to transfer the MC3 data and the University of Leeds analysis results, which were also in simple text files, into the MySQL database.

MC 3 Potential Threats

The task was: “Identify any imminent terrorist threats in the Vastopolis metropolitan area. Provide detailed information on the threat or threats (e.g. who, what, where, when, and how) so that officials can conduct counterintelligence activities. Provide a list of the evidential documents supporting your answer.”

Analytics Process

We used the INteractive VIsual Search and QUery Environment (INVISQUE), a prototype visual analytics interface created at Middlesex University, to visually sift through the news corpus. INVISQUE uses index-card visualization to represent individual information items, in this case the news articles, and arranges them on screen on an X-Y axis. Figure 1 shows the search results from the keyword search “bomb” arranged on the X axis by significance and on the Y axis by date - so that news articles with higher level of significance for the keyword “bomb” appears more to the left and newer articles appear higher up the Y axis.

The “significance” value was calculated by collaboration members from the University of Leeds who performed keyword extraction on the news corpus. Keyword Extraction is a standard Corpus Linguistics technique for genre classification which pinpoints statistically significant or “key” words for that genre via comparison with a general reference corpus. The significance calculation for the MC 3 corpus entailed comparison of word frequency distributions in each of the 4474 news article test sets with their distribution in the entire news article dataset as reference corpus. The Leeds program verifies apparent overuse of lexical items in each article by computing the difference between these observed frequencies and the norm as represented by their expected frequency in the whole dataset, expressed as a log likelihood (LL) statistic. Words with LL scores of 6.63 or above are statistically significant.

Single word searches, e.g. bioterrorism, on the INVISQUE interface is applied against the word list generated

Figure 2: Table of Words in MySQL containing results of University of Leeds Keyword Extraction

Figure 1: INVISQUE index-card visualization arranged on X-Y axis

36

POSTER

by University of Leeds, see Figure 2, and leads to generation of index-cards that have the matching keyword as the title and the significance of the keyword as the top-left value. Composite phrases, e.g. “Vast River”, are applied against the full article text and in the latter case – the title of the index card becomes the most significant keyword, as calculated by Leeds, in the article. These are illustrated in Figure 3.

As shown in Figure 4, the cards also show a “gist” of the article by displaying the top three most significant keywords of the article, the article title, Vastopolis locations mentioned in the article, which are extracted and appended to the cards by the

middleware based on a pre-compiled list, and the date of the article. The cluster of returned cards can be filtered by any of the card features and the cards also have a shortcut to the full text of the article. As demonstrated in the accompanying video, the primary technique used to explore the corpus provided was visual searching and filtering. This technique allowed us to explore the corpus very thoroughly, very quickly and we began to get a good picture of the happening in Vastopolis within hours of beginning our exploration.

APPENDIX: Findings from Visual Analytic Investigation of the News Corpus

There were many interesting activities in Vastopolis but most could not be considered “imminent terrorist” threats. However, what may pose an imminent threat involves stolen equipment from the labs of molecular biologist Professor Edward Patino. Prof. Patino has been harassed by the group Citizens for Ethical

Treatment of Lab Mice, who in-turn are affiliated with the Forever Brotherhood of Antarctica. The Professor himself has recently given lectures on the threat of bioterrorism, in addition, the Center for Disease Control (CDC) also released a recent report highlighting the dangers of bioterrorism. Since the robbery of the professor’s lab, the Brotherhood and the Citizens for Ethical Treatment of Lab Mice have shown an increase level of activity. Lastly, dead fish has turned-up in Vast River.

Therefore, we concluded that there may be an imminent threat to Vastopolis metropolitan area from Forever Brotherhood of Antarctica and their affiliates, the Citizens for Ethical Threatment of Lab Mice involving some form of biological weapon created from the equipment stolen from Professor Patino’s lab.

Other events in Vastopolis, which were discounted as either being resolved or self-contained, include,

1. Military weapons went missing from Vastopolis Armed Forces on the 26-04-2011 and on the 30-04-2011, military grade weapons were used in a park shootout in Southville. However, the weapons were recovered at the Vastopolis airport on the 20-05-2011.

2. Two mental patients affiliated with the psychobrotherhood escaped the Vastopolis Center for the Criminally Insane on 27-04-2011 but were caught on the 12-05-2011 while trying to make a bomb. No further information was available in the corpus for psychobrotherhood.

3. An Antarctica Airlines plane crashed and traces of explosives were found in the wreckage but this is a past event. In addition, while there were articles about bad security at Vastopolis Airport, following the crash – security was increased.

4. A 60 year old man built an improvised explosive device to kill his neighbor’s cat KeeKee but that was a self-contain incident.5. A man with a bomb concealed in the turkey was stopped at Vastopolis Airport but that news article did not provide any

hooks for further investigation.6. The daughter of a military counter-intelligence agent was raped by another soldier and her identity exposed but the article

provided no course to follow.7. F-Alliance a group of Hackers comprised of high-school drop-out were arrested, thus another resolved issue.8. Anarchists for Freedom issue daily threats to Vastopolis Officials but there is no evidence they actually do more then bark.9. Lastly, Vastopolis was included in general threat issued by the overseas terror group Network of Dread.

Figure 3: Searching Using INVISQUE

Table 1: Timeline of News Articles

37

POST

ER

Men, Women and Gods: Distant Reading in Literary Collections – Combining Visual Analytics with Language Technology

Dimitrios Kokkinakis (Center of Language Technology (CLT), Department of Swedish Language, University of Gothenburg, Sweden)

Daniela Oelke (University of Konstanz, Department of Computer and Information Science, Konstanz, Germany)

The volumes of digitized literary collections in various languages increase at a rapid pace and so increases the need to computationally support the analysis of such data. Literature can be studied in a number of different ways and from many different perspectives and text analysis make up a central component of literature studies. If such analysis can be integrated with advanced visual methods and fed back to the daily work of the literature researcher, then it is likely to reveal the presence of useful and nuanced insights into the complex daily lives, ideas and beliefs of the main characters found in many of the literary works. In this paper we describe the combination of robust text analysis with visual analytics and bring a new set of tools to literary analysis.

As a show case, we analyzed a small subset (13 novels of a single author) taken from a large literary collection, the Swedish Literature Bank <http://litteraturbanken.se/#!om/inenglish>. The analysis is based upon two levels of inquiry, namely by focusing on mentions of theistic beings (e.g. Gods’ names) as well as mentions of persons’ names, including their gender and their normalized, linked variant forms, and examining their appearance in sentences, paragraphs and chapters. The case study shows several successful applications of visual analytics methods to various literature problems and demonstrates the advantages of the implementation of visual literature fingerprinting (Keim & Oelke, 2007). Our work is inspired by the notion of distant reading or macronalysis for the analyses of literature collections (Moretti, 2005).

We start by recognizing all characters in the novels using a mature language technology (named entity recognition) which can be turned into a tool in aid of text analysis in this field. We apply context cues, lists of animacy and gender markers and inspired by the document centered approach and the labelled consistency principle which is a form of on-line learning from documents under processing which looks at unambiguous usages of words or names for assigning annotations in ambiguous words or names. For instance, if in an unambiguous context where there is a strong gender indicator, such as ‘Mrs Alexander’ the name ‘Alexander’ is assigned a feminine gender, then subsequent mentions of the same name in the same discourse will be assigned the feminine gender as well unless there is a conflict with another person with the same name.

We argue, that the integration of text analysis such as the one briefly outlined and visualization techniques, such as higher resolution pixel-based fingerprinting, could be put to effective use also in literature studies. We also see an opportunity to devise new ways of exploring the large volumes of literary texts being made available through national cultural heritage digitization projects, for instance by exploring the possibility to show several literary texts (novels) at once. We will illustrate some of the applied techniques using several examples from our case study, such as summary plots

based on all the characters in these novels as well as fingerprints based on the distribution of characters across the novels (a simplified example can be seen in figure 1).

References:

Keim D.A. and Oelke D. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST ‘07), 115--122.

Moretti F. 2005. Graphs, maps, trees: abstract models for a literary history. R.R. Donnelley & Sons.

38

POSTER

An Online Visual Articulatory Resource for Phonetics Teaching and Independent Study

Eleanor Lawson (Queen Margaret University, Edinburgh) Jane Stuart-Smith (University of Glasgow, English Language)

Traditionally, phonetics teaching has focussed on aural input, with phonetics teachers modelling a set of reference sounds for students to learn by imitation. This technique is supplemented with verbal descriptions of the articulators and their movements, introspection and the use of diagrams or static physical models of the vocal tract. However, an understanding of speech organ movements is not automatically achieved with these techniques and many students still struggle to produce particular reference sounds. With these techniques there is also the danger that phonetics students are taught an over-simplified view of articulation, where secondary and tertiary articulations are overlooked and articulations that are auditorily covert, for example, due to masking, are missed. Rather than relying on idealised, audio-based teaching materials, there is a need for students to be able to see examples of articulatory movement in real speech.

Today, visual articulatory techniques such as ultrasound tongue imaging (UTI), electropalatography (EPG), magnetic resonance imaging (MRI) and electromagnetic midsagittal articulography (EMMA) are improving our understanding of how speech is produced. Researchers in the field of Phonetics find that their long-held assumptions about the production of particular speech sounds are overturned when they are able to see the tongue moving. Instrumentation and software that can record and visually represent articulatory movement during natural speech can greatly improve Phonetics teaching. We will discuss a project, already underway, that aims to provide students and teachers of Phonetics with online access to video of modelled and spontaneous speech via. This articulatory teaching website will make use of MRI and UTI technology and associated software to allow students and teachers of Phonetics to view articulatory movement during speech. The website will have two main sections; a clickable IPA chart showing videos of the articulatory movements of experienced phoneticians as they model the sounds of the IPA; a corpus of videos of the articulatory movements of naïve talkers producing spontaneous speech. We will discuss the motivations for the creation of this resource, how the materials for the resource will be obtained and how the resource is being constructed. We would also use this opportunity to obtain feedback from potential users of the resource.

39

POST

ER

Visualizing Collocational Environments in Interdisciplinary Discourse

David Oakey (Department of English, Iowa State University, USA)

This paper presents a means of visually displaying collocational information in order to reveal underlying epistemologies in the discourse of interdisciplinary research. In the interdisciplinary field of Health and Social Care, practitioners have been made aware of the need to do more to communicate effectively across differing occupations and agencies and with the public. An understanding of the core knowledge, principles and practices of their potential collaborators allows practitioners, researchers, or management strategists to apply or adapt their own skills and knowledge more effectively.

The study uses collocations of closed-class keywords (Groom 2007) to display to collaborators the cognitive maps and foundational concepts and practices in a corpus of research articles in Health and Social Care and its contributory disciplines. Our talk focuses on the challenges in presenting these findings visually. Closed-class key words each occur several hundred times in the corpus, and there are hundreds more lexical words which co-occur with each key word. Table 1 below shows the twenty most frequent lexical words to occur to the left of among in each of the three journals in the corpus, and illustrates how little a conventional frequency table can reveal, as this is merely the left-hand collocational environment of a single key word.

Table 1: Frequencies of the 20 most frequent lexical words occurring to the left of among

Concordance lines are similarly difficult to study for clues to the epistemological signs for which we are looking. Lines 189 to 206 in Table 2 below show that although re-sorting of lines alphabetically reveals mortality repeated 18 times immediately to the left of among, other collocates up to five words to the left are dispersed throughout the lines, however, and in this case much more searching of the 800 lines for among is required to find occurrences of mortality in the left-hand environment beyond the sort limit, such as in lines 5 and 52, and then all the way down to lines 282 and further to 365.

Table 2: Concordance lines showing mortality in the left-hand collocational environments of among in the BMJ (reduced in width to fit on the page)

40

POSTER

Our presentation suggest “word clouds” known as Wordles (Fineberg 2009) as an alternative means of visualizing lexical relations. Wordles are images displaying all the types (different individual words) in a text file in a font size proportional to their frequency. The wordle in figure 1 below shows the entire left-hand environment (up to five or six words away) of among in the British Medical Journal, and its contents can be compared to Tables 1 and 2 above.

It can be seen that mortality, risk, and smoking are frequent terms in the left-hand environments of among in the British Medical Journal, something also observable in Table 1 above, but we can also see at a glance the relative importance of words which are only slightly less frequent than birth but which are not as easily observable in the frequency table. A comparison of collocates of among in the three journals is shown in Figure 2 on the next page. Our presentation concludes that Wordles, while controversial, are an effective visual tool for revealing frequencies of lexis in texts, and, more specifically for this study, collocational environments of closed-class key words. The presentation will also present results for between and within.

Figure 1: Wordle of the left-hand collocational environments of among in the British Medical Journal

Figure 2: Wordles of the left- and right-hand collocational environments of within in the Health and Social Care Corpus

References:

Feinberg, J. (2009) Wordle - Beautiful Word Clouds. Retrieved on 14-17 October 2011 from: http://www.wordle.netGroom, N. W. (2007). Phraseology And Epistemology In Humanities Writing: A Corpus-Driven Study. Unpublished Phd Thesis,

University Of Birmingham, Birmingham

41

POST

ER

Visualisation of Arabic Morphology

Majdi Sawalha (Department of Computer Information Systems, King Abdullah II School for Information Technology, University of Jordan, Amman, Jordan)

Eric Atwell (School of Computing, University of Leeds, Leeds, UK)

1. The Complexity of Arabic Morphology

Arabic language morphology has been addressed by many researchers. We want to PoS-tag our Arabic Corpus. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and enclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and entclitics). Arabic has many morphological and grammatical features, including subcategories, person, number, gender, case, mood, etc. [1]. More fine-grained tag sets are often considered more appropriate. The additional information may also help to disambiguate the (base) part of speech [2]. Many challenges face the implementation of Arabic morphological analyzers, such as: the rich “root-and-pattern” nonconcatenative (or nonlinear) morphology; the highly complex word formation process of root and patterns; and the special orthographic issues of Arabic. The SALMA – Tagger [3] is a fine-grained morphological analyzer which is based on linguistic information extracted by text analytics software from traditional Arabic grammar books and dictionaries [4]. The SALMA – Tag Set [5] is a proposed standard tag-set, which captures long-established traditional fine-grained morphological features of Arabic, in a notation intended to be compact yet transparent. The SALMA – Tagger uses it to encode the fine grained morphological feature information for the automatically analyzed words.

2. Colour-coded Arabic Morphology

The SALMA – Tagger outputs include:

■■ The root (i.e. three or four letters origin of the word).

■■ The lemma (i.e. the dictionary form of the word (headword)).

■■ The pattern (i.e. templates of combinations of consonants and vowels).

■■ The full vowelized form (i.e. marks added above or below letters to provide information about correct pronunciation).

■■ The tokenization of the word into morphemes (i.e. minimal meaning bearing unit that for constituting a word).

■■ A SALMA – Tag for each morpheme.

The SALMA – Tagger analyses can be output in alternative formats including tab-separated text, XML and HTML (Fig.1).

To visualize the analysis, the word morphemes are colour-coded. The colour-coding scheme depends on the morphological information of the analyzed word (Fig. 2). The SALMA – Tokenizer and SALMA – Tagger modules specify each of the word’s morphemes, its class (i.e. proclitic, prefix, stem, suffix and enclictic) and the part-of-speech category for each morpheme. The colour-coding module is used to visualize the morphological information such as the word’s morphemes and its part of speech coded in colours. This colour-coding 2 output format visualizes the complexity of the Arabic words, and the number and types of morphemes that forms a single word. Each morpheme is coloured depending on its type and part of speech. Fig.3 shows an example of a colour-coded word. Fig.4 shows the colourcoded visualization of a Quranic Arabic text and Modern Standard Arabic (MSA) text. It illustrates morpheme boundaries.

Figure 1 SALMA – Tagger outputs formatted in HTML file

42

POSTER

References:

[1] Author 2. 2008. Development of tag sets for part-of-speech tagging. Corpus Linguistics: An International Handbook, Volume 1, ed. by A. Ludeling & M. Kyto, 501-26 Mouton de Gruyter.

[2] Schmid, H., and Laws, F. 2008. Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. Paper presented at the COLING’08, Manchester,UK.

[3] Author 1. 2009. Open-source Resources and Standards for Arabic Word Structure Analysis. Leeds: University of Leeds PhD.

[4] Authors. 2010. Constructing and Using Broad-Coverage Lexical Resource for Enhancing Morphological Analysis of Arabic. Paper presented to the Language Resource and Evaluation Conference LREC 2010, Valleta, Malta, 2010.

[5] Authors. Under review. A Theory Standard Tag Set Expounding Traditional Morphological features for Arabic Language Part-of-Speech Tagging. Word structure journal, Edinburgh University Press.

Figure 3 Colour-coded example of a word from the Qur’an gold standard

Figure 4 Colour coded output of the analyzed text samples of the Qur’an (top) and MSA (bottom).

Figure 2 Colour codes used to colour code the morphemes of the analyzed words

43

POST

ER

Visualising Spoken Language Transcriptions – Old Principles and New Opportunities

Thomas Schmidt (Institute for the German Language (IDS), Pragmatics Department, Mannheim, Germany)

Visualising spoken language transcriptions – old principles and new opportunities Transcription, the paradoxical task of representing spoken language in the written medium, has always been an interesting challenge for data visualisation. Long before the computer became the tool of choice for transcribing audio or video recordings, researchers from such diverse fields as conversation analysis, dialectology or phonology (to name just a few) have developed means of capturing on paper not only the lexical form of words that are spoken but also some relevant properties with respect to their timing (e.g. speaker overlap), prosody (e.g. emphatic stress, pauses) or accompanying nonverbal behaviour (e.g. smiling, gesturing). It is widely agreed that adequacy and detail of a transcription on the one hand, and accessibility and readability of its visual representation on the other hand, are conflicting requirements not easy to reconcile.

The computer has expanded the potential for visualising spoken language in several ways. First, the principle of separating content from form, on which all modern approaches to document processing (especially in the XML paradigm) are based, allows researchers to have several visualisations for one and the same base transcription, each of which can be optimized for a specific use. Second, the computer’s multimedia capacities make it possible to (re)bridge the gap between the spoken original and the written transcription by linking visualisations to the underlying recording. Third, by making non-linear structures navigable through hypertext technology, digital transcriptions can bypass some of the disadvantages that result from the need to map the multidimensional structure of interactional talk onto a one- or two-dimensional visual representation.

In my contribution to the conference, I will first give an overview of the visualisation challenges inherent in spoken language transcription and some “classical” techniques employed by linguists to meet these challenges. I will use the example of “Partitur” (musical score) notation (see figure 1 below) to illustrate some of the most salient problems.

Figure 1: Musical score visualisation of a spoken language transcription

I will then present ongoing work on new and improved visualisation methods that are developed in the context of a spoken corpus technology project. At the core of this project are a time- based data model for spoken language transcription, represented in an XML file format, and a set of software tools for inputting, editing and managing spoken language corpora in that format. I will demonstrate how a combination of stylesheet transformations (XSLT and CSS), HTML5 media integration and other technologies can help not only to improve the readability and usability of established visualisation forms, but also to generate completely new views on transcription data which can substantially alter the ways researchers look at and work with spoken language data.

Documents

(=43 - avml2012.files.wordpress.com€¦ · Geographic Information System (GIS) and Perceptual Dialect Mapping — ... Statistics and Computing. Springer, 2nd edition. Figure 5: Rapidly