Timo Honkela: Semantic and pragmatics representations of large text corpora

Preview:

Citation preview

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Timo Honkela

FIN-CLARIN Jubilee Seminar andNordic CLARIN Network SeminarUniversity of Helsinki, 9 Jun 2016

Semantic and pragmatic representations

of large text corpora

timo.honkela@helsinki.fi

2

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Agenda

● Digital humanities in Finland● Strategic role of humanities and

social sciences● Research using text corpora

3

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Digital humanities in Finland

● Research in humanities and social sciences is increasingly using digitally stored resources and computational analysis tools

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Krister Lindénet al.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Varieng - Research Unit for the Study of Variation, Contacts and Change in English

Big Data, Rich Data, Uncharted Data19–22 October 2015Helsinki, Finland

Terttu Nevalainen

Irma TaavitsainenTanja Säilyhttp://www.helsinki.fi/varieng/

http://www.helsinki.fi/varieng/people/varieng_saily.html

et al.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Multilinguallanguage technology

Jörg Tiedemann

Mathias Creutz et al.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Text mining historical newspapers

Mikko Tolonen

Kimmo Kettunen

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Citizen MindscapesAnalysis of large social media corporain order to increase understanding of

social and societal phenomena

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Educational efforts:e.g. Digital Humanities Hackathon

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.

Let's celebrate andhave a moment

of applause

http://375humanistia.helsinki.fi/en/humanists/kimmo-koskenniemi

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity associated withdifferent areas of science

Biological phenomena

Physical phenomena

Cultural phenomena

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Importance ofhumanities and social sciences

● As surprising it may at first sound, one can claim that humanities and social sciences are the most important ones

● These disciplines deal with topics like language and communication, social condition, historical developments, economy, etc.

● Due to the complexity, research in these areas is challenging; generalizations commonplacein physics are rarely possible

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Understandingthe phenomena

Theory andknowledgeformation

Qualitative Quantitative

Open data:corpora

Openmethods

Computationalresources

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Lars Borin

Linguistics hasbeen the first

e-science

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Challenges:

“Language is BIG”

“Human INTERPRETATION isinherently involved”

Importance of language:

”Language is involved in mostrelevant human activities”

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Example:

Complexity ofFinnish at thelevel of wordforms

Kimmo Koskenniemi (2013):Johdatus kieliteknologiaan,sen merkitykseen ja sovelluksiin(Introduction to language technology, its significance andapplications)

https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

> 6000 languages,many more dialects Billions of people

blogs.state.gov

en.wikipedia.org

A large number ofdifferent cultures

en.wikipedia.org A vast number of ways to relatelanguage, concepts andthe world to each other

Simulating processes of language emergence and communication 19

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Language as a system

● Considering natural language as a signal and dynamic system at cognitive and social levels (also in its written form) rather than a symbolic and logical system

● Importance of embodiment (cf. e.g. Harnad) and embeddedness (cf. e.g. Edelman)

● Learning and pattern recognition processes are essential (as opposed to the theories presented e.g. by Chomsky, Fodor, Pinker); much of the learning is bound to be unsupervised

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity of languageregarding different areas and levels

Structure:morphology and syntax

Meaning: semantics and pragmatics

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity of languageregarding different areas and levels

Structure:morphology and syntax

Meaning: semantics and pragmatics

What are the nature,granularity, type,

metadata involved, etc.for different researchpurposes in different

areas of linguistics andother areas of humanities

and social sciences?

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Need toharmonize,build sharedterminologies,theories,frameworks, etc.

Need to modelcontextuality,

ambiguity, vagueness,history-dependence,

change, ambiguity,etc.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Need toharmonize,build sharedterminologies,theories,frameworks, etc.

Need to modelcontextuality,

ambiguity, vagueness,history-dependence,

change, ambiguity,etc.

The same medium, language, isthe object of study as well as the

basis for theory formation,representing the ideas and resources, etc.

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Philosophy of scienceis essential to

understand whatis going on...

Data-driveninductive mode

Hypothesisdriven,

deductive mode

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

An old research example:

Data-driven emergenceof implicit word

categories that match withhuman syntactic

and semantic intuitions

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Classical example: Learning meaning from context:

Maps of words in Grimm fairy tales

Honkela, Pulkki & Kohonen 1995

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Research example:

Multimodallygroundedmodels

of meaning

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Labeling movements: Associatinghigh-dim. kinesthetic time series

with linguistic labels

Förger & Honkela 2014

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

RUNNING

WALKING

LIMPING

JOGGING

Förger & Honkela 2014

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Research example:

Tensor-based analysis ofsubjective aspect

of interpreting linguisticexpressions

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

GICA: Grounded IntersubjectiveConcept Analysis

Honkela et al. 2012

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Analysis of the word 'health'

Honkela et al. 2012

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Ideas for building corpora

● Espansion of the contextual framework● Enriching metadata● Increasing multimodal data sources

that associate linguistic data with othermodalities

● Involving large number of peoplein labeling data to model variation

● Collecting data in real world contexts

Recommended