Corpus Linguistics - A Simple Introduction · Corpus Linguistics can be divided into two parts 1...

Preview:

Citation preview

Hypothesis Testing Hypothesis Generation

Corpus LinguisticsA Simple Introduction

Dr. Niko Schenkn.schenk@em.uni-frankfurt.de

Applied Computational Linguistics LabComputer Science Department

Department of English- and American StudiesGoethe University Frankfurt, Germany

October 16, 2019

Dr. Niko Schenk Corpus Linguistics – Introduction 1 / 35

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Dr. Niko Schenk Corpus Linguistics – Introduction 2 / 35

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Dr. Niko Schenk Corpus Linguistics – Introduction 3 / 35

Hypothesis Testing Hypothesis Generation

“Correct” vs. “Incorrect” Use of Language

Figure: Source: http://www.elektrojournal.at/bilder/d166/Saturn_Claim.jpg

Dr. Niko Schenk Corpus Linguistics – Introduction 4 / 35

Hypothesis Testing Hypothesis Generation

Soo! muss Technik

Regarding the slogan four things are striking:

Missing main verb “sein” (verbal ellipsis?/“ungrammatical”)Orthographic variation “soo” vs. “so” (spelling “error”)Misleading punctuation “soo! ...”, all letters capitalized...

Question: How would you objectively/formally test that, “soo”—compared to“so”—is not “regular” language?

Dictionary...Better: → www.google.com!

Dr. Niko Schenk Corpus Linguistics – Introduction 5 / 35

Hypothesis Testing Hypothesis Generation

Comparing Frequencies of Contrastive Elements

(a) Google results for “so” (b) Google results for “soo”

Dr. Niko Schenk Corpus Linguistics – Introduction 6 / 35

Hypothesis Testing Hypothesis Generation

Refining Comparisons of Contrastive Elements

(a) Google results for “so schon” (b) Google results for “soo schon”

Dr. Niko Schenk Corpus Linguistics – Introduction 7 / 35

Hypothesis Testing Hypothesis Generation

Congratulations!

We’ve successfully performed our first corpus linguistic search.

How is that different from using a dictionary?Answer:

1 consult tons of real data instead single of contemporary rule.

2 use tendencies instead of absolute true/false answer (Thedictionary claims that “soo” is false or—even worse–that thephrase does not exist).

Dr. Niko Schenk Corpus Linguistics – Introduction 8 / 35

Hypothesis Testing Hypothesis Generation

Language Change-An Example

Figure: “Thrived” vs. ...

Dr. Niko Schenk Corpus Linguistics – Introduction 9 / 35

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Figure: ... “throve”.

Dr. Niko Schenk Corpus Linguistics – Introduction 10 / 35

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Question: How would you objectively/formally show that “throve” is obsolete/nolonger in use?

You could ask a native speaker...(Much) better: → Google books Ngram Viewer

Dr. Niko Schenk Corpus Linguistics – Introduction 11 / 35

Hypothesis Testing Hypothesis Generation

Figure: Distributions of “thrived” vs. “throve” in the Google books corpus.

Possible explanation: Low-frequency words change to fit the main paradigm.Dr. Niko Schenk Corpus Linguistics – Introduction 12 / 35

Hypothesis Testing Hypothesis Generation

Again, how is that different from asking a native speaker?Answer: consult real data instead of intuition of one individualspeaker.

Dr. Niko Schenk Corpus Linguistics – Introduction 13 / 35

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (I)

Starting with a linguistic phenomenon (see previous examples) and a hypothesis, youuse

large textual resources (a corpus!) and software to objectively test(falsify/verify) the hypothesis.

hypothesis testing is usually based on frequencies obtained through search.

Dr. Niko Schenk Corpus Linguistics – Introduction 14 / 35

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Dr. Niko Schenk Corpus Linguistics – Introduction 15 / 35

Hypothesis Testing Hypothesis Generation

Extracting Useful Information

Contrary to the previous examples, you don’t have to start with a concrete hypothesis.

You could just “do something” with the corpus itself:

e.g., compute various statistics and inspect the output.

Usually you count words, phrases, etc. This is done automatically with the helpof computer programs.

→ As a result, you can come up with a hypothesis.

Dr. Niko Schenk Corpus Linguistics – Introduction 16 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Co-occurrence probabilities between words...

Dr. Niko Schenk Corpus Linguistics – Introduction 17 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 18 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 19 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 20 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 21 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 22 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 23 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 24 / 35

Hypothesis Testing Hypothesis Generation

Motivation

Dr. Niko Schenk Corpus Linguistics – Introduction 25 / 35

Hypothesis Testing Hypothesis Generation

How to Automatically Find Collocations

Example: Automatically collect words which co-occur more frequently than whatwould be expected:

Generated hypothesis: → These words are idiomatic expressions, proper names...Dr. Niko Schenk Corpus Linguistics – Introduction 26 / 35

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Based on the data, you could just count the words which are used by different peopleand compare the numbers.

Dr. Niko Schenk Corpus Linguistics – Introduction 27 / 35

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Generated hypothesis: → Groups of people with the same opinion share a similarvocabulary.

Dr. Niko Schenk Corpus Linguistics – Introduction 28 / 35

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

The same technique can be applied to find the author of a document.

For each document (written by an author), count the number of distinct wordswhich he/she uses.

Dr. Niko Schenk Corpus Linguistics – Introduction 29 / 35

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

● ●

●●

● ●

●●

0.30 0.35 0.40 0.45 0.50

0.25

0.30

0.35

0.40

0.45

0.50

standard type / token ratio

type

(le

mm

a) /

toke

n ra

tio

christopher.txt

christopher2.txt

instructions.txt

lisa.txt lisa2.txt

lisa3.txt

lisa4.txt

lisa5.txtlisa6.txt

marie−luise.txt

marie−luise2.txt

marie−luise3.txt

marie−luise4.txt

miriam.txt

miriam2.txt

miriam3.txtmiriam4.txt

miriam5.txt

miriam6.txt

philipp.txt

simon.txt

simon2.txt

simon3.txt

thu.txt

thu2.txt

thu3.txt

thu4.txt

thu5.txt

thu6.txtthu7.txt

thu8.txt

viktor.txt

viktor2.txt

Dr. Niko Schenk Corpus Linguistics – Introduction 30 / 35

Hypothesis Testing Hypothesis Generation

Hypothesis Generation

Generated hypothesis:→ Students appearing closer together in the visualization are similar in language use.

Dr. Niko Schenk Corpus Linguistics – Introduction 31 / 35

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (II)

Starting with the corpus and no specific hypothesis, you use

large textual resources and statistics to detect contrastive (interesting)patterns, i.e. you generate a hypothesis.

Dr. Niko Schenk Corpus Linguistics – Introduction 32 / 35

Hypothesis Testing Hypothesis Generation

Summary

Corpus Linguistics can be divided into two parts

1 hypothesis testing

2 hypothesis generation

You usually use

large textual (linguistic) resources which are electronically available.

software to analyze (search) the data.

Frequencies are essential.

Dr. Niko Schenk Corpus Linguistics – Introduction 33 / 35

Hypothesis Testing Hypothesis Generation

Homework Assignment

Dr. Niko Schenk Corpus Linguistics – Introduction 34 / 35

Hypothesis Testing Hypothesis Generation

Corpus Linguistics—An Example

Google offers an exploratory search functionality as a corpus linguistic application.

Cf. Google Books Corpus1 / Google Ngram Viewer2

1http://googlebooks.byu.edu/x.asp – AE: 155 billion words2Cf. http://books.google.com/ngrams/

Dr. Niko Schenk Corpus Linguistics – Introduction 35 / 35

Recommended