Corpora, Blogs and Linguistic Variation (Paderborn)

Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development

Cornelius PuschmannUniversity of [email protected]

University of Paderborn8 November 2007

Contents of this presentation

What counts as evidence in linguistics?

System, use and the individual

Using the Web for corpus investigations

Structured vs. unstructured language data

A research example

What counts as evidence in linguistics?

Four central questions a researcher must answer

What is my question?

What data can I use?

What methods are at my disposal?

What are my findings?

*0) What are my preliminary assumptions?

Different schools of thought

GenerativismStructuralismCompLingCogSciSocioLingFunctionalism...all have different questions and assumptions!

What role does corpus data play?

If my aim is figure out what we know when we know language, then corpus data is just one type of evidence among many.

If my aim is to describe the social function of language in a speech community, then I'll be interested in some (specific) natural language data.

If my aim is to systematically describe, classify and manipulate natural language data for practical reasons, it is likely to be the only thing to qualify as data to me.

The relevance of corpus data can range from somewhat interesting to what else is there?, depending on my perspective.

System, use and the individual

A totalizing view of language

systemsocial functioncognitive mechanismgenetic dispositionuse

production

investigation

cultural transmissionbut...whose system?whose use?

shared, recurring & patternedlanguageMargin vs. center: what is shared vs. what varies

individual & varyinglanguageIf we're interested in variation, corpus analysis is the way to go

A (slightly) different way of looking at language

While variation in language use is recognized by linguists, system is generally believed to be an abstract and essentially universal category.

Alternately, it is possible to regard system as the sum of all recurring and patterned instances of language use by a single speaker, some of which are shared with other speakers, while others are not.

How could similarities and differences between speakers be accurately captured and measured?

They could be captured with corpora that allow us to systematically compare different speakers.

Using the Web for corpus investigations

The ultimate corpus

the indexable Web has more than 11.5 billion pages (2005 study)

virtually all (written) languages are represented

established forms of writing (fiction, official documents, personal communication) and

new genres (blogs, message boards) can be found on the Web

Web as Corpus (WaC)

WaC treats the vast amount of language data on the WWW as a corpus

search engines are used to query this corpus

they can be commercial (e.g. Google), or specialized tools for linguistic research (e.g. WebCorp, LSE)

but: specialized linguistic search engines limited to post-processing (?)

pros: large volume of data, no data acquisition necessary, easy to use

cons: no knowledge of the makeup of the dataset, no control over the dataset or the query engine, no tagging, parsing or other specialized linguistic tools (with commercial engines)

Broader issues with WaC

When we're looking for qualitative data, WaC is relatively unproblematic, but when we're comparing frequencies (i.e. taking a quantitative approach) it has serious issues.

The fact that Google controls the data means that both the

data and

how is processed

are

unknown and

may change at any time

...and most importantly: Google doesn't care!

Web for Corpus

Web for Corpus (WfC) extracts data from the Web and stores it locally

it is closer to traditional corpus development in the sense that data is consciously added by the researcher following certain criteria (randomness, balance etc)

pros: control over makeup of the dataset, precise knowledge of its size, ability to annotate, reuse, share and publish the dataset, ability to parse, tag ...

cons: corpus construction is technically more challenging, corpus is smaller than the Web in its entirety (though still larger than traditional corpora)

Constructing a corpus using web data

Pick a data source (Wikipedia, Project Gutenberg, blogger.com, the entire WWW as indexed by Google, ...)

Retrieve the raw HTML data (spidering or crawling)

Process the HTML data, i.e. separate natural language from formatting instructions. For example
Dear diary
I am really bored & tired right now...

becomes
Dear diary
I am really bored & tired right now

Tag, parse and store

Things to consider

...not a whole lot of language data!

Things to consider

... better, but we need to take register into account

Structured vs. unstructured language data

A blog is...

Blog (n): a website where entries are written in chronological order and displayed in reverse chronological order (web log).

blog (v): to maintain or add content to a blog.

Blogs are written by a variety of people (bloggers) with a variety of purposes. Every text in a blog can be directly linked to its author and usually has other meta-information (date, keywords etc).

A few facts

Technorati (blog search engine) tracks about 70 million blogs worldwide

120.000 new blogs are created each day that's 1.4 every second

1.5 million entries (posts) are published every day

virtually every (written) language on the planet is represented in blogs

blog data is well-structured in the sense that it doesn't contain visual markup

The markup of a blog is semantic, meaning it contains meta-information about the content that a machine can understand.

An example

A research example

Expression of futurity in English: will vs. be going to

will
origin as a transitive lexical verb (OE willan) with a meaning similar to German wollen; has been grammaticalized to express futurity

be going to

can be combined with +NP (movement) or +Inf.V (future); movement away from the speaker, towards a goal (Perez)

The notion of satisfying a condition forms one of the major distinctions between the two future expressions will and be going to. A sentence with will relies on a condition evident in the context that will enable the proposition to take place. (Perez)

Distribution of will vs. be going to in three blogs

light blue = willdark blue = going to

Personal pronoun frequency in the same blogs

THOMPSON

01 the DT 2638

02 and CC 1604

03 to TO 1272

04 of IN 1152

05 a DT 1034

06 in IN 821

07 For IN 680

08 on IN 509

09 is VBZ 465

10 with IN 375

J.SCHWARTZ

01 the DT 3077

02 and CC 2120

03 to TO 2072

04 a DT 1862

05 of IN 1486

06 in IN 1008

07 we PP 988

08 I PP 818

09 our PP$ 723

10 It PP 642

H.HAMILTON

01 the DT 3802

02 I PP 3613

03 to TO 2789

04 a DT 2045

05 of IN 1795

06 and CC 1788

07 It PP 1519

08 you PP 1205

09 in IN 1096

10 that IN 1077

Inanimate subjects with be going to

(1) my car is going to get pretty crusty

(2) the exhibit is going to be in Seattle

(3) the things that are left to do in this house are going to cost some money

(4) the work is still going to be there the next day

(5) my first post is going to be about troublesome managers

Observations

The assumption that be going to is more frequent with certain subjects is confirmed by the data (1st pers. -> 2nd pers. -> human -> animate -> inanimate).

This appears to be the strongest conditioning factor degree of certainty seems less significant.

However, where subjects are inanimate, the course of action can always be described as certain.

These observations could be tested against a large number of sources (blogs), factoring in individual variation in addition to other factors.

Thanks for listening!

Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development

Cornelius PuschmannUniversity of [email protected]

University of Paderborn8 November 2007

Economy & Finance

Corpora, Blogs and Linguistic Variation (Paderborn)