If you can't read please download the document
Upload
cornelius-puschmann
View
2.834
Download
0
Embed Size (px)
Citation preview
Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development
Cornelius PuschmannUniversity of [email protected]
University of Paderborn8 November 2007
Contents of this presentation
What counts as evidence in linguistics?
System, use and the individual
Using the Web for corpus investigations
Structured vs. unstructured language data
A research example
What counts as evidence in linguistics?
Four central questions a researcher must answer
What is my question?
What data can I use?
What methods are at my disposal?
What are my findings?
*0) What are my preliminary assumptions?
Different schools of thought
GenerativismStructuralismCompLingCogSciSocioLingFunctionalism...all have different questions and assumptions!
What role does corpus data play?
If my aim is figure out what we know when we know language, then corpus data is just one type of evidence among many.
If my aim is to describe the social function of language in a speech community, then I'll be interested in some (specific) natural language data.
If my aim is to systematically describe, classify and manipulate natural language data for practical reasons, it is likely to be the only thing to qualify as data to me.
The relevance of corpus data can range from somewhat interesting to what else is there?, depending on my perspective.
System, use and the individual
A totalizing view of language
systemsocial functioncognitive mechanismgenetic dispositionuse
production
investigation
cultural transmissionbut...whose system?whose use?
shared, recurring & patternedlanguageMargin vs. center: what is shared vs. what varies
individual & varyinglanguageIf we're interested in variation, corpus analysis is the way to go
A (slightly) different way of looking at language
While variation in language use is recognized by linguists, system is generally believed to be an abstract and essentially universal category.
Alternately, it is possible to regard system as the sum of all recurring and patterned instances of language use by a single speaker, some of which are shared with other speakers, while others are not.
How could similarities and differences between speakers be accurately captured and measured?
They could be captured with corpora that allow us to systematically
compare different speakers.
Using the Web for corpus investigations
The ultimate corpus
the indexable Web has more than 11.5 billion pages (2005 study)
virtually all (written) languages are represented
established forms of writing (fiction, official documents, personal communication) and
new genres (blogs, message boards) can be found on the Web
Web as Corpus (WaC)
WaC treats the vast amount of language data on the WWW as a corpus
search engines are used to query this corpus
they can be commercial (e.g. Google), or specialized tools for linguistic research (e.g. WebCorp, LSE)
but: specialized linguistic search engines limited to post-processing (?)
pros: large volume of data, no data acquisition necessary, easy to use
cons: no knowledge of the makeup of the dataset, no control over the dataset or the query engine, no tagging, parsing or other specialized linguistic tools (with commercial engines)
Broader issues with WaC
When we're looking for qualitative data, WaC is relatively unproblematic, but when we're comparing frequencies (i.e. taking a quantitative approach) it has serious issues.
The fact that Google controls the data means that both the
data and
how is processed
are
unknown and
may change at any time
...and most importantly: Google doesn't care!
Web for Corpus
Web for Corpus (WfC) extracts data from the Web and stores it locally
it is closer to traditional corpus development in the sense that data is consciously added by the researcher following certain criteria (randomness, balance etc)
pros: control over makeup of the dataset, precise knowledge of its size, ability to annotate, reuse, share and publish the dataset, ability to parse, tag ...
cons: corpus construction is technically more challenging, corpus is smaller than the Web in its entirety (though still larger than traditional corpora)
Constructing a corpus using web data
Pick a data source (Wikipedia, Project Gutenberg, blogger.com, the entire WWW as indexed by Google, ...)
Retrieve the raw HTML data (spidering or crawling)
Process the HTML data, i.e. separate natural language from
formatting instructions. For example
Dear diary
I am really bored & tired right now...
becomes
Dear diary
I am really bored & tired right now
Tag, parse and store
Things to consider
...not a whole lot of language data!
Things to consider
... better, but we need to take register into account
Structured vs. unstructured language data
A blog is...
Blog (n): a website where entries are written in chronological order and displayed in reverse chronological order (web log).
blog (v): to maintain or add content to a blog.
Blogs are written by a variety of people (bloggers) with a variety of purposes. Every text in a blog can be directly linked to its author and usually has other meta-information (date, keywords etc).
A few facts
Technorati (blog search engine) tracks about 70 million blogs worldwide
120.000 new blogs are created each day that's 1.4 every second
1.5 million entries (posts) are published every day
virtually every (written) language on the planet is represented in blogs
blog data is well-structured in the sense that it doesn't contain visual markup
The markup of a blog is semantic, meaning it contains meta-information about the content that a machine can understand.
An example
A research example
Expression of futurity in English: will vs. be going to
will
origin as a transitive lexical verb (OE willan) with a meaning
similar to German wollen; has been grammaticalized to express
futurity
be going to
can be combined with +NP (movement) or +Inf.V (future); movement away from the speaker, towards a goal (Perez)
The notion of satisfying a condition forms one of the major distinctions between the two future expressions will and be going to. A sentence with will relies on a condition evident in the context that will enable the proposition to take place. (Perez)
Distribution of will vs. be going to in three blogs
light blue = willdark blue = going to
Personal pronoun frequency in the same blogs
THOMPSON
01 the DT 2638
02 and CC 1604
03 to TO 1272
04 of IN 1152
05 a DT 1034
06 in IN 821
07 For IN 680
08 on IN 509
09 is VBZ 465
10 with IN 375
J.SCHWARTZ
01 the DT 3077
02 and CC 2120
03 to TO 2072
04 a DT 1862
05 of IN 1486
06 in IN 1008
07 we PP 988
08 I PP 818
09 our PP$ 723
10 It PP 642
H.HAMILTON
01 the DT 3802
02 I PP 3613
03 to TO 2789
04 a DT 2045
05 of IN 1795
06 and CC 1788
07 It PP 1519
08 you PP 1205
09 in IN 1096
10 that IN 1077
Inanimate subjects with be going to
(1) my car is going to get pretty crusty
(2) the exhibit is going to be in Seattle
(3) the things that are left to do in this house are going to cost some money
(4) the work is still going to be there the next day
(5) my first post is going to be about troublesome managers
Observations
The assumption that be going to is more frequent with certain subjects is confirmed by the data (1st pers. -> 2nd pers. -> human -> animate -> inanimate).
This appears to be the strongest conditioning factor degree of certainty seems less significant.
However, where subjects are inanimate, the course of action can always be described as certain.
These observations could be tested against a large number of sources (blogs), factoring in individual variation in addition to other factors.
Thanks for listening!
Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development
Cornelius PuschmannUniversity of [email protected]
University of Paderborn8 November 2007