45
1 ACRH2 Lisbon Thursday 29th November 2012 Do we need annotated corpora in the era of the data deluge? Martin Wynne [email protected] [email protected] Oxford e-Research Centre & IT Services (formerly OUCS) & Faculty of Linguistics, Philology and Phonetics, University of Oxford

Annotated Corpora for Research in the Humanities

Embed Size (px)

DESCRIPTION

Do we need annotated corpora in the era of the data deluge?

Citation preview

Page 1: Annotated Corpora for Research in the Humanities

1

ACRH2

Lisbon

Thursday 29th November 2012

Do we need annotated corpora in the era of the data deluge?

Martin Wynne

[email protected]

[email protected]

Oxford e-Research Centre &

IT Services (formerly OUCS) &

Faculty of Linguistics, Philology and Phonetics,

University of Oxford

Page 2: Annotated Corpora for Research in the Humanities
Page 3: Annotated Corpora for Research in the Humanities

3

Problems with annotation

It can:

• lead to circular reasoning

• be incorrect

• be inconsistent

• follow a particular theory

• have a specific level of granularity

• use a particular tag-set

• introduce subjective interpretations

Page 4: Annotated Corpora for Research in the Humanities

The data deluge

Page 5: Annotated Corpora for Research in the Humanities

5

The case for the corpus today(against “the web as corpus”)

The spoken corpus: spoken, and other non-computer-mediated dataThe historical corpus: pre-internet data (beyond books)The specialised corpus: with integrity, provenance and controlled

sampling and representativenessThe annotated corpus: adding and sharing linguistic annotationThe web corpus: filtering and organising the data deluge (aka "the web

for corpus")

Page 6: Annotated Corpora for Research in the Humanities

6

The case for the corpus today(against “the web as corpus”)

But we do need to go beyond the finite text corpus:● speech● video● the language of the internet - new genres, new media, new modes● capturing the context, especially other data streams● engaging with the non-finite corpora (aka "the web as corpus")

Page 7: Annotated Corpora for Research in the Humanities
Page 8: Annotated Corpora for Research in the Humanities
Page 9: Annotated Corpora for Research in the Humanities

Image by James Cridland from Flickr. Some rights reserved.

Page 10: Annotated Corpora for Research in the Humanities

10

Annotation - Why?

• To perform identification, categorization and analysis of features of the text

• It enables certain types of search and analysis, especially beyond the word form (e.g. “search for all inflected forms of cause as a verb”)

• It can be the foundation for further automatic analysis of a corpus (e.g. POS tags can be used for parsing)

• Preserving the analysis, enabling replicability of research, and reusability of the annotated corpus

Page 11: Annotated Corpora for Research in the Humanities

11

Annotation: less than the text?

“Annotation of a text is a procedure which loses information. There is no point in arguing that the information is in the computer's memory somewhere - annotation is the substitution of a general category for a specific item, and with respect to that area of the classification, the item has lost its uniqueness.”

(John Sinclair, personal communication, 2001)

Page 12: Annotated Corpora for Research in the Humanities

12

Annotation: how?

• Annotations should be separable• Detailed and explicit documentation should be

provided• Annotation practices should be linguistically

consensual• Annotation should observe standards

(Leech 2005)

http://www.ota.ox.ac.uk/documents/creating/dlc/

Page 13: Annotated Corpora for Research in the Humanities

13

Annotation standards?

Use of standards can help to ensure successful:

• interpretation,

• interchange, • preservation, • incorporation into other resources, • processing by generic software.

And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions.

Page 14: Annotated Corpora for Research in the Humanities

14

Potential problems with annotation

1. Annotation is liable to be subjective and inconsistent

2. Annotation is sometimes intellectual and painstaking, sometimes trivial and automatic

3. Annotation leads to digital silos

4. Annotation makes building a shared services infrastructure difficult

Page 15: Annotated Corpora for Research in the Humanities

15

Interoperability and sustainability for digital textual scholarship

Well-known problems with digital resources in the humanities of:• fragmentation of communities, resources, tools;• lack of connectness and interoperability;• sustainability of online services;• lack of deployment of tools as reliable and available services

There is a potential solution in distributed, federated infrastructure services.

Page 16: Annotated Corpora for Research in the Humanities
Page 17: Annotated Corpora for Research in the Humanities
Page 18: Annotated Corpora for Research in the Humanities

The CLARIN Vision

A researcher in the Darmstadt, from his desktop computer, can: do a single sign-on, with local authentication, and then: search for, find and obtain authorization to use corpora in Oxford,

Prague and Berlin select the precise dataset to work on, and save that selection run semantic analysis tools from Budapest and statistical tools from

Tübingen over the dataset use computational power from the local, national or other

computing centre where necessary obtain advice and support for carrying out all technical and

methodological procedures save the workflow and results of the analysis, and share those

results with collaborators in Paris, Vienna and Zagreb discuss and iteratively adopt and re-run the analyses with

collaborators

Page 19: Annotated Corpora for Research in the Humanities
Page 20: Annotated Corpora for Research in the Humanities

20

Silos or fishtanks??

Let's talk about fishtanks rather than silos...

There are lots of fishtanks out there, some very elaborate, big, pretty...

But they're all in different places and

unconnected.

And if I want to keep a fish I have to

build a fishtank (or put it in yours)...

And who's going to carry on feeding

the fish?

Let's not all make our own fishtanks.

Page 21: Annotated Corpora for Research in the Humanities

21

Wouldn't it be better to have an ecosystem where we can all set our fishes free?

You can access all of the riches of the deep and it's a lot easier to get into fish research

Page 22: Annotated Corpora for Research in the Humanities
Page 23: Annotated Corpora for Research in the Humanities
Page 24: Annotated Corpora for Research in the Humanities
Page 25: Annotated Corpora for Research in the Humanities

CLARINhttp://www.clarin.eu/

Infrastructure services for research in the humanities and social sciences using language resources and tools.

Services to include:

• Access and identity federation• Network of service centres• Concept and component metadata registries• Federated resource discovery• Federated search across resources• SOA for connecting tools• PID services

Bamboohttp://www.project-bamboo.org/

Project Bamboo is building applications and shared infrastructure for humanities research, principally:• Research environments for humanities scholars• Infrastructure allowing librarians and technologists to support humanities scholarship• Evolution of shared applications for the curation and exploration of widely distributed content collections• Build a community for uptake, expansion and sustainability

DARIAHhttp://www.dariah.eu

Enhance and support digitally-enabled research across the humanities and arts.

DARIAH is working with communities of practice to:• Explore and apply ICT-based methods and tools• Improve research opportunities and outcomes through linking distributed digital source materials of many kinds• Exchange knowledge, expertise, methodologies and practices across domains and disciplines

Page 26: Annotated Corpora for Research in the Humanities
Page 27: Annotated Corpora for Research in the Humanities
Page 28: Annotated Corpora for Research in the Humanities
Page 29: Annotated Corpora for Research in the Humanities
Page 30: Annotated Corpora for Research in the Humanities

30

Corpus Linguistics

Page 31: Annotated Corpora for Research in the Humanities

Player One (a man) Player Two (a woman)

[Enter two players] What news, Borachio?

[Don John, Much Ado About Nothing, I, 3]

I came yonder from a great supper: I can give you intelligence of an intended marriage.

[Borachio, Much Ado About Nothing, I, 3]

A married man! that's most intolerable.

[Earl of Warwick, Henry VI Part I, V, 4]

They say the lady is fair; 'tis a truth, I can bear them witness; and virtuous; 'tis so, I cannot reprove it

[Benedick, Much Ado About Nothing, II, 3]

Yet hasty marriage seldom proveth well.

[Richard III, Henry VI Part III, IV, 1]

Is the single man therefore blessed? No; as a wall'd town is more worthier than a village, so is the forehead of a married man more honourable than the bare brow of a bachelor

[Touchstone, As You Like It, III, 3]

Many a good hanging prevents a bad marriage

[Feste, Twelfth Night, I, 5]

By this marriage, All little jealousies, which now seem great, And all great fears, which now import their dangers, Would then be nothing

[Agrippa, Antony and Cleopatra, II, 2]

I may chance have some odd quirks and remnants of wit broken on me, because I have railed so long against marriage: but doth not the appetite alter? a man loves the meat in his youth that he cannot endure in his age.

[Benedick, Much Ado About Nothing, II, 3]

They are in the very wrath of love, and they will together. Clubs cannot part them.

[Rosalind, As you Like It, V, 2]

Speak low, if you speak love.

[Don Pedro, Much Ado About Nothing, II, 1]

I can be secret as a dumb man; I would have you think so; but, on my allegiance, mark you this, on my allegiance. He is in love.

[Benedick, Much Ado About Nothing, I, 1]

By this day! She's a fair lady: I do spy some marks of love in her.

[Benedick, Much Ado About Nothing, II, 3]

He has been, madam, a wicked creature, as you and all flesh and blood are; and, indeed, he does marry that he may repent.

[Clown, All's Well That Ends Well, I, 3]

She will keep no fool, sir, till she be married; and fools are as like husbands as pilchards are to herrings; the husband's the bigger

[Feste, Twelfth Night, III, 1]

Such a mad marriage never was before. Hark, hark! I hear the minstrels play.

[Gremio, Taming of the Shrew, III, 2]

If music be the food of love, play on

[Orsina, Twelfth Night, I, 1]

And what is music then? Such it is As are those dulcet sounds in break of day That creep into the dreaming bridegroom's ear, And summon him to marriage.

[Portia, Merchant of Venice, III, 2]

My lord, they stay for you to give your daughter to her husband.

[Messenger, Much Ado About Nothing, III, 5]

Page 32: Annotated Corpora for Research in the Humanities

32

Data-intensive Humanities

Page 33: Annotated Corpora for Research in the Humanities
Page 34: Annotated Corpora for Research in the Humanities
Page 35: Annotated Corpora for Research in the Humanities
Page 36: Annotated Corpora for Research in the Humanities
Page 37: Annotated Corpora for Research in the Humanities
Page 38: Annotated Corpora for Research in the Humanities

Nature 474, 436-440 (2011) | doi:10.1038/474436a

Page 39: Annotated Corpora for Research in the Humanities
Page 40: Annotated Corpora for Research in the Humanities
Page 41: Annotated Corpora for Research in the Humanities

41

"[There is] a monolithic conception of social space, according to which it would suffice to have the right information to make the right decisions. But in point of fact, information itself is far from homogenous and no purely quantitative approach is satisfying. Having ever greater amounts of information at our fingertips not only does not make us more virtuous, as Rousseau already predicted, but it does not even make us more knowledgeable."

[Tzvetan Todorov, In Defence of the Enlightenment, 2009]

Page 42: Annotated Corpora for Research in the Humanities

42

The simple challenge then...

... to transform the Humanities by promoting shared digital services, facilities, resources and tools, without destroying the justification and arguments for the Humanities for the Humanities sake, and thus accidentally contributing to the decline and eventual destruction of civilization

Page 43: Annotated Corpora for Research in the Humanities
Page 44: Annotated Corpora for Research in the Humanities

44

The 'take-home messages'

● in the era of the data deluge, web science and digital scholarship, we need to rethink the case for the corpus today, and the case for doing annotation

● we need an ecosystem, not separate 'fishtanks'● annotation risks more fragmentation● we need to follow the physical sciences in deciding priorities &

adopting standards, reducing complexity and variety, to promote shared facilities and infrastructures

● but, at the same time, we need to avoid arguments for scientism and instrumentalism, and to defend the humanities

Page 45: Annotated Corpora for Research in the Humanities