17
Challenges in the linguistic exploitation of specialized republishable web corpora Adrien Barbaresi Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) RESAW conference 2015 ˚ Arhus – June 10, 2015 Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15

Challenges in the linguistic exploitation of specialized republishable web corpora

Embed Size (px)

Citation preview

Challenges in the linguistic exploitation of specializedrepublishable web corpora

Adrien Barbaresi

Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)

RESAW conference 2015Arhus – June 10, 2015

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15

Outline

• Context• Specialized web corpora

• Construction and availability

• Challenges• Metadata extraction

• Quality assessment of content

• Licensing and republishing

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 2 / 15

Context Specialized web corpora

Text corpora

Text collections

in German

gathered on the Web

used by linguists

available via a web interface

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 3 / 15

Context Specialized web corpora

“Specialized” corpora

Definition

The corpora focus on a particular text genre or source.

Goal for linguists: better coverage of specific written text types and genresnot found in “traditional” corpora.

Construction

1 Discovery and download: web crawling techniques

2 Stored in a processed version: linguistic corpus

3 Standardized formats: interoperability within the research community

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 4 / 15

Context Specialized web corpora

Two cases of republishable corpora

“Standard” case: German political speeches

Chancellery | 1.831 speeches | 1998–2012Presidency | 1.442 speeches | 1984–2012https://adrien.barbaresi.eu/corpora/speeches/

“Borderline” case: German blogs under Creative Commons licenses

Blogs | 250.000 documents | ∼ 100 MTokenshttps://kaskade.dwds.de/dstar/blogs/

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15

Context Specialized web corpora

Two cases of republishable corpora

“Standard” case: German political speeches

Chancellery | 1.831 speeches | 1998–2012Presidency | 1.442 speeches | 1984–2012https://adrien.barbaresi.eu/corpora/speeches/

“Borderline” case: German blogs under Creative Commons licenses

Blogs | 250.000 documents | ∼ 100 MTokenshttps://kaskade.dwds.de/dstar/blogs/

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15

Context Construction and availability

File formats

1 Web archives (HTML, no WARC to this day)

⇒ linguistic processing toolchain

2a XML TEI format (https://tei-c.org)

2b Browser-friendly HTML documents

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 6 / 15

Context Construction and availability

Interface to the political speeches: static HTML documents

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 7 / 15

Context Construction and availability

Interface to the blogs: querying architecture @DWDS

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 8 / 15

Challenges Metadata extraction

Data quality

Even small or rare mistakes in date encoding for instance may cause theapplication to be disregarded or discarded by researchers in the humanities.

Potentially erroneous metadata in “one size fits all” web corpora mayundermine the relevance of web texts for linguistic purposes.

→ “Hi-Fi” web corpora promote web sources and modernization ofresearch methodology

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 9 / 15

Challenges Metadata extraction

Examples: quality of metadata

Figure: Relative frequency of lemma“Google” in the blog corpus, classifiedby date

Figure: Relative frequency of lemma“Zuckerberg” in the blog corpus,classified by date

Querying and plotting software (DDC & DiaCollo): Bryan Jurish (BBAW)http://odo.dwds.de/~moocow/software/

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 10 / 15

Challenges Quality assessment of content

Example: text quality (query: “document” in blog corpus)

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 11 / 15

Challenges Licensing and republishing

Last but not least: License issues

Different countries, different laws (public domain in the USA, politicalspeeches in Germany etc.)

To be sure: check content and licenses

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 12 / 15

Challenges Licensing and republishing

Manual content checks for the blogs

2727 blog candidates

1766 blogs can be used without restriction (65 %), since all the textualcontent qualifies for archiving:

• At least something on the website

• It is a blog

• Mostly written in German

• Under CC license

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 13 / 15

Challenges Licensing and republishing

CC licence terms (blog corpus)

Most frequent licence types:

652 BY-NC-SA

532 BY-NC-ND

351 BY-SA

282 BY

129 BY-NC

58 BY-ND

Remarks

• Theoretically, the CC license cannot be overridden by another oncethe content has been published

• The usage of *-ND might be a problem

• Differences between countries are not supposed to be a concern

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15

Challenges Licensing and republishing

CC licence terms (blog corpus)

Most frequent licence types:

652 BY-NC-SA

532 BY-NC-ND

351 BY-SA

282 BY

129 BY-NC

58 BY-ND

Remarks

• Theoretically, the CC license cannot be overridden by another oncethe content has been published

• The usage of *-ND might be a problem

• Differences between countries are not supposed to be a concern

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15

Challenges Licensing and republishing

Thank you for your attention

[email protected]

@adbarbaresi

http://purl.org/adrien-barbaresi

Document under CC BY-SA 4.0 license

Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 15 / 15