Comparable Corpora BootCat (CCBC)

Preview:

DESCRIPTION

Comparable Corpora BootCat (CCBC). Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd. BootCaT. Bootstrapping Corpora and Terms Translators Know the language Not domain experts Can interpret domain terms but can’t guess them Instant domain corpus from the web - PowerPoint PPT Presentation

Citation preview

Comparable Corpora BootCat(CCBC)

Adam Kilgarriff, Avinesh PVSLexical Computing Ltd

BootCaT

• Bootstrapping Corpora and Terms• Translators– Know the language– Not domain experts– Can interpret domain terms but can’t guess them

• Instant domain corpus from the web• Marco Baroni and Silvia Bernardini (2004)

BootCaT method

• Piggyback on a search engine– Google, Yahoo, Bing

• Set of seed terms• Repeat– Take random 3 seeds– Send to search engine– Gather ‘search hits’ pages

• Remove, duplicates, find terms– Can iterate

WebBootCaT

• Web interface• Improved cleaning, duplicate removal• Integrated with corpus tool (Sketch Engine)

Going multilingual

• Google-translate– English: volcanology volcanologist "volcanic

eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic

– French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques

• And do the same thing for French

• By July 2011– All steps integrated – Propose bilingual terminology

Recommended