"Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software",...

Preview:

Citation preview

Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software

www.advisori.de 2

Some explanations – Writing Centres

www.advisori.de 3

Student Diversity

www.advisori.de 4

The Writing Centre Triangle

Student

ScientificInstructor

WritingInstructor

no communication

Missing knowledge: Content Linguistics Academic Traditions

Missing knowledge: Content Linguistics Academic Traditions

www.advisori.de 5

A Missmatch in Communication

A Chinese student of mechanical engineering

writing a bachelors‘s thesis

in German

A German language instructor with a masters

degree in social sciences

No idea of mechanical engineering in terms of content & academic traditions

No idea of German meta language &German academic traditions

www.advisori.de 6

An Example• Which verb goes together with “regression”:

a.Fitb.Estimatec. Calculated.Predicte.Computef. I-hope-it-is-not-contagious

www.advisori.de 7

Solution Strategies

• Ask a dictionary• Ask Google• Ask the student• Ask someone else• Have a look at the respective

literature

There are no specialised dictionariesHow would you?She/he does not knowYour colleagues know as much as you

knowA good starting point

www.advisori.de 8

Old-Fashioned Knowledge Mining

www.advisori.de 9

Corpus-Linguistics / Text-Mining for automated Knowledge Generation

www.advisori.de 10

The Task

Design, programme and implement a tool that helps language instructors

working at writing centres to support students

writing in a foreign language

www.advisori.de 11

Some Challanges• No one wants to use a programme with such a syntax:

• [a-z]*\[vbp\]\s[a-z\s]*\sregression[a-z]• Sentence boundaries need to be respected• It needs to run online, offline, on Windows, Windows Server, Linux, Linux Servers and

Mac (hey why not on a smartphone as well)• It needs to be easily maintainable• It needs to return high quality results without being to techy regarding IT and linguistic

special terms • It needs to be cheap (i.e. for free)• It needs to work with German, English and Russian texts

www.advisori.de 12

The Hannover Concordancer – A Joint Venture

www.advisori.de 13

The Architecture

Texts

Metadata

LSA Database

Local / Remote Server Client

www.advisori.de 14

Text Preparation Workflow

PDF TXT XML RData DB

RData Index

Texts

Meta Information

Document Term Matrix

BackendPre-Processing

www.advisori.de 15

Query Input and Programme Output

KWIC

CollocationsN-GramsReadingsLSA Associations

Frequencies

Com

plex

ity

Words Lemmata POS Tags Of each up to 5 One Corpus Two Corpora ComplexityOutput:

Query Input:

www.advisori.de 16

Contact Details

Feel free to contact me:

Via E-Mail: tobias.gaertner@advisori.deOn Xing: https://www.xing.com/profile/Tobias_Gaertner35On LinkedIn: https://www.linkedin.com/in/tobias-g%C3%A4rtner-b11205125/

Did you know we are hiring?

Recommended