Corpus Linguistics Workshop

Embed Size (px)

Citation preview

  • 8/11/2019 Corpus Linguistics Workshop

    1/8

    Corpus Linguistics workshop

    By

    Dr. Wessam Ibrahim of Tanta University

    Setting: Faculty of Education

    Time: 8-11 sept.2014

    Corpus tools:

    1. CQP web: Lancaster university. Advantage: easiest to use disadvantage: cant upload my own corpus

    User name: azzaabdeen

    Password: azzaabdeen

    2. Antconc: advantage: can upload my corpus disadvantage: complicated, but useful for part of speech

    tagging. The only available one

    3. Wmatrix: semantic domains and tagging. Free for very little corpus

    4. Smith

    D Wessam email:[email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/11/2019 Corpus Linguistics Workshop

    2/8

    Introduction

    What is corpus?

    A methodology to save time

    Corpus: body plural. Corpra: a collection of texts readable b computer saved as plain text file so the

    computer can read (machine readable)

    The text has to be saved to word then to text file

    For Pd files has to be converted from Pdf to plain texts. PdF saved as image only cannot be used as

    corpus

    Why use a corpus:

    For research work:

    1. testing a hypothesis

    2. evade research-bias: large corpus help researcher evade being bias by supporting or refuting a corpus

    3. helps spot common and rare phenonmena, especially in large corpora

    4. helps us make generalization

    What is corpus linguistics?

    1.Empirical: authentic data the has evidence

    What is empiricism:

    a. Scientific/philosophical

    b. symptoms of evidence

    It needs computer: a software for analysis

    2. Quantitive and Qualititive analytical techniques: We can get frequencies from a corpus, but these

    frequencies have to be interpreted. Say something about the pattern and explain it

    How to analyze?

    1.Search for frequency of a certain item so we can make claims. For example, the word wash is more

    frequent with women than men. So we could make claims about why women use this more.

    3. Key words: What is the corpus about:

    There should be reference corpus to compare the uploaded corpus to

  • 8/11/2019 Corpus Linguistics Workshop

    3/8

    2. Tag a corpus: adding information to your corpus. We do the tagging while compiling. For example,

    comparev men and women. This should be tagged who is who (this part is giving by a 25 year old

    female, middle class). Tagging involves lots of work,ie, corpus A and corpus be that is more general and

    larger. For example, newspaper in Arabic to the more representative of Arabic statistics will work out

    the results and get the keyword: one word which is statistically significant. This is concordance

    3. Concordance:the little context in which the query word is used, a certain number before and after

    the query (key word).A list of all the occurrences of a word or phrase in a corpus given in the context of

    the sentence it occurs in sometimes called KWIC (key word in context. Before+key word+after

    4. Collocation: Words that tend to come together. They have ideological connotation. For example, the

    word, spinister and bacholar have ideological connotation because of their collocates:

    Bacholar: unmarried, man, happy, gentleman

    Spinister: married woman, eldrly, cold-hearted, witch

    The generalization that comes out of this maybe that English language is sexiest

    Muslim brotherhood in Egyptian and British Newspaper

    which gives the connotation of violence, crackdown, source of trouble

    In British newspaper, set in which means peaceful which gives a positive image of Ikhwan as victims,

    whereas, the Egyptian is negative because of a certain agenda the media is following.

    Features of Corpus:

    1.Very large: related to the kind of research questions

    2. Representative: different genres to talk about general things. News reports should be in focus

    3. Machine Readable

    4. Often annotated: tagging is an extra information you add according to syntax, e.g. n,v, according to

    semantic domain. For example categorizing words that belong to groups of semantic features: food,

    family, sports. The software will do it. W Matrix is the only software for semantic annotation. Most of

    others give syntactic annotation

    5. Representaive: corpora are so big to be representative of language variation. It should be large to

    establish norms/patterns to reveal cases of usual uses.

    6. Annotated: tagging the corpus such as Age- sexclass of the speakers

    Types of Corpora

    1. Specialized corpus: For example:

    Genre: the language of newspaper

    Time: 2005 till present

  • 8/11/2019 Corpus Linguistics Workshop

    4/8

    Place: texts published in Egypt

    2. General corpus: needs to be larger, for example, British National Corpus (BNC) has about 100million

    words of spoken and written British English. We can search in general corpora things such as: Discourse

    markers, transitive, Modals or any other grammatical features in corpus.

    There are two corpora: LOB : Lancaster Oslen Burger (British Corpora)and

    FLOB corpus (American 1961)

    3. Multilingual corpus: English and Arabic or American English and British English

    4. Parallel corpus: 2 corpora about 2 different languages, e.g. English and Arabic

    5. Learner corpus: language use created by people learning a particular language, e.g, the international

    corpus of learner English Adjectives expressing feelings are the same as Americans

    6. Historical or Diachornic corpus, eg. Hesinki corpus. 1.5 million words of texts from 700 AD to 1700 AD

    7. Monitor corpus: continually added to, e.g. the bank of English (COCA: American corpus for free)

    - size of corpus is based on your purpose: what do you want to do with it. Specialized corpora does not

    have to be big..according to the purpose.

    \ demographic data: everyday conversation

    Goverened data: TV language

    Types of Searches

    A single word: book

    A phrase: book the hotel

    One word or another: clever,mart

    Wild cards in words : hat, hit, hot

    Wild cards as words: the*man

    Part of speech: loveNN!

    Headword searches: {list/lists/ listing}

    Lemma search: word dervatives {light/verb} }lights/N} {lit, lighted, lightening}

    Restricted searches: Only news genre or only female speakers, restrict setting before embarking oncorpus

    Coping with too many concordances lines: Thin the concordances: e.g 100 lines. Look at 30 lines, then

    another 30 untill there are no patterns

    -Use a small no of lines to form hypothesis, then carry out other searches

    - Use collocation or keywords . Get collocates of each key word

  • 8/11/2019 Corpus Linguistics Workshop

    5/8

    -All choices should be based on statistical significance

    Collocation

    The systematic co-ocuurances of words in use. First key word is the nod word fixed, eg. Telephone

    operator: fixed relationship

    Variable: tell me a story

    Story to tell

    Non-idiomatic: told a story

    Tell a story

    Telling a story

    Some collocates are based on a certain ideology

    Idiomatic collocation: kick the bucket

    Nod word is the word I want to search its collocates. How large should the span be?

    Antconc: -1+1 and can be changed into -5+5

    It is important to specify the span of collocates befor you do statistics

    5 words before the collocates+5 words after

    Loglikehood: most frequently used

    Mutual Information: in small corpus we can use mutual information for statistic significance. It measuresthe strength of association of 2 words (collocates)

    Mutual information (MI) mainly based to get ideology of the producer. The words the journalist imposed

    this ideology.subtle patterns that is statistically significant because it has an ideology. It creates an

    entity. It has cut off which is when to say it is statistically significant. MI is ameasure of effect size

    showing strength or salience of collocation

    MI= 3: 2 occurrences = 1

    MI: measures very strong association8 occurrences = 3

    Colligation: a word collocates with a certain part of speech

    Semantic preference: the collocates belong to the same semantic domain:

    A glass of : water, lemon, juice (colligates with domain of cold beverage)

    Semantic prosody: words used for a special feeling effect: negative or positive connotation

    Semantic refeference: is a common semantic field around a word

  • 8/11/2019 Corpus Linguistics Workshop

    6/8

    Consequence+Adjective related to logic/importance

    Semantic Discourse Prosody: cause as a noun rare: aim (positive)

    Cause as a noun verb: bad (negative)

    Semantic prosody explain connotation

  • 8/11/2019 Corpus Linguistics Workshop

    7/8

    Corpus Software Tools

    Wordsmith 5

    Antconc

    CQB Web

    WMatrix

    WMatrix

    1. Word is----- T frequency

    2. Open Filecorpus text file

    Choose text now-ok

    If you want to change text. Chose button to change selection.

    3. Make a word list now

    Wordsmith tools

    File setting utilities window

    - word list

    - tick file

    -open

    -choose text from my computer

    - Browse

    - choose my file from my Comp.)

    Change selection in case ou need to change (All books, one Book, 2 books)

    Select

    -Tick on the ruler

    -Ok

    If I need to change after downloading____highlight________clear

    -

    Make a word list now

    -

    All words appeared arranged in number of frequency from most to least/ top-to-down

    -

    Frequency: no. of occurrences

    -

    Percentages------- the frequency of occurrences compared to the text

    -

  • 8/11/2019 Corpus Linguistics Workshop

    8/8

    How to save word list?

    -

    Tick file-------------tick save as a word list

    -

    Save twice as excel sheet and as word list

    -

    Tick function--------concord-----------open window and the key word will occur witrh frequency

    (concordances highlighted)-

    Window-sort now-yes with concord