Click here to load reader

CLAN: Computerized Language Analysis CLAN. Set of programmes that allows you to do different types of linguistic analysis Characteristics of the transcription

Embed Size (px)

Citation preview

Slide 1

CLAN: Computerized Language Analysis

CLAN

CLANSet of programmes that allows you to do different types of linguistic analysis

Characteristics of the transcriptionChat formatSpeaker linguistic dataResearcher data (codes for general or particular analysis)

CLAN programmes allow you to do analysis in both type of data (linguistic data and codes)

Adding codesCodes can be included in the speakers tiers- the guy bought el@s:spa libro@s:spaCodes can be added in different tiersWe can codify any linguistic data by using our own codesSyntactic dataMorphological dataPhonological dataGesture dataDiscourse data, etcAdding codesCLAN offers some programmes that help in the codification (eg. MOR)

MOR*CHI: oop I spilled it . %mor: int|oop pro|I v|spill-PAST pro|it .

Most of this work has to be done by hand by the researcher because it depends on you own researchExample of a transcription with MOR

CLAN Programmesfreq calculate the frequency of any item in the transcriptions . See also its variants:

Statfreq : produce a summary of the word or code frequencies.

Combo allows you to search strings to match patterns of letters, words, or groups of words.

kwal searches data for user-specified words and outputs those keywords in context.

CLAN Programmesmlu command is used primarily to determine the mean length of utterance of a specified speaker. It also provides the total number of utterances and of morphemes in a file.

Gem Gemfreq

MOR: programme that allows automatic morphosyntactic coding

Micromultimodal analysis for gesture : http://talkbank.org/CABank/frokost.zip.

Phonfreq: tabulates all of the segments on the %pho line5 Micromultimodal Analysis for GestureThe analysis of gestural, postural, and proxemics communication requires carefulattention to detail. Because gesture uses many regions of the body, and because multipleparticipants often produce gestures in parallel, there is an enormous amount of concurrentor parallel activity going on across the various gestural and verbal streams, particularly inconversations involving more than two people. One approach to dealing with thismultiple parallelism uses the Partitur or musical notation display found in Elan orEXMARaLDA. CHAT files can be converted to these formats for the analysis ofgesture. However, there are good reasons to consider working instead inside CLAN,even for these more complex analyses. In this section, we will describe howmicromultimodal analysis for gesture can be constructed inside CLAN.

7In CLAN, we have access to all the programmes through the option called command.

Setting CLANThe files to be analysed must be allocated where CLAN is

Where clan isIn LIBIn LIBFREQOne of the most powerful programs

One of the easiest programs to use and a good program to start with when learning to use CLAN.

It constructs a frequency word count for user-specified files: the calculation of the number of times a word occurs in a file or set of files. It produces :a list of all the words their frequency counts calculates a typetoken ratio Rough measure of lexical diversityWhat FREQ ignoresBy default, it ignores these symbols: xxx and www

Also the words beginning with one of the following characters: 0, &, +, -, #

The CHAT manual specifies two special symbols that are used when transcribing difficultmaterial. The xxx symbol is used to indicate unintelligible speech and the wwwsymbol is used to indicate speech that is untranscribable for technical reasons. FREQignores these symbols by default. Also excluded are all the words beginning with one ofthe following characters: 0, &, +, -, #. If you wish to include them in your analyses, listthem, along with other words you are searching for, using the +s/-s option. The FREQprogram also ignores header and code tiers by default. Use the +t option if you want toinclude headers or coding tiers.

11Example of a simple search in FREQIn Commands we introduce a formula

freq +t*XXX +f *.cha

The meaning of each switchfreq*t+f*.cha

General optionsWhen FREQ is run on a single file, output can be directed to an output file by using:the +f option:

freq +f sample.chaThis results : sample.frq

General optionsTo produce output on a group of files rather than on a single file. To have the output in a separate file :

freq +f *.cha

To specify that the frequency analysis for each of these files be computed separately but stored in a single file: use (>) and specify the name of the output file. freq *.cha > freq.all

General optionsYou can analyse one or more specific speakers instead of allfreq +t*LAU +f *.cha

NOTE: the gaps between switches are IMPORTANT

General optionsThe +u option: treats a whole group of files as a single database.

freq +u +f *.cha

If you want to perform the analysis in only one speaker you use +t* the analysis is done only for that speaker+o organize the frequency in order+d4 provides only with type/token radio

freq +t*LAU +f +o *.cha

ExampleYou could want to look at a specific item in a specific speaker +s look only for that segment

freq +t*LAU +sde +f *.cha > freq.all> freq.all The frequency analysis is done separately for each file but stored in a single file.

freq +t*JUL +sde +u +f *.cha > freq.all+u Treat the whole group of files as a single database.

Be aware of the spaces17Getting a list of frequencies Studying Lexical Groups using the +s@file switch

The programme can be used to study the development and use of particular lexical groups in child and bilingual data . Examples how children use personal pronouns between the ages of 2 and 3 years, a frequency count of these forms would be helpfulTo study any lexical group among different bilingual speakers (late bilinguals - early bilinguals)Other lexical groups that might be interesting to track could be the set of all conjunctions, all prepositions, all morality words, names of foods, and so on.

18Getting a list of frequenciesput all the words you want to track into a text file, one word on each line by itself

save it as a .cut file

use the +s switch with the name of the file preceded by the @ sign:

freq [email protected] +f sample.cha

Getting a list of frequenciesThis command would conduct a frequency analysis on all the articles that you have put in the file called xxx.cut.

The file looks like this:athean

Other Searches in Bilingual CorporaFREQ can also be used with the +l switch to track a variety of other multilingual patterns.

FREQ count of all Spanish only wordsfreq +l +s*@s:spa *.cha

FREQ count of English stems with Spanish sufixes

freq +l +s*@s:eng+spa *.chaOther Searches in Bilingual CorporaFREQ can also be used with the l switch to track a variety of other multilingualpatterns. Here is a FREQ count of Spanish only noun (n|*) words on %mor tier with theircorresponding words on speaker tier:freq +l +s*@s:spa *.cha +s@|-n +d7Here is a FREQ count of Spanish only noun (n|*) words on %mor tier:freq +l +s*@s:spa *.cha +s@|-nHere is a FREQ count of all Spanish only words on %mor tier:freq +l +s*@s:spa *.cha +t%morHere is a FREQ count of all Spanish only words on speaker tier:freq +l +s*@s:spa *.chaTo count tiers with [- spa] pre-code only:freq +l +s"" *.chaCounting only the English tiers:freq +l +s"" *.chaThe language pre-code has a space in it, so it is important that when you specify thislanguage pre-code that you include the space and use quotes:Exercise Create a list of prepositions and name the file preposition.cut

Then perform the programme on a transcription filefreq [email protected] sample.cha

It will look at the main tiers

We could also made the analysis in one speaker freq [email protected] +t%NAME +f sample.chaUsing Wildcards with FREQSome of the most powerful uses of FREQ

Wildcards are particularly useful when you want to analyze the frequencies for various codes that you have entered into coding lines. ExampleOne line of Hungarian data in sample2.cha has been coded on the %mor line for syntactic role and part of speech, as described in the CHAT manual. It includes these codes:

N:A|duck-ACC, N:I|plane-ACC, N:I|grape-ALL, and N:A|baby-ALL,

the suffixes mark accusative and illative cases and N:A and N:I indicate animate and inanimate nouns.

If you want to obtain a frequency count of all the animate nouns (N:A) that occur in this file, use this command line:freq +t%mor +s"N:A|*" sample2.cha

The output of this command will be:1 n:a|baby-all1 n:a|ball-acc1 n:a|duck-accExampleTo tabulate codes that require a taxonomic hierarchy. Phrases: verbal (%vrb), nominal (%nom)Verbal: tense, personTense: present (pre) , past (pas)Person: 1st (1) , 2nd (2), 3er (3)(1) : singular (sin) , plural (plu)(2) : singular (sin) , plural (plu)(3) : singular (sin) , plural (plu)

I buy the house %vrb: $PRE:1:SIN

We could use wildcards in the +s string to analyze at different levels. The following examples show what the different wildcards will produce when analyzing the %sin tier.

Researchers have also made extensive use of FREQ to tabulate speech act and interactional codes. Often such codes are constructed using a taxonomic hierarchy. For example, a code like $NIA:RP:NV has a three-level hierarchy. In the INCA-A system discussed in the chapter on speech act coding in the CHAT manual, the first level codesthe interchange type; the second level codes the speech act or illocutionary force type;and the third level codes the nature of the communicative channel. As in the case of themorphological example cited earlier, one could use wildcards in the +s string to analyzeat different levels. The following examples show what the different wildcards willproduce when analyzing the %spa tier.

26ExampleThe basic command here is:freq +s"$*" +t%spa sample.chaString Output+s$* frequencies of all the three-level codes in the %vrb tier+s$*:% frequencies of the tense+s$%:*:% frequencies of the number+s$PRE:*: % frequencies of numbers within the TENSE category+s$*:3:% frequencies of tense verbs that are in 3rd person

ExampleIf some of the codes have only two levels rather than the complete set of three levels, you need to use an additional % sign in the +s switch.

+s"$%:*:%%"

As another example of how to use wild cards in FREQ, consider the task of counting all the utterances from the various different speakers in a file. In this case, you count the three-letter header codes at the beginnings of utterances. To do this, you need the +y switch in order to make sure FREQ sees these headers. The command is:freq +y +s\**: *.cha

28Limiting in FREQFocus your analysis on the relevant parts by excluding all other sections:the +s, +t, and +z switchesIt is available in most of the clan string search programs (except chstrinG or check)

You ca select out particular dependent tiers:+t and -t options (to include certain dependent tiers and ignore others)

For example, if you select a particular main speaker tier, you will be able to choose the dependent tiers of only that particular speaker. Each type of tier has to be specifically selected by the user, otherwise the programs follow their default conditions for selecting tiers.An important analytic technique available in clan is the process of limiting whichallows you to focus your analysis on the part of your data files that is relevant byexcluding all other sections. Limiting is based on use of the +s, +t, and +z switches.Limiting is available in most of the clan string search programs, but cannot be donewithin special purpose programs such as chstrinG or check.1. Limiting by including or excluding dependent tiers. Limiting can be used toselect out particular dependent tiers. By using the +t and -t options, you canchoose to include certain dependent tiers and ignore others. For example, if youselect a particular main speaker tier, you will be able to choose the dependenttiers of only that particular speaker. Each type of tier has to be specifically selectedby the user, otherwise the programs follow their default conditions for selectingtiers.2. Limiting by including or excluding main tiers. When the -t* option is combinedwith a switch like +t*MOT, limiting first narrows the search to the utterancesby MOT and then further excludes the main lines spoken by MOT. Thisswitch functions in a different way from -t*CHI, which will simply exclude all ofthe utterances of CHI and the associated dependent tiers.3. Limiting by including or excluding sequential regions of lines or words. Thenext level of limiting is performed when the +z option is used. At this level onlythe specified data region is chosen out of all the selected tiers.

29Limiting in FREQThe +s/-s options limit the data that is passed on to subsequent programs. There are two speakers, *CHI and *MOT, in sample.cha. Suppose you want to create a frequency count of all variations of the $PRE codes found on the %vrb dependent tiers of *CHI only in the first 20 utterances.

freq +t*CHL +t%vrb +s"$PRE*" -t* sample.cha

You can specify the number of utterances you want to analyze+z20u

Limiting by string inclusion and exclusion. The +s/-s options limit the data that is passed on to subsequent programs. Here is an example of the combined use of the first four limiting techniques. There are two speakers, *CHI and *MOT, in sample.cha. Suppose you want to create a frequency count of all variations of the $ini codes found on the %spa dependent tiers of *CHI only in the first 20 utterances. This analysis is accomplished by using this command:

freq +t*CHI +t%spa +s"$INI*" -t* +z20u sample.cha

The +t*CHI switch tells the program to select the main and dependent tiers associated only with the speaker *CHI. The +t%spa tells the program to further narrow the selection. It limits the analysis to the %spa dependent tiers and the *CHI main speaker tiers. The -t* option signals the program to eliminate data found on the main speaker tier for NIC from the analysis. The +s option tells the program to eliminate all the words that do not match the $INI* string from the analysis. Quotes are needed for this particular +s switch in order to guarantee correct interpretation of the asterisk. In general, it is safest to always

30Limiting in FREQThe +t*CHL: tells to select the main and dependent tiers associated only with the speaker *CHL.

The +t%spa tells to further narrow the selection. It limits the analysis to the %spa dependent tiers and the *CHL main speaker tiers.

freq +t*CHL +t%vrb +s"$PRE*" -t* sample.chaLimiting by string inclusion and exclusion. The +s/-s options limit the data that is passed on to subsequent programs. Here is an example of the combined use of the first four limiting techniques. There are two speakers, *CHI and *MOT, in sample.cha. Suppose you want to create a frequency count of all variations of the $ini codes found on the %spa dependent tiers of *CHI only in the first 20 utterances. This analysis is accomplished by using this command:

freq +t*CHI +t%spa +s"$INI*" -t* +z20u sample.cha

The +t*CHI switch tells the program to select the main and dependent tiers associated only with the speaker *CHI. The +t%spa tells the program to further narrow the selection. It limits the analysis to the %spa dependent tiers and the *CHI main speaker tiers. The -t* option signals the program to eliminate data found on the main speaker tier for NIC from the analysis. The +s option tells the program to eliminate all the words that do not match the $INI* string from the analysis. Quotes are needed for this particular +s switch in order to guarantee correct interpretation of the asterisk. In general, it is safest to always

31Limiting in FREQThe -t* option tells to eliminate data found on the main speaker tier for LAU from the analysis.

The +s option tells to eliminate all the words that do not match the $PRE* string from the analysis. Quotes are needed for this particular +s switch in order to guarantee correct interpretation of the asterisk. freq +t*CHL +t%vrb +s"$PRE*" -t* sample.chaLimiting by string inclusion and exclusion. The +s/-s options limit the data that is passed on to subsequent programs. Here is an example of the combined use of the first four limiting techniques. There are two speakers, *CHI and *MOT, in sample.cha. Suppose you want to create a frequency count of all variations of the $ini codes found on the %spa dependent tiers of *CHI only in the first 20 utterances. This analysis is accomplished by using this command:

freq +t*CHI +t%spa +s"$INI*" -t* +z20u sample.cha

The +t*CHI switch tells the program to select the main and dependent tiers associated only with the speaker *CHI. The +t%spa tells the program to further narrow the selection. It limits the analysis to the %spa dependent tiers and the *CHI main speaker tiers. The -t* option signals the program to eliminate data found on the main speaker tier for NIC from the analysis. The +s option tells the program to eliminate all the words that do not match the $INI* string from the analysis. Quotes are needed for this particular +s switch in order to guarantee correct interpretation of the asterisk. In general, it is safest to always

32STATFREQSummary of word or code frequencies across a set of files.

This summary can be sent on as the input to statistical analysis by programs such as Excel. Excellent tool for gathering data and have it ready for analysis.

Limitation: you can only look at one speaker at a time.

33STATFREQYou can use the ID or *XXX (name abbreviation)

You get a summary of codes and the details of the speaker ID.

Three steps:

You must assign a @ID header line per speaker

@ID:es, en|bangor3|IRI||male|||Adult||

STATFREQ2. Run FREQ over the files using either +t* and the speaker code. +d2 > will produce a temporary file called stat.out+t% > if you want to analyse a coder tier. Freq +d2 +t*CHL t* +t%vrb herring1.cha

3. Then the program will tell you what to typeAdd +f to generate the output in a file

4. Then go to excell , open > delimited > only space > finish (transpose as option)

STATFREQProduce a summary or word or code frequencies across a set of file. It can only be done in a single speaker at a timeYou can use the ID or *XXX (name abbreviation)Freq +d2 +t*CHL t* +t%verb herring1.cha Then the program will tell you what to typeThen go to excel , open > delimited > only space > finish (transpose as option)

COMBO

Gives ways of composing Boolean search strings to match patterns of letters, words, or groups of words in the data files.

Particularly important for researchers who are interested in syntactic analysis.

The search strings are specified with either the +s/-s option or in a separate file.

+s switch is obligatory

When learning to use COMBO, what is most tricky is learning how to specify the correct search strings. Combo uses regular expressions to define the search patternCOMBO

COMBOEXAMPLES OF SEARCH

Combo +swant^to *.chaCombo +swant^*^to *.chaCombo +sshe+to *.chaCombo +sneat^!toy* *.cha neat when not directly followed by words toy or toy-s

40Referring to Files in Search StringsInside the +s switch, one can include reference to one, two, or even more groups of words that are listed in separate files. For example, you can look for combinations of prepositions with articles by using this switch:[email protected]^@art.cut

To use this form, you first need to create a file of prepositions called prep with one preposition on each line and a file of articles called art with one article on each line. Cut files

combo [email protected]^art.cut herring1.chacombo [email protected]^do herring1.chaBy maintaining files of words for different parts of speech or different semantic fields,you can use COMBO to achieve a wide variety of syntactic and semantic analyses. Somesuggestions for words to be grouped into files are given in the chapter of the CHATmanual on word lists. Some particularly easy lists to create would be those including allthe modal verbs, all the articles, or all the prepositions. When building these lists,remember the possible existence of dialect and spelling variations such as dat for that.41ComboThere are more options of complex search availableCross-tier ComboTracking Final WordsTracking Initial WordsLimiting ComboAdding Codes with COMBOCross-tier ComboParticular dependent tiers can be included or excluded by using the +t option immediatelyfollowed by the tier code. By default, COMBO excludes the header anddependent code tiers from the search and output. However, when the dependent code tiersare included by using the +t option, they are combined with their speaker tiers intoclusters. For example, if the search expression is the^*^kitten, the match would be foundeven if the is on the speaker tier and kitten is on one of the speakers associateddependent tiers. This feature is useful if one wants to select for analyses only speakertiers that contain specific word(s) on the main tier and some specific codes on thedependent code tier. For example, if one wants to produce a frequency count of the wordswant and to when either one of them is coded as an imitation on the %spa line, or neatwhen it is a continuation on the %spa line, the following two commands could be used:combo +s(want^to^*^%spa:^*^$INI*)+(neat^*^%spa:^*^$CON*)+t%spa +f +d sample.chafreq +swant +sto +sneat sample.cmbIn this example, the +s option specifies that the words want, to, and $INI may occur inany order on the selected tiers. The +t%spa option must be added in order to allow theprogram to look at the %spa tier when searching for a match. The +d option is used tospecify that the information produced by the program, such as file name, line number andexact position of words on the tier, should be excluded from the output. This way theoutput is in a legal CHAT format and can be used as an input to another CLAN program,42KWALOutputs utterances that match certain user-specified search words.

Use the +s option: allows the search for either a single word or a whole group of words stored in a file.

It is possible to specify as many +s options on the command line as you like.43KWALLike COMBO, the KWAL program works not on lines, but on clusters. A cluster is a combination of the main tier and the selected dependent tiers relating to that line

Each cluster is searched independently for the given keyword. The program lists all keywords that are found in a given cluster tier.

kwal +swhat sample.cha44Unique Options+w : enlarge the context in which the keyword was found. The +w and -w options let you specify how many clusters after and before the target cluster are to be included in the output.

These options must be immediately followed by a number. kwal +swant +w3 -w3 sample.cha

Kwal: excercises kwal +sla@s:spa +w2 -w2 herring1.cha

kwal +l +s*@s:spa +w1 -w1 herring1.chaCHECKTo run CHECK from inside the editor: type Esc-L and check runs through the file looking for errors. It highlights the error and tells you what the nature of the error is. You need to fix the error in order to allow CHECK to move on

The other way of running CHECK is to issue the command from the commands window.the best method for large collection of files. If you want to examine several directories> use the +re option This is an excellent way of working when you have many files and only a few errors.If you want to examine several directories> use the +re option recursively across directories. If you send the output of check to the CLAN Outputwindow, you can locate errors in that window and then triple-click on the file name andCLAN will take you right to the problem that needs to be fixed. This is an excellent wayof working when you have many files and only a few errors.

47MLUThe MLU program computes the mean length of utterance, which is the ratio of morphemes to utterances. mlu +t*XXX +f +z50u *.cha+z50u the analysis is done in the first 50 utterancesMLT (Mean Length of Turn)MLT includes the symbols xxx and yyy in all counts. Thus, utterances that consist of only unintelligible vocal material still constitute turns, as do nonverbal turns indicated by the postcode [+ trn] as illustrated in the following example:*CHI: 0.[+ trn]%gpx: CHI points to picture in book

We can use a single command to run our complete analysis and put all the results into a single file.mlt *.cha > allmlt.cexIn this case, all the speakers will be analyzed. MLT (Mean Length of Turn)OutputMLT for Speaker: *MOT:MLT (xxx and yyy are INCLUDED in the utterance and morpheme counts):Number of: utterances = 331, turns = 227, words = 1398Ratio of words over turns = 6.159Ratio of utterances over turns = 1.458Ratio of words over utterances = 4.224GEMAllow you to mark particular parts of a transcript for further analysis.

Separate header lines are used to mark the beginning and end of each interesting passage you want included in your gem output.

These header tiers may contain tags that will affect whether a given section is selected or excluded in the output. If no particular tag information is being coded, you should use the header form @bg with no colon. If you are using tags, you must use the colon, followed by a tab. If you do not follow these rules, check will complain.

By default, gem looks for the beginning marker @bg without tags and the ending marker @eg, as in this example command:gem sample.chaGEMIf you want to be more selective in your retrieval of gems, you need to add code words or tags to both the @bg: and @eg: lines @bg:freetalk@eg:freetalk

For example, you might wish to mark all cases of verbal interchange during the activity of reading. To do this, you must place the word reading on the @bg: line just before each reading episode, as well as on the @eg: line just after each reading episode. Then you can use the +sreading switch to retrieve only this type of gem, as in this example:gem +sreading sample2.cha

GEMGEM : retrieve responses to particular questions or particular stimuli used in an elicited production task.

The @bg entry for this header can show the number and description of the stimulus.

@bg: Picture 53, truck

One can then search for all of the responses to picture 53 by using the +s"53" switch in GEM.GEMFREQcombines the basic features of FREQ and GEM. it analyzes portions of the transcript that are marked off with @bg and @eg markers. For example, gems can mark off a section of book reading activity with @bg: bookreading and @eg: bookreading.

Once these markers are entered, you can then run GEMFREQ to retrieve a basic FREQ-type output for each of the various gem types you have marked.

GEMFREQFor example, you can run this command:gemfreq +sarriving sample2.cha

and you would get the following output:GEMFREQ +sarriving sample2.chaWed May 12 15:54:35 1999GEMFREQ (04-May-99) is conducting analyses on:ALL speaker tiersand ONLY header tiers matching: @BG:; @EG:;****************************************From file 2 tiers in gem " arriving":1 are1 fine1 how1 youMORIt tags with morphological information: each word + morphological analysis of affixes

It is necessary to construct a separate MOR grammar for each language. After analysis with MOR, users can then use the POST program to disambiguate the %mor line.

Running MOR involves a one-line command.

The initial morphological tagging of the CHILDES database relies on the application of the MOR program.

Running MOR on a CHAT file is easy. In the simplest case, it involves nothing much more than a one-line command. However, before discussing the mechanics of MOR, let us take a look at what it produces. To give an example of the results of a MOR analysis for English, consider this sentence from eve15.cha in Roger Browns corpus for Eve.*CHI: oop I spilled it .%mor: int|oop pro|I v|spill-PAST pro|it .

Here, the main line gives the childs production and the %mor line gives the part of speech for each word, along with the morphological analysis of affixes, such as the past tense mark (-PAST) on the verb. The %mor lines in these files were not created by hand. To produce them, we ran the MOR program, using the MOR grammar for English, which can be downloaded from http://childes.psy.cmu.edu/morgrams/. The command for running MOR is nothing more in this case than mor *.cha. After running MOR, the file looks like this:*CHI: oop I spilled it .%mor: int|oop pro|I part|spill-PERF^v|spill-PAST pro|it .

56MORTo produce them, we ran the MOR program, using the MOR grammar for English, which can be downloaded from http://childes.psy.cmu.edu/morgrams/.

The command for running MOR is nothing more in this case than mor *.cha. After running MOR, the file looks like this:*CHI: oop I spilled it .%mor: int|oop pro|I part|spill-PERF^v|spill-PAST pro|it .The initial morphological tagging of the CHILDES database relies on the application of the MOR program.

Running MOR on a CHAT file is easy. In the simplest case, it involves nothing much more than a one-line command. However, before discussing the mechanics of MOR, let us take a look at what it produces. To give an example of the results of a MOR analysis for English, consider this sentence from eve15.cha in Roger Browns corpus for Eve.*CHI: oop I spilled it .%mor: int|oop pro|I v|spill-PAST pro|it .

Here, the main line gives the childs production and the %mor line gives the part of speech for each word, along with the morphological analysis of affixes, such as the past tense mark (-PAST) on the verb. The %mor lines in these files were not created by hand. To produce them, we ran the MOR program, using the MOR grammar for English, which can be downloaded from http://childes.psy.cmu.edu/morgrams/. The command for running MOR is nothing more in this case than mor *.cha. After running MOR, the file looks like this:*CHI: oop I spilled it .%mor: int|oop pro|I part|spill-PERF^v|spill-PAST pro|it .

57Studying %mor codes using the +s@ switchThe +s@ switch can be used in a very different way to track stems, prefixes, suffixes, and so on, on the %mor line. You can get information about this by typingfreq +s@

You will see that the +s@code switch words by using this syntax for the %mor line.# prefix marker| part-of-speech markerr stem of the word marker- suffix marker& nonconcatenated morpheme marker= English translation for the stem marker@ replacement word preceding [: ...] code marker* error code inside [* ...] code marker

Examples with FREQ in MORHere are some example commands:Find all stems and erase all other markers:+s"@r-*,o-%"Find all stems of all "adv" and erase all other markers:+s"@r*,|adv,o%"Find all forms of "be" verb+s"@r-be"Find all stems and parts-of-speech and erase other markers+s"@r*,|*,o%"

Other Searches in Bilingual CorporaFREQ can also be used with the l switch to track a variety of other multilingual patterns.

FREQ count of Spanish only noun (n|*) words on %mor tier with their corresponding words on speaker tier:freq +l +s*@s:spa *.cha +s@|-n +d7

FREQ count of Spanish only noun (n|*) words on %mor tier:freq +l +s*@s:spa *.cha +s@|-n

FREQ count of all Spanish only words on %mor tier:freq +l +s*@s:spa *.cha +t%morOJO > SE PUEDE DAR ESTA INFORMACION Y NO PONER EN LAS PRACTICAS60Other Searches in Bilingual CorporaHere is a FREQ count of all Spanish only words on speaker tier:freq +l +s*@s:spa *.cha

To count tiers with [- spa] pre-code only:freq +l +s"" *.cha

Counting only the English tiers:freq +l +s"" *.chaOther Searches in Bilingual CorporaThe language pre-code has a space in it, so it is important that when you specify this language pre-code that you include the space and use quotes:

+s"[- spa]" +s"[- eng]" +s"" +s""Other Searches in Bilingual CorporaFREQ can also be used with the l switch to track a variety of other multilingualpatterns. Here is a FREQ count of Spanish only noun (n|*) words on %mor tier with theircorresponding words on speaker tier:freq +l +s*@s:spa *.cha +s@|-n +d7Here is a FREQ count of Spanish only noun (n|*) words on %mor tier:freq +l +s*@s:spa *.cha +s@|-nHere is a FREQ count of all Spanish only words on %mor tier:freq +l +s*@s:spa *.cha +t%morHere is a FREQ count of all Spanish only words on speaker tier:freq +l +s*@s:spa *.chaTo count tiers with [- spa] pre-code only:freq +l +s"" *.chaCounting only the English tiers:freq +l +s"" *.chaThe language pre-code has a space in it, so it is important that when you specify thislanguage pre-code that you include the space and use quotes:+s"[- spa]" +s"[- eng]" +s"" +s""Studying Multilingual Mixtures using the +s@s switch+s@ switch allows for searching for words with specific language sources in multilingual corpora, for example:

swallowni@s:eng+hun

for a word with an English stem and a Hungarian suffix. The use of this function is explained if you type in COMMAND :freq +s@sFollowed by search patternr word& stem language marker+ suffix language marker$ part-of-speech markero all other elements not specified by user followed by - or + and/or the following* find any match% erase any matchExamples PRACTICAR A VER SI SALEFind all words with Spanish stems+s"@r-*,&-spa"Find all words with English stems and erase all other markers+s"@r-*,&-en,o-%"Find all words with Spanish stems and an English suffix+s"@r*,&spa,+en"

69Using Wildcards with FREQSome of the most powerful uses of FREQ

Wildcards are particularly useful when you want to analyze the frequencies for various codes that you have entered into coding lines. ExampleOne line of Hungarian data in sample2.cha has been coded on the %mor line for syntactic role and part of speech, as described in the CHAT manual. It includes these codes:

N:A|duck-ACC, N:I|plane-ACC, N:I|grape-ALL, and N:A|baby-ALL,

the suffixes mark accusative and illative cases and N:A and N:I indicate animate and inanimate nouns.

If you want to obtain a frequency count of all the animate nouns (N:A) that occur in this file, use this command line:freq +t%mor +s"N:A|*" sample2.cha

The output of this command will be:1 n:a|baby-all1 n:a|ball-acc1 n:a|duck-accNote that material after the +s switch is enclosed in double quotation marks to guarantee that wildcards will be correctly interpreted.

The next examples give additional search strings with asterisks and the output they will yield when run on the sample file. Note that what may appear to be a single underline in the second example is actually two underline characters.

String Output*-acc 1 n:a|ball-acc1 n:a|duck-acc1 n:i|plane-acc*-a__ 1 n:a|baby-all1 n:a|ball-acc1 n:a|duck-acc1 n:i|grape-all1 n:i|plane-accN:*|*-all 1 N:A|baby-all1 N:I|grape-all72These examples show the use of the asterisk as a wildcard. When the asterisk is used, FREQ gives a full output of each of the specific code types that match. If you do not want to see the specific instances of the matches, you can use the percentage wildcard, as in the following examples:

String OutputN:A|% 3 N:A|%-ACC 3 -ACC%-A__ 3 -ACC

2 -ALLN:%|%-ACC 3 N:|-ACCN:%|% 5 N:|It is also possible to combine the use of the two types of wildcards, as in these examples:String OutputN:%|*-ACC 1 N:|ball-acc1 N:|duck-acc1 N:|plane-accN:*|% 3 N:A|2 N:I|ExampleTo tabulate codes that require a taxonomic hierarchy. Phrases: verbal (vrb), nominal (nom)Verbal: tense, personTense: present (pre) , past (pas)Person: 1st (1) , 2nd (2), 3er (3)(1) : singular (sin) , plural (plu)(2) : singular (sin) , plural (plu)(3) : singular (sin) , plural (plu)Nominal: with determiner , without determinerWith determiner (wde)without determiner (wtd)

$VRB:PRE:1:SIN $NOM:WDE

We could use wildcards in the +s string to analyze at different levels. The following examples show what the different wildcards will produce when analyzing the %spa tier.

Researchers have also made extensive use of FREQ to tabulate speech act and interactional codes. Often such codes are constructed using a taxonomic hierarchy. For example, a code like $NIA:RP:NV has a three-level hierarchy. In the INCA-A system discussed in the chapter on speech act coding in the CHAT manual, the first level codesthe interchange type; the second level codes the speech act or illocutionary force type;and the third level codes the nature of the communicative channel. As in the case of themorphological example cited earlier, one could use wildcards in the +s string to analyzeat different levels. The following examples show what the different wildcards willproduce when analyzing the %spa tier.

75Adaptar estos codigos a un ejemploThe basic command here is:freq +s"$*" +t%spa sample.chaString Output$* frequencies of all the three-levelcodes in the %spa tier$*:% frequencies of the interchange types$%:*:% frequencies of the speech act codes$RES:*: % frequencies of speech acts within theRES category$*:sel:% frequencies of the interchange types that have SELspeech acts

????If some of the codes have only two levels rather than the complete set of three levels, you need to use an additional % sign in the +s switch.

+s"$%:*:%%"

This switch will find all speech act codes, including both those with the third level coded and those with only two levels coded.

As another example of how to use wild cards in FREQ, consider the task of counting all the utterances from the various different speakers in a file. In this case, you count the three-letter header codes at the beginnings of utterances. To do this, you need the +y switch in order to make sure FREQ sees these headers. The command is:freq +y +s\**: *.cha

with the %mor lineThe previous section illustrates the fact that searches for material on the %mor line can be difficult to compose. To help in this regard, we have composed a variant of FREQ searches that takes advantage of the syntax of the %mor tier. To see the available options here, you can type freq +s@ on the command line and you will see the following information:+s@ Followed by file name or morpho-syntaxMorpho-syntax search pattern and markers# prefix marker| part-of-speech markerr stem of the word marker- suffix marker& nonconcatenated morpheme marker= English translation for the stem marker@ replacement word presiding [: ...] code marker* error code inside [* ...] code markero all other elements not specified by userfollowed by - or + and/or the following* find any match% erase any matchstring find "string"Exception:~ postclitic OR $ precliticfollowed by - or + and* keep words with preclitic or postclitic intact% exclude the whole word if preclitic or postclitic foundFor example:+t%mor -t* +s"@r-*,o-%"find all stems and erase all other markers+t%mor -t* +s"@r*,|adv,o%"find all stems of all "adv" and erase all other markers+t%mor -t* +s"@r-be"find all forms of "be" verb+t%mor -t* +s"@r*,|*,o%"find all stems and parts-of-speech and erase other markers+t%mor -t* +s"@r*,|*,-*,o%"find only stems and parts-of-speech that have suffixes and eraseother markers+t%mor -t* +s"@r-*,|-*,-+*,o-%"find all stems, parts-of-speech and distinguish those with suffixand erase other markers+t%mor -t* -s@"-*" +s"@r*,|*,o%"find only stems and parts-of-speech that do not have suffixes anderase other markers

Studying Unique Words and Shared WordsWith a few simple manipulations, FREQ can be used to study the extent to which words are shared between the parents and the child. For example, we may be interested in understanding the nature of words that are used by the child and not used by the mother as a way of understanding the ways in which the childs social and conceptual world is structured by forces outside of the immediate family. In order to isolate shared and unique words, you can go through three steps. To illustrate these steps, we will use the sample.cha file.1. Run freq on the childs and the mothers utterances using these two commands:freq +d1 +t*MOT +f sample.chafreq +d1 +t*CHI +f sample.chaThe first command will produce a sample.frq.cex file with the mothers wordsand the second will produce a sample.fr0.cex file with the childs words.2. Next you should run freq on the output files:freq +y +o +u sample.f*The output of these commands is a list of words with frequencies that are either 1or 2. All words with frequencies of 2 are shared between the two files and allwords with frequencies of 1 are unique to either the mother or the child.3. In order to determine whether a word is unique to the child or the mother, youcan run the previous command through a second filter that uses the COMBOprogram. All words with frequencies of 2 are unique to the mother. The wordswith frequencies of 1 are unique to the child. Commands that automate this procedureare:freq +y +o +u sample.f* | combo +y +s"2" +d | freq +y +d1 >shared.frqfreq +y +o +u *.frqThe first command has three parts. The first FREQ segment tags all shared wordsas having a frequency of 2 and all non-shared words as having a frequency of 1.The COMBO segment extracts the shared words. The second FREQ segmentstrips off the numbers and writes to a file. Then you compare this file with yourother files from the mother using a variant of the command given in the secondstep. In the output from this final command, words with a frequency of 2 areshared and words with a frequency of 1 are unique to the mother. A parallelanalysis can be conducted to determine the words unique to the child. This sameprocedure can be run on collections of files in which both speakers participate, aslong as the speaker ID codes are consistent.

9.8.14 Unique Options+c Find capitalized words only.+d Perform a particular level of data analysis. By default the output consistsof all selected words found in the input data file(s) and their correspondingfrequencies. The +d option can be used to change the output format. Try thesecommands:freq sample.cha +d0freq sample.cha +d1freq sample.cha +d2 +tCHIEach of these three commands produces a different output.+d0 When the +d0 option is used, the output provides a concordance with thefrequencies of each word, the files and line numbers where each word, and thetext in the line that matches.+d1 This option outputs each of the words found in the input data file(s) oneword per line with no further information about frequency. Later this outputcould be used as a word list file for KWAL or COMBO programs to locate thecontext in which those words or codes are used.+d2 With this option, the output is sent to a file in a very specific form that isuseful for input to STATFREQ. This option also creates a stat.out file to keeptrack of multiple .frq.cex output files. You do not need to use the +f option with+d2, because this is assumed. Note that you must include a +t specification inCLAN Manual 98order to tell the +d2 option which speaker to track for the STATFREQ analysis.You may also wish to use the +t@ID= method to select your target speakers, asin this example for the ne20 sample files from the New England corpus.freq +d2 +t@ID=*|Target_Child|* *.cha+d3 This output is essentially the same as that for +d2, but with only thestatistics on types, tokens, and the typetoken ratio. This option also creates astat.out file to keep track of multiple .frq.cex output files. Word frequenciesare not placed into the output. You do not need to use the +f option with +d3,since this is assumed.+d4 This switch allows you to output just the typetoken information.+d5 This switch will output all words you are searching for, including thosethat occur with zero frequency. This could happen, for example, if you use the+s switch to search for a specific word and that word does not occur in thetranscript.+d6 This switch allows you to search for a specific segment of a largercomplex word, particularly on the %mor tier and then see how that wordappears in various morphological contexts.+o Normally, the output from FREQ is sorted alphabetically. This option canbe used to sort the output in descending frequency. The +o1 level will sort tocreate a reverse concordance.+xN The +xN version of this switch allows you to include only utteranceswhich are N words in length or longer. The xN version allows you to includeonly utterances which are N words in length or shorter.FREQ also uses several options that are shared with other commands. For a completelist of options for a command, type the name of the command followed by a carriagereturn in the Commands window. Information regarding the additional options sharedacross commands can be found in the chapter on Options.

Studying Multilingual Mixtures using the +s@s switch When you use this switch:

you must always also add the l switch to FREQ to create explicit language tags on each word

These tags are written to memory and can appear in the output, but the file on the disk is not changedMeaningSymbol

Immediately FOLLOWED by^

Inclusive OR+

Logical NOT!

Repeated character*

Single character_

Quoting\

MeaningSymbol

Immediately FOLLOWED by^

Inclusive OR+

Logical NOT!

Repeated character*

Single character_

Quoting\