The Encyclopedia of Applied Linguistics || Corpus Analysis In Forensic Linguistics

Corpus Analysis in Forensic LinguisticsJANET COTTERILL

This article looks in two directions with respect to the use of corpora within forensic lin-guistics: fi rst, the notion of small, genre-specifi c forensic corpora made up of both questioned or disputed texts and candidate texts; and second, the use of existing large corpora of language in a variety of specialized/individual genres which may assist the forensic linguist in his or her task. Each category of analysis is illustrated with contemporary, real-world examples and considers the advantages and disadvantages of using a corpus approach.

Defi nitions and Delineations of Forensic Linguistics

The growing discipline of forensic linguistics may be broadly defi ned as the study of language and law. Forensic linguistics is coupled with its sister discipline forensic phonetics, which, as the name suggests, deals with speaker and voice identifi cation in data such as telephone calls, answerphone messages, and other recorded speech samples. Forensic linguistics has two principal foci of attention: the descriptive aspect and the investigative dimension.

Descriptive Forensic Linguistics

Within this domain, the linguistic characteristics of a variety of different genres and text types are described. This broadly encompasses the language of the legal process, from arrest to trial and beyond, and includes the following texts and registers:

• The police caution/Miranda and its equivalents, a text read aloud to suspects on arrest outlining rights and responsibilities (Shuy, 1997; Cotterill, 2000; Rock, 2007).

• The police interview (Heydon, 2005; Haworth, 2006).• The language of civil and criminal trials, including examination-in-chief (direct) and

cross-examination; opening statements and closing arguments; jury deliberation and verdicts (Stygall, 1994).

• Judicial language: Supreme Court judgments, reports, and appeals (Solan, 1993).• Prison language (Mayr, 2004).

Within the written aspect of language and law a wide spectrum of text types is also analyzed:

• contracts/written laws (Tiersma, 1999),• trademarks (Shuy, 2002),• warning labels/signs (Dumas, 1990; Tiersma, 1999).

This descriptive aspect of the fi eld is an obvious one for the development of genre and text-specifi c, specialized corpora, although in some countries, including the UK, there are restrictions placed on the availability of data from courtrooms, with many of the proceed-ings and resulting transcripts/recordings either diffi cult to obtain or nonexistent. Despite this the descriptive area is undoubtedly one where corpus linguistics can prove extremely

The Encyclopedia of Applied Linguistics, Edited by Carol A. Chapelle.© 2013 Blackwell Publishing Ltd. Published 2013 by Blackwell Publishing Ltd.DOI: 10.1002/9781405198431.wbeal0238

2 corpus analysis in forensic linguistics

useful in exploring linguistic characteristics of legal genres and is likely to grow in the future. Perhaps indicative of this trend the most recent book on courtroom and judicial language (Heffer, 2006) carries the subtitle A Corpus-Aided Linguistic Analysis of Legal–Lay Discourse.

Investigative Forensic Linguistics

The second area is perhaps the best known and longest established area of corpus analysis in forensic linguistics: This involves the analysis of texts as part of criminal investigations. Texts include the following:

• threat letters,• suicide notes,• fabricated confessions,• blackmail/extortion letters,• terrorist/bomb threats,• ransom demands,• e-mails,• text messages.

The use of corpora in investigative forensic linguistics is of increasing importance. Not only is such a methodology useful for analyzing sets of texts collected by scenes-of-crime offi cers at the early crime investigation stage, but the resulting evidence benefi ts from a scientifi c presentation in court. This includes the use of graphs and charts, tables and statistics, which juries have been shown to fi nd more convincing than testimony delivered in a qualitative form (Conley & O’Barr, 1990). In addition to clearly defi ned forensic con-texts such as those detailed above, there are other, more peripheral areas where corpus linguistics is becoming an important method of investigation. As O’Keeffe, McCarthy, and Carter (2007, p. 20) attest, “authorship and plagiarism are growing concerns within forensic linguistics, for which corpora can prove a useful instrument of investigation.” The Joint Information Systems Committee (JISC) Electronic Plagiarism Detection project created TurnitinUK plagiarism detection software in 2000, for example, which is used widely by UK universities, colleges, and schools.

One of the major challenges within the forensic fi eld is usually not present in the same way for corpus constructors of more general and widely available texts such as newspaper articles or casual conversation (e.g., the Bank of English, the Brown Corpus, or the British National Corpus). As O’Keeffe, McCarthy, and Carter (2007) point out, the construction of corpora and the notion of sampling can be crucial. As they put it, “any old collection of texts does not make a corpus.” Unfortunately, and frustratingly for the forensic linguist, any old collection of texts is precisely what is provided by either the police or defense solicitors, who are unaware of genre/register differences, variations in text size, and tem-poral factors, all of which may infl uence the potential of texts to be analyzed.

All the text types listed above pose a number of challenges for the analyst. The fi rst concerns the length of the text(s) to be analyzed. In general, all of the above are typically short pieces of writing with very few extending beyond a single page or several pages. In fact, texts such as e-mails and in particular text messages frequently consist of fewer than ten words and as such represent a dilemma for the analyst. Without very much language to work on, it is diffi cult to establish authorship, either in terms of identifi cation or exclu-sion of candidate authors. As Olsson (2004, p. 14) notes:

many forensic inquiries present too little data to justify a full-scale statistical study. Often there are no more than two candidates, as few as three attested texts and perhaps no

corpus analysis in forensic linguistics 3

more than one or two questioned texts. This is why most methods described as author-ship detection do not work in the forensic context.

This kind of work also involves the notion of idiolect and uniqueness of expression (see Coulthard, 2004), both of which can very effectively involve the use of corpora. However, there are additional problems with work of this kind. First, the linguist has to attempt to deal with the thorny issue of genre. Such work in general linguistics which has attempted to defi ne the characteristics of particular genres (e.g., Biber, 1988, 1995; Stubbs, 1996) has resulted so far in only partial descriptions of stylistic characteristics associated with certain text types. Perhaps the most problematic genre of analysis in this area is that of text messaging, where samples of data are both brief and highly idiolectal.

The Use of Specialized Individual Corpora in the Authorship of Text Messages

In addition to the text types outlined above, an increasing number of crimes involve the analysis of small, specialized corpora of text messages or SMS. Here, very small features including the use of punctuation, capitalization, and abbreviation, as well as openings and closings to messages, contained in very short texts nevertheless appear to give a sense of the author’s idiolect or texting style. The brevity of text types was a focus of Eugene Winter’s work, who attempted to establish the threshold of text size where analysis was no longer feasible (Winter, 1996). As Coulthard discussed as early as 1993:

no one really knows how small a sample one can reliably work with, at what size sig-nifi cant irregularities begin to emerge, whether two samples by the same person can be treated as one larger sample and to what extent regularities vary across registers. (Coulthard 1993, p. 93)

To illustrate this principle, he cites Stubbs (1996, pp. 81–100) who, comparing two short letters of an average of only 400 words each, was able to show repeated instances of common word usage. In addition to the issue of genre or register, there is a diachronic dimension to the issue of authorship attribution; as Tomblin (2009) has also noted, under-pinning authorship identifi cation must be an assumption that an individual’s idiolect (even assuming this can be established as a valid concept), remains constant across the output of an author over time.

As a relatively new genre, people’s texting styles tend to show considerable variation and idiosyncrasy. In general, older people tend to write messages in full form with traditional openings and closings, as they would in e-mails or letters, whereas younger individuals are more likely to send more frequent and shorter text messages, using abbreviated forms and features such as emoticons, and tending to omit openings to their messages as well as parts of speech such as the defi nite and indefi nite article (Crystal, 2009). These differ-ences in style, whether related to gender, age, or other sociolinguistic variables, are becom-ing increasingly signifi cant in forensic linguistic corpus evidence. In 2000, the BBC News website even went as far as to call mobile phones “the new fi ngerprints” in their potential as corpora of an individual’s idiolectal texting style.

In a 2005 case with texting as a key evidential criterion, parallel corpora of the text messages of murder victim Julie Turner and her friend Howard Simmerson were com-pared. This was central to the timeline of the crime, since on the afternoon following her disappearance her partner had received the following mobile phone text which the local newspaper reported at the time: “Stopping at jills, back later need to sort my head out.”


He (the perpetrator) allegedly sent this text message to Miss Turner’s partner and to his own mobile phone in an apparent attempt to convince people she was still alive and to give himself an alibi.

The following day, two days after Julie had gone missing, another apparently reassuring text was received: “Tell kids not to worry. sorting my life out.[sic] be in touch to get some things” (Worksop Guardian, November 4, 2005). In a corpus of letters which Simmerson had written (and which therefore provided a comparative corpus), analysts found simi-larities to the language of the mobile-phone texts, as well as several unusual orthographic and punctuation features which were indicative of a text-messaging idiolect. Simmerson was arrested and charged with her murder. He was subsequently convicted of the murder in 2005 and was sentenced to life imprisonment. The role of his individual corpus of text messages in comparison to Julie’s was pivotal in the trial.

The Use of Pre-Existing Linguistic Corpora in Forensic Linguistics

As well as the construction of sometimes necessarily opportunistic and unprincipled, but bespoke, case-specifi c corpora as in the example discussed above, large, pre-existing corpora of English are extremely useful for the forensic linguist in a variety of case types. This may be in an investigative role where the linguist is asked to comment on idiolectal, dialectal, or regional features of texts which may aid police offi cers in identifying and locating a potential perpetrator. It is also possible to use corpora to illustrate common meaning (the difference between a legal meaning and a lay person’s plain understanding) of a term in (usually civil cases) disputed comprehensibility of texts. These include texts such as patient information leafl ets included in packs of medication, instructions, and warnings. Failure to fully comprehend these types of texts can lead to frustration at best and even injury or death at worst.

Another sector of work for the forensic linguist is that of trademark disputes, where both usage type and frequency becomes crucial (see Shuy, 2002 for a detailed discussion of forensic linguistic analysis of trademarks). In cases such as this, access to large corpora can be invaluable since it allows the linguist to gauge a sense of the common usage of such terms and by implication the common understanding of them. Arguably the largest corpus of all is the Web—used in a number of authorship attribution cases, most famously that of the Unabomber.

Using the Web as a Reference Corpus to Catch a Killer

The case of the Unabomber in the USA spanned the 1970s, 1980s, and early 1990s and is the archetypal illustration of a hybrid approach to forensic data analysis, combining indi-vidual intuition and idiolectal features of language, with the Web acting as a reference super-corpus.

Theodore Kaczynski, a disenchanted ex-university professor of mathematics in the USA, sent 16 letter bombs to a variety of targeted individuals and institutions. His attacks increased in the level of violence used and resulted in three fatalities and 23 injuries of varying severity. He was named by the FBI as the Unabomber because of his choice of targets—universities and airlines. In 1995, a textual dimension was added to the case. Kaczynski contacted the New York Times with an ultimatum: he would stop his campaign if either the Washington Times or the Washington Post agreed to publish a 35,000-word manifesto he had written, entitled “Industrial Society and its Future.” One of the hopes of the Post and the Times, which both agreed to publish the manifesto, was that someone might recognize its stylistic characteristics.


The manifesto contained a number of distinctive features indicative of the writer’s idiolectal style. In addition to consistent self-references using the plural form we or FC (standing for Freedom Club), or both, the author also capitalized whole words as a means of indicating emphasis. Although the text displayed an idiosyncratic use of hyphenation, it was otherwise error-free, in terms of both spelling and grammar, despite the fact that the author had used an old-fashioned typewriter rather than a word processor with a spelling and grammar checker.

A short time after the manifesto was published, the FBI was contacted by David Kaczynski, who claimed that the document had been authored by Theodore, his long-estranged brother. The manifesto contained certain unusual expressions, such as a self-description of the writer as a “cool-headed logician.” David Kaczynski found a set of letters dating back to the 1970s which appeared to contain similar phrasing to that of the manifesto. This is a clear and fascinating example of the power of an individual’s recognition of an idiosyncratic and distinctive writing style. Following a search of Kaczynski’s property, he was arrested and charged with manufacturing and sending the incendiary devices. With the possession of a further set of additional documents seized from the property, the FBI’s analysis (dis-cussed in Fitzgerald, 2004, a Georgetown graduate of linguistics and FBI agent) determined that there were multiple similarities between the manifesto and particularly a lengthy letter sent to a newspaper on a similar topic.

Following this individual analysis and discovery process, a more detailed analysis of the manifesto was subsequently conducted. The fi rst linguist brought in by the defense argued that although the two documents did indeed share lexical and grammatical simi-larities, this was not surprising as they were written in roughly the same genre and on the same topic. In response, the FBI carried out an analysis of the Web, which, although signifi cantly smaller in the mid-1990s than even 10 years later, still represented an enormous reference corpus. The FBI used the same set of 12 expressions that had been used in an analysis previously carried out for the defense. The 12 generic and non-subject-specifi c words and phrases were the following:

Single Lexical Items Phrases Lemmasthereabouts at any rate argu*gotten more or less propos*propaganda on the other handpresumably in practicemoreoverclearly

* This includes lexical items such as argument, argumentative, proposition, proposal, etc.

At fi rst glance these items and phrases do not appear at all remarkable. As the defense suggested, one would expect them to appear in any argumentative text of this type. However, when the 12 words and expressions were entered into the search engine Google, a total of approximately 3 million Web hits were returned which contained one or more of them. This is an unsurprising result but was clearly extremely disappointing for the FBI’s analyst (confi rmed in personal communication with the special agent concerned and published by him in Fitzgerald, 2004). The query was then refi ned to include only those documents which contained all 12 of the search terms. This produced a remarkable result: only 69 documents on the entire Web contained all 12 of these words and phrases. Perhaps more remarkably, each hit consisted of a version of Kaczynski’s 35,000 word manifesto. As Coulthard points out in his discussion of the methodology employed in the Unabomber case:


This was a massive rejection of the defence expert’s view of text creation as purely open choice [see Sinclair, 1991], as well as a powerful example of the idiolectal habit of co-selection and an illustration of the consequent forensic possibilities that idiolectal co-selection affords for authorship attribution. (Coulthard, 2004, p. 433)

In addition to the corpus evidence produced from these items and phrases, one further fascinating issue emerged which relates to the concept of keyness, that is words or phrases which occur with “unusual frequency” (Scott, 1999). In his manifesto, the Unabomber reversed the order of an expression, from the more expected “have your cake and eat it too” to “eat your cake and have it too.” This alternative formulation was also found in other writings by Kaczynski, but a search of the Web again found that this was a unique form not found in any other text. The concept of idiolectal features, as illustrated in the case of the Unabomber, has considerable potential in forensic casework, as Kredens (2002) discusses in detail.

Shortly before his trial in 1998, Kaczinski pleaded guilty to all charges and received a commuted sentence of life imprisonment without the possibility of parole. The Unabomber case is a powerful illustration of the potential power of using the Web as a reference corpus, albeit with many caveats and cautions.

Coulthard and Johnson (2007) discuss the case of Robert Brown in the UK, who was convicted of the murder of Annie Walsh in 1977. The alleged confession was the only substantive evidence presented by the prosecution. Brown consistently claimed the confes-sion had been produced by the police and that incriminating sections had been inserted into his police interview. Coulthard analyzed several signifi cant strings which occurred in both disputed texts, for example, “I asked her if I could carry her bags and she said yes.” In a convincing analysis using the Web again as a reference corpus, Coulthard showed these apparently common strings do not occur at all on the Web. The appeal court found Brown’s conviction unsafe, mainly due to this evidence, and he was released after spending 25 years in prison. Coulthard labels this comparative method a “corpus-assisted” approach combining expert qualitative analysis with corpus linguistic methodology (Coulthard, 2004).

For Solan and Tiersma (2004), such an approach would be admissible even under the rigorous Daubert criteria employed in American courtrooms. Daubert requires that a method be subjected to peer review and publication, undergo empirical testing, and be generally accepted by its relevant academic community. Corpus linguistics, Solan and Tiersma conclude, represents the most promising methodology currently available to lin-guists attempting authorship attribution.

Conclusion

With the development of large corpora of various forensic genres apparently on the horizon, the future for forensic linguistics, particularly in terms of authorship analysis in all its forms, seems to hang on computerized analysis and the development of robust statistical measures. Many forensic linguists advocate the creation of these genre-specifi c corpora of authentic data (in the example below, police statements), both to permit:

statistically valid statements in court and to protect [oneself] against the suggestion by hostile cross-examiners that, as any reasonable person will agree, a corpus of general conversation is irrelevant for comparative and normative purposes, because the linguis-tic behaviour of witnesses and suspects must change when they are making statements under oath. (Coulthard, 1993, p. 89)


In terms of effective data analysis and the fulfi llment of evidential criteria, it seems clear that corpus linguistics will have an increasingly important role to play in the future detec-tion of crimes involving language. Whether corpus-aided, corpus-assisted, or corpus-based, much of the work of the forensic linguist will involve some degree of corpus linguistic methodology and principles.

SEE ALSO: Forensic Linguistics: Overview; Language of Courtroom Interaction; Language of Jury Instructions; Language of Police Interviews; Legal Language; Linguistic Analysis of Disputed Meanings: Trademarks; Prison Language

References

Biber, D. (1988). Variation across speech and writing. Cambridge, England: Cambridge University Press.

Biber, D. (1995). Dimensions of register variation. Cambridge, England: Cambridge University Press.

Conley, J., & O’Barr, M. (1990) Rules versus relationships: The ethnography of legal discourse. Chicago, IL: University of Chicago Press.

Cotterill, J. (2000). Reading the rights: A cautionary tale of comprehension and comprehensibil-ity. Forensic Linguistics, 7(1), 4–25.

Coulthard, R. M. (1993). Beginning the study of forensic texts: corpus, concordance, collocation. In M. P. Hoey (Ed.), Data, description discourse (pp. 86–97). London, England: HarperCollins.

Coulthard, R. M. (2004). Author identifi cation, idiolect, and linguistic uniqueness. Applied Linguistics, 25(4), 431–47.

Coulthard, R. M., & Johnson, A. (2007). An introduction to forensic linguistics: Language in evidence. London, England: Routledge.

Crystal, D. (2009). Txtng: the Gr8 Db8. Oxford, England: Oxford University Press.Dumas, B. (1990). An analysis of the adequacy of federally mandated cigarette package warnings.

In J. N. Levi & A. G. Walker (Eds.), Language in the judicial process (pp. 309–52). New York, NY: Plenum.

Fitzgerald, J. R. (2004). Using a forensic linguistic approach to track the Unabomber. In J. H. Campbell & D. Denivi (Eds.), Profi lers. New York, NY: Prometheus Books.

Haworth, K. (2006). The dynamics of power and resistance in police interview discourse. Discourse & Society, 17, 739–59.

Heffer, C. (2006). The language of jury trial: A corpus-aided linguistic analysis of legal–lay discourse. Basingstoke, England: Palgrave Macmillan.

Heydon, G. (2005). The language of police interviewing: A critical analysis. Basingstoke, England: Palgrave Macmillan.

Kredens, K. (2002). Idiolect in forensic authorship attribution. In P. Stalmaszczyk (Ed.), Folia Linguistica Anglica (Vol. 4). Lodz, Poland: Lodz University Press.

Mayr, A. (2004). Prison discourse: Language as a means of control and resistance. Basingstoke, England: Palgrave Macmillan.

O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom: Language use and language teaching. Cambridge, England: Cambridge University Press.

Olsson, J. (2004). Forensic linguistics: An introduction to language, crime and the law. London, England: Continuum.

Rock, F. (2007). Communicating rights: The language of arrest and detention. Basingstoke, England: Palgrave Macmillan.

Scott, M. (1999). Wordsmith tools. Software. Oxford, England: Oxford University Press.Shuy, R. W. (1997). Ten unanswered language questions about Miranda. Forensic Linguistics,

4(2), 175–96.


Shuy, R. W. (2002). Linguistic battles in trademark disputes. New York, NY: Palgrave Macmillan.Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, England: Oxford University Press.Solan, L. (1993). The language of judges. Chicago, IL: University of Chicago Press.Solan, L., & Tiersma, P. (2004). Author identifi cation in American courts, Applied Linguistics,

25(4), 448–65.Stubbs, M. (1996). Text and corpus analysis. Oxford, England: Blackwell.Stygall, G. (1994). Trial language. Amsterdam, Netherlands: John Benjamins.Tiersma, P. (1999). Legal language. Chicago, IL: University of Chicago Press.Tomblin, S. D. (2009). Future directions in forensic authorship analysis: Evaluating formulaicity as

a marker of authorship (Unpublished doctoral dissertation). Cardiff University.Winter, E. (1996). The statistics of analysing very short texts in a criminal context. In H. Kniffka

(Ed.), Recent developments in forensic linguistics (pp. 141–79). Frankfurt am Main, Germany: Peter Lang.

Suggested Readings

Cotterill, J. (2003). Language in court: Power and persuasion in the OJ Simpson trial. Basingstoke, England: Palgrave Macmillan.

Coulthard, R. M. (1994). On the use of corpora in the analysis of forensic texts. Forensic Linguistics: International Journal of Speech, Language and the Law, 1, 27–43.

Coulthard, R. M. (2000). Whose text is it? On the linguistic investigation of authorship. In S. Sarangi & R. M. Coulthard (Eds.), Discourse and social life (pp. 270–89). London, England: Longman.

Eades, D. (2010). Sociolinguistics and the legal process. Bristol, England: Channel View Publications/Multilingual Matters.

Finegan, E. (2010). Corpus linguistic approaches to “legal language”: adverbial expression of attitude and emphasis in Supreme Court opinions. In R. M. Coulthard & A. Johnson (Eds.), The Routledge handbook of forensic linguistics (pp. 65–77). London, England: Routledge.

Fox, G. (1993). A comparison of “policespeak” and “normalspeak”: A preliminary study. In J. Sinclair, M. Hoey, & G. Fox (Eds.), Techniques of description: Spoken and written discourse (pp. 183–95). London, England: Routledge.

Gibbons, J. (2001). Revising the language of New South Wales police procedures: Applied lin-guistics in action. Applied Linguistics, 22(4), 439–69.

Hänlein, H. (1998). Studies in authorship recognition: A corpus-based approach. Frankfurt am Main, Germany: Peter Lang.

Sarangi, S., & Coulthard, R. M. (Eds.). (2000). Discourse and social life. London, England: Longman.

Newspaper/Online Resources

Julie Turner

Mum’s body found in bin. www.worksopguardian.co.uk/news/MUM39S-BODY-FOUND-IN-BIN.1243062.jp Worksop Guardian, November 4, 2005.

Text messaging

Humphrys (2007). I h8 txt msgs: How texting is wrecking our language. Daily Mail, September 24, 2007. Available at www.dailymail.co.uk/news/article-483511/I-h8-txt-msgs-How-texting-wrecking-language.html.

Text messages could solve crimes, August 10, 2006.news.bbc.co.uk/1/hi/england/leicestershire/4779981.stm.Mobile phones—the new fi ngerprints. BBC News Online:news.bbc.co.uk/1/hi/uk/3303637.stm, December 18, 2003.

Documents

The Encyclopedia of Applied Linguistics || Corpus Analysis In Forensic Linguistics