Collocation and Knowledge Production in an Academic Discourse

8/3/2019 Collocation and Knowledge Production in an Academic Discourse

1/8

238

Collocation and knowledge production in an academic discourse

community

Keith Stuart

Ana Botella Trelis

Universidad Politcnica de Valencia (Spain)

Abstract

This paper analyses the discourse of science and technology through the study of lexical and

grammatical co-selection in research articles. The corpus comprises 1,376 articles, from specialist

leading journals (a total of 6,104,323 tokens, 71,516 types, and 1.17 type/token ratio). The main

criterion for choosing these journal articles for our corpus is that they are written by the members

of the academic discourse community of the Universidad Politcnica de Valencia.

The research follows on work by Gledhill (2000) who analysed the most frequent collocations in a

corpus of pharmaceutical research articles and proposed possible functions for these collocations.However, this paper takes a more lexico-grammatical approach to collocations as a system of

preferred expressions of knowledge in scientific research. The concept of collocation used here is

in the Hallidayan tradition as an intermediate level between syntax and lexis, which focuses on

recurrent word patterns (Hunston & Francis, 2000). The papers ultimate objective is to show that

collocations as a system of preferred expressions of knowledge in scientific research can help us to

analyse knowledge produced in our academic discourse community.

Key words: corpus linguistics, co-selection, collocation, colligation, research articles

Introduction

An important discovery of corpus linguistics has been that there is a level of

syntagmatic phrasal organisation, which had been largely ignored. These may be

described as n-grams to mean a recurrent string of uninterrupted word-forms or, as in

Scott (1997), they are called word clusters. They form part of what has been

denominated in the Firthian tradition as collocation. The reason why evidence from

corpora was needed is because these syntagmatic structures did not fit into either lexis

or grammar and because they involve facts about frequency which depends on computer

technology. As Leech (1992: 106) envisaged, the computer was going to do more than

just act as a research tool; it was going to open up new ways of thinking about language

by providing more data and better counting. Clear (1993: 274) pointed out that the use

of computational (algorithmic and statistical) methods has lead to a difference of scalein the corpus data that can be analysed and this in turn has led to a qualitative difference

in observations about language based on corpus evidence.

This paper explores corpus evidence about collocations as a first step towards

establishing a conceptual map through collocational networks of the knowledge being

produced in our academic discourse community. This paper analyses the discourse of

science and technology through the study of lexical and grammatical co-selection in

research articles in a corpus comprising of 1,376 articles (a total of 6,104,323 tokens).

The main criteria for choosing these journals for our corpus was the fact that they are

cited in the Science Citation Index (SCI), they are read by our university lecturers and

students, and it is where our lecturers and postgraduate students try to publish their

research. All the articles have been written by our lecturers and, therefore, represent the

back to contents
http://../contents.pdfhttp://../contents.pdf


2/8

239

work of a single academic discourse community, in this case, the Universidad

Politcnica de Valencia.

The research follows on work by Gledhill (2000) where he analyses the most frequent

collocations in a corpus of pharmaceutical research articles and proposes possible

functions for these collocations. However, this paper takes a more lexico-grammaticalapproach to collocations as a system of preferred expressions of knowledge in scientific

research. We do not though restrict the analysis to a strict collocational approach but

rather investigate the notion of co-selection asdescribing the general phenomenon of

words that habitually keep company, to paraphrase Firth. A syntagmatic view of

language takes account of the contribution of sense and syntax to meaning. The

argument that sense and syntax (Sinclair, 1991), or meaning and pattern (Hunston

& Francis, 2000), are associated is based on two pieces of evidence. Firstly, meanings

tend to be distinguished by differing patterns, and secondly, words with the same

pattern sometimes share aspects of meaning.

Sinclair (1991: 170) refers to collocation as the occurrence of two or more wordswithin a short space of each other in a text; this could logically refer to co-selection

between lexical or grammatical items. Some authors (Firth, 1957; Hoey, 2005: 43) draw

a distinction between collocation and colligation, using the former to refer to the co-

occurrence of lexical items and the latter to the interrelationship of words and

grammatical items (the grammatical company a word or word sequence keeps).

Sinclair himself refers to colligation within a collocation context, in terms of

collocational frameworks, which are units based on a grammatical, as opposed to a

lexical, core (e.g., the/an...of) (Renouf & Sinclair, 1991: 128-143). Analysis of lexical

and grammatical co-selection in our corpus of research articles proceeded by asking

three questions:

What are the collocations of X word or words in the corpus? What meanings do X word or words tend to associate with? What grammatical constructions (colligation) do X word or words tend to

enter into?

The papers ultimate objective is to analyse collocations as a system of preferred

expressions of knowledge in scientific research that can help us to analyse knowledge

produced in our academic discourse community.

Method

Once the corpus had been designed and implemented, we proceeded to analyse the data

by creating wordlists of technical and semi-technical terms through frequency counts

and keyword identification. This process involved initially comparing a general English

wordlist (from the 100 million BNC corpus) with a wordlist from our corpus.

Frequencies were compared and a keyword list was created from our corpus. To

compute the "key-ness" of an item, the software (WordSmith) used computes the

following and cross-tabulates them (Scott, 2004):

its frequency in the smaller wordlist (our corpus) the number of running words in the smaller wordlist (our corpus)


3/8

240

its frequency in the reference corpus (BNC) the number of running words in the reference corpus (BNC)

Once we had established the candidate terms to be analysed, our software started

extracting collocations for these terms and dumped them into an Excel spreadsheet.

Collocates of terms are extracted from the entire corpus within the span of 5 words both

sides of the node term.

The candidate terms selected for this analysis were the following:

1.- Semi-technical words which are very frequent in the corpus and constitute

significant examples of both lexical and grammatical co-occurrences (collocations and

colligations), for example, results, system, model, etc.

2.- Semi-technical and technical words which tend to appear next to or near certain

terms producing relevant semantic content which represents knowledge generated at our

Institution.

Results

The first example we would like to present in this paper is the term results , as it is the

most frequent semi-technical term in the UPV corpus (9,730 times). Moreover, this term

gives us clear examples of lexical associations not only for three-word recurrent patterns

(clusters) but also if we look at longer strings. It is worth mentioning the fact that the

most frequent collocation found, the results obtained, is followed by different

prepositions (depending on the noun group that follows the preposition).

TABLE 1. Obtained as a collocate of results

the resultsobtained 636

the resultsobtained for 98

the resultsobtained with 97

the resultsobtainedby 82

the resultsobtained in 82

the resultsobtainedfrom 56

Collocates for the term results fall into three categories: evaluative adjectives(experimental, similar, good, different, previous), past participle adjectives/passive

structures (obtained, shown, presented, compared), active verbs: show, indicate, present.

Position of terms with respect to the node before or after it is clearly fixed in some of

the examples and, consequently, relevant in those cases.

TABLE 2. Most frequent collocates of results

with Total Total Left Total Right

obtained results 1457 191 1266

experimental results 627 558 69show results 591 90 501


4/8

241

discussion results 526 66 460

shown results 426 82 344

presented results 284 48 236

similar results 277 199 78

good results 251 170 81

different results 236 104 132

simulation results 217 174 43

compared results 201 68 133

between results 194 136 58

indicate results 186 8 178

agreement results 185 112 73

analysis results 176 96 80

observed results 156 96 60

given results 149 37 112

present results 149 101 48

previous results 149 88 61

It may be also especially worth mentioning that 5-word clusters with results are

different from those shown above.

TABLE 3. 5 -word clusters with results

in agreement with the results 20

often leads to misleading results 12

according to the results obtained 10

basic notions and preliminary results 9

the basis of the results 9

taking into account the results 9

on the basis of the results 9

The collocates and clusters found for the same word in the singular (result) differ

substantially from those found in the plural form.

TABLE 4. 3-word clusters for result

as a result 329

the following result 282

the resultof 250

a resultof 228

System is the second semi-technical word in frequency in the corpus (8,205). Theresults obtained when analysing the term show two facts we would like to mention.


5/8


6/8

243

Other noun groups, although less statistically frequent, are formed with this semi-

technical word: the traditional pile salting method, the discrete analytical stiffness

derivative method, proposed shape restricted snake method, the split step Fourier

method, etc.

Another term which shows a fixed pattern of use is performance, being theperformance of the most frequent cluster found (494). This term usually collocates in

our corpus with other words with a positive meaning: evaluative adjectives such as

best, better than, high, good and with verbs that have positive connotations such as

improve, boost, achieve.

The semi-technical term temperature tends tocolligate with the preposition, at (the

pattern at room temperature is the most frequent: 484 times). Clusters with semantic

content are also found with this term: glass transition temperature, the annealing

temperature, cooling water temperature, burnt products temperature.

Other examples of semi-technical words in the corpus which show fixed lexical andgrammatical patterns are samples and values. In the case of samples, we find a

repeated use of passive structures with verbs indicating actions performed by scientists

in this context.

TABLE 8. Passive structures with sample

samples were taken 50

samples were prepared 31

samples were analysed 23

samples stored at 23

samples treated with 20

samples were dried 19

samples were analyzed 18

Both value and values are used in recurrent combinations with prepositions. We find

patterns such as: of a/the X value of, for a/the value of, with the value/s of, with

different values of, in the value of, from the values of.

These examples with value are similar to Sinclairs collocational frameworks (Renouf

& Sinclair, 1991). One of the most common collocational frameworks in our corpus is

the string: preposition+the+x+of+y. The most frequent examples found for the

preposition inare: in the case of (1,256), in the presence of (957), in the absence

of (268), in the range of (141), etc. The results for on are: on the basis of (219),

on the use of (124), on the surface of (95), etc. With at we have: at the beginning

of (94), at the bottom of (92), at the centre of (90), at the end of (80). Another

collocational framework which is very frequent in the scientific writing of our corpus is:

under x conditions. Examples found in the corpus are: under these conditions, under

the conditions, under certain conditions, under different conditions, under non-

cavitating conditions, under super critical conditions, etc. This kind of analysis can

constitute a useful resource for scientists writing in English as L2 and for ESP teachersand their students.


7/8

244

Semi-technical terms with a high degree of frequency in the corpus provide us with

information about the knowledge in our community. The association of these terms with

other technical terms and the relationships established between some of them will help

the linguist to represent knowledge in terms of more or less complex semantic

networks.

FIGURE 1.First step towards a collocational network with acid

FIGURE 2. First step towards a collocational network with sites

Concentration

(49)

Zeolites (50)

Frameworks (50)

Strong (56)

Surface (60)

Strength (63)

Binding (91)

Number (110)

Active (174)

Acid (416)

SITES 1,460

Cinnamic (77)

Membrane (88)

Catalysts (94)

Acetic (97)

PH (104)

Coumaric (114) Solution (118)

Groups (136)

Strength (129)

Citric (185)

Concentration

(211)

Amino (379)Sites (416)

ACID 4,083


8/8

Documents

Collocation and Knowledge Production in an Academic Discourse