11
Metrics versus Peer Review? Albert Weale University of Essex At a time when research evaluation policy in the UK is in some flux, and when there are a multitude of proposals for change but little hard-headed analysis, Political Studies Review is to be congratulated on setting up this symposium. Of course, as one of the co-editors of the British Journal of Political Science, which shows up extremely well in the McLean et al. study, as well as the Chair of the 2001 Research Assessment Exercise (RAE) panel, whose decisions are analysed by Butler and McAllister, I am hardly the most disinterested commentator. But the two articles under discussion are important in their own right and in each case grow out of work that has been conducted over many years. It may be useful if I clarify the terminology I propose to use. Research evaluation is the process of arriving at a grading of research performance. Such gradings normally take two forms: categorical and rank order. Categorical gradings involve units of assessment being assigned a category such as ‘good, average or poor’ or a numerical category, such as the seven-point grades used in the 2001 RAE.Rank orders are where it is possible to list research performance from high to low assigning each unit a place in the ranking (a ‘league table’). Rank orders will certainly be produced officially or unofficially as a result of the 2008 RAE. Research evaluation can be by peer review or by metrics or by some combination of the two. Peer review, as I use it here, involves a panel, along with ad hoc expert advisers, whose members make a direct judgement of the quality of the work to be assessed, just as the referees for a journal article do. Metrics are counts of supposed indicators or proxies for research quality (citations, research income, graduate students, etc.). Where reputational measures, like those in McLean et al., are based on surveys, they are a form of metrics. Anybody using metrics has to decide what metrics are appropriate, and so there has to be expert judgement at least in that phase of evaluation. More generally, there is a significant difference between expecting an expert panel to make its own judgement informed by metrics on the one hand and taking the results of metrics as the baseline to be moderated only in particular cases by peer review on the other. Butler and McAllister Linda Butler and Ian McAllister (2009) say that their purpose is to consider whether citations could ‘wholly or partly replace peer review for evaluating research quality?’ (p. 3). In pursuing this goal, they produce a parsimonious model that throws up some interesting issues. Before taking up some specific points, let me first comment on the overall approach and argument. POLITICAL STUDIES REVIEW: 2009 VOL 7, 39–49 © 2009 The Author. Journal compilation © 2009 Political Studies Association

Metrics versus Peer Review?

Embed Size (px)

Citation preview

Page 1: Metrics versus Peer Review?

Metrics versus Peer Review?

Albert WealeUniversity of Essex

At a time when research evaluation policy in the UK is in some flux, and when there area multitude of proposals for change but little hard-headed analysis, Political Studies Review isto be congratulated on setting up this symposium. Of course, as one of the co-editors of theBritish Journal of Political Science, which shows up extremely well in the McLean et al. study,as well as the Chair of the 2001 Research Assessment Exercise (RAE) panel, whosedecisions are analysed by Butler and McAllister, I am hardly the most disinterestedcommentator. But the two articles under discussion are important in their own right andin each case grow out of work that has been conducted over many years.

It may be useful if I clarify the terminology I propose to use. Research evaluation is theprocess of arriving at a grading of research performance. Such gradings normally take twoforms: categorical and rank order. Categorical gradings involve units of assessment beingassigned a category such as ‘good, average or poor’ or a numerical category, such as theseven-point grades used in the 2001 RAE.Rank orders are where it is possible to list researchperformance from high to low assigning each unit a place in the ranking (a ‘league table’).Rank orders will certainly be produced officially or unofficially as a result of the 2008 RAE.

Research evaluation can be by peer review or by metrics or by some combination of thetwo. Peer review, as I use it here, involves a panel, along with ad hoc expert advisers, whosemembers make a direct judgement of the quality of the work to be assessed, just as thereferees for a journal article do. Metrics are counts of supposed indicators or proxies forresearch quality (citations, research income, graduate students, etc.). Where reputationalmeasures, like those in McLean et al., are based on surveys, they are a form of metrics.

Anybody using metrics has to decide what metrics are appropriate, and so there has to beexpert judgement at least in that phase of evaluation. More generally, there is a significantdifference between expecting an expert panel to make its own judgement informed bymetrics on the one hand and taking the results of metrics as the baseline to be moderatedonly in particular cases by peer review on the other.

Butler and McAllister

Linda Butler and Ian McAllister (2009) say that their purpose is to consider whethercitations could ‘wholly or partly replace peer review for evaluating research quality?’ (p. 3).In pursuing this goal, they produce a parsimonious model that throws up some interestingissues. Before taking up some specific points, let me first comment on the overall approachand argument.

POLITICAL STUDIES REVIEW: 2009 VOL 7, 39–49

© 2009 The Author. Journal compilation © 2009 Political Studies Association

Page 2: Metrics versus Peer Review?

A strong point of the approach is to move away from the assumption that only journalcitations can be a proxy for assessment of research output in general. Faute de mieux, thisassumption has been made in previous attempts to use citation measures in politics andinternational studies (e.g. Hix, 2004), but it is obviously flawed. Different sub-disciplinespublish in different ways, and there is no reason to think that these differences reflectresearch quality. Limiting oneself to one form of publication produces sample truncationbias.

However, the extension by Butler and McAllister, though significant, is limited. Conven-tional citation analyses are usually taken from Thomson ISI and consist of analysing articlesindexed in the ISI database. Since Thomson ISI does not include all major journalspublished, there are well-known problems of data bias. (Political theorists reading this mightlike to reflect on the fact that, among others, Contemporary PoliticalTheory, Critical Review ofInternational Social and Political Philosophy and the Journal of Political Ideologies are notincluded.) Building on earlier work done by Butler (Butler, 2006; CHASS, 2006), Butlerand McAllister augment the conventional cited sources by including citations in ISI-indexed journals of books, book chapters and those journals not included in the ISI index.Table 1 gives a summary. In that table, the citing and citation sources in italics are theconventional ones used in ISI-based analysis and those in Roman type are the onesincluded by Butler and McAllister. Excluded from the analysis are those citing publicationsin the form of books, book chapters and journals not listed in ISI.

To note the exclusions is not to criticise Butler and McAllister – after all they are buildingon excellent work that has sought to extend the conventional measures.The required dataare not available in a form to be analysed (though we can hope that as books becomedigitised they will be at some point).However, if we think that there is some truncation biasby these exclusions, then we should be correspondingly cautious about the inferences wedraw.

Table 1: Scope of Citations in Butler and McAllister

Citing publications Cited publications

Included: Included:Journals indexed in ISI Journals indexed in ISI.

Journals not indexed in ISIBooksChapters in books

Excluded: Excluded:Journals not indexed in ISI Other research output, e.g.

databases on CD ROMBooksChapters in booksOther research output, e.g.

databases on CD ROM

40 ALBERT WEALE

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 3: Metrics versus Peer Review?

In relation to policy inferences, I note some slippage in the logic of the argument. Butlerand McAllister state their purpose as being to test whether citation analysis can wholly orpartly replace peer review. One of the arguments they give for conducting the exercise onUK data is that the UK’s long experience with RAE peer-review assessment means thatthere are ‘objective’ rankings against which to test the citation approach (p. 5). So at thispoint of the article the RAE scores are used to provide some way of benchmarking theefficacy of the citation approach. By the end of the article Butler and McAllister claim thattheir analysis has shown that citation analysis will ‘obviate the need for a large peer-reviewcommittee and the consequent indirect biases that it introduces in the system’ (p. 15). Inshort, the argument has moved from one which says that metrics are desirable because theycan replicate peer review to saying that they are desirable because they can replace and improveon peer review. Peer review as benchmark has been disposed of.

One reason for this slippage in the argument – I conjecture – is that Butler and McAllistermake certain unwarranted assumptions about what is an ‘objective’ and what is a ‘subjective’measure of quality.With the exception of the passage on page 5 that introduces the notionof benchmarking from peer review, peer review is regarded as a ‘subjective’ assessment andcitation analysis as an ‘objective’ assessment.This distinction is muddled. Citation analysismeasures decisions to cite. A decision to cite is a judgement by someone of some kind(whether of the high value of the work or its familiarity to other members of the academiccommunity or something else). In that sense, it is just as subjective as the judgements of apeer-review RAE panel. (As a long-standing journal editor, I am still surprised when I seecitation data on articles that I have published, and in particular note lack of citations to quiteoutstanding articles.)

The distinction, then, is not between subjective and objective, but between collectivelydeliberated judgements by a relatively small number of scholars on an expert panel anddisaggregated and individual judgement by a relatively large number of scholars.There aresome reasons for preferring the latter to the former, a point to which I return later.But evenif large-scale averaging of judgement is better than small-scale deliberated judgement inresearch evaluation, that is how the choice should be described. It is simply confusing tointroduce the terms ‘objective’ and ‘subjective’.

Having commented on the general approach, let me now pick up some importantindividual matters.

Determining Grades

The first stage in Butler and McAllister’s argument is to test how far a suitable citationanalysis can replicate the gradings of the 2001 RAE panel. In their article, the panel gradingis regressed on a number of variables including citations (logged), Office of Science andTechnology and the British Academy (OST/BA) income, national academy membership,the number of academic staff, the number of graduate students and whether or not adepartment contained a member of the panel.The only statistically significant coefficientsare those related to citations, departmental member on the panel and the number ofgraduate students (see theirTable 2, p. 9).Together these variables explain 62 per cent of thevariance in the RAE panel scores.

METRICS VERSUS PEER REVIEW? 41

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 4: Metrics versus Peer Review?

This is one of those cases where one does not know whether to be more impressed by thesimilarities or the differences.The replication argument requires stressing the similarities,but in some ways the differences are most striking.Yet there are good reasons why theregression equation that Butler and McAllister use would not coincide with the panel RAEscores. I have summarised these in Table 2, but they deserve some detailed commentary.

Of these differences, the most important relates to the non-linear process of aggregation onthe panel as distinct from the linear process integral to a regression analysis.The 2001 panelhad to work to the Higher Education Funding Council for England (HEFCE) gradingscheme, the building block of which was an assessment of the output of individuals as beingof international, national or sub-national quality. The panel then had to aggregate thesejudgements of individuals’ output into an overall department score, according to theHEFCE rules.This aggregation was non-linear in at least three important respects.

On the HEFCE aggregation scheme, a department score was not responsive to quitesubstantial changes in the proportion of work that was judged to be of a certain quality.Theclearest example of this non-linearity was in relation to the award of a 5. A departmentwould secure a 5 if between 10 per cent and 50 per cent of its work were international. Fordepartments of average size (17) this means that 2 departments would get the same gradeif one had international work for 2 of its members and the other had international workfor 8 of its members. In the Butler and McAllister approach, by contrast, every incrementof quality will count towards an improvement of grade.

Secondly, the HEFCE grading system did not allow the considerable strengths of particularindividuals to offset the weaknesses of others.To secure a 5 or 5-star a department was notallowed to have any individual’s work judged below national level. In the Butler andMcAllister approach, by contrast, this is precisely what can happen. A large number ofcitations by one or two individuals raises the whole team.

In coming to a judgement on the research output of individuals, the panel also took anon-linear approach. Suppose a submission had a very good book, but looked rather thinin other respects.Was the good book enough to offset the weaknesses? A judgement couldbe made that ‘a book was enough’. In the Butler and McAllister approach, by contrast, everyitem counts for one and no one for more than one.The large number of citations that onesignificant item receives is averaged over all items.

Table 2: Differences between Panel Process and OLS Regression Analysis

Panel approach OLS regression analysis

Non-linear criteria of grading, based onproportion of work falling into different classes

Process of grading given by best linearcombination of independent variables

Whole of RAE submission, including thoseparts like research culture that are notassignable to individuals

Partial RAE submission, based principallyon cited output, as well as grant incomeand number of students

User input No user input

42 ALBERT WEALE

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 5: Metrics versus Peer Review?

To note this process of non-linear aggregation might be taken to imply that the panel scoreswere simply a function of cited outputs recorded in RA2, but this would be a mistake.Indeed, HEFCE rules made it plain that it was necessary to take into account informationon research culture in a department, information that cannot of course be expressed at theindividual level.Butler and McAllister’s proxies for research culture – OST/BA income andnational academy fellowships – do not capture what HEFCE intended, and indeed theyappeared in different parts of the form from the RA5, which was the part devoted toresearch culture.

One can of course hold to the view that the only thing that should count is the aggregatedevaluation of the work of the individuals in a department. (I am not a long way from thisposition myself.) However, that is clearly a distinct matter from seeking to replicate what thepanel actually did. I cannot verify my memory, since HEFCE rules require all relevantpapers to be destroyed, but to the best of my recollection there was at least one case of adepartment where the collective element did determine a grade when on other indicatorsit seemed to straddle two grades.

The final missing element from the Butler and McAllister assessment is the input of users.Politics and international studies had a separate user panel, which met on more than oneoccasion.They were certainly important in confirming grades, and (again to the best of myrecollection) there were two departments which were substantially helped by users’ advice.

The above offers some reasons as to why the Butler and McAllister statistical approachwould not be expected a priori to match exactly the 2001 panel’s approach. One pointwhere there is an interesting match, however, is on departmental size, the coefficient ofwhich shows up as statistically insignificant in their model. In common with its predeces-sors, the 2001 panel sought to avoid rewarding departments simply because they were large(rightly in my view), so it is interesting to see the lack of a size effect.

Simulating Peer Review

One of the most interesting aspects of Butler and McAllister’s analysis is their simulation ofwhat the RAE outcome would have been had it been based upon their model rather thanthe judgement of the panel. It is worth noting that their simulation does not simply usecitations as a predictor but includes all the variables of their original regression model lessthe dummy variable indicating panel membership.The article notes (p. 12) that this reducedmodel still explains 55 per cent of the variance in predicting the panel’s scores.

Since most of the variables in the original model showed up as being statistically insignifi-cant, in some ways I am puzzled by their inclusion in the simulation, even as controls. It maybe that the model using only citations explains such a low proportion of the variance (38per cent) that Butler and McAllister think that it would be unreliable on its own. However,that is not a point I want to pick up in the present context.

The simulation works by inserting the actual value of the independent variables into anequation of which the coefficients are taken from the reduced model.When this is done,

METRICS VERSUS PEER REVIEW? 43

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 6: Metrics versus Peer Review?

Butler and McAllister find that their model over-predicts eleven RAE departmental scores(that is, gives a higher score than that given by the panel) and under-predicts eighteendepartments (that is, it gives the departments a lower score than the panel).

One difficulty with the simulation is that the predicted scores are not whole integers, sincethe predictive variables and their corresponding coefficients are not whole integers.Therehas then to be a step in the calculation by which a number with decimal places istransformed into a whole integer in order to replicate an RAE-type score. I understand (IanMcAllister, personal communication) that the cutting rule used was that of conventionalrounding.That is to say those with scores equal to or less than X.49 were rounded downand those with scores of X.50 or greater were rounded up. There is of course a certainartificiality in transforming decimals into whole integers and this highlights a further reasonwhy a procedure based on a linear and continuous functional transformation of data wouldnot replicate the 2001 RAE procedure.

Panel Membership

No doubt one finding that will spark a lot of interest is the claimed effect on departmentalgrading that comes from having a member of a department on the panel. In this regard,Butler and McAllister’s argument should really be understood as falling into two parts.Firstly, in the regression equation modelling the panel’s decision making, panel membershipis coded as a dummy variable with a statistically significant coefficient.Although the addedvariance in scores that its inclusion explains is equivalent only to 7 per cent, the statisticallysignificant coefficient reveals a substantively large marginal contribution.The second part ofthe argument is that four of the five 5-star departments with panel members would havedropped a grade if their scores had been determined by the reduced model rather thanpanel judgement.

This is potentially an important finding, but how far it is genuine rather than a statisticalartefact is difficult to judge given the problems noted above about how a continuousvariable is defined as a discrete one. More importantly, there is also the problem hinted atin the article but not addressed statistically as to how far there is an endogeneity problem.All the members of the panel bar two were either previous panel members or nominationsof the two major professional associations, the Political Studies Association and the BritishInternational Studies Association – and of course previous members were themselvesnominees at some point. (The two exceptions were myself, appointed as a result of thedecision of panel members serving in 1996 and the one ‘free’ slot I was given, which I usedto cover what I perceived as a major gap in sub-disciplinary coverage left by the professionalassociations.) Given this background, panel members were therefore likely to be fromdepartments widely esteemed in the profession, and so the association between membershipand grade might run from rating to membership. Departments were not rated highlybecause they had a panel member, but panel members were disproportionately drawn fromdepartments that were likely to be rated highly.

However, let us suppose that the causal arrow does run from panel membership to graderather than the other way around. How might we account for this finding (apart from

44 ALBERT WEALE

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 7: Metrics versus Peer Review?

conspiracy or manipulation, explanations which I hope no-one would entertain)? Twopossibilities have occurred to me, though no doubt there are others.

The first is that mechanisms of social relationships come into play. Panel members are doinga hard job, often in rather uncongenial circumstances (there was an extremely shoddy hotelin Manchester that sticks in the memory) and under great pressure of time.There is boundto be some social bonding, and there may be a reluctance to embarrass a colleague,especially in a context in which a high proportion of panellists’ departments are gettinggood scores.This cannot be the whole story,of course, since one department with a panellistfailed to get the score that it might have felt it deserved. But a good explanation does nothave to be a complete explanation.

The second explanation is the one I favour. In my experience of three RAEs, the principalbenefit of being on the panel is to realise how unforgiving the exercise is, particularly fordepartments that have a ‘tail’, that is to say a relatively high proportion of work judged assub-national. (Remember the non-linear character of the aggregation.) Participants in theprocess may well have been involved in hard decisions in their own institutions, and whilethey will want to be fair to departments, they are not inclined to clemency. Hence thesubmitted staff in departments with panel members could have already met a moredemanding threshold test than other departments. (This will not necessarily show inproportion of staff submitted, since some of the effects will have taken the form of earlyretirement before the census date.)

This explanation is only a putative one, but it is capable of some empirical testing. It wouldbe interesting to see the effects not only of panel membership, but of panel membershipacross more than one RAE cycle as well as departments that have never had a panelmember. If my explanation is correct, then the effects from these variables should be asstrong – if not stronger – than concurrent panel membership.

McLean et al.

The most striking finding from the McLean et al. (2009) study is the extent to which thereis international agreement between the UK, the US and Canada on the ranking of politicsand international studies journals. In some ways this is not surprising.There is a great dealof international mobility in the form of conferences and the like, certainly between the UKand the US, and recruitment into the UK profession from those trained abroad has becomequite high.

For journal editors like myself it is always interesting to see where one’s journal stands inthe league tables and it is naturally gratifying if the results are good. But what contributioncan such analyses make to research evaluation? In terms of the distinction between peerreview and metrics, Iain McLean et al. suggest that expert reputational surveys can helpcounterbalance the biases implicit in metrics drawn from citation analysis (p. 20). But howfar would this correction of bias, if it could be achieved, enable metrics to substitute for peerreview in research evaluation?

METRICS VERSUS PEER REVIEW? 45

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 8: Metrics versus Peer Review?

One thing is certain.The method used by McLean et al. has a validity that is lacking in someother attempts to rank journals. The most egregious example of poor methodology atpresent in this regard is the European Reference Index for the Humanities (ERIH). Drawnup by the European Science Foundation in apparent ignorance of existing ways in whichjournal gradings might be reliably assessed, the ERIH seeks to place journals into threecategories, depending on their degree of general interest, based on the decision of asmall committee of experts (details available from: http://www.esf.org/research-areas/humanities/research-infrastructures-including-erih.html). Political theorists will be familiarwith many of the journals on the philosophy list, and I would be surprised if they were notsurprised by some of the categorisations.The chief difficulty is that the pool of expertise onwhich the relevant committees draw is too small to be reliable, and there are otherdifficulties noted by a 2007 British Academy working group report on ‘Peer Review’(British Academy, 2007, pp. 34–7, though I declare an interest, since I chaired that workinggroup).

However, to say that a method has some validity as a reputational assessment is not to saythat it can be used in a metrics-based form of research assessment.There are at least threeissues in this regard. Firstly, even supposing that the rankings gave reliable evaluations of thequality of journals, one would still not be able to infer merely from publication in aparticular journal the quality of research that it represents. Within-journal variation ofquality undoubtedly exists. To be sure, with highly ranked journals there is likely to becompetition for publication and, to the extent to which competition is a quality filter, thejournal name will provide some evidence of quality. However, it is unlikely that a researchassessment that relied on such an indirect measure of the quality of any one piece of workwould command legitimacy.

Secondly, even if one could read off the quality of an article published in a particular journalfrom the standing of that journal, the rankings provide no clear cut-off points. One couldsay that the ability to get a piece in one of the top ten was a mark of quality, but almost bydefinition that test is only going to be passed by a small proportion of the profession. Anational research evaluation needs to operate across the full range of output.

Thirdly, there is the problem about how to balance the quality associated with general, andtherefore widely known, journals and the quality associated with specialist journals. Con-sider the case of Philosophy and Public Affairs. In the UK sample this is rated top on purequality by respondents, and many political theorists are likely to agree with this assessment.However, on the overall ranking it comes out only in twentieth position, because itsspecialist nature makes it less well known to the profession at large. A similar point couldbe made about History of PoliticalThought, ranked fifth on quality, but 27th overall.The pointhere is not that these discrepancies are a fault with the method; they are precisely thediscrepancies one would expect from an analysis that combines a judgement of quality witha judgement of familiarity. It is simply that such varied rankings by journal mean that theapproach cannot serve as a simple metric replacing peer review.

How then might we imagine peer-review assessment using the results of studies like thoseof McLean et al.? There are at least two possibilities. Peer-review panels may want to check

46 ALBERT WEALE

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 9: Metrics versus Peer Review?

their own assessments of the quality of journals against a broader sample of professionaljudgement. They may also wish to consider whether those who are working in appliedareas, and therefore likely to be producing in highly specialist outlets or grey literature, cannonetheless on occasion secure publication in journals that are generally highly rated. In thiscontext, having evidence from a wide sample is useful.

Policy Implications

One question has guided my remarks so far: can metrics replace peer review or must theyremain an adjunct to peer review? In the case of the reputational assessment of journals, Ihave suggested that the data will always remain an adjunct to peer review.What of citationanalysis?

One advantage of citation analysis is that in place of the judgements of a relatively smallnumber of people, judgements are drawn from a relatively large number of people. Ofcourse, the benefits of size can be overstated. Even a well-cited journal in politics andinternational relations may only receive some one hundred citations to all its articles in agiven year, say an average of three citations per article.Yet, still one could argue that therewas merit in the wider sampling of judgements in the scholarly community that metricsallow in contrast to the sample of a small expert panel.

Expert panels are made up of sub-disciplinary experts. One aspect that has troubled me inthe three RAE panels on which I have served is the influence that any particular sub-disciplinary expert can have.This is not a matter of prejudice or self-interest on their part.It is simply that there is inevitably sub-disciplinary deference on a panel. If one’s expert onpolitical iconography in Iceland (I hope this is a sufficiently fictionalised example) says thata particular study of iconography is good or not so good, there is little that other membersof the panel can do to challenge the judgement. Historically, given the non-linear nature ofthe RAE scoring system, such judgements have been important.

Another possible advantage of citation analysis is related to the ‘light touch’ that HEFCE hassaid it is aiming for with research assessment.When I first started as a lecturer in the 1970s,universities used to collect details of publications each year for inclusion in the vice-chancellor’s report (these were the days when vice-chancellors were expected to beinterested in scholarly matters). It is not unreasonable to ask academics once a year to uploadto a centralised database the details of their publications and this database could form thebasis of metrics assessment. (Whether HEFCE is capable of purchasing user-friendlysoftware for such an enterprise is another matter.)

One advantage such a database would have is that it would ease the burden of those policydiscussions on such questions as to whether four or any other number of publicationsshould be cited. With a relatively complete database and citation analysis, it would bepossible to see how much of a difference counting different numbers of publications made.Moreover, under present arrangements a great deal of anxiety is created by the selection ofitems to be submitted. Much of that anxiety could be lifted.

METRICS VERSUS PEER REVIEW? 47

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 10: Metrics versus Peer Review?

All of this presupposes that a research evaluation has access to good data that are an adequatesample of what is being measured.The exclusions listed in Table 2 still need to be overcome.In my experience, research administrators stare blankly when you mention problems ofsample truncation bias. So there would be a large job to be done of educating our masters,as well as data to be gathered.

There is also a problem about timing. An expert panel can read and assess very recentmaterial. Citation analysis depends upon the window of citation being right.This is less ofa lagged citation issue (the claim that citations in the social sciences take longer to appear)than it is of a policy issue.The Butler and McAllister study took its window of citation upto 2007 (Linda Butler, personal communication), and they note in their article that citationrates did not vary significantly between those published early in the RAE cycle and thosepublished late. Is it right, however, for research assessment devoted to allocating researchfunding for a five-year period from the point of assessment to be using data that reflectperformance that is more than five years old? It would of course be possible to avoid thisproblem of time lag by conducting rolling metrics-based analysis and there might beadvantages if metrics allowed for more continuous assessment, not least getting away fromthe big push associated with current research evaluation. How far and in what ways that isdesirable needs to be considered.

What behavioural incentives will be set up by citation analysis? To be sure, not all suchincentives are undesirable. If citation analysis leads people to focus on producing morehighly cited work, that might be a worthwhile aim. However, we should be alert to possibledangers. One possible shift is away from specific case studies towards more general com-parative work.There is of course a large methodological discussion in politics and inter-national studies on the relative merits of idiographic versus law-like approaches, a debatethat goes back at least as far as the controversy between James Mill and Thomas Macaulay.It is not difficult to see how metrics assessment could lead to a greater emphasis upon thecomparative and general and less emphasis upon the specific and historical.Different peoplewill no doubt have differing views as to whether that would be desirable and it is not myintention to take sides in that debate. However, it would be unfortunate if such a shift aroseas an incidental and unintended by-product of a particular way of measuring researchquality.

Another effect might be to exacerbate the conflict between policy-relevant work and workthat will secure wider international recognition. Again there are large questions here. Myown view is that there is a confusion in the minds of those thinking about research fundingwho tend to equate policy relevance with work on the policy process.That notwithstand-ing, there is no doubt that implementation studies, for example, are important forms ofpolicy-relevant work. However, such studies may be of limited interest and so not get citedwidely. The trade-offs between a metrics assessment of quality and a disincentive to docertain types of policy-relevant work need to be considered.

The premise of the analysis by McLean et al. is that there is enough concern about thevalidity of citation analysis – the most promising form of pure metrics assessment – for areliable research evaluation to need to supplement citation measures with other measures

48 ALBERT WEALE

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)

Page 11: Metrics versus Peer Review?

like those drawn from reputational analysis. If this is so, then any research assessment willneed a group of people to decide how to balance the evidence that comes from distinct andnot always consistent sources, like those of citation and reputational analyses.To my mind,that looks like a continuation of the centrality of peer review for some time to come.

(Accepted: 18 August 2008)

About the AuthorAlbert Weale FBA, Department of Government, University of Essex, Colchester CO4 3SQ, UK; email: [email protected]

ReferencesBritish Academy (2007) Peer Review: The Challenge for the Humanities and Social Sciences. London: The British Academy.

Available from: http://www.britac.ac.uk/reports/peer-review/index.html [Accessed 14 August 2008].

Butler, L. (2006) ‘RQF Pilot Study – History and Political Science Methodology for Citation Analysis’ (mimeo).

Butler, L. and McAllister, I. (2009) ‘Metrics or Peer Review? Evaluating the 2001 UK Research Exercise in Political Science’,Political Studies Review, 7 (1), 3–17.

CHASS (2006) ‘CHASS Bibliometrics Project Political Science and History Panels: Report on Recommendations and MajorIssues’ (mimeo).

Hix, S. (2004) ‘A Global Ranking of Political Science Departments’, Political Studies Review, 3 (2), 293–313.

McLean, I., Blais, A., Garand, J. C. and Giles, M. (2009) ‘Comparative Journal Rankings: A Survey Report’, Political StudiesReview, 7 (1), 18–38.

METRICS VERSUS PEER REVIEW? 49

© 2009 The Author. Journal compilation © 2009 Political Studies AssociationPolitical Studies Review: 2009, 7(1)