projects05

Embed Size (px)

Citation preview

  • 8/14/2019 projects05

    1/9

    MPHIL IN STATISTICAL SCIENCE

    2004-2005

    Applied Project Titles 2005

    (as summarised by their authors)

    Paul Baines................................ Election Forecasting Techniques for the UK General ElectionNicolas Casini............................. Applying QDA Analysis to Internet Traffic ClassificationLoic Chenevier ........................... Different Numerical Methods in Option PricingAndrew Fisher............................ Parametrizing Gaussian Interest Rate ModelsHayley Jones.............................. Measuring Effect Size in Area-Based Crime Prevention ResearchLi Li............................................ Conditional Correlations of Short-time-scale Stock PricesYin Lim...................................... An Analysis of Travellers Preferred Time of DepartureCarr-Yii Loh............................... Quantitative Analysis of Hedge Fund DataUrvish Patel............................... Portfolio Construction-Theory and PracticeChristiana Spyrou ...................... Modelling the Performance of Health Institutions and Identifying

    Unusual Performance using MCMC Methods.

    Alfred Truong ............................ Large Deviations in the Problem of Optimal Choice of New Busi-ness

    Wiebke Werft ............................. Travel Time Prediction in Road NetworksMary-Rose Wilkinson ................ Analysing the Frequencies of Loop Lengths of Genomic G-

    Quadruplex StructuresShujing Zhang............................ Comparative Mortality Models

    Paul Baines

    Election Forecasting Techniques for the UK General Election

    This project was supervised by Dr. S. Brooks.

    General elections are subject to ever-increasing media coverage and scrutiny. An important aspect ofthis is the prediction of the result throughout the night as more and more results are announced. Thisproject begins by analysing some of the techniques used to predict the outcome, focusing upon ridgeregression. Limitations and practical issues are discussed and the motivation behind the model designis emphasised. Extensions to the model are then proposed to incorporate confidence intervals, with thewhole process implemented using R.

    Simulation studies were then used to select the optimal combination of model inputs before executingthe chosen prediction method on the May 5th general election. Performance is analysed, with alternative

    models then proposed in light of the successes and failures of the technique.

    Nicolas Casini

    Applying QDA Analysis to Internet Traffic Classification

    This project was supervised by Dr. Andrew Moore of the Computer Laboratory.

    There is nowadays a growing interest in traffic classification using machine learning techniques. Dr.Moore has previously studied the efficiency of the Naive Bayes classifier. We aim to see if a generalised

    version of Naive Bayes can significantly improve the traffic classification results.

    1

  • 8/14/2019 projects05

    2/9

    The data made available to us consists of 10 datasets of traffic flows. Each traffic flow has been previouslyhand-classified into one of 10 categories, and is described by 248 attributes. Techniques such as NaiveBayes or Bayesian Neural Network have already been applied on this example. We are now interested inexamining and comparing the results obtained using Linear and Quadratic Discriminant Analysis.

    A certain number of simplifications need to be applied on the original framework in order to train our

    new classifiers. We reduce the set of attributes used, and the number of categories that we want todiscriminate. The results obtained using Quadratic Discriminant Analysis are significantly better thanthose from the Naive Bayes classifier, tested in the same context.

    Future work would include finding a way to use all the attributes, and using datasets with a homogenousnumber of flows from each category so as to see the behaviour of Discriminant Analysis in a generalcontext including all of them. It would thus enable a more accurate comparison with the results obtainedfrom previous work.

    This project has required the use of version 3.4 of the WEKA software suite and of R on Windows toperform our classifications.

    Loic Chenevier

    Different Numerical Methods in Option Pricing

    This project was supervised by Prof L.C.G. Rogers.

    In this report, we have a general overview of different numerical methods commonly used in optionpricing. We first describe our framework of the Black-Scholes Model and the options that will bestudied in this paper. Then, we discuss three numerical methods to price them: Binomial Trees, FiniteDifference Methods and Monte-Carlo valuation.

    We will point out that Binomial Trees method and Finite Difference method are quite similar. Bothare using a backward algorithm based either on a pricing formula or on Black-Scholes equation; whereasthe last method is intrinsically different in that it uses martingales and simulations with Monte-Carlo.This is a recent approach drawn from a paper from my supervisor Professor L.C.G. Rogers, Universityof Cambridge. To compare those methods we will examine the following criteria:

    Generality: A numerical method is said to be general if it can be applied to many problems

    Difficulty and speed: Usually, numerical methods are coded on computers or workstations in parallel.The complexity of the algorithms used in the codes are directly related to the numbers of necessarycomputers to solve the problem. There is also a direct impact on the speed. The calculations will be

    performed throughout in Scilab and can be expected to speed up considerably if coded throughoutin a compiled code. Typically one expects a speed up of 5-10 times faster when going to a compiledcode, more if there is a lot of looping in the Scilab code.

    Accuracy: Numerical methods are only approximations of true solutions of problems. Accuracy andspeed of convergence are the main criteria involved in the efficiency of one method. A numericalmethod is accurate if the approximation is close enough to the true solution.

    In a nutshell, the aim of this paper is to show drawbacks and advantages of different numerical techniquesin terms of those criteria. We will focus on the case of the standard American put option.

    2

  • 8/14/2019 projects05

    3/9

    Andrew Fisher

    Parametrizing Gaussian Interest Rate Models

    This project was supervised by Dr. D.P. Kennedy.

    This project examines interest rate models using Treasury Constant Maturity rates from 1996. The firstpart analyses the Vasicek model; the second part investigates the Kennedy model. Traditional attemptsto parametrize these models rely heavily on curve interpolation methods. This project considers ways ofmaking greater use of the observed yields themselves.

    Hayley Jones

    Measuring Effect Size in Area-Based Crime Prevention Research

    This project was supervised by Prof D.P. Farrington (Criminology).

    In this report, I consider two data sets which were collated and originally analysed by Farrington andWelsh (2002a, 2002b). Their aim was to systematically review the results of a range of previous stud-ies examining the effects of two area-based crime prevention strategies on crime rates. The statisticalprocedure involved is called meta-analysis.

    The two strategies or interventions evaluated were:

    Introduction of CCTV surveillance

    Improved street lighting

    The data consist of before and after numbers of crimes (mostly according to police records) in experi-mental and comparable control areas from 19 CCTV studies and 11 street lighting studies. Data of aslightly different type from a further two street lighting studies were also provided.

    Essentially, then, all of the CCTV data and the majority of the street lighting data can be presented asa series of 2 2 contingency tables, where the table for the jth study is of the form:

    Before AfterExperimental aj bj

    Control cj dj

    where aj is the number of crimes in the experimental area in the before period for the jth study, etc.,

    and we have a total ofk studies.

    The aim in Chapter 2 is to measure the effect size (i.e. estimate the effectiveness of the intervention)for each individual study, and to estimate the distribution of this measure in order to test the hypothesisof no effect. The measure used throughout this report is an estimate of the odds-ratio, which is definedin Section 2.2.

    The remaining chapters consist of attempts to pool the results from k studies, in order to produce anestimated population odds-ratio (and an estimate of its variance). This should give an indication of the

    overall effect of each intervention on crime rates.

    3

  • 8/14/2019 projects05

    4/9

    In Chapter 3, I detail the methods used by Farrington and Welsh (2002a, 2002b) to estimate a populationodds-ratio and its variance. This involves a weighted means approach. However, it becomes evidentthat one of the assumptions made does not hold. Sections 3.5 and 3.6 outline two alternative approachesto dealing with the problem of overdispersion within the weighted means framework.

    In Chapter 4, I show how a population odds-ratio can be estimated almost equivalently by fitting a

    Binomial generalised linear model (GLM). As expected, the same problem of overdispersion emerges. Ioutline a method of dealing with this in Section 4.7.

    I then introduce a generalised linear mixed model (GLMM) in Chapter 5. This is an extension of theGLM approach which inherently takes the overdispersion into account, by modelling the study effect asrandom. New estimates of a population odds-ratio and its variance can be obtained by fitting this model.

    The methods are implemented on both data sets in each chapter and the results are then summarised inChapter 6.

    Overall, I conclude from my analysis that there is evidence that improved street lighting is effective inreducing crime rates. CCTV, however, is not found to be significantly effective when the results of allstudies are pooled.

    Statistical analysis is carried out using R Version 2.0.1 and also (in Chapter 5) S-Plus 6.

    Li Li

    Conditional Correlations of Short-time-scale Stock Prices

    This project was supervised by Dr. Sean Blanchflower of Autonomy Systems Ltd.

    It is a common belief among traders that the movement of certain share prices is strongly correlatedto other shares on very small time scales. The aim of this project is to assess the volatility of stockprice returns and their conditional correlations. The aim of investigation of equity tick data is to derivestatistical relationships between pairs or bundles of stocks on very short time scales. The time-dependentnature of any correlation is particularly emphasized. Moreover, the correlation results will be used todecide the time for traders to act to make money, for example, as Nokia is going up but Ericsson hasntyet, so they should buy Ericsson now! And statistical techniques will be used to arrive at dynamicconfidence levels for the conclusion reached.

    In this project, I adopt the vector autoregressive (VAR) model to derive the mean function of stock re-turns, and the multivariate generalized autoregressive heteroskedasticity model (M-GARCH) (developedby Bollerslev, Engle and Wooldridge (1988)), CCC-MGARCH of Bollerslev (1990), and DCC-GARCHmodel of Engle (2001) and Engle and Sheppard (2002) to market the time-variation in conditional correla-tion and volatility. The multivariate model captures time variation in volatility and stock prices influence.In addition, as far as we are concerned, one hour after the market opens or before it closes, market makersand day traders will balance their positions, which will make stocks move in the opposite direction fora very short period of time. Therefore, it is also very important to report evidence concerning openingand closing hours effects on price movements and volatility. After these analyses we find that at least theconditional correlations between the returns can be approximately constant in continuous several daysexcept opening and closing hours.

    The software used in this project is S+Finmetrics.

    4

  • 8/14/2019 projects05

    5/9

    Yin Lim

    An Analysis of Travellers Preferred Time of Departure

    This project was supervised by Dr. James Fox (RAND Europe) and Hugh Gunn (HGA Ltd).

    This project takes a dataset on travel patterns provided by an external company (RAND Europe) andanalyses it, finding patterns and relationships between the variables. It is exploratory and open-ended innature, involving extensive use of R statistical software, but the complexity of programs written is small.

    Interest in selection of time of departure has increased recently with the recognition that a changein departure time is the second most common response, after a change in route, to changes in trafficconditions. This project is concerned with finding variables which are correlated with time of departurein some manner, in order to understand departure time choices better.

    In this project the departure time of two distinct groups of travellers are analysed: commuters and theelderly. In each group, variables which might be thought of as having an effect on departure time are

    analysed to observe the actual effect they have on departure time.

    Carr-Yii Loh

    Quantitative analysis of hedge fund data

    This project was supervised by Dr. Meena Lakshmanan of Atlas Capital Group.

    This project has been divided into five different sections. The first section attempts to fit the distribution

    of hedge fund returns based on the Anderson-Darling test as well as to explore the relationship betweenthe different funds and the various market indices. The second and third sections attempt to estimate thecovariance matrix, a key input for the mean-variance analysis, based on the exponential weighted movingaverage methodology and GARCH (generalized autoregressive conditional heteroskedasticity) model. Thefourth section discusses in detail Copulas, a tool that enables us to model dependence more effectively,and hence allows us to measure the risk of a portfolio of funds more accurately. The last section introducescluster analysis, which enables the classification of the many different funds in a quantitative manner.

    Urvish Patel

    Portfolio Construction-Theory and Practice

    This project was supervised by Dr. Antony Ledford of Man Investment Products.

    During the past decade, portfolio management has emerged as one of the most exciting and importanttopics within finance and applied economics. Curricula within schools of business administration havebeen expanded to include courses in investments and security analysis as well as courses in portfoliomanagement. The technical journals of economics and finance contain a growing number of articlesdealing with portfolio theory and management.

    The work of both academics and practitioners within the field of investment can be traced to a concernwith individuals. The teaching of investment prepares students for careers in the securities business or

    familiarises them with the investment environment in which they may eventually participate. Research

    5

  • 8/14/2019 projects05

    6/9

    within the investment field is intended to provide a better understanding of the investment process, tosolve particular problems of the securities industry, or to develop guidelines for improving investmentdecisions. The practice of investment management by financial institutions is part of the total packageof services provided for individuals.

    Traditionally investors focused on assessing the risks and rewards of individual securities in construct-

    ing their portfolios. Standard investment advice was to identify those securities that offered the bestopportunities for gain with the least risk and then construct a portfolio from these.

    It was not until the work of Harry M Markowitz that the investment world changed. He introduced theidea of diversification which revolutionised the way people invested. He proposed that investors focus onselecting portfolios based on their overall risk-reward characteristics instead of merely compiling portfoliosfrom securities that each individually have attractive risk-reward characteristics. In a nutshell, investorsshould select portfolios not individual securities.

    This project aims to use computational methods to construct an optimal portfolio and to assess some ofthe underlying assumptions of the mean-variance methods introduced by Harry M Markowitz.

    Christiana Spyrou

    Modelling the performance of health institutions and identifying unusual performance using MCMC meth-ods

    This project was supervised by Dr. David Ohlssen of the MRC Biostatistics Unit.

    In order to improve the quality of healthcare hospital outcome data are statistically analysed to producemeasurements of performance, known as performance indicators. A careful presentation is needed thatwould not result in unfair criticism or unjustified praise. Using Bayesian hierarchical models and Markov

    Chain Monte Carlo methodology we will carry out a cross-sectional analysis of routinely collected datain order to model the performance of health institutions and highlight possible high and low performers.We have mortality rates as our main outcome measure and we are comparing 100 hospitals in the UnitedKingdom.

    In the General Linear Models framework we want to compare the most common approaches usually usedwith more flexible distributions that may appear to be more suitable to use for our data. The statisticalprogram WinBugs gives the opportunity of working with non-parametric and more complex distributionalassumptions and provides model diagnostics to compare them.

    We are interested in identifying unusual performance of centres that appear to diverge significantly fromwhat is expected and would require further analysis, using more information, to classify them as outliers.To do so we will use cross-validation, a technique that compares each centre with the expected outcomepredicted from the distribution formed by the rest of the data and a ranking approach, that comparesthe rank of each centre compared to the other observations.

    6

  • 8/14/2019 projects05

    7/9

    Alfred Truong

    Large Deviations in the Problem of Optimal Choice of New Business

    This project was supervised by Prof Y. Suhov and Dr. M. Kelbert.

    The conclusions of my project are that given a second business with known premium rate, incomingclaims rate and claims sizes, that we can decide whether or not we should open it in the large deviationsregime. I have tried and failed due to lack of time to confirm the results numerically which would havebeen very satisfying.

    Wiebke Werft

    Travel Time Prediction in Road Networks

    This project was supervised by Dr. R.J. Gibbens of the Computer Laboratory.

    Travel time prediction has been an interesting research area for decades during which various predictionmodels have been developed. The forecasted travel time information is useful for the pre-trip planning ofthe drivers. The drivers can decide their departure time and mode choice based on the forecasted traveltime.

    The Motorway Incident Detection and Automatic Signalling (MIDAS) systems provide data summariesof traffic numbers and speeds, and vehicle types, from loop detectors which sense traffic passing overthem. The loop sensors are sited at nominal 500-meter intervals along the carriageway and producestraffic counts per minute per lane by vehicle type (lorries, vans, cars, motorbikes), and record average

    speeds, average headway (distance between adjacent vehicles), average occupancy (percentage of timethe detector is covered by a vehicle) and flow for each lane, also per minute.

    This project focuses on how to produce a journey time predictor for individual motorists planning journeysbetween two points on the stragetic highway network. The predictions are based on MIDAS data collectedat the south west quadrant of the M25 near London. The statistical methodology used for the travel timeprediction was developed by Rice & van Zwet (2004) for a loop sensor area in California and applies linearmodels with varying coefficients. Those models are linear models with parameters which are functionsthemselves of the time of day and of the lag between the time at which you have the data available andthe time at which you want your journey to start. This model is solved using weighted least squaresregression and produces smooth estimates of these regression parameters. The main achievement of thisproject is the implementation of this weighted least squares regression methodology and the applicationof it to these data.

    By examining the root mean squared errors it can be observed that a predictor of travel time based onthis varying coefficient model outperforms other naive predictors such as the historical mean or frozenfield predictors.

    The statistical analysis was completed using R, see appendix for coding.

    7

  • 8/14/2019 projects05

    8/9

    Mary-Rose Wilkinson

    Analysing the Frequencies of Loop Lengths of Genomic G-Quadruplex Structures

    This project was supervised by Dr. Julian Huppert of Trinity College and Cambridge University Chemical

    Laboratory.

    In this project I analyse data collected by Dr Julian Huppert on the loop lengths of putative G-quadruplexstructures identified in the human genome. This analysis shows that there are statistical structures presentin the data which indicate that at least some proportion of these putative G-quadruplex structures actuallydo form G-quadruplex structures under certain physiological conditions.

    DNA is a long polymer made up of a series of units called nucleotides. Each nucleotide consists of asugar, a phosphate group and an organic molecule (a heterocycle) called a base which is one of adenine(A), cytosine (C), guanine (G) or thymine (T). It is well known that the usual structure of DNA isthe double-helix (like a spiral staircase, with the sugar-phosphate backbone forming the railing and thebases of the nucleotides being the steps). However, sequences of bases of certain patterns can form otherstructures, and one structure which can be formed by guanine rich sequences is the G-quadruplex.

    The G-quadruplex structure has a core of stacked tetrads linked by three loops. These loops are sequencesof between one and seven bases, and the combination of lengths of the three loops affects the stabilityand shape of the G-quadruplex. These loops are the focus of this dissertation.

    Dr Huppert developed a quadruplex folding rule which says that sequences of bases of a particularform will form a G-quadruplex structure under certain physiological conditions. Dr Huppert also de-veloped an algorithm called Quadparser which he used to search the whole of the human genome forsequences which satisfy this rule. Although Dr Huppert had identified that sequences of this form couldform G-quadruplexes, it was not known how many of them actually do form G-quadruplex structuresphysiologically in the genome.

    For each putative G-quadruplex structure identified by Quadparser, the three lengths of the loops thatwould be formed if the sequence formed a G-quadruplex physiologically were recorded. As these loopsare of between one and seven bases in length, this gave a 7 x 7 x 7 contingency table, each dimension ofthe contingency table giving the length of one of the loops.

    The aim of this project was to analyse the contingency table, investigating whether there were statisticalstructures present. Presence of statistical structures would indicate that the sequences identified showevidence of evolutionary selection, and hence that they may actually form G-quadruplexes physiologicallyin the genome.

    A test for independence between all of the three loop lengths was rejected. Further analysis of the three-way contingency table did not initially lead in a particular direction for further work, so I collapsed thetable by summing over the lengths of loop two to give a 7 x 7 contingency table showing the lengthsof loops one and three. It was natural to consider loops one and three together as they have a similarposition in the G-quadruplex structure.

    This two-way contingency table had three interesting features. Firstly it was almost exactly symmetric,secondly there were a large number of counts where one or both of loops one or three were only one baselong, and thirdly, there were a large number of counts on the diagonal, where the lengths of the two loopsare the same. As the first row and column of this table had such a dominating effect I excluded themfrom the next section of the analysis, and fitted a quasi-independence model to the remaining 6 x 6 table,which would model the excess of counts on the diagonal and the symmetry. The quasi-independencemodel is a probability mixture model, which says that there is some probability that the row and columnclassifications are the same, and with the remaining probability, the row and column classifications areindependent.

    8

  • 8/14/2019 projects05

    9/9

    The parameter of interest is the probability that the row and column classifications are the same, thatis, that the lengths of loops one and three are the same. We denote this probability by . The mainsection of this project regards estimation of , which is done in three different ways: by approximating to Cohens Kappa statistic, by numerically maximising the profile log-likelihood and by Markov ChainMonte Carlo.

    The three estimates that were obtained agree very closely, the latter two methods agreeing to give = 0.098 to three decimal places, with a very tight confidence interval of (0 .096, 0.010). The estimateobtained by approximation of to Cohens Kappa statistic was 0.095, but this estimate required anadditional assumption, which I later investigated and was not strictly true. I also modified my programsto run on the 7 x 7 contingency table, to include the counts where one or both of the loops were justone base long; this gave slightly higher estimates (those obtained through the profile log-likelihood andMarkov Chain Monte Carlo methods were both 0.129 with confidence interval (0.128, 0.131)), but all sixestimates obtained for agreed to give = 0.1 to one decimal place.

    I then investigated including further terms in the quasi-independence model to try to improve the fitof the model to the data. These models indicated that there were further statistical structures whichremained in the data even after conditioning on the event that the lengths of loops one and three differed.I also tried fitting various independence models, symmetry models and a quasi-independence model to

    the original three-way contingency table, but this did not give any results of interest.

    The analysis of the two-way contingency table does nevertheless give strong results. The quasi-independencebetween the number of bases on loops one and three cannot be explained by the duplex structure of DNA,and is most easily explained by the quadruplex structure. This indicates that it is likely that a proportionof the counted putative quadruplex structures (the proportion being 0.1) do form quadruplexes phys-iologically. The statistical structure that remained in the data even after conditioning on the event thatthe lengths of loops one and three differed indicates that the proportion of the putative G-quadruplexstructures that do form G-quadruplexes physiologically may in fact be even higher than 0.1.

    The programming for this project was performed in R 2.0.1 and WinBUGS 14.

    Shujing Zhang

    Comparative Mortality Models

    This project was provided by Dr. M. Orszag of Watson Wyatt LLP.

    The purpose of this project is to compare difference models currently used in forecasting the humanmortality in United Kingdom and to try to find the most appropriate ones. Three models are chosen tofit and forecast the England and Wales mortality experience of age 55 to 84; most work is in the thirdmodel Lee-Carter. This paper is divided into five chapters. The first chapter introduces the mortalitychange in the U.K. of 20th century, mortality pattern and existing methods for forecasting mortalities.In the second chapter, the procedure of constructing the quadratic Gompertz model is discussed andthe model is fitted and used to project to year 2046. The third chapter is about the Smooth Splinemodel. The fourth chapter implements the Lee-Carter model, in which the procedure of fitting Poissonlog-bilinear regression model in GLM and forecasting time-varying index in ARIMA are discussed indetail. The last chapter compares all the fitted and projected mortalities from the three models to theGovernment Actuarys Department (GAD) 2002-based.

    All modelling was completed in S-plus6. LaTeX was the primary software for typesetting.

    9