11
FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION ELINOR VELASQUEZ 1. Introduction Parkinson’s disease (PD), a progressive neurogenerative disorder, aects 3% of people above 65 years in age [29], [4]. Extensive loss of neurons which contain dopamine (dopaminer- gic neurons) in the basal ganglia region of the brain, Lewy bodies (dense aggregates of proteins) and abnormal motor skills are the symptoms commonly associated with PD [29], although recent research indicates a more complex picture [8]. Idiopathic PD, which means the disease is not caused by hereditary elements, nonetheless has both “genetic and envi- ronmental factors” [8] that create the disease state. Because of this complexity, an early correct diagnosis of PD at first appears futile. Yet a patient with late stage PD typically has approximately 70% midbrain dopaminergic neuronal loss. Thus, it is essential to provide a means of diagnosis in the early stages of the disease. The problem of proper diagnosis may be alleviated by the use of biomarkers for the disease. A biomarker is an “indicator of a pathological molecular process,” such as dopaminergic neuron apoptosis [17]. A pressing question for PD is: Can diagnostic biomarkers be found, using genetic, genomic or proteomic information encoded in a patient’s tissues, preferably noninvasive tissue? At present, there are no confirmed biomarkers of any kind for PD [3], [11]. Biomarkers are needed for the earlier stages of idiopathic PD (IPD) (idiopathic means that the disease is not caused solely by hereditary elements but rather a combination of “genetic and environmental factors” [8]). However, there are several studies that have confirmed a number of single nucleotide polymorphisms (SNPs), which are biomarkers for familial PD (FPD), i.e. patients with a genetic predisposition of PD, such as a genetic mutation [19]. Recent genome-wide association studies have indicated SNPs in the SNCA and LRRK2 genes for FPD: SNPs in the SNCA, MAPT, PARK16 and LRRK2 genes 1 have been found in PD [25], [23], [18], and a larger list of SNPs in [9], although these newer studies do not specify the proportion of patients with FPD and IDP. FPD accounts for 10% of all cases of PD; most cases of PD are of the idiopathic type. Additionally, other genomic studies have used postmortem brain tissue to study PD [24], but these studies have the disadvantage of tissue taken from dopaminergic cells which are in the late stage of PD, not in an early stage when an ideal diagnosis would occur. There is a need for non-familial biomarkers taken from midbrain tissues in the earlier stages of IPD. 1

UCSC Qualifying Exam Proposal 2012

Embed Size (px)

Citation preview

Page 1: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USINGFEATURE SELECTION

ELINOR VELASQUEZ

1. Introduction

Parkinson’s disease (PD), a progressive neurogenerative disorder, a↵ects 3% of people above65 years in age [29], [4]. Extensive loss of neurons which contain dopamine (dopaminer-gic neurons) in the basal ganglia region of the brain, Lewy bodies (dense aggregates ofproteins) and abnormal motor skills are the symptoms commonly associated with PD [29],although recent research indicates a more complex picture [8]. Idiopathic PD, which meansthe disease is not caused by hereditary elements, nonetheless has both “genetic and envi-ronmental factors” [8] that create the disease state. Because of this complexity, an earlycorrect diagnosis of PD at first appears futile. Yet a patient with late stage PD typically hasapproximately 70% midbrain dopaminergic neuronal loss. Thus, it is essential to providea means of diagnosis in the early stages of the disease.

The problem of proper diagnosis may be alleviated by the use of biomarkers for the disease.A biomarker is an “indicator of a pathological molecular process,” such as dopaminergicneuron apoptosis [17]. A pressing question for PD is: Can diagnostic biomarkers be found,using genetic, genomic or proteomic information encoded in a patient’s tissues, preferablynoninvasive tissue? At present, there are no confirmed biomarkers of any kind forPD [3], [11].

Biomarkers are needed for the earlier stages of idiopathic PD (IPD) (idiopathic means thatthe disease is not caused solely by hereditary elements but rather a combination of “geneticand environmental factors” [8]). However, there are several studies that have confirmed anumber of single nucleotide polymorphisms (SNPs), which are biomarkers for familial PD(FPD), i.e. patients with a genetic predisposition of PD, such as a genetic mutation [19].Recent genome-wide association studies have indicated SNPs in the SNCA and LRRK2genes for FPD: SNPs in the SNCA, MAPT, PARK16 and LRRK2 genes1 have been foundin PD [25], [23], [18], and a larger list of SNPs in [9], although these newer studies do notspecify the proportion of patients with FPD and IDP. FPD accounts for 10% of all cases ofPD; most cases of PD are of the idiopathic type. Additionally, other genomic studies haveused postmortem brain tissue to study PD [24], but these studies have the disadvantageof tissue taken from dopaminergic cells which are in the late stage of PD, not in an earlystage when an ideal diagnosis would occur. There is a need for non-familial biomarkerstaken from midbrain tissues in the earlier stages of IPD.

1

Page 2: UCSC Qualifying Exam Proposal 2012

2 ELINOR VELASQUEZ

The recent popularity of studying “diseases in a dish” [30], not only increases the adventof personalized medicine, but permits an advantageous model of human diseases that havenot been well investigated by existing animal models, such as early stage IPD. The ‘dish’represents specified di↵erentiated cells generated from induced pluripotent stem cells (iP-SCs) that originated from reprogramming certain cell types of a patient with a specifieddisease, such as PD. In the case of PD, the goal is to study dopaminergic neurons that arisefrom di↵erentiation of iPSCs, which are previously reprogrammed from dermal fibroblastcells of a PD patient [21]. Data from ‘PD in a dish’ is scarce, primarily because the biotech-nological protocols for producing PD midbrain dopaminergic neuronal cells from iPSCs arein the developmental phase [21]. The Soldner group have generated iPSCs from fibroblastsamples of PD patients [28]. Their protocol is not e�cient for routine personalized medicaldiagnoses, yet it is a step in the right direction: Once such a protocol for iPSCs becomes ef-ficient, gene therapy, drug development and more research can start to take place, becauseof the patient-specific models. In this rapidly developing research field, Devine et al. areable to produce gene expression data from PD patient-derived iPSC-derived dopaminergicneurons [5]. One problem is that PD has a long latency period, variability in progres-sion and also epigenetic influences that are e↵ectively erased during reprogramming [5].Devine et al. have considered these issues: Devine et al.’s idiopathic PD patient carriesfour copies of the SNCA gene2 resulting in a doubling of ↵-synuclein protein expression, aprotein contained in Lewy bodies, which reduces the epigenetic influence [5] but models anearlier stage PD midbrain dopaminergic neuron. Therefore, there is a possibility of findingnon-familial biomarkers in the earlier stages of IPD from the data of Devine et al.

Although no confirmed biomarkers for PD exist [3], [11], I will use Devine et al.’s IPDpatient-derived iPSC-derived dopaminergic neuronal data to find informative genes (po-tential biomarkers) for IPD. I am interested in the developmental stages of IPD, i.e. dif-ferential gene expression data for potential biomarkers, not the genetic predisposition forIPD, i.e. SNP data, and will study Devine et al.’s gene expression data coming from iPSC-derived dopaminergic neurons (which were reprogrammed from fibroblast cells of an IPDpatient). I will use machine learning feature (gene) selection algorithms as a novel wayto find a subset of potential biomarkers as well as traditional ways. I will test the verac-ity of the biomarkers using a machine learning classifier to classify IPD. I will also applythe same techniques to a later stage PD dataset in order to compare earlier stage iPSC-derived PD dopaminergic neurons to post-mortem PD midbrain dopaminergic neurons,using additional data from Cantuti-Castelvetri [4].

2. Specific aims

Since we have no confirmed biomarkers of any kind for PD [3], [11], it is fruitful to search forbiomarkers using computational means. I focus on di↵erential gene expression biomarkersfor IPD in this study because the work by Simon-Sanchez et al., Satake et al. and othersindicate informative SNPs (potential biomarkers) for FPD [25], [23]. I will compare the dif-ference between early stage and late stage biomarkers which will help in understanding the

Page 3: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION 3

development of PD. I am using machine learning techniques for finding potential biomark-ers because machine learning feature (gene) selection o↵ers a better approach to findinginformative genes, much better than a simple fold-change and better than the moderatedt-statistic method: It is known that fold change is non-precise measure because the upperand lower thresholds are arbitrarily chosen as compared to a t-statistic measure where thethresholds are chosen to minimize Type I error [6]. However, classical data analysis, suchas when using a t-statistic, assumes that the genes are independent [6]. This assumptionis not reasonable for microarray data. Another problem is that use of the t-statistic is notoptimal for small sample sets [6]. The software SAM tries to fix this problem by applyinga ‘fudge’ factor [6]. A statistic that more accurately solves this problem is the moderatedt-statistic [6] in which the t-statistic denominator’s standard deviation is replaced by a“posterior residual standard deviation” [6]. Yet, both SAM and the moderated t-statisticare categorized as a ‘filter feature selection method’, with ‘feature’ meaning an attribute ofthe set of samples, specifically, a gene3. A ‘feature selection method’ is a method that se-lects genes. ‘Filter’ feature selection methods are independent of learning algorithms, thusdo not take advantage of any learning process [12], [13]. ‘Wrapper’ methods are intertwinedwith a learning algorithm and subset selection, and subset selection provides feedback tothe classification process, leading to a more accurate feature (biomarker) selection [12],[13]. To summarize, it is known that filter methods perform less well than wrapper featureselection methods [12], [13]: Since the moderated t-statistic selects genes without a learner,it is thus a filter feature (gene) selector. I will use a wrapper feature (gene) selector inthis project in addition to a filter feature (gene) selector method (for comparison to themoderated t-statistic).

Aim I. I will create potential biomarker subsets for IPD, using fold change, the moderatedt statistic (a filter gene selection method), an information-theoretic method (a filter geneselection method), and a recursive feature elimination ridge regression (RFE-RR) algorithm(a wrapper gene selection algorithm) and computationally test the veracity of the potentialbiomarker subsets using a machine learning algorithm, namely the Naive Bayes algorithm.The assumption is that the subset of genes which most accurately classifies Devine et al.’sIPD data from Devine et al.’s control data is the best potential biomarker subset for IPD.The classifier to be used for all gene subsets is the Naive Bayes classifier, a classifier knownin machine learning to give good results [20]. The control data in Devine et al. comes froman una↵ected first degree relative’s iPSC-derived dopaminergic neurons.

There are no confirmed biomarkers of any kind for IPD [3], [11]; biomarkers for IPD couldbe useful for a diagnosis of early stage IPD. I will compute the fold change and rank thegenes using the moderated t-statistic and select the subset of genes by using a p-value 0.05. I will also rank the genes using an information-theoretic measure to compare withthe moderated t-statistic. The information-theoretic measure is still a ‘filter’ method, so Iwill construct an informative gene (potential biomarker) subset from the RFE-RR wrappermethod, known to be be more accurate than filter methods [12], [13], [34]. I will comparethe results by classification of PD versus control, using a Naive Bayes algorithm as theclassifier, presuming that the gene subset which classifies PD most accurately is the best

Page 4: UCSC Qualifying Exam Proposal 2012

4 ELINOR VELASQUEZ

potential biomarker subset. I use Devine et al.’s dopaminergic neuronal gene expressiondata because it originated from an earlier stage IPD patient (earlier than a post-mortemlate stage PD patient, such as one from the work of Cantuti-Castelvetri [4]).

Aim II. I will compare the potential biomarker results of Aim I with similar results from alate stage PD gene expression dataset, such as Cantuti-Castelvetri et al’s dataset. [4]. I firstwill perform the same analysis as before, only this time will use Cantuti-Castelvetri et al.’smicroarray data from post-mortem una↵ected (control) midbrain dopaminergic neuronsand post-mortem late stage PD (diseased) midbrain dopaminergic neurons. I will thencompare the subsets of potential biomarkers to see if there is similarity or di↵erence betweeniPSC-derived earlier stage PD neurons and post-mortem late stage PD neurons. I will usethe Database for Annotation, Visualization and Integrated Discovery4 (DAVID) to aid inthe comparison. We do not yet know if there is a significant di↵erence between living earlierstage iPSC-derived PD dopaminergic neurons and post-mortem later stage dopaminergicneurons. This aim may serve to highlight the di↵erences and similarities between the twotypes of cells. This will be the first comparative study of PD dopaminergic iPSC-derivedneuronal microarray data and post-mortem PD dopaminergic neuronal microarray data.I hope to confirm the similarity/dissimilarity of biomarkers of late stage post-mortem PDdopaminergic neurons and earlier stage iPSC-derived PD dopaminergic neurons, using thePD midbrain dopaminergic neuronal data of Cantuti-Castelvetri et al. [4] for the laterstage PD patients’ dopaminergic neurons.

3. Methods

My goals are to create an informative gene subset, i.e. a subset of potential biomarkersfor both early stage IPD and late stage PD patient midbrain dopaminergic neurons. I willsearch for an informative gene subset using fold change and the moderated t statistic (filter)method [6]. I will compare the known method, the moderated t-statistic, with a novel filtermethod as suggested by the NIPS Feature Selection 2003 Challenge [13]�an information-theoretic method of feature (biomarker) selection. I will also compare these results witha novel wrapper feature (biomarker) selection method, a methodology known to be highlyaccurate for detecting features [12], namely the RFE-RR method [34]. RFE-RR has beenshown to be more accurate than RFE-SVM [34] and RFE-SVM has been shown to bee↵ective in searching for cancer biomarkers [35]. I plan to assess the quality of each arrayto determine acceptability (quality control), perform preprocessing and normalization tobe able to compare across arrays (data preprocessing and normalization). I will applygene selection using fold change, the moderated t-statistic and p-value of Devine et al’sdata to construct a potential biomarker subset, then use an information-theoretic measureon the original set of genes (preprocessed and normalized) to find an independent subsetof potential biomarkers. I will use a wrapper feature (gene) selection algorithm (RFE-RR) to again independently determine a subset of potential biomarkers from the originalset of genes (preprocessed and normalized). I will compare all these subsets of potentialbiomarkers by seeing how well they classify PD, through the use of the Naive Bayes learner

Page 5: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION 5

algorithm. This machine learning methodology for selecting potential biomarkers is notnovel: It has been used to select biomarkers for cancer [35]. To summarize, my project is tofind a subset of informative genes (potential biomarkers) for an earlier stage IPD patient’smidbrain dopaminergic neurons and then to see if those potential biomarkers are the samefor later stage post-mortem PD patients’ midbrain dopaminergic neurons.

3.1. Quality control. This step is the first of several to detect potential biomarkers forearlier IPD and late stage PD. Using the dataset of Devine et al., I will perform qualitycontrol in order to discard abnormal arrays from the set of all the arrays. I shall do aquality control analysis on the gene expression data in order to detect abnormal arrays.I have A↵ymetrix GeneChip arrays so the ‘qc’ function from the simpleA↵y package inthe Bioconductor suite of packages su�ces for detecting the quality of each array [6]. Thefunction qc can be used to perform assessment of the average background of each arrayand the number of genes called ‘present’ among other metrics; all these metrics help todetermine the consistency within each array.

3.2. Data preprocessing and normalizing data. I will perform several steps in orderto be able to compare the gene expression values across arrays. I am concerned with spotintensity saturation because this spot condition can negatively influence the results of myanalysis. On the other hand, throwing out highly expressed genes are not desirable either;thus I need to be careful with regards to spot analysis. In terms of the spot intensitysaturation, ideally the saturation of individual pixels would be corrected or eliminatedaltogether in order to avoid eliminating target genes which are highly expressed [33]. Todo such a procedure, I would require data in TIFF files, however the available data is inCEL files which have only the means, medians and variances of the pixel intensities foreach probeset not the individual pixel intensities. To compensate for saturated points, Iwill cut o↵ the highest 2% of the values [6]. I will also cut o↵ the lowest 2% to keep thedata from being biased toward the lowest intensity values. I will use the ‘rma’ functionfrom the R simpleA↵y package to perform a “robust multichip average” [2] which correctsthe background, normalizes the data and summarizes the data to gene expression intensityvalues [15], [16]. The background is needed to be corrected for noise and non-specifichybridization [10]. The data needs to normalized in order to compare data from di↵erentarrays [6]. Summarization combines the multiple probeset intensities into a gene expressionvalue from several probesets [10].

3.3. Moderated t statistic. The goal of the project is to find potential biomarkers forlate stage PD and earlier stage IPD. I will use the moderated t statistic [6] as the bestapproach according to [27] for ranking genes (potential biomarkers). The moderated tstatistic is available as the eBayes function from the R limma package. The moderated tstatistic improves upon the ordinary t statistic and SAM values by not outputting genesthat have very small log base 2 fold changes with small p-values [6]. The function ‘topTable’in the R limma package gives the p-value for the moderated statistic [6]. I will also rankthe genes by fold change using the ‘sort’ and ‘rank’ functions in R in order to provide

Page 6: UCSC Qualifying Exam Proposal 2012

6 ELINOR VELASQUEZ

convincing evidence that fold change is not useful for ranking genes when distinguishingPD from a control.

3.4. Filtering approach: Using an information gain-type measure to create aninformative subset of genes. The information-theoretic paradigm has changed the waywe look at model selection and inference: Biomarker (feature) selection can now be accom-plished using information theory (devoid of p-values). Here biomarker (filter) selection isaccomplished by selecting biomarkers (features) prior to classification. Biomarker (feature)selection by correlation produces features that su↵er from not being invariant under trans-formations, i.e. rescaling [1] or taking the logarithm. So instead, I focus on the conceptof mutual information, which is invariant under transformations. First, I define entropy,H, of an object, Z, as the expectation of the log of the inverse of the probability densityof Z. Given a feature (gene) X and a class Y , such as PD or control, we define mutualinformation, I, as I(Y,X) = H(Y )�H(Y |X), with H as the entropy of Y and H(Y |X) theconditional entropy of Y given X. To select a feature X, I find the maximum, over the setof all features, of I(Y,X)� �

PXj2�I(X,Xj) with � the subset of informative biomarkers

(features) [1], [13]. That particular X is then added to � and the process is repeated; �represents the “balance between maximizing relevance and minimizing redundancy” [13].Since the process is computationally expensive, I employ the technique of Qiu et al. tospeed up the calculation of I [22]. I arrive at an informative subset of genes which maybe considered as potential biomarkers. I will need to implement the particular form of themutual information feature selection algorithm (using R).

3.5. Wrapper-style feature selection: Recursive feature elimination to obtainsubsets using ridge regression. Xiong et al. gave the first indications that biomarkerscould be identified by feature selection [32]. They proposed using feature selection wrap-pers which interact with a given classifier in a search for feature subsets and are morecomputationally intensive yet more accurate [32]. In Zhang et al., a type of recursive fea-ture elimination with support vector machines (SVMs) was proposed and studied in orderto select biomarkers [35]. They applied their algorithm to simulated data, breast cancermicroarray data and mass spectrometry data [35]. Yang et al. demonstrated that RFE-RRwas more accurate than the popular SVM algorithm, when performing recursive featureelimination, to select features [34].

Wrapper style (biomarker) feature selection has been found to be more accurate thanfiltered feature selection [13]. I will generate scores for the biomarkers (features) using thegiven classifier, which in this case is RFE-RR. Recursive feature elimination (RFE) relieson ‘weights’ and their optimization (training the classifier), by computing the rankingcriterion which equals the dot product of the weights and removing the feature with thesmallest ranking criterion [12]. RFE is an iterative method that trains a given classifier,computes the ranking criterion and removes the feature (gene) that ranks the least [12].The subset of features that remain retrain the classifier and repeat the iteration [12]. Whatremains is a subset of features that trains well the classifier.

Page 7: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION 7

Ridge regression is a well known classifier that is a standard in machine learning. TheRFE-RR has been used to rank features for cancer classification [34]. The RFE-RR al-gorithm starts with an initial set of features and a ranking of those features: The ridgeregression is trained and the weight vector ‘�’ is computed [34]. The ranking criterion isthen computed, ci = �2

i , for i = 1, . . . , N , with N the number of remaining features and �i

the ridge regression weight components. The feature which ranks the least is eliminated.The iteration repeats by initializing the subset of features that survived from a previousiteration. The iteration runs until all features in the subset have been eliminated. Theweights in ridge regression are determined by minimizing the loss function which is a linearcombination of the least squares residual sum of squares plus a ‘regularization’ term, withthe regularization term defined to o↵set the possible ill-posedness of the least squares [14].The regularization term is a sum of the squared weights times �, a “complexity parameter”[14]. For prostate cancer, the degrees of freedom of � is such that df(�) =

P d2j

d2j+�

, with

dj the eigenvalues of the data matrix, has its minimum at 4.16 [14]. Thus we need to vary� to achieve a good subset of informative genes. There is no available implementation ofRFE-RR, so I will need to implement it (using R).

3.6. Validation of gene subsets using a Bayes Net algorithm. I will test all subsetsof features by use of a Naive Bayes classification algorithm. Naive Bayes is useful whenthe number of features (genes) is large [14]. It is a well known and e↵ective standardclassifier [20]. Classifiers learn from previous examples to predict a new outcome for agiven example. The Naive Bayes classifier assigns a new outcome from the maximum overall outcomes (two in my case) of the probability of the outcome multiplied by a conditionalmaximum likelihood. I will use Weka5, a well known software package, to run the NaiveBayes classification. For numerical features (genes), the Weka Naive Bayes implementationassumes a Gaussian distribution to compute the conditional probabilities for the maximumlikelihood. Weka outputs a confusion matrix which is a two-by-two matrix of true positivesand true negatives along the diagonal and false positives and false negatives on the anti-diagonal.

Since the number of data samples is small, we will perform cross validation for all thesubsets using a five-fold cross validation. For the NIPS 2003 Feature Selection Challenge,the criterion for judging is the balanced error rate (BER) [13]. I want the BER error toequal zero for a given feature selection method. To compute the error, with error = 1 -accuracy, I will use the following equation:

BER =12

# of positive instances predicted wrong

# of positive instances

+12

# negative instances predicted wrong

# of negative instances,

[13]. To compute the classifier performance, i.e. the area under the ROC curve (AUC), Iwill use the following equations: AUC = G+1

2 , G = 2[p0(1�p0)+p1(1�p1)], with G equaling

Page 8: UCSC Qualifying Exam Proposal 2012

8 ELINOR VELASQUEZ

the Gini index, p0 the relative frequency of the class ‘control’ in the set of examples andp1 the relative frequency of class ‘PD’ in the set of examples. The AUC variable measuresthe error under the ROC curve: When the classifier performs perfectly, the AUC measureequals one. I will compute p0 and p1 via the confusion matrices associated with the specifiedclassification.

4. Conclusion

To summarize I will use di↵erent techniques to rank features (biomarkers) and to test theirreliability using a classification scheme. My plan is to select features by more than onemethod and then test the ability of the features to classify PD, i.e. to construct poten-tial biomarkers by computational means, specifically through classification methods. Thebiomarkers found in this project will require validation by biological laboratory experi-ments [21]. To summarize, I plan to implement filter and wrapper feature selection for thetwo sets of data, namely those of Devine et al. and Cantuti-Castelvetri et al., compare tothe standard methods of gene selection and test the results by classification performance ofthe selected subsets. The best performing subset will be the final subset of selected genes(potential biomarkers).

5. Footnotes

1. The SNCA gene encodes ↵-synuclein, a protein found in Lewy bodies. The MAPT geneencodes the tau protein, which when defective is associated with dementia, a symptomsometimes found in PD patients [8]. Current research shows PARK16 to have an associationto IPD, but its functional relationship is still unknown [31]. Mutations in the LRRK2 geneare well known to be associated with neuronal apoptosis [26].

2. Devine et al. say that a variation at the SNCA locus results in a significant “geneticrisk factor” for IPD [5] and the ↵-synuclein protein, produced from the SNCA gene and“mitochondrial dysfunction mediated ↵-synuclein aggregation,” is an aggravating factorfor all types of PD [7].

3. A ‘feature’ in this setting is an attribute of the set of samples, specifically, a gene.Feature selection algorithms are of three types: ensemble, wrapper and filter methods [12],[13]. Filter methods are independent of the machine learning algorithm, thus do not takeadvantage of the learning process [12], [13]. Wrapper methods are intertwined with thelearning algorithm, and subset selection provides feedback to the classification process,leading to a more accurate selection [12], [13]. Ensemble systems are adaptive systems forclassification and outperform other learner systems [12], [13]. It is possible to rank featureswith ensemble learning methods [12], [13]. I use both filter and wrapper methods for featureselection in this project and leave ensemble methodology to a later project.

4. DAVID is available online (http://david.abcc.ncifcrf.gov/home.jsp).

Page 9: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION 9

5. Weka is available for downloading through its website (www.cs.waikato.ac.nz/ml/weka/).

6. References

1. Battini, R. Using Mutual Information for Selecting Features in Supervised Neural NetLearning. IEEE Trans. on Neural Networks 5(4), 537�550 (1994).

2. Bolstad, B. A comparison of normalization methods for high density oligonucelotidearray data based on variance and bias. Bioinformatics, 19(2), 185�193 (2003).

3. Breen, D. P. et al. Parkinson’s disease�the continuing search for biomarkers. Clin.Chem. Lab. Med. 49(3), 393�401 (2011).

4. Cantuti-Castelvetri, I. et al. E↵ects of gender on nigral gene expression and Parkinsondisease. Neurobiol. Dis. 26(3), 606�614 (2007).

5. Devine, M. et al. Parkinson’s disease induced pluripotent stem cells with triplication ofthe ↵-synuclein locus. Nature Comm. 2(23 Aug), 440�450 (2011).

6. Draghici, S. Statistics and Data Analysis for Microarrays Using R and Bioconductor,SecondEd. (CRC Press, New York, NY) (2012).

7. Esteves, A.R. et al. Mitochondrial dysfunction: the road to ↵-synuclein oligomerizationin Parkinson’s disease. ParkinsonsDisease 2011(Article 693761), 1�20 (2011).

8. Fisher, A. et al. Advances in Alzheimer0s and Parkinson0s Disease. (Springer, NewYork NY) (2008).

9. Fung, H. et al. Genome-wide genotyping in Parkinson’s disease and neurologicallynormal controls: first stage analysis and public release of data. The Lancet Neuro., 5(11),911�916 (2006).

10. Gentleman, R. et al. Bioinformatics and Computational Biology Solutions using Rand Bioconductor. (Springer, New York NY) (2005).

11. Gerlach, M., et al. Biomarker candidates of neurodegeneration in Parkinson’s diseasefor the evaluation of disease-modifying therapeutics. J . Neural Transm. 119(1), 39�52(2012).

12. Guyon, I. et al. Gene selection for cancer classification using support vector machines.Machine Learning 46, 389�422 (2002).

13. Guyon, I. et al. Feature Extraction Foundations and Applications. (Springer, NewYork NY) (2006).

14. Hastie, T. et al. The Elements of Statistical Learning Data Mining, Inference,and Prediction. (Springer, New York NY) (2001).

Page 10: UCSC Qualifying Exam Proposal 2012

10 ELINOR VELASQUEZ

15. Irizarry, R.A. et al. Exploration, normalization, and summaries of high density oligonu-cleotide array probe level data. Biostatistics 4, 249�264 (2003).

16. Irizarry, R.A. et al. Summaries of A↵ymetrix GeneChip probe level data. Nucl. AcidsRes. 31(4), e15 (2003).

17. Jain, K. The Handbook of Biomarkers. (Humana Press, New York NY) (2010).

18. Liu, X. et al. Genome-wide association study identifies candidate genes for Parkinson’sdisease in an Ashkenazi Jewish population. BMC Med. Genet. 12, 104�120 (2011).

19. Maraganore, D. M. et al. High-resolution whole-genome association study of Parkin-son’s disease. Amer. J. Hum. Genet. 77, 685�693 (2005).

20. Mitchell, T.M. Machine Learning. (McGraw-Hill, San Francisco CA) (1997).

21. Pal, G. Private Communication Parkinson’s Institute (2011).

22. Qiu, P. et al. Fast calculation of pairwise mutual information for gene regulatory net-work reconstruction. Comput. Methods Programs Biomed. 94(2),177�180 (2009).

23. Satake, W. et al. Genome-wide association study identifies common variants at four locias genetic risk factors for Parkinson’s disease. Nature Genet. 41, 1303�1307 (2009).

24. Scherzer, C. K. Molecular markers of early Parkinson’s disease based on gene expressionin blood. P.N.A.S. 104(3), 955�960 (2007).

25. Simon-Sanchez, J. et al. Genome-wide association study reveals genetic risk underlyingParkinson’s disease. Nature Genet. 41(12), 1308�1312 (2009).

26. Smith, W.W. et al. Leucine-rich repeat kinase 2 (LRRK2) interacts with parkin,and mutant LRRK2 induces neuronal degeneration. P.N.A.S. 102(51), 18676�18681(2005).

27. Smyth, G. et al. Linear models and empirical Bayes methods for assessing di↵erentialexpression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3(1), 1027�1052(2004).

28. Soldner, F. et al. Generation of isogenic pluripotent stem cells di↵ering exclusively attwo early onset Parkinson point mutations. Cell 146, 318�331 (2011).

29. Squire, L. et al. Fundamentals of Neuroscience. (Academic Press, New York NY)(2008).

30. Tiscornia, G. et al. Diseases in a dish: Modeling human genetic disorders using inducedpluripotent cells. Nature Med. 17(12), 1570�1576 (2011).

31. Tucci, A. et al. Genetic variability at the PARK16 locus. Euro. J. Hum. Genet. 18,1356�1359 (2010).

Page 11: UCSC Qualifying Exam Proposal 2012

FINDING BIOMARKERS IN PARKINSON’S DISEASE USING FEATURE SELECTION 11

32. Xiong, M. et al. Biomarker identification by feature wrappers. Genome Res. 11,1878�1887 (2001).

33. Yang, Y. et al. Segmentation and intensity estimation for microarray images withsaturated pixels. BMC Bioinformatics 12, 462�486 (2011).

34. Yang, Y. & Li, F. Analysis of recursive gene selection approaches from microarraydata. Bioinformatics 12, 462�469 (2005).

35. Zhang, X. et al. Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 7, 197�210 (2006).