REVIEW - arXiv · Challenges of Big Data analysis JianqingFan 1,,FangHan 2 andHanLiu 1 ... Keywords: BigData ... 294 NationalScienceReview ,2014,Vol.1,No.2 REVIEW

REVIEW National Science Review1: 293–314, 2014

doi: 10.1093/nsr/nwt032Advance access publication 6 February 2014

COMPUTER SCIENCE

Challenges of Big Data analysisJianqing Fan1,∗, Fang Han2 and Han Liu1

1Department ofOperations Researchand FinancialEngineering, PrincetonUniversity, Princeton,NJ 08544, USA and2Department ofBiostatistics, JohnsHopkins University,Baltimore, MD 21205,USA

∗Correspondingauthor. E-mail:[email protected]

Received 3 August2013; Accepted 15October 2013

ABSTRACTBig Data bring new opportunities to modern society and challenges to data scientists. On the one hand,Big Data hold great promises for discovering subtle population patterns and heterogeneities that are notpossible with small-scale data. On the other hand, the massive sample size and high dimensionality ofBig Data introduce unique computational and statistical challenges, including scalability and storagebottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors.These challenges are distinguished and require new computational and statistical paradigm.This papergives overviews on the salient features of Big Data and how these features impact on paradigm change onstatistical and computational methods as well as computing architectures. We also provide various newperspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of thesparsest solution in high-confidence set and point out that exogenous assumptions in most statisticalmethods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statisticalinferences and consequently wrong scientific conclusions.

Keywords: Big Data, noise accumulation, spurious correlation, incidental endogeneity, data storage,scalability

INTRODUCTIONBig Data promise new levels of scientific discoveryand economic value. What is new about Big Dataand how they differ from the traditional small- ormedium-scale data?This paper overviews the oppor-tunities and challenges brought by Big Data, withemphasis on the distinguished features of Big Dataand statistical and computational methods as well ascomputing architecture to deal with them.

BACKGROUNDWe are entering the era of Big Data—a term thatrefers to the explosion of available information. Sucha Big Data movement is driven by the fact that mas-sive amounts of very high-dimensional or unstruc-tured data are continuously produced and storedwith much cheaper cost than they used to be. Forexample, in genomics we have seen a dramatic dropin price for whole genome sequencing [1]. This isalso true in other areas such as social media analysis,biomedical imaging, high-frequency finance, analy-sis of surveillance videos and retail sales. The ex-

isting trend that data can be produced and storedmore massively and cheaply is likely to maintainor even accelerate in the future [2]. This trend willhave deep impact on science, engineering and busi-ness. For example, scientific advances are becom-ing more and more data-driven and researchers willmore and more think of themselves as consumersof data. The massive amounts of high-dimensionaldata bring both opportunities and new challenges todata analysis. Valid statistical analysis for Big Data isbecoming increasingly important.

GOALS AND CHALLENGESOF ANALYZING BIG DATAWhat are the goals of analyzing Big Data? Accordingto [3], twomain goals of high-dimensional data anal-ysis are to develop effective methods that can accu-rately predict the futureobservations and at the sametime to gain insight into the relationship between thefeatures and response for scientific purposes. Fur-thermore, due to large sample size, BigData give riseto two additional goals: to understand heterogeneity

C© The Author 2014. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For Permissions, please email: [email protected]

by guest on Decem

ber 15, 2014http://nsr.oxfordjournals.org/

Dow

nloaded from

arX

iv:1

308.

1479

v2 [

stat

.ML

] 1

5 D

ec 2

014

294 National Science Review, 2014, Vol. 1, No. 2 REVIEW

and commonality across different subpopulations.In other words, Big Data give promises for: (i) ex-ploring the hidden structures of each subpopulationof the data, which is traditionally not feasible andmight even be treated as ‘outliers’ when the samplesize is small; (ii) extracting important common fea-tures across many subpopulations even when thereare large individual variations.

What are the challenges of analyzing Big Data?Big Data are characterized by high dimensionalityand large sample size. These two features raise threeunique challenges: (i) high dimensionality bringsnoise accumulation, spurious correlations and inci-dental homogeneity; (ii) high dimensionality com-bined with large sample size creates issues such asheavy computational cost and algorithmic instabil-ity; (iii) the massive samples in Big Data are typi-cally aggregated from multiple sources at differenttimepoints using different technologies.This createsissues of heterogeneity, experimental variations andstatistical biases, and requires us to develop moreadaptive and robust procedures.

PARADIGM SHIFTSTo handle the challenges of Big Data, we need newstatistical thinking and computational methods. Forexample, many traditional methods that performwell formoderate sample size donot scale tomassivedata. Similarly, many statistical methods that per-form well for low-dimensional data are facing signif-icant challenges in analyzing high-dimensional data.To design effective statistical procedures for explor-ing and predicting Big Data, we need to address BigData problems such as heterogeneity, noise accumu-lation, spurious correlations and incidental endor-geneity, in addition to balancing the statistical accu-racy and computational efficiency.

In terms of statistical accuracy, dimension reduc-tion and variable selection play pivotal roles in an-alyzing high-dimensional data. This is designed toaddress noise accumulation issues. For example, inhigh-dimensional classification, [4] and [5] showedthat conventional classification rules using all fea-tures perform no better than random guess due tonoise accumulation. This motivates new regulariza-tionmethods [6–10] and sure independence screen-ing [11–13]. Furthermore, high dimensionality in-troduces spurious correlations between responseand unrelated covariates, which may lead to wrongstatistical inference and false scientific conclusions[14]. High dimensionality also gives rise to inci-dental endogeneity, a phenomenon that many unre-lated covariates may incidentally be correlated withthe residual noises. The endogeneity creates statisti-cal biases and causes model selection inconsistency

that lead towrong scientific discoveries [15,16]. Yet,most statistical procedures are based on unrealisticexogenous assumptions that cannot be validated bydata (see the ‘Incidental endogeneity’ section and[17]). New statistical procedures with these issuesin mind are crucially needed.

In terms of computational efficiency, Big Datamotivate the development of new computational in-frastructure and data-storage methods. Optimiza-tion is often a tool, not a goal, to Big Data analy-sis. Such a paradigm change has led to significantprogresses on developments of fast algorithms thatare scalable to massive data with high dimension-ality. This forges cross-fertilizations among differ-ent fields including statistics, optimization and ap-plied mathematics. For example, the authors of [18]showed that the non-deterministic polynomial-timehard (NP-hard) best subset regression can be re-cast as an L1-norm penalized least-squares prob-lem which can be solved by the interior pointmethod. Alternative algorithms to accelerate thisL1-norm penalized least-squares problems, such asleast angle regression [19], threshold gradient de-scent [20] and coordinate descent [21,22], itera-tive shrinkage-thresholding algorithms [23,24], areproposed. Besides large-scale optimization algo-rithms, Big Data also motivate the developmentof majorization–minimization algorithms [25–27],‘large-scale screening and small-scale optimization’framework [28], parallel computing methods [29–31] and approximate algorithms that are scalable tolarge sample size.

ORGANIZATION OF THIS PAPERThe rest of this paper is organized as follows. Thesection ‘Rises of Big Data’ overviews the rise of BigData problem from science, engineering and socialscience.The ‘Salient Features of BigData’ section ex-plains someunique features of BigData and their im-pacts on statistical inference. Statisticalmethods thattackle these Big Data problems are given in the ‘Im-pact on statistical thinking’ section. The ‘Impact oncomputing infrastructure’ section gives an overviewon scalable computing infrastructure for Big Datastorage and processing. The ‘Impact on computa-tional methods’ section discusses the computationalaspect of Big Data and introduces some recent pro-gresses. The ‘Conclusions and future perspectives’section concludes the paper.

RISE OF BIG DATAMassive sample size and high dimensionality char-acterize many contemporary datasets. For example,

by guest on Decem


Dow

nloaded from

REVIEW Fan, Han and Liu 295

in genomics, there have beenmore than 500 000mi-croarrays that are publicly available with each arraycontaining tens of thousands of expression valuesof molecules; in biomedical engineering, there havebeen tens of thousands of terabytes of functionalmagnetic resonance images (fMRIs) with each im-age containingmore than 50 000 voxel values.Otherexamples of massive and high-dimensional data in-clude unstructured text corpus, social medias, and fi-nancial time series, e-commerce data, retail transac-tion records and surveillance videos.We now brieflyillustrate some of these Big Data problems.

GenomicsMany new technologies have been developed in ge-nomics and enable inexpensive andhigh-throughputmeasurement of the whole genome and transcrip-tome. These technologies allow biologists to gen-erate hundreds of thousands of datasets and haveshifted their primary interests from the acquisitionof biological sequences to the study of biologicalfunction. The availability of massive datasets shedslight towards new scientific discoveries. For exam-ple, the large amount of genome sequencing datanow make it possible to uncover the genetic mark-ers of raredisorders [32,33] andfindassociationsbe-tween diseases and rare sequence variants [34,35].The breakthroughs in biomedical imaging technol-ogy allow scientists to simultaneouslymonitormanygene and protein functions, permitting us to studyinteractions in regulatory processes and neuron ac-tivities. Moreover, the emergence of publicly avail-able genomic databases enables integrative analysiswhich combines information frommany sources fordrawing scientific conclusions. These research stud-ies give rise to many computational methods as wellas new statistical thinking and challenges [36].

One of the important steps in genomic data anal-ysis is to remove systematic biases (e.g. intensity ef-fect, batch effect, dye effect, block effect, among oth-ers). Such systematic biases are due to experimentalvariations, such as environmental, demographic, andother technical factors, and can bemore severewhenwe combine different data sources. They have beenshown to have substantial effects on gene expressionlevels, and failing to taking them into considerationmay lead towrong scientific conclusions [37].Whenthe data are aggregated from multiple sources, it re-mains an open problem on what is the best normal-ization practice.

Even with the systematic biases removed, an-other challenge is to conduct large-scale tests to pickimportant genes, proteins, or single-nucleotide poly-morphism(SNP). In testing the significanceof thou-

sands of genes, classical methods of controlling theprobability of making one falsely discovered geneare no longer suitable and alternative procedureshave been designed to control the false discoveryrates [38–42] and to improve the power of the tests[42]. These technologies, though high-throughputin measuring the expression levels of tens of thou-sands of genes, remain low-throughput in surveyingbiological contexts (e.g. novel cell types, tissues, dis-eases, etc.).

An additional challenge in genomic data analy-sis is to model and explore the underlying hetero-geneity of the aggregated datasets. Due to technol-ogy limitations and resource constraints, a single labusually can only afford performing experiments forno more than a few cell types. This creates a ma-jor barrier for comprehensively characterizing generegulation in all biological contexts, which is a fun-damental goal of functional genomics. On the otherhand, the National Center for Biotechnology Infor-mation (NCBI) Gene Expression Omnibus (GEO)[43] and other public databases have cumulatedmore than 500 000 gene expression profiles, includ-ing microarray, exon array and ribonucleic acid-sequencing (RNA-seq) samples from thousands ofbiological contexts. Public ChIP–chip and ChIP–seq data generated by different labs for different pro-teins and in different contexts are also steadily grow-ing. Together, these public data contain enormousamounts of information that have not been fully ex-ploited so far. Massive data aggregated from thesepublic databases shed light on systematically study-ing many biological contexts in a high-throughputway.However, how to systematically explore the un-derlying heterogeneity and unveil the commonal-ity across different subpopulations remains an activeresearch area.

NeuroscienceMany important diseases, including Alzheimer’s dis-ease, Schizophrenia, Attention Deficit HyperactiveDisorder, Depression andAnxiety, have been shownto be related to brain connectivity networks. Under-standing the hierarchical, complex, functional net-workorganizationof thebrain is a necessary first stepto explore how the brain changeswith disease. Rapidadvances in neuroimaging techniques, such as fMRIandpositronemission tomography aswell as electro-physiology, provide great potential for the study offunctional brain networks, i.e. the coherence of theactivities among different brain regions [44].

Take fMRI for example. It is a non-invasive tech-nique for determining the neural correlates of men-tal processes in humans.During the past decade, this

by guest on Decem


Dow

nloaded from


technique has become a leadingmethod in the fieldsof cognitive andphysiological neuroscience andkeptproducingmassive amounts of high-resolution brainimages. These images enable us to explore the asso-ciation between brain connectivity and potential re-sponses such as disease or psychological status. ThefMRI data are massive and very high dimensional.Due to its non-invasive feature, everydaymany fMRImachines keep scanning different subjects and con-stantly produce new imaging data. For each datapoint, the subject’s brain is scanned for hundreds oftimes.Therefore, it is a 3D time-course image whichcontainsmore than hundreds of thousands of voxels.At the same time, the fMRI images are noisy due toits technological limit and possible head motion ofthe subjects. Analyzing such high-dimensional andnoisy data poses great challenges to statisticians andneuroscientists.

Similar to the field of genomics, an importantBig Data problem in neuroscience is to aggregatedatasets from multiple sources. Brain imaging datasharing is becoming more and more frequent nowa-days [45]. Primary sources of fMRI data arise fromthe International Data Sharing Initiative and the1000FunctionalConnectomesProject [46],AutismBrain Imaging Data Exchange (ABIDE) [47] andADHD-200 [48] datasets. These international ef-forts have compiled thousands of resting-state fMRIscans along with complimentary structural scans.The largest of the datasets is the 1000 FunctionalConnectomes Project, which focuses on healthyadults and includes limited covariate informationon age, gender, handedness and image quality. TheADHD-200 dataset is similarly structured; yet, itincludes diagnostic information on disease statussuch as human IQ. The ABIDE dataset is similarto the ADHD-200 dataset, with diagnostic autismand symptomseverity information.However, it has agreater balance between diseased and non-diseasedsubjects. These large datasets pose great opportuni-ties as well as new challenges.

One of the main challenges, as in the area of ge-nomics, is to remove the systematic biases causedby experimental variations and data aggregations.Moreover, statistically controlled inclusion of a sub-ject in a group study, i.e. testing whether a per-son should be rejected as outlier data, is oftenpoorly conducted [49] and voxels cannot be per-fectly aligned across different experiments in differ-ent laboratories. Therefore, the collected data con-tain many outliers and missing values. These is-sues make data preprocessing and analysis signifi-cantly more complicated. Many traditional statisti-cal procedures are not well suited in this noisy high-dimensional settings, and new statistical thinking iscrucially needed.

Economics and financeOver the past decade, more and more corpora-tions are adopting the data-driven approach to con-duct more targeted services, reduce risks and im-prove performance. They are implementing special-ized data analytics programs to collect, store, man-age andanalyze largedatasets froma rangeof sourcesto identify key business insights that can be ex-ploited to support better decision making. For ex-ample, available financial data sources include stockprices, currency and derivative trades, transactionrecords, high-frequency trades, unstructured newsand texts, consumers’ confidence and business sen-timents buried in social media and internet, amongothers. Analyzing these massive datasets helps mea-suring firms risks as well as systematic risks. It re-quires professionals who are familiar with sophis-ticated statistical techniques in portfolio manage-ment, securities regulation, proprietary trading, fi-nancial consulting and risk management.

Analyzing a large panel of economic and finan-cial data is challenging. For example, as an impor-tant tool in analyzing the joint evolution of macroe-conomics time series, the conventional vector au-toregressive (VAR)model usually includes nomorethan 10 variables, given the fact that the number ofparameters grows quadratically with the size of themodel. However, nowadays econometricians needto analyze multivariate time series with more thanhundreds of variables. Incorporating all informationinto the VARmodel will cause severe overfitting andbad prediction performance. One solution is to re-sort to sparsity assumptions, under which new sta-tistical tools have been developed [50,51].

Another example is portfolio optimization andrisk management [52,53]. In this problem, estimat-ing the covariance and inverse covariancematricesofthe returns of the assets in the portfolio plays an im-portant role. Suppose that we have 1000 stocks to bemanaged. There are 500 500 covariance parametersto be estimated. Even if we could estimate each indi-vidual parameter accurately, the cumulated error ofthe whole matrix estimation can be large under ma-trix norms. This requires new statistical procedures.See, for example, [54–66]onestimating large covari-ance matrices and their inverse.

Other applicationsBig Data have numerous other applications. Tak-ing social network data analysis for example, mas-sive amount of social network data are beingproduced by Twitter, Facebook, LinkedIn andYouTube. These data reveal numerous individual’scharacteristics and have been exploited in various

by guest on Decem


Dow

nloaded from


fields. For example, the authors of [67] usedthese data to predict influenza epidemic; those of[68] used these data to predict the stock mar-ket trend; and the authors of [69] used the so-cial network data to predict box-office revenuesfor movies. In addition, the social media andInternet contain massive amount of information onthe consumer preferences and confidences, leadingeconomics indicators, business cycles, political atti-tudes, and the economic and social states of a soci-ety. It is anticipated that the social network data willcontinue to explode and be exploited for many newapplications.

Several other new applications that are becomingpossible in the Big Data era include:(i) Personalized services. With more personal data

collected, commercial enterprises are able toprovide personalized services adapt to individ-ual preferences. For example, Target (a retail-ing company in the United States) is able topredict a customer’s need by analyzing the col-lected transaction records.

(ii) Internet security. When a network-based attacktakes place, historical data on network trafficmay allow us to efficiently identify the sourceand targets of the attack.

(iii) Personalized medicine. More and more health-related metrics such as individual’s molecularcharacteristics, human activities, human habitsand environmental factors are now available.Using these pieces of information, it is possi-ble to diagnose an individual’s disease and se-lect individualized treatments.

(iv) Digital humanities. Nowadays many archivesare being digitized. For example, Google hasscanned millions of books and identified aboutevery word in every one of those books. Thisproduces massive amount of data and enablesaddressing topics in the humanities, such asmapping the transportation system in ancientRoman, visualizing the economic connectionsof ancient China, studying how natural lan-guages evolve over time, or analyzing historicalevents.

SALIENT FEATURES OF BIG DATABig Data create unique features that are not sharedby the traditional datasets. These features pose sig-nificant challenges to data analysis and motivate thedevelopment of new statistical methods. Unlike tra-ditional datasets where the sample size is typicallylarger than the dimension, Big Data are character-ized by massive sample size and high dimensional-ity. First, we will discuss the impact of large sam-ple size on understanding heterogeneity: on the one

hand, massive sample size allows us to unveil hiddenpatterns associated with small subpopulations andweak commonality across the whole population. Onthe other hand, modeling the intrinsic heterogene-ity of BigData requiresmore sophisticated statisticalmethods. Secondly, we discuss several unique phe-nomenaassociatedwithhighdimensionality, includ-ing noise accumulation, spurious correlation and in-cidental endogeneity. These unique features maketraditional statistical procedures inappropriate. Un-fortunately, most high-dimensional statistical tech-niques address only noise accumulation and spuri-ous correlations issues, but not incidental endogene-ity. They are based on exogeneity assumptions thatoften cannot be validated by collected data, due toincidental endogeneity.

HeterogeneityBigData are often created via aggregatingmany datasources corresponding to different subpopulations.Each subpopulation might exhibit some unique fea-tures not shared by others. In classical settingswherethe sample size is small or moderate, data pointsfrom small subpopulations are generally categorizedas ‘outliers’, and it is hard to systematically modelthem due to insufficient observations. However, inthe Big Data era, the large sample size enables us tobetter understand heterogeneity, shedding light to-ward studies such as exploring the association be-tween certain covariates (e.g. genes or SNPs) andrare outcomes (e.g. rare diseases or diseases in smallpopulations) and understanding why certain treat-ments (e.g. chemotherapy) benefit a subpopulationand harm another subpopulation. To better illus-trate this point, we introduce the following mixturemodel for the population:

λ1 p1 (y ; θ 1(x)) + · · · + λm pm (y ; θm(x)) , (1)

where λj ≥ 0 represents the proportion of the jthsubpopulation, p j

(y ; θ j (x)

)is the probability dis-

tribution of the response of the jth subpopulationgiven the covariates x with θ j (x) as the parametervector. In practice, many subpopulations are rarelyobserved, i.e. λj is very small. When the sample sizen is moderate, nλj can be small, making it infeasibleto infer the covariate-dependent parameters θ j (x)due to the lack of information. However, becauseBig Data are characterized by large sample size n,the sample size nλj for the jth subpopulation can bemoderately large even if λj is very small. This en-ables us to more accurately infer about the subpop-ulation parameters θ j (·). In short, the main advan-tage brought by Big Data is to understand the het-erogeneity of subpopulations, such as the benefits of

by guest on Decem


Dow

nloaded from


certain personalized treatments, which are infeasiblewhen sample size is small or moderate.

Big Data also allow us to unveil weak common-ality across whole population, thanks to large sam-ple sizes. For example, the benefit of one drink ofred wine per night on heart can be difficult to assesswithout large sample size. Similarly, health risks toexposure of certain environmental factors can onlybe more convincingly evaluated when the samplesizes are sufficiently large.

Besides the aforementioned advantages, the het-erogeneity of Big Data also poses significant chal-lenges to statistical inference. Inferring the mix-ture model in (1) for large datasets requires so-phisticated statistical and computational methods.In low dimensions, standard techniques such as theexpectation–maximization algorithm for finite mix-ture models can be applied. In high dimensions,however, we need to carefully regularize the estimat-ing procedure to avoid overfitting or noise accumu-lation and to devise good computation algorithms[70,71].

Noise accumulationAnalyzing BigData requires us to simultaneously es-timate or testmany parameters.These estimation er-rors accumulate when a decision or prediction ruledepends on a large number of such parameters. Sucha noise accumulation effect is especially severe inhigh dimensions and may even dominate the truesignals. It is usually handled by the sparsity assump-tion [2,72,73].

Take high-dimensional classification for in-stance. Poor classification is due to the existence ofmany weak features that do not contribute to thereduction of classification error [4]. As an example,we consider a classification problem where the datacome from two classes:

X1, . . . ,Xn ∼ Nd (μ1, Id )

and Y 1, . . . , Y n ∼ Nd (μ2, Id ). (2)

Wewant to construct a classification rule which clas-sifies a new observation Z ∈ Rd into either the firstor the second class. To illustrate the impact of noiseaccumulation in classification, we set n = 100 andd = 1000. We set μ1 = 0 and μ2 to be sparse, i.e.only the first 10 entries ofμ2 are nonzero with value3, and all the other entries are zero. Figure 1 plotsthe first two principal components by using the firstm= 2, 40, 200 features and the whole 1000 features.As illustrated in these plots, when m = 2 we obtainhigh discriminative power. However, the discrimi-native power becomes very low when m is too large

due to noise accumulation.The first 10 features con-tribute to classifications and the remaining featuresdo not. Therefore, whenm> 10, procedures do notobtain any additional signals, but accumulate noises:the larger the m, the more the noise accumulates,which deteriorates the classification procedure withdimensionality. Form= 40, the accumulated signalscompensate the accumulated noise, so that the firsttwo principal components still have good discrimi-native power.Whenm=200, the accumulatednoiseexceeds the signal gains.

The above discussion motivates the usage ofsparse models and variable selection to overcomethe effect of noise accumulation. For example, inthe classification model (2), instead of using all thefeatures, we could select a subset of features whichattain the best signal-to-noise ratio. Such a sparsemodel provides more improved classification per-formance [72,73]. In other words, variable selec-tion plays a pivotal role in overcoming noise ac-cumulation in classification and regression predic-tion.However, variable selection in high dimensionsis challenging due to spurious correlation, inciden-tal endorgeneity, heterogeneity and measurementerrors.

Spurious correlationHigh dimensionality also brings spurious correla-tion, referring to the fact thatmanyuncorrelated ran-dom variables may have high sample correlationsin high dimensions. Spurious correlation may causefalse scientific discoveries andwrong statistical infer-ences.

Consider the problem of estimating the coeffi-cient vector β of a linear model

y = Xβ + ε, Var(ε) = σ 2Id , (3)

where y ∈ Rn represents the response vector, X =[x1, . . . , xn]T ∈ Rn×d represents the designmatrix,ε ∈ Rn represents an independent random noisevector and Id is the d × d identity matrix. To copewith the noise accumulation issue, when the dimen-sion d is comparable to or larger than the samplesize n, it is popular to assume that only a small num-ber of variables contribute to the response, i.e. β is asparse vector. Under this sparsity assumption, vari-able selection can be conducted to avoid noise accu-mulation, improve the performance of prediction, aswell as enhance the interpretability of themodelwithparsimonious representation.

In high dimensions, even for a model as simpleas (3), variable selection is challenging due to thepresence of spurious correlation. In particular, [11]showed that, when the dimensionality is high, the

by guest on Decem


Dow

nloaded from


Figure 1. Scatter plots of projections of the observed data (n= 100 from each class) onto the first two principal components of the bestm-dimensionalselected feature space. A projected data with the filled circle indicates the first class and the filled triangle indicates the second class.

important variables can be highly correlated withseveral spurious variables which are scientificallyunrelated.We consider a simple example to illustratethis phenomenon. Let x1, . . . , xn be n indepen-dent observations of a d-dimensional Gaussian ran-dom vector X = (X1, . . . , Xd )T ∼ Nd (0, Id ). Werepeatedly simulate thedatawithn=60 andd=800

and 6400 for 1000 times. Figure 2a shows the em-pirical distribution of themaximum absolute samplecorrelation coefficient between the first variablewiththe remaining ones defined as

r = maxj≥2

|Corr (X1, X j)|, (4)

by guest on Decem


Dow

nloaded from


Figure 2. Illustration of spurious correlation. (a) Distribution of the maximum absolute sample correlation coefficients between X1 and {Xj}j �= 1.(b) Distribution of the maximum absolute sample correlation coefficients between X1 and the closest linear projections of any four members of{Xj}j �= 1 to X1. Here the dimension d is 800 and 6400, the sample size n is 60. The result is based on 1000 simulations.

where Corr(X1, X j

)is the sample correlation

between the variables X1 and Xj. We see that themaximum absolute sample correlation becomeshigher as dimensionality increases.

Furthermore, we can compute the maximumabsolute multiple correlation between X1 and lin-ear combinations of several irrelevant spuriousvariables:

R = max|S|=4

max{β j }4j=1

∣∣∣∣∣∣Corr⎛⎝X1,

∑j∈S

β j X j

⎞⎠

∣∣∣∣∣∣ . (5)

Using the same configuration as in Fig. 2 a, Fig. 2 bplots the empirical distribution of the maximum ab-solute sample correlation coefficient betweenX1 and∑

j ∈ Sβ jXj, where S is any size four subset of {2, . . . ,d} and β j is the least-squares regression coefficientof Xj when regressing X1 on {Xj}j ∈ S. Again, we seethat even thoughX1 is utterly independent ofX2, . . . ,Xd, the correlation betweenX1 and the closest linearcombination of any four variables of {Xj}j �= 1 to X1can be very high. We refer to [14] and [74] aboutmore theoretical results on characterizing the ordersof r .

The spurious correlation has significant impacton variable selection and may lead to false scientificdiscoveries. Let XS = (X j ) j∈S be the sub-randomvector indexed by S and let S be the selected setthat has the higher spurious correlation withX1 as inFig. 2. For example, when n= 60 and d= 6400, wesee that X1 is practically indistinguishable from X S

for a set S with |S| = 4. If X1 represents the expres-sion level of a gene that is responsible for a disease,we cannot distinguish it from the other four genes inS that have a similar predictive power although theyare scientifically irrelevant.

Besides variable selection, spurious correlationmay also lead to wrong statistical inference. We ex-plain this by considering again the same linearmodelas in (3). Here we would like to estimate the stan-dard error σ of the residual, which is prominentlyfeatured in statistical inferences of regression co-efficients, model selection, goodness-of-fit test andmarginal regression. Let S be a set of selected vari-ables and PS be the projectionmatrix on the columnspace of XS . The standard residual variance estima-tor, based on the selected variables, is

σ 2 = yT(In − PS)yn − |S| . (6)

The estimator (6) is unbiased when the variablesare not selected by data and the model is correct.However, the situation is completely different whenthe variables are selected based on data. In particu-lar, the authors of [14] showed that when there aremany spurious variables, σ 2 is seriously underesti-mated, which leads further to wrong statistical infer-ences includingmodel selection or significance tests,and false scientific discoveries such as finding wronggenes for molecular mechanisms.They also proposea refitted cross-validation method to attenuate theproblem.

by guest on Decem


Dow

nloaded from


Incidental endogeneityIncidental endogeneity is another subtle issue raisedby high dimensionality. In a regression settingY = ∑d

j=1 β j X j + ε, the term ‘endogeneity’ [75]means that some predictors {Xj} correlate with theresidual noise ε. The conventional sparse modelassumes

Y =∑j

β j X j + ε,

andE(εX j ) = 0 for j = 1, . . . , d , (7)

with a small set S = {j: β j �= 0}. The exogenous as-sumption in (7) that the residual noise ε is uncorre-lated with all the predictors is crucial for validity ofmost existing statistical procedures, including vari-able selection consistency. Though this assumptionlooks innocent, it is easy to be violated in high di-mensions as some of variables {Xj} are incidentallycorrelated with ε, making most high-dimensionalprocedures statistically invalid.

To explain the endogeneity problem in more de-tail, suppose that unknown to us, the response Y isrelated to three covariates as follows:

Y = X1 + X2 + X3 + ε,

with EεX j = 0, for j = 1, 2, 3.

In the data-collection stage, we do not know the truemodel, and therefore collect as many covariates thatare potentially related to Y as possible, in hope to in-clude all members in S in (7). Incidentally, some ofthose Xjs (for j �= 1, 2, 3) might be correlated withthe residual noise ε. This invalidates the exogenousmodeling assumption in (7). In fact, the more co-variates are collected or measured, the harder thisassumption is satisfied.

Unlike spurious correlation, incidental endo-geneity refers to the genuine existence of correla-tions between variables unintentionally, both due tohigh dimensionality.The former is analogous to findtwo persons look alike but have no genetic relation,whereas the latter is similar to bumping into an ac-quaintance, both easily occurring in a big city. Moregenerally, endogeneity occurs as a result of selectionbiases, measurement errors and omitted variables.These phenomena arise frequently in the analysis ofBig Data, mainly due to two reasons:� With the benefit of new high-throughput mea-surement techniques, scientists are able to andtend to collect as many features as possible. Thisaccordingly increases the possibility that some ofthem might be correlated with the residual noise,incidentally.� Big Data are usually aggregated from multiplesources with potentially different data generating

schemes.This increases the possibility of selectionbias andmeasurement errors,which also cause po-tential incidental endogeneity.

Whether incidental endogeneity appears in realdatasets and how shall we test it in practice? Weconsider a genomics study in which 148 microarraysamples are downloaded from GEO database andArrayExpress [76]. These samples are created un-der the Affymetrix HGU133a platform for humansubjects with prostate cancer. The obtained datasetcontains 22 283 probes, corresponding to 12 719genes. In this example, we are interested in the genenamed ‘Discoidin domain receptor family, member1’ (abbreviated as DDR1). DDR1 encodes recep-tor tyrosine kinases, which plays an important rolein the communication of cells with their microen-vironment. DDR1 is known to be highly related tothe prostate cancer [77] and we wish to study its as-sociation with other genes in patients with prostatecancer. We took the gene expressions of DDR1 asthe response variable Y and the expressions of allthe remaining 12 718 genes as predictors. The leftpanel of Fig. 3 draws the empirical distribution of thecorrelations between the response and individualpredictors.

To illustrate the existence of endogeneity, we fitan L1-penalized least-squares regression (Lasso) onthe data, and the penalty is automatically selectedvia 10-fold cross validation (37 genes are selected).We then refit an ordinary least-squares regression onthe selected model to calculate the residual vector.In the right panel of Fig. 3, we plot the empirical dis-tribution of the correlations between the predictorsand the residuals. We see the residual noise is highlycorrelatedwithmany predictors. Tomake sure thesecorrelations are not purely caused by spurious corre-lation, we introduce a ‘null distribution’ of the spuri-ous correlations by randomly permuting the ordersof rows in the design matrix, such that the predic-tors are indeed independent of the residual noise.By comparing the two distributions, we see that thedistribution of correlations between predictors andresidual noise on the raw data (labeled ‘raw data’)has a heavier tail than that on the permuted data(labeled ‘permuted data’). This result provides starkevidence of endogeneity in these data.

Theabovediscussion shows that incidental endo-geneity is likely to be present in Big Data. The prob-lem of dealing with endogenous variables is not wellunderstood in high-dimensional statistics. What isthe consequence of this endogeneity?The authors of[16] showed that endogeneity causes inconsistencyinmodel selection. In particular, they provided thor-ough analysis to illustrate the impact of endogene-ity on high-dimensional statistical inference and

by guest on Decem


Dow

nloaded from


Figure 3. Illustration of incidental endogeneity on a microarry gene expression data. Left panel: the distribution of the sample correlation Corr(X j , Y )(j = 1, . . . , 12 718). Right panel: the distribution of the sample correlation Corr(X j , ε). Here ε represents the residual noise after the Lasso fit. Weprovide the distributions of the sample correlations using both the raw data and permuted data.

proposed alternative methods to conduct linear re-gression with consistency guarantees under weakerconditions. See also the following section.

IMPACT ON STATISTICAL THINKINGAs has been shown in the previous section, massivesample size and high dimensionality bring hetero-geneity, noise accumulation, spurious correlationand incidental endogeneity. These features of BigData make traditional statistical methods invalid. Inthis section, we introduce new statistical methodsthat can handle these challenges. For an overview,see [72] and [73].

Penalized quasi-likelihoodTo handle the noise-accumulation issue, we assumethat the model parameter β as in (3) is sparse. Theclassical model selection theory, according to [78],suggests to choose a parameter vector β that mini-mizes negative penalized quasi-likelihood:

− QL(β) + λ‖β‖0, (8)

where QL(β) is the quasi-likelihood of β and ‖ · ‖0represents the L0 pseudo-norm (i.e. the number ofnonzero entries in a vector). Here λ > 0 is a regu-larization parameter that controls the bias-variancetradeoff. The solution to the optimization problemin (8) has nice statistical properties [79]. However,

it is essentially combinatoric optimization and doesnot scale to large-scale problems.

The estimator in (8) can be extended to a moregeneral form

�n(β) +d∑j=1

Pλ,γ (β j ), (9)

where the term �n(β) measures the goodness of fitof the model with parameter β and

∑dj=1 Pλ,γ (β j )

is a sparsity-inducing penalty that encourages spar-sity, in which λ is again the tuning parameter thatcontrols the bias-variance tradeoff and γ is a pos-sible fine-tune parameter which controls the de-gree of concavity of the penalty function [8]. Pop-ular choices of the penalty function Pλ, γ (·) in-clude the hard-thresholding penalty [80,81], soft-thresholding penalty [6,82], smoothly clipped ab-solution deviation (SCAD, [8]) and minimax con-cavity penalty (MCP, [10]). Figure 4 visualizesthese penalty functions for λ = 1. We see that allpenalty functions are folded concave, but the soft-thresholding (L1-)penalty is also convex. The pa-rameter γ in SCAD andMCP controls the degree ofconcavity. From Fig. 4c and d, we see that a smallervalue of γ results in more concave penalties. Whenγ becomes larger, SCAD andMCP converge to thesoft-thresholdingpenalty.MCP is a generalizationofthe hard-thresholding penalty which corresponds toγ = 1.

by guest on Decem


Dow

nloaded from


Figure 4. Visualization of the penalty functions. In all cases, λ = 1. For SCAD and MCP, different values of γ are chosen asshown in graphs.

How shall we choose among these penalty func-tions? In applications, we recommend to use eitherSCADorMCPthresholding, since they combine theadvantages of bothhard- and soft-thresholdingoper-ators.Many efficient algorithms have been proposedfor solving the optimization problem in (9) with theabove four penalties. See the ‘Impact on computinginfrastructure’ section.

Sparsest solution in high confidence setThe penalized quasi-likelihood estimator (9) issomewhat mysterious. A closely related method isthe sparsest solution in high confidence set, intro-

duced in the recent book chapter by [17], which hasmuch better statistical intuition. It is a generally ap-plicable principle that separates the data informationand the sparsity assumption.

Suppose that the data information is summarizedby the function �n(β) in (9). This can be a likeli-hood, quasi-likelihood or loss function.Theunderly-ing parameter vectorβ0 usually satisfies �′(β0) = 0,where �′(·) is the gradient vector of the expectedloss function �(β) = E�n(β). Thus, a natural con-fidence set for β0 is

Cn = {β ∈ Rd : ‖�′n(β)‖∞ ≤ γn}, (10)

by guest on Decem


Dow

nloaded from


where ‖·‖∞ is the L∞-norm of a vector and γ n ischosen, so that we have confidence level at least 1−δn, namely

P(β0 ∈ Cn) = P{‖�′n(β0)‖∞ ≤ γn} ≥ 1 − δn .

(11)

The confidence set Cn is called high-confidence setsince δn → 0. In theory, we can take any norm inconstructing the high-confidence set.We opt for theL∞ norm, as it produces a convex confidence set Cnwhen �n(·) is convex.

The high-confidence set is a summary of the in-formation we have for the parameter vector β0. Itis not informative in high-dimensional space. Take,for example, the linear model (3) with the quadraticloss �n(β) = ‖ y − Xβ‖22. The high-confidence setis then

Cn = {β ∈ Rd : ‖XT(y − Xβ)‖∞ ≤ γn},where we take γn ≥ ‖XTε‖∞, so that δn = 0. If inaddition β0 is assumed to be sparse, then a naturalsolution is the intersection of these two pieces of in-formation, namely, finding the sparsest solution inthe high-confidence set:

minβ∈Cn

‖β‖1 = min‖�′

n (β)‖∞≤γn

‖β‖1. (12)

This is a convex optimization problem when �(·) isconvex. For the linear model with the quadratic loss,it reduces to the Dantzig selector [9].

There are many flexibilities in defining the spars-est solution in high-confidence set. First of all, wehave a choice of the loss function �n(·). We can re-gard �′

n(β) = 0 as the estimation equations [83]and define directly the high-confidence set (10)from the estimation equations. Secondly, we havemany ways to measure the sparsity. For example, wecan use a weighted L1-norm to measure the sparsityof β in (12). By proper choices of estimating equa-tions in (10) andmeasure of sparsity in (12), the au-thors of [17] showed that many useful procedurescan be regarded as the sparsest solution in the high-confidence set. For example, CLIME [84] for esti-mating sparse precision matrix in both the Gaussiangraphic model and the linear programming discrim-inant rule [85] for sparse high-dimensional classifi-cation is the sparsest solution in the high-confidenceset.The authors of [17] also provided a general con-vergence theory for such a procedure under a con-dition similar to the restricted eigenvalue conditionin [86]. Finally, the idea is applicable to the prob-lems with measurement errors or even endogeneity.In this case, the high-confidence set will be definedaccordingly to accommodate the measurement er-rors or endogeneity. See, for example, [87].

Independence screeningAn effective variable screening technique based onmarginal screening has been proposed by the au-thors of [11]. They aim at handling ultra-high-dimensional data for which the aforementioned pe-nalized quasi-likelihood estimators become compu-tationally infeasible. For such cases, the authors of[11] proposed to first use marginal regression toscreen variables, reducing the original large-scaleproblem to a moderate-scale statistical problem, sothat more sophisticated methods for variable selec-tion can be applied. The proposed method, namedsure independence screening, is computationallyvery attractive. It has been shown to possess surescreening property and to have some theoretical ad-vantages over Lasso [13,88].

There are two main ideas of sure independentscreening: (i) it uses the marginal contribution ofa covariate to probe its importance in the jointmodel; and (ii) instead of selecting the most impor-tant variables, it aims at removing variables that arenot important. For example, assuming each covari-ate has been standardized, we denote βM

j the esti-mated regression coefficient in a univariate regres-sion model. The set of covariates that survive themarginal screening is defined as

S = { j : |βMj | ≥ δ} (13)

for a given threshold δ. One can alsomeasure the im-portance of a covariate Xj by using its deviance re-duction. For the least-squares problem, both meth-ods reduce to ranking importance of predictors byusing the magnitudes of their marginal correlationswith the response Y. The authors of [11] and [88]gave conditions under which sure screening prop-erty can be established and false selection rates arecontrolled.

Since the computational complexity of surescreening scales linearly with the problem size, theidea of sure screening is very effective in the dra-matic reduction of the computational burden ofBig Data analysis. It has been extended in variousdirections. For example, generalized correlationscreeningwasused in [12], nonparametric screeningwas proposed by [89] and principled sure indepen-dence screeningwas introduced in [90]. In addition,the authors of [91] utilized the distance correlationto conduct screening, [92] employed rank correla-tion and [28] proposed an iteratively screening andselection method.

Independent screening has never examined themultivariate effect of variables on the response vari-able nor has it used the covariance matrix of vari-ables. An extension of this is to use multivari-ate screening, which examines the contributions of

by guest on Decem


Dow

nloaded from


small groups of variables together. This allows us toexamine the synergy of small groups of variables tothe response variable.However, the bivariate screen-ing already involves O(d2) submodels, which canbe prohibitive in computation. Covariance assistscreening and estimation in [93] can be adaptedhere to prevent examining all bivariate or multivari-ate submodels. Another possible extension is to de-velop conditional screening techniques, which rankvariables according to their conditional contribu-tions given a set of variables.

Dealing with incidental endogeneityBig Data are prone to incidental endogeneity thatmakes the most popular regularization methods in-valid. It is accordingly important to develop meth-ods that can handle endogeneity in high dimen-sions. More specifically, let us consider the high-dimensional linear regression model (7). The au-thors of [16] showed that for any penalized estima-tors to be variable selection consistent, a necessarycondition is

E(εX j ) = 0 for j = 1, . . . , d . (14)

As discussed in the ‘Salient features of Big Data’ sec-tion, the condition in (14) is too restrictive for real-world applications. Letting S={j:β j �= 0} be the setof important variables, with non-vanishing compo-nents inβ , amore realisticmodel assumption shouldbe

E(ε|{X j } j∈S) = E(Y −

∑j∈S

β j X j |{X j } j∈S)

= 0. (15)

In the paper by the authors of [16], they consideredan even weaker version of Equation (15), called the‘over identification’ condition, such as

EεX j = 0 and EεX 2j = 0 for j ∈ S.

(16)

Under condition (16), the authors of [16] showedthat the classical penalized least-squares methods,such as Lasso, SCAD and MCP, are no longer con-sistent. Instead, they introduced the focused gener-alized methods of moments (FGMMs) by utilizingthe over identification conditions and proved thatthe FGMM consistently selects the set of variablesS. We do not go into the technical details here butillustrate this by an example.

We continue to explore the gene expressiondata in the ‘Incidental endogeneity’ section. Weagain treat gene DDR1 as response and other genesas predictors, and apply the FGMM instead ofLasso. By cross validation, the FGMM selects 18genes. The left panel of Fig. 5 shows the distribu-tion of the sample correlations between the genesXj(j = 1, . . . , 12 718) and the residuals ε after theFGMM fit. Here we find that many correlationsare nonzero, but it does not matter, because we re-quire only (16). To verify (16), the right panel ofFig. 5 shows the distribution of the sample corre-lations between the 18 selected genes (and theirsquares) and the residuals. The sample correlationsbetween the selected genes and residuals are zero,and the sample correlations between the squaredcovariates and residuals are small. Therefore, themodeling assumption is consistent to our modeldiagnostics.

IMPACT ON COMPUTINGINFRASTRUCTUREThe massive sample size of Big Data fundamentallychallenges the traditional computing infrastructure.In many applications, we need to analyze internet-scale data containing billions or even trillions of datapoints, which even makes a linear pass of the wholedataset unaffordable. In addition, such data couldbe highly dynamic and infeasible to be stored in acentralized database. The fundamental approach tostore and process such data is to divide and conquer.The idea is to partition a large problem into moretractable and independent subproblems. Each sub-problem is tackled in parallel by different process-ing units. Intermediate results from each individualworker are then combined to yield the final output.In small scale, such divide-and-conquer paradigmcan be implemented either bymulti-core computingor grid computing. However, in very large scale, itposes fundamental challenges to computing infras-tructure. For example, when millions of computersare connected to scale out to large computing tasks,it is quite likely some computers may die duringthe computing. In addition, given a large computingtask, we want to distribute it evenly to many com-puters and make the workload balanced. Design-ing very large scale, high adaptive and fault-tolerantcomputing systems is extremely challenging andmo-tivates the outcome of new and reliable computinginfrastructure that supports massively parallel datastorage and processing. In this section, we takeHadoop as an example to introduce basic soft-ware and programming infrastructure for Big Dataprocessing.

by guest on Decem


Dow

nloaded from


Figure 5.Diagnostics of the modeling assumptions of the FGMMon amicroarry gene expression data. Left panel: Distribution of the sample correlationsCorr(X j , ε) (j = 1, . . . , 12, 718). Right panel: Distribution of the sample correlations Corr(X j , ε) and Corr(X 2

j , ε) for only 18 selected genes. Here ε

represents the residual noise after the FGMM fit.

Hadoop is a Java-based software framework fordistributed data management and processing. Itcontains a set of open source libraries for dis-tributed computing using theMapReduce program-mingmodel and its owndistributedfile systemcalledHDFS. Hadoop automatically facilitates scalabilityand takes cares of detecting and handling failures.Core Hadoop has two key components:

Core Hadoop = Hadoop distributed file system(HDFS)+MapReduce� HDFS is a self-healing, high-bandwidth, clusteredstorage file system, and� MapReduce is a distributed programming modeldeveloped by Google.

We dart with explaining HDFS and MapReducein the following two subsections. Besides these twokey components, a typical Hadoop release containsmany other components. For example, as is shownin Fig. 6, Cloudera’s open-source Hadoop distribu-tion also includes HBase, Hive, Pig, Oozie, Flumeand Sqoop. More details about these extra compo-nents are provided in the online Cloudera technicaldocuments. After introducing the Hadoop, we alsobriefly explain the concepts of cloud computing inthe ‘Cloud computing’ section.

Hadoop distributed file systemHDFS is a distributed file system designed to hostandprovide high-throughput access to large datasetswhich are redundantly stored across multiple ma-chines. In particular, it ensures Big Data’s durabil-ity to failure and high availability to parallel applica-tions.

As a motivating application, suppose we have alarge data file containing billions of records, and wewant to query this file frequently. If many queries aresubmitted simultaneously (e.g. the Google searchengine), the usual file system is not suitable due tothe I/O limit. HDFS solves this problem by dividinga large file into small blocks and store them in differ-ent machines. Each machine is called a DataNode.Unlike most block-structured file systems which use

Figure 6. An illustration of Cloudera’s open-source Hadoop distribution (source: cloud-era website).

by guest on Decem


Dow

nloaded from


Figure 7. An illustration of the HDFS architecture.

a block size on the order of 4 or 8 KB, the defaultblock size in HDFS is 64MB, which allows HDFSto reduce the amount of metadata storage requiredper file. Furthermore, HDFS allows for fast stream-ing reads of data by keeping large amounts of data se-quentially laid out on the hard disk.Themain trade-off of this decision is that HDFS expects the data tobe read sequentially (instead of being read in a ran-dom access fashion).

The data in HDFS can be accessed via a ‘writeonce and read many’ approach.The metadata struc-tures (e.g. the file names anddirectories) are allowedto be simultaneously modified by many clients. It isimportant that this meta information is always syn-chronized and stored reliably. All the metadata aremaintained by a single machine, called the NameN-ode. Because of the relatively low amount of meta-data per file (it only tracks file names, permissionsand the locations of each block of each file), all suchinformation can be stored in the main memory ofthe NameNode machine, allowing fast access to themetadata. An illustration of the whole HDFS archi-tecture is provided in Fig. 7.

To access or manipulate a data file, a client con-tacts the NameNode and retrieves a list of locationsfor the blocks that comprise the file. These loca-tions identify theDataNodeswhich hold each block.Clients then read file data directly from the DataN-ode servers, possibly in parallel. The NameNode isnot directly involved in this bulk data transfer, keep-ing itsworking load to aminimum.HDFShas abuilt-in redundancy and replication feature which securesthat any failure of individual machines can be recov-

ered without any loss of data (e.g. each DataNodehas three copies by default). The HDFS automati-cally balances its load whenever a new DataNode isadded to the cluster.We also need to safely store theNameNode information by creatingmultiple redun-dant systems, which allows the important metadataof the file system be recovered even if the NameN-ode itself crashes.

MapReduceMapReduce is a programming model for processinglarge datasets in a parallel fashion. We use an exam-ple to explain how MapReduce works. Suppose weare given a symbol sequence (e.g. ‘ATGCCAATC-GATGGGACTCC’), and the task is to write a pro-gram that counts the number of each symbol. Thesimplest idea is to read a symbol, add it into a hashtable with key as the symbol and set value to its num-ber of occurrences. If the symbol is not in the hashtable yet, then add the symbol as a new key to thehash and set the corresponding value to 1. If the sym-bol is already in the hash table, then increase thevalue by 1. This program runs in a serial fashion andthe time complexity scales linearly with the length ofthe symbol sequence. Everything looks simple so far.However, imagine if instead of a simple sequence,we need to count the number of symbols in thewhole genomes of many biological subjects. Serialprocessing of such a huge amount of information istime consuming. So, the question is how can we useparallel processing units speed up the computation.

by guest on Decem


Dow

nloaded from


Figure 8. An illustration of the MapReduce paradigm for the symbol counting task.Mappers are applied to every element of the input sequences and emit intermediate(key, value)-pairs. Reducers are applied to all values associated with the same key.Between the map and reduce stages are some intermediate steps involving distributedsorting and grouping.

The idea ofMapReduce is illustrated in Fig. 8.Weinitially split the original sequence into several files(e.g. two files in this case). We further split each fileinto several subsequences (e.g. two subsequences inthis case) and ‘map’ the number of each symbol ineach subsequence. The outputs of the mapper are(key, value)-pairs. We then gather together all out-put pairs of the mappers with the same key. Finally,we use a ‘reduce’ function to combine the values foreach key.This gives the desired output:

#A = 5, #T = 4, #G = 5, #C = 6.

The Hadoop MapReduce contains three stages,which are listed as follows.

First stage: mapping.The first stage of a MapRe-duce program is called mapping. In this stage,a list of data elements is provided to a ‘map-per’ function to be transformed into (key,value)-pairs. For example, in the above symbol-counting problem, the mapper function sim-

ply transforms each symbol into the pair (sym-bol, 1). The mapper function does not modifythe input data, but simply returns a new outputlist.Intermediate stages: shuffling and sorting. Afterthe mapping stage, the program exchanges theintermediate outputs from themapping stage todifferent ‘reducers’. This process is called shuf-fling. A different subset of the intermediate keyspace is assigned to each reduce node. Thesesubsets (known as ‘partitions’) are the inputs tothe next reducing step. Eachmap task may send(key, value)-pairs to any partition. All pairs withthe samekey are always grouped together on thesame reducer regardless of which mappers theyare coming from. Each reducermay process sev-eral sets of pairs with different keys. In this case,different keys on a single node are automaticallysorted before they are fed into the next reducingstep.Final stage: reducing. In the final reducing stage,an instance of a user-provided code is called foreach key in the partition assigned to a reducer.The inputs are a key and an iterator over all thevalues associated with the key. These values re-turned by the iterator could be in an undefinedorder. In particular, we have one output file perexecuted reduce task.

The Hadoop MapReduce builds on the HDFSand inherits all the fault-tolerance properties ofHDFS. In general, Hadoop is deployed on very largescale clusters. One example is shown in Fig. 9.

Cloud computingCloud computing revolutionizes modern comput-ing paradigm. It allows everything—from hardwareresources, software infrastructure to datasets—tobe delivered to data analysts as a service whereverand whenever needed. Figure 10 illustrates differentbuilding components of cloud computing.Themoststriking feature of cloud computing is its elasticityand ability to scale up anddown,whichmakes it suit-able for storing and processing Big Data.

IMPACT ON COMPUTATIONALMETHODSBig Data are massive and very high dimensional,which pose significant challenges on computingand paradigm shifts on large-scale optimization[29,94]. On the one hand, the direct applicationof penalized quasi-likelihood estimators on high-dimensional data requires us to solve very large scaleoptimization problems. Optimization with a largeamount of variables is not only expensive but also

by guest on Decem


Dow

nloaded from


Figure 9. A typical Hadoop cluster (source: wikipedia).

suffers from slow numerical rates of convergenceand instability. Such a large-scale optimization isgenerally regarded as a mean, not the goal of BigData analysis. Scalable implementations of large-scale nonsmooth optimization procedures are cru-cially needed.On theother hand, themassive samplesize of BigData, which can be in the order ofmillionsor even billions as in genomics, neuroinformatics,marketing, and online social medias, also gives riseto intensive computation on data management andqueries. Parallel computing, randomized algorithms,approximate algorithms and simplified implementa-tions should be sought. Therefore, the scalability ofstatistical methods to both high dimensionality andlarge sample size should be seriously considered inthe development of statistical procedures.

In this section, we explain some new progress ondeveloping computational methods that are scalableto Big Data. To balance the statistical accuracy andcomputational efficiency, several penalized estima-tors such as Lasso, SCAD, and MCP have been de-

scribed in the ‘Impact on statistical thinking’ section.We will introduce scalable first-order algorithms forsolving these estimators in the ‘First-order methodsfor nonsmooth optimization’ section. We also notethat the volumes of modern datasets are explodingand it is often computationally infeasible to directlymake inference based on the raw data. Accordingly,to effectively handle Big Data in both statistical andcomputational perspectives, dimension reduction asan important data pre-processing step is advocatedand exploited in many applications [95].We will ex-plain some effective dimension reduction methodsin the ‘Dimension reduction and randomprojection’section.

First-order methods for nonsmoothoptimizationIn this subsection, we introduce several first-orderoptimization algorithms for solving the penalizedquasi-likelihood estimators in (9). For most lossfunctions �n(·), this optimization problem hasno closed-form solution. Iterative procedures areneeded to solve it.

When the penalty function Pλ, γ (·) is convex(e.g. the L1-penalty), so is the objective functionin (9) when �n(·) is convex. Accordingly, sophis-ticated convex optimization algorithms can be ap-plied.Themost widely used convex optimization al-gorithm is gradient descent [96], which finds a so-lution sequence converging to the optimum β bycalculating the gradient of the objective function ateach point. However, calculating the gradient canbe very time consuming when the dimensionalityis high. Instead, the authors of [97] proposed tocalculate the penalized pseudo-likelihood estimatorusing the pathwise coordinate descent algorithm,

Figure 10. An illustration of the cloud computing paradigm.

by guest on Decem


Dow

nloaded from


which can be viewed as a special case of the gradientdescent algorithm. Instead of optimizing along thedirection of the full gradient, it only calculates thegradient direction alongone coordinate at each time.A beautiful feature of this is that even though thewhole optimization problem does not have a closed-form solution, there exist simple closed-form so-lutions to all the univariate subproblems. The co-ordinate descent is computationally easy and hassimilar numerical convergence properties as gradi-ent descent [98]. Alternative first-order algorithmsto coordinate descent have also been proposedand widely used, resulting in iterative shrinkage-thresholding algorithms [23,24]. Prior to the coor-dinate descent algorithm, the authors of [19] pro-posed the least angle regression (LARS) algorithmto the L1-penalized least-squares problem.

When the penalty function Pλ, γ (·) is noncon-vex (e.g. SCAD and MCP), the objective functionin (9) is no longer concave. Many algorithms havebeen proposed to solve this optimization problem.For example, the authors of [8] proposed a localquadratic approximation (LQA) algorithm for opti-mizing nonconcave penalized likelihood. Their ideais to approximate the penalty term piece by piece us-ing a quadratic function, which can be thought asa convex relaxation (majorization) to the noncon-cave object function.With the quadratic approxima-tion, a closed-form solution can be obtained. Thisidea is further improved by using a linear instead of aquadratic function to approximate the penalty termand leads to the local linear approximation (LLA) al-gorithm [27]. More specifically, given current esti-mate β

(k) = (β(k)1 , . . . , β

(k)d )T at the kth iteration

for problem (9), by Taylor’s expansion,

Pλ,γ (β j ) ≈ Pλ,γ

(β(k)j

)

+ P ′λ,γ

(β(k)j

) (|β j | − |β(k)

j |)

. (17)

Thus, at the (k+ 1)th iteration, we solve

minβ j

⎧⎨⎩�n(β) +

d∑j=1

wk, j |β j |⎫⎬⎭ , (18)

where wk, j = P ′λ,γ (β

(k)j ). Note that problem (18)

is convex, so that a convex solver canbeused.Theau-thors of [58] suggested using initial valuesβ(0) = 0,which corresponds to the unweighted L1 penalty.This algorithm shares a very similar idea as in [99],both of which can be regarded as implementationsof the minimization of the folded-concave penal-ized quasi-likelihood [8] problem (9). If one furtherapproximates the goodness-of-fit measure �n(β) in

(18) by a quadratic function via the Taylor expan-sion, then theLARSalgorithm[19] andpathwise co-ordinate descent algorithm [97] can be used.

For themore general settingswhere the loss func-tion �n(·) may not be concave, the authors of [100]proposed anapproximate regularizationpath follow-ing algorithm for solving the optimization problemin (9). By integrating statistical analysis with com-putational algorithms, they provided explicit statis-tical and computational rates of convergence of anylocal solution obtained by the algorithm. Compu-tationally, the approximate regularization path fol-lowing algorithm attains a global geometric rate ofconvergence for calculating the full regularizationpath, which is fastest possible among all first-orderalgorithms in terms of iteration complexity. Statis-tically, they show that any local solution obtainedby the algorithm attains the oracle properties withthe optimal rates of convergence.The idea on study-ing statistical properties based on computational al-gorithms, which combine both computational andstatistical analysis, represents an interesting futuredirection for Big Data. We also refer to [101] and[102] for research studies in this direction.

Dimension reduction and randomprojectionWe introduce several dimension (data) reductionprocedures in this section. Why do we need di-mension reduction? Let us consider a dataset repre-sented as an n × d real-value matrix D, which en-codes information about n observations of d vari-ables. In the Big Data era, it is in general compu-tationally intractable to directly make inference onthe raw data matrix. Therefore, an important data-preprocessingprocedure is to conductdimension re-duction which finds a compressed representation ofD that is of lower dimensions but preserves as muchinformation inD as possible.

Principal component analysis (PCA) is the mostwell-known dimension reduction method. It aims atprojecting the data onto a low-dimensional orthog-onal subspace that captures as much of the data vari-ation as possible. Empirically, it calculates the lead-ing eigenvectors of the sample covariance matrix toform a subspace Uk ∈ Rd×k . We then project the n× d data matrix D to this linear subspace to obtainan n × k data matrix DUk . This procedure is opti-mal among all the linear projection methods in min-imizing the squared error introduced by the projec-tion. However, conducting the eigenspace decom-position on the sample covariance matrix is compu-tational challengingwhen both n and d are large.Thecomputational complexity of PCA is O(d2n + d3)[103], which is infeasible for very large datasets.

by guest on Decem


Dow

nloaded from


Figure 11. Plots of the median errors in preserving the distances between pairs of data points versus the reduced dimension k in large-scale microarraydata. Here ‘RP’ stands for the random projection and ‘PCA’ stands for the principal component analysis.

To handle the computational challenge raised bymassive and high-dimensional datasets, we need todevelop methods that preserve the data structureas much as possible and is computational efficientfor handling high dimensionality. Random projec-tion (RP) [104] is an efficient dimension reductiontechnique for this purpose, and is closely related tothe celebrated idea of compress sensing [105–109].More specifically, RPaims at finding a k-dimensionalsubspace of D, such that the distances between allpairs of data points are approximately preserved. Itachieves this goal by projecting the original data Donto a k-dimensional subspace using an RP matrixwith unit column norms. More specifically, let R ∈Rd×k be a random matrix with all the column Eu-clidean norms equal to 1.We reduce the dimension-ality of D from d to k by calculating matrix multipli-cation

DR = DR.

This procedure is very simple and the computationalcomplexity of the RP procedure is of order O(ndk),which scales only linearly with the problem size.

Theoretical justifications of RP are based on tworesults. The authors of [104] showed that if pointsin a vector space are projected onto a randomly se-lected subspace of suitable dimensions, then the dis-tances between the points are approximately pre-served. This justifies the RP when R is indeed a pro-jectionmatrix. However, enforcingR to be orthogo-nal requires the Gram–Schmidt algorithm, which iscomputationally expensive. In practice, the authorsof [110] showed that in high dimensions we do notneed to enforce the matrix to be orthogonal. In fact,any finite number of high-dimensional random vec-tors are almost orthogonal to each other. This resultguarantees that RTR can be sufficiently close to theidentity matrix. The authors of [111] further simpli-

fied the RP procedure by removing the unit columnlength constraint.

To illustrate the usefulness of RP, we use thegene expression data in the ‘Incidental endogene-ity’ section to compare the performance of PCAand RP in preserving the relative distances betweenpairwise data points. We extract the top 100, 500and 2500 genes with the highest marginal stan-dard deviations, and then apply PCA and RP toreduce the dimensionality of the raw data to asmall number k. Figure 11 shows the median er-rors in the distance between members across allpairs of data vectors. We see that, when dimen-sionality increases, RPs have more and more advan-tages over PCA in preserving the distances betweensample pairs.

One thing to note is that RP is not the ‘optimal’procedure for traditional small-scale problems. Ac-cordingly, the popularity of this dimension reduc-tion procedure indicates a new understanding of BigData. To balance the statistical accuracy and com-putational complexity, the suboptimal procedures insmall- or medium-scale problems can be ‘optimal’ inlarge scale. Moreover, the theory of RP depends onthe high dimensionality feature of BigData.This canbe viewed as a blessing of dimensionality.

Besides PCA and RP, there are many otherdimension-reduction methods, including latent se-mantic indexing (LSI) [112], discrete cosine trans-form [113] and CUR decomposition [114]. Thesemethods have been widely used in analyzing largetext and image datasets.

CONCLUSIONS AND FUTUREPERSPECTIVESThis paper discusses statistical and computationalaspects of Big Data analysis.We selectively overview

by guest on Decem


Dow

nloaded from


several unique features brought by Big Data anddiscuss some solutions. Besides the challenge ofmassive sample size and high dimensionality, thereare several other important features of Big Dataworth equal attention.These include

(1) Complex data challenge: due to the fact that BigData are in general aggregated from multiplesources, they sometime exhibit heavy tail behav-iors with nontrivial tail dependence.

(2) Noisy data challenge: Big Data usually containvarious types of measurement errors, outliersand missing values.

(3) Dependent data challenge: in various types ofmodern data, such as financial time series, fMRIand time course microarray data, the samplesare dependent with relatively weak signals.

To handle these challenges, it is urgent to developstatisticalmethods that are robust todata complexity(see, for example, [115–117]), noises [62–119] anddata dependence [51,120–122].

ACKNOWLEDGEMENTSThe authors gratefully acknowledge Dr Emre Barut for his kindassistance on producing Fig. 5. The authors thank the associateeditor and referees for helpful comments.

FUNDINGThis work was supported by the National Science Foundation[DMS-1206464 to JQF, III-1116730 and III-1332109 to HL]and the National Institutes of Health [R01-GM100474 and R01-GM072611 to JQF].

REFERENCES1. Stein, L. The case for cloud computing in genome informatics.Genome Biol 2010; 11: 207.

2. Donoho, D. High-dimensional data analysis: the curses andblessings of dimensionality. In: The American MathematicalSociety Conference, Los Angeles, CA, United States, 7–12August 2000.

3. Bickel, P. Discussion on the paper ‘Sure independence screen-ing for ultrahigh dimensional feature space’ by Fan and Lv.J Roy Stat Soc B 2008; 70: 883–4.

4. Fan, J and Fan, Y. High dimensional classification using fea-tures annealed independence rules. Ann Stat 2008: 36: 2605–37.

5. Pittelkow, PH and Ghosh, M. Theoretical measures of relativeperformance of classifiers for high dimensional data with smallsample sizes. J Roy Stat Soc B 2008; 70: 159–73.

6. Tibshirani, R. Regression shrinkage and selection via the lasso.J Roy Stat Soc B 1996; 58: 267–88.

7. Chen, S, Donoho, D and Saunders, M. Atomic decompositionby basis pursuit. SIAM J Sci Comput 1998; 20: 33–61.

8. Fan, J and Li, R. Variable selection via nonconcave penalizedlikelihood and its oracle properties. J Am Stat Assoc 2001; 96:1348–60.

9. Candes, E and Tao, T. The Dantzig selector: statistical estima-tion when p is much larger than n. Ann Stat 2007; 35: 2313–51.

10. Zhang, C-H. Nearly unbiased variable selection under minimaxconcave penalty. Ann Stat 2010; 38: 894–942.

11. Fan, J and Lv, J. Sure independence screening for ultrahighdimensional feature space (with discussion). J Roy Stat Soc B2008; 70: 849–911.

12. Hall, P and Miller, H. Using generalized correlation to effectvariable selection in very high dimensional problems. J ComputGraph Stat 2009; 18: 533–50.

13. Genovese, C, Jin, J and Wasserman, L et al. A comparison ofthe lasso and marginal regression. JMach Learn Res 2012; 13:2107–43.

14. Fan, J, Guo, S and Hao, N. Variance estimation using refit-ted cross-validation in ultrahigh dimensional regression. J RoyStat Soc B 2012; 74: 37–65.

15. Liao, Y and Jiang, W. Posterior consistency of nonparamet-ric conditional moment restricted models. Ann Stat 2011; 39:3003–31.

16. Fan, J and Liao, Y. Endogeneity in ultrahigh dimension. Techni-cal report. Princeton University, 2012.

17. Fan, J. Features of big data and sparsest solution in high con-fidence set. Technical report. Princeton University, 2013.

18. Donoho, D and Elad, M. Optimally sparse representationin general (nonorthogonal) dictionaries via L1 minimization.Proc Natl Acad Sci USA 2003; 100: 2197–202.

19. Efron, B, Hastie, T and Johnstone, I et al. Least angle regres-sion. Ann Stat 2004; 32: 407–99.

20. Friedman, J and Popescu, B. Gradient directed regularizationfor linear regression and classification. Technical report. Stan-ford University, 2003.

21. Fu, WJ. Penalized regressions: the bridge versus the lasso.J Comput Graph Stat 1998; 7: 397–416.

22. Wu, T and Lange, K. Coordinate descent algorithms for lassopenalized regression. Ann Appl Stat 2008; 2: 224–44.

23. Daubechies, I, Defrise, M and De Mol, C. An iterative thresh-olding algorithm for linear inverse problems with a sparsityconstraint. Commun Pur Appl Math 2004; 57: 1413–57.

24. Beck, A and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM JImaging Sciences 2009; 2: 183–202.

25. Lange, K, Hunter, D and Yang, I. Optimization transfer usingsurrogate objective functions. J Comput Graph Stat 2000; 9:1–20.

26. Hunter, D and Li, R. Variable selection using MM algorithms.Ann Stat 2005; 33: 1617–42.

27. Zou, H and Li, R. One-step sparse estimates in nonconcave pe-nalized likelihood models. Ann Stat 2008; 36: 1509–33.

28. Fan, J, Samworth, R and Wu, Y. Ultrahigh dimensional featureselection: beyond the linear model. J Mach Learn Res 2009;10: 2013–38.

by guest on Decem


Dow

nloaded from


29. Boyd, S, Parikh, N and Chu, E et al. Distributed optimization and statisti-cal learning via the alternating direction method of multipliers. Found TrendsMach Learn 2011; 3: 1–122.

30. Bradley, J, Kyrola, A and Bickson, D et al. Parallel coordinate descent forL1-regularized loss minimization. arXiv:1105.5379, 2011.

31. Low, Y, Bickson, and Dand Gonzalez, J et al. Distributed graphlab: a frame-work for machine learning and data mining in the cloud. Proc Int Conf VLDBEndowment 2012; 5: 716–27.

32. Worthey, E, Mayer, A and Syverson, G et al. Making a definitive diagnosis:successful clinical application of whole exome sequencing in a child with in-tractable inflammatory bowel disease. Genet Med 2010; 13: 255–62.

33. Chen, R, Mias, G and Li-Pook-Than, J et al. Personal omics profiling revealsdynamic molecular and medical phenotypes. Cell 2012; 148: 1293–307.

34. Cohen, J, Kiss, R and Pertsemlidis, A et al.Multiple rare alleles contribute tolow plasma levels of HDL cholesterol. Science 2004; 305: 869–72.

35. Han, F and Pan, W. A data-adaptive sum test for disease association withmultiple common or rare variants. Hum Hered 2010; 70: 42–54.

36. Bickel, P, Brown, J and Huang, H et al. An overview of recent developmentsin genomics and associated statistical methods. Philos T R Soc A 2009; 367:4313–37.

37. Leek, J and Storey, J. Capturing heterogeneity in gene expression studies bysurrogate variable analysis. PLoS Genet 2007; 3: e161.

38. Benjamini, Y and Hochberg, Y. Controlling the false discovery rate: a practicaland powerful approach to multiple testing. J Roy Stat Soc B 1995; 57: 289–300.

39. Storey, J. The positive false discovery rate: a Bayesian interpretation and theq-value. Ann Stat 2003; 31: 2013–35.

40. Schwartzman, A, Dougherty, R and Lee, J et al. Empirical null and false dis-covery rate analysis in neuroimaging. Neuroimage 2009; 44: 71–82.

41. Efron, B. Correlated z-values and the accuracy of large-scale statistical esti-mates. J Am Stat Assoc 2010; 105: 1042–55.

42. Fan, J, Han, X and Gu, W. Control of the false discovery rate under arbitrarycovariance dependence. J Am Stat Assoc 2012; 107: 1019–45.

43. Edgar, R, Domrachev, M and Lash, AE. Gene expression omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Res 2002;30: 207–10.

44. Jonides, J, Nee, D and Berman, M. What has functional neuroimaging told usabout the mind? So many examples little space. Cortex 2006; 42: 414–7.

45. Visscher, K and Weissman, D. Would the field of cognitive neuroscience beadvanced by sharing functional MRI data? BMC Med 2011; 9: 34.

46. Milham, M, Mennes, M and Gutman, D et al. The International NeuroimagingData-sharing Initiative (INDI) and the Functional Connectomes Project. 17thAnnual Meeting of the Organization for Human Brain Mapping, Quebec City,2011.

47. Di Martino, A., Yan, CG and Li, Q et al. The autism brain imaging data ex-change: Towards a large-scale evaluation of the intrinsic brain architecture inautism.Mol Psychiatry 2013, doi:10.1038/mp.2013.78.

48. The ADHD-200 Consortium. The ADHD-200 consortium: a model to advancethe translational potential of neuroimaging in clinical neuroscience. Front SystNeurosci 2012; 6: 62.

49. Fritsch, V, Varoquaux, G and Thyreau, B et al. Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators. MedImage Anal 2012; 16: 1359–70.

50. Song, S and Bickel, P. Large vector auto regressions. arXiv:1106.3915, 2011.51. Han, F and Liu, H. Transition matrix estimation in high dimensional time se-

ries. In: The 30th International Conference onMachine Learning, Atlanda, GA,USA, 16–21 June, 2013.

52. Cochrane, J. Asset Pricing. Princeton, NJ: Princeton University Press, 2001.53. Dempster, M. Risk Management: Value at Risk and Beyond. Cambridge: Cam-

bridge University Press, 2002.54. Stock, J andWatson, M. Forecasting using principal components from a large

number of predictors. J Am Stat Assoc 2002; 97: 1167–79.55. Bai, J and Ng, S. Determining the number of factors in approximate factor

models. Econometrica 2002; 70: 191–221.56. Bai, J. Inferential theory for factor models of large dimensions. Econometrica

2003; 71: 135–71.57. Forni, M, Hallin, M and Lippi, M et al. The generalized dynamic factor model:

one-sided estimation and forecasting. J Am Stat Assoc 2005; 100: 830–40.58. Fan, J, Fan, Y and Lv, J. High dimensional covariance matrix estimation using

a factor model. J. Econometrics 2008; 147: 186–97.59. Bickel, P and Levina, E. Covariance regularization by thresholding. Ann Stat

2008; 36: 2577–604.60. Cai, T and Liu, W. Adaptive thresholding for sparse covariance matrix estima-

tion. J Am Stat Assoc 2011; 106: 672–84.61. Agarwal, A, Negahban, S and Wainwright, M. Noisy matrix decomposition

via convex relaxation: optimal rates in high dimensions. Ann Stat 2012; 40:1171–97.

62. Liu, H, Han, F and Yuan, M et al. High-dimensional semiparametric Gaussiancopula graphical models. Ann Stat 2012; 40: 2293–326.

63. Xue, L and Zou, H. Regularized rank-based estimation of high-dimensionalnonparanormal graphical models. Ann Stat 2012; 40: 2541–71.

64. Liu, H, Han, F and Zhang, C-H. Transelliptical graphical models. In: The 25thConference in Advances in Neural Information Processing Systems, LakeTahoe, NV, USA, 3–8 December, 2012.

65. Fan, J, Liao, Y and Mincheva, M. Large covariance estimation by thresholdingprincipal orthogonal complements. J Roy Stat Soc B 2013; 75: 603–80.

66. Pourahmadi, M. Modern Methods to Covariance Estimation: With High-Dimensional Data. New York: Wiley, 2013.

67. Aramaki, E, Maskawa, S and Morita, M. Twitter catches the flu: detectinginfluenza epidemics using twitter. In: The Conference on Empirical Methodsin Natural Language Processing, Edinburgh, UK, 27–29 July, 2011.

68. Bollen, J, Mao, H and Zeng, X. Twitter mood predicts the stock market.J Comput Sci 2011; 2: 1–8.

69. Asur, S and Huberman, B. Predicting the future with social media. In: TheIEEE/WIC/ACM International Conference on Web Intelligence and IntelligentAgent Technology (WI-IAT), Toronto, Canada, 31 August–3 September, 2010.

70. Khalili, A and Chen, J. Variable selection in finite mixture of regression mod-els. J Am Stat Assoc 2007; 102: 1025–38.

71. Stadler, N, Buhlmann, P and van de Geer, S. �1-penalization for mixture re-gression models. Test 2010; 19: 209–56.

72. Hastie, T, Tibshirani, R and Friedman, J. The Elements of Statistical Learning.Berlin: Springer, 2009.

73. Buhlmann, P and van de Geer, S. Statistics for High-Dimensional Data: Meth-ods, Theory and Applications. Berlin: Springer, 2011.

74. Cai, T and Jiang, T. Phase transition in limiting distributions of coherence ofhigh-dimensional random matrices. J Multivariate Anal 2012; 107: 24–39.

75. Engle, R, Hendry, D and Richard, J-F. Exogeneity. Econometrica 1983; 51: 277–304.

76. Brazma, A, Parkinson, H and Sarkans, U et al. ArrayExpress—a public repos-itory for microarray gene expression data at the EBI. Nucleic Acids Res 2003;31: 68–71.

77. Valiathan, R, Marco, M and Leitinger, B et al. Discoidin domain receptortyrosine kinases: new players in cancer progression. Cancer Metastasis Rev2012; 31: 295–321.

by guest on Decem


Dow

nloaded from


78. Akaike, H. A new look at the statistical model identification. IEEE Trans Au-tomat Control 1974; 19: 716–23.

79. Barron, A, Birge, L and Massart, P. Risk bounds for model selection via penal-ization. Probab Theory Related Fields 1999; 113: 301–413.

80. Antoniadis, A. Wavelets in statistics: a review. J Ital Stat Soc 1997; 6: 97–130.

81. Antoniadis, A and Fan, J. Regularization of wavelet approximations. J AmStatAssoc 2001; 96: 939–55.

82. Donoho, D and Johnstone, J. Ideal spatial adaptation by wavelet shrinkage.Biometrika 1994; 81: 425–55.

83. Liang, K-Y and Zeger, S. Longitudinal data analysis using generalized linearmodels. Biometrika 1986; 73: 13–22.

84. Cai, T, Liu, W and Luo, X. A constrained L1 minimization approach to sparseprecision matrix estimation. J Am Stat Assoc 2011; 106: 594–607.

85. Cai, T and Liu, W. A direct estimation approach to sparse linear discriminantanalysis. J Am Stat Assoc 2011; 106: 1566–77.

86. Bickel, P, Ritov, Y and Tsybakov, A. Simultaneous analysis of lasso and Dantzigselector. Ann Stat 2009; 37: 1705–32.

87. Gautier, E and Tsybakov, A. High-dimensional instrumental variables regres-sion and confidence sets. arXiv:1105.2454, 2011.

88. Fan, J and Song, R. Sure independence screening in generalized linear modelswith NP-dimensionality. Ann Stat 2010; 38: 3567–604.

89. Fan, J, Feng, Y and Song, R. Nonparametric independence screening in sparseultra-high dimensional additive models. J Am Stat Assoc 2011; 106: 544–57.

90. Zhao, S and Li, Y. Principled sure independence screening for Cox models withultra-high-dimensional covariates. J Multivariate Anal 2012; 105: 397–411.

91. Li, R, Zhong,W and Zhu, L. Feature screening via distance correlation learning.J Am Stat Assoc 2012; 107: 1129–39.

92. Li, G, Peng, H and Zhang, J et al. Robust rank correlation based screening.Ann Stat 2012; 40: 1846–77.

93. Ke, T, Jin, J and Fan, J. Covariance assisted screening and estimation.arXiv:1205.4645, 2012.

94. Boyd, S and Vandenberghe, L. Convex Optimization. Cambridge: CambridgeUniversity Press, 2004.

95. Fodor, I. A survey of dimension reduction techniques. Technical report. USDepartment of Energy, 2002.

96. Avriel, M. Nonlinear Programming: Analysis and Methods. New York: CourierDover, 2003.

97. Friedman, J, Hastie, T and Hofling, H et al. Pathwise coordinate optimization.Ann Appl Stat 2007; 1: 302–32.

98. Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimiza-tion problems. SIAM J Optim 2012; 22: 341–62.

99. Candes, E, Wakin, M and Boyd, S. Enhancing sparsity by reweighted L1 mini-mization. J Fourier Anal Appl 2008; 14: 877–905.

100. Wang, Z, Liu, H and Zhang, T. Optimal computational and statistical rates ofconvergence for sparse nonconvex learning problems. arXiv:1306.4960, 2013.

101. Agarwal, A, Negahban, S and Wainwright, M. Fast global convergence ofgradient methods for high-dimensional statistical recovery. Ann Stat 2012;40: 2452–82.

102. Loh, P-L and Wainwright, M. Regularized M-estimators with nonconvexity:statistical and algorithmic theory for local optima. arXiv:1305.2436, 2013.

103. Golub, G and Van Loan, C. Matrix Computations. Baltimore, MD: The JohnsHopkins University Press, 2012.

104. Johnson, W and Lindenstrauss, J. Extensions of Lipschitz mappings into aHilbert space. Contemp Math 1984; 26: 189–206.

105. Donoho, D. Compressed sensing. IEEE Trans Inform Theory 2006; 52: 1289–306.

106. Tsaig, Y and Donoho, D. Extensions of compressed sensing. Signal Process2006; 86: 549–71.

107. Lustig, M, Donoho, D and Pauly, J. Sparse MRI: the application of com-pressed sensing for rapid MR imaging. Magn Reson Med 2007; 58:1182–95.

108. Figueiredo, M, Nowak, R and Wright, S. Gradient projection for sparse re-construction: application to compressed sensing and other inverse problems.IEEE J Sel Top Signal Process 2007; 1: 586–97.

109. Candes, E and Wakin, M. An introduction to compressive sampling. SignalProcess Magazine 2008; 25: 21–30.

110. Marks, R and Zurada, J. Computational Intelligence: Imitating Life. Piscat-away, NJ: IEEE, 1994.

111. Achlioptas, D. Database-friendly random projections. In: The 20th ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,Dallas, TX, USA, 16–18 May, 2001.

112. Deerwester, S, Dumais, S and Furnas, GR. Indexing by latent semantic analy-sis. J Assn Inf Sci 1990; 41: 391–407.

113. Rao, K, Yip, P and Britanak, V. Discrete Cosine Transform: Algorithms, Advan-tages, Applications. New York: Academic, 2007.

114. Mahoney, M and Drineas, P. CUR matrix decompositions for improved dataanalysis. Proc Natl Acad Sci USA 2009; 106: 697–702.

115. Owen, J and Rabinovitch, R. On the class of elliptical distributions and theirapplications to the theory of portfolio choice. J Finance 1983; 38: 745–52.

116. Blanchard, G, Kawanabe,M and Sugiyama,M et al. In search of non-Gaussiancomponents of a high-dimensional distribution. J Mach Learn Res 2006; 7:247–82.

117. Han, F and Liu, H. Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data. J Am Stat Assoc, doi:10.1080/01621459.2013.844699.

118. Candes, E, Li, X andMa, Y et al. Robust principal component analysis? J. ACM2011; 58: 11: 1–37.

119. Loh, P-L and Wainwright, M. High-dimensional regression with noisy andmissing data: provable guarantees with nonconvexity. Ann Stat 2012; 40:1637–64.

120. Lam, C and Yao, Q. Factor modeling for high-dimensional time series: infer-ence for the number of factors. Ann Stat 2012; 40: 694–726.

121. Han, F and Liu, H. Principal component analysis on non-Gaussian dependentdata. In: The 30th International Conference on Machine Learning, Atlanda,GA, USA, 16–21 June, 2013.

122. Huang, J, Sun, T and Ying, Z et al. Oracle inequalities for the lasso in the Coxmodel. Ann Stat 2013; 41: 1142–65.

by guest on Decem


Dow

nloaded from

Documents

REVIEW - arXiv · Challenges of Big Data analysis JianqingFan 1,,FangHan 2 andHanLiu 1 ... Keywords: BigData ... 294 NationalScienceReview ,2014,Vol.1,No.2 REVIEW