12
Cytometric Fingerprinting: Quantitative Characterization of Multivariate Distributions Wade T. Rogers, 1,2,3 * Allan R. Moser, 1,2 Herbert A. Holyst, 1,3 Andrew Bantly, 3 Emile R. Mohler III, 4 George Scangas, 3 Jonni S. Moore 3 Abstract Recent technological advances in flow cytometry instrumentation provide the basis for high-dimensionality and high-throughput biological experimentation in a het- erogeneous cellular context. Concomitant advances in scalable computational algo- rithms are necessary to better utilize the information that is contained in these high-complexity experiments. The development of such tools has the potential to expand the utility of flow cytometric analysis from a predominantly hypothesis- driven mode to one of discovery, or hypothesis-generating research. A new method of analysis of flow cytometric data called Cytometric Fingerprinting (CF) has been developed. CF captures the set of multivariate probability distribution functions corresponding to list-mode data and then ‘‘flattens’’ them into a compu- tationally efficient fingerprint representation that facilitates quantitative compari- sons of samples. An experimental and synthetic data were generated to act as reference sets for evaluating CF. Without the introduction of prior knowledge, CF was able to ‘‘discover’’ the location and concentration of spiked cells in ungated analyses over a concentration range covering four orders of magnitude, to a lower limit on the order of 10 spiked events in a background of 100,000 events. We describe a new method for quantitative analysis of list-mode cytometric data. CF includes a novel algorithm for space subdivision that improves estimation of the probability density function by dividing space into nonrectangular polytopes. Additionally it renders a multidimensional distribution in the form of a one- dimensional multiresolution hierarchical fingerprint that creates a computationally efficient representation of high dimensionality distribution functions. CF supports both the generation and testing of hypotheses, eliminates sources of operator bias, and provides an increased level of automation of data analysis. ' 2008 International Society for Advancement of Cytometry Key terms flow cytometry; cytomics; bioinformatics; computational biology; machine learning; empirical modeling IN the past decade, flow cytometry, along with other high information content tech- nologies, has experienced exponential growth. The power of flow cytometry to allow the simultaneous measurement of multiple cellular components of individual cells in a heterogeneous environment makes it an ideal technology for high-dimensional bio- logical studies. Along with other -‘‘omics’’ technologies (such as genomics and proteomics), cytomics, facilitated by high-dimensional flow cytometry, has shifted scientific investigation from purely hypothesis-driven to a more discovery, or hy- pothesis-generating, mode. With such approaches, complex cellular phenotypes and interactions can be elucidated and can contribute to the understanding of such im- portant clinical questions as successful vaccination strategies (1–3), and the nature and role of stem cells in health and disease (4–6). The availability of advanced instru- mentation coupled with expansion in the selection of probes has extended the cap- abilities of multivariate analyses of cells and cell systems to the point that conven- 1 Cira Discovery Sciences, Inc., Philadelphia, Pennsylvania 2 Perinatology Research Branch, Intramural Division, NICHD/NIH/DHHS, Hutzel Women’s Hospital, Detroit, Michigan 3 Department of Pathology and Laboratory Medicine and the Flow Cytometry and Cell Sorting Resource Laboratory, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 4 Vascular Medicine Section, Cardiovascular Division, Department of Medicine, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania Received 25 September 2007; Revision Received 5 December 2007; Accepted 16 January 2008 Grant sponsor: National Institute of Child Health and Human Development, NIH, DHHS. *Correspondence to: Wade T. Rogers, University of Pennsylvania, 207 John Morgan Bldg., Philadelphia, PA 19104-6082. Email: [email protected] Published online 27 March 2008 in Wiley InterScience (www.interscience. wiley.com) DOI: 10.1002/cyto.a.20545 © 2008 International Society for Advancement of Cytometry Original Article Cytometry Part A 73A: 430441, 2008

Cytometric Fingerprinting: Quantitative … Fingerprinting

  • Upload
    lytram

  • View
    232

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Cytometric Fingerprinting: Quantitative … Fingerprinting

Cytometric Fingerprinting: Quantitative

Characterization of Multivariate Distributions

Wade T. Rogers,1,2,3* Allan R. Moser,1,2 Herbert A. Holyst,1,3 Andrew Bantly,3

Emile R. Mohler III,4 George Scangas,3 Jonni S. Moore3

� AbstractRecent technological advances in flow cytometry instrumentation provide the basisfor high-dimensionality and high-throughput biological experimentation in a het-erogeneous cellular context. Concomitant advances in scalable computational algo-rithms are necessary to better utilize the information that is contained in thesehigh-complexity experiments. The development of such tools has the potential toexpand the utility of flow cytometric analysis from a predominantly hypothesis-driven mode to one of discovery, or hypothesis-generating research. A newmethod of analysis of flow cytometric data called Cytometric Fingerprinting (CF)has been developed. CF captures the set of multivariate probability distributionfunctions corresponding to list-mode data and then ‘‘flattens’’ them into a compu-tationally efficient fingerprint representation that facilitates quantitative compari-sons of samples. An experimental and synthetic data were generated to act asreference sets for evaluating CF. Without the introduction of prior knowledge, CFwas able to ‘‘discover’’ the location and concentration of spiked cells in ungatedanalyses over a concentration range covering four orders of magnitude, to a lowerlimit on the order of 10 spiked events in a background of 100,000 events. Wedescribe a new method for quantitative analysis of list-mode cytometric data. CFincludes a novel algorithm for space subdivision that improves estimation of theprobability density function by dividing space into nonrectangular polytopes.Additionally it renders a multidimensional distribution in the form of a one-dimensional multiresolution hierarchical fingerprint that creates a computationallyefficient representation of high dimensionality distribution functions. CF supportsboth the generation and testing of hypotheses, eliminates sources of operator bias,and provides an increased level of automation of data analysis. ' 2008 International

Society for Advancement of Cytometry

� Key termsflow cytometry; cytomics; bioinformatics; computational biology; machine learning;empirical modeling

IN the past decade, flow cytometry, along with other high information content tech-

nologies, has experienced exponential growth. The power of flow cytometry to allow

the simultaneous measurement of multiple cellular components of individual cells in

a heterogeneous environment makes it an ideal technology for high-dimensional bio-

logical studies. Along with other -‘‘omics’’ technologies (such as genomics and

proteomics), cytomics, facilitated by high-dimensional flow cytometry, has shifted

scientific investigation from purely hypothesis-driven to a more discovery, or hy-

pothesis-generating, mode. With such approaches, complex cellular phenotypes and

interactions can be elucidated and can contribute to the understanding of such im-

portant clinical questions as successful vaccination strategies (1–3), and the nature

and role of stem cells in health and disease (4–6). The availability of advanced instru-

mentation coupled with expansion in the selection of probes has extended the cap-

abilities of multivariate analyses of cells and cell systems to the point that conven-

1Cira Discovery Sciences, Inc.,Philadelphia, Pennsylvania2Perinatology Research Branch,Intramural Division, NICHD/NIH/DHHS,Hutzel Women’s Hospital, Detroit,Michigan3Department of Pathology and LaboratoryMedicine and the Flow Cytometry andCell Sorting Resource Laboratory,University of Pennsylvania School ofMedicine, Philadelphia, Pennsylvania4Vascular Medicine Section,Cardiovascular Division, Department ofMedicine, University of PennsylvaniaSchool of Medicine, Philadelphia,Pennsylvania

Received 25 September 2007; RevisionReceived 5 December 2007; Accepted16 January 2008

Grant sponsor: National Institute of ChildHealth and Human Development, NIH,DHHS.

*Correspondence to: Wade T. Rogers,University of Pennsylvania,207 John Morgan Bldg., Philadelphia,PA 19104-6082.

Email: [email protected]

Published online 27 March 2008 in WileyInterScience (www.interscience.wiley.com)

DOI: 10.1002/cyto.a.20545

© 2008 International Society forAdvancement of Cytometry

Original Article

Cytometry Part A � 73A: 430�441, 2008

Page 2: Cytometric Fingerprinting: Quantitative … Fingerprinting

tional data-analytical paradigms have become the rate-limiting

step in the overall process of collecting, analyzing, and inter-

preting complex data sets (7–10). Twenty-first century flow-

cytometric technology will thus benefit from the development

of new approaches to the analysis of data, especially those that

facilitate the application of machine learning and data mining

methodologies.

Conventional flow cytometry typically measures events

that are variable in only a few dimensions. In the best cases the

data are quantal; that is, cells express particular antigens

according to tightly regulated processes causing the number of

such molecules per cell to fall within a narrow range. Thus, a

two-dimensional projection of the data for two cell-surface

markers characterizing a population of cells can be divided

into four quadrants into which individual clusters are segre-

gated. Variation in the choice of thresholds to separate positive

events from negative events on each axis has a negligible influ-

ence on estimates of the density of events in each quadrant.

More complex multiparameter distributions are usually simpli-

fied into a series of two-dimensional distributions by sequen-

tial gating (2) or, in some cases, more sophisticated principal

component analysis (PCA) (8), to represent the high-dimen-

sionality data in a lower-dimensionality projection which

retains most of the variance in the dataset (11). However, these

dimensionality-reducing approaches possess a number of

drawbacks that limit their suitability for the analysis of com-

plex populations. First, serial pairwise analysis decouples

parameter correlations that may be important in defining

complex populations. Additionally, analytical bias may be

introduced through multiple subjective operator interventions,

which is of particular relevance in the analysis of complex,

nonquantal distributions. Finally, sequential gating involves the

imposition of selection criteria on the data that do not allow

for new interpretations of poorly characterized populations.

Analysis via gating remains the most effective method for

filtering data using well-established criteria (e.g., removing

data associated with dead cells, or counting populations in

highly optimized and quantal assays). In some cases the appli-

cation of expert judgment in the visual design of gating strate-

gies may be able to isolate events of biological interest even in

the presence of confounding experimental (or biological) vari-

ation that will be difficult to account for automatically. Never-

theless there is a strong need for tools to transform and repre-

sent multiparameter flow cytometric data in a form efficiently

amenable to methods of machine learning and data mining.

We describe here a new algorithm for the analyses of

high-dimensional cytometric data termed Cytometric Finger-

printing (CF). CF extends the approach of Probability Binning

(PB) (7) and its companion method Frequency Difference Gat-

ing (12). The objective of CF is to represent the information

contained in cytometric list-mode files in a form that enables

quantitative comparison among samples. These fingerprints

are capable of capturing and encoding the full multivariate cor-

relations of complex, high-dimensional cytometric data. This

representation is particularly useful when cell populations are

not clearly delineated by optimized assays and the distribution

of events in the multiparameter space is not bimodal.

MATERIALS AND METHODS

Cell Staining and Flow Cytometric Analysis

Peripheral blood mononuclear cells (PBMCs) were iso-

lated from 60 ml of fresh blood from seven healthy donors as

described in Ref. 6, collected under the guidelines of the Insti-

tutional Review Board of the University of Pennsylvania

School Medicine. Cells were counted and split into two

groups: Pool no. 1 and Pool no. 2, as well as a compensation

control group. The compensation control set contained 107

cells, and the Pool no. 1 and Pool no. 2 sets each contained 2

3 107 cells. The two pools of cells were each stained with a dif-

ferent antibody combination to create a known variable subset

of events embedded in a background distribution. Antibody

labeling was performed according to standard procedures.

Briefly, all cells were blocked with murine IgG (10 lg per 107

cells) (Sigma, St. Louis, MO) for 10 min at room temperature,

and then either stained with 60 ll anti-human CD45 FITC

(BD Biosciences Pharmingen, San Jose, CA), 15 ll anti-human

CD4 PE (Invitrogen/Caltag Laboratories, Carlsbad, CA), and

15 ll anti-human CD3 APC (Invitrogen/Caltag Laboratories,

Carlsbad, CA) for Pool no. 1, or 60 ll anti-human CD45

FITC, 15 ll anti-human CD4 PE, and 15 ll anti-human CD64

APC (Invitrogen/Caltag Laboratories Carlsbad, CA) for Pool

no. 2. Three compensation control tubes were labeled with

single antibodies as in Pool no. 1. A fourth, unstained control

tube had 100 ll of staining buffer (DPBS 1 1.0% BSA) added

to it. After labeling, cells were incubated for 30 min at 48C in

the dark, washed twice with 2 ml of staining buffer, and finally

resuspended in 2 ml of staining buffer. A series of collection

tubes was made by aliquoting equal numbers (1 3 106) of

Pool no. 2 cells at a final volume of 1 ml.

Cell Sorting

A BD FACS Aria was used to sort Pool no. 1 or Pool no.

2 cells into the tubes containing 1 million Pool no. 2 cells.

Sorted population gating was based on a light scatter for lym-

phocytes and monocytes [P1 gate], singlets [P2 gate],

CD45(1) FITC A vs. SSC A [P3 gate], and CD4(1) PE A vs.

CD3(1) APC A [P4 gate], after compensation as shown in

Figure 1. A series of increasing numbers (10, 32, 100, 316,

1,000, 3,162, 10,000, 31,623, or 100,000 cells) of Pool no. 1

cells were sorted into the collection tubes containing 1 3 106

of Pool no. 2 cells. The resulting composite distributions are

shown on the lower part of Figure 1. Three tubes into which

Pool no. 2 cells (10, 100, or 1,000 cells) were sorted into Pool

no. 2 collection tubes were used as controls.

Flow Cytometry Data Acquisition

Data were acquired on a BD FACS Canto immediately

after sorting for each experiment. PMT voltages were deter-

mined using the compensation control tubes. Once voltages

were established, during the first experiment, Sphero Rainbow

Calibration Particles (eight peaks) (Spherotech, Lake Forest,

IL) were used to standardize fluorescence measurements

across experiments. To do this, 105 beads were collected, a

tight gate was set around the three brightest peaks for the

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 431

Page 3: Cytometric Fingerprinting: Quantitative … Fingerprinting

FITC, PE, and APC channels, and the median intensity for

each peak was recorded. PMT voltages were then adjusted if

necessary to produce the same median intensities for each

parameter compared with the first experiment. For cellular

events, the threshold was set at FSC 30,000. One hundred

thousand ungated events for each sample were collected. For

the Pool no. 2 cells sorted into Pool no. 2, five tubes were col-

lected of each sample to be used as controls. In between each

collection of a tube, buffer only was run until no events were

detectable. After all tubes were collected, the Spherotech

8-peak beads were run a second time to see if there were any

changes in the median intensities of the three bright peak gates

for each channel. When performing the experiments that

followed, the PMT voltages for the samples were determined

by using the predrawn gates used to monitor possible shifts in

the median intensity in the first experiment.

Flow Cytometry Data Analysis

FCS3.0 files were analyzed using FlowJo Software version

6.4.7 (TreeStar Software, Ashland, OR). Population gating,

with gates corresponding to those described above for the

sort, was used in order to establish reference numbers of Pool

no. 1 cells in each tube. These numbers were used to deter-

mine the accuracy of event counts obtained with the CF

method described below. Cytometric Fingerprints were gener-

ated and analyzed using custom-written software in C, Perl,

MatLab version 6.5.0 (The Mathworks, Natick, MA) and R

version 2.5.1.

Cytometric Fingerprints

CF aims to capture the set of multivariate probability dis-

tribution functions corresponding to a set of list-mode files

and then ‘‘flatten’’ them into a computationally efficient fin-

gerprint representation that facilitates quantitative compari-

sons of samples. The CF algorithm starts by specifying the

number of hierarchical levels L for the representation. The

total number of bins nB into which the space is to be divided

is related to the number of hierarchies by the expression nB 52L 2 1. (Note that the undivided space is considered to be the

first level in the hierarchy. Thus, the total number of bins is an

odd number.) The next step is to determine boundaries that

recursively subdivide the multidimensional space into regions

of equal probability. The subdivision of space is done in such

a way that at each division the parent bin is divided by a

Figure 1. Gating strategy and event distribution in spiking experiment. Upper panel: Gating strategy for spiking Pool no. 1 cells via cell

sorting into aliquots of Pool no. 2 cells. Lower panel: Distributions of spiked samples at nine different spike concentrations (corresponding

to the concentrations in Table 1) are shown. The boxes show the location of the spiked events.

ORIGINAL ARTICLE

432 Cytometric Fingerprinting

Page 4: Cytometric Fingerprinting: Quantitative … Fingerprinting

hyperplane perpendicular to the direction of maximum var-

iance of the data within the bin. Thus, the variance of data

within a bin is maximally reduced at each subdivision subject

to the constraint that the data are divided into approximately

equal portions. PCA is utilized to find the direction of maxi-

mum variance (11,13–17). The procedure, termed Oblique Re-

cursive Subdivision, is outlined in Figure 2. It is noteworthy

that while the bin boundaries consist of the intersection of

hyperplanes in a p-dimensional space, these boundaries need

not be explicitly stored. All of the information necessary to

bin new data is contained in the hierarchy of rotation matrices

Rn and median split values described in Figure 2.

Once bin boundaries have been determined from a data

set (denoted Dtemplate), the procedure for binning another

data set, Dnew, utilizing this bin description is straightforward.

The procedure is identical to that described earlier except that

the rotation matrices and median values from Dtemplate are

used rather than being recalculated from the new data set.

The binning method utilized by oblique recursive subdi-

vision is similar to that employed by PB (7). Both recursively

subdivide space into bins containing equal numbers of events.

For multivariate distributions, both use variance as splitting

criteria. The primary difference is that PB is constrained to

split along the coordinate axes, whereas oblique recursive sub-

division splits the data in the actual direction of maximum

variance. As stated in Ref. 7, ‘‘By choosing the parameter with

the largest variance, the algorithm is weighted towards assign-

ing distinct clusters of events into distinct bins (or sets of

bins).’’ Thus, oblique recursive subdivision may be regarded as

a generalization of PB that more accurately achieves this goal.

A simple pictorial example showing the difference between the

binning achieved by PB versus that of oblique recursive subdi-

vision is shown in Figure 3. This example shows two distribu-

tions of events in a two-dimensional space. Splitting in the

direction of maximum variance, as is done in oblique recur-

sive subdivision, places the events from these two distributions

in separate bins starting from the first level of recursion. Con-

straining the split direction to a coordinate axis, as is done in

PB, creates bins that mix events from both distributions; even

in the higher resolution bins. Of course, there is no way of

knowing whether such distributions exist in real data. Addi-

tionally, if enough data exist to create sufficiently high-resolu-

tion bins, the event distributions will eventually be separated

by either binning method. However, distributions such as this

could, in principle, exist at any resolution within data.

As discussed in Ref. 7, there are potential disadvantages

to splitting in the direction of maximum variance. Distribu-

tions that are spread out in one dimension will be split

through the middle making the method sensitive to subtle

shifts in the position of the bulk of the event population. Note

that while this may present a problem for rare event detection,

it can also be exploited as a means of quality control (18).

Additionally, care must be taken that all parameters are trans-

formed such that their variances have approximately equal

scale. Importantly, log-normally distributed values (such as is

typical with fluorescence parameters) should be log-trans-

formed prior to binning. All of the data presented in the

‘‘Results’’ section later were treated in this way.

Generation of Cytometric Fingerprints

A simple one-dimensional representation of the density

variations relative to a template may be obtained by recording

the densities for the bins in the form of a list. We call this rep-

resentation a ‘‘fingerprint’’ because it distinguishes individual

samples (referred to as ‘‘instances’’) represented by list-mode

data. Figure 4 schematically illustrates this procedure. Finger-

prints are derived as follows:

i. For a set of M instances {S1, S2, . . . , SM}, aggregate the

events from all of the instances to form a composite

denoted S.

ii. Find the bins for S.

iii. Bin the data for each instance Si (i 5 1, 2, . . . , M), using

the bins from Step ii.

iv. Convert the event counts in each bin into an event fraction

by dividing each count by the total number of events in

the data set.

The lists of binned event densities for the set of instances, S1,

S2, . . ., SM, form a set of fingerprints for these data relative to

the probability density estimated from the composite data set,

Figure 2. Schematic outline for oblique recursive subdivision.

The method starts with a list-mode file for the template dataset D.After initialization, the algorithm recursively processes events in

parent bins at a given level of resolution, forming children bins at

the next higher level of resolution. The algorithm terminates

when the specified number of bins has been generated.

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 433

Page 5: Cytometric Fingerprinting: Quantitative … Fingerprinting

Figure 3. Comparison of rectangular and oblique binning. The figure illustrates the differences between the binning method utilized by

probability binning and oblique recursive subdivision. One thousand two-dimensional events were generated such that they fall into the

two elliptical clusters shown in red and blue. The figure on the left shows the result for probability binning. The figure on the right shows

the result for oblique recursive subdivision. The thick brown line indicates the location of the first subdivision. Probability binning splits

the events along the Parameter 2 axis creating two bins having mixed blue-cluster and red-cluster events, whereas oblique recursive sub-

division creates two bins which are each pure in red and blue events. Subsequent bin divisions are indicated by successively thinner lines

in green (second level recursion), purple (third level), light brown (fourth level), and black (fifth level). Even at the fifth level of recursion,

two bins remain ‘‘mixed’’ with probability binning, although at successively higher levels of resolution, few events are in the mixed bins.

For oblique recursive subdivision, because clusters are separated at the first level of recursion, bins formed at every level of recursion

remain pure.

Figure 4. Schematic representation of fingerprint formation. This diagram shows two instances binned with respect to template bins

found using oblique recursive subdivision. In this schematic example, there are a total of 63 bins. The thickest brown diagonal line sepa-

rates the data into two level 1 bins; the thick blue lines separate each of these bins into the four level 2 bins; and so on. The fingerprints to

the right of the dot plots display raw event counts (or P values obtained from Poisson statistics) for each bin, with the color intensity

encoding the amplitude of the feature. Fingerprints are organized in resolution hierarchies, with low resolution features to the left progres-

sing to successively higher resolution features to the right. There are 2L features within a resolution hierarchy where L is the level of resolu-tion. Thus, the rightmost 32 features indicate the fingerprint values at the highest level of resolution. The dot plot for the ‘‘disease’’

instance has a well-defined cluster of cells in the lower right-hand bin which is not present in the normal instance. The arrows indicate the

effect of this cluster in the disease instance fingerprint at multiple levels of resolution.

ORIGINAL ARTICLE

434 Cytometric Fingerprinting

Page 6: Cytometric Fingerprinting: Quantitative … Fingerprinting

and may be used, for example, to detect and quantify devia-

tions from a norm (18).

Another variation of fingerprinting is particularly useful

for classification. The goal is to emphasize differences between

samples belonging to different classes. For classification pro-

blems, one typically has a set of training instances for which the

class identity is known and a set of test or validation instances

for which the class identity is unknown. The training data may

be aggregated to construct class-specific templates. Fingerprints

for individual instances, from both the training set and test or

validation sets can be computed utilizing bins derived from the

templates. Fingerprints from the training instances may then be

used to construct supervised classifiers with which to classify

fingerprints for test or validation instances.

RESULTS

Spiked Sample Experiment and Analysis

A series of ‘‘spiking’’ experiments was carried out to

demonstrate the use of cytometric fingerprints for quantitative

analysis of cell populations. The basis for these experiments

was to simulate a potential phenotypically unique population

of cells present at varying concentrations depending on some

condition or treatment. This was done by creating a series of

concentrations of minority cell populations (Pool no. 1)

spiked into a majority cell population (Pool no. 2). Fifteen

list-mode files were obtained from tubes with nominal spiked

cell concentrations increasing in half-decade increments from

0.001% to 10%. Additionally, 14 list-mode files were acquired

for control samples having only Pool no. 2 cells. The actual

concentrations of spiked cells varied from the nominal values

because of uncertainties in the sorting process and the number

of reservoir Pool no. 2 cells into which the Pool no. 1 cells

were spiked. For the population gating analysis, a set of se-

quential gates corresponding to that used to sort Pool no. 1

cells into the Pool no. 2 reservoir was used to determine the

reference number of spike cells in each sample. By contrast,

the CF method did not have access to knowledge of the loca-

tion and distribution of the spiked events. Rather, the goal was

to automatically determine the location and concentration of

spiked cells for each concentration experiment independently

by direct comparison with the Pool no. 2 reference set. To do

this, the only assumption made by CF was that some events

may be present in one distribution that were absent in the

other. Thus, the objective of the analysis in this section was to

demonstrate the ability of CF to generate hypotheses (i.e.,

automatically determine unique cell population locations and

concentrations) and to determine the concentration limits for

which such hypothesis generation is possible.

As discussed previously, cytometric fingerprints are one-

dimensional lists of event counts in bins formed with respect to

a template instance. Event counts for list-mode files binned with

respect to this template show density variations for that sample

with respect to the template. In general, to compare two sets of

samples, one could choose to aggregate all instances to form a

template; choose one of the two sets to form the template; or

form templates from each set and create two sets of fingerprints

for each instance. For the spiking experiments, we chose to form

template bins based on each spiked experimental sample and

then create fingerprints for the control samples with respect to

each of these templates. Qualitatively, what one expects to see

with this approach is that a cluster of Pool no. 1 cells in the

spiked sample will cause some number of bins to be localized in

the region of parameter space occupied by this cluster. The num-

ber of bins formed in the region of Pool no. 1 cells depends on

the concentration and degree of localization of the spiked cells.

Binning unspiked samples with respect to this template will

result in a fingerprint with low event counts in the bins asso-

ciated with the spiked cell cluster. We note that in a setting where

it is not known in advance in which group of samples a cluster is

likely to occur, samples from all groups can independently be

used to generate templates and each sample can then be binned

against each template—a method we call differential fingerprint-

ing. For the sake of simplicity we do not illustrate this here; how-

ever, the results would be equivalent. We next describe the quan-

titative estimation of the number of spiked cells.

The total number of events in a spiked sample is the sum

of the spiked Pool no. 1 events and background Pool no. 2

events.

N ¼ N1 þ N2: ð1Þ

This superposition of event counts holds for every subregion

of cytometric parameter space. In particular, if the spiked sam-

ple is used as the template and bins are formed using oblique

recursive subdivision, for each bin i, the total number of

events Ti is

Ti ¼ Xi þ Yi ð2Þ

where Xi and Yi are, respectively, the numbers of Pool no. 1

and Pool no. 2 events in bin i. By design, the number of events

in a bin (at a given resolution) for the template instance is Ti5 N/nB, where nB is the number of bins at the given resolu-

tion. From Eq. (2), the number of Pool no. 1 events is

Xi ¼ N

nB� Yi ð3Þ

The total number of spiked events in a sample can be found

by summing over the set of bins S containing spiked cells.

N1 ¼Xi2S

Xi ¼Xi2S

N

nB� Yi

� �: ð4Þ

The number of Pool no. 2 cells in each bin can be estimated

from the cytometric fingerprints of the control samples

because these contain only Pool no. 2 events. Aggregating the

control samples and binning relative to the spiked sample

template, the fraction of Pool no. 2 cells in each bin is given by

fi ¼ ki

mN; ð5Þ

where ki is the event count value for bin i obtained from the

fingerprint of the aggregated control samples. The denomina-

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 435

Page 7: Cytometric Fingerprinting: Quantitative … Fingerprinting

tor in this expression is the total number of events in the

aggregated control sample which is m (the number of aggre-

gated Pool no. 2 tubes) times the number of events, N, in

each tube. Using Eqs. (1) and (5), the number of Pool no. 2

events in bin i is

Yi ¼ fiN2 ¼ fiðN � N1Þ ¼ ki

mNðN � N1Þ ð6Þ

Combining Eqs. (4) and (6), the fraction of spiked events in

an experimental sample is

N1

N¼ mN

��S��=nB �Pi2S ki

� �mN �P

i2S ki� � ð7Þ

where |S| is the number of bins containing spiked cells.

The remaining task is to determine the set of bins S encom-

passing the region of localization of spiked cells. Because the

template bins were created using spiked samples, if there is a

sufficient concentration of Pool no. 1 cells, one expects the

Pool no. 2 control sample fingerprints to have abnormally low

event counts in these bins. For example, in the experiment

with 0.1% spike concentration, there are 100 Pool no. 1 cells

out of the 100,000 events acquired from the tube. At a resolu-

tion of 8,192 bins, there will be on average 12.2 events per bin.

Thus, if the Pool no. 1 cells are well localized, one would

expect on the order of eight template bins to correspond to

this region in parameter space and that the control sample fin-

gerprints should have low event counts in these bins. These

low event count bins show up most clearly if the control sam-

ple fingerprint values are sorted in order of ascending bin

event count. For spiked samples with a sufficient spike popu-

lation, the sorted control sample fingerprint will have several

zero values followed by a rapid rise corresponding to bins

associated with the background of Pool no. 2 events. The posi-

tion of this change in slope is most easily determined using

the derivative of the sorted fingerprint values. The first maxi-

mum in the derivative of the sorted fingerprint signal indicates

the point where bins dominated by spiked events transition to

bins dominated by background events.

The selection of bins by the method outlined earlier is

somewhat conservative in that we pick the set of bins clearly

dominated by Pool no. 1 events, excluding some bins that con-

tain a mixture of a small number spiked events and a larger

number of background events. This results in an underestima-

tion of the spiked event concentration. However, this under-

estimation can be corrected by taking advantage of the multi-

resolution nature of CF. At higher resolutions a larger number

of low event-count bins pack more tightly around the cluster

of spiked events and thus are able to more clearly delineate

between spiked and background events. Lower resolution bins

contain more events but are broader and may overlap the

spiked and background regions. Our bin selection strategy is

to use the highest resolution bins to determine the cutoff

threshold for the number of Pool no. 2 events allowed in a bin

and then apply this threshold to the lower resolutions. This is

analogous to using spheres of successively smaller radii to esti-

mate the volume of a container. Utilizing the threshold deter-

mined from the highest resolution corresponds to considering

only spheres (at any resolution) that are contained within the

volume to the same degree as the smallest spheres. Each

resolution of binning provides an estimate of the spiked cell

concentration.

The form of a function useful for estimating the behavior

of concentration as a function of resolution can be obtained

by geometrical arguments. As the binning resolution increases,

the fraction of bins overlapping spiked and background

regions decreases. Consider the fraction of bins that lie on the

boundary of the spiked region relative to the number of bins

interior to the region. Letting h represent the characteristic

linear size (scale) of the bins, the number of bins at the

boundary is proportional to S/h(d21) while the number of bins

interior to the spike region is V/hd, where S and V represent

the surface area and volume of the region and d is the dimen-

sionality of the space. The fraction of bins on the boundary of

the spike region is given by

f ¼ S=hd�1

V=hd/ h

which is proportional to the scale of the bins. The scale of the

bins decreases by a factor of 2 for each increase in resolution

level. Thus, the fraction of bins on the boundary decreases

geometrically with resolution hierarchy index and can be rep-

resented as, f � e2cL. Therefore, the estimated concentration

level asymptotically approaches the true value and can be rep-

resented as

CðLÞ / ð1� af Þ ¼ C0ð1� ae�kLÞ ð8Þ

C0, a, and k are parameters that can be fit from the series of

multiresolution estimates. The parameters a and k are propor-

tionality factors that depend on the geometry of the spike dis-

tribution and dimensionality of the space while C0 may be

interpreted as the asymptotic concentration as the resolution

approaches infinity.

High-resolution (8,192 bins) fingerprints for the aggre-

gate of the control Pool no. 2 cells binned relative to each spike

concentration sample are shown in Figure 5. Low event count

bins are clearly visible for the high spike concentration experi-

ments. As the spike concentration decreases, the number of

bins with low event density falls. The limit of detection is at

the 0.01% level corresponding to 10 cells out of 100,000. This

is an expected result because, for the highest resolution, the

best one could do is to form a bin around a cluster of 12

spiked events. Table 1 summarizes the results of the spiking

experiments for levels of resolution corresponding to 1,024,

2,048, 4,096, and 8,192 bins. It also shows the asymptotic con-

centration using the method described earlier. Spike concen-

trations less than 0.03%, were detectable at only the highest

level of resolution and so could not be extrapolated. The table

shows good agreement between the event counts from popula-

tion gating and those obtained by CF.

ORIGINAL ARTICLE

436 Cytometric Fingerprinting

Page 8: Cytometric Fingerprinting: Quantitative … Fingerprinting

Analysis of Synthetic Data���Detection of Rare EventsCF can be used for a variety of objectives in the analysis

of flow cytometric data. Here we show that CF is able to detect

and identify the presence of a small number of events against a

large overlapping background distribution, and that it can do

so in the presence of significant ‘‘biological’’ variability. This is

similar to the application of PB termed Frequency Difference

Gating (12).

Using the R Statistical Computing Environment, we cre-

ated a series of two synthetic populations consisting of 100

instances each of simulated flow cytometric data. The two

populations are termed ‘‘normal’’ and ‘‘disease.’’ Each instance

in each population consists of 100,000 events described in a

four-dimensional space. Each event is generated from one of

five equally represented 4D hyperellipsoidal distribution func-

tions whose centroids (Ci) and standard deviations (Si) were

assigned using a random number generator within predefined

limits corresponding to a typical log-linear parameter range of

0–1,024 log fluorescence intensity. For the disease instances, a

sixth cluster was added with n events. (An equal number of n

events were removed at random from the other five clusters

such that the total number of events in each data set remained

unaltered.) A series of disease populations was generated,

varying n from a minimum of 10 events (corresponding to

0.01% disease events) to a maximum of 1,000 events (corre-

sponding to 1% disease events). The actual centroids for each

hyperellipsoid (including the disease hyperellipsoid) for each

instance were randomly varied with a standard deviation of

0.08 log10 units. The standard deviations were randomly var-

ied for each instance with a standard deviation of 0.04 log10units. Thus, no two instances had the same distributions, and

there were marked visual differences in the distribution as

judged from pairwise comparisons of the dotplots (data not

shown). The locations of the hyperellipsoidal functions were

such that the disease population was significantly intermingled

with the normal population along all parameter axes. Thus,

with no prior knowledge of the location of the disease-indicat-

ing events, at the lower disease event concentrations (0.1%

and lower) it is not possible to visually detect their presence.

Consequently, conventional sequential gating analysis in such

a situation cannot be applied.

For a given population of 100 instances each of disease

and normal, we further divided the instances randomly into

equal training and testing sets. The events in the normal train-

ing instances were aggregated, and CF template bins were

determined from this aggregate to a resolution of 8,192 leaf

bins (16,383 bins in total). Cytometric Fingerprints were then

individually computed for all 200 instances in the population

with respect to this template. Fingerprints were computed

from the aggregated disease training instances and a procedure

similar to that described in the previous section was used to

automatically select significant features.

Figure 6 shows a set of dotplots for one of the disease

training instances (0.1% disease events) selected at random.

Disease events are shown in red. The circles show the known

centroid of the disease cluster. The events that fall within the

informative fingerprint bins group closely about the location

where the disease events were located in this instance, demon-

strating that CF indeed identifies the multiparameter locus of

significance in this data set. Moreover, for the 0.1% disease

case, on average about 50% of the events that fall within the

informative leaf bins are actual disease events, which can be

determined with certainty in synthetic data. This represents an

enrichment factor of about 5003, showing that CF can be an

effective means for sorting rare events.

The detection task is equivalent to determining which

instances are normal and which are disease by means of their

Figure 5. Cytometric fingerprints from spiking experiment. Fin-

gerprints corresponding to two different spike concentrations are

shown for illustrative purposes. The spiked samples were used

as the template instances to determine bins, and the fingerprints

represent the aggregate of the unspiked pure Pool no. 2 popula-

tion (as described in the text). (a) Fingerprints for a nominal spike

concentration of 1%. The upper graph shows the 8,192 high-

resolution bins of the fingerprint. The bins shown in black repre-

sent the location corresponding to the spiked events. The lower

graph shows the same data, zoomed to show better that there

are many bins corresponding to the location of spiked events in

which the reference instances lack events. (b) As in (a) but for

0.1% spiking concentration. Note that the number of black bins is

much smaller than that at the higher 1% spike concentration.

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 437

Page 9: Cytometric Fingerprinting: Quantitative … Fingerprinting

Cytometric Fingerprints alone. For this purpose, we utilized

the selected features for the set of simulated instances at 0.1%

disease event concentration to perform supervised classifica-

tion. We chose linear discriminant analysis (LDA) (function

lda in package MASS of the R statistical computing environ-

ment) because it is perhaps the simplest method of classifica-

tion and its use here is intended to illustrate that the

cytometric fingerprints are in fact informative with respect to

phenotype.

Utilizing the training set only, an LDA classifier was

trained. This model was then used to predict the remaining

100 instances in the test set in a blinded fashion. The class

membership of these instances was then unblinded to evaluate

the classification accuracy. This procedure was repeated for

100 random draws (a random draw is a random selection of

50 of the 100 instances of each class to act as the training set,

the remaining 50 being the test set), each time building a fresh

LDA classifier based upon only the training instances in the

Table 1. Estimate of spiked cell concentration by CF

% SPIKED CELL CONCENTRATIONa ESTIMATED % SPIKED CELL CONCENTRATIONb

NOMINAL GATING 1,024 BINS (L5 10) 2,048 BINS (L5 11) 4,096 BINS (L5 12) 8,192 BINS (L 5 13) ASYMPTOTIC (L?1)

0.001 0.002 – – – – –

0.003 0.003 – – – – –

0.010 0.019 – – – 0.012 0.012

0.032 0.041 – – 0.024 0.036 0.036

0.100 0.170 0.097 0.145 0.145 0.157 0.157

0.316 0.530 0.390 0.488 0.488 0.512 0.513

1.000 1.950 1.757 1.806 1.830 1.866 1.981

3.162 4.720 4.491 4.539 4.588 4.610 4.697

10.00 13.80 13.379 13.525 13.622 13.695 13.854

a Column 1 5 nominal spike concentration; Column 2 5 actual spiked cell concentration determined by population gating with sequen-

tial gates corresponding those used for sorting.b Columns 3 through 6 give the results of cytometric fingerprint analysis at four binning resolutions; Column 7 shows the concentration

estimated by using the four fingerprint resolutions to extrapolate to infinite resolution.

Figure 6. Synthetic data. (a) Example of one of the ‘‘normal’’ instances. (b, c) Example of one of the ‘‘disease’’ instances, with 50 disease

cluster events spiked into 100,000 events. In (b) the disease events are present but not marked, and in (c) they are marked in red. (d) Same

instance as in (b, c), but all six projections are shown to indicate that the 2�3 projection is the one where the disease cluster is best sepa-rated from the background. (e, f) Another example of a disease instance. The regions encircled in (b, c) and (e, f) indicate the general loca-

tion of the disease clusters. Note that the disease cluster is intermingled with the background. Comparing (a), (b), and (e) it is possible to

discern the fact that no two instances are exactly the same, as described in the text.

ORIGINAL ARTICLE

438 Cytometric Fingerprinting

Page 10: Cytometric Fingerprinting: Quantitative … Fingerprinting

draw and evaluating its accuracy using only the test instances

for that draw. The average classification accuracy at 100/

100,000 disease event concentration was 97.74%/98.5% (sensi-

tivity/specificity). This accuracy degraded to 94.02%/93.98%

and 67.30%/67.68% at disease event concentrations 50/

100,000 and 20/100,000, respectively. (At concentrations above

100/100,000 the classifier was perfectly accurate.) At 10/

100,000 disease event concentration the LDA classifier per-

formed only slightly better than the 50% null hypothesis,

demonstrating that an effective lower limit of detection had

been reached for this hypothetical ‘‘disease.’’ That this per-

formance is achieved with a very simple classifier suggests that

CF is an efficient method of representing differences among a

set of multivariate probability density distributions.

How well does the simplest possible detection method

work on these data? To answer this question we divided the

50/100,000 concentration training data into uniform bins at a

resolution of 10 divisions on each of the four dimensions. We

then computed Mann–Whitney P values for each bin location

and found that three bins had P values less than 1024. Count-

ing events in these bins and averaging for the disease and nor-

mal groups separately, we detected on average only 12.97

(25.9%) of the 50 spiked events. CF detected on average 27.04

(54.1%) disease events in two high-resolution bins, indicating

that CF creates a more efficient representation than uniform

binning, in that fewer bins capture more information. PB by

comparison detected on average 25.1 (50.2%) disease events

in one lower resolution bin. Thus, for this comparison, CF

and PB perform comparably, and both perform significantly

better than uniform binning.

Quality Control Based on CF

CF has been used to quantitatively assess the consistency

of gating across data derived from a panel of tubes. Here we

illustrate CF’s ability to complement rather than replace con-

ventional methods of sequential gating data analysis.

Often, clinical flow cytometric assays involve the acquisi-

tion of data from several tubes in order to accommodate

many markers or stimulation conditions. Analysis of multi-

tube data frequently involves the establishment of gates (based

on a subset of parameters common to all tubes) which can be

uniformly applied across the entire panel, thus providing a

basis for consistent population identification and minimizing

time and effort in the data analysis process. The underlying

assumption is that events are distributed the same way for

each tube across these gating parameters. Violations of this

assumption due to unanticipated and undetected shifts or

other dissimilarities of the gating data may result in inaccurate

phenotyping.

CF-based quality control assessment was applied in the

context of a 96-well plate format flow cytometric experiment.

In this study (19), fresh PBMCs were added to an array of an-

tigenic stimuli and a set of controls in a 96-well plate format.

Cells were allowed to respond to the stimuli for 6 h. Rows A–

D of the plate contained IFN-g FITC, CD69 PE, CD4 perCP-

Cy5.5, and CD3 APC, whereas rows E–H contained TNF-aFITC, IL-2 PE, CD4 perCP-Cy5.5, and CD3 APC. Responses

and the cell types of the responders were then measured by

gating on the common parameters including Side scatter, CD4

perCP-Cy5.5, and CD3 APC.

To compare the gating distributions, cytometric finger-

prints were computed on the three dimensions common to all

96 wells at a resolution of 32 bins by binning on the aggre-

gated gating data from all 96 wells, projecting the distribution

from each well onto the resulting template bins, and expres-

sing the PDF for each well as a vector of the base-2 logarithm

of the fractional deviation from the norm. Figure 7 shows

examples of these fingerprints for each well, along with the

maximum deviation for each well. Log2 (maximum deviation)

values less than 2.0 are color-coded green, between 2.0 and 3.0

in yellow, and greater than 3.0 in red. Bins coded yellow and

red clearly indicate disruptions of the characteristic CD3/CD4

distributions as shown in Figure 7c. This type of CF-based

analysis provides a quick visual triage of possible data acquisi-

tion artifacts that can help the investigator recognize and

correct defects.

DISCUSSION

In this report we describe CF, a new method for unbiased

quantitative analysis of list-mode cytometric data. This tech-

nique relies on a method of subdividing a multidimensional

space which extends that of PB. CF includes a novel algorithm

for space subdivision that eliminates the constraint in PB of

subdivision along the coordinate axes by dividing multidi-

mensional space into nonrectangular polytopes. Additionally,

it renders a multidimensional distribution in the form of a

one-dimensional multiresolution hierarchical fingerprint that

creates a computationally efficient representation of high-

dimensionality distribution functions.

Each event in list-mode data is described by a vector of

coordinates in a multidimensional space. Thus, a complete

mathematical description of a sample is the multivariate prob-

ability distribution function defining the density of events in

this space. This distribution may be approximated by dividing

the space into small volume elements, counting the number of

events in each volume element, and normalizing the count by

the total number of events in the sample. In the limit of an in-

finite number of events, the regions may be made infinitesi-

mally small yielding the true probability distribution function.

Of course, it is impossible to collect an infinite number of

events, and so the question of interest is how does one accu-

rately estimate the true probability distribution from a finite

sample of events? Equally importantly, how does one represent

this approximation of the multivariate probability density

function in a form amenable to comparing disparate samples?

The most common nonparametric means of estimating a

PDF is a histogram where space is divided into equal width

bins (20). For a complex (rapidly changing) PDF, one would

like to choose small bins in order to accurately track the varia-

tion with respect to independent variables (low bias). On the

other hand, one would like to choose bins of sufficient size to

contain a large number of events in order to estimate the value

of the density within a bin with high accuracy (low variance).

This trade-off between number of bins and bin size is the

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 439

Page 11: Cytometric Fingerprinting: Quantitative … Fingerprinting

classic bias-variance dilemma (21). For one independent vari-

able and reasonably sized datasets, it is not difficult to balance

the bias-variance requirements. However, for multidimen-

sional data one approach is to control the variance by choos-

ing bins that contain equal numbers of events. In the case of

univariate data, there is a unique set of bin boundaries that

accomplishes this goal (22). For multivariate data, however,

there is not a unique solution. While this indeterminacy might

seem like a disadvantage, in fact, it creates an opportunity to

find a specific set of bin boundaries that does a superior job of

reducing bias.

Other methods of representing and analyzing multidi-

mensional flow cytometry data have been developed (8,23–

25). One that is most closely related to the present work is PB

(7,10,12,22). PB represents a multidimensional probability

distribution as a set of bins defining regions of the multidi-

mensional space. The boundaries of these bins are chosen so

that approximately equal numbers of events lie in each bin.

Bins are found by selecting a coordinate dimension, determin-

ing the median in that coordinate, and dividing the data at the

median value. In PB, the axis selection is made by calculating

the variance of the data in the parent bin for each of the origi-

nal coordinate dimensions and choosing the one dimension

having the largest variance. Although the decision is made on

the basis of the variance in each dimension, the split is not

necessarily along the optimal direction because the direction

of maximum variance may not coincide with one of the coor-

dinate axes. The method of binning utilized in CF differs from

that of PB in that bins are formed by splitting the data in the

direction of maximum variance rather than along the original

coordinate axes. A primary objective of CF is the development

of a general and efficient representation of the multivariate

probability density function embodied in list-mode data that

can be utilized with machine learning methods to analyze

cytometric data. Key features of this representation are that it

retains information at multiple resolutions and is in the form

of a one-dimensional feature vector.

As evidenced from the controlled spiking experiments,

CF compares well with proven methods of data analysis (e.g.,

sequential gating) in cases where the target cell population is

discrete and easily recognized. In a more challenging setting

illustrated by the synthetic data example, CF is able to detect

and identify events in complex, overlapping distributions that

are not visible to the human eye in any 2D dotplot and are

thus not amenable to methods of analysis that are graphically/

visually guided. Furthermore, even at a level below 0.05%

anomalous events in an ungated population, in which the

anomalous events are not even detectable via conventional

methods, cytometric fingerprints were able to efficiently repre-

sent the data such that a simple classifier (in this case LDA)

was able to accurately and predictively classify instances into

distinct ‘‘phenotypes’’ based upon the raw list-mode data

alone. Importantly, CF does not depend upon arbitrary

choices of gating or analytical thresholds, which collectively

represent a priori hypotheses. Rather, CF provides a means to

generate new hypotheses directly from the data, as well as to

test existing hypotheses by direct comparison with traditional,

proven methods of data analysis.

It is important to note that as the numbers of parameters

involved in flow cytometric studies increases, so will the com-

plexity of the resulting distributions of events. Thus, there will

be a concomitant demand for increasing the number of events

collected. This is a fundamental requirement for multipara-

meter flow cytometry in general, and not just for the applica-

Figure 7. Quality control assess-

ment for gating data. (a) Finger-

prints for 96-well plate data; (b)

upper-left corner of the plate

enlarged; (c) individual dotplots

for CD3 vs. CD4 for three

selected wells, superimposed

on contours for the aggregated

event density distribution (blue

contours). White arrows in (c)

for the dotplots corresponding

to wells B-1 and C-12 indicate

locations where event density

distributions deviate from the

plate norm.

ORIGINAL ARTICLE

440 Cytometric Fingerprinting

Page 12: Cytometric Fingerprinting: Quantitative … Fingerprinting

tion of fingerprinting or other advanced methods of quantita-

tive analysis. It is a result of the geometric dilution of the sig-

nificance of any individual dimension into a ‘‘sea’’ of increas-

ing dimensionality. CF seeks to maximize the utility of data

characterized by a given number of events at a given dimen-

sionality by optimizing the trade-off between bias and var-

iance. Thus, although it might be argued that CF allows for

the collection of the minimum number of events for a given

dimensionality, we emphasize the need to collect an adequate

sampling of events given the dimensionality, and then suggest

the use of CF to maximize the utility of those data.

CF demands adherence to strict, standardized operating

procedures in the collection of flow cytometric data, and it

can help in the development of such procedures. CF, like other

quantitative methods of analysis of flow cytometric data, is

sensitive to variations that exist in the data whether due to a

true biological effect or artifact. In fact, as shown earlier and

in Ref. 18, CF can be used to detect and correct artificial

sources of variability. Improved procedures for collecting data

will enable CF to be more sensitive to true biologic variations,

thus deriving phenotype hypotheses that can then be tested. It

is important that the quantitation of each parameter represent

as accurately as possible its corresponding physical counter-

part (e.g., the CD45-FITC fluorescence signal for an event

should reproducibly represent the copy number of CD45

molecules on the surface of the corresponding cell). Methods

for collection, storage, and transport of samples, variations in

staining and other sample preparation variables, and instru-

ment performance parameters can all contribute to small but

important variations in the distribution of events in spaces of

ever-increasing dimensionality. In difficult applications, for

example, the characterization of rare event distributions, small

systematic errors may completely mask small but biologically

significant effects. Therefore, standardized procedures, includ-

ing appropriate controls, for each step in the chain of events

leading from donor/patient to list-mode file are essential.

While the need for such rigor may seem to be a limitation of

CF, it is equally significant in any other methods of quantita-

tive analysis of list-mode data.

The CF method together with the adoption of such stan-

dardized protocols will result in a number of tangible and far-

reaching benefits. First, the application of quantitative analysis

methods such as CF reduces or eliminates tedious manual

analysis procedures. This not only increases analysis through-

put via automation but also removes an important source of

analytical bias; namely, subjective decisions (such as choosing

gates) made by a human analyst. Second, the incorporation of

additional parameters into an assay, which potentially leads to

a large increase in the number of steps involved in manual

analysis, does not increase the apparent difficulty of analysis

via CF, because data of any dimensionality are computation-

ally reduced to a one-dimensional fingerprint. Therefore, stu-

dies with larger dimensionality are made more practical. Third

and perhaps most significant, data that are collected according

to standardized and controlled protocols can be integrated

together with their fingerprints into a large-scale cytometry

database. Such an archive will be a valuable resource for retro-

spective data mining for the creation and testing of new

hypotheses for disease associations or biological mechanisms.

ACKNOWLEDGMENTS

Flow cytometry was performed at the Abramson Cancer

Center Flow Cytometry and Cell Sorting Shared Resource of

the University of Pennsylvania School of Medicine, which was

established in part by equipment grants from the NIH Shared

Instrument Program, and receives support from NIH P30

CA016520 from the National Cancer Institute. We thank the

authors of Ref. 19 for generously making their 96-well plate

data publicly available.

LITERATURE CITED

1. Wille-Reece U, Flynn BJ, Lore K, Koup RA, Miles AP, Saul A, Kedl RM, Mattapallil JJ,Weiss WR, Roederer M, Seder RA. Toll-like receptor agonists influence the magnitudeand quality of memory T cell responses after prime-boost immunization in non-human primates. J Exp Med 2006;203:1249–1258.

2. Neeson P, Paterson Y. A new multi-parameter flow cytometric assay for monitoringlymphoma growth and spread in a pre-clinical murine model for human lymphoma.Cytometry Part A 2004;60A:8–20.

3. Precopio ML, Betts MR, Parrino J, Price DA, Gostick E, Ambrozak DR, Asher TE,Douek DC, Harari A, Pantaleo G, Bailer R, Graham BS, Roederer M, Koup RA. Im-munization with vaccinia virus induces polyfunctional and phenotypically distinctiveCD8(1) T cell responses. J Exp Med 2007;204:1405–1416.

4. Wu KH, Zhou B, Mo XM, Cui B, Yu CT, Lu SH, Han ZC, Liu YL. Therapeutic poten-tial of human umbilical cord-derived stem cells in ischemic diseases. Transplant Proc2007;39:1620–1622.

5. Zhao X, Huang L, Yin Y, Fang Y, Zhou Y. Autologous endothelial progenitor cells trans-plantation promoting endothelial recovery in mice. Transplant Int 2007;20:712–721.

6. Shaffer RG, Greene S, Arshi A, Supple G, Bantly A, Moore JS, Parmacek MS, MohlerER3. Effect of acute exercise on endothelial progenitor cells in patients with periph-eral arterial disease. Vasc Med 2006;11:219–226.

7. Roederer M, Moore W, Treister A, Hardy RR, Herzenberg LA. Probability binningcomparison: a metric for quantitating multivariate distribution differences. Cytome-try 2001;45:47–55.

8. Lugli E, Pinti M, Nasi M, Troiano L, Ferraresi R, Mussi C, Salvioli G, Patsekin V,Robinson JP, Durante C, Cocchi M, Cossarizza A. Subject classification obtained bycluster analysis and principal component analysis applied to flow cytometric data.Cytometry Part A 2007;71A:334–344.

9. Lizard G. Flow cytometry analyses and bioinformatics: interest in new softwares tooptimize novel technologies and to favor the emergence of innovative concepts in cellresearch. Cytometry Part A 2007;71A:646–647.

10. De Rosa SC, Brenchley JM, Roederer M. Beyond six colors: a new era in flow cytome-try. Nat Med 2003;9:112–117.

11. Jackson J. AUser’s Guide to Principal Components. New York: Wiley-Interscience; 1991.

12. Roederer M, Hardy RR. Frequency difference gating: a multivariate method for iden-tifying subsets that differ between samples. Cytometry 2001;45:56–64.

13. O’Connell M. Search program for significant variables. Comput Phys Commun1974;8:49–55.

14. Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag1901;2:559–572.

15. Jolliffe I. Principal Component Analysis. New York: Springer; 2002.

16. Hotelling H. Analysis of a complex statistical variable into principal components.J Educ Psych 1933;26:417–441, 498–520.

17. Arnold G, Collins A. Interpretation of transformed axes in multivariate analysis.Appl Statist 1993;42:381–400.

18. Rogers W, Holyst H, Moore J, Schretzenmair R, Moser A. Qtube: automated qualityassessment of flow cytometric data based on cytometric fingerprinting. CytometryPart B 2007;72B:490.

19. Inokuma M, dela Rosa C, Schmitt C, Haaland P, Siebert J, Petry D, Tang M, SuniMA, Ghanekar SA, Gladding D, Dunne JF, Maino VC, Disis ML, Maecker HT. Func-tional t cell responses to tumor antigens in breast cancer patients have a distinct phe-notype and cytokine signature. J Immunol 2007;179:2627–2633.

20. Sturges HA. The choice of a class interval. J Am Statist Assoc 1926;21:65–66.

21. Liu RC, Brown LD. Nonexistence of informative unbiased estimators in singular pro-blems. Ann Statist 1993;21:1–13.

22. Roederer M, Treister A, Moore W, Herzenberg LA. Probability binning comparison: ametric for quantitating univariate distribution differences. Cytometry 2001;45:37–46.

23. Robinson JP, Ragheb K, Lawler G, Kelley S, Durack G. Rapid multivariate analysisand display of cross-reacting antibodies on human leukocytes. Cytometry 1992;13:75–82.

24. Robinson JP, Durack G, Kelley S. An innovation in flow cytometry data collectionand analysis producing a correlated multiple sample analysis in a single file. Cytome-try 1991;12:82–90.

25. Murphy RF. Automated identification of subpopulations in flow cytometric list modedata using cluster analysis. Cytometry 1985;6:302–309.

ORIGINAL ARTICLE

Cytometry Part A � 73A: 430�441, 2008 441