Upload
lytram
View
232
Download
4
Embed Size (px)
Citation preview
Cytometric Fingerprinting: Quantitative
Characterization of Multivariate Distributions
Wade T. Rogers,1,2,3* Allan R. Moser,1,2 Herbert A. Holyst,1,3 Andrew Bantly,3
Emile R. Mohler III,4 George Scangas,3 Jonni S. Moore3
� AbstractRecent technological advances in flow cytometry instrumentation provide the basisfor high-dimensionality and high-throughput biological experimentation in a het-erogeneous cellular context. Concomitant advances in scalable computational algo-rithms are necessary to better utilize the information that is contained in thesehigh-complexity experiments. The development of such tools has the potential toexpand the utility of flow cytometric analysis from a predominantly hypothesis-driven mode to one of discovery, or hypothesis-generating research. A newmethod of analysis of flow cytometric data called Cytometric Fingerprinting (CF)has been developed. CF captures the set of multivariate probability distributionfunctions corresponding to list-mode data and then ‘‘flattens’’ them into a compu-tationally efficient fingerprint representation that facilitates quantitative compari-sons of samples. An experimental and synthetic data were generated to act asreference sets for evaluating CF. Without the introduction of prior knowledge, CFwas able to ‘‘discover’’ the location and concentration of spiked cells in ungatedanalyses over a concentration range covering four orders of magnitude, to a lowerlimit on the order of 10 spiked events in a background of 100,000 events. Wedescribe a new method for quantitative analysis of list-mode cytometric data. CFincludes a novel algorithm for space subdivision that improves estimation of theprobability density function by dividing space into nonrectangular polytopes.Additionally it renders a multidimensional distribution in the form of a one-dimensional multiresolution hierarchical fingerprint that creates a computationallyefficient representation of high dimensionality distribution functions. CF supportsboth the generation and testing of hypotheses, eliminates sources of operator bias,and provides an increased level of automation of data analysis. ' 2008 International
Society for Advancement of Cytometry
� Key termsflow cytometry; cytomics; bioinformatics; computational biology; machine learning;empirical modeling
IN the past decade, flow cytometry, along with other high information content tech-
nologies, has experienced exponential growth. The power of flow cytometry to allow
the simultaneous measurement of multiple cellular components of individual cells in
a heterogeneous environment makes it an ideal technology for high-dimensional bio-
logical studies. Along with other -‘‘omics’’ technologies (such as genomics and
proteomics), cytomics, facilitated by high-dimensional flow cytometry, has shifted
scientific investigation from purely hypothesis-driven to a more discovery, or hy-
pothesis-generating, mode. With such approaches, complex cellular phenotypes and
interactions can be elucidated and can contribute to the understanding of such im-
portant clinical questions as successful vaccination strategies (1–3), and the nature
and role of stem cells in health and disease (4–6). The availability of advanced instru-
mentation coupled with expansion in the selection of probes has extended the cap-
abilities of multivariate analyses of cells and cell systems to the point that conven-
1Cira Discovery Sciences, Inc.,Philadelphia, Pennsylvania2Perinatology Research Branch,Intramural Division, NICHD/NIH/DHHS,Hutzel Women’s Hospital, Detroit,Michigan3Department of Pathology and LaboratoryMedicine and the Flow Cytometry andCell Sorting Resource Laboratory,University of Pennsylvania School ofMedicine, Philadelphia, Pennsylvania4Vascular Medicine Section,Cardiovascular Division, Department ofMedicine, University of PennsylvaniaSchool of Medicine, Philadelphia,Pennsylvania
Received 25 September 2007; RevisionReceived 5 December 2007; Accepted16 January 2008
Grant sponsor: National Institute of ChildHealth and Human Development, NIH,DHHS.
*Correspondence to: Wade T. Rogers,University of Pennsylvania,207 John Morgan Bldg., Philadelphia,PA 19104-6082.
Email: [email protected]
Published online 27 March 2008 in WileyInterScience (www.interscience.wiley.com)
DOI: 10.1002/cyto.a.20545
© 2008 International Society forAdvancement of Cytometry
Original Article
Cytometry Part A � 73A: 430�441, 2008
tional data-analytical paradigms have become the rate-limiting
step in the overall process of collecting, analyzing, and inter-
preting complex data sets (7–10). Twenty-first century flow-
cytometric technology will thus benefit from the development
of new approaches to the analysis of data, especially those that
facilitate the application of machine learning and data mining
methodologies.
Conventional flow cytometry typically measures events
that are variable in only a few dimensions. In the best cases the
data are quantal; that is, cells express particular antigens
according to tightly regulated processes causing the number of
such molecules per cell to fall within a narrow range. Thus, a
two-dimensional projection of the data for two cell-surface
markers characterizing a population of cells can be divided
into four quadrants into which individual clusters are segre-
gated. Variation in the choice of thresholds to separate positive
events from negative events on each axis has a negligible influ-
ence on estimates of the density of events in each quadrant.
More complex multiparameter distributions are usually simpli-
fied into a series of two-dimensional distributions by sequen-
tial gating (2) or, in some cases, more sophisticated principal
component analysis (PCA) (8), to represent the high-dimen-
sionality data in a lower-dimensionality projection which
retains most of the variance in the dataset (11). However, these
dimensionality-reducing approaches possess a number of
drawbacks that limit their suitability for the analysis of com-
plex populations. First, serial pairwise analysis decouples
parameter correlations that may be important in defining
complex populations. Additionally, analytical bias may be
introduced through multiple subjective operator interventions,
which is of particular relevance in the analysis of complex,
nonquantal distributions. Finally, sequential gating involves the
imposition of selection criteria on the data that do not allow
for new interpretations of poorly characterized populations.
Analysis via gating remains the most effective method for
filtering data using well-established criteria (e.g., removing
data associated with dead cells, or counting populations in
highly optimized and quantal assays). In some cases the appli-
cation of expert judgment in the visual design of gating strate-
gies may be able to isolate events of biological interest even in
the presence of confounding experimental (or biological) vari-
ation that will be difficult to account for automatically. Never-
theless there is a strong need for tools to transform and repre-
sent multiparameter flow cytometric data in a form efficiently
amenable to methods of machine learning and data mining.
We describe here a new algorithm for the analyses of
high-dimensional cytometric data termed Cytometric Finger-
printing (CF). CF extends the approach of Probability Binning
(PB) (7) and its companion method Frequency Difference Gat-
ing (12). The objective of CF is to represent the information
contained in cytometric list-mode files in a form that enables
quantitative comparison among samples. These fingerprints
are capable of capturing and encoding the full multivariate cor-
relations of complex, high-dimensional cytometric data. This
representation is particularly useful when cell populations are
not clearly delineated by optimized assays and the distribution
of events in the multiparameter space is not bimodal.
MATERIALS AND METHODS
Cell Staining and Flow Cytometric Analysis
Peripheral blood mononuclear cells (PBMCs) were iso-
lated from 60 ml of fresh blood from seven healthy donors as
described in Ref. 6, collected under the guidelines of the Insti-
tutional Review Board of the University of Pennsylvania
School Medicine. Cells were counted and split into two
groups: Pool no. 1 and Pool no. 2, as well as a compensation
control group. The compensation control set contained 107
cells, and the Pool no. 1 and Pool no. 2 sets each contained 2
3 107 cells. The two pools of cells were each stained with a dif-
ferent antibody combination to create a known variable subset
of events embedded in a background distribution. Antibody
labeling was performed according to standard procedures.
Briefly, all cells were blocked with murine IgG (10 lg per 107
cells) (Sigma, St. Louis, MO) for 10 min at room temperature,
and then either stained with 60 ll anti-human CD45 FITC
(BD Biosciences Pharmingen, San Jose, CA), 15 ll anti-human
CD4 PE (Invitrogen/Caltag Laboratories, Carlsbad, CA), and
15 ll anti-human CD3 APC (Invitrogen/Caltag Laboratories,
Carlsbad, CA) for Pool no. 1, or 60 ll anti-human CD45
FITC, 15 ll anti-human CD4 PE, and 15 ll anti-human CD64
APC (Invitrogen/Caltag Laboratories Carlsbad, CA) for Pool
no. 2. Three compensation control tubes were labeled with
single antibodies as in Pool no. 1. A fourth, unstained control
tube had 100 ll of staining buffer (DPBS 1 1.0% BSA) added
to it. After labeling, cells were incubated for 30 min at 48C in
the dark, washed twice with 2 ml of staining buffer, and finally
resuspended in 2 ml of staining buffer. A series of collection
tubes was made by aliquoting equal numbers (1 3 106) of
Pool no. 2 cells at a final volume of 1 ml.
Cell Sorting
A BD FACS Aria was used to sort Pool no. 1 or Pool no.
2 cells into the tubes containing 1 million Pool no. 2 cells.
Sorted population gating was based on a light scatter for lym-
phocytes and monocytes [P1 gate], singlets [P2 gate],
CD45(1) FITC A vs. SSC A [P3 gate], and CD4(1) PE A vs.
CD3(1) APC A [P4 gate], after compensation as shown in
Figure 1. A series of increasing numbers (10, 32, 100, 316,
1,000, 3,162, 10,000, 31,623, or 100,000 cells) of Pool no. 1
cells were sorted into the collection tubes containing 1 3 106
of Pool no. 2 cells. The resulting composite distributions are
shown on the lower part of Figure 1. Three tubes into which
Pool no. 2 cells (10, 100, or 1,000 cells) were sorted into Pool
no. 2 collection tubes were used as controls.
Flow Cytometry Data Acquisition
Data were acquired on a BD FACS Canto immediately
after sorting for each experiment. PMT voltages were deter-
mined using the compensation control tubes. Once voltages
were established, during the first experiment, Sphero Rainbow
Calibration Particles (eight peaks) (Spherotech, Lake Forest,
IL) were used to standardize fluorescence measurements
across experiments. To do this, 105 beads were collected, a
tight gate was set around the three brightest peaks for the
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 431
FITC, PE, and APC channels, and the median intensity for
each peak was recorded. PMT voltages were then adjusted if
necessary to produce the same median intensities for each
parameter compared with the first experiment. For cellular
events, the threshold was set at FSC 30,000. One hundred
thousand ungated events for each sample were collected. For
the Pool no. 2 cells sorted into Pool no. 2, five tubes were col-
lected of each sample to be used as controls. In between each
collection of a tube, buffer only was run until no events were
detectable. After all tubes were collected, the Spherotech
8-peak beads were run a second time to see if there were any
changes in the median intensities of the three bright peak gates
for each channel. When performing the experiments that
followed, the PMT voltages for the samples were determined
by using the predrawn gates used to monitor possible shifts in
the median intensity in the first experiment.
Flow Cytometry Data Analysis
FCS3.0 files were analyzed using FlowJo Software version
6.4.7 (TreeStar Software, Ashland, OR). Population gating,
with gates corresponding to those described above for the
sort, was used in order to establish reference numbers of Pool
no. 1 cells in each tube. These numbers were used to deter-
mine the accuracy of event counts obtained with the CF
method described below. Cytometric Fingerprints were gener-
ated and analyzed using custom-written software in C, Perl,
MatLab version 6.5.0 (The Mathworks, Natick, MA) and R
version 2.5.1.
Cytometric Fingerprints
CF aims to capture the set of multivariate probability dis-
tribution functions corresponding to a set of list-mode files
and then ‘‘flatten’’ them into a computationally efficient fin-
gerprint representation that facilitates quantitative compari-
sons of samples. The CF algorithm starts by specifying the
number of hierarchical levels L for the representation. The
total number of bins nB into which the space is to be divided
is related to the number of hierarchies by the expression nB 52L 2 1. (Note that the undivided space is considered to be the
first level in the hierarchy. Thus, the total number of bins is an
odd number.) The next step is to determine boundaries that
recursively subdivide the multidimensional space into regions
of equal probability. The subdivision of space is done in such
a way that at each division the parent bin is divided by a
Figure 1. Gating strategy and event distribution in spiking experiment. Upper panel: Gating strategy for spiking Pool no. 1 cells via cell
sorting into aliquots of Pool no. 2 cells. Lower panel: Distributions of spiked samples at nine different spike concentrations (corresponding
to the concentrations in Table 1) are shown. The boxes show the location of the spiked events.
ORIGINAL ARTICLE
432 Cytometric Fingerprinting
hyperplane perpendicular to the direction of maximum var-
iance of the data within the bin. Thus, the variance of data
within a bin is maximally reduced at each subdivision subject
to the constraint that the data are divided into approximately
equal portions. PCA is utilized to find the direction of maxi-
mum variance (11,13–17). The procedure, termed Oblique Re-
cursive Subdivision, is outlined in Figure 2. It is noteworthy
that while the bin boundaries consist of the intersection of
hyperplanes in a p-dimensional space, these boundaries need
not be explicitly stored. All of the information necessary to
bin new data is contained in the hierarchy of rotation matrices
Rn and median split values described in Figure 2.
Once bin boundaries have been determined from a data
set (denoted Dtemplate), the procedure for binning another
data set, Dnew, utilizing this bin description is straightforward.
The procedure is identical to that described earlier except that
the rotation matrices and median values from Dtemplate are
used rather than being recalculated from the new data set.
The binning method utilized by oblique recursive subdi-
vision is similar to that employed by PB (7). Both recursively
subdivide space into bins containing equal numbers of events.
For multivariate distributions, both use variance as splitting
criteria. The primary difference is that PB is constrained to
split along the coordinate axes, whereas oblique recursive sub-
division splits the data in the actual direction of maximum
variance. As stated in Ref. 7, ‘‘By choosing the parameter with
the largest variance, the algorithm is weighted towards assign-
ing distinct clusters of events into distinct bins (or sets of
bins).’’ Thus, oblique recursive subdivision may be regarded as
a generalization of PB that more accurately achieves this goal.
A simple pictorial example showing the difference between the
binning achieved by PB versus that of oblique recursive subdi-
vision is shown in Figure 3. This example shows two distribu-
tions of events in a two-dimensional space. Splitting in the
direction of maximum variance, as is done in oblique recur-
sive subdivision, places the events from these two distributions
in separate bins starting from the first level of recursion. Con-
straining the split direction to a coordinate axis, as is done in
PB, creates bins that mix events from both distributions; even
in the higher resolution bins. Of course, there is no way of
knowing whether such distributions exist in real data. Addi-
tionally, if enough data exist to create sufficiently high-resolu-
tion bins, the event distributions will eventually be separated
by either binning method. However, distributions such as this
could, in principle, exist at any resolution within data.
As discussed in Ref. 7, there are potential disadvantages
to splitting in the direction of maximum variance. Distribu-
tions that are spread out in one dimension will be split
through the middle making the method sensitive to subtle
shifts in the position of the bulk of the event population. Note
that while this may present a problem for rare event detection,
it can also be exploited as a means of quality control (18).
Additionally, care must be taken that all parameters are trans-
formed such that their variances have approximately equal
scale. Importantly, log-normally distributed values (such as is
typical with fluorescence parameters) should be log-trans-
formed prior to binning. All of the data presented in the
‘‘Results’’ section later were treated in this way.
Generation of Cytometric Fingerprints
A simple one-dimensional representation of the density
variations relative to a template may be obtained by recording
the densities for the bins in the form of a list. We call this rep-
resentation a ‘‘fingerprint’’ because it distinguishes individual
samples (referred to as ‘‘instances’’) represented by list-mode
data. Figure 4 schematically illustrates this procedure. Finger-
prints are derived as follows:
i. For a set of M instances {S1, S2, . . . , SM}, aggregate the
events from all of the instances to form a composite
denoted S.
ii. Find the bins for S.
iii. Bin the data for each instance Si (i 5 1, 2, . . . , M), using
the bins from Step ii.
iv. Convert the event counts in each bin into an event fraction
by dividing each count by the total number of events in
the data set.
The lists of binned event densities for the set of instances, S1,
S2, . . ., SM, form a set of fingerprints for these data relative to
the probability density estimated from the composite data set,
Figure 2. Schematic outline for oblique recursive subdivision.
The method starts with a list-mode file for the template dataset D.After initialization, the algorithm recursively processes events in
parent bins at a given level of resolution, forming children bins at
the next higher level of resolution. The algorithm terminates
when the specified number of bins has been generated.
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 433
Figure 3. Comparison of rectangular and oblique binning. The figure illustrates the differences between the binning method utilized by
probability binning and oblique recursive subdivision. One thousand two-dimensional events were generated such that they fall into the
two elliptical clusters shown in red and blue. The figure on the left shows the result for probability binning. The figure on the right shows
the result for oblique recursive subdivision. The thick brown line indicates the location of the first subdivision. Probability binning splits
the events along the Parameter 2 axis creating two bins having mixed blue-cluster and red-cluster events, whereas oblique recursive sub-
division creates two bins which are each pure in red and blue events. Subsequent bin divisions are indicated by successively thinner lines
in green (second level recursion), purple (third level), light brown (fourth level), and black (fifth level). Even at the fifth level of recursion,
two bins remain ‘‘mixed’’ with probability binning, although at successively higher levels of resolution, few events are in the mixed bins.
For oblique recursive subdivision, because clusters are separated at the first level of recursion, bins formed at every level of recursion
remain pure.
Figure 4. Schematic representation of fingerprint formation. This diagram shows two instances binned with respect to template bins
found using oblique recursive subdivision. In this schematic example, there are a total of 63 bins. The thickest brown diagonal line sepa-
rates the data into two level 1 bins; the thick blue lines separate each of these bins into the four level 2 bins; and so on. The fingerprints to
the right of the dot plots display raw event counts (or P values obtained from Poisson statistics) for each bin, with the color intensity
encoding the amplitude of the feature. Fingerprints are organized in resolution hierarchies, with low resolution features to the left progres-
sing to successively higher resolution features to the right. There are 2L features within a resolution hierarchy where L is the level of resolu-tion. Thus, the rightmost 32 features indicate the fingerprint values at the highest level of resolution. The dot plot for the ‘‘disease’’
instance has a well-defined cluster of cells in the lower right-hand bin which is not present in the normal instance. The arrows indicate the
effect of this cluster in the disease instance fingerprint at multiple levels of resolution.
ORIGINAL ARTICLE
434 Cytometric Fingerprinting
and may be used, for example, to detect and quantify devia-
tions from a norm (18).
Another variation of fingerprinting is particularly useful
for classification. The goal is to emphasize differences between
samples belonging to different classes. For classification pro-
blems, one typically has a set of training instances for which the
class identity is known and a set of test or validation instances
for which the class identity is unknown. The training data may
be aggregated to construct class-specific templates. Fingerprints
for individual instances, from both the training set and test or
validation sets can be computed utilizing bins derived from the
templates. Fingerprints from the training instances may then be
used to construct supervised classifiers with which to classify
fingerprints for test or validation instances.
RESULTS
Spiked Sample Experiment and Analysis
A series of ‘‘spiking’’ experiments was carried out to
demonstrate the use of cytometric fingerprints for quantitative
analysis of cell populations. The basis for these experiments
was to simulate a potential phenotypically unique population
of cells present at varying concentrations depending on some
condition or treatment. This was done by creating a series of
concentrations of minority cell populations (Pool no. 1)
spiked into a majority cell population (Pool no. 2). Fifteen
list-mode files were obtained from tubes with nominal spiked
cell concentrations increasing in half-decade increments from
0.001% to 10%. Additionally, 14 list-mode files were acquired
for control samples having only Pool no. 2 cells. The actual
concentrations of spiked cells varied from the nominal values
because of uncertainties in the sorting process and the number
of reservoir Pool no. 2 cells into which the Pool no. 1 cells
were spiked. For the population gating analysis, a set of se-
quential gates corresponding to that used to sort Pool no. 1
cells into the Pool no. 2 reservoir was used to determine the
reference number of spike cells in each sample. By contrast,
the CF method did not have access to knowledge of the loca-
tion and distribution of the spiked events. Rather, the goal was
to automatically determine the location and concentration of
spiked cells for each concentration experiment independently
by direct comparison with the Pool no. 2 reference set. To do
this, the only assumption made by CF was that some events
may be present in one distribution that were absent in the
other. Thus, the objective of the analysis in this section was to
demonstrate the ability of CF to generate hypotheses (i.e.,
automatically determine unique cell population locations and
concentrations) and to determine the concentration limits for
which such hypothesis generation is possible.
As discussed previously, cytometric fingerprints are one-
dimensional lists of event counts in bins formed with respect to
a template instance. Event counts for list-mode files binned with
respect to this template show density variations for that sample
with respect to the template. In general, to compare two sets of
samples, one could choose to aggregate all instances to form a
template; choose one of the two sets to form the template; or
form templates from each set and create two sets of fingerprints
for each instance. For the spiking experiments, we chose to form
template bins based on each spiked experimental sample and
then create fingerprints for the control samples with respect to
each of these templates. Qualitatively, what one expects to see
with this approach is that a cluster of Pool no. 1 cells in the
spiked sample will cause some number of bins to be localized in
the region of parameter space occupied by this cluster. The num-
ber of bins formed in the region of Pool no. 1 cells depends on
the concentration and degree of localization of the spiked cells.
Binning unspiked samples with respect to this template will
result in a fingerprint with low event counts in the bins asso-
ciated with the spiked cell cluster. We note that in a setting where
it is not known in advance in which group of samples a cluster is
likely to occur, samples from all groups can independently be
used to generate templates and each sample can then be binned
against each template—a method we call differential fingerprint-
ing. For the sake of simplicity we do not illustrate this here; how-
ever, the results would be equivalent. We next describe the quan-
titative estimation of the number of spiked cells.
The total number of events in a spiked sample is the sum
of the spiked Pool no. 1 events and background Pool no. 2
events.
N ¼ N1 þ N2: ð1Þ
This superposition of event counts holds for every subregion
of cytometric parameter space. In particular, if the spiked sam-
ple is used as the template and bins are formed using oblique
recursive subdivision, for each bin i, the total number of
events Ti is
Ti ¼ Xi þ Yi ð2Þ
where Xi and Yi are, respectively, the numbers of Pool no. 1
and Pool no. 2 events in bin i. By design, the number of events
in a bin (at a given resolution) for the template instance is Ti5 N/nB, where nB is the number of bins at the given resolu-
tion. From Eq. (2), the number of Pool no. 1 events is
Xi ¼ N
nB� Yi ð3Þ
The total number of spiked events in a sample can be found
by summing over the set of bins S containing spiked cells.
N1 ¼Xi2S
Xi ¼Xi2S
N
nB� Yi
� �: ð4Þ
The number of Pool no. 2 cells in each bin can be estimated
from the cytometric fingerprints of the control samples
because these contain only Pool no. 2 events. Aggregating the
control samples and binning relative to the spiked sample
template, the fraction of Pool no. 2 cells in each bin is given by
fi ¼ ki
mN; ð5Þ
where ki is the event count value for bin i obtained from the
fingerprint of the aggregated control samples. The denomina-
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 435
tor in this expression is the total number of events in the
aggregated control sample which is m (the number of aggre-
gated Pool no. 2 tubes) times the number of events, N, in
each tube. Using Eqs. (1) and (5), the number of Pool no. 2
events in bin i is
Yi ¼ fiN2 ¼ fiðN � N1Þ ¼ ki
mNðN � N1Þ ð6Þ
Combining Eqs. (4) and (6), the fraction of spiked events in
an experimental sample is
N1
N¼ mN
��S��=nB �Pi2S ki
� �mN �P
i2S ki� � ð7Þ
where |S| is the number of bins containing spiked cells.
The remaining task is to determine the set of bins S encom-
passing the region of localization of spiked cells. Because the
template bins were created using spiked samples, if there is a
sufficient concentration of Pool no. 1 cells, one expects the
Pool no. 2 control sample fingerprints to have abnormally low
event counts in these bins. For example, in the experiment
with 0.1% spike concentration, there are 100 Pool no. 1 cells
out of the 100,000 events acquired from the tube. At a resolu-
tion of 8,192 bins, there will be on average 12.2 events per bin.
Thus, if the Pool no. 1 cells are well localized, one would
expect on the order of eight template bins to correspond to
this region in parameter space and that the control sample fin-
gerprints should have low event counts in these bins. These
low event count bins show up most clearly if the control sam-
ple fingerprint values are sorted in order of ascending bin
event count. For spiked samples with a sufficient spike popu-
lation, the sorted control sample fingerprint will have several
zero values followed by a rapid rise corresponding to bins
associated with the background of Pool no. 2 events. The posi-
tion of this change in slope is most easily determined using
the derivative of the sorted fingerprint values. The first maxi-
mum in the derivative of the sorted fingerprint signal indicates
the point where bins dominated by spiked events transition to
bins dominated by background events.
The selection of bins by the method outlined earlier is
somewhat conservative in that we pick the set of bins clearly
dominated by Pool no. 1 events, excluding some bins that con-
tain a mixture of a small number spiked events and a larger
number of background events. This results in an underestima-
tion of the spiked event concentration. However, this under-
estimation can be corrected by taking advantage of the multi-
resolution nature of CF. At higher resolutions a larger number
of low event-count bins pack more tightly around the cluster
of spiked events and thus are able to more clearly delineate
between spiked and background events. Lower resolution bins
contain more events but are broader and may overlap the
spiked and background regions. Our bin selection strategy is
to use the highest resolution bins to determine the cutoff
threshold for the number of Pool no. 2 events allowed in a bin
and then apply this threshold to the lower resolutions. This is
analogous to using spheres of successively smaller radii to esti-
mate the volume of a container. Utilizing the threshold deter-
mined from the highest resolution corresponds to considering
only spheres (at any resolution) that are contained within the
volume to the same degree as the smallest spheres. Each
resolution of binning provides an estimate of the spiked cell
concentration.
The form of a function useful for estimating the behavior
of concentration as a function of resolution can be obtained
by geometrical arguments. As the binning resolution increases,
the fraction of bins overlapping spiked and background
regions decreases. Consider the fraction of bins that lie on the
boundary of the spiked region relative to the number of bins
interior to the region. Letting h represent the characteristic
linear size (scale) of the bins, the number of bins at the
boundary is proportional to S/h(d21) while the number of bins
interior to the spike region is V/hd, where S and V represent
the surface area and volume of the region and d is the dimen-
sionality of the space. The fraction of bins on the boundary of
the spike region is given by
f ¼ S=hd�1
V=hd/ h
which is proportional to the scale of the bins. The scale of the
bins decreases by a factor of 2 for each increase in resolution
level. Thus, the fraction of bins on the boundary decreases
geometrically with resolution hierarchy index and can be rep-
resented as, f � e2cL. Therefore, the estimated concentration
level asymptotically approaches the true value and can be rep-
resented as
CðLÞ / ð1� af Þ ¼ C0ð1� ae�kLÞ ð8Þ
C0, a, and k are parameters that can be fit from the series of
multiresolution estimates. The parameters a and k are propor-
tionality factors that depend on the geometry of the spike dis-
tribution and dimensionality of the space while C0 may be
interpreted as the asymptotic concentration as the resolution
approaches infinity.
High-resolution (8,192 bins) fingerprints for the aggre-
gate of the control Pool no. 2 cells binned relative to each spike
concentration sample are shown in Figure 5. Low event count
bins are clearly visible for the high spike concentration experi-
ments. As the spike concentration decreases, the number of
bins with low event density falls. The limit of detection is at
the 0.01% level corresponding to 10 cells out of 100,000. This
is an expected result because, for the highest resolution, the
best one could do is to form a bin around a cluster of 12
spiked events. Table 1 summarizes the results of the spiking
experiments for levels of resolution corresponding to 1,024,
2,048, 4,096, and 8,192 bins. It also shows the asymptotic con-
centration using the method described earlier. Spike concen-
trations less than 0.03%, were detectable at only the highest
level of resolution and so could not be extrapolated. The table
shows good agreement between the event counts from popula-
tion gating and those obtained by CF.
ORIGINAL ARTICLE
436 Cytometric Fingerprinting
Analysis of Synthetic Data���Detection of Rare EventsCF can be used for a variety of objectives in the analysis
of flow cytometric data. Here we show that CF is able to detect
and identify the presence of a small number of events against a
large overlapping background distribution, and that it can do
so in the presence of significant ‘‘biological’’ variability. This is
similar to the application of PB termed Frequency Difference
Gating (12).
Using the R Statistical Computing Environment, we cre-
ated a series of two synthetic populations consisting of 100
instances each of simulated flow cytometric data. The two
populations are termed ‘‘normal’’ and ‘‘disease.’’ Each instance
in each population consists of 100,000 events described in a
four-dimensional space. Each event is generated from one of
five equally represented 4D hyperellipsoidal distribution func-
tions whose centroids (Ci) and standard deviations (Si) were
assigned using a random number generator within predefined
limits corresponding to a typical log-linear parameter range of
0–1,024 log fluorescence intensity. For the disease instances, a
sixth cluster was added with n events. (An equal number of n
events were removed at random from the other five clusters
such that the total number of events in each data set remained
unaltered.) A series of disease populations was generated,
varying n from a minimum of 10 events (corresponding to
0.01% disease events) to a maximum of 1,000 events (corre-
sponding to 1% disease events). The actual centroids for each
hyperellipsoid (including the disease hyperellipsoid) for each
instance were randomly varied with a standard deviation of
0.08 log10 units. The standard deviations were randomly var-
ied for each instance with a standard deviation of 0.04 log10units. Thus, no two instances had the same distributions, and
there were marked visual differences in the distribution as
judged from pairwise comparisons of the dotplots (data not
shown). The locations of the hyperellipsoidal functions were
such that the disease population was significantly intermingled
with the normal population along all parameter axes. Thus,
with no prior knowledge of the location of the disease-indicat-
ing events, at the lower disease event concentrations (0.1%
and lower) it is not possible to visually detect their presence.
Consequently, conventional sequential gating analysis in such
a situation cannot be applied.
For a given population of 100 instances each of disease
and normal, we further divided the instances randomly into
equal training and testing sets. The events in the normal train-
ing instances were aggregated, and CF template bins were
determined from this aggregate to a resolution of 8,192 leaf
bins (16,383 bins in total). Cytometric Fingerprints were then
individually computed for all 200 instances in the population
with respect to this template. Fingerprints were computed
from the aggregated disease training instances and a procedure
similar to that described in the previous section was used to
automatically select significant features.
Figure 6 shows a set of dotplots for one of the disease
training instances (0.1% disease events) selected at random.
Disease events are shown in red. The circles show the known
centroid of the disease cluster. The events that fall within the
informative fingerprint bins group closely about the location
where the disease events were located in this instance, demon-
strating that CF indeed identifies the multiparameter locus of
significance in this data set. Moreover, for the 0.1% disease
case, on average about 50% of the events that fall within the
informative leaf bins are actual disease events, which can be
determined with certainty in synthetic data. This represents an
enrichment factor of about 5003, showing that CF can be an
effective means for sorting rare events.
The detection task is equivalent to determining which
instances are normal and which are disease by means of their
Figure 5. Cytometric fingerprints from spiking experiment. Fin-
gerprints corresponding to two different spike concentrations are
shown for illustrative purposes. The spiked samples were used
as the template instances to determine bins, and the fingerprints
represent the aggregate of the unspiked pure Pool no. 2 popula-
tion (as described in the text). (a) Fingerprints for a nominal spike
concentration of 1%. The upper graph shows the 8,192 high-
resolution bins of the fingerprint. The bins shown in black repre-
sent the location corresponding to the spiked events. The lower
graph shows the same data, zoomed to show better that there
are many bins corresponding to the location of spiked events in
which the reference instances lack events. (b) As in (a) but for
0.1% spiking concentration. Note that the number of black bins is
much smaller than that at the higher 1% spike concentration.
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 437
Cytometric Fingerprints alone. For this purpose, we utilized
the selected features for the set of simulated instances at 0.1%
disease event concentration to perform supervised classifica-
tion. We chose linear discriminant analysis (LDA) (function
lda in package MASS of the R statistical computing environ-
ment) because it is perhaps the simplest method of classifica-
tion and its use here is intended to illustrate that the
cytometric fingerprints are in fact informative with respect to
phenotype.
Utilizing the training set only, an LDA classifier was
trained. This model was then used to predict the remaining
100 instances in the test set in a blinded fashion. The class
membership of these instances was then unblinded to evaluate
the classification accuracy. This procedure was repeated for
100 random draws (a random draw is a random selection of
50 of the 100 instances of each class to act as the training set,
the remaining 50 being the test set), each time building a fresh
LDA classifier based upon only the training instances in the
Table 1. Estimate of spiked cell concentration by CF
% SPIKED CELL CONCENTRATIONa ESTIMATED % SPIKED CELL CONCENTRATIONb
NOMINAL GATING 1,024 BINS (L5 10) 2,048 BINS (L5 11) 4,096 BINS (L5 12) 8,192 BINS (L 5 13) ASYMPTOTIC (L?1)
0.001 0.002 – – – – –
0.003 0.003 – – – – –
0.010 0.019 – – – 0.012 0.012
0.032 0.041 – – 0.024 0.036 0.036
0.100 0.170 0.097 0.145 0.145 0.157 0.157
0.316 0.530 0.390 0.488 0.488 0.512 0.513
1.000 1.950 1.757 1.806 1.830 1.866 1.981
3.162 4.720 4.491 4.539 4.588 4.610 4.697
10.00 13.80 13.379 13.525 13.622 13.695 13.854
a Column 1 5 nominal spike concentration; Column 2 5 actual spiked cell concentration determined by population gating with sequen-
tial gates corresponding those used for sorting.b Columns 3 through 6 give the results of cytometric fingerprint analysis at four binning resolutions; Column 7 shows the concentration
estimated by using the four fingerprint resolutions to extrapolate to infinite resolution.
Figure 6. Synthetic data. (a) Example of one of the ‘‘normal’’ instances. (b, c) Example of one of the ‘‘disease’’ instances, with 50 disease
cluster events spiked into 100,000 events. In (b) the disease events are present but not marked, and in (c) they are marked in red. (d) Same
instance as in (b, c), but all six projections are shown to indicate that the 2�3 projection is the one where the disease cluster is best sepa-rated from the background. (e, f) Another example of a disease instance. The regions encircled in (b, c) and (e, f) indicate the general loca-
tion of the disease clusters. Note that the disease cluster is intermingled with the background. Comparing (a), (b), and (e) it is possible to
discern the fact that no two instances are exactly the same, as described in the text.
ORIGINAL ARTICLE
438 Cytometric Fingerprinting
draw and evaluating its accuracy using only the test instances
for that draw. The average classification accuracy at 100/
100,000 disease event concentration was 97.74%/98.5% (sensi-
tivity/specificity). This accuracy degraded to 94.02%/93.98%
and 67.30%/67.68% at disease event concentrations 50/
100,000 and 20/100,000, respectively. (At concentrations above
100/100,000 the classifier was perfectly accurate.) At 10/
100,000 disease event concentration the LDA classifier per-
formed only slightly better than the 50% null hypothesis,
demonstrating that an effective lower limit of detection had
been reached for this hypothetical ‘‘disease.’’ That this per-
formance is achieved with a very simple classifier suggests that
CF is an efficient method of representing differences among a
set of multivariate probability density distributions.
How well does the simplest possible detection method
work on these data? To answer this question we divided the
50/100,000 concentration training data into uniform bins at a
resolution of 10 divisions on each of the four dimensions. We
then computed Mann–Whitney P values for each bin location
and found that three bins had P values less than 1024. Count-
ing events in these bins and averaging for the disease and nor-
mal groups separately, we detected on average only 12.97
(25.9%) of the 50 spiked events. CF detected on average 27.04
(54.1%) disease events in two high-resolution bins, indicating
that CF creates a more efficient representation than uniform
binning, in that fewer bins capture more information. PB by
comparison detected on average 25.1 (50.2%) disease events
in one lower resolution bin. Thus, for this comparison, CF
and PB perform comparably, and both perform significantly
better than uniform binning.
Quality Control Based on CF
CF has been used to quantitatively assess the consistency
of gating across data derived from a panel of tubes. Here we
illustrate CF’s ability to complement rather than replace con-
ventional methods of sequential gating data analysis.
Often, clinical flow cytometric assays involve the acquisi-
tion of data from several tubes in order to accommodate
many markers or stimulation conditions. Analysis of multi-
tube data frequently involves the establishment of gates (based
on a subset of parameters common to all tubes) which can be
uniformly applied across the entire panel, thus providing a
basis for consistent population identification and minimizing
time and effort in the data analysis process. The underlying
assumption is that events are distributed the same way for
each tube across these gating parameters. Violations of this
assumption due to unanticipated and undetected shifts or
other dissimilarities of the gating data may result in inaccurate
phenotyping.
CF-based quality control assessment was applied in the
context of a 96-well plate format flow cytometric experiment.
In this study (19), fresh PBMCs were added to an array of an-
tigenic stimuli and a set of controls in a 96-well plate format.
Cells were allowed to respond to the stimuli for 6 h. Rows A–
D of the plate contained IFN-g FITC, CD69 PE, CD4 perCP-
Cy5.5, and CD3 APC, whereas rows E–H contained TNF-aFITC, IL-2 PE, CD4 perCP-Cy5.5, and CD3 APC. Responses
and the cell types of the responders were then measured by
gating on the common parameters including Side scatter, CD4
perCP-Cy5.5, and CD3 APC.
To compare the gating distributions, cytometric finger-
prints were computed on the three dimensions common to all
96 wells at a resolution of 32 bins by binning on the aggre-
gated gating data from all 96 wells, projecting the distribution
from each well onto the resulting template bins, and expres-
sing the PDF for each well as a vector of the base-2 logarithm
of the fractional deviation from the norm. Figure 7 shows
examples of these fingerprints for each well, along with the
maximum deviation for each well. Log2 (maximum deviation)
values less than 2.0 are color-coded green, between 2.0 and 3.0
in yellow, and greater than 3.0 in red. Bins coded yellow and
red clearly indicate disruptions of the characteristic CD3/CD4
distributions as shown in Figure 7c. This type of CF-based
analysis provides a quick visual triage of possible data acquisi-
tion artifacts that can help the investigator recognize and
correct defects.
DISCUSSION
In this report we describe CF, a new method for unbiased
quantitative analysis of list-mode cytometric data. This tech-
nique relies on a method of subdividing a multidimensional
space which extends that of PB. CF includes a novel algorithm
for space subdivision that eliminates the constraint in PB of
subdivision along the coordinate axes by dividing multidi-
mensional space into nonrectangular polytopes. Additionally,
it renders a multidimensional distribution in the form of a
one-dimensional multiresolution hierarchical fingerprint that
creates a computationally efficient representation of high-
dimensionality distribution functions.
Each event in list-mode data is described by a vector of
coordinates in a multidimensional space. Thus, a complete
mathematical description of a sample is the multivariate prob-
ability distribution function defining the density of events in
this space. This distribution may be approximated by dividing
the space into small volume elements, counting the number of
events in each volume element, and normalizing the count by
the total number of events in the sample. In the limit of an in-
finite number of events, the regions may be made infinitesi-
mally small yielding the true probability distribution function.
Of course, it is impossible to collect an infinite number of
events, and so the question of interest is how does one accu-
rately estimate the true probability distribution from a finite
sample of events? Equally importantly, how does one represent
this approximation of the multivariate probability density
function in a form amenable to comparing disparate samples?
The most common nonparametric means of estimating a
PDF is a histogram where space is divided into equal width
bins (20). For a complex (rapidly changing) PDF, one would
like to choose small bins in order to accurately track the varia-
tion with respect to independent variables (low bias). On the
other hand, one would like to choose bins of sufficient size to
contain a large number of events in order to estimate the value
of the density within a bin with high accuracy (low variance).
This trade-off between number of bins and bin size is the
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 439
classic bias-variance dilemma (21). For one independent vari-
able and reasonably sized datasets, it is not difficult to balance
the bias-variance requirements. However, for multidimen-
sional data one approach is to control the variance by choos-
ing bins that contain equal numbers of events. In the case of
univariate data, there is a unique set of bin boundaries that
accomplishes this goal (22). For multivariate data, however,
there is not a unique solution. While this indeterminacy might
seem like a disadvantage, in fact, it creates an opportunity to
find a specific set of bin boundaries that does a superior job of
reducing bias.
Other methods of representing and analyzing multidi-
mensional flow cytometry data have been developed (8,23–
25). One that is most closely related to the present work is PB
(7,10,12,22). PB represents a multidimensional probability
distribution as a set of bins defining regions of the multidi-
mensional space. The boundaries of these bins are chosen so
that approximately equal numbers of events lie in each bin.
Bins are found by selecting a coordinate dimension, determin-
ing the median in that coordinate, and dividing the data at the
median value. In PB, the axis selection is made by calculating
the variance of the data in the parent bin for each of the origi-
nal coordinate dimensions and choosing the one dimension
having the largest variance. Although the decision is made on
the basis of the variance in each dimension, the split is not
necessarily along the optimal direction because the direction
of maximum variance may not coincide with one of the coor-
dinate axes. The method of binning utilized in CF differs from
that of PB in that bins are formed by splitting the data in the
direction of maximum variance rather than along the original
coordinate axes. A primary objective of CF is the development
of a general and efficient representation of the multivariate
probability density function embodied in list-mode data that
can be utilized with machine learning methods to analyze
cytometric data. Key features of this representation are that it
retains information at multiple resolutions and is in the form
of a one-dimensional feature vector.
As evidenced from the controlled spiking experiments,
CF compares well with proven methods of data analysis (e.g.,
sequential gating) in cases where the target cell population is
discrete and easily recognized. In a more challenging setting
illustrated by the synthetic data example, CF is able to detect
and identify events in complex, overlapping distributions that
are not visible to the human eye in any 2D dotplot and are
thus not amenable to methods of analysis that are graphically/
visually guided. Furthermore, even at a level below 0.05%
anomalous events in an ungated population, in which the
anomalous events are not even detectable via conventional
methods, cytometric fingerprints were able to efficiently repre-
sent the data such that a simple classifier (in this case LDA)
was able to accurately and predictively classify instances into
distinct ‘‘phenotypes’’ based upon the raw list-mode data
alone. Importantly, CF does not depend upon arbitrary
choices of gating or analytical thresholds, which collectively
represent a priori hypotheses. Rather, CF provides a means to
generate new hypotheses directly from the data, as well as to
test existing hypotheses by direct comparison with traditional,
proven methods of data analysis.
It is important to note that as the numbers of parameters
involved in flow cytometric studies increases, so will the com-
plexity of the resulting distributions of events. Thus, there will
be a concomitant demand for increasing the number of events
collected. This is a fundamental requirement for multipara-
meter flow cytometry in general, and not just for the applica-
Figure 7. Quality control assess-
ment for gating data. (a) Finger-
prints for 96-well plate data; (b)
upper-left corner of the plate
enlarged; (c) individual dotplots
for CD3 vs. CD4 for three
selected wells, superimposed
on contours for the aggregated
event density distribution (blue
contours). White arrows in (c)
for the dotplots corresponding
to wells B-1 and C-12 indicate
locations where event density
distributions deviate from the
plate norm.
ORIGINAL ARTICLE
440 Cytometric Fingerprinting
tion of fingerprinting or other advanced methods of quantita-
tive analysis. It is a result of the geometric dilution of the sig-
nificance of any individual dimension into a ‘‘sea’’ of increas-
ing dimensionality. CF seeks to maximize the utility of data
characterized by a given number of events at a given dimen-
sionality by optimizing the trade-off between bias and var-
iance. Thus, although it might be argued that CF allows for
the collection of the minimum number of events for a given
dimensionality, we emphasize the need to collect an adequate
sampling of events given the dimensionality, and then suggest
the use of CF to maximize the utility of those data.
CF demands adherence to strict, standardized operating
procedures in the collection of flow cytometric data, and it
can help in the development of such procedures. CF, like other
quantitative methods of analysis of flow cytometric data, is
sensitive to variations that exist in the data whether due to a
true biological effect or artifact. In fact, as shown earlier and
in Ref. 18, CF can be used to detect and correct artificial
sources of variability. Improved procedures for collecting data
will enable CF to be more sensitive to true biologic variations,
thus deriving phenotype hypotheses that can then be tested. It
is important that the quantitation of each parameter represent
as accurately as possible its corresponding physical counter-
part (e.g., the CD45-FITC fluorescence signal for an event
should reproducibly represent the copy number of CD45
molecules on the surface of the corresponding cell). Methods
for collection, storage, and transport of samples, variations in
staining and other sample preparation variables, and instru-
ment performance parameters can all contribute to small but
important variations in the distribution of events in spaces of
ever-increasing dimensionality. In difficult applications, for
example, the characterization of rare event distributions, small
systematic errors may completely mask small but biologically
significant effects. Therefore, standardized procedures, includ-
ing appropriate controls, for each step in the chain of events
leading from donor/patient to list-mode file are essential.
While the need for such rigor may seem to be a limitation of
CF, it is equally significant in any other methods of quantita-
tive analysis of list-mode data.
The CF method together with the adoption of such stan-
dardized protocols will result in a number of tangible and far-
reaching benefits. First, the application of quantitative analysis
methods such as CF reduces or eliminates tedious manual
analysis procedures. This not only increases analysis through-
put via automation but also removes an important source of
analytical bias; namely, subjective decisions (such as choosing
gates) made by a human analyst. Second, the incorporation of
additional parameters into an assay, which potentially leads to
a large increase in the number of steps involved in manual
analysis, does not increase the apparent difficulty of analysis
via CF, because data of any dimensionality are computation-
ally reduced to a one-dimensional fingerprint. Therefore, stu-
dies with larger dimensionality are made more practical. Third
and perhaps most significant, data that are collected according
to standardized and controlled protocols can be integrated
together with their fingerprints into a large-scale cytometry
database. Such an archive will be a valuable resource for retro-
spective data mining for the creation and testing of new
hypotheses for disease associations or biological mechanisms.
ACKNOWLEDGMENTS
Flow cytometry was performed at the Abramson Cancer
Center Flow Cytometry and Cell Sorting Shared Resource of
the University of Pennsylvania School of Medicine, which was
established in part by equipment grants from the NIH Shared
Instrument Program, and receives support from NIH P30
CA016520 from the National Cancer Institute. We thank the
authors of Ref. 19 for generously making their 96-well plate
data publicly available.
LITERATURE CITED
1. Wille-Reece U, Flynn BJ, Lore K, Koup RA, Miles AP, Saul A, Kedl RM, Mattapallil JJ,Weiss WR, Roederer M, Seder RA. Toll-like receptor agonists influence the magnitudeand quality of memory T cell responses after prime-boost immunization in non-human primates. J Exp Med 2006;203:1249–1258.
2. Neeson P, Paterson Y. A new multi-parameter flow cytometric assay for monitoringlymphoma growth and spread in a pre-clinical murine model for human lymphoma.Cytometry Part A 2004;60A:8–20.
3. Precopio ML, Betts MR, Parrino J, Price DA, Gostick E, Ambrozak DR, Asher TE,Douek DC, Harari A, Pantaleo G, Bailer R, Graham BS, Roederer M, Koup RA. Im-munization with vaccinia virus induces polyfunctional and phenotypically distinctiveCD8(1) T cell responses. J Exp Med 2007;204:1405–1416.
4. Wu KH, Zhou B, Mo XM, Cui B, Yu CT, Lu SH, Han ZC, Liu YL. Therapeutic poten-tial of human umbilical cord-derived stem cells in ischemic diseases. Transplant Proc2007;39:1620–1622.
5. Zhao X, Huang L, Yin Y, Fang Y, Zhou Y. Autologous endothelial progenitor cells trans-plantation promoting endothelial recovery in mice. Transplant Int 2007;20:712–721.
6. Shaffer RG, Greene S, Arshi A, Supple G, Bantly A, Moore JS, Parmacek MS, MohlerER3. Effect of acute exercise on endothelial progenitor cells in patients with periph-eral arterial disease. Vasc Med 2006;11:219–226.
7. Roederer M, Moore W, Treister A, Hardy RR, Herzenberg LA. Probability binningcomparison: a metric for quantitating multivariate distribution differences. Cytome-try 2001;45:47–55.
8. Lugli E, Pinti M, Nasi M, Troiano L, Ferraresi R, Mussi C, Salvioli G, Patsekin V,Robinson JP, Durante C, Cocchi M, Cossarizza A. Subject classification obtained bycluster analysis and principal component analysis applied to flow cytometric data.Cytometry Part A 2007;71A:334–344.
9. Lizard G. Flow cytometry analyses and bioinformatics: interest in new softwares tooptimize novel technologies and to favor the emergence of innovative concepts in cellresearch. Cytometry Part A 2007;71A:646–647.
10. De Rosa SC, Brenchley JM, Roederer M. Beyond six colors: a new era in flow cytome-try. Nat Med 2003;9:112–117.
11. Jackson J. AUser’s Guide to Principal Components. New York: Wiley-Interscience; 1991.
12. Roederer M, Hardy RR. Frequency difference gating: a multivariate method for iden-tifying subsets that differ between samples. Cytometry 2001;45:56–64.
13. O’Connell M. Search program for significant variables. Comput Phys Commun1974;8:49–55.
14. Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag1901;2:559–572.
15. Jolliffe I. Principal Component Analysis. New York: Springer; 2002.
16. Hotelling H. Analysis of a complex statistical variable into principal components.J Educ Psych 1933;26:417–441, 498–520.
17. Arnold G, Collins A. Interpretation of transformed axes in multivariate analysis.Appl Statist 1993;42:381–400.
18. Rogers W, Holyst H, Moore J, Schretzenmair R, Moser A. Qtube: automated qualityassessment of flow cytometric data based on cytometric fingerprinting. CytometryPart B 2007;72B:490.
19. Inokuma M, dela Rosa C, Schmitt C, Haaland P, Siebert J, Petry D, Tang M, SuniMA, Ghanekar SA, Gladding D, Dunne JF, Maino VC, Disis ML, Maecker HT. Func-tional t cell responses to tumor antigens in breast cancer patients have a distinct phe-notype and cytokine signature. J Immunol 2007;179:2627–2633.
20. Sturges HA. The choice of a class interval. J Am Statist Assoc 1926;21:65–66.
21. Liu RC, Brown LD. Nonexistence of informative unbiased estimators in singular pro-blems. Ann Statist 1993;21:1–13.
22. Roederer M, Treister A, Moore W, Herzenberg LA. Probability binning comparison: ametric for quantitating univariate distribution differences. Cytometry 2001;45:37–46.
23. Robinson JP, Ragheb K, Lawler G, Kelley S, Durack G. Rapid multivariate analysisand display of cross-reacting antibodies on human leukocytes. Cytometry 1992;13:75–82.
24. Robinson JP, Durack G, Kelley S. An innovation in flow cytometry data collectionand analysis producing a correlated multiple sample analysis in a single file. Cytome-try 1991;12:82–90.
25. Murphy RF. Automated identification of subpopulations in flow cytometric list modedata using cluster analysis. Cytometry 1985;6:302–309.
ORIGINAL ARTICLE
Cytometry Part A � 73A: 430�441, 2008 441