Upload
dangkhanh
View
213
Download
0
Embed Size (px)
Citation preview
Biological response to environmental stress. Environmental
similarity and hierarchical, scale-dependant segregation of
biotic signatures for prediction purposes
A Dissertation Presented
by
David Bedoya Ribó
to
The Department of Civil & Environmental Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Civil Engineering
in the field of
Environmental Engineering
Nortehastern University
Boston, Massachusetts
(October 2008)
i
Abstract
Biological response to environmental stress. Environmental similarity and
hierarchical, scale-dependant segregation of biotic signatures for prediction
purposes
David Bedoya Ribó
In the hierarchical river system, any deviation from the pristine state will be translated
into disturbances that propagate and eventually reach its endpoints (i.e. the biologic
community). Endpoints are indicative of the overall health or integrity of a water body.
Integrity is usually measured with multi-metric indices that compare actual observations
to reference scenarios. Despite strong agreement among experts about the importance of
biological indicators, development of numeric biological standards similar to those used
for water quality remains uncertain for several reasons: (1) the natural system is
composed of highly intertwined and cross-correlated variables. Identification of simple
stress-response relationships is not often possible; (2) the natural system is organized in a
nested hierarchy of suitable habitats with very different geographic scales; (3) many
environmental variables have a categorical evaluation, which introduces subjectivity and
relativity into the system ; (4) true reference conditions may no longer exist; and (5)
natural randomness .
ii
In order to address these issues, an attempt to predict or characterize biologic integrity
was performed. In the first section, fish Indices of Biologic Integrity (IBI) were predicted
using the K-nearest neighbor concept (KNN). This methodology was used because it
allows a fast, step-wise approach easily implemented with highly dimensional
environmental vectors. The KNN concept was tested with databases in Maryland, Ohio,
and Minnesota. Subsequently, a slightly modified version of the algorithm was tested
with a new database in Ohio which combined instream and offtstream features improving
the results significantly.
The second section consisted of a progressive, hierarchical separation of biological
responses using Self-Organizing Maps (SOM) and subsequent clustering of sites using
one environmental variable at a time in decreasing order of importance. This
methodology attempted to replicate the nested hierarchy of habitats in nature. The
biologic responses were characterized using a Gaussian probabilistic curve because it was
assumed that IBI was a projection of the log-normal distribution of species onto an
arithmetic scale. The best sites in each group were considered as truly reference
conditions and compared to the remaining sites within the group. This was applied in
Ohio (with only instream or only offstream data) and Maryland (instream and offstream
data combined).
iii
Acknowledgments
I would like to especially thank my wife: Tonya L. Berenson. Her affection, empathy,
sense of humor, and always positive attitude during very hard periods at the academic
and personal levels have been crucial to me in order to achieve this goal.
I would like to thank my advisor: Professor Vladimir Novotny. His guidance and broad
experience in the water resources field were critical for the successful completion of this
research. I am also very grateful to the committee members and especially to Professor
Elias Manolakos. His experience and advice with complex data patterning techniques, a
field that was completely new to me, were extremely helpful.
I would like to thank all my family and friends in Spain and the U.S.A. for their
unconditional support and understanding. Just knowing they were there has been an
endless source of energy and joy.
I would like to thank all my friends in the Civil & Environmental Engineering
Department at Northeastern University for their support and the good times we spent
together all these years.
Finally, I would like to express my gratitude towards Mr. Ed Rankin and Dennis Mishne
from Ohio EPA for their help with the environmental databases and valuable advice.
This research was partially funded by a USEPA STAR watershed research grant to
Northeastern University, Boston, MA.
iv
Table of contents Research summary .............................................................................................................. 1 Introduction......................................................................................................................... 4 1. Chapter 1: Comparison of IBI predictions using regression and the environmental similarity concept.............................................................................................................. 13
1.1. Methodology..................................................................................................... 13 1.1.1. Self-Organizing feature maps ................................................................... 14 1.1.2. k-nearest neighbor concept ....................................................................... 15 1.1.3. Description of the databases ..................................................................... 16 1.1.4. IBI prediction methodology using kNN ................................................... 18 1.1.5. IBI prediction using regression and SOM + regression............................ 20 1.1.6. Chronic and acute toxic chemical effects ................................................. 23
1.2. Results and discussion ...................................................................................... 25 1.2.1. IBI predictions using kNN (k =1 or k= 2) ................................................ 25 1.2.2. IBI predictions using kNN (with k =5 or k = 10) ..................................... 26 1.2.3. Regression models .................................................................................... 28
1.3. Conclusions....................................................................................................... 36 2. Chapter 2: Large-scale biologic integrity prediction based on environmental similarity using instream data and regional and local offstream characteristics .............. 39
2.1. Methodology..................................................................................................... 39 2.1.1. Data and study area................................................................................... 39 2.1.2. Variable sorting based on IBI prediction power using a leave-one-out, hierarchical approach ................................................................................................ 45 2.1.3. Step-wise IBI prediction using a leave-one-out, hierarchical approach ... 46 2.1.4. Analysis of observations with a significant impact from local variables . 48
2.2. Results............................................................................................................... 48 2.2.1. Step-wise IBI predictions.......................................................................... 48 2.2.2. Analysis of sites with significant local-scale stressors ............................. 50
2.3. Discussion......................................................................................................... 55 2.3.1. Land use .................................................................................................... 56 2.3.2. Fragmentation ........................................................................................... 59 2.3.3. Point sources and instream water quality.................................................. 60 2.3.4. Instream Habitat........................................................................................ 62 2.3.5. Mispredictions due to local effects ........................................................... 62
2.4. Conclusions....................................................................................................... 64 3. Chapter 3: Probabilistic, Hierarchical, Biologic Integrity Discrimination ............... 66
3.1. Methodology..................................................................................................... 66 3.1.1. Ohio: instream data and study area........................................................... 66 3.1.2. Ohio: offstream data and study area ......................................................... 68 3.1.3. Maryland data and study area ................................................................... 72
v
3.1.4. Self-Organizing Feature Maps (SOM)...................................................... 75 3.1.5. Initial data clustering and SOM neuron analysis ...................................... 77 3.1.6. Second SOM data clustering..................................................................... 78 3.1.7. Site patterning based on ‘large-scale’ variables and associated biotic responses ……………………………………………………………………………79 3.1.8. Site patterning based on ‘small-scale’ variables and associated biotic response ……………………………………………………………………………82 3.1.9. IBI response curve development for different levels of watershed characterization ......................................................................................................... 82 3.1.10. Development of biotic response reference curves .................................... 86
3.2. Results and discussion ...................................................................................... 87 3.2.1. Ohio: instream data ................................................................................... 87 3.2.2. Ohio offstream data................................................................................... 99 3.2.3. Coastal Maryland .................................................................................... 117 3.2.4. Piedmont Maryland................................................................................. 122 3.2.5. Highland Maryland ................................................................................. 128
3.3. Conclusions..................................................................................................... 138 3.3.1. Ohio with instream data .......................................................................... 138 3.3.2. Ohio with offstream data ........................................................................ 140 3.3.3. Maryland................................................................................................. 140
4. Main conclusions .................................................................................................... 143 5. Future research and work........................................................................................ 148 6. References............................................................................................................... 157 Appendices...................................................................................................................... 165 Appendix I: group statistics ............................................................................................ 166 Appendix II: computer code ........................................................................................... 186
vi
List of Figures Figure 1. Hierarchical stressor-risk-endpoint propagation model based on Karr et al. (1986)
integrity concept and Novotny( 2003) concept of risk propagation…………………………4 Figure 1-1. Flow-chart of the step-wise kNN prediction method. Dashed arrow lines represent
the steps followed when the environmental variables are sorted with k =1. Dotted arrow lines represent the steps followed when the variable sorting is performed with k =10. Solid arrow lines depict common steps for both cases................................................................... 20
Figure 1-2. Flow-chart of the step-wise multiple regression method. Dashed lines indicate steps
for the cluster-based model only. Dotted lines indicate steps for the whole database model only. Solid lines are common steps for both methods .......................................................... 23
Figure 1-3. Top, site cluster distribution in Minnesota (left), Maryland (Piedmont sites) (center),
and Ohio (right). In Minnesota, cluster 1 is concentrated in Southern watersheds. In Maryland clusters 4 and 5 are concentrated in a specific region and, in Ohio, sites located in the same watershed usually belong to the same cluster. Bottom, Self-organizing Map neuron lattice and box plots with the cluster-based IBI values. The red line in the boxplots represents median cluster value, the top line is 75 percentile, and bottom line is 25 percentile............................................................................................................................... 34
Figure 2-1. From left to right and top to bottom. (1)Upstream stream network carrying waste
water; (2) upstream stream network fragmentation; (3) basin-scale dams in the downstream main channel; (4) basin-scale stream network fragmentation .............................................. 42
Figure 2-2. Hierarchical tree with different clustering levels to which the test site (Xi1,Xi2,…,Xin)
is being compared against. i indicates the observation number, n indicates the environmental variable within the environmental vector ............................................................................. 47
Figure 2-3. Diagram showing the order with which the variable groups were merged. Orange
rectangles indicate instream variables. Green rectangles indicate offstream variables. Blue indicates a mix of both.......................................................................................................... 47
Figure 2-4. IBI predictions with the best offstream variables (top), best instream variables
(middle), and best variables overall (bottom). Dashed red lines indicate perfect fit line (center) and ± 1.5×RMSE (sides). Dot size is proportional to the number of hits in a specific point. ........................................................................................................................ 54
Figure 3-1. Distribution of observations used in the analysis and basins. On the left, groups after
the 2nd SOM. On the right groups after clustering using SITE_Con (groups from the same parent group are segregated by basin) .................................................................................. 70
Figure 3-2. 1995-1997 MBSS monitoring stations in the state of Maryland and strata distribution
............................................................................................................................................... 73
vii
Figure 3-3. Example of a hierarchical tree of the 2nd SOM neurons (left) and analysis of
differences among group biologic responses (right). On the right, example of MRT analysis. Overlapping indicate not significant differences in group IBI means. Non-overlapping indicates significantly different group IBI means. In this case, Level 4 partition would be chosen because it yields the largest number of different biotic responses (5) with less overlapping than Level 5 (Figure for clarification purposes only). ................. 81
Figure 3-4. Flow chart summarizing the methodology used to characterize response of the
biologic community to similar environmental characteristics and stressors (Maryland and Ohio with instream data)....................................................................................................... 84
Figure 3-5. Flow chart summarizing the methodology used to characterize response of the
biologic community to similar environmental characteristics and stressors (for Ohio with offstream data) ...................................................................................................................... 85
Figure 3-6. Correlation matrix of the variable neuron-based weights and neuron-based average
IBI values in the trained SOM. ............................................................................................. 87 Figure 3-7. Groups and subgroups with different biological responses after clustering with large
and small-scale environmental filters. Red color marks groups that did not pass normality tests. Blue color indicates groups that passed the normality tests. ....................................... 92
Figure 3-8. Normal distribution probability plots for groups 1 through 6. Red line indicates 75th
IBI percentile. Points to the right of the red line were considered as reference observations for the respective group of sites and separated. ................................................................... 96
Figure 3-9. Normal probability plots for the reference (green) and impaired (red) conditions for
the six groups obtained after clustering the SOM neurons with environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group......................................................................................................... 98
Figure 3-10. Correlation matrix of the variable neuron-based weights and neuron-based average
IBI scores in the trained SOM. Color bar on the right indicates absolute value of the absolute correlation coefficient. Plus and minus signs indicate positive or negative correlation. ............................................................................................................................ 99
Figure 3-11. Hierarchical diagram of habitats with significantly different biotic responses. On the
right, list of environmental variables used to segregate biotic signatures at each step. Rectangles in blue indicate groups that passed normality test. Rectangles in red indicate groups that did not pass normality test. .............................................................................. 102
Figure 3-12. Normal distribution probability plots for the biologic signatures after clustering sites
with SITE_Con. Group 212 did not pass the Jarque-Bera test of normality at the 95% confidence level (see Figure 3-11) . Group 221 was not plotted because it only had 4 observations ........................................................................................................................ 103
viii
Figure 3-13. Example of biologic response separation by segregation of sites with environmental variables. Group 222 splits in groups 2221 and 2222 (group 2222 not-normally distributed) after clustering with RDA_Urban. Group 2222 splits in groups 22221 and 22222 (both normally distributed) after clustering with R30_Agri. ....................................................... 104
Figure 3-14. Normal probability plots for the reference (green) and impaired (red) conditions for
the groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 212 was fitted to a Gaussian distribution only for demonstration purposes) .................................................... 105
Figure 3-15. Groups of sampling sites in a watershed located in the Muskingum River Basin. On
the left, groups after partition with regional watershed land use and fragmentation metrics. On the right, groups after partitions with land use in the local 100-meter buffer............... 116
Figure 3-16. Correlation matrix of the variable neuron-based weights and neuron, average IBI
values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables ............................................................................ 117
Figure 3-17. Groups and subgroups with different biological response after clustering with large
and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed the normality tests ...................................... 119
Figure 3-18. Normal probability plots for the IBI responses found after the 2nd SOM clustering
............................................................................................................................................. 120 Figure 3-19. Normal probability plots for the reference (green) and impaired (red) conditions for
the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution............................................. 121
Figure 3-20. Correlation matrix of the variable neuron-based weights and neuron, average IBI
values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables ............................................................................ 122
Figure 3-21. Groups and subgroups with different biological responses after clustering with large
and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests ............................................ 124
Figure 3-22. Normal probability plots for the IBI responses identified by the 2nd SOM clustering
in Piedmont sites (Group 4 didn’t pass the normality test)................................................. 125
ix
Figure 3-23. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 4 was fitted to a Gaussian distribution only for demonstration purposes) .................................................... 126
Figure 3-24. Correlation matrix of the variable neuron-based weights and neuron, average IBI
values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables ............................................................................ 128
Figure 3-25. Biological response hierarchical structure after clustering with large and small-scale
environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests.............................................................. 130
Figure 3-26. Normal probability plots for the IBI responses the 2nd SOM clustering in Highland
sites (groups 1 and 3 didn’t pass normality tests) ............................................................... 131 Figure 3-27. Normal probability plots for the reference (green) and impaired (red) conditions for
the three groups obtained using environmental gradients in Highland sites. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group in order to describe its Gaussian distribution (Groups 1 and 3 fitted to a Gaussian distribution only for demonstration purposes) .................................................... 132
x
List of Tables Table 1-1. Description of the environmental variables, scores and indices available for each state
and their units........................................................................................................................ 17 Table 1-2. Summary of IBI predictions using the kNN methodology. The different functions (Mh
= Mahalanobis; Eu = Euclidean) and selected number of closest neighbors (k) are specified. Final selected variables in each case are also listed.............................................................. 28
Table 1-3. Summary of the step-wise regressions for IBI prediction for the development and
validation sets. The variables used in each case are listed together with their coefficients and curve type (in parentheses). Variables in italics in the whole database regressions indicate variables also used in some of the kNN predictions. Results in Ohio after including metal toxicity penalties ................................................................................................................... 35
Table 2-1. Description, percentage quartiles, and individual IBI predicting power for the
different NLCD land use categories present in the Ohio database ....................................... 43 Table 2-2. Description, quartile values, and individual IBI predicting power for the water quality,
habitat, point source, and stream fragmentation metrics ...................................................... 44 Table 2-3. List of variables with significant differences between over-predicted sites and sites
with a prediction within the ±1.5 ×RMSE intervals ............................................................ 51 Table 2-4. List of variables with significant differences between under-predicted sites and
observations with a prediction within the ±1.5 ×RMSE intervals ....................................... 52 Table 2-5. Step-wise IBI predictions. R2 indicate the variability explained after adding a new
variable to the model. All results were achieved using a hierarchical tree with 423 branches. For an explanation of variables refer to Table 2-1 and Table 2-2 ........................................ 53
Table 3-1. List of water quality, habitat, and biologic integrity parameters used in the research 67 Table 3-2. Land use categories and quartiles at the watershed (R) and the local (L) scales ........ 71 Table 3-3. Fragmentation (top) and point source density and intensity metrics (middle) , units,
and quartiles .......................................................................................................................... 71 Table 3-4. Description, quartiles, and units for the available regional environmental variables 74 Table 3-5. Neuron-based correlation coefficients between variables and IBI. ............................ 90
xi
Table 3-6. ANOVA (top) and MRT (bottom) analyses for the IBI means in groups after 2nd SOM patterning with environmental gradients shown in Figure 3-7. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups. ............................................... 90
Table 3-7. 95% confidence intervals for the environmental variable means in reference and
impaired sites. Text in bold indicates statistically significant differences for that variable and group according to the t-tests ......................................................................................... 97
Table 3-8. Correlation coefficients between the neuron-based regional environmental variables
and the neuron-based average IBI scores (left and mid columns) and raw local variables and IBI scores (left column). Variables in bold were capable of separating significantly different biological responses in the hierarchical structure ............................................................... 101
Table 3-9. ANOVA (top) and MRT (bottom) analyses to detect significant differences in IBI
means between 2nd SOM groups of neurons. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups. ..................................................................................................... 101
Table 3-10. 95% confidence intervals and ANOVA test between reference and non-reference
sites in variables used in the separation of biotic responses ............................................... 106 Table 3-11. Average group values after clustering with basin/watershed scale variables.......... 116 Table 3-12. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the
MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups .............................................. 119
Table 3-13. 95% confidence intervals and ANOVA test between reference and non-reference
sites with variables used in the separation of biotic responses in coastal sites................... 121 Table 3-14. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the
MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups .............................................. 124
Table 3-15. 95% confidence intervals and ANOVA test between reference and non-reference
sites with variables used in the separation of biotic responses in piedmont sites............... 127 Table 3-16. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses in
highland sites. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups...... 130
Table 3-17. 95% confidence intervals and ANOVA test between reference and non-reference
sites in variables used in the separation of biotic responses in highland sites.................... 133
1
Research summary
The research presented in this thesis is an attempt to predict or characterize biological integrity
using data patterning techniques. In the initial stages, a comparison of traditional and more
advanced prediction methods was performed at the state-level and presented in Chapter 1. The
results showed how predictions based on evaluation of environmental similarity outperformed
predictions based on more traditional techniques (i.e. non-linear regression). Moreover, this
methodology was much faster computationally and allowed a leave-one-out validation procedure
that other methods couldn’t afford due to time constraints. This methodology was tested using
databases compiled by public agencies in Ohio, Maryland, and Minnesota.
After these initial results, I realized the prediction results could potentially be improved because
none of the available databases had complete instream (i.e. water and habitat quality) and
offstream (i.e. regional and local land use, point source, and fragmentation information). A new
database was created for Ohio using Geographic information Systems (GIS) in order to obtain
accurate land use, point source, and fragmentation metrics for each existing site in the original
database. IBI was predicted using the merged databases. An improved algorithm was used which
assessed environmental variability at different levels of a hierarchical tree of homogeneous
groups. The results improved the previous prediction by almost 10% using only offstream data
(regional land use and stream fragmentation). Mispredicted sites were separated and the
differences with the remaining observations analyzed. Significant differences in upstream
fragmentation, local land use, and water quality were detected. These results are presented in
Chapter 2.
2
A literature review as well as the results from Chapter 2 led to the development of a
methodology able to characterize biological responses at different levels of environmental
characterization or description. The methodology developed was named PROHIBID
(PRObabilistic HIerarchical Biologic Integrity Discrimination). This consisted of a top-down
hierarchical classification of environmental stressors based on their overall effect on IBI. This
started with a separation of major biologic signatures by identification of environmental
gradients using Self-Organizing Maps (SOM). Subsequently, distinct biotic responses due to
more localized environmental stressors were progressively segregated using one variable at a
time in decreasing order of importance. Therefore, as the system characterization increased,
group environmental and biologic homogeneity was increased as well. Biotic responses in each
group were represented using a Gaussian distribution. This function was used because the
hypothesis that the IBI is a projection of the observed log-normal distribution of species onto an
arithmetic axis was made. This hypothesis worked very well and groups usually reached
normality if they were homogeneous enough. Some groups did not achieve normality but this
was most likely due to lack of a representative sample.
The best observations in each group were considered as truly reference conditions because they
belonged to a highly environmentally homogeneous cluster. Differences between reference and
non-reference sites were evaluated and indicated the main issues to be addressed as well as their
scale in order to achieve reference conditions. For example, when offstream data was used,
PROHIBID identified regional land use and fragmentation as environmental gradients (i.e. large-
scale variables responsible for background integrity). Local buffer land uses usually explained
the fluctuations within these groups.
3
This methodology could easily be implemented to establish probabilistic biological standards
similar to those in water quality. Furthermore, reference or realistically achievable conditions are
easily identified because we ‘let the data speak’ with no a-priori assumptions of what reference
conditions should be. Moreover, the scale at which the problem is analyzed is flexible because
with this method differences can be analyzed at any level of the hierarchical structure. Therefore,
the scale issue is no longer a problem. PROHIBID was implemented in Ohio (with instream and
offstream data) and Maryland (combination of both) and described in detail in Chapter 3.
4
Introduction
Biologic integrity represents the highest point of the hierarchy in the natural system. It is a direct
measure of the ecological status in a water body and considered a response indicator (Novotny et
al. 2005). Environmental stressors and fauna’s exposure risk to stress propagate through the
hierarchical structure and in the final outcome impact the biologic community (Figure 1). For
this reason, integrity is considered as a true indicator of the overall health of a water body and
sensitive to any departure from the pristine conditions due to anthropogenic modifications at any
scale.
Figure 1. Hierarchical stressor-risk-endpoint propagation model based on Karr et al. (1986) integrity concept and Novotny( 2003) concept of risk propagation
5
Biological integrity in fresh water systems is usually evaluated with indices.
The use of indices to monitor the biological integrity of surface waters has been common
practice since the last quarter of the 20th century, but started almost a century ago (Novotny
2003). One of the most widely used indices in the United States is the Index of Biologic Integrity
(IBI) developed by (Karr et al. 1986). Many public agencies have adopted it as a framework for
their own calibrated version at the state or region scales (Bode 1988; Lyons 2006; Lyons et al.
2001; Ohio_EPA 1987; Roth et al. 1998).The IBI is a multi-metric (12 metrics), comparative
index in which the fish samples obtained from a particular water body are compared against the
fish abundances and community composition in reference watersheds. Fish samples are obtained
by fish electro-shocking. The sum of the 12 metrics constitute the final IBI score, which is a
discrete number ranging from 12 (essentially no fish) to 60 (healthy fish community). Even
though the index developed by Karr et al. (1986) is based on fish, numerous IBI based on the
macroinvertebrate community also exist and currently used (Barbour et al. 1999; Hilsenhoff
1987; Southerland et al. 2005; Stribling et al. 1998; Wright et al. 1988). For convenience, from
this point forward, fish IBI will be referred as IBI.
Valid environmental and biodiversity indicators should be sensitive enough to track changes
from reference conditions, applicable in large geographical areas, capable of providing a
continuous assessment over a wide range of stress, and differentiate between natural cycles or
trends and anthropogenic stress (Ott 1978). This is not an easy task because an index must be
able to reflect changes in the community produced by stressors at different hierarchical levels of
the ecosystem and at different geographic scales. The IBI developed by Karr et al. (1986) is
currently accepted as an index with the desirable characteristics and has been applied
6
successfully to aquatic communities (Noss 1990). Numerous authors have confirmed that
indices based on Karr’s IBI are sensitive to man-induced environmental stresses (Dyer et al.
2000; Dyer et al. 1998a; Lammert and Allan 1999; Manolakos et al. 2007; Richards et al. 1996;
Roth et al. 1996; Wang et al. 2001; Yuan and Norton 2004).
Species distribution in a pristine, lotic system is determined by natural inputs such as
meteorology, geography, latitude, elevation, stream or lake morphology, habitat quality, and
water chemistry (Novotny 2003). However, finding completely pristine environments is difficult
if not impossible. Therefore, identifying a pre-existing state (i.e. actual state) is of major
importance in order to set a reference against which to be able to compare (Rykiel 1985).
Deviations from the natural state are a consequence of introducing some disturbance in the
reference system that will cause a perturbation (at the system level) and/or stress (at the
physiological and functional level). This can be quantified by looking at the departure of the
biological and ecological features of the modified system (Rykiel 1985). In literature, there exists
discussion on the concept of disturbance and its expression in the ecosystem. Most authors agree
that disturbances should not be approached as monolithic inputs that will translate into a specific
change in the whole reference structure. Instead, system disturbance will be expressed in
different ways at different levels of biological organization (Noss 1990). Disturbances should be
analyzed in the context of a highly hierarchical system (i.e. ecosystem) in which the scale in
which a disturbance is manifested will determine its consequences (Pickett et al. 1989).
The hierarchy theory in ecosystems suggests that higher levels of organization incorporate and
determine the response of lower levels (Allen and Starr 1982; O'Neill et al. 1986). Four different
7
levels of organization of the biological community exist (from high to low hierarchy): regional
landscape, community-ecosystem, population species, and genetics. Each one of these levels has,
at the same time, three different dimensions that define them: functional, structural, and
compositional (Noss 1990; Novotny et al. 2007). The relevance of higher order constraints
should not mean that monitoring be limited to these levels (e.g. landscape patterns). It is in the
lower levels of the hierarchy where the most detailed information (e.g. species abundances) and
the mechanistic basis for higher levels can be found (Noss 1990). According to the concept of
hierarchical structure, one should be able to distinguish between disturbances that cause stress in
the higher hierarchies (regional or community level) and lower ones (population and genetic
level). Disturbances that translate into some sort of stress at high hierarchy levels are also known
as environmental gradients or large-scale stressors. Environmental gradients usually occur
when normal ecological stimuli and processes in the system, which constitute a continuum, go
beyond normal limits and constitute an axis of continuous change in frequency (Allen and Starr
1982; Rykiel 1985).Environmental gradients are usually ubiquitous, meaning that they will
always be there at different levels or configurations (e.g. landscape patterns, background water
quality). Since they are usually related to large-scale patterns, deviations from their normal,
natural boundaries affect the biological community in its higher hierarchies, producing an overall
shift of the natural species distribution. Large-scale variables determine the background quality
of the biotic community.
On the other hand, environmental stressors that affect lower levels of the biologic hierarchy will
be called for consistency small-scale variables. These variables usually have a marginal effect on
the whole community structure distributed over a large geographic area, but can severely affect
8
the biologic community at the regional or local scale (e.g. point sources). Two types of small-
scale variables might exist. The first one would be when some element foreign to the natural
system is introduced in sufficient amount as to negatively affect the biologic quality (e.g.
introduction of metals from point sources). The second type would be when localized extreme
values of already existing elements or gradients are reached due to human activity (e.g. high
levels of siltation due to presence of construction sites).
Many studies trying to link IBI to stressors focus on a specific scale and therefore found that the
relevant variables to IBI were those that could affect the biologic community at its highest
possible level of hierarchy at the given scale. Thus, impacts from small-scale stressors get
blurred by these. For example, Manolakos et al. (2007); (also summarized in Novotny et al.,
(2007)) used the whole state of Ohio as their system of study. In their analysis, they found that
habitat characteristics together with conductivity and hardness were the main descriptors of the
three identified clusters with different IBI qualities. These variables are large-scale variables
with great effect on the overall IBI variability at the state level. Other variables such as metal
concentration showed a weaker overall effect on IBI. In another study by (Dyer et al. 2000), the
main IBI predictors were identified through multiple linear regression in the Great Miami River
watershed in Ohio. When they analyzed the entire area they found that the Qualitative Habitat
Evaluation Index (QHEI), the percentage of municipal effluent flow in average stream
conditions, gradient, and hardness were the best predictors for IBI. These are all large-scale
variables. When they analyzed the lower portion of the watershed, hardness, total suspended
solids, concentrations of selenium, lead, zinc, and ammonium together with pool and channel
qualities were the best predictors. Roth et al. (1996) found that regional land use was a better
9
predictor for IBI than local land use in a study at the River Raisin watershed in Michigan. Their
study comprised multiple samples in streams of different order and different biologic integrity.
However, another study in the same watershed found exactly the opposite results (Lammert and
Allan 1999). Their study focused only on three first-order warm water tributaries to the River
Raisin. The discrepancies between Roth et al. (1996) and Lammert and Allan (1999) were due to
the scale at which the problem was approached (Allan et al. 1997).
Therefore, average biologic integrity observed in a specific area (which will be referred as
background integrity) is mainly determined by environmental stressors that are ubiquitous at the
specific scale. This doesn’t necessarily imply that these stressors are the best biologic integrity
predictors. For example, in a pristine environment, species distribution within homogeneous
geographical regions (e.g. ecoregions) is mostly determined by natural inputs such as
meteorology, geology, geography, latitude, and altitude (Novotny 2003). Species presence or
absence and species abundance within smaller, pristine environmental units is mostly determined
by other variables such as local habitat quality, stream morphology, or natural water quality.
Therefore, at very large scales, variables such as geology will be better predictors. However,
when the scale is reduced, local variables become better predictors because larger-scale
variables are homogeneous within the study area (i.e. they determine the average background
quality but not the fluctuations). This concept is transferable to areas undergoing disturbance.
Some stressors act as big disruptors of the ecological hierarchy affecting it at its highest levels
(e.g. climate change at The Earth’s scale or extensive land use changes at the basin, sub-basin or
watershed scales). Other anthropogenic stressors may only affect species distribution in small
areas or localized points and affect the ecosystem at lower levels of the ecological hierarchy (e.g.
10
channelization of a stream section). Therefore, anthropogenic disturbances may alter the
ecological system at different levels of the hierarchical system depending on their geographic
extent. One stressor may only be local if it is highly localized (e.g. a point source) but can
become a major disruptor if its intensity and extent are severe enough (e.g. extensive water
quality degradation in the U.S. before passing the Clean Water Act of 1,972).
Because of the scale dependence of environmental stressors, a correct sampling design is
paramount in order to obtain reliable results and identify relationships between response
indicators and environmental stressors. Targeted environmental stressors need to be ubiquitous
and diverse in the area of study in order to draw reliable predictions and identify clear patterns. If
biologic integrity is to be evaluated in multiple watersheds, these need to have significantly
different regional characteristics in order to identify reliable stressor-response relationships. If
evaluation of biologic integrity is to be performed at smaller scales (e.g. subcatchment level),
environmental stressors with a scale larger than the study area must be highly homogeneous in
order to reveal the effect of more localized stressors on IBI. Background quality is still
determined by the regional variables; however fluctuations within a homogeneous unit are due to
variables that are local and diverse at the given scale unless extraordinary stressors exist such as
toxic spills.
In summary, the physical structure of the aquatic system is organized in a hierarchical manner
(Allan et al. 1997; Frissell et al. 1986). Therefore, the distribution of species within this
hierarchical structure is also a nested hierarchy of suitable habitats to which species have adapted
(Kolasa and Biesiadka 1984; Kolasa and Strayer 1988; Sugihara 1980; 1983). Due to the
11
correspondence in hierarchies, it is logical to think that disturbances in high habitat hierarchical
levels will affect high levels of the biological hierarchy. Stresses in the higher hierarchies of
habitat (e.g. regional scale land use) will propagate and directly alter instream fish habitat
conditions (e.g. sediment retention, instream habitat quality, or organic matter input) (Allan et al.
1997). As a consequence and due to the habitat-biological hierarchy correspondence, high levels
and all subsequent lower levels of the biotic community structure will be affected as well. Since
these major shifts in community structure are produced by environmental gradients, IBI is better
predicted with high level environmental variables in large geographic scales. At smaller scales, a
combination of large and small scale variables predicts IBI more accurately.
Despite the clear theoretical relationship between environmental stressors and response
indicators, identification of stress-response relationships remains challenging for several reasons:
(1) the natural system is composed of highly intertwined and cross-correlated environmental
stressors, (2) the natural system is organized in a nested hierarchy of suitable habitats that are
adequate to different types of species and organisms and may have very different scales (Allan et
al. 1997; Frissell et al. 1986; Kolasa 1989), (3) categorical evaluation of environmental variables
such as habitat quality may bring some degree of subjectivity or relativity into the system and
lead to misleading results, data errors, or poor numerical relationships (i.e. lower coefficients of
multiple determination due to discrete nature of the data), (4) truly reference conditions may no
longer exist. Thus, selection of a representative, reference actual state is crucial in order to have
reliable, non-arbitrary results (Rykiel 1985). Reference conditions are also linked to scale
(Pickett et al. 1989), and (5) presence of natural randomness.
12
In my opinion, most of the current research efforts to predict or characterize biologic integrity
have three main issues that need to be addressed. First, many of the numerical analysis
techniques used for IBI prediction purposes are performed with traditional methods that have
limited capabilities to truly reflect the high non-linearity of the natural system (e.g. linear
regression or canonical correspondence analysis). Also, since many environmental variables may
be responsible for a fraction of the IBI variability, easy-to-apply numerical methodologies that
allow easy validation become crucial. Second, and most importantly, the results of any research
effort trying to predict or characterize biological integrity are bound to the scale and design of
the sampling strategy. Many examples exist in which different (and even opposite) results have
been found in the same region of study due to scale issues (Allan et al. 1997; Dyer et al. 2000).
Therefore, development of a methodology able to segregate different biologic responses to
stressors acting at different geographic scales is a link missing in current research. Third,
identification of true reference or realistically achievable conditions to compare new
observations against with no a-priori assumptions is also paramount in order to set future
strategies for standard development and set priorities for future restoration efforts.
13
1. Chapter 1: Comparison of IBI predictions using regression and the
environmental similarity concept
1.1. Methodology
In the present chapter, two different methodologies to predict biotic integrity were tested. For the
analyses, three large state or region-wide databases of indices of biotic integrity and their metrics
as well as accompanying land use, habitat, and chemical parameters were obtained. The first IBI
prediction methodology consisted of using the k-nearest neighbor concept (kNN) with the entire
databases. This method was used first because it is usually considered as a benchmark for
subsequent, more elaborate techniques to be compared against. The kNN concept is based on
assessing proximity among observations by measuring their dissimilarity. The Euclidean and
Mahalanobis distance functions were used for this purpose. A detailed description of the kNN
methodology can be found in (Jain et al. 1999), and (Jain and Dubes 1988). It was used as a first
step because it is a very fast, computationally efficient technique that easily allows good model
validation by using a leave-one-out approach without drastically increasing the computation
time. Since it was performed using the entire databases, it was expected to reveal the main
environmental parameters with a significant impact on biotic integrity at larger scales.
Once the kNN predictions were performed and a prediction benchmark was obtained, another
methodology was tested. It consisted of a step-wise multiple regression using the best fitting
function (linear or non-linear) at each step. This was performed using two different data scales.
The first scale was the entire database (same as with kNN). The second scale was clusters of sites
14
obtained using Self Organizing Maps (SOM) (Kohonen 2001; Manolakos et al. 2007) followed
by SOM-neuron clustering with the k-means method (Duda et al. 2000).
The first goal of the research was to compare the performance of more traditional approaches
(regressions) in identifying critical environmental variables to a more simple and time-efficient
technique based on site similarities (kNN). The second goal was to demonstrate the importance
of data scale in biotic integrity prediction and develop a methodology able to identify relevant
variables at different scales. This was done by running the regression model first with the entire
database (state or region level) and then on a cluster (of sampling sites) basis.
The SOM, the kNN techniques as well as a description of the different databases and their
parameters are presented briefly in this section before describing the methodology followed in
each case.
1.1.1. Self-Organizing feature maps
SOM are considered a type of unsupervised Artificial Neural Network (ANN). The SOM consist
of a topologically ordered mapping of the input space (in our case vectors of environmental
variables) onto a two-dimensional space according to a meaningful order (Kohonen 2001). SOM
are composed of multiple units called cells or neurons, which represent a homogeneous unit in
the SOM environment. Neurons can be grouped into clusters using similarity functions among
the neuron centroids.
A SOM is usually composed of a two-dimensional lattice that represents the SOM cells. In an
initialization process, each neuron in the SOM is associated with a random weight vector
15
( [ ]iniiim μμμ ,...,, 21= ), which has the same dimension (n) as the input environmental vectors
( [ ]bnbsbb xxxx ,...;,1= ). Using a dissimilarity function (Euclidean distance), each environmental
vector (corresponding to a sampling site) is associated with the most similar SOM neuron, called
the Best Matching Unit (BMU). Thus, an initial environmental vector SOM-layout is obtained.
Subsequently, the initial neuron-allocated weights (mi) are updated using a neighborhood
function. This function minimizes the overall distance between the neuron itself and its
neighbors. The new updated neuron weight is called the generalized median (ε ). This process is
iterated several times (epochs) until convergence or until a certain criterion is met
(usually iim ε≅ ). After convergence, similar SOM neurons can be further grouped according to
their similarity. Grouped SOM neurons constitute the clusters. SOM have been used for
environmental modeling in different occasions ((Cereghino et al. 2001; Manolakos et al. 2007)).
1.1.2. k-nearest neighbor concept
The kNN technique consists of a simple algorithm in which one observation point (which is
composed of multiple physical and chemical environmental variables measured at a specific site)
is compared against a set of observations with the exact same attributes. The objective is to find a
specified number of most similar observations (k) to the one being tested.
In order to measure the degree of dissimilarity, there exist numerous distance metrics. Some
common metrics are the Minkowski distance, the Euclidean distance (which is a particular case
of the Minkowski distance), the cosine distance, or the Mahalanobis distance. The latter is
particularly interesting because it applies a whitening transformation to the data that avoids or
reduces linear correlation distortion among features. Detailed information on these functions can
16
be found in Jain and Dubes (1988) and Jain et al. (1999). The Euclidean (Eq.1) and Mahalanobis
(Eq.2) distances were used in the research with a customized application developed with
MATLAB® .
2/1
1
2,, ))((),( ∑
=
−=n
kkjkiji XXXXED (Equation 1)
))()(),( 1ji
Tjiji XXXXXXMhD −××−=
−∑ (Equation 2)
In the above equations, n is the dimension of the data vectors (number of environmental
variables) in the database, Xi and Xj are the pair of vectors being compared. Matrix Σ in Equation
2 is the covariance matrix of the observed data vectors using the selected features.
1.1.3. Description of the databases
Environmental databases compiled by the Minnesota Pollution Control Agency (MNPCA), Ohio
EPA and Maryland Biological Stream Survey (MBSS) were obtained. The databases contained
multiple observations of chemical, physical and biological parameters at different sites.
Unfortunately the type and format of data available, especially for physical variables, were quite
different among the three states. A summary of the environmental variables recorded for each
observation is provided in Table 1-1. The variables in each site were collected within an one
week window. Therefore, IBI observations at that specific time can be considered as the outcome
of the recorded physical and chemical characteristics from an observation site. The number of
sites in each case is the total number of observations with no missing data in any field, and the
number of observations used in the analysis.
17
OH (429 sites) MN (125 sites) MD Piedmont sites (246 sites) Water Chemistry Water Chemistry Water Chemistry Conductivity (Cond) (µmho/cm) Conductivity (Cond) (µmho/cm) Conductivity (Cond) (µmho/cm) Dissolved oxygen (DO) (mg/L) Dissolved oxygen (DO) (mg/L) Dissolved oxygen (DO) (mg/L) pH (standard units) pH (standard units) pH (standard units) Total Suspended Solid (TSS) (mg/L) Total Suspended Solid (TSS) (mg/L) Nitrate as N (NO3) (mg/L) Total Phosphorus (P) (mg/L) Total Phosphorus (P) (mg/L) Temperature (Temp) (deg C) Ammonia as N (NH4) (mg/L) Ammonia as N (NH4) (mg/L) Sulfate (SO4) (mg/L) Nitrite as N (NO2) (mg/L) Total Nitrogen (TN) (mg/L) Alkalinity (ANC) (µEq/L) Nitrogen Kjeldahl (TKN)(mg/L) Temperature (Temp) (deg C) Diss. Organic Carbon (DOC) (mg/L) Nitrate as N (NO3) (mg/L) Turbidiy (Turb) (NTU) Habitat and morphology Hardness as CaCO3 (Hard) (mg/L) Habitat and morphology Remoteness score (Remote) (0-20) Biological Oxygen Demand (BOD) (mg/L)
Substrate, channel,,and cover scores Habitat index (QHEI) (0-100) Instream habitat (Instrhab) (0-20)
Total Calcium (Ca) (mg/L) Buffer width (MBufWid) (m) Epifaunal substrate (EpiSub) (0-20)
Total Magnesium (Mg) (mg/L) Mean bank erosion (MBankEros) (m)
Velocity-depth variability (Vel-dpth) (0-20)
Chloride (Cl) (mg/L) % undercut (PctUndercut) Pool quality (Pool) (0-20) Sulfate (SO4) (mg/L) % woody (PctWoody) Riffle quality (Riffle) (0-20) Total Arsenic (As) (µg/L) % over vegetat. (PctOverVeg) Channel alteration (Chan) (0-20)
Total Cadmium (Cd) (µg/L) % emerging macrophyytes (Pct Emermac) Bank stability (BankStab) (0-20)
Total Copper (Cu) (µg/L) % submerged macrophytes (PctSubMac) % embeddedness (PctEmbed)
Total Iron (Fe) (µg/L) % other cover (PctOtherCov) % channel with flow (Ch_flow) Total Lead (Pb) (µg/L) % vegetal cover (PctCov) % shading in channel (Shading) Total Zinc (Zn) (µg/L) %pool (PctPool) Buffer width (MBufWid) (m) Habitat and morphology % run (PctRun) Aesthetic quality (Aesthet) (0-20) Substrate score (Subs) (0-20) %riffle (PctRiffle) Habitat index (PHI) (0-100) Embeddedness score (Embed) (0-4) % pool+run (PctPoolRun) Thalweg depth (MThalDep) (cm) Riparian score (Rip) (0-10) Mean width (MWidth) (m) Mean width (MWidth) (m) Instream cover score (Cov) (0-20) Thalweg depth (MThalDep) (cm) Maximum depth (MaxDepth) (cm) Riffle score (Riffle) (0-8) Mean depth (MDepth) (cm) Slope (Sl) (%) Pool score (Pool) (0-12) Width-depth ratio (WDRatio) Average flow velocity (m/s) Channel score (Chan) (0-20) Sinuosity ratio (Sin) Woody debris count Gradient score (Grad) (0-10) Slope (Sl) (m/km) Root count Habitat index (QHEI) (0-100) % boulder (PctBould) Land use ( in drainage area) Land use ( beyond 100m buffer area) %rock (PctRock) % urban land uses (Urban) %Agriculture (Agri) (25% increments) %fines (PctFine) %agriculture + barren (Agribarr) % Forest-wetland (Forwet) (25% inc.) % embeddedness (PctEmbed) % forest+wetland+water (Forwetwat) % Urban (Urban) (25% increments) Mean fines’ depth (MFineDep) (cm) Biological indices Biological indices Land use(in riparian area) Fish IBI (1-5) Fish IBI (12-60) Land use (0-5),riparian (0-15) scores Benthic IBI (1-5)
ICI (0-60) % disturbed LU in 100m buffer (PctDistLU) Hilsenhoff Index (0-10)
% undisturbed LU in 100 meter buffer (PctUnDistLU)
% dist. LU in 30-meter buffer (PctDistLU30)
% undisturbed LU in 30-meter buffer (PctUnDistLU30)
Biological indices Fish IBI (0-100) Table 1-1. Description of the environmental variables, scores and indices available for each state and their units.
18
1.1.4. IBI prediction methodology using kNN
Due to the small computation time required, a leave-one-out cross validation procedure was
used. Thus, each individual observation was taken out of the database and compared against the
rest of the remaining observations one at a time. Once the first observation was compared to the
rest of the database, it was reintroduced into the database and the next observation was taken out
to repeat the process until all the observations were tested. With this method, there was no need
to separate a validation set because each point was validated against the remaining sites in the
database. Two different similarity functions were used; the Euclidean and the Mahalanobis
distances. Prior to the analysis with the Euclidean distance, the data were log transformed and
scaled in the range [0 1]. The steps followed are described below (also see Figure 1-1).
1. Best metric selection (using 1 and 10 closest neighbors): this step evaluated prediction
capability of each environmental variable alone by comparing the IBI value of the site being
tested (one-out) with the average IBI in the identified closest site/s (1 and 10). The variables
were then sorted for both cases separately (for k=1 and k =10) in decreasing order. The r2 of the
linear regression between IBI scores and each environmental variable determined the variable
sorting. One (k =1) closest neighbor was used because by using the closest observation, the
extreme values would be predicted more accurately since few observations in the very low and
upper IBI ranges existed. With k = 10 , observations in the mid IBI range (with larger number of
observations) would be predicted more reliably, but not the extremes. Therefore, two lists of
sorted variables were obtained (with k=1 and k=10).
19
2. Step-wise predictions using variables from the k=1 sorted list: Following the variable
sorting obtained in step 1 (with k =1) a new variable at a time was introduced. The similarity
function was computed with the selected variables. The prediction was performed by finding the
IBI value (with k=1) or the mean IBI value (with k =2) of the most similar sites at each step. If
the IBI prediction with the new added variable (with either k=1 or k =2) improved the previous
one, the new variable would be kept, otherwise it would not. When a new variable was added,
backtracking was performed. Therefore, previously included variables were excluded one at a
time to see how the predictions were affected. If the exclusion of an old variable improved the
prediction, then this would be eliminated from the model. The reason for backtracking was to
minimize the effect of the order with which the variables were included in the model, as
suggested by Jain and Dubes (1988).
3. Step-wise predictions using the variables from the k=10 sorted list: It was implemented
as step 2 except that in this case the average IBI value from the 5 or 10 closest neighbors (k=5 or
k=10) was used for prediction.
20
Figure 1-1. Flow-chart of the step-wise kNN prediction method. Dashed arrow lines represent the steps followed when the environmental variables are sorted with k =1. Dotted arrow lines represent the steps followed when the variable sorting is performed with k =10. Solid arrow lines depict common steps for both cases
1.1.5. IBI prediction using regression and SOM + regression
Prediction of IBI using multiple regression was performed at the state (or region in Maryland)
and the cluster (of sites) scales. The regression equations were obtained following a step-wise
methodology and using 75 percent of all the available observations in each database for model
development. The remaining 25 percent was kept for model validation. The observation subsets
were selected randomly. A diagram summarizing the different steps is presented in Figure 1-2
and described as follows.
No more variables
No
No
Yes
Yes
Environmental data
Variable selection and sorting using k = 1
Best variable
Improves model?
Analysis of the best environmental variables
Variable selection and sorting using k = 10
Add next variable
Predict with k= 1 and k =2
Predict with k =5 and k = 10
Improves previous model?
Discard variable Keep variable Backtrack
Discard oldvariable
Plot IBI predictions
21
1. Database clustering using the SOM (only in cluster-based predictions): Each of the
databases was clustered using all the available chemical and physical environmental variables
shown in Table 1-1. In Ohio, land use data was not used for clustering purposes because this
variable was measured in a very crude scale (25 percent increments). Land use data was kept out
of the clustering as a cautionary measure because it could negatively alter the SOM site
distribution. The environmental data were converted to their natural logarithms and ranged [0-1]
before training the SOM. The number of SOM neurons was determined based on the topographic
and quantization errors. The quantization error is the average distance (Euclidean) between each
data vector and its BMU, while the topographic error is the proportion of data vectors for which
the first and second closest SOM cells are not adjacent in the grid of neurons (Kiviluoto 1996).
A compromise between the two errors had to be made because the quantization error usually
tends to decrease as the number of SOM neurons increase, and a very large map size was
undesirable given the available data. Hence, the maximum number of SOM neurons was limited
to 100. In Ohio, a SOM with 60 )106( × neurons was used. For Minnesota and Maryland, SOM
with 63 ( 97× ) and 54 ( 96× ) neurons were used, respectively.
The next step consisted of finding the optimum number of neuron clusters. The k-means
algorithm was used for this purpose (Manolakos et al., 2007). The optimal number of clusters
found using the Davies-Bouldin index (Davies and Bouldin 1979) was 3 in Ohio and Minnesota,
and 5 in Maryland.
2. Selection of a validation set: 25 percent of randomly selected observations in each cluster
were kept aside for validation. The remaining 75 percent was used to develop the regression
22
models. The validation sets used for the cluster-based and the state-based regressions were the
same in all cases.
3. Best metric selection (at state and cluster level): In the regression development datasets,
each one of the environmental variables was regressed linearly against the fish IBI score. The
environmental variables were then sorted in decreasing order based on the coefficient of multiple
determination (r2). An F-test at the 95% confidence level was performed in each case to check
the statistical significance of the regressions. Only variables that showed statistical significance
(p ≤ 0.05) were included in the model.
4. Linear correlation checking: The correlation coefficient (r) was calculated for each pair of
significant variables selected and sorted in the previous step. In cases in which the variable-
variable 85.0≥r , the least discriminant variable (i..e with smaller IBI-variable r2) was
removed because it was considered not to bring any new relevant information to the system.
5. Step-wise regression and backtracking: This was done by starting the regressions with the
best variable from step 2 and adding the next best one at each step. If the new added variable
increased the previous r2 it was kept, otherwise it was discarded and the next variable was tested.
When a variable was introduced, linear and non-linear regression equations were evaluated. The
function that yielded the highest r2 was selected. Quadratic, logarithmic, exponential, inverse, S-
curve, and power functions were the non-linear model forms tested. Backtracking was also
performed in this case. Steps 2 through 5 were performed using the statistical software SPSS
Version 15® for Windows.
23
6. Model validation: the equations obtained in step 5 were tested with the validation sets and
the IBI predictions plotted.
Figure 1-2. Flow-chart of the step-wise multiple regression method. Dashed lines indicate steps for the cluster-based model only. Dotted lines indicate steps for the whole database model only. Solid lines are common steps for both methods
1.1.6. Chronic and acute toxic chemical effects
After the prediction models were developed, a further fine-tuning was performed by adding a
penalty on those sites in which the reported metal concentrations (only available in Ohio) were
higher than the chronic exposure limit (CCC). Chemical toxicity does not act as a gradient along
r≤ 0.85 and p≤ 0.05
r≥ 0.85 and/or p>0.05
No more variables
No
No Yes
Yes
Whole environmental database
Clusters: SOM + k-means
Variable sorting
Variable selection Discarded variables
Selected variables
Best variable
Improves model?
Analysis of significant variables at different scales
Data processing and normalization
Add next variable
Improves model?
Discard variable
Keep variable Backtracking Discard old variable
Plot IBI prediction
Plot validation set
24
concentration. Only an effect on the biotic community would be observed if a specific threshold
is reached. Regression and kNN are unlikely to identify the effects of variables that do not act as
gradients in large scale models. Variables acting as environmental gradients have a greater
overall impact on the biotic community and are more likely to be selected in the predictive
model. For this reason, and since it was deemed important to account for chemical toxicity, a
penalty was included in the calculated IBI when the CCC for some of the available metals was
reached. The penalty followed an exponential curve (see equations 3 and 4). Since no literature
relating IBI change to chemical toxicity was found, the penalty was arbitrarily set by finding the
penalty value that yielded a better fit. The chronic and acute (CMC) concentrations for each
metal were obtained using the EPA Water Quality Criteria (EPA 2008a).
( )1)(
1−= −×
=∑ iii CCCCONC
n
ieP α (Equation 3)
ii
CMCi CCCCMC
PLni
−
+=
)1(α (Equation 4)
Where P is the final penalty, n is the number of available metal concentration measurements,
CONC is the observed concentration for that metal, PCMC is the set penalty when the CMC
concentration is reached, α is a coefficient calculated given the boundary conditions of the
equation (PCONC≤CCC =0 and PCONC=CMC , which was determined in each case).
25
1.2. Results and discussion
1.2.1. IBI predictions using kNN (k =1 or k= 2)
Minnesota
In this state, the Euclidean distance performed better than the Mahalanobis (r2 = 0.53 and 0.42
respectively with k=1 ). In both cases, total nitrogen (TN) and percent disturbed land use in the
riparian buffer (LU) were among the most significant variables, but not necessarily included in
the final prediction model. With the Mahalanobis function these variables were discarded in the
backtracking process, and required only two variables to yield its best possible prediction (see
Table 1-2). The variables used in the best model (Euclidean distance) were related to nutrient
loads, land use patterns, stream variability, substrate quality, and channel morphology, which
agreed strongly with the variables obtained in the regression model for the whole state (Table
1-3).
Maryland
Both proximity functions performed very similarly, achieving equal final results (r2 = 0.54 with
k = 2 in both cases) and identified very similar significant environmental variables (see Table
1-2). Land use patterns in the drainage area and alkalinity were key parameters in both cases
(like in the whole dataset regression model), and so was the PHI (unlike the regression model).
The rest of the selected variables were different. The Mahalanobis function identified aesthetic
quality (Aesthet) as an important parameter, similarly to the regression model. Even though this
is a qualitative parameter, it seems to have high predicting capabilities in Piedmont regions.
26
Ohio
The Mahalanobis function outperformed the Euclidean (r2 = 0.51 and 0.47 respectively with k
=1 in both cases). Habitat parameters (Substrate, Riffle, and Cover) were able to explain a very
large portion of the total biologic variability in both cases. Land use in the riparian corridor was
also found important in both cases. The significant habitat variables found with the distance
functions in Ohio agree again with those found relevant with the regression approach (see Table
1-3). No chemical parameters were selected in this case, with the exception of copper using the
Euclidean distance function.
1.2.2. IBI predictions using kNN (with k =5 or k = 10)
Minnesota
Again, the Euclidean proximity function performed better than the Mahalanobis (r2 = 0.54 and
0.48 respectively with k =5). Again, land use patterns in the riparian corridor determined a big
percentage of the total biotic variability (around 40 percent in both cases). However, land use
related variables were removed from the prediction with the Euclidean function after
backtracking, which might indicate that other included variables (TN, and Cond.) could be
strongly related to land use patterns and also account for new information from other non-land
use related stressors (i.e. point sources). This suggested that water quality is the main stressor in
Minnesota’s dataset, especially for heavily degraded sites (Southern watersheds).
With the Mahalanobis function, the land use score (LU) was included and the top chemical
variables removed (TN, P, TSS). The use of the covariance matrix (see Eq.2) which acts as a
whitening transformation by eliminating parameters with high correlation could explain this
difference between the two distance functions. Since in the step-wise variable sorting process LU
27
was the top metric, the subsequent added variables with high correlation were eliminated
accordingly.
Maryland
As in the previous case, both functions performed almost identically in terms of predicting
capability (r2 = 0.59 in both cases with k = 10). However, it is remarkable that the Mahalanobis
function needed only two variables (Agricutural/barren land and velocity-depth variability) to
achieve such result, while the Euclidean needed six. Both variables used with the Mahalanobis
function were also used in the regression model and in the Euclidean distance function. Other
variables used in the regression model (i.e. ANC) were also included with the Euclidean
distance.
Ohio
Unlike the previous time, the Euclidean distance obtained better overall results (r2 = 0.53 versus
0.47 with k = 5). The Mahalanobis distance needed only five variables versus nine needed by the
Euclidean, which matched the variables selected with the regression model very well. The
Mahalanobis function only needed three physical habitat parameters (embeddedness, riffle and
pool quality) to explain a large part of the total variability (44 percent). The remaining variability
was explained with the inclusion of cadmium and copper concentrations. Cadmium was also
identified as an important variable in the cluster-based regression predictions (cluster number 2,
see Table 1-3) and in research conducted by (Dyer et al. 2000). The inclusion of metal toxicity
penalties helped slightly improve the overall model performance in Ohio (best previous model r2
= 0.51).
28
Location Similarity function
k RMSE* r2 Variables used
1 7.28 0.51 Subs, Riffle, Cov, Embed, Forwet, Urban Mh 5 7.06 0.47 Riffle, Pool, Embed, Cd, Cu 1 7.77 0.47 Pool, Riffle, Cov, Subs, Rip, Cu, Urban
OH Eu
5 6.76 0.53 Riffle, Pool, Subs, Rip, Embed,SO4,TKN,pH,Urban1 21.74 0.41 TN, PctWoody Mh 5 20.52 0.48 LU, Rip, Cond, MFineDep, PctCov 1 19.43 0.53 LU,TN, Channel, PctBoulder, PctRun, PctRiffle
MN Eu
5 19.42 0.54 TN, Cond, TSS, PctUnderCut, MDepth 2 0.66 0.54 Agribarr, Urban, PHI, ANC, Aesthet, SO4,DOC Mh
10 0.63 0.59 Agribarr, Vel-dpth 2 0.67 0.54 Agribarr, ANC, Urban, PHI, Sl,DO MD
Eu 10 0.63 0.59 Agribarr, Urban, ANC, PHI, Vel-dpth, DOC
* RMSE for different IBI predictions. IBI scales: 12 to 60 in Ohio, 0 to 100 in Minnesota, 1 to 5 in Maryland Table 1-2. Summary of IBI predictions using the kNN methodology. The different functions (Mh = Mahalanobis; Eu = Euclidean) and selected number of closest neighbors (k) are specified. Final selected variables in each case are also listed.
1.2.3. Regression models
Minnesota
Whole state predictions
The metrics able to explain the largest part of the total variability were conductivity, total
nitrogen, TSS, land use score, mean width and mean thalweg depth. These variables explained
68 percent of the total variability in the equation development set and 52 percent in the validation
one. In the latest, the main source of error were those sites with very poor IBI scores (IBI = 0).
These observations corresponded to highly urbanized areas located in the Southern part of the
state (see cluster 1 in Figure 1-3). Water chemical quality is the main cause for the severely
impaired biotic integrity in those sites, and not so much the existing physical conditions. Due to
lack of metal concentrations in the Minnesota’s database, a toxicity penalty could not be
29
included. Other selected variables such as stream’s width and thalweg depth deemed relevant in
less degraded areas (Northern watersheds) and contributed to an overall better fit of the
regression in the development dataset, but led to misprediction in highly urbanized areas during
validation (Table 1-3).
Cluster-based predictions
The SOM yielded three clusters that were very clearly separated geographically (see Figure 1-3).
The observations located in the Southern areas (cluster 1) showed significantly lower IBI and
habitat quality, with higher accumulations of fine sediment in the substrate and larger percentage
of disturbed land use in the riparian strip. Those areas were also associated with higher total
nitrogen and TSS concentrations as well as higher conductivity. Clusters 2 and 3 (located in the
Upper Mississippi and St. Croix River Basins) had similar IBI scores and chemical quality. The
main differences between them were due to physical habitat quality, especially substrate and
channel characteristics.
The IBI cluster-based prediction improved significantly the results obtained using the entire state
database. The variables selected (different for each cluster) were able to explain 80 percent of the
total variability in the regression development set and 59 percent in the validation one.
Identification of sites with similar environmental characteristics helped improve biotic integrity
prediction. Effects of more local stressors could be better identified using sub-datasets with
similar environmental properties. In larger scales, their effects are blurred by other more
ubiquitous variables. Stressors like nutrient loads or land use patterns still explain a big part of
the biotic variability at the cluster-level. However, segregation of the points that were
30
mispredicted using the whole set (mainly those in cluster 1) led to significant improvement
despite the lack of metal toxicity data. The variables used in the prediction model in each cluster
are shown in Table 1-3.
.
Maryland, Piedmont Areas
Whole dataset predictions
In this case, the largest part of biotic variability was explained with five variables; percentage of
agriculture/barren lands in the drainage area, alkalinity, aesthetic quality, velocity-depth
variability, and mean width. The total variability explained was 66 percent in the regression
development dataset and 59 percent in the validation set. Other chemical variables found of
relevance in previous models (mainly total nitrogen and conductivity) were within the top seven
best metrics in Maryland and had a high degree of correlation with ANC (r=0.75) and
agricultural land uses (r=0.69) respectively.
Cluster-based predictions
Five clusters of SOM neurons were identified. As shown in Figure 1-3, clusters 4 and 5 were
concentrated in a specific region. Cluster 1 IBI scores were significantly higher than the rest,
while cluster 5 scores were significantly lower. Clusters 2, 3 and 4 showed similar median IBI
values, with wide ranges and overlapping among them. For this reason, and because a minimum
number of observations is required for each cluster to develop the equations, clusters 2, 3 and 4
were merged. In this case, the regression dataset explained up to 71 percent of the total
variability, and a 62 percent in the validation dataset.
31
The best predictive parameters for cluster 1 were (in decreasing order); ANC, velocity-depth
variability, urban land use, agricultural/barren land uses, and shading. These results strongly
agree with the metrics found with the whole dataset, which are large-scale variables with the
exception of shading . Shading is a variable considered important only at the local scale. The best
metrics in clusters 2, 3 and 4 were (in decreasing order); agricultural and barren land uses, ANC,
PHI, percentage of channel covered by flow, and, again, shading. The best metrics in cluster 5
were aesthetic quality, urban land use, conductivity, and epifaunal substrate quality.
Ohio
Whole state predictions
Forty-seven percent of the total variability was explained with the regression development set
and 41 percent with the validation set. The top metrics used in the regressions were (in
descending order); embeddedness, substrate quality, pool quality, sulfate concentration, hardness
and TKN. The top seven variables were habitat-related, while sulfate concentration, BOD,
conductivity, arsenic concentration, hardness and TKN were the top chemicals. The results
obtained in this case are very similar to those obtained by Dyer et al. (2000) who identified
habitat and morphological parameters as the most important variables affecting biotic integrity
with chemical variables playing a more secondary role at the state scale.
Cluster-based predictions
Three very prominent clusters of SOM neurons were found. The IBI distribution was
significantly different in all three, having cluster 1 the highest IBI scores and cluster 3 the lowest
(see Figure 1-3). The sites within one watershed usually belonged to only one cluster with few
32
exceptions. The cluster-based predictions outperformed significantly the non-clustered model in
the model development set (r2 = 0.62). However, the validation results were very similar to the
non-clustered model (r2 = 0.44).
Cluster 1 was dominated by habitat metrics, especially those related to substrate quality
(embeddedness and substrate quality were within the top variables). Sulfate concentration, pH
and conductivity were the top chemical variables and in the top ten overall. The variables
included in the regression model for cluster 1 are shown in Table 1-3.
Cluster 2 biotic quality was again dominated by habitat-related variables (riffle, pool, and
substrate qualities). However, the chemical variables differed significantly with zinc, nitrate,
nitrite, and cadmium concentrations as the most relevant. Hardness, pH, conductivity, and sulfate
concentrations were not among the top chemical variables unlike cluster 1.
In cluster 3, biotic integrity was clearly driven by water quality. Eutrophication problems seemed
to be the main environmental impact with BOD and nutrient input among the top predictors. Zinc
was again, an important stressor as well. Conductivity and sulfate were also important variables
in the model. Riparian quality was the only non-chemical parameter that placed in the top ten in
the variable selection process, which might be an indication of the importance of functional
stream buffers in severely impaired areas. The importance of the riparian buffer quality in
heavily degraded sites was also identified in Minnesota’s cluster 1 (see Table 1-3). Interestingly,
water quality was also the most important biotic integrity driver in that case. A functional
33
riparian corridor is of capital importance in regulating sediment and chemical delivery from the
surrounding lands, especially in severely impaired sites.
34
Figure 1-3. Top, site cluster distribution in Minnesota (left), Maryland (Piedmont sites) (center), and Ohio (right). In Minnesota, cluster 1 is concentrated in Southern watersheds. In Maryland clusters 4 and 5 are concentrated in a specific region and, in Ohio, sites located in the same watershed usually belong to the same cluster. Bottom, Self-organizing Map neuron lattice and box plots with the cluster-based IBI values. The red line in the boxplots represents median cluster value, the top line is 75 percentile, and bottom line is 25 percentile.
35
Scale State Variables State Variables State Variables Whole
database MN Cond ( 5101.3 −×− ,Q) MD Agribarr/10 (-.035,.488,Q) OH Embed
(2.397,-17.843,Q) TN (.066,-2.58,Q) ANC (-.00048, L) Subs (.273,L)
TSS (-.235,L) Aesthet (-.006,.15,Q) Pool (.563,L)
MWidth (-27.6,I) Vel-dpth (.049,L) SO4 (-3.957, Lg)
MThalDep (-496.62,I) MWidth (.02, L) Hard/100 (-.264, 3.031,Q)
PctRun (-.007,.77,Q) TKN (.846, I) LU+1 (9.5, Lg ) Cu (-.087,.749,Q) Constant (56.81) Constant (.597) Constant (66.260) r2 = .68 r2 = .66 r2 = .47 r2 validation = .52 r2 validation = .58 r2 validation = .41 RMSE* = 27.16 RMSE* = 0.71 RMSE* = 7.23
Cluster 1 MN PctDistLU (0.0015,Q) MD 2 ANC (-0.069,Q) OH Embed (-4.541, L) TSS (-11.5,Lg) Vel-Dpth (.182,L) SO4 (-.031,L)
PctOverVeg+1 (50.97, I) Agribarr (-.081,Q) QHEI (-967.42,I)
Rip (0.273,Q) Urban (-.046,.1,Q) pH (-13.80,219.31,Q) Shading (.121,Q) TKN (.686,-5.342,Q) Hard (.018, L) Chan (-.098, 2.742,Q) NO3 (.01,-.610,Q) NO2 (9.887,L) Constant (10.66) Constant (3.84) Constant (-824.408)
Cluster 21 MN TN (-0.398,Q) MD 2 Agribarr OH Riffle (1.703,Q) PctEmbed (-0.367,L) (-.178,.159,Q) Pool (.649,L) MThalDep (-314.2,I) ANC (-.223,L) Zn (-.85,.019,E) MDepth (1075.9,I) PHI (.293,L) Subs (.497,L) PctPool+1 (-145.7, I) CH_Flow (.188,L) NO3 (.101,-2.074,Q) Shading (-.181,L) NO2 (-293.3,60.71,Q) Cd (4.097, L) SO4 (-1.897,Lg) Constant (113.81) Constant (3.51) Constant (29.256)
Cluster 33 MN Temp (-822.4, S) MD Aesthet OH BOD (23.295,I) Sl (-1.48,L) (.075,1.05,P) Cond (-.001,L) MThalDep (-623.04,I) Urban (-.67,Lg) Zn (.001,-.124,Q) TN (-6.69,E) Cond TKN (117.23, .021,P) ( 71008.7 −×− ,Q) SO4 (291.707,I) EpiSub (.007,.23,E) Constant (957.72) Constant (4.075) Constant (-98.829) r2 = .81 r2 = .71 r2 = .62 r2 validation .59 r2 validation .62 r2 validation = .44 RMSE* = 25.81 RMSE* = 0.66 RMSE* = 6.78
1Clusters 2, 3, and 4 in Maryland, 2regressions performed with standardized values, 3Cluster 5 in Maryland * RMSE for different IBI predictions. IBI scales: 12 to 60 in Ohio, 0 to 100 in Minnesota, 1 to 5 in Maryland L =linear; E=exponential; Lg = natural log; I = inverse; Q = quadratic; P=power; S =S-curve Table 1-3. Summary of the step-wise regressions for IBI prediction for the development and validation sets. The variables used in each case are listed together with their coefficients and curve type (in parentheses). Variables in italics in the whole database regressions indicate variables also used in some of the kNN predictions. Results in Ohio after including metal toxicity penaltie
36
1.3. Conclusions
• Biotic integrity is the result of both, offstream and instream factors that generally behave in a
highly non-linear manner. Offstream (allochtonous) variables, such as land use patterns in the
drainage area or the riparian corridor, are the variables able to account for the largest part of the
total variability, especially when predictions are performed at larger scales. Other instream
environmental parameters are also able to explain a large part of the variability. Habitat
parameters, nutrient and sulfate concentrations, conductivity, alkalinity, hardness and some
morphologic variables are equally important and potential good indicators of stream’s health by
themselves or when combined. Most of these variables depend on regional (catchment)
conditions. Some local parameters (e.g. instream shading, high metal concentrations) showed an
important effect when the scale was reduced or some variations were made to account for their
impacts. Hence, the models confirmed that biotic integrity is sensitive to different stressors
acting at different geographic scales.
• SOM-based clustering of sites successfully improved biotic integrity predictions in all three
states. The improvement was generally more noticeable in the model development datasets than
in the validation datasets. Finding the optimum degree of similarity among sites, while having
enough observations to develop and validate a regression model is paramount to extract the
maximum possible knowledge out of the available data.
• Since very few sites had concentrations of some metal above the chronic toxicity threshold,
the models were generally unable to identify their effect on biotic integrity because the overall
effect was small. Metals have an important effect only when a threshold is met. Other variables
37
such as habitat parameters or nutrient concentration (especially nitrogen) behaved more as
gradients along their measurement ranges.
• In Ohio, the inclusion of a penalty to account for toxic effects on IBI improved the model
predictions. The inclusion of such penalty in Minnesota and Maryland might also improve the
IBI predictions, but metal concentration data was not available. In Minnesota, the most
significant mispredictions in both, clustered and non-clustered, models occurred when extremely
poor IBI observations were present. The clustered model helped improve the predictions but
those sites were still the main source of error. Chemical toxicity is suspected to be the cause for
those extremely poor IBI values.
• An attempt to predict biotic integrity using similarity functions was performed. The results
were quite encouraging. One of the main advantages of using such techniques is that they need
very little computational time. For this reason, the predictions could be easily validated with a
leave-one-out approach instead of splitting the data into a model development and a validation
dataset. This methodology proved to be robust with crudely-scaled or qualitative data such as
Ohio’s land use and habitat parameters. In all three states, more than 50 percent of the total biotic
variability was explained without need of previous clustering.
• Two different types of proximity functions were tested for IBI prediction purposes; the
Mahalanobis and the Euclidean functions. In general, the Euclidean distance function obtained
better results (higher r2). However, to achieve such results it used a larger number of variables
than the Mahalanobis function. The Mahalanobis function achieved similar or equal results
using fewer variables. The covariance matrix used in the calculation of this function is
considered a whitening transformation that eliminates correlation. Therefore, variables that did
not bring any new relevant information to the system were discarded.
38
• With the kNN technique, the optimum number of closest neighbors for prediction purposes
depends on the available data distribution. The number of existing observations along the IBI
scoring scale determines the model prediction performance in each IBI range. When very few
observations exist (i.e. for the extreme IBI values), IBI is predicted more accurately using lower
k values, assuming that some other observation in the database has similar environmental
conditions. If a high k value is used in such cases, a “smoothing effect” will occur because some
of the selected neighbors will be quite different from the target site. Determining a Maximum
Allowable Distance Threshold between the site being predicted and its neighbors could be a
solution to finding the optimum k in each case. The larger the number of observations in each
data range and the more relevant environmental variables to be compared against, the more
reliable the predictions will be.
39
2. Chapter 2: Large-scale biologic integrity prediction based on
environmental similarity using instream data and regional and local offstream
characteristics
2.1. Methodology
2.1.1. Data and study area
The research was based in 429 observations in the state of Ohio. Each observation had biologic
integrity measurements along with instream and offstream environmental characteristics. The
biologic and instream data were collected and compiled by the Ohio Environmental Protection
Agency between years 1996 and 2000. The offstream data were obtained using a Geographic
Information System (GIS). The biological and instream environmental data were collected in the
same stream segment with no more than a 5-day time difference. To our knowledge, all data
were collected in base-flow conditions and extreme events (e.g a spill) were not reported.
Biologic integrity was measured with the fish Index of Biologic Integrity. This is a discrete
score ranging from 12 (very poor biologic integrity) to 60 (excellent biologic integrity). The IBI
is composed of 12 metrics that describe the species richness and composition, the trophic
composition, and the fish abundance and condition of the fish community (Karr et al. 1986;
Ohio_EPA 1987).
Instream variables consisted of water quality and habitat quality metrics (Table 2-2). The habitat
parameters consisted of the metrics from the Qualitative Habitat Evaluation Index (QHEI)
40
(Rankin 1989). The QHEI and its metrics are discrete scores with different ranges of
measurement (see Table 2-2). Another score quantifying the percentage of fine sediment in the
river bed (embeddedness) was also available (embeddedness is not used as a QHEI metric itself,
but as a penalizing factor for the QHEI’s substrate and channel quality metrics).
Offstream environmental variables were grouped in three main categories: upstream land use,
stream fragmentation, and point source information. In order to obtain the upstream land use,
each site’s watershed was delineated using a 30-meter resolution Digital Elevation Map (DEM)
with ArcGIS Spatial Analyst. Subsequently, the percentage of each upstream land use was
calculated at two different scales: the regional scale referred to the whole upstream segment,
while the local scale referred only to 2 miles upstream the observation site. Land uses for the
whole catchment and the 100 and 30-meter buffers were obtained for both scales. A 30-meter
width was chosen because this was the minimal possible distance due to data resolution and
beyond the minimal recommended 15-meter width, effective under most conditions (Castelle et
al. 1994). A 100-meter width was chosen because this is an intermediate value between 3 and
200 meters, minimum and maximum effective widths depending on site-specific conditions
according to Castelle et al.( 1994). Land use percentages were obtained using the Thematic
Raster Summary function within Hawth’s Analysis Tools for ArcGIS (Beyer 2004) . The land
cover categories defined in the 2001 National Land Cover Dataset (NLCD) were used (USGS
2008b). Sixteen different land use categories existed in the area of study and listed in Table 2-1.
The Open Water (OW) land use category was only calculated for the drainage and catchment
areas, not for the buffers. Drainage areas (DA) for each site were also obtained. The
fragmentation and point source metrics (Table 2-2) were calculated using the National
41
Hydrography Datasets (NHD) (USGS 2008a). The ArcGIS Utility Network Analyst was used to
trace upstream or downstream a specific site. Majors dams (only with DA ≥ 2.59 Km2) and point
sources (major and minor waste water treatment plants ands major industrial dischargers) were
obtained from the National Inventory of Dams (USACE 2005) and the Permit Compliance
System database (EPA 2008c)
The observations were distributed in the whole state of Ohio. Most of the observations belonged
to either the Eastern Corn Belt Plains (ECBP), the Huron/Erie Lake Plains (HELP), or the
Erie/Ontario Lake Plains (EOLP) ecoregions with 180, 73, and 100 observations respectively.
The Western Allegheny Plateau (WAP) and the Interior Plateau (IP) ecoregions only had 36 and
40 observations respectively. The HELP and ECBP eocregions have the highest nutrient
background concentrations, the EOLP and IP ecoregions have intermediate levels of nutrients,
while the WAP ecoregion has the lowest levels (Rankin et al. 1999).
Drainage areas were also very diverse, ranging from 1.55 to 16,428 Km2. In this research, sites
were not subdivided in ecoregions or stream size and were introduced into the model all at once.
Since the model was based on assessing environmental similarities, groups with higher/lower
concentrations of nutrients and/or ionic strength would tend to group together if chemical
stresses were a major source of overall biologic variability. Moreover, we wanted to “let the data
speak”.
42
Figure 2-1. From left to right and top to bottom. (1)Upstream stream network carrying waste water; (2) upstream stream network fragmentation; (3) basin-scale dams in the downstream main channel; (4) basin-scale stream network fragmentation
43
Table 2-1. Description, percentage quartiles, and individual IBI predicting power for the different NLCD land use categories present in the Ohio database RDA = drainage area; R100 = regional 100-meter buffer; R30 = regional 30-meter buffer; LDA = local catchment area; L100 = local 100-meter buffer; L30 = local
30-meter buffer area; a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches
a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches
Name Description Quartiles R2 Name Quartiles R2 Name Quartiles R2 RDA_OW Open water 0.10 – 0.25 -0.60 0.292a ---- ---- ---- ---- ---- ----
RDA_DevO Urban Open Space 5.23 – 6.25 – 9.76 0.212a R100_DevO 5.26 – 6.84 -10.33 0.300a R30_DevO 4.44 -6.26 -10.01 0.234a
RDA_DevL Urban low intens. 1.22 – 2.39 – 6.06 0.293a R100_DevL 0.94 -1.85 -3.79 0.274a R30_DevL 0.79 -1.51 – 3.48 0.245a
RDA_DevM Urban medium intens. 0.24 – 0.64 -1.64 0.253a R100_DevM 0.10 -0.40 -0.87 0.275a R30_DevM 0.07 – 0.26 -0.62 0.259a
RDA_DevH Urban high intens. 0.07 – 0.26 -0.75 0.223a R100_DevH 0.00 -0.13 -0.38 0.320a R30_DevH 0.00 – 0.07 – 0.20 0.198a
RDA_Bar Barren 0.00 – 0.01 -0.05 0.239a R100_Bar 0.00 -0.00 – 0.02 0.184b R30_Bar 0.00 – 0.00 -0.01 0.182a
RDA_ForD Deciduous forest 5.29 -9.45 – 18.60 0.281a R100_ForD 8.22 -16.06 – 30.56 0.292a R30_ForD 9.82 – 20.90 – 38.01 0.368a
RDA_ForE Evergreen forest 0.01 -0.08 -0.25 0.243a R100_ForE 0.00 – 0.09 -0.24 0.285a R30_ForE 0.00 -0.05 -0.20 0.225a
RDA_ForM Mixed forest 0.00 -0.00 – 0.03 0.294b R100_ForM 0.00 -0.00 – 0.05 0.261a R30_ForM 0.00 -0.00 -0.04 0.232a
RDA_Shr Shrub/scrub 0.00 -0.00 -0.03 0.221a R100_Shr 0.00 – 0.00 -0.05 0.223a R30_Shr 0.00 – 0.00 -0.02 0.208a
RDA_Herb Herbaceous 0.35 – 0.89 – 1.38 0.225a R100_Herb 0.29 – 1.00 – 1.81 0.296a R30_Herb 0.23 – 1.00 -2.08 0.312b
RDA_Hay Hay/pasture 3.16 – 7.67 – 14.60 0.385a R100_Hay 2.97 – 8.20 – 13.60 0.322a R30_Hay 3.13 -7.35 – 12.57 0.312a
RDA_Crop Crops 40.49 -60.84 -75.58 0.257a R100_Crop 34.65 – 54.32- 70.27 0.231a R30_Crop 30.69 -50.23- 64.82 0.304a
RDA_WetW Woody wetlands 0.00 -0.04 – 0.20 0.223a R100_WetW 0.00 – 0.13 -0.58 0.270a R30_WetW 0.00 -0.26-0.95 0.242a
RDA_WetH Herbaceous wetlands 0.00 – 0.02 – 0.09 0.261b R100_WetH 0.00 – 0.03 -0.20 0.236b R30_WetH 0.00-0.01-0.36 0.172b
RDA_Oth Other 0.00 -0.00 -0.00 0.012a R100_Oth 0.00 -0.00 -0.00 0.012a R30_Oth 0.00-0.00-0.00 0.012a
LDA_OW Open water 0.00 – 0.19 – 0.96 0.200b ---- ---- ---- ---- ---- ---- LDA_DevO Urban Open Space 4.97 – 7.22 – 13.73 0.289a L100_DevO 4.17 – 7.80- 14.68 0.208a L30_DevO 3.00-6.42-15.23 0.158a
LDA_DevL Urban low intens. 0.42 – 2.42 -11.19 0.183a L100_DevL 0.30 – 2.06 – 6.99 0.272b L30_DevL 0.00-1.20-6.59 0.186a
LDA_DevM Urban medium intens. 0.00 -0.27 -2.01 0.159a L100_DevM 0.00 -0.00 – 1.78 0.107b L30_DevM 0.00-0.00-1.12 0.100a
LDA_DevH Urban high intens. 0.00 -0.00 - 0.80 0.140b L100_DevH 0.00 - 0.00 – 0.16 0.048b L30_DevH 0.00-0.00-0.00 0.071c
LDA_Bar Barren 0.00 -0.00 -0.00 0.020a L100_Bar 0.00 – 0.00 – 0.00 0.005b L30_Bar 0.00-0.00-0.00 0.004b
LDA_ForD Deciduous forest 4.47 – 13.59 -28.73 0.285b L100_ForD 7.39 – 24.43 – 46.14 0.335a L30_ForD 7.33-29.34 -53.65 0.334a
LDA_ForE Evergreen forest 0.00 – 0.00 – 0.25 0.098a L100_ForE 0.00 – 0.00 – 0.15 0.064a L30_ForE 0.00-0.00-0.00 0.066a
LDA_ForM Mixed forest 0.00 – 0.00 -0.00 0.077a L100_ForM 0.00 – 0.00 – 0.00 0.042a L30_ForM 0.00-0.00-0.00 0.036c
LDA_Shr Shrub/scrub 0.00 – 0.00 -0.00 0.117a L100_Shr 0.00 -0.00 – 0.00 0.069c L30_Shr 0.00-0.00 - 0.00 0.032c
LDA_Herb Herbaceous 0.00 -0.65 -1.68 0.130a L100_Herb 0.00 – 0.34 – 1.84 0.117b L30_Herb 0.00-0.00-1.61 0.154b
LDA_Hay Hay/pasture 0.00 -5.51 - 12.92 0.214a L100_Hay 0.00 – 3.97 – 10.85 0.196a L30_Hay 0.00 – 1.67 -9.45 0.161a
LDA_Crop Crops 17.20 -44.71 -69.59 0.152a L100_Crop 8.55 – 30.06 – 59.61 0.190b L30_Crop 6.33 – 25.24 – 53.97 0.202a
LDA_WetW Woody wetlands 0.00 – 0.00 -0.47 0.128a L100_WetW 0.00 – 0.00 – 1.28 0.113a L30_WetW 0.00 – 0.00 – 2.24 0.092a
LDA_WetH Herbaceous wetlands 0.00 – 0.00 – 0.19 0.124b L100_WetH 0.00 – 0.00 – 0.63 0.051a L30_WetH 0.00 – 0.00 2.24 0.087a
LDA_Oth Other 0.00-0.00-0.00 0.000 L100_Oth 0.00 – 0.00 -0.00 0.000 L30_Oth 0.00 - 0.00 -0.00 0.000
44
Table 2-2. Description, quartile values, and individual IBI predicting power for the water quality, habitat, point source, and stream fragmentation metrics
a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches; * Downstream distances until basin outlet; ** Values were multiplied by 1,000
Name Units Type Quartiles R2 Name Units Descriptiom Quartiles R2
Cond μmho/cm 575.00-722.00-948.75 0.02a Subs 0-20 Substrate quality 10.0-13.0-16.0 0.23a
DO mg/L 6.50-7.70-8.70 0.02c Embed 0-4 Embeddedness 2.5 -3.0 -4.0 0.28a pH SU 7.68-7.86-8.05 0.01b Rip 0-10 Riparian and bank qualities 3.5-5.0-7.0 0.17b TSS mg/L 5.00-13.00-32.00 0.01b Cover 0-20 Instream vegetal cover 7.0-11.0-14.0 0.13a TP mg/L Total 0.11-0.22-0.42 0.00a Riffle 0-8 Riffle and run quality 0.0-2.0-4.5 0.24a
NH4 mg/L As N 0.05-0.05-0.13 0.07b Pool 0-12 Pool and glide quality 5.0-8.0-10.0 0.23a NO2 mg/L As N 0.02-0.02-0.05 0.06b Chan 0-20 Channel morphology score 9.0-13.0-16.0 0.21a TKN mg/L 0.40-0.60-1.00 0.09c Grad 0-10 Gradient score 6.0-8.0-10.0 0.09c
NO3 mg/L As N 0.30-1.09-2.94 0.01c DA Km2 Drainage area 32.89-103.60-344.73 0.19c
Hard mg/L As CaCO3
239.75-304.00-374.25 0.02a UPS_Con % Upstream connected length / total upstream length 94.5-100.0-100.0 0.19a
BOD mg/L 2.00-2.m00-3.23 0.12b SITE_Con* % Total connected length / basin network length 3.3-18.9-34.5 0.47a Ca mg/L Total 62.00-77.00-91.00 0.01b DW_MainDf* Km Main channel downstream length / # downst.dams 28.2-42.9-179.3 038a Mg mg/L Total 19.00-26.00-36.00 0.04b U_Df Km Upstream network length / # upstream dams 26.3-83.2-226.2 0.25a Cl mg/L Total 26.00-42.00-84.00 0.02a Avg_Df* Km Average of D_MainDf and U_Df 54.4-110.2-198.7 0.26a
SO4 mg/L Total 52.00-81.00-153.25 0.03a Uflood_len m2/km Upstream flooded area / upstream network length 0.0-0.0-630.2 0.17a
As μg/L Total 2.00-2.00-3.00 0.08c UPS_stor_DA m3/Km2 Upstream dam storage / DA 0.0-0.0-1459.5 0.15a Cd μg/L Total 0.20-0.20-0.20 0.01a UPS_stor_len m3/km Upstream dam storage / upstream network length 0.0-0.0-1,202.4 0.16a Cu μg/L Total 10.00-10.00-10.00 0.01c UPS_Flooded % Upstream flooded area / DA 0.00-0.00-0.05 0.17a
Fe μg/L Total 281.75-602.00-1212.50 0.01b Dfl_MainLen* m2/km Downstream flooded area / downstream main
channel length
38,171.0-68,393.2-137,272.0
0.46a
Pb μg/L Total 2.00-2.00-2.00 0.00a Dsto_MLen* m3/km Downstream dam storage / main channel length 231,248.8-370,196.2- 1,487,654
0.44b
Zn μg/L Total 10.00-10.00-19.00 0.01b Flow_PS % % of upstream network carrying wastewater 0.0-3.7-11.5 0.20b Temp Deg C In water 19.70-21.70-23.43 0.06b PSDisch_LT m3/d/Km Point source discharge/upstream network length 0.0-1.5 -32.6 0.20b
PSDisch_LPS m3/d/Km Point source discharge/ distance from site to all point sources 0.0-23.8-320.6 0.21b
PS_LPS #/km # point sources/distance to all point sources 0.0-43.5-90.1** 0.21a PS_LTOT #/km # point sources/ upstream network length 0.0-3.7-10.7** 0.26b PSDisch_DA m3/d/Km2 Point source discharge/ DA 0.0-4.5-70.4 0.21b LPS-DA Km/Km2 Distance to all point sources/ DA 0.0-27.9-98.8** 0.18b
45
2.1.2. Variable sorting based on IBI prediction power using a leave-one-out,
hierarchical approach
The environmental variables were kept separated in two different categories: offtream and
instream variables. Each category was composed of different groups. Thus, the offstream
category was composed of four groups: local land and regional uses ( in catchment and 30 and
100-meter buffer areas), fragmentation metrics, and point source metrics. The instream category
was composed of two groups: water and habitat qualities.
A leave-one-out method was taken in order to assess the IBI predictive power (R2) of each
environmental variable individually. Following this method, one observation at a time was left
out (test site). The remaining 428 observations were then clustered using an agglomerative
hierarchical structure with the average linkage method and the standardized Euclidean distance
(Jain et al. 1999) . The hierarchical clustering (HC) of the remaining 428 observations was
performed using only the environmental variable being tested. Subsequently, the closest branch
in the hierarchical tree (i.e. with the smallest standardized Euclidean distance between the branch
mean value and the test site value) was selected. The calculated IBI was the average IBI of the
sites located in the selected branch. The observed IBI was the test site’s IBI. The prediction
capability was tested for all 429 observations at three different levels of the hierarchical
structure: with 233, 328, and 423 branches (see Figure 2-2). The highest R2 as used to sort the
variables.
46
2.1.3. Step-wise IBI prediction using a leave-one-out, hierarchical approach
Using the variable sorting obtained in point 2.1.2, the same methodology was repeated using a
step-wise approach. In each group, the best variable form the previous step was selected and the
next best group variable was included to form an array of two-dimensional environmental
vectors. The test site two-dimensional environmental vector was compared to the hierarchical
tree branches obtained with the two-dimensional array of remaining sites. If the prediction
improved the previous one, the new variable was kept, otherwise it was discarded. This process
was repeated in each group until no more variables were available. Therefore, the IBI prediction
capability for each group of variables was revealed.
Subsequently, the different groups were progressively merged together using the selected
variables at each step (Figure 2-3). The IBI prediction methodology used when two groups were
merged was the same. Again, the variables were sorted according to their individual predicting
power from point 2.1.2 and introduced one at a time. The order of the group mergers is shown in
Figure 2-3. At the end o this process, the best variables and the best possible IBI prediction at the
state scale was obtained.
47
Figure 2-2. Hierarchical tree with different clustering levels to which the test site (Xi1,Xi2,…,Xin) is being compared against. i indicates the observation number, n indicates the environmental variable within the environmental vector
Figure 2-3. Diagram showing the order with which the variable groups were merged. Orange rectangles indicate instream variables. Green rectangles indicate offstream variables. Blue indicates a mix of both
48
2.1.4. Analysis of observations with a significant impact from local variables
Predictions obtained with this methodology represent the best possible prediction using the
selected variables at the scale of the study. However, some sites still showed poor predictions.
We hypothesized that most of the variability not accounted for with the selected variables was
due to local stressors able to explain an important part of the biotic quality in one or few specific
sites. Thus, sites where regional scale environmental factors were the most significant stressors
to the biologic community should be well predicted with the selected variables in the presented
model. Significant mispredictions would occur when some local scale variable had a significant
impact in one or several sites.
Observations whose predictions fell beyond the interval of ±1.5 times the root mean square error
(RMSE) were considered as mispredictions due to significant impacts from local effects. These
observations were separated from the rest and tested for differences against the group of sites
whose predictions fell within the ±1.5×RMSE interval. Differences in water quality (in sites
affected by point sources), point source and fragmentation density and intensity, and local and
regional land uses were tested. These were evaluated using a Student t-test at the 95% confidence
level.
2.2. Results
2.2.1. Step-wise IBI predictions
Local land use was able to predict 49% of the total IBI variability. In the30-meter buffer,
presence of agricultural land (cropland, pasture), herbaceous, and developed land (medium and
highly developed) were the most relevant to IBI. In the 100-meter buffer, presence of deciduous
49
forest and developed lands (low and medium intensity) were the most important. In the whole
local catchment area, presence of developed open space, evergreem forest, and herbaceous
wetlands were the land uses that had a relevant impact on IBI (Table 2-5).
Regional land use explained 58% of the total biologic variability. The variables selected in the
step-wise procedure were deciduous forest and woody wetlands in the 30-meter buffer;
developed (low intensity and open space) and other land uses in the 100-meter buffer; and
hay/pasture, deciduous forest, herbaceous, and woody wetlands in the drainage area.
Combination of both, local and regional land uses, resulted in the selection of mostly regional
variables. Only medium intensity developed lands in the local buffers were able to bring new
information to the system (see All LU in Table 2-5).
Point source density and intensity was the group of offstream variables that explained the least
overall variability. The ratio between number of upstream point sources versus upstream network
length explained 26% of the overall variability and was the first and only metric selected in the
step-wise algorithm.
Stream fragmentation explained 54% of the overall variability. One metric alone (Site_Con) was
able to explain 47% of the overall variability. Site percentage of connected network, downstream
dam frequency, average dam frequency, and percentage of upstream connected network were the
variables selected in the fragmentation model (Table 2-5).
Combination of all the best offstream variables resulted in the selection of fragmentation metrics
(SITE_Con and DW_MainDf) and regional land use variables. None of the local land use
50
variables was selected. The best prediction was marginally better than the prediction with
combined local and regional land uses and explained around 60% of the total variability with just
eight variables instead of ten.
Combination of the best habitat parameters plus drainage area explained 49% of the overall
biologic variability. Six variables were selected in the best predicting model; embeddedness,
riffle, substrate, and pool qualities, drainage area, and instream vegetal cover. Water quality was
the group with the smallest overall prediction capability (R2 = 0.13). BOD, NO2-N, and Cd were
the variables that yielded the best prediction.
The variables selected after merging the best offstream and instream variables were all the
previously selected offstream variables plus two habitat parameters (riffle and cover). However,
the improvement in the overall IBI prediction was very modest compared to the best offstream
and land use models (R2 = 0.606 versus R2 = 0.0597 and 0.596 respectively). New information
from instream parameters was minimal. IBI prediction plots for best offstream and instream
variables alone and combined are shown in Figure 2-4. The step-wise metric selection is shown
in Table 2-5.
2.2.2. Analysis of sites with significant local-scale stressors
A total of 28 sites were above the 1.5×RSME threshold, while 27 were below. Some of the
chemical concentrations in overpredicted sites had unusually high values, well beyond the
reported background concentrations (e.g. NO3-N, NH4, Cu, Zn, Cond, TP, BOD, or NO2-N)
(Rankin et al. 1999) . The concentrations of most elements were not significantly higher than the
rest of the database because not all overpredicted sites had consistently higher levels of a
51
particular element. Only Cu and Zn concentrations and some point source metrics (LPS_DA and
PSDisch_DA) were significantly higher in sites with reported point sources. However, this was
due to the existence of two sites with values one order of magnitude larger than the rest (one site
had very high Cu and Zn concentrations and the other one had very high point source density and
intensity). Since the source of impairment on these two sites was evident, and their presence
would have a significant impact on the t-tests results, they were disregarded from subsequent
point source and water quality analyses. After this, only upstream fragmentation, and land use-
related metrics had significant differences between groups as shown in Table 2-3.
Table 2-3. List of variables with significant differences between over-predicted sites and sites with a prediction within the ±1.5 ×RMSE intervals
Variable Name
# of over/well predicted
sites
Type of sites
Value in over-predicted sites (95% conf. interval)
Value in well-predicted sites (95% conf. interval) p
Uflood_len 11/19 NPS+UF 14.2 ± 8.8 2.2 ± 1.3 0.000UPS_Con 11/19 NPS+UF 40.6 ± 25.7 75.6 ± 10.2 0.003
UPS_stor_len 11/19 NPS+UF 142.5 ± 82.2 17.2 ± 13.4 0.000Ups_stor_DA 11/19 NPS+UF 0.115 ± 0.078 0.021 ± 0.018 0.003
UPS_Con 28/374 ALL 76.6 ± 14.6 89.4 ± 2.5 0.011L30M_ForD 28/374 ALL 44.2 ± 9.1 30.6 ± 2.72 0.009L100M_ForD 28/374 ALL 39.8 ± 8.0 26.4 ± 2.4 0.003L100M_DevM 28/374 ALL 0.24 ± 0.17 2.5 ± 0.58 0.041
LDA_ForD 28/374 ALL 26.5 ± 7.1 17.8 ± 1.8 0.012LDA_ForE 28/374 ALL 1.1 ± 1.2 0.4 ± 0.1 0.014LDA_Hay 28/374 ALL 12.6 ± 4.8 8.2 ± 1.1 0.034
R30M_ForD 28/374 ALL 36.5 ± 7.5 23.2 ± 1.8 0.000R100M_ForD 28/374 ALL 29.0 ± 5.7 19.3 ± 1.5 0.001RDA_ForD 28/374 ALL 18.1 ± 4.6 12.9 ± 1.1 0.019
NPS = sites without point sources; UF = sites with upstream fragmentation; ALL = all sites
Sites with predictions below the 1.5×RMSE interval (i.e. underpredicted) consistently showed
lower hardness and hardness-related parameters in sites with reported point sources. Other
variables such as unit flow (discharge flow/area), Pb, Cu, and Zn were significantly higher in
under predicted sites. However, these results were highly influenced by one site in particular
with extremely high concentrations and point source density compared to the remaining 11 sites
affected by point source pollution. For this reason, final water quality and point source density
52
analyses were run disregarding this particular observation. These results are presented in Table
2-4.
Table 2-4. List of variables with significant differences between under-predicted sites and observations with a prediction within the ±1.5 ×RMSE intervals
Variable Name
# of under/well predicted
sites
Type of sites
Value in under-predicted sites
(95% conf. interval)
Value in well- predicted sites
(95% conf. interval) p
Hard 11/213 PS 247.0 ± 42.1 313.8 ± 13.8 0.033Mg 11/213 PS 20.9 ± 4.7 28.2 ± 1.6 0.046SO4 11/213 PS 135.2 ± 15.5 64.1 ± 25.1 0.042
Dsto_MLen 22/331 DF 1920.8 ± 711.9 1194.8 ± 157.5 0.025L100M_ForD 27/374 ALL 39.5 ± 8.5 26.4 ± 2.4 0.005L100M_ForE 27/374 ALL 1.8 ± 2.0 0.4 ± 0.2 0.002L30M_ForD 27/374 ALL 47.4 ± 9.9 30.6 ± 2.7 0.002
L30DevL 27/374 ALL 2.0 ± 1.7 6.0 ± 1.0 0.038L30M_ForE 27/374 ALL 3.8 ± 5.8 0.4 ± 0.2 0.000LDA_ForD 27/374 ALL 29.9 ± 7.8 17.8 ± 1.8 0.000LDA_ForE 27/374 ALL 1.4 ± 1.3 0.4 ± 0.1 0.000R30_Crop 27/374 ALL 38.1± 1.3 49.1 ± 2.5 0.025R30_ForD 27/374 ALL 36.3± 7.2 23.2 ± 1.8 0.000
R100M_ForD 27/374 ALL 29.4 ± 5.9 19.3 ± 1.5 0.000RDA_ForD 27/374 ALL 20.0 ± 4.7 12.9 ± 1.1 0.002
DF = sites with downstream fragmentation; PS = sites with point sources; ALL = all sites
53
Local LU R2 L100M_ForD 0.335 LDA_DevO 0.363
L100M_DevL 0.363 L30M_Crops 0.373 L30M_Hay 0.392 L30M_Herb 0.427
L100M_WetW 0.458 L100M_DevM 0.465 L30M_DevM 0.472 LDA_ForE 0.476
L30M_DevH 0.480 All LU R2 L30M_ForM 0.484 RDA_Hay 0.385
LDA_Bar 0.492 R30M_ForD 0.509 R100M_DevO 0.546
Regional LU R2 RDA_ForD 0.558 RDA_Hay 0.385 R100M_DevL 0.562
R30M_ForD 0.509 R30M_WetW 0.569 R100M_DevO 0.546 RDA_Herb 0.570
RDA_ForD 0.558 RDA_WetW 0.572 Offtream variables R2 R100M_DevL 0.562 L100M_DevM 0.593 SITE_Con 0.467 R30M_WetW 0.569 L30M_DevM 0.596 RDA_Hay 0.512
RDA_Herb 0.570 DW_MainDf 0.535 RDA_WetW 0.572 R30M_ForD 0.537 Overall R2
R100M_Other 0.577 R100M_DevO 0.563 SITE_Con 0.469 RDA_ForD 0.592 RDA_Hay 0.512
Fragmentation R2 R100M_DevL 0.596 DW_MainDf 0.535SITE_Con 0.467 R30M_WetW 0.597 R30M_ForD 0.537
DW_MainDf 0.499 R100M_DevO 0.563Avg_Df 0.541 RDA_ForD 0.592
UPS_Con 0.542 R100M_DevL 0.596 R30M_WetW 0.597
Point sources R2 Riffle 0.605PS_LTOT 0.260 Cover 0.606
Water Quality R2
BOD 0.116 Instream variables R2 NO2-N 0.124 Embedded 0.281
Cd 0.130 Riffle 0.326 Substrate 0.403
Habitat R2 Pool 0.431 Embedded 0.281 Area 0.442
Riffle 0.326 Cover 0.491 Substrate 0.403
Pool 0.431 Area 0.442
Cover 0.491 Table 2-5. Step-wise IBI predictions. R2 indicate the variability explained after adding a new variable to the model. All results were achieved using a hierarchical tree with 423 branches. For an explanation of variables refer to Table 2-1 and Table 2-2
54
Figure 2-4. IBI predictions with the best offstream variables (top), best instream variables (middle), and best variables overall (bottom). Dashed red lines indicate perfect fit line (center) and ± 1.5×RMSE (sides). Dot size is proportional to the number of hits in a specific point.
55
2.3. Discussion
The model confirmed that biological integrity is the result of the impact on the biotic community
by many stressors of different nature acting at different scales. Out of the five components of
biotic integrity (energy sources, water quality, habitat structure, flow regime, and biotic
interactions (Karr 1991; Karr et al. 1986; Karr and Kerans 1981), at least the first four were
totally or partially accounted for in the model. Even though numerous stressors existed in our
database, biologic integrity could be best characterized with only two groups of stressors:
regional land use and stream fragmentation. Only larger scale metrics (basin or watershed scales)
were selected in the final model (i.e. regional land use, percentage of connected stream network
and downstream dam frequency at the basin scale). This result is not surprising because the scale
of the research performed was large (state scale). In consequence, regional or basin scale
variables were able to explain a greater part of the total variability than local ones such as
instream habitat quality. The influence of land use on stream integrity is scale-dependent (Allan
et al. 1997), and the significance of scale becomes evident depending on the sampling design and
distribution. If the research is based on a wide array of streams with different characteristics and
different environmental conditions (e.g. substantially different upstream land uses or stream
order), regional characteristics will prevail as main contributors to biotic variability (Roth et al.
1996). Alternatively, if the study is based on similar types of observations with little
environmental variability (e.g. same order streams in one watershed), local variables will reveal
as the most significant (Lammert and Allan 1999).
56
2.3.1. Land use
Our model identified combinations of drainage area’s extent of hay/pasture and deciduous forest
(this was the dominant type of forest in our area of study) as the most critical to IBI. Herbaceous
and woody wetlands were also identified as important types of regional land uses in the drainage
area for IBI in Ohio (see All LU model in Table 2-5). These results strongly agree with research
by Roth et al. (1996) and Wang et al. (1997). They identified agriculture and forest in the
drainage area as the main contributors to IBI variability in Ohio and Wisconsin respectively.
However, in their research cropland and pasture lands were lumped into one single agriculture
category, while forest included deciduous, mixed and evergreen categories (Anderson et al.
1976). Stewart et al., (2001) identified positive correlation of fish diversity, intolerant fish, and
EPT species with increased forest cover. Richards et al.(1996) linked non-row crop agricultural
lands with increased woody debris, flood ratio and shallows. In our research, land uses were not
merged according to Anderson et al. (1976) recommendations and kept the different sub-
categories as defined in the NLCD (USGS 2008b). The result was that the extent of hay/pasture
in the drainage area turned out to have the greatest prediction power overall. This was
significantly greater than cropland’s (R2= 0.385 vs.0.257) despite the great dominance of this
type of land use (average cropland coverage equal to 56.1% versus 9.1% for hay/pasture).
Presence of pasture lands in the drainage area has been associated with reduced vegetal cover,
increased water temperature, nitrate, biomass concentrations, photosynthetic rates, and total
suspended solids as well as with an increase of fine sediments deposited in the river bed. A major
shift in the composition of the macroinvertebrate species was also associated to pasture lands
(Quinn et al. 1997). It has been found that presence of rangeland is particularly harmful to
aquatic fauna, especially in sites with poor riparian quality (Meador and Goldstein 2003).
57
Woody wetlands in the drainage area, and especially in the regional buffers, were also deemed
important for IBI. Even though little new variability was explained after the introduction of this
metric in the final model (most likely due to cross-correlation with other land uses such as
deciduous forest or developed lands), its presence is remarkable because of its little extent (mean
percentages equal to a 0.33, 0.70, and 1.07% in the drainage area, 100 and 30-meter regional
buffers respectively). Woody wetlands seemed to gain importance with proximity to the stream
(its individual-based predictive power ranked in 12th out of 16 land uses in the drainage area, 9th
out of 15 land uses in the 100-meter regional buffer, and 7th out of 15 land uses in the regional
30-meter buffer). In the final model, only presence of woody wetlands in the 30-meter buffer
introduced some new information to the model. A similar result was reported by Richards et al.
(1996), who linked small presence of forested wetlands (mean extent of 10% in drainage area)
with increased presence of woody debris and some channel characteristics such as bankfull
depth. Wetlands are known to act as regulators between surface water flow and hydrology
(Mitsch and Gosselink 1986). Their presence is associated with decreased sediment input,
nutrients, temperature, ionic strength, and increased resilience to disturbances (Detenbeck et al.
2000; Richards et al. 1996). Of special importance is the presence of wetlands near the receiving
waterbody as the model indicated (30-meter buffer was selected over drainage area). A decreased
wetland-stream distance has been positively correlated to reduced levels of nutrients, ions, and
bacteria.Wetland extent has been correlated to decreased lead and high color in downstream
lakes. This was found to be especially true in areas with highly fragmented riparian
corridors(Detenbeck et al. 2000; Detenbeck et al. 1993; Johnston et al. 1990).
58
Presence of developed lands in the regional and local buffers also provided significant new
information. At the regional level, open and low intensity urban lands (the dominant urban
categories with mean extents equal to 9.1 and 4.3% respectively) in the 100-meter buffer were
the urban land uses selected in the final predicting model. Therefore, it seems from the obtained
results that the extent of developed lands (represented mainly by low intensity and open space) in
the regional 100-meter buffer plays and important role on biotic degradation (Morley and Karr
2002; Stewart et al. 2001; Wang et al. 2001). This is also true for local land use in buffers even
though open and low intensity developed lands at this scale were not selected in any of the
models, most likely due to high correlation with their regional homologues (r = 0.60 and 0.57 for
open space and r = 0.57 and 0.59 for low development in the 30 and 100-meter buffers
respectively).
At the regional level, presence of deciduous forest and woody wetlands in the most immediate
lands (30-meter buffer) seemed to counter-act the effect of urban land uses within or beyond that
buffer (100-meter). It’s been reported that buffer fragmentation and patchiness data can provide
substantial information beyond traditional land use percentages (Allan 2004; Detenbeck et al.
2000; Stewart et al. 2001). Presence of medium intensity development at local scales in the
Overall LU model may be an indication urban intensity is also important, especially in the
immediate surroundings of a water body (local-scale 30 and 100-meter buffers) (Morley and
Karr 2002; Wang et al. 2001). Medium intensity development was not the dominant urban land
use in local buffers (2.21 and 1.76% in the 100 and 30-meter buffers respectively, versus 12.3
and 11.6%, and 5.9 and 5.51% of open space and low intensity urban lands respectively). Around
10-12% of connected imperviousness is considered the threshold beyond which biologic quality
declines rapidly in watersheds without or small riparian buffers (Schueler 1994; Wang et al.
59
2001; Wang et al. 2000). Presence of medium intensity development in the local buffer as a
significant variable in our model may indicate that this threshold has been reached.
2.3.2. Fragmentation
Fragmentation and flow regulation affects a large percentage of the streams worldwide,
especially in developed countries (Dynesius and Nilsson 1994; Nilsson et al. 2005).Stream
fragmentation by dams has serious consequences for the biologic community, preventing fish
from reaching upstream habitats, and isolating trapped upstream populations. Decreased species
richness and risk of extinction of native fauna through demographic, environmental, and genetic
stochasticity are some of the consequences fragmented populations face (Morita and Yamamoto
2002). The negative effects of physical separation of stream segments on aquatic species has
been widely studied (Morita and Yamamoto 2002; Morita and Yokota 2002; ReyesGavilan et al.
1996). Moreover, physical barriers are not the only consequence of dams. Usually, hydrologic
changes are also associated with impoundments. Alteration of the natural flow regime affects
fauna by eliminating or modifying natural habitat conditions, which in turn, produces a shift in
species composition and, therefore, biologic integrity (Fischer and Kummer 2000; Freeman et al.
2001; Gilvear et al. 2002; Poff and Allan 1995; Poff et al. 1997; Richter et al. 1996)
In this research, the site percentage of connected stream network (SITE_Con) and downstream
fragmentation metrics (Dfl_MainLen, Dsto_MLen, and DW_MainDf) had the largest individual
predicting powers overall. These were able to explain around 40% of the total IBI variability by
themselves. Upstream fragmentation metrics had far less prediction power and showed
importance only in some specific sites as shown in Table 2-3. Most of the sites were located well
inland and far from the basin outlet (average stream distance to basin outlet = 284.3 Km,
60
minimum distance = 18.35 km, maximum distance = 833 Km) which could have influenced the
results. However, no significant differences in drainage area between fragmented sites in over
and well-predicted groups were found. Furthermore, the only fragmentation metric that included
upstream and downstream fragmentation (SITE_Con) had the greatest predictive power overall.
2.3.3. Point sources and instream water quality
The prediction power of the individual instream water quality variables was clearly sorted in
three main groups. The first one was related to nutrient concentration, especially nitrogen (BOD,
TKN, NO2-N, and NH4). Nitrate and TP concentrations were not ranked among the top
chemical predictors. In fact, TP was the worst chemical predictor, which could be the due to
elevated concentrations beyond the biomass limiting-nutrient condition (Rankin et al. 1999).
Rankin et al. (1999) didn’t find a clear relationship between NO3-N and IBI in Ohio either, and
only concentrations beyond 3-4 mg/L had consistently negative effects on IBI. Ionic strength
parameters (Mg, Hard, Cl, Cond, SO4) were the second group. These elements affect the toxicity
of some components such as metals. Metal concentrations came in last (Zn, Cd, Fe, Cu, Pb) with
the exception of arsenic which had the third highest individual predicting power of all available
chemicals. Other variables such as DO, TSS, or pH had low prediction capabilities.
The first two variables selected in the step-wise model (BOD, NO2-N) showed that nutrient
input is the main water quality contributor to biologic degradation in Ohio. BOD has been
identified as a source of degradation in Ohio streams (Dyer et al. 2000; Norton et al. 2000;
Norton et al. 2002) and is an indicator of biologic degradation due to highly eutrophic
conditions. The third selected variable in the model was cadmium concentration, which provided
61
marginal improvement (see Table 2-5). Metal toxicity is indeed a powerful agent of biologic
degradation. However, it is only able to explain a significant part of the overall IBI variability at
smaller scales such as the upper or lower parts of a watershed (Dyer et al. 2000). This is most
likely a consequence of its highly localized nature (i.e. coming from point sources or legacy
pollution). None of the chemical variables were present in the Instream Variables model. Habitat
and water qualities (especially if related to nutrient input) are highly influenced by changes in
local and regional land uses. Therefore, in severely impaired habitats (e.g. with a high level of
fine sediment due to accelerated denudation processes) poor water quality is likely due to
increased non-point source inputs (chemicals attached to flushed particles in runoff). Therefore,
water quality did not provide any further improvement in the subsequent models at this particular
scale.
Point source metrics had only significant effects at the local scale as expected. When extreme
cases were removed, no significant differences in water quality and point source density and
intensity were found between sites with reported point sources. Only significantly lower ionic
strength in under-predicted sites was found, which could be an indication of less intense waste
water discharges in these sites. Therefore, point sources have a small overall impact on biotic
integrity compared to other more ubiquitous stressors directly or indirectly linked to land use
changes. At the sub-basin scale or smaller, point source pollution can play a significant role if
they have a substantial presence (Dyer et al. 2000; Dyer et al. 1998a). However, as the scale
expands other factors take over for one simple reason: they are more ubiquitous, hence they act
as gradients in all the available observations. Thus, point sources explain a significant amount of
variability in specific clusters of sites but little when all are considered as a whole (Allan et al.
1997; Manolakos et al. 2007).
62
2.3.4. Instream Habitat
Instream habitat and drainage area were able to explain 49% of the overall IBI variability.
Substrate-related metrics (embeddedness and substrate quality), stream variability (pool and
riffle qualities), as well as vegetal cover were the most relevant QHEI metrics. Habitat
parameters have been identified as the main instream sources of IBI variability (Dyer et al.
1998a; Hall et al. 1996; Manolakos et al. 2007). At larger scales, a significant part of the
variability due to water quality is accounted for with habitat quality for the reasons mentioned
above. Our model confirmed this, and the Habitat model selected exactly the same variables as
the Instream one (Table 2-5). Riffle and Cover qualities were selected in the Overall model but
contributed very little to the final result. Stream variability, substrate quality, and/or instream
cover have been identified as significant contributors to biotic quality in Ohio (Dyer et al. 1998a;
Dyer et al. 1998b; Manolakos et al. 2007; Yuan and Norton 2004) and elsewhere (Minshall
1984; Quinn and Hickey 1990; Rabeni and Smale 1995; Richards et al. 1993). Drainage area was
positively correlated to IBI, which strongly agreed with the findings by Dyer et al. (1998a) in
Ohio.
2.3.5. Mispredictions due to local effects
The main cause for IBI overprediction was the presence of either one extreme factor (e.g. very
high levels of point source pollution), or a combination of two or more variables with
significantly different values from the remaining well-predicted sites (upstream fragmentation
and/or local land use differences). Overpredicted observations had significantly higher levels of
deciduous forest at all levels at the regional scale, which contributed to high calculated IBI
scores. Surprisingly, extent of forested areas at all local levels were also significantly higher and
63
urban extent was lower in overpredicted sites, which was counter-intuitive given the lower
observed IBI scores in sites with such good ‘land use quality’. The only local land use metric
that could contribute to lowering the IBI expectancy was the presence of significantly higher
percentages of hay/pasture lands in the local catchment area. No significant differences existed
with this land use at regional scales. Hay/pasture was identified as the best land use predictor and
negatively correlated to IBI at the regional scale.
Overprediction was also due to increased upstream fragmentation (average upstream connected
network equal to 76.6% versus 89.4% in all over and well-predicted sites respectively).
Therefore, high point source density and intensity, highly fragmented upstream networks, and
larger extents of hay/pasture in the local catchment area lowered the observed IBI. These metrics
were not part of the final model. On the other hand, significantly higher percentages of forested
areas in the regional buffer yielded higher than expected predicted IBI scores.
On the other hand, underpredicted sites also had significantly better ‘land use quality’ at all
scales. Therefore, high quality of the local land use (higher levels of forested areas followed by
reduced extents of urban and crop lands, and absence of other significant differences) is the most
likely cause of underprediction because the model doesn’t take into account exceptional local
conditions. Furthermore, lower concentrations of ionic strength parameters might be an
indication of reduced sediment and chemical input from non-point sources at the regional and
local scales and also absence of a substantial impact from point sources.
64
2.4. Conclusions
• The presented prediction model was based on evaluation of environmental similarities among
sites with the same environmental variables. It successfully identified the most significant
variables to IBI at the state-scale with a very fast and easy-to-apply technique. Selected variables
at each step strongly agreed with published research.
• At the state-scale, regional land use and stream fragmentation are the main predictors of
biotic integrity. Habitat variables only contribute marginally to model improvement, while
instream water quality and point source intensity and density were not able to improve the final
model at all. Most of the information from instream water and habitat qualities is introduced into
the model by regional land use, which acts as a surrogate variable.
• Sixty-one percent of the total variability was explained with regional land use and
fragmentation metrics and sixty percent with just local and regional land use. Overpredictions
mainly came from a combination of higher upstream fragmentation, extreme point source density
and intensity, and high levels of hay/pasture in the local catchment area. Underpredictions
mainly came from sites with an extraordinary local land use quality which was not accounted for
in the model, and less harmful effects from disposed waste water.
• If the 55 sites with significant local effects were to be disregarded, the model could explain
86% of the overall IBI variability. Therefore, at the state-scale local stressors account for 25% of
the variability beyond the one explained by land use and fragmentation. The remaining 14% may
be due to sampling errors, data quality issues, or natural randomness (for example, a site with
BOD = 24mg/L; TKN = 3.1mg/L; TP=1.29 mg/L; Zn = 180 µg/L; Cu = 39µg/L; Fe = 19,700
µg/L; or NO2-N = 0.19 mg/L had one of the highest observed IBI scores (50)).
65
• The results show that water quality issues from point sources have small overall impact on
biotic integrity in Ohio. This may indicate a successful control of points sources through the
NPDES permits and Total Maximum Daily Loads (TMDL) projects, which have been top
priority for surface waters since the Clean water Act of 1972. Our model showed how current,
most significant stressors are related to stream fragmentation and land use change, especially in
the regional buffers. Habitat degradation and nutrient input are the most direct consequences
from this. In order to achieve the aimed physical, chemical, and biological integrity of the
Nation’s waters, protection and enforcing policies have to refocus towards a more holistic view
beyond point source control. Ecosystem continuum must be kept and watershed-level land use
planning is necessary to attain such goals, especially in the most immediate lands to any water
body
66
3. Chapter 3: Probabilistic, Hierarchical, Biologic Integrity Discrimination
3.1. Methodology
3.1.1. Ohio: instream data and study area
The data used consisted of 429 observations. An observation consisted of an array of instream
habitat and water quality parameter measurements, and the corresponding value of the fish Index
of Biologic Integrity (IBI). This data set of observations was extracted from a larger data base
and was selected because the complete set of input and output variables was available. Other
observations in the data base were incomplete (e.g., only biological parameter values were
available. The data were collected between years 1996 and 2000 by Ohio EPA. Habitat
observations consisted of discrete scores for each of the metrics in the Qualitative Habitat
Evaluation Index (QHEI): in-stream cover score (Cov), gradient score (Grad), and substrate
(Subs), riparian (Rip) , pool, riffle, and channel (Chan) qualities. Furthermore, discrete scores
quantifying the site’s embeddedness extent due to fine sediment deposition (Embed) were also
available (embeddedness is not a metric in itself but a penalizing factor for substrate and channel
qualities). A detailed description of each QHEI metric and scoring criteria can be found in
(Rankin 1989). Drainage area (DA) was also available. Water temperature (temp), conductivity
(Cond), dissolved oxygen (DO), biologic oxygen demand (BOD), pH, total suspended solids
(TSS), ammonia (NH4-N), nitrite (NO2-N), nitrate (NO3-N), total Kjeldahl nitrogen (TKN),
total phosphorus (TP), hardness (Hard), total calcium (Ca), total magnesium, (Mg), chloride (Cl),
sulfate (SO4), total arsenic (As), total cadmium (Cd), total copper (Cu), total iron (Fe), total lead
(Pb), and total zinc (Zn) were the available physical and chemical values for water quality. The
67
units for both, habitat and water quality parameters are shown in Table 3-1. The IBI scores
consisted of discrete scores ranging from 12 (essentially no fish) to 60 (healthy fish community).
A description of how IBI was developed and implemented for biological assessment of streams
in Ohio can be found in Ohio EPA (1987). Habitat, water quality and IBI data collection was
performed in the same stream segment. Habitat, water quality, and biologic sampling dates did
not differ more than 5 days in any of the observations. To our knowledge, the monitoring was
performed in base-flow conditions during summer time or early fall. No extreme events such as
chemical spills were reported.
Table 3-1. List of water quality, habitat, and biologic integrity parameters used in the research Variable Symbol Units Variable Symbol Units Metric Symbol Scale
Conductivity Cond µmho/cm Total Calcium Ca (mg/L) Substrate quality Subs 0-20
Dissolved Oxygen DO mg/L Total
magnesium Mg (mg/L) Embeddedness Embed 0-4
pH pH 0-14 Chloride Cl (mg/L) Riparian quality Rip 0-10 Total susp.
solids TSS mg/L Sulfate SO4 (mg/L) Instream cover Cov 0-20
Total phosphorus TP mg/L Total arsenic As (µg/L) Riffle quality Riffle 0-8
Ammonia as N NH4 mg/L Total
cadmium Cd (µg/L) Pool quality Pool 0-12
Nitrite as N NO2 mg/L Total copper Cu (µg/L) Channel quality Chan 0-20 Total
Kjeldahl nitrogen
TKN mg/L Total iron Fe (µg/L) Gradient Grad 0-10
Nitrate as N NO3 mg/L Total lead Pb (µg/L) Qualitative
Habitat Evaluation Index
QHEI 0-100
Hardness as CaCO3
Hard mg/L Total zinc Zn (µg/L) Drainage area DA km2
Biologic Oxygen Demand
BOD mg/L Water temperature Temp deg C Fish Index Of
Biologic Integrity IBI 12-60
The observations were distributed across the entire state. The majority was collected in the
Eastern Corn Belt Plains (ECBP), the Huron/Erie Lake Plains (HELP), and the Erie/Ontario
Lake Plains (EOLP) ecoregions with 180, 73, and 100 observations respectively. The Western
Allegheny Plateau (WAP) and the Interior Plateau (IP) ecoregions only had 36 and 40
68
observations respectively. The HELP and ECBP eocregions have the highest nutrient
background concentrations, the EOLP and IP ecoregions have intermediate levels of nutrients,
while the WAP ecoregion has the lowest levels (Rankin et al. 1999). The watershed areas were
also very diverse, ranging from 1.55 km2 to 16,420 km2. In our research the sites were not
subdivided in ecoregions or stream size and were introduced into the model all at once. Small
number of observations would have limited the progressive partitioning process, which requires
large number of sites. Moreover, we wanted to ‘let the data speak’. Ecoregional or stream size
trends in nutrient concentration would be captured by the different patterning techniques used in
the research if they were significant enough.
3.1.2. Ohio: offstream data and study area
The data used consisted of 429 observations, where an observation here consists of an array of
basin, watershed and local-scale offstream variables along with the fish Index of Biologic
Integrity (IBI). The biological data were collected between years 1996 and 2000 by Ohio EPA.
Basin-scale observations consisted of fragmentation metrics. Watershed-scale metrics consisted
of percentages of different types of land use in the drainage area and the 100 and 30-meter buffer
areas around the stream network in the entire watershed. Watershed-based point source density
and intensity were also watershed-scale variables. Local metrics consisted of percentages of land
use in the catchment area and the 100 and 30-meter buffers only 2 miles upstream of the
sampling point. A 30-meter buffer width was chosen because this was the minimal possible
distance due to data resolution and beyond the minimal recommended 15-meter width, effective
under most conditions (Castelle et al. 1994). A 100-meter width was chosen because this is an
intermediate value between 3 and 200 meters, minimum and maximum effective widths
69
depending on site-specific conditions according to Castelle et al.( 1994).A description of the
different variables is available in Table 3-1, Table 3-2,, and Table 3-3.
In order to obtain the upstream land uses, each site’s watershed was delineated using a 30-meter
resolution Digital Elevation Map (DEM) with ArcGIS Spatial Analyst. Subsequently, the
percentage of each upstream land use was calculated at two different scales: the watershed scale
and the local scales. Land use percentages were obtained using the Thematic Raster Summary
function within Hawth’s Analysis Tools for ArcGIS (Beyer 2004) . Eight different broad land
use categories were used for each scale: urban, agricultural, non-forested, forested, surface
water, wetland, barren, and other (Anderson et al. 1976). These were calculated from the sixteen
land cover categories defined in the 2001 National Land Cover Dataset (NLCD) (USGS 2008b)
The surface water land use category was only calculated for the drainage and catchment areas,
not for the buffers because we felt that including it would heavily affect the final percentages of
narrow buffers. The fragmentation and point source metrics were calculated using the National
Hydrography Datasets (NHD) (USGS 2008a). The ArcGIS Utility Network Analyst was used to
trace upstream or downstream a specific site. Majors dams (with DA ≥ 2.59 Km2 ) and point
sources (major and minor waste water treatment plants ands major industrial dischargers) were
obtained from the National Inventory of Dams (USACE 2005) and the Permit Compliance
System database (EPA 2008c)
The IBI scores consisted of discrete scores ranging from 12 (very poor biotic integrity) to 60
(excellent biotic integrity). A description of how IBI was developed and implemented for
biological assessment of streams in Ohio can be found in Ohio EPA (1987).
70
The observations were located in five different basins: the Western Lake Erie, Muskingum River
Basin, the Sciotto River Basin, Middle Ohio and Little Miami River Basin, and the Wabash
River Basin (Figure 3-1).The drainage areas of the sampling points were also calculated and
were very diverse, ranging from 1.55 km2 to 16,420 km2. In our research, sites were not
subdivided in ecoregions or stream size and were introduced into the model all at once. Smaller
number of observations would have limited the progressive partitioning process, which requires
large number of observations. Moreover, we wanted to ‘let the data speak’ and not make any pre-
conceived assumptions.
Figure 3-1. Distribution of observations used in the analysis and basins. On the left, groups after the 2nd SOM. On the right groups after clustering using SITE_Con (groups from the same parent group are segregated by basin)
71
Table 3-2. Land use categories and quartiles at the watershed (R) and the local (L) scales
Name Units Quartiles Name Units Quartiles RDA_Water % 0.10-0.25-0.60 LDA_Water % 0.00-0.19-0.96 RDA_Forest % 5.54-9.56-19.72 LDA_Forest % 4.47-13.60-29.77
RDA_NonForest % 0.48-1.04-1.42 LDA_NonForest % 0.16-0.73-1.89 RDA_Barren % 0.00-0.01-0.05 LDA_Barren % 0.00-0.00-0.00 RDA_Agric % 57.01-70.64-81.93 LDA_Agric % 32.39-57.97-78.67 RDA_Urban % 6.84-9.48-18.58 LDA_Urban % 5.99-11.08-30.99
RDA_Wetlands % 0.03-0.15-0.29 LDA_Wetlands % 0.00-0.22-1.13 RDA_Other % 0.00-0.00-0.00 LDA_Other % 0.00-0.00-0.00 R100_Forest % 8.22-16.22-31.24 L100_Forest % 7.66-24.62-46.09
R100_NonForest % 0.39-1.11-1.89 L100_NonForest % 0.00-0.49-2.02 R100_Barren % 0.00-0.00-0.02 L100_Barren % 0.00-0.00-0.00 R100_Agric % 48.75-65.64-77.14 L100_Agric % 19.06-43.99-70.17 R100_Urban % 6.60-9.50-15.73 L100_Urban % 5.58-11.13-28.01
R100_Wetlands % 0.02-0.37-1.01 L100_Wetlands % 0.00-0.41-3.50 R100_Other % 0.00-0.00-0.00 L100_Other % 0.00-0.00-0.00 R30_Forest % 9.83-21.20-39.77 L30_Forest % 7.38-29.34-55.19
R30_NonForest % 0.26-1.06-2.25 L30_NonForest % 0.00-0.00-1.77 R30_Barren % 0.00-0.00-0.01 L30_Barren % 0.00-0.00-0.00 R30_Agric % 42.05-60.88-76.07 L30_Agric % 11.54-33.58-67.21 R30_Urban % 5.42-8.40-15.23 L30_Urban % 3.74-8.77-24.89
R30_Wetlands % 0.00-0.58-1.65 L30_Wetlands % 0.00-0.35-6.67 R30_Other % 0.00-0.00-0.00 L30_Other % 0.00-0.00-0.00
DA = drainage or catchment area; 100 =100-meter buffer; 30 =30-meter buffer
Table 3-3. Fragmentation (top) and point source density and intensity metrics (middle) , units, and quartiles Name Description Units Quartile
UPS_Floodarea Percentage of flooded drainage area % 0.00-0.00-0.05 UPS_Con Percentage of upstream connected network % 94.63-100.00-100.00 SITE_Con Percent of total connected network % 3.26-18.93-34.52
DW_MainDF Downstream channel length/ # of dams on channel Km 28.19-42.86-179.26
UPS_DF Upstream network length/number of upstream dams Km 26.44-83.2-225.28
Avg_DF Average of DW_MainDF and UPS_DF Km 54.43-110.23-198.40 UPS_floodlen Upstream flooded area/upstream network length m2/Km 0.0-0.0-630.2 UPS_storDA Upstream dam storage capacity/drainage area m3/Km2 0.0-0.0-1459.5
UPS_storlength Upstream dam storage / upstream network length m3/km 0.0-0.0-1,202.4
DW_floodMainlen Downstream flooded area / main channel length m2/km 38,171.0-68,393.2-137,272.0
DW_storMainlen Downstream dam storage / main channel length m3/km 231,248.8-370,196.2- 1,487,654
Name Description Units Quartile Flow_PS % of upstream network carrying wastewater % 0.00-3.70-11.52
PSDisch_LT Point source discharge/upstream network length m3/d/Km 0.0-1.5 -32.6
PSDisch_LPS Point source discharge/ distance from site to all point sources m3/d/Km 0.0-23.8-320.6
PS_LPS # point sources/distance to all point sources #/km 0.0-43.5-90.1** PS_LTOT # point sources/ upstream network length #/km 0.0-3.7-10.7**
PSDisch_DA Point source discharge/ DA m3/d/Km2 0.0-4.5-70.4 LPS-DA Distance to all point sources/ DA Km/Km2 0.0-27.9-98.8**
DA Drainage area Km2 32.89-103.60-344.73 ** Values were multiplied by 1,000
72
3.1.3. Maryland data and study area
A total of 774 observations were used for the present research. These were grouped in three
geographic strata : coastal, piedmont, and highland regions. Piedmont and highland regions
represent non-coastal areas, and have significant differences in soil and land use history. Also,
the metrics used to calculate the Physical Habitat Index (PHI) are different for each region (Paul
et al. 2002). Coastal areas had a total of 225 observations, highland had 196 observations, while
piedmont regions had 252 sites. Figure 3-2 shows the distribution of the observation sites within
the state of Maryland. The data was obtained from the 1995-1997 Maryland Biological Stream
Survey (MBSS) (DNR 2008) . The data consisted of biologic, habitat, and water qualities, stream
morphology, and watershed land use information.
The available habitat information in the MBSS database corresponded to the Maryland’s
Provisional PHI metrics (Hall et al. 1999). These metrics were recalculated to obtain the “new”
PHI following the guidelines by Paul et al. (2002). The old metrics not used in the calculation of
the “new” PHI for a particular stratum were kept. Therefore, each site had its regional “new” PHI
and corresponding metrics and the remaining “old” habitat metrics not included in the new
regional PHI. Table 3-4 shows a list of all the different environmental variables available in each
of the observations and strata.
Land use information consisted of percentages of each category in the drainage area. The
original land use dataset contained fifteen different categories. MBSS used the land use/ land
cover information from the Federal Region III Multi-Resolution Land Characterization (MRLC)
digital data set, Version 2 (EPA 2008b). The MRLC was developed by a federal agency
consortium, using data primarily from Landsat 1991-1993 Thematic Mapper satellite images at a
73
resolution of 30 meters. In the present research, the fifteen MRLC land use categories were
grouped in three land cover classes: urban (which included low and high intensity development),
agriculture and barren (hay/pasture/grass, row crops, quarries, coal mines, beach areas, and
transitional), and natural lands (forest, wetlands, and open water).
The biological data consisted of fish IBI. This is based on the comparison of observed fish
assemblages at each site to those found at reference sites (Roth et al. 1998). Reference sites exist
for each of the strata: coastal, piedmont, and highland regions. The final IBI scores are the mean
values of the individual metrics, which are discrete scores (1, 3 or 5, being 5 the score if there’s
little or no departure from reference conditions and 1 if viceversa). In coastal areas, the IBI is
composed of eight metrics, nine metrics in piedmont regions, and seven metrics in highland sites
(Roth et al. 2000).
Figure 3-2. 1995-1997 MBSS monitoring stations in the state of Maryland and strata distribution
74
Table 3-4. Description, quartiles, and units for the available regional environmental variables COASTAL Quartiles Units PIEDMONT Quartiles Units HIGHLAND Quartiles Units Description
Cond 106-154-209.2 μmho/cm Cond 136-174-223 μmho/cm Cond 87.2-150-247 μmho/cm Conductivity DO 6.1-7.1-8.4 mg/L DO 8.4-9.2-9.8 mg/L DO 7.5-8.2-9.1 mg/L Dissolved oxygen pH 6.6-6.9-7.2 SU pH 7.1-7.4-7.6 SU pH 6.8-7.1-7.4 SU
NO3 0.7-1.1-2.9 mg/L NO3 1.8-2.6-4.1 mg/L NO3 0.46-0.94-3.04 mg/L Nitrate as N Temp 18.8-20.8-23.1 deg C Temp 17.4-19.2-21.1 deg C Temp 16.3-18.0-20.2 deg C Water temperature SO4 10.8-14.2-17.8 mg/L SO4 6.0-9.0-13.1 mg/L SO4 9.5-13.5-21.5 mg/L Sulfate
ANC 165.07-262.7-453.0 μEq/L ANC 352.1-515.2-
845.5 μEq/L ANC 168.9-343.1-700.0 μEq/L Alkalinity
DOC 3.2-5-7.35 mg/L DOC 1.2-1.9-2.5 mg/L DOC 1.2-2.0-2.2 mg/L Dissolved organic carbon CRemote 37.7-64.6-86.2 0-100 Remote 31.2-50-81.2 0-20 HRemote 25-45-75 0-100 Remoteness score CShade 58.9-73.3-84.6 0-100 PShade 69.1-80.1-89.7 0-100 HShade 52.0-75.2-87.4 0-100 Shading score
CEpiSub 35.6-58.3-77.9 0-100 PEpiSub 58.8-76.5-88.2 0-100 HEpiSub 27.8-61.1-83.3 0-100 Epifaunal substrate score CInstrHab 42.2-58.1-80.8 0-100 PInstrHab 64.4-79.3-87.6 0-100 InstrHab 10-14-16 0-20 Instream Habitat score
CWood 40.8-57.9-69.1 0-100 PWood 8.3-25-41.8 0-100 Wood 0-1-3 Count Instream wood score or count CBank 59.2-74.2-86.6 0-100 PBank 50.8-66.7-84.5 0-100 HBank 62.3-82.7-90.0 0-100 Bank stability score Root In CWood ---- Root In PWood ---- Root 0.0-0.0-1.0 Count # of instream rootwads Pool 8-13-15 0-20 Pool 12-15-16 0-20 Pool 10-14-16 0-20 Pool quality score
Riffle 6-11-14 0-20 PRiffle 74.9-85.0-92.2 0-100 Riffle 7-12-15 0-20 Riffle quality score Chan 5-8-11 0-20 Chan 8-12-15 0-20 Chan 7-15-16 0-20 Channel alteration score
Vel_dep 6-10-13 0-20 Vel_dep 11-14-16 0-20 Vel_dep 8-11-14 0-20 Veloc.-depth variability score Aesthet 11-15-17 0-20 Aesthet 10-15-16 0-20 Aesthet 11.2-16-18 0-20 Aesthetic quality score
PHI 52.2-62.9-73.2 0-100 PHI 59.2-67.3-74.4 0-100 PHI 43.6-59.1-75.7 0-100 Physical Habitat Index ThalDep 19.4-29.5-46.8 cm ThalDep 22.7-32.2-43.9 cm ThalDep 13.5-22.7-35.7 cm Mean thalweg depth
Wid 2.4-3.7-5.9 m Wid 3.4- 5.5- 8.8 m Wid 2.2-4.0-7.0 M Mean stream width MaxDep 42-63-88 cm MaxDep 52.2-71-90.1 cm MaxDep 36-51-75..7 cm Maxiimum stream depth
Sl 0.2-0.3-0.7 % Sl 0.5-1.0-1.5 % Sl 0.7-1.3-2.5 % Average slope FlowVel 0.05-0.09-0.16 m/s FlowVel 0.14 -0.22 - 0.32 m/s FlowVel 0.08-0.17-0.30 m/s Average flow velocity
DA 5.8 – 15.24 – 41.21 Km2 DA 4.9- 14.5 – 38.7 Km2 DA 3.13 – 11.2 -
28.7 Km2 Drainage area
Ch_flow 70-81-90 % Ch_flow 75-90-95 % Ch_flow 70-90-96 % % of channel covered by water
RipWid 20-50-50 m RipWid 0-20-50 m HRipWid 0-28-100 0-100 Riparian score or width (up to 50m)
Agribarr 23.9-39.8-57.8 % Agribarr 43.3-65.7-73.1 % Agribarr 13.6-33.0-67.0 % Agricultural land use in DA Forwetwat 35.0-41.3-59.2 % Forwetwat 22.9-28.3-37.7 % Forwetwat 29.1- 64.4-83.8 % Forest+ wetland +water in DA
Urban 0.49-2.7-9.9 % Urban 0.6-2.1-8.2 % Urban 0.0- 0.2- 1.2 % Urban land use in DA Embed 44-85-100 % PEmbed 55.5-77.8-88.9 0-100 Embed 20-35-50 % % fine sediment or score
IBI 3.0-3.5-4.25 1-5 IBI 2.8-3.7-4.1 1-5 IBI 2.1-3.3-4.1 1-5 Fish Index of Biotic Integrity Variables starting with a C = metrics used to calculate new PHI in coastal sites; starting with a H = highland sites; starting with a P = piedmont sites
75
3.1.4. Self-Organizing Feature Maps (SOM)
The SOM consists of an unsupervised Artificial Neural Network (ANN) model, whose operation
is inspired by the way the human brain is organized when new data is presented to it (Kohonen
2001). SOMs consist of a nonlinear projection of multidimensional data vectors on a 2D grid
with a meaningful order. The SOM grid is composed of individual units, called cells or neurons,
that compete with each other in order to identify the closest, or most similar cell, to the new data
vector being presented to the system. One neuron in a trained SOM will represent a specific
number of observations that have similar characteristics. Therefore, SOM neurons can be
considered as clusters of similar observations.
The data observations allocation process in the SOM map starts by assigning random weights to
each one of the SOM neurons ( [ ]ni wwww ,...,, 21= ). These weights have the same dimension as
the environmental variable input vectors ( [ ]nsj xxxx ,...;,1= ). One at a time, each observation is
presented to the SOM and compared to the neuron-based weights. The observation is then
associated with the most similar SOM neuron, which is called the Best Matching Unit (BMU).
Similarities between pairs of data and weight vectors are measured using the Euclidean distance.
Therefore, each unit in the input layer (i.e. observations of environmental vectors) is linked to
one unit in the output layer (i.e. SOM neurons).
Subsequently, this same process is iterated for better organization of the input space. The weights
are updated using a neighborhood function. This function looks at the observations placed in a
specific neuron and the surrounding ones within a specified radius. The initial random weight is
then replaced by another vector called the generalized median (εi), which is the ‘middlemost’
76
vector that minimizes the sum of distances between the data observations in the neuron itself and
the surrounding ones within the used neighborhood radius (Kohonen 2001). The process is then
repeated until convergence ( i.e. until a certain criterion is met [usually iiw ε≅ ]), or a certain
number of iterations is completed.
One of the properties of the SOM is that the nonlinear projection of the multidimensional input
vectors xj on the neuron grid can be considered to approximate the probability density function
p(x) of the high dimensional input data. Therefore, relevant information can be retrieved by
observing the neuron-based weights distribution of the final weights. Also, since the weights
learning process creates a smoothing effect on the weight vectors of the neurons, correlations
among variables become more clear. This is especially important for the understanding of highly
complex, natural systems in which one observation can be the outcome of multiple variable
combinations. The smoothing effect is also important to identify correlations in discrete or
crudely scaled data because the final trained neuron weights have a more continuous nature than
the initial input data.
SOMs have been used in several environmental applications, usually for data exploration
purposes in combination with more conventional techniques (Manolakos et al. 2007; Tran et al.
2003), spatial analysis and site classification and characterization (Cereghino et al. 2001; Tran et
al. 2003), identification of the main traits of the biotic community (Chon et al. 1996), or
prediction of the probability of presence/absence of fish species in specific sites after some
anthropogenic change in the study area took place (Park et al. 2003).
77
3.1.5. Initial data clustering and SOM neuron analysis
In the case of Maryland and Ohio (instream data), all the available physical and chemical
environmental variables were used to train the SOM. In the case of Ohio with offtsream data,
only regional land use and fragmentation metrics were used to train the SOM because these
variables were deemed responsible for the background quality of the biologic integrity in a
specific area. Point source density and intensity and local land use were deemed too local or non-
ubiquitous to have a significant effect on the overall IBI variability and therefore, not used in the
SOM training. Unprocessed data for each variable were logged (natural log) and ranged between
[0,1]. This step was necessary in order to equalize the effect of each input variable on the final
SOM output due to different scaling.
The size of the SOM (number of neurons) was mainly determined by the topographic error,
although the quantization error was also checked. The topographic error is defined as the
proportion of input data vectors for which the first and second most similar SOM neurons are not
adjacent in the grid of neurons (Kiviluoto 1996). The quantization error is defined as the average
distance (Euclidean) between each input data vector and its BMU. In our research, the optimum
number of neurons was found by choosing the number that had the minimum topographic error.
The quantization error usually decreases monotonically with SOM size. Since a very large map
size was undesirable given the available data set size, it was deemed less important and was not
used to determine the optimum map size. The maximum number of SOM neurons was limited to
100. A SOM with 60 )106( × and 72 )98( × neurons was used for the initial SOM training in
Ohio with instream and offstream data respectively. SOMs with 48 )86( × , 54 )96( × , and 54
)96( × neurons were used for the coastal, highland and piedmont regions in Maryland
78
respectively. The SOM training consisted of 20 and 100 epochs for the coarse and fine-tuning
map training respectively.
There exists one element in each weight vector corresponding to each one of the environmental
variables included in the input data vectors used for the SOM training. Therefore, a vector of
SOM neuron weights could be extracted for each environmental variable used in the SOM
patterning. Also, the IBI values of the patterned observations in each SOM neuron were
averaged. Hence, a neuron-based average IBI value was determined for each SOM neuron.. The
correlation matrix among the environmental weight vectors and the neuron-based mean IBI
vector was computed. The goal was to evaluate the effect of each environmental variable over
IBI and also reveal relationships among environmental variables. The absolute values of the
neuron-based IBI-variable correlation coefficients were sorted in descending order. Variables
with a higher variable-IBI absolute correlation coefficient were considered to have a greater
overall impact on the biological community and vice versa.
3.1.6. Second SOM data clustering
A second SOM training was performed using variables that showed a significant impact on IBI
(neuron based IBI-variable 5.0≥r in Maryland and Ohio with instream data and 4.0≥r in
Ohio with offstream data), and were not highly correlated to a more relevant variable (variable-
variable r < 0.8). In the case of Ohio with offstream data, the large-scale variable correlation
coefficient criterion was relaxed because if the initial criterion was kept only one variable would
have been available for the 2nd SOM training (see Table 3-8) because other large-scale variables
79
(i.e. with IBI-variable 5.0≥r ) were discarded due to cross-correlation. For this reason, the
criterion was relaxed so that more than one variable could be used for the 2nd SOM patterning.
Therefore, the initial dataset was reduced to a smaller one that included only variables with great
overall effect on IBI (large-scale variables or environmental gradients). The number of SOM
neurons was again determined with the topographic error. A SOM with 72 ( 126× ) and 78
( 136× ) neurons was used in Ohio with instream and offstream data respectively. SOMs with 48
( 86× ), 70 ( 107× ), and 45 ( 95× ) were used in Maryland’s coastal, highland, and piedmont
sites respectively.
3.1.7. Site patterning based on ‘large-scale’ variables and associated biotic
responses
The neurons from the 2nd SOM patterning were grouped into different clusters of similar units
with an agglomerative Hierarchical Clustering (HC) using the average linkage method and the
standardized Euclidean distance (Jain et al. 1999). The neuron-based SOM weights for each of
the variables used in the 2nd SOM patterning were used for this purpose. Therefore, groups with
different environmental characteristics were obtained and the corresponding IBI observations in
each one of these groups retrieved for analysis. The final number of groups in the hierarchical
structure was the maximum number of statistically different biotic responses (determined by the
group IBI) these variables were able to segregate with no or little overlapping among groups (see
Figure 3-3). The process started with 2 groups. If these two groups of observations yielded 2
different biotic responses, three groups were tested and so forth. An ANOVA F-test at the 95%
confidence level was performed to test the null hypothesis that the groups’ IBI means were
equal. If the null hypothesis was rejected (p<0.05), Multiple Range Tests (MRT) using the
80
Fisher’s Least Significant Difference (LSD) method at the 95% confidence level were
performed. This consists of a pair-wise comparison of the group IBI means. Thus, statistically
different biologic qualities corresponding to different environmental conditions were separated.
The number of groups that yielded the clearest separation of biologic responses (i.e. with the
largest possible number of IBI categories with no or little overlap among groups) was selected.
81
Figure 3-3. Example of a hierarchical tree of the 2nd SOM neurons (left) and analysis of differences among group biologic responses (right). On the right, example of MRT analysis. Overlapping indicate not significant differences in group IBI means. Non-overlapping indicates significantly different group IBI means. In this case, Level 4 partition would be chosen because it yields the largest number of different biotic responses (5) with less overlapping than Level 5 (Figure for clarification purposes only).
BIOTIC RESPONSE BIOTIC RESPONSE
Level Group 1 2 3 4 Level Group 1 2 3 4 5
1 1 I
2 2
1 3
2 4 II
3 5 1
IV
6 2 1 3 2 4 3
III
5 4 5 6
V
7
82
3.1.8. Site patterning based on ‘small-scale’ variables and associated biotic
response
To account for the potential effect of variables acting at a local scale, each group obtained in the
previous step was subdivided using small-scale variables one at a time. Small-scale variables
were those with an absolute neuron based variable-IBI correlation coefficient smaller than 0.5 (or
0.4 in the case of Ohio with offstream data along with local land use and point source metrics).
Again, this process was executed in a hierarchical manner. The order with which the different
variables were tested was determined by the variable ranking from the neuron-based variable-IBI
correlation coefficient in the initial SOM analysis. If the subgroups’ biologic responses were
statistically different according to ANOVA, the main group was split into new subgroups,
otherwise it was not split. This procedure was repeated with all the available variables that were
not used in the 2nd SOM training and were not highly correlated to other variables. Figure 3-4
and Figure 3-5 show a flow chart summarizing this methodology.
3.1.9. IBI response curve development for different levels of watershed
characterization
Each group obtained at each level of clustering represents a separation of the IBI responses given
different environmental conditions. An assumption made in the present research is that the biotic
community response would follow a Gaussian distribution if the environmental characteristics of
the groups were homogeneous enough. Normality can be achieved at different levels of group
characterization. This condition would be achieved depending on what part of the overall
biologic variability is explained with the identified group stressors. Departure from normality
would mean that the current level of characterization is not enough because heterogeneous
83
conditions produce the existence of different populations. Groups that follow a Gaussian
distribution are indicative of more homogeneous conditions and further subdivisions would lead
to at least one new, narrower normal distribution because the system is defined in greater detail.
To confirm the normality condition in the different groups, the group cumulative density
function (CDF) was plotted in a normal probability plot. A straight CDF would be an indication
of normality (Chambers et al. 1983). Important deviations from the straight pattern would
indicate group homogeneity was not achieved in order to guarantee this condition. Moreover, a
Jarque-Bera statistical test for normality at the 95% confidence level was also performed in each
group (Jarque and Bera 1987). This test was chosen over more traditional ones such as the
Kolmogorov-Smirnoff test because the group distribution was unknown. The Lilliefors test for
normality was also rejected because it required large amounts of data in order to be performed.
The Jarque-Bera test is considered more robust and is based on the sample skewness and
kurtosis. Some authors recommend this test over the rest (Gujarati 2003; Judge et al. 1985).
For the sake of brevity, only the normal proabibility plots at one level of system characterization
were plotted in each case. This level corresponded to the main biologic signatures identified after
clustering the 2nd SOM neurons (in Maryland [Figure 3-17, Figure 3-21, and Figure 3-25] and
Ohio using instream data [Figure 3-7]) or the biologic responses found after clustering with site’s
percentage of fragmented stream network (Ohio using offstream data [Figure 3-11]).
84
.
Figure 3-4. Flow chart summarizing the methodology used to characterize response of the biologic community to similar environmental characteristics and stressors (Maryland and Ohio with instream data)
85
Figure 3-5. Flow chart summarizing the methodology used to characterize response of the biologic community to similar environmental characteristics and stressors (for Ohio with offstream data)
86
3.1.10. Development of biotic response reference curves
The IBI observations above the 75 percentile in each group (at the selected level of system
characterization) were separated and considered as group reference conditions. The IBI 75th
percentile was selected arbitrarily. However, another reference percentile could be selected if
more/less stringent criteria were to be met. The IBI response in sites above the 75th percentile
were considered to resemble pristine or realistically achievable conditions and therefore,
considered reference sites. New CDF curves for the reference and impaired scenarios were
developed.
Departure from reference conditions were evaluated in each group. Student’s t-tests at the 95%
confidence level were performed to test the null hypothesis that the reference and impaired group
means for the different environmental variables were equal.
87
3.2. Results and discussion
3.2.1. Ohio: instream data
3.2.1.1. Biotic response separation The correlation matrix of the neuron-based environmental weight vectors and the neuron-based
average IBI vector after the initial SOM training is shown in Figure 3-6.
Figure 3-6. Correlation matrix of the variable neuron-based weights and neuron-based average IBI values in the trained SOM.
The variables that showed a relevant influence on IBI ( r ≥ 0.5) were, in decreasing order:
embeddedness (r = -0.861), riffle quality (r = 0.815), substrate quality (r = 0.81), channel quality
(r = 0.789), QHEI (r = 0.789), cover quality (r = 0.732), pool quality (r = 0.722), gradient score
(r = 0.711), DO (r = 0.664), TKN (r = - 0.63), riparian quality (r = 0.625), ammonia (r = - 0.62),
88
total arsenic (r = -0.61), BOD (r = -0.61), nitrite (r = -0.57), sulfate (r = -0.54), drainage area (r =
0.54), and total iron (r = -0.52).
The significant variables that were subsequently eliminated due to cross-correlation with more
relevant variables ( r ≥ 0.8) were: riffle, substrate, channel, cover, and pool qualities, and QHEI,
which were highly correlated to embeddedness (r = -0.965, -0.926, -0.920, -0.856,-0.823, and -
0.890 respectively). Ammonia, total arsenic, and BOD were correlated to TKN (r = 0.961, 0.893,
0.963 respectively). Total iron was negatively correlated to DO (r = -0.850). Thus, the remaining
variables for the second SOM patterning were embeddedness, gradient score, DO, TKN, riparian
quality, nitrite, sulfate, and drainage area. NO2-N was disregarded because we wanted to analyze
the effect of Zn on IBI, and Zn was highly correlated to NO2-N (r = 0.806). A summary of the
variables used is presented in Table 3-5.
The variables used in the second SOM training were considered as environmental gradients or
large-scale variables responsible for the largest part of the biotic variability. Hierarchical
clustering of neurons from the second SOM yielded six groups with different environmental
conditions and five significantly different biologic responses according to ANOVA, as shown in
Table 3-6 and Figure 3-7.
In order to be able to account for environmental variables at the local scale, each of the groups
was clustered using the variables that were deemed not relevant (i.e. neuron-based variable –IBI
r < 0.5). The clustering was performed using one variable at a time. New subgroups were
created only if their biologic qualities were significantly different according to ANOVA. The
89
small-scale variables able to separate sites with different levels of IBI were (in the same order in
which they were patterned); total zinc concentration, pH, and nitrate concentration. Total copper,
TSS, and total cadmium and lead concentrations did not bring any further separation of IBI
responses (see Figure 3-7).
Within the available observations, very large watersheds (group 5 average DA = 2,303.1 km2)
had the best IBI scores (μ = 42.82, σ = 5.81). Headwater streams (DA< 51.8 Km2) mainly
belonged to group 3 (average DA = 41.15 Km2) and had the worst IBI scores (μ = 24.08, σ =
7.08). Group 3 had the highest values of embeddedness, TKN and sulfate concentrations, and the
second lowest DO and riparian quality. Sites from group 1, with the second smaller average DA
(84.72 Km2), had also the second poorest IBI scores after group 3 (μ = 27.84, σ = 8.41). This
might be an indication of greater resilience to degradation of larger watersheds, since Ohio’s IBI
is calibrated with drainage area EPA (1987). Positive correlation between IBI and drainage area
in Ohio was also found by (Dyer et al. 2000). In Ohio, high levels of total phosphorus (which
was not used due to high correlation to TKN) were associated with poor IBI (Rankin et al. 1999).
The two groups with highest levels of TKN (groups 3 and 1 with mean TKN equal to 2.43 and
1.37 mg/L respectively) had the poorest IBI scores. A summary of the average environmental
variables at each level of characterization is included in Appendix I.
90
Table 3-5. Neuron-based correlation coefficients between variables and IBI. Variables in bold were able to separate significantly different biotic responses
Variable IBI-Variable r Variable IBI-Variable r EmbedL -0.86 DAL 0.54 RiffleLC 0.815 FeLC -0.52 SubsLC 0.81 TPSC -0.48 ChanLC 0.789 MgSC -0.47 QHEILC 0.789 CuSC -0.46 CovLC 0.732 ZnS -0.42 PoolLC 0.722 CondSC -0.39 GradL 0.711 pHS 0.382 DOL 0.664 ClSC -0.37
TKNL -0.63 HardSC -0.37 RipL 0.625 TSSSC -0.36
NH4LC -0.62 NO3
S 0.312 AsLC -0.61 CdS -0.27
BODLC -0.61 CaSC -0.25 NO2
LD -0.57 TempS 0.222 SO4
L -0.54 PbS -0.08 L = large-scale variables or environmental gradients; S = small-scale variables; C = variables cross-correlated to higher hierarchy variables; D = disregarded variable
Table 3-6. ANOVA (top) and MRT (bottom) analyses for the IBI means in groups after 2nd SOM patterning with environmental gradients shown in Figure 3-7. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups.
ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 15825.0 5 3165.0 53.95 0.0000 Within groups 24816.5 423 58.6677 ----------------------------------------------------------------------------- Total (Corr.) 40641.5 428 Multiple Range Tests -------------------------------------------------------------------------------- Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI3 35 24.0 X IBI1 87 27.8391 X IBI6 69 28.4348 X IBI4 71 31.0423 X IBI2 111 38.6667 X IBI5 56 42.8214 X --------------------------------------------------------------------------------
91
Clustering with small-scale variables resulted in the creation of 10 new subgroups. Clustering
with total Zn resulted in the separation of few sites (3 in group 11 and 4 in group 32) in which
zinc seemed to be the cause of degradation (mean concentration equal to 178.67 and 44.75 µg/L
in groups 11 and 32 respectively versus 21.26 and 15.80 in groups 12 and 31 respectively [see
Figure 3-7 and Appendix I]). Clustering with pH values resulted in the creation of four new
subgroups after dividing groups 12 and 31; subgroup 121 (mean pH = 7.82 with 74 sites) versus
subgroup 122 (mean pH = 8.84 with 10 sites), and subgroup 311 (mean pH = 7.62 with 25 sites)
versus 312 (mean pH = 8.30 with 5 sites). Finally, nitrate concentration yielded two new groups
out of group 6; subgroup 61 (mean [NO3] = 8.33 mg/L and 32 sites) versus subgroup 62 (mean
[NO3] = 0.62 mg/L and 37 sites). Even though nitrate is a nutrient-related parameter and one
might think it should be more an environmental gradient than a small-scale variable, a clear
relationship between IBI and NO3-N concentration has not been found in Ohio (Rankin et al.
1999). High concentrations of NO3-N can be associated to the presence of waste water treatment
plants (WWTP) or intensive agriculture tile drainage. Negative effects should not be observed
until the median NO3-N concentration is greater than 3-4 mg/L (Rankin et al. 1999). Average
NO3-N concentrations in groups 11 and 61 surpassed this threshold and had significantly poorer
IBI than their homologues (groups 12 and 62) (see Appendix I).All the new subgroups obtained
with small-scale variables had statistically significant differences in IBI means in the ANOVA
test. A summary of the results after clustering with large and small-scale variables is shown in
Figure 3-7 and Appendix I.
One of the assumptions of our research was that the biological response would follow a normal
distribution if biota’s exposure to environmental conditions is homogeneous enough after the
system reaches a steady state (i.e. even if a specific group is far from reference conditions, its
92
biologic community has adapted to that level of stress by changing its structure). This
hypothesis was confirmed when the tests for normality were performed. With the full database
(i.e. with highly heterogeneous conditions), the tests for normality indicated that the IBI scores
did not follow a normal distribution. However, when the data were divided into more
homogeneous groups using large and small scale variables, they followed a normal distribution.
Group 11 and 12 were the only two exceptions. However, group 11 included only three
observations and therefore, the result was most likely due to a non-representative sample of the
group distribution. Subdivision of group 12 resulted in two subgroups that followed normal
distributions.
Figure 3-7. Groups and subgroups with different biological responses after clustering with large and small-scale environmental filters. Red color marks groups that did not pass normality tests. Blue color indicates groups that passed the normality tests.
The normal probability plots of the six groups from large-scale variables are shown in Figure
3-8. Similar plots could be obtained for each subgroup obtained with small-scale variables if
enough data were available. Unfortunately, this was not the case for some groups (only 3
observations in group 11, 4 in group 32, 10 in group 122, and 5 in group 312). These plots are a
characterization of the biotic response after passing through the specified environmental filters
(Figure 3-8 shows responses after large-scale filters are passed). With this methodology it is
93
possible to isolate the response of the biotic community to a specific stressor in a hierarchical
manner (i.e. the specific effect of a stressor will be revealed together with other relevant stressors
with a higher hierarchy in the tree as shown in Figure 3-7).
3.2.1.2. Reference conditions for similar environmental sites and potential causes for departure Reference sites represent the environmental conditions that could potentially be met by other
sites with similar characteristics. They could also be a potential framework for development of
biologic standards, similar to water quality standards in which a maximum probability of
exceedance is set with a log-normal probability plot. Exceedance probabilities beyond a set
threshold would represent a violation of the standard (Novotny 2004). Analysis of significant
differences between reference and impaired sites may indicate the likely causes of departure
from reference conditions. However, impairment is usually not the result of a single, isolated
stressor, but a highly/intertwined combination of environmental factors structured in a
hierarchical manner that propagate through the hierarchy producing a response of the biotic
endpoint (Novotny 2003).
The 75th IBI percentile was identified in each group and sites beyond this threshold were
considered as reference sites. This corresponded to IBI scores equal to 32, 44, 28, 34, 48, and 34
for groups 1 through 6 respectively. In some of these groups the IBI was far from being
considered as good (48≤ IBI ≤ 52) (Karr et al. 1986; Rankin et al. 1990). However, reference
conditions do not necessarily refer to pristine environments but to least impacted watersheds
within a highly homogeneous group. Pristine or undisturbed streams do not really exist in Ohio
anymore. The character of the reference sites should reflect the reasonably attainable biological
conditions within a particular homogeneous group given the prevailing background conditions
94
(Ohio_EPA 1987). The 75th IBI percentile was selected because departure from normality was
observed at that point in groups 1, 3, and 4 (Figure 3-8). This was interpreted as an abrupt shift in
all or some of the environmental gradients from impaired to reference conditions (i.e. a break or
gap in the environmental gradient ‘continuum’). Therefore, the biologic community responded
somewhat differently in these sites. Groups 2 and 6 showed the best overall goodness-of-fit in
the plots. Group 2 didn’t undergo any further division with new environmental filters (see Figure
3-7), which could indicate a highly continuous biotic response to the environmental gradients
and absence of significant local stressors. Group 6 was divided in two even groups using nitrate
concentration (32 and 37 observations in groups 61 and 62 respectively). This was interpreted as
a shift in the biologic response due to increased nutrient loading. When a threshold concentration
was surpassed beyond normal limits, the biologic community response was somewhat different.
However, the shift in behavior was not sudden enough as to be perceived in the normal
probability plots for level I groups (see group 6 in Figure 3-8).
Figure 3-9 shows the curves for reference and impaired sites in each one of the six groups. The t-
tests found significant differences between pristine and impaired sites in groups 1,2, and 5. No
significant differences were observed within the rest of the groups (Table 3-7). TKN was the
only environmental gradient that showed consistently better results in reference sites, and can be
an indication of progressive degradation due to increased nutrient input with changing land uses.
Embeddedness was also better in all reference sites with the exception of group 6. Sediment (and
therefore, all substrate-related habitat parameters) and nutrient input have been identified as the
most relevant factors for biotic degradation and are intimately related to land use and hydrologic
changes in the drainage and buffer areas, especially at the catchment scale(Allan 2004; Archer
and Newson 2002; Dyer et al. 1998a; Gilvear et al. 2002; Hall and Killen 2005; Manolakos et al.
95
2007; Richards et al. 1996; Shields et al. 2006; Yuan and Norton 2004). All the environmental
gradients identified in the research are directly or indirectly related to substrate quality (i.e.
embeddedness, DA, gradient, riparian quality) and nutrient input (DO, TKN, and sulfate
concentration). Even though the t-tests did not show statistically significant differences among
most of the variables, TKN and embeddedness seem to be the most consistent variables in the
differentiation of reference and impaired environmental qualities in Ohio.
96
Figure 3-8. Normal distribution probability plots for groups 1 through 6. Red line indicates 75th IBI percentile. Points to the right of the red line were considered as reference observations for the respective group of sites and separated.
97
Table 3-7. 95% confidence intervals for the environmental variable means in reference and impaired sites. Text in bold indicates statistically significant differences for that variable and group according to the t-tests ( p = 0.05)
Variable Group 1 Group 2 Group 3
Drainage area 16.32 ±6.65 65.20±34.49 9.28 ±3.639 Dissolved Oxygen 8.68 ± 1.04 7.69±0.59 5.98 ±2.15
Embeddedness 3.28 ± 0.35 2.15±0.17 3.67 ±0.33 Gradient score 6.36 ± 1.08 8.79 ±0.57 8.44 ±1.28 Riparian score 3.29 ± 0.29 6.58 ±0.47 4.11 ±0.50
Sulfate concent. 90.92 ±25.39 80.82 ±17.07 300.33 ± 144.78
Ref
eren
ce
TKN 1.31 ± 0.97 0.48 ±0.21 2.18 ±1.61 Drainage area 38.26 ±10.29 65.98 ±19.01 18.18 ±9.05
Dissolved Oxygen 8.53 ± 0.75 8.08 ±0.40 7.02 ±1.24 Embeddedness 3.69 ± 0.12 2.36 ±0.11 3.85 ±0.15 Gradient score 4.92 ± 0.35 9.16 ±0.30 7.61 ±0.60 Riparian score 3.61± 0.46 5.73 ±0.38 3.61 ±0.47
Sulfate concent. 155.14 ± 30.10 102.88 ±13.93 254.58 ±72.65
Impa
ired
TKN 1.38 ± 0.43 0.56 ±0.08 2.52 ±1.69 Variable Group 4 Group 5 Group 6
Drainage area 29.79 ±9.13 703.75 ± 435.57 152.98 ± 65.21 Dissolved Oxygen 7.26 ±0.78 8.63 ± 1.14 7.69 ± 0.78
Embeddedness 3.58 ±0.21 2.21 ± 0.19 3.56 ± 0.24 Gradient score 5.89 ±0.23 9.57 ± 0.49 8.82 ± 0.52 Riparian score 5.17 ±0.75 5.82± 0.75 5.71 ± 1.03
Sulfate concent. 67.22 ±22.10 62.36 ± 24.18 138.42 ± 40.55
Ref
eren
ce
TKN 0.55 ±0.21 0.45 ± 0.10 0.76 ± 0.28 Drainage area 465.99 ± 387.76 992.23 ± 541.09 90.61 ± 19.14
Dissolved Oxygen 6.43 ± 0.48 8.41 ± 0.58 7.72 ± 0.48 Embeddedness 3.70 ± 0.11 2.30 ± 0.14 3.43 ± 0.16 Gradient score 6.04 ± 0.27 9.29 ± 0.30 9.15 ± 0.30 Riparian score 6.48 ± 0.59 6.18 ± 0.37 5.87 ± 0.56
Sulfate concent. 64.63 ± 9.22 56.90 ± 14.72 207.19 ± 42.88
Impa
ired
TKN 0.78 ± 0.12 0.82 ± 0.34 0.93 ± 0.17
The rest of the environmental gradients did not show a clear pattern between reference and non-
reference sites. However, the differences among them were never large. Hence, reference sites
can be initially screened based on their substrate-related parameters (e.g. degree of
embeddedness compared to a reference site) and nutrient inputs (e.g. TKN and phosphorus
levels).Different combinations of the rest of gradients would determine the final biotic integrity.
Land use data in the drainage area and the riparian buffer at different scales would, most likely,
helped refine the watershed classification. Different combinations of local and regional land uses
98
in the drainage area and the riparian buffer are the main regulators of sediment and nutrient
input. Morphologic characteristics can also play a significant role (i.e. gradient). Unfortunately
this data was not available at the time this analysis was performed.
Figure 3-9. Normal probability plots for the reference (green) and impaired (red) conditions for the six groups obtained after clustering the SOM neurons with environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group
99
3.2.2. Ohio offstream data
3.2.2.1. Biotic response separation The correlation matrix of the neuron-based regional environmental variable vectors and the
neuron-based average IBI is presented in Figure 3-10.
Figure 3-10. Correlation matrix of the variable neuron-based weights and neuron-based average IBI scores in the trained SOM. Color bar on the right indicates absolute value of the absolute correlation coefficient. Plus and minus signs indicate positive or negative correlation.
The regional variables that showed a strong effect on biotic integrity ( neuron-based IBI-variable
4.0≥r ) were (in descending order): R30_Forest, R100_Forest, RDA_Forest (r = 0.662, 0.646,
0.579 respectively), R30_Agric (r = -0.436), R30_Barren (r = 0.460), R100_Agric (r = -0.436),
DW_storMainlen (r = 0.430). Some of these variables were eliminated due to strong correlation
100
with other more significant variables. R100_Forest, RDA_Forest, and R30_Agric were strongly
correlated to R30_Forest (r = 0.996, 0.910, and -0.826 respectively). Therefore, the variables that
should theoretically have been used for the 2nd SOM patterning were R30_Forest, R30_Barren,
R100_Agric, and DW_storMainlen. However, we decided to disregard the variable
DW_storMainlen because the IBI-variable correlation coefficient sign seemed counter-intuitive
(the correlation was positive, which would mean an IBI improvement with greater dam water
storage capacity per main channel length unit). A negative correlation was expected in this case.
In the present thesis (see Chapter 2) and elsewhere (Dyer et al. 2000) it has been reported that
Ohio’s IBI is positively correlated to drainage area. Hence, fragmentation metrics whose final
units had some element directly or indirectly related to drainage area (e.g. UPS_floodlen
[m2/Km]) were deemed biased. For this reason, only unitless fragmentation metrics were kept
for further analysis (i.e. UPS_Floodarea, UPS_Con and SITE_Con).The rest were disregarded.
As a consequence, the only variables used in the 2nd SOM patterning were R30_Forest,
R30_Barren, and R100_Agric.
The remaining variables were considered as regional variables with a local effect (neuron-based
IBI-variable 4.0<r ) and used for individual, progressive clustering along with local variables
(i.e. local land use and point source metrics). A list of both, the large and small-scale variables
and their respective correlation coefficients with IBI are shown in Table 3-8. Strongly cross-
correlated variables, and therefore discarded, are also identified in Table 3-8.
Clustering of the 2nd SOM neurons using the three-dimensional neuron-based environmental
vectors (R30_Forest, R30_Barren, R100_Agric) segregated two significantly different biologic
responses as indicated by the ANOVA and MRT analysis (Table 3-9).
101
Table 3-8. Correlation coefficients between the neuron-based regional environmental variables and the neuron-based average IBI scores (left and mid columns) and raw local variables and IBI scores (left column). Variables in bold were capable of separating significantly different biological responses in the hierarchical structure
Regional variables with widespread impact r Regional variables with
localized impact r Local variables r
R30_Forest 0.662 R30_NonForest -0.391 L100_Forest 0.458 R100_ForestC 0.646 RDA_BarrenC 0.378 L30_ForestC 0.456 RDA_ForestC 0.579 R100_BarrenC 0.365 LDA_ForestC 0.380 R30_AgricC -0.483 RDA_AgricC -0.347 L30_Agri -0.271 R30_Barren 0.460 R100_NonForestC -0.338 L30_NonForest -0.223 R100_Agric -0.436 UPS_Con -0.320 L100_AgriC -0.217
DW_storlengthD 0.430 RDA_NonForestC -0.271 L100_Urban -0.175 DW_MainDFD -0.257 L100_NonForestC -0.159 UPS_storDAC 0.242 LDA_UrbanC -0.153 DW_floodMainlenC 0.226 L30_UrbanC -0.139 DAC 0.225 LDA_NonForestC -0.112 RDA_WaterC 0.179 LDA_AgriC -0.112 SITE_Con 0.170 UPS_floodareaC 0.139 Avg_DFD -0.120 UPS_DFD 0.110 UPS_floodlenD 0.109 RDA_Urban 0.105
C = strongly cross-correlated with a higher hierarchy variable; D = disregarded
Table 3-9. ANOVA (top) and MRT (bottom) analyses to detect significant differences in IBI means between 2nd SOM groups of neurons. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups.
ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 1181.57 1 1181.57 12.79 0.0004 Within groups 39539.1 428 92.381 ----------------------------------------------------------------------------- Total (Corr.) 40720.6 429 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 413 32.5521 X IBI1 17 41.0588 X -------------------------------------------------------------------------------- Contrast Difference � Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *8.50677 4.67525 -------------------------------------------------------------------------------- * denotes a statistically significant difference.
102
Observations in each of the two main groups obtained with the regional most significant
variables were clustered with the remaining regional and local variables one at a time in the order
shown in Table 3-8. Figure 3-11 shows a diagram of the final hierarchical tree in which the
initial two main groups were progressively split using the remaining variables. The subgroups
shown are those with statistically different biologic responses and the variables at each level are
those responsible for the differences and used in the group partitioning.
Figure 3-11. Hierarchical diagram of habitats with significantly different biotic responses. On the right, list of environmental variables used to segregate biotic signatures at each step. Rectangles in blue indicate groups that passed normality test. Rectangles in red indicate groups that did not pass normality test.
The biologic responses of the groups obtained after clustering with the fragmentation metric “site
percentage of connected network” (SITE_Con) were plotted in a normal probabilistic plot
(Figure 3-12).Such plot could have been created for each level of clustering. Only one level was
characterized in the present paper for the sake of brevity.
103
Figure 3-12. Normal distribution probability plots for the biologic signatures after clustering sites with SITE_Con. Group 212 did not pass the Jarque-Bera test of normality at the 95% confidence level (see Figure 3-11) . Group 221 was not plotted because it only had 4 observations
104
Figure 3-13. Example of biologic response separation by segregation of sites with environmental variables. Group 222 splits in groups 2221 and 2222 (group 2222 not-normally distributed) after clustering with RDA_Urban. Group 2222 splits in groups 22221 and 22222 (both normally distributed) after clustering with R30_Agri.
3.2.2.2. Reference conditions for similar environmental sites and potential causes for departure The 75th percentile in groups 1, 211, 212,222, and 223 was 44, 46, 40, 40, and 32 respectively.
Values above and below the 75th percentile were considered as reference and non-reference
respectively for each of the groups and plotted in Figure 3-14. Analysis of group differences is
presented in Table 3-10.
105
Figure 3-14. Normal probability plots for the reference (green) and impaired (red) conditions for the groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 212 was fitted to a Gaussian distribution only for demonstration purposes)
106
Table 3-10. 95% confidence intervals and ANOVA test between reference and non-reference sites in variables used in the separation of biotic responses
Group 1 Group 211
Variable Reference Non-reference p Reference Non-reference p R30_Forest 39.13 ± 7.28 39.93 ± 10.26 0.882 41.99 ± 7.62 40.13 ± 4.34 0.655 R30_Barren 0.22 ± 0.09 0.26 ± 0.26 0.755 0.00 ± 0.00 0.00 ± 0.00 0.367 R100_Agri 61.36 ± 6.38 59.09 ± 6.02 0.600 56.67 ± 6.73 60.71 ± 4.05 0.296
R30_NonForest 0.41 ± 0.40 0.37 ± 0.38 0.881 0.33 ± 0.20 0.38 ± 0.17 0.752 SITE_Con 20.49 ± 17.53 49.71 ± 33.82 0.082 80.13 ± 10.97 90.39 ± 3.49 0.016*
RDA_Urban 9.42 ± 4.25 7.79 ± 3.58 0.511 14.40 ± 8.27 9.82 ± 3.16 0.188 L100_Forest 50.62 ± 9.85 48.69 ± 17.73 0.821 62.89 ± 10.84 44.89 ± 6.23 0.004*
L30_Agri 17.87 ± 9.66 25.93 ± 18.51 0.363 15.69 ± 6.25 30.03 ± 7.22 0.026*L30_NonForest 0.45 ± 0.80 2.16 ± 2.60 0.138 0.34 ± 0.50 0.48 ± 0.43 0.701
L100_Urban 11.13 ± 5.70 12.97 ± 10.43 0.713 10.19 ± 3.68 10.56 ± 4.00 0.915 Group 212 Group 222
Variable Reference Non-reference p Reference Non-reference p R30_Forest 38.13 ± 5.04 19.70 ± 2.58 0.000* 28.49 ± 9.40 19.20 ±2.05 0.003*R30_Barren 0.00 ± 0.00 0.01 ± 0.01 0.016* 0.01 ± 0.01 0.01 ±0.00 0.294 R100_Agri 52.13 ± 5.46 61.80 ± 3.78 0.007* 65.03 ± 9.50 66.76 ±3.10 0.640
R30_NonForest 0.80 ± 0.14 0.87 ± 0.10 0.478 3.28 ± 0.34 3.00 ± 0.17 0.114 SITE_Con 12.42 ± 2.37 15.79 ± 2.11 0.080 37.52 ± 8.09 33.62 ±0.91 0.092
RDA_Urban 16.45 ± 3.53 19.52 ± 3.17 0.286 7.99 ± 1.93 18.78 ±2.99 0.000*L100_Forest 42.79 ± 5.82 24.22 ± 3.37 0.000* 33.13 ±11.95 14.59 ±3.59 0.000*
L30_Agri 30.52 ± 6.50 44.94 ± 4.54 0.000* 42.08 ±14.62 36.27 ± 10.54 0.559 L30_NonForest 0.42 ± 0.19 1.41 ± 0.38 0.003* 2.26 ± 2.31 2.70 ± 1.26 0.728
L100_Urban 18.12 ±3.84 21.91 ± 3.45 0.226 12.00 ± 4.84 37.80 ±10.19 0.005* Group 223
Variable Reference Non-reference p R30_Forest 20.18 ± 7.80 13.57 ±4.69 0.134 R30_Barren 0.00 ± 0.00 0.00 ± 0.00 0.884 R100_Agri 55.98 ±16.64 67.94 ±7.97 0.138
R30_NonForest 25.91 ± 8.83 26.60 ±4.80 0.882 SITE_Con 3.83 ± 1.47 3.29 ± 0.82 0.491
RDA_Urban 30.48 ±18.01 16.27 ±5.82 0.043* L100_Forest 6.76 ± 2.41 12.97 ±5.46 0.173
L30_Agri 45.35 ±23.33 58.89 ±10.69 0.216 L30_NonForest 3.10 ± 2.70 5.56 ± 2.08 0.190
L100_Urban 44.45 ±24.66 21.20 ± 7.68 0.015* * Indicates a statistically significant difference at the 95% confidence level (p < 0.05)
107
The hypothesis of normality in environmentally homogeneous groups worked well and most of
the groups’ IBI scores followed a Gaussian distribution at some point of the hierarchical
partitioning process (Figure 3-11). The IBI scores in the full database did not follow a Gaussian
pattern as expected due to environmental heterogeneity which caused a mix of different biologic
signatures. Only one group (group 21221) out of fifteen groups after the last partition didn’t pass
the normality test. This group was composed of 98 sites and still had a wide IBI range (minimum
and maximum IBI equal to 12 and 52 respectively). Therefore, homogeneity could not be
achieved within this group using offstream variables. Use of other stressor types (i.e. instream
features) or a mix of instream and offstream variables would, most likely, solve this issue.
Separation of reference and non-reference sites within each group targeted the main issues that
need to be addressed in order to achieve realistic integrity goals within homogeneous groups.
Some of the groups had problems at the local scale (i.e group 211 in Table 3-10), but most of
them had significant differences at the regional and local scale (groups 212, 222, and 223).
Groups 1 and 221 (group 221 not included in Table 3-10, only 4 observations) had no differences
because they were highly homogeneous with good biotic integrity (average IBI equal to 41 and
48.5 respectively). Results at the shown level of partition need to be read carefully. For example,
in group 223 the only significant differences were found with the regional and local urban land
uses, which were surprisingly higher in reference sites. However, percentage of forest in the
regional buffer was higher in reference conditions and agriculture in the local and regional
buffers was lower in reference sites. Differences were not statistically significant though.
Selection of a more stringent reference threshold would most likely identify these differences as
significant.
108
Regional land use
The model identified two main groups using the most important regional variables (i.e. forest and
barren land in the 30-meter buffer and agriculture in the 100-meter buffer). Group 1 was only
composed of 17 sites. Eleven of them were located in the East fork of the Little Miami River’s
main stem, three of them were located on the Sciotto River’s main stem, and the remaining three
were located in the Huron River watershed (see Figure 3-1). The remaining 412 sites that
composed group 2 were distributed among the Western Lake Erie, Wabash River, Middle-Ohio
and Little Miami Rivers, Muskingum River, and Sciotto River basins. The main difference
between both groups was the percentage of forested area in the regional 30-meter buffer (average
forested land equal to 39.5 versus 24.9 percent in groups 1 and 2 respectively) as well as the
percentage of barren land in the regional 30-meter buffer (0.24 versus 0.01% in groups 1 and 2
respectively). The percentage of agriculture within the regional 100-meter buffer was very
similar in both groups( average percentage of agriculture equal to 60.3 and 60.9 in groups 1 and
2 respectively). The average IBI was higher in group 1 than in group 2 as expected (41.1 versus
32.6 respectively). The importance of protective vegetated buffers of at least 15 meters in order
to preserve wetlands, streams, and other aquatic resources is widely recognized (Castelle et al.
1994). Proper management of vegetated buffers is particularly important in order to avoid the
negative effects of sedimentation to the fish community (Rabeni and Smale 1995). Moreover, the
results showed how proximity to the water body is also a significant factor affecting integrity. At
the regional level, the most relevant land uses consistently showed slightly better predictions
with the 30-meter buffer than with the 100-meter buffer.
Group 1 didn’t undergo any further segregation of biologic responses with any of the subsequent
partitions using localized environmental variables. Group 1 IBI scores were rather homogeneous
109
(25th, 50th and 75th IBI percentiles equal to 36, 44, and 44 respectively). On the other hand, group
2 was subdivided due to localized regional effects (Figure 3-11). Group 2 had a much greater
variability of offstream features than group 1 did and a smaller part of the biotic variability could
be explained with large-scale variables only. Subsequent divisions with watershed and watershed
buffer land uses (R30_NonForest, RDA_Urban) successfully separated sites with different
regional features and therefore, different biologic responses were identified.
Immediate non-forested lands (which included herbaceous and shrub/scrub lands) were able to
separate group 2 into two main groups. Increased presence of non-forest translated into reduced
biotic quality (mean IBI equal to 28.5 in group 21 versus 34.0 in group 22). This was most likely
related to smaller presence of forested land in the 30-meter buffer and larger agricultural
coverage in group 21 (average forested and agricultural land was 27 and 59 percent and 19 and
66 percent in groups 21 and 22 respectively).
Watershed urbanization segregated 4 groups nested in group 22 (group 2221 and 2222, and
group 2231 and 2232). The biological responses to watershed urbanization were diverse. Groups
with similar percentages of forested and agricultural lands in the regional buffer (i.e. groups 2221
and 2222) showed a negative biologic response to increased watershed urbanization as expected
(Table 3-11). However, the opposite response was observed between groups 2231 and 2232. This
suggested that highly urbanized watersheds (i.e. group 2232) can achieve better integrity than
less urbanized ones if the regional buffer keeps its protective functions intact. Of special
importance are its vegetated areas. Importance of vegetated regional buffers was also revealed in
groups 221 and 1, which had the top two mean IBI scores and also had the highest percentage of
forest in the 30-meter regional buffer. Group 211 had significantly better average IBI than group
110
212. The regional buffer characteristics were quite similar in both groups but percentages of
forested lands was higher in group 211 (Table 3-11). Therefore, the importance of regional
buffers as the main regulator of biologic integrity was revealed. However, when similar buffer
characteristics exist in two different sites, IBI fluctuations are determined by land use beyond it
as observed in groups 2221 and 2222. In general, sites within a same order stream belonged to
only one group due to homogeneous regional characteristics (Figure 3-15).
The results from the regional land use analysis strongly agreed with current research. Regional
buffers were found to be better predictors of sediment-related habitat variables than the whole
catchment area (Richards et al. 1996). Habitat, and especially substrate quality degradation has
been strongly associated to negative effects on aquatic fauna in Ohio (Dyer et al. 2000; Dyer et
al. 1998a; Manolakos et al. 2007; Norton et al. 2000; Norton et al. 2002; Yuan and Norton 2004)
and elsewhere (Richards et al. 1993; Shields et al. 2006). Richards et al. (1996) also found that
land use in the whole catchment area has a stronger effect in variables related to hydraulic
regime such as channel dimensions than regional buffers do. The results with the regional
variables agreed extremely well with similar studies evaluating the impact of land use at a
similar scale. Stewart et al. (2001) linked larger increased presence of intolerant species and
total number of fish species to increased forested wetland in the 20-30 meter buffer. Percent
tolerant species and percent insectivorous fish decreased as the percent of forest in the 20-30
meter buffer increased. These relationships indicate a positive correlation between regional
stream buffer and biologic integrity. Moreover, percentage of grasslands (equivalent to non
forested lands in our research) in the 20-30 meter regional buffer has been negatively associated
with the health of fish communities (Stewart et al. 2001). Also, several authors have identified
urban land use in the whole catchment area as a good indicator of biological integrity
111
degradation (Morley and Karr 2002; Stewart et al. 2001). Therefore, the model confirmed the
disproportionate importance of regional buffers compared to its total land area (Johnson et al.
1997). However, good quality of regional stream buffers alone does not guarantee good
biological integrity in highly urbanized basins (Roth et al. 1996).
Fragmentation
Of the two fragmentation metrics used in the hierarchical separation of biologic signatures
(percentage of upstream connected network [UPS_Con] and percentage of basin connected
network [SITE_Con]), only the basin-based metric was able to separate different biological
responses. Because the observations within a basin were highly concentrated in specific areas or
river systems and because the metric was calculated at the basin-scale, clustering with this metric
functioned as a basin-filter. Observations within same basins (or within a same watershed in
basins with multiple outlets) were grouped together. Therefore, biologic integrity responses were
segregated on a basin level (Figure 3-1).
The results could suggest that separation of different biotic qualities in this case was more due to
regional characteristics than the effect of fragmentation itself. However, a clear pattern was
observed. The two groups with the lowest biological integrity (i.e. group 223 and 212 with an
average IBI equal to 26.3 and 32.9 respectively) had the lowest average site network connectivity
(3.43 and 14.9 percent respectively). The three groups with the highest average IBI (i.e. groups
221,1, and 211 with average IBI equal to 48.5, 41.1, and 40.1 respectively) had much larger
mean connectivity values (59.7, 34.2, and 87.7 percent respectively). Therefore, fragmentation at
the basin-level seems to play an important role in biologic integrity. Connectivity thresholds to
guarantee species survival and persistence may exist. Physically fragmented networks tend to
112
isolate small populations which become not viable and are condemned to disappearance in a
time ranging from 30 to 100 years (Morita and Yokota 2002). Some studies suggest that risk of
species disappearance due to stream damming is positively correlated to increasing population’s
isolation period with respect to the rest of the river network, and stream gradient, and is
negatively correlated to watershed area (i.e. habitat size) (Morita and Yamamoto 2002).
Moreover, fragmentation not only represents a physical barrier to fauna but this is also associated
with flow regulation. Hydraulic intermittency due to flow regulation/abstractions are usually
associated with the presence of dams or other infrastructure for flow regulation. One of the main
consequences is the longitudinal and lateral dispersion of species due habitat fragmentation or
disappearance triggered by the new hydrologic regime (Fischer and Kummer 2000; Freeman et
al. 2001).
Upstream connectivity didn’t seem to be as relevant as site connectivity at the basin-level. One
of the possible reasons why this metric wasn’t as relevant as the basin-connectivity is because
many of the observations were very far from the basin outlet. Therefore, the largest part of the
network available for fauna was located in the downstream section (average distance to basin
outlet following the main channel was equal to 284.3 Km, average total upstream network
distance was equal to 488 Km). Results from Chapter 2 indicated that increased upstream
fragmentation in the same study area was responsible for IBI over-predictions in some sites (see
Table 2-3). Upstream fragmentation (which was not a predicting variable in the model) was
significantly higher in some over-predicted sites. Therefore, given the available observation
points, upstream fragmentation is more a local than a regional stressor because of the generally
big distances to the basin outlet (i.e. the upstream network section represents a small fraction of
the existing fish habitat).
113
Local variables
Only some local land uses were considered to have a significant correlation to IBI (Table 3-8).
None of the point source density and intensity metrics had an absolute raw IBI-variable
correlation coefficient larger than 0.1 and therefore, these were disregarded. On the other hand,
the order of the correlations between local land use and IBI were almost exactly the same as
those observed at the regional scale. Forest was the most significant once again and positively
correlated to IBI, while agriculture, non-forested lands, and urbanization were negatively
correlated to IBI. Once again, land use in the buffer strip was more strongly correlated to IBI
than land use in the whole local catchment area. Urban lands was not an exception and was a
significant difference with the regional land use variable selection (percent of urban land use in
the whole drainage area was more relevant at the regional level than in the buffer zone). Also,
percent of barren lands in the local buffer had a very weak correlation with IBI and disregarded
for further analysis.
The importance of forested local buffers was revealed in groups 211 and 212, which were split
into two and three subgroups respectively. The pattern was very clear. Increased vegetation in
the local buffer corresponded to improved biologic integrity within the limits established by the
group’s regional and basin characteristics (i.e. background integrity). Average IBI for subgroups
2111 and 2112 were equal to 37.6 and 43.6 respectively. Percentage of forested land in the local,
100-meter buffer in the same sub-groups was 34.6 and 67.4 respectively. Average IBI in groups
2121, 2122, and 2123 were 39.4, 35.8, and 28.9 and corresponded to an average percentage of
forested land in the local buffer of 77.5, 41.7, and 7.82 respectively.
114
Groups 221, 2221, 2222, 2231, and 2232 didn’t undergo any further subdivision with this
variable. This was most likely due to high homogeneity in groups 221, 2231, and 2232 (25th,
50th, and 75th quartiles equal to 42.5, 44.6, 45.6% in group 221; 0.8, 8.2, 11.6% in group 2231;
and 4.3, 8.2, 15.9% in group 2232). However, groups 2221 and group 2222 had more variability
(25th, 50th, and 75th quartiles equal to 3.38, 29.2, 35.4% in group 2221; 2.5, 12.5, 25.3% in group
2222). We believe that despite variability, differences in biological responses may not be
statistically significant when a lower limit of forested land cover is reached. For example, group
212 and 211 were divided into three and two subgroups respectively. Sub-groups 2121, 2122,
and 2123 had average percentages of forest equal to 77.5, 41.7, and 7.82 respectively, while
subgroups 2111 and 2112 had average percentages equal to 34.6 and 67.4 respectively. Different
biotic responses were not observed between the 10-30% range of forest and less than 10% of
forested lands. Since most of the observations in groups 2221 and 2222 were below the 30%
limit, we believe local buffer functionality was degraded enough as to not be able to further
generate different biotic responses.
The strong influence of forest in the local 100-meter buffer was also identified in a study in the
River Raisin in Michigan (Lammert and Allan 1999). Thirty percent of the total fish IBI
variability was explained with this variable. Local forest cover was also an important factor
positively affecting the Benthic IBI (B-IBI). In another study, local urban land use showed the
strongest correlation to B-IBI when compared to other land uses. However, since the watersheds
under study were mostly dominated by either urbanization or forest, forest cover was excluded
from the analysis due to almost perfect correlation with urban land use(Morley and Karr 2002).
Like in our model, Morley and Karr (2002) identified watershed urban land use as a better
predictor for B-IBI than local urban land use. However, they found that local urban land use in
115
highly urbanized watersheds was strongly correlated to B-IBI in watersheds with little vegetal
cover continuity in the immediate stream buffer (1 Km upstream).
Agriculture, non-forested land uses, and urbanization in the local buffers were able to further
separate more biotic responses. However, the biologic responses obtained in each new sub-
division were not always the expected ones. Groups 22221 and 22312 had worse biotic
integrities than groups 22222 and 22311 despite having smaller percentages of agriculture in the
local buffer. Also, group 21112 had better integrity than group 21111 despite having larger
percentage of urban land use (average of 26.6 versus 6.4 respectively).Only percentage of non-
forested land in the local buffer yielded the expected outcome. Group 21221 had better integrity
than 21222 and its percentage of non-forested land was significantly lower (0.34 versus 25.4
percent respectively).
Even though some of these results were counter-intuitive, they may reflect truly extraordinary
local conditions. The sign of the overall IBI- local variable correlation coefficients when the
whole database was used were the ones expected (Table 3-8). We believe the discrepancies in
some groups were due to data resolution (e.g. agriculture was the sum of hay pasture, range and
croplands which may have different behaviors). For example, group 22312 had far less local
agricultural coverage than group 22311 (average 36.3 versus 81.1% respectively). However, the
average percentage of local pasture and rangeland (included in the agricultural category) was
almost doubled in group 22312 with respect to group 22311 (10.3 versus 5.9% respectively). In
Chapter 2, this land use type was identified as the most deleterious to biotic integrity in Ohio.
Data resolution problems were not observed in groups 22221 and 22222, and 21111 and 21112.
Even though a clear explanation for the observed pattern was not found, dominance of regional
116
influence could be a possible cause. It has been documented that some instream features such as
shade, channel width and stability, epilithon biomass, or water clarity improve rapidly with
improved local buffer quality. However, other processes that can severely affect biotic integrity
such as water chemistry, nutrient input, surficial fine sediment, or fecal contamination are highly
dependant on regional characteristics (Parkyn et al. 2003; Scarsbrook and Halliday 1999).
Figure 3-15. Groups of sampling sites in a watershed located in the Muskingum River Basin. On the left, groups after partition with regional watershed land use and fragmentation metrics. On the right, groups after partitions with land use in the local 100-meter buffer Table 3-11. Average group values after clustering with basin/watershed scale variables
R30_FOREST R30_BARREN R100_AGRI R30_NONFOREST SITE_CON RDA_URB IBI
1 39.50 0.24 60.29 0.39 34.24 8.65 41.06211 40.62 0.00 59.66 0.37 87.71 11.01 40.09212 24.63 0.01 59.21 0.85 14.89 18.70 32.88221 42.43 0.02 59.28 4.41 59.67 6.08 48.502221 21.78 0.01 71.47 3.15 35.16 7.54 38.082222 21.33 0.01 62.40 19.41 34.17 22.61 24.712231 11.71 0.01 79.09 18.91 2.22 7.68 24.302232 23.39 0.00 33.04 9.91 6.10 47.47 30.67
117
3.2.3. Coastal Maryland
3.2.3.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI scores is shown
in Figure 3-16.
Figure 3-16. Correlation matrix of the variable neuron-based weights and neuron, average IBI values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables
The variables that showed a relevant overall impact on IBI (variable-IBI r ≥ 0.5) were: pool
quality ( r = 0.730), average thalweg (r = 0.675), average width (r = 0.671), velocity-depth
variability (r = 0.665), percentage of channel covered by flow (r = 0.659), maximum depth (r =
118
0.654), drainage area ( r = 0.620), wood score (r = 0.577), flow velocity (r = 0.529), and riffle
quality (r = 0.517).
Many of these significant variables’ neuron weights were strongly cross-correlated (variable-
variable r ≥ 0.8), and therefore disregarded for subsequent analysis. Average thalweg, velocity-
depth variability, percentage of channel covered by flow, and maximum depth were strongly
correlated to pool quality (r = 0.912, 0.876, 0.841, 0.968 respectively). Drainage area and wood
score were strongly correlated to average stream width (r = 0.955, -0.864 respectively). Average
flow velocity was also disregarded for further analysis because we considered this variable could
be influenced by local conditions (e.g. channelization). Thus, the remaining variables for the 2nd
SOM patterning were pool and riffle qualities, and average width.
The remaining variables were considered as small-scale variables. DO and ANC were eliminated
due to high correlation with pH (r = 0.85 and 0.84 respectively), Agribarr and CShade were
correlated to NO3 ( r = 0.95, -0.81 respectively), Embed and CEpiSub were correlated to
CInstrHab (r = -0.86 and 0.87 respectively), and CRemote was correlated to Aesthet ( r = 0.95).
Temperature was also discarded from further analysis due to its variability. Thus, the remaining
small-scale variables were: pH, NO3, Forwetwat, CInstrHab, Aesthet, RipWid, CBank, Chan,
Cond, SO4, Sl, DOC, and Urban.
The 2nd SOM was run with the identified environmental gradients and the subsequent SOM-
neuron clustering yielded two groups with significantly different biologic responses according to
ANOVA (Table 3-12).
119
Table 3-12. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups
ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 34.965 1 34.965 50.28 0.0000 Within groups 155.07 223 0.695381 ----------------------------------------------------------------------------- Total (Corr.) 190.035 224 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI1 103 3.05097 X IBI2 122 3.84221 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *-0.791242 0.219896 -------------------------------------------------------------------------------- * denotes a statistically significant difference.
Figure 3-17. Groups and subgroups with different biological response after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed the normality tests
120
Subsequent biologic response separation based on individual, not strongly correlated variable
site-clustering yielded eleven different groups as shown in Figure 3-17. Group and sub-group
variable statistics included in Appendix I.
The normal probability plots for the two main biologic responses after the 2nd SOM clustering is
shown in Figure 3-18.
Figure 3-18. Normal probability plots for the IBI responses found after the 2nd SOM clustering
Reference conditions for similar environmental sites and potential causes for departure
The IBI 75th percentile for groups 1 and 2 were 3.75 and 4.25 respectively. Values beyond these
scores were arbitrarily set as reference sites for each biological response at the given level of
partition. Reference sites curves along with curves from the remaining non-reference sites are
121
shown in Figure 3-19. Significant differences among variables between reference and non-
reference conditions are presented in Table 3-13.
Figure 3-19. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution
Table 3-13. 95% confidence intervals and ANOVA test between reference and non-reference sites with variables used in the separation of biotic responses in coastal sites
. Group 1 Group 2 Variable Reference Non-reference p Reference Non-reference p Aesthet 14.14 ± 1.65 12.65 ± 1.14 0.132 15.00 ± 0.99 13.49 ± 1.03 0.054
Wid 3.73 ± 0.95 3.12 ± 0.78 0.350 6.62 ± 1.24 5.62 ± 0.65 0.118 CBank 66.8 ± 5.85 71.12 ± 4.28 0.236 73.73 ± 4.49 73.14 ± 3.35 0.831 Chan 7.31 ± 1.41 7.44 ± 1.04 0.886 8.73 ± 1.06 9.93 ± 0.85 0.083 Cond 161.57 ± 25.10 182.51 ± 26.58 0.311 180.24 ± 41.25 163.96 ± 15.92 0.388 DOC 5.73 ± 1.09 7.75 ± 1.55 0.081 5.30 ± 1.01 5.71 ± 0.71 0.499
Forwetwat 48.1 ± 6.64 54.63 ± 5.49 0.148 48.04 ± 4.76 45.04 ± 3.06 0.269 CInstrhab 56.59 ± 7.77 53.37 ± 5.89 0.517 64.53 ± 6.56 63.60 ± 5.30 0.828
NO3 1.83 ± 0.55 1.74 ± 0.67 0.864 2.71 ± 0.79 2.35 ± 0.54 0.440 pH 6.89 ± 0.14 6.67 ± 0.16 0.075 6.93 ± 0.15 6.83 ± 0.10 0.252
Pool 11.66 ± 1.35 8.66 ± 1.07 0.001* 13.93 ± 0.87 13.95 ± 0.74 0.980 Riffle 5.20 ± 1.314 5.41 ± 1.01 0.803 12.82 ± 1.07 13.38 ± 0.65 0.346
RipWid 31.6 ± 6.81 34.96 ± 4.18 0.377 37.83 ± 5.30 38.62 ± 3.91 0.683 Sl 0.459 ± 0.112 0.447 ± 0.117 0.895 0.46 ± 0.14 0.54 ± 0.09 0.318
SO4 14.00± 1.91 14.67 ± 1.81 0.634 14.46 ± 1.50 15.47 ± 1.29 0.323 Urban 6.07± 2.78 6.09 ± 2.71 0.992 6.81 ± 2.74 12.93 ± 3.55 0.018*
*Indicates statistically significant difference at the 95% confidence level (p<0.05)
122
3.2.4. Piedmont Maryland
3.2.4.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI after the initial
SOM training is shown in Figure 3-20.
Figure 3-20. Correlation matrix of the variable neuron-based weights and neuron, average IBI values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables
The variables with significant impact to IBI were (in decreasing order of importance): Ch_flow
(r = 0.749), Chan (r = 0.747), Urban (r = - 0.746), Agribarr (r = 0.726), SO4 (r = 0.690), Aesthet
(r = 0.689), Veldep (r = 0.673), Pool ( r = 0.672), DOC (r = -0.662), DO (r = 0.659), NO3 (r =
0.655), ANC (r = -0.650), Cond (r = -0.632), ThalDep (r = 0.605), PRemote (r = 0.585),
PEmbed (r = 0.531), MaxDep (r = 0.505). Agribarr, SO4, Aesthet, DOC, DO, ANC, and Cond
123
were highly correlated to Urban and therefore, disregarded (r = -0.910, 0.925, -0.930, 0.871, -
0.914, 0.925, 0.943 respectively). Agribarr, Veldep, and NO3 were strongly correlated to
Ch_flow (r = 0.816, 0.837, 0.835 respectively). ThalDep and MaxDep were strongly correlated
to Pool (r = 0.942 and 0.934 respectively). Hence, the remaining large-scale variables for the 2nd
SOM patterning were: Ch_Flow, Chan, Urban, Pool, PRemote, and PEmbed.
The remaining variables (IBI-variable 5.0<r ) were considered as small-scale variables.
However, some variables were again disregarded due to strong cross-correlation. Wid was
strongly correlated to Pool ( r = 0.845), PRiffle, PInstrHab, PShade, and PBank were strongly
correlated to PEpiSub (r = 0.936, 0.957, 0.835, and 0.840 respectively). Therefore, the remaining
small-scale variables were: PEpiSub, DA, PWood, pH, RipWid, Sl, Forwetwat, and PHI. Again,
Flow_vel and Temp were disregarded from further analyses for the reasons mentioned
previously.
Clustering of the 2nd SOM neurons using the six most significant, non-correlated variables
yielded five groups with significantly different biologic responses as indicated by the ANOVA
and MRT analyses (Table 3-14). Subsequent separation of biological responses due to small-
scale stressors resulted in two more levels of IBI segregation. Stream gradient and percentage of
forest, wetlands, and water in the drainage area were the variables responsible for the
significantly different biological signatures (Figure 3-21).
124
Table 3-14. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups
ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 119.017 4 29.7543 59.14 0.0000 Within groups 124.261 247 0.503083 ----------------------------------------------------------------------------- Total (Corr.) 243.279 251 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 29 1.80483 X IBI5 17 2.71353 X IBI1 33 3.24818 X IBI3 47 3.59149 X IBI4 126 3.92992 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *1.44335 0.355584 IBI1 - IBI3 *-0.343308 0.317279 IBI1 - IBI4 *-0.681739 0.273186 IBI1 - IBI5 *0.534652 0.417067 IBI2 - IBI3 *-1.78666 0.329884 IBI2 - IBI4 *-2.12509 0.287729 IBI2 - IBI5 *-0.908702 0.426734 IBI3 - IBI4 *-0.338431 0.238776 IBI3 - IBI5 *0.87796 0.395383 IBI4 - IBI5 *1.21639 0.360961 -------------------------------------------------------------------------------- * denotes a statistically significant difference.
Figure 3-21. Groups and subgroups with different biological responses after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests
125
Figure 3-22 shows the normal probability plot for the five main groups after clustering the SOM
neurons using the identified environmental gradients in piedmont regions.
Figure 3-22. Normal probability plots for the IBI responses identified by the 2nd SOM clustering in Piedmont sites (Group 4 didn’t pass the normality test)
Reference conditions for similar environmental sites and potential causes for departure
The IBI 75th percentile for groups 1 through 5 were 3.89, 2.33, 4.11, 4.33, and 3.00 respectively.
Group reference and non-reference curves for the values above and below the 75th IBI percentile
are presented in Figure 3-23. Differences between reference and non-reference sites are shown in
Table 3-15.
126
Figure 3-23. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 4 was fitted to a Gaussian distribution only for demonstration purposes)
127
Table 3-15. 95% confidence intervals and ANOVA test between reference and non-reference sites with variables used in the separation of biotic responses in piedmont sites
. Group 1 Group 2 Variable Reference Non-reference p Reference Non-reference p Ch_flow 84.84 +/- 8.82 80.43 +/- 7.72 0.451 70.62 +/- 19.41 69.0 +/- 11.32 0.881
Chan 13.58 +/- 1.92 11.67 +/- 1.69 0.139 8.50 +/- 3.10 7.90 +/- 2.33 0.767Urban 1.12 +/- 0.61 0.37 +/- 0.27 0.008* 49.34 +/- 16.23 63.17 +/- 7.93 0.075Pool 10.17 +/- 0.97 9.86 +/- 0.90 0.644 13.50 +/- 2.86 12.14 +/- 1.69 0.378
PRemote 54.17 +/- 19.91 57.44 +/- 15.93 0.790 15.62 +/- 9.26 13.39 +/- 5.34 0.646PEmbed 76.57 +/- 14.40 66.24 +/- 10.10 0.211 68.06 +/- 26.13 72.43 +/- 12.33 0.712
Sl 1.12 +/- 0.372 1.2 +/- 0.23 0.699 0.79 +/- 0.35 1.53+/- 0.51 0.082Forwetwat 28.41 +/- 5.08 30.02 +/- 8.02 0.767 30.62 +/- 9.97 22.75 +/- 4.98 0.102
. Group 3 Group 4 Variable Reference Non-reference p Reference Non-reference p Ch_flow 85.87 +/- 6.71 76.97 +/- 5.40 0.050* 85.27 +/- 4.17 86.62 +/- 2.92 0.590
Chan 7.40 +/- 1.63 6.50 +/- 0.80 0.250 13.49 +/- 0.76 13.41 +/- 0.63 0.873Urban 3.46 +/- 3.30 5.66 +/- 2.28 0.263 3.12 +/- 1.17 2.43 +/- 0.54 0.228Pool 14.27 +/- 1.21 13.56 +/- 1.01 0.398 15.56 +/- 0.56 15.33 +/- 0.46 0.553
PRemote 41.25 +/- 14.2587 52.54 +/- 9.84 0.185 63.33 +/- 8.35 67.98 +/- 6.26 0.376PEmbed 50.00 +/- 11.57 62.57 +/- 8.21 0.077 80.76 +/- 5.47 75.86 +/- 4.66 0.191
Sl 1.04 +/- 0.27 0.92 +/- 0.27 0.562 1.17 +/- 0.27 1.30 +/- 0.31 0.567Forwetwat 29.44 +/- 4.25 37.97 +/- 5.23 0.038* 31.30 +/- 2.77 30.91 +/- 2.72 0.855
Group 5 Variable Reference Non-reference p Ch_flow 88.83 +/- 13.53 79.36 +/- 9.89 0.206
Chan 14.00 +/- 1.88 14.45 +/- 1.54 0.682 Urban 23.82 +/- 4.50 44.018 +/- 13.35 0.028* Pool 15.17 +/- 3.28 14.54 +/- 2.11 0.702
PRemote 50.00 +/- 24.89 43.18 +/- 9.47 0.464 PEmbed 77.78 +/- 15.20 67.8 +/- 21.23 0.474
Sl 0.90 +/- 0.43 1.24 +/- 0.83 0.533 Forwetwat 28.97 +/- 5.42 30.29 +/- 9.24 0.823 *Indicates statistically significant difference at the 95% confidence level (p<0.05)
128
3.2.5. Highland Maryland
3.2.5.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI after the initial
SOM training is shown in Figure 3-24.
Figure 3-24. Correlation matrix of the variable neuron-based weights and neuron, average IBI values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables
In this case, the variables with a significant impact to IBI were: HEpiSub (r = 0.588), Riffle (r =
0.582), Veldep (r = 0.567), Wid (r = 0.555), FlowVel (r =0.545), InstrHab (r = 0.536), MaxDep
(r = 0.534), Pool (r = 0.521), DA (r = 0.517), ThalDep (r = 0.510). Riffle, VelDep, and InstrHab
were strongly correlated to HEpiSub (r = 0.901, 0.800, and 0.890 respectively); Instrhab,
129
MaxDep, Pool, DA, and ThalDep were strongly correlated to Wid (r = 0.907, 0.977, 0.964,
0.971, and 0.968 respectively). Therefore, the variables left for the 2nd SOM patterning were only
HEpiSub and Wid.
The remaining, non-correlated variables were considered small-scale variables. These were (in
decreasing order of importance): Root, DO, Sl, SO4, Wood, Agribarr, Aesthet, Chan, Embed,
HShade. Other variables were disregarded due to strong correlation: DOC with HEpiSub (r = -
0.824), Ch_Flow with DO (r = 0.824), pH to Sl (r = -0.911), HRipWid and HRemote to Aesthet
(r = 0.851 and 0.867 respectively), and Urban, HBank, ANC, Cond, NO3, and Forwetwat to
Agribarr (r = 0.861, -0.898, 0.909, 0.899, 0.971, and – 0.939 respectively). Again, Flow_vel and
Temp were disregarded from further analyses for the reasons mentioned previously.
Clustering of the 2nd SOM neurons using the two identified large-scale variables resulted in three
groups with significantly different IBI responses according to ANOVA and MRT analyses
(Table 3-16). Subsequent separation of biological responses with small-scale variables resulted
in 9 different levels (Figure 3-25).
130
Table 3-16. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses in highland sites. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups
ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 39.0566 2 19.5283 16.76 0.0000 Within groups 341.399 293 1.16518 ----------------------------------------------------------------------------- Total (Corr.) 380.455 295 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 111 2.67324 X IBI1 153 3.25 X IBI3 32 3.79406 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *0.576757 0.264873 IBI1 - IBI3 *-0.544062 0.412961 IBI2 - IBI3 *-1.12082 0.426261 -------------------------------------------------------------------------------- * denotes a statistically significant difference.
Figure 3-25. Biological response hierarchical structure after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests
131
Figure 3-26 shows the three different biological responses after separation with environmental
gradients (see Figure 3-25).
Figure 3-26. Normal probability plots for the IBI responses the 2nd SOM clustering in Highland sites (groups 1 and 3 didn’t pass normality tests)
Reference conditions for similar environmental sites and potential causes for departure
The IBI 75th percentile for groups 1 through 3 were 4.14, 3.57, and 4.43 respectively. Reference
vs. non-reference curves are presented in Figure 3-27. Differences among variables between
reference and non-reference sites are presented in Table 3-17.
132
Figure 3-27. Normal probability plots for the reference (green) and impaired (red) conditions for the three groups obtained using environmental gradients in Highland sites. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group in order to describe its Gaussian distribution (Groups 1 and 3 fitted to a Gaussian distribution only for demonstration purposes)
The model identified a rather homogeneous system (piedmont areas), and a medium and a highly
heterogeneous ones (coastal plains and highland areas respectively).
In coastal sites, the variables with the greatest overall impact on IBI were either related to stream
variability/complexity (i.e. pool and riffle qualities) or stream size (i.e. average stream width).
Pool quality as well as stream size-related parameters (i.e.maximum depth) were part of the
provisional PHI and were also found in another research effort to have a high IBI discriminatory
power (Hall et al. 1999). Another stream variability metric with high discriminatory power
eliminated due to cross-correlation and also included in the provisional PHI was velocity-depth
133
variability. None of the new metrics included in the final PHI index by Paul et al. (2002) was
used in the 2nd SOM patterning. Only the PHI metric CWood was among the top predictors but
eliminated due to strong cross-correlation with stream size parameters.
Table 3-17. 95% confidence intervals and ANOVA test between reference and non-reference sites in variables used in the separation of biotic responses in highland sites
GROUP 1 GROUP 2 GROUP 3
Variable REF NON-REF p REF NON-
REF p REF NON-REF p
HEpiSub 85.99 +/- 3.88
72.96 +/- 3.36 0.000* 30.37 +/-
5.78 22.56 +/-
2.44 0.004* 73.46 +/- 17.58
75.84 +/- 9.09 0.780
Wid 5.34 +/- 0.67
4.45 +/- 0.43 0.026* 5.16 +/-
0.96 3.98 +/-
0.53 0.027* 13.96 +/- 2.89
12.42 +/- 1.08 0.185
Root 1.56 +/- 0.54
0.86 +/- 0.27 0.011* 1.13 +/-
0.59 0.52 +/-
0.22 0.017* 1.89 +/- 1.05
1.65 +/- 1.50 0.845
Sl 2.01 +/- 0.41
2.11 +/- 0.47 0.803 1.18 +/-
0.30 1.72 +/-
0.36 0.090 0.52 +/- 0.33
0.84 +/- 0.24 0.132
SO4 10.96 +/- 1.65
19.22 +/- 3.35 0.001* 15.89 +/-
4.39 41.26 +/-
17.38 0.082* 10.20 +/- 1.84
38.14 +/- 37.10 0.341
Wood 1.96 +/- 0.84
1.78 +/- 0.51 0.710 1.37 +/-
0.77 2.16 +/-
0.79 0.252 2.22 +/- 1.32
2.52 +/- 1.37 0.791
Agribarr 40.74 +/- 7.35
39.30 +/- 5.88 0.774 37.86 +/-
9.62 39.13 +/-
6.17 0.828 50.56 +/- 15.47
36.00 +/- 11.06 0.137
Chan 13.85 +/- 1.03
12.29 +/- 0.94 0.049* 12.77 +/-
1.93 9.44 +/-
1.25 0.006* 12.78 +/- 3.19
13.78 +/- 2.28 0.613
Embed 25.23 +/- 4.28
29.98 +/- 3.65 0.125 38.60 +/-
10.43 47.04 +/-
6.58 0.180 34.33 +/- 13.23
38.04 +/- 9.18 0.644
HShade 77.36 +/- 5.02
67.37 +/- 4.84 0.013* 65.89 +/-
8.88 62.06 +/-
6.56 0.526 62.91 +/- 13.27
57.43 +/- 9.16 0.496
*Indicates statistically significant difference at the 95% confidence level (p<0.05)
134
Coastal sites were mainly dominated by a rather constant combination of agricultural and forest
land covers (41.2 and 49.7 average percentages respectively). The agricultural coverage in these
areas was well beyond the average in the Mid-Atlantic region estimated in 20 percent (Herlilhy
et al. 1998). Inappropriate habitat metrics and/or widespread human disturbance may explain the
weak relationships between IBI and PHI and land use parameters (Pirhalla 2004), which were
mostly evaluated as small-scale variables. Despite weak correlation, agriculture still remains the
main source of impairment in coastal sites with high nutrient loading (Pirhalla 2004) as well as
widespread levels of fine sediment deposition and embeddedness (Paul et al. 2002). However,
agriculture and related variables were not the main source of IBI variability in this stratum and
therefore, the model wasn’t sensitive to it. The detrimental effects of agriculture on biologic
integrity have been widely reported (EPA 2000; Hall and Killen 2005; Lammert and Allan 1999;
Meador and Goldstein 2003; Shields et al. 2006; Stewart et al. 2001).
Because regional/catchment environmental characteristics were quite constant, the model
showed how other variables more important at the reach system (related to stream variability)
took over as main predictors. Even though the background biotic integrity is mostly determined
by regional characteristics, fluctuations of IBI within the region were mostly determined by
variables relevant at a smaller geographic scale due to regional homogeneity. Nutrient and
sediment input, hydrologic regime and channel morphology are processes mostly controlled at
the regional scale. Other factors such as organic matter inputs, site habitat quality, as well as
shade are controlled at more local scales (Allan et al. 1997; Frissell et al. 1986).
Comparison of impaired and reference sites in coastal plains’ group 1 revealed pool quality as
the most critical issue to address in order to achieve realistic reference conditions as shown in
135
Figure 3-19. Pool quality was the only significantly different variable within group 1 between
reference and non-reference sites (Table 3-13). This might be an indication of a substantial gap
in habitat diversity (depth, current, and substrate or DCS) and habitat volume between reference
and non-reference sites. It has been reported that in headwater streams (group 1 is composed
mainly of small streams with and average DA equal to 3,600 acres), pools have a greater DCS
diversity and habitat volume than riffles (Schlosser 1982). Therefore, pool quality in small
streams is critical in order to achieve good biotic integrity. Differences in pH and DOC were
close to be statistically significant (Table 3-13). DOC and pH have not been found to have a
strong relationship with land use patterns in the Mid-Atlantic region (Herlilhy et al. 1998). DOC
has been linked to nutrient enrichment (Leland and Porter 2000). However, differences in
nutrient concentrations were not observed in our research.
Biologic quality in piedmont sites was mainly determined by different configurations of
agricultural and urban land uses (average land use percentages in the drainage area equal to 57.5
and 11.7 percent respectively).Like in coastal and highland sites, agricultural and barren land
uses were positively correlated to IBI (neuron-based r = 0.73). The opposite occurred with urban
land uses which had a strong negative correlation (neuron-based r = -0.75) in piedmont regions.
Correlations between IBI and agriculture and urban land uses were much weaker in the two other
strata (neuron-based agriculture-IBI r = 0.16 and 0.41, and urban-IBI r = 0.121 and 0.01 in
highland and coastal sites respectively). Weaker correlations in these strata were possibly due to
the existence of a third regulating factor: larger percentages of forest and wetlands (see Table
3-4).Forest and wetlands only had a significant positive effect in few sites in the piedmont
stratum (Figure 3-21).
136
Even though a positive correlation between agriculture and urban land uses might seem counter-
intuitive, this was the consequence of the agricultural-urban dominance. A study by (Wang et al.
2000) demonstrated how expansion of urban land uses in traditionally agricultural watersheds led
to a decrease in fish species, fish density and IBI. Therefore, agriculture was positively
correlated to IBI despite its evident negative effects if compared to pristine conditions. Impact of
urbanization becomes critical to the biologic community when a threshold ranging from 8 to
15% of connected imperviousness is reached (Schueler 1994; Wang et al. 2001). In Maryland, a
10% increase in urban land use has been linked to a doubled likelihood of failing biocriteria
(Volstad et al. 2003).
In piedmont sites, a steady channel flow and good channel quality seemed to be critical in order
to achieve good biologic integrity. Percentage of channel covered by flow showed a strong
positive correlation to agriculture, but was negatively correlated to urban land uses. This
association could be the consequence of greater percentages of imperviousness in developed
lands and its subsequent increase in storm runoff with shorter residence times. Fast conveyance
channeling is common practice in order to deal with increased urban runoff which, in turn, would
degrade channel quality (Novotny 2003). Remoteness was also significant and is an indication of
the proximity of human activity to the sampling site (Mercurio et al. 1999). Again, pool quality
was identified as a very significant variable to IBI. Degree of embeddedness due to fine
sediment deposition was the last main variable affecting biologic integrity. Substrate degradation
has very negative consequences to aquatic fauna (Manolakos et al. 2007; Quinn and Hickey
1990; Rabeni and Smale 1995; Richards et al. 1993). However, in this region fine sediment
embeddedness was positively correlated to IBI, most likely due to its linkage with agriculture
and its not-so-negative effects if compared to urban sprawl..
137
In piedmont sites, comparison of reference and non-reference sites showed how urbanization was
more extensive in non-reference sites (differences were statistically significant in groups 1 and
5). Group 4 was an exception to this. Percentage of channel covered by flow, and channel and
pool quality were consistently higher in reference sites with the exception, again, of group 4.
Group 4 had the best biologic integrity overall and therefore, sources of impairment may differ
from the rest of the groups. The rest of metrics had more variability between groups.
A high degree of heterogeneity was found in highland regions as shown in Figure 3-25. The high
level of heterogeneity of highland observations in the MBSS database was already identified
(Southerland et al. 2005). In their study, a cluster analysis of the fish assemblages separated
highland observations in two main groups: sites with a drainage area smaller than 3,000 acres
(12.14 km2) and sites with a larger drainage area. Our model successfully identified this pattern.
One of the selected environmental variables in the 2nd SOM training in highland sites was
strongly correlated to drainage area (i.e. average stream width). Many of the variables related to
drainage area were also among the top IBI predictors in highland areas, although most of them
were disregarded due to cross-correlation (i.e. average thalweg depth, drainage area, and
maximum depth). The separation of observations in highland regions matched the results by
Southerland et al. (2005) quite well. Groups 1 and 2 had a median drainage area equal to 3,108
and 3,393 acres respectively, while in group 3 it was equal to 16,765 acres. IBI seemed to be
positively correlated to drainage area. This pattern was also observed in coastal sites. Even
though the available IBI scores in the MBSS database are calibrated with drainage area (Roth et
al. 2000), these don’t capture actual reference conditions for smaller streams (Southerland et al.
2005). Our model was able to successfully detect this trend. The second significant variable to
138
IBI was Epifaunal substrate quality, which also showed a strong positive correlation to IBI as
expected.
The high level of heterogeneity was more evident in small streams (groups 1 and 2), maybe due
to the IBI bias mentioned above. Local habitat conditions (instream woody debris and rootwads,
shade, and channel quality), water quality (SO4), or channel morphology (slope) explained the
remaining variability. In larger streams, different biological responses were only found due to
significantly different water qualities (i.e. differences in SO4 concentration) (Figure 3-25).
Comparison of reference and non-reference sites consistently confirmed stream size, substrate
quality, and some local variables as key issues to address if reference conditions were to be met
in highland’s small streams (Table 3-17). Not statistically significant differences were observed
in larger streams (group 3). However, average SO4 concentrations were almost four times higher
in non-reference sites (10.2 versus 38.1 in reference and non-reference sites respectively), which
might be an indication of chemical degradation as shown in Figure 3-25. Most likely, selection
of a more stringent reference percentile to set reference conditions (e.g. 80th instead of 75th)
would have identified this difference in mean SO4 concentration as statistically significant.
3.3. Conclusions
3.3.1. Ohio with instream data
• Instream substrate parameters (i.e. embeddedness), nutrient input (i.e TKN), and stream size
were the variables with the clearest relationship to IBI, which strongly agreed with current
literature. The rest of the environmental gradients acted more like ‘moderators’ of the final IBI.
139
The effects of variables acting at a local scale (i.e total zinc and nitrate concentrations and pH)
were also successfully identified and separated.
• The model was sensitive to stream size. Headwaters (DA <51.8 Km2) and wadeable streams
(51.8 ≤ DA ≤ 518 Km2) were mainly contained within groups 1, 2, 3, and 6. Small (518 < DA≤
2,590 Km2) and large rivers (DA > 2,590 Km2) were mainly in group 6. Group 4 showed greater
variability in stream size. In this group, however, most of the observations belonged to wadeable
streams, but very large streams (maximum DA = 15,672.1 Km2) and small rivers were also
present. This was important because no a priori assumptions were made with the data. Stream
size is known to play an important role on biologic integrity, having a positive correlation to IBI
in Ohio. The model successfully identified this trend.
• Development of reference curves using this methodology gives an indication of the expected
probability of violation if a biotic standard needs to be achieved. For example, in group 2, the
WWH biologic standard (IBI =40) would be violated 10% of the times in reference conditions.
With this methodology, the reference sites can be determined at will by watershed managers
depending on the realistic goals that must be met for different watershed types. In the present
research the reference threshold was set to 75th percentile.
• The same methodology would yield more accurate results with an a priori separation of sites
in different ecoregions. In the present paper, the sites were not separated because not enough
observations were available for all of them to perform the partitions and subsequent curve
development. However, most of the data belonged to the ECBP and HELP ecoregions, with
higher natural nutrient concentrations, and the IP and EOLP with medium levels. Only 36
observations out of 429 belonged to the WAP ecoregion, with the lowest nutrient background
concentration.
140
3.3.2. Ohio with offstream data
• The model partitioning corresponded very well to a progressive reduction of geographic
scale. Basin-scale variables (i.e. basin connectivity) segregated different biotic responses in
different basins or watersheds within a basin. Upstream regional buffer and watershed land use
segregated biologic responses at the stream segment level. Local land use separated different
responses due to different local conditions but within a watershed and stream context. With the
presented methodology we believe the scale-issue in the analysis of biologic integrity has been
resolved. The model developed is able to zoom in and out of geographic scale and identify
responses at each level of watershed characterization.
• Regional land uses, and particularly percentage of forest and agriculture in the 30 and 100-
meter regional buffers, were the most important variables to biotic integrity. Watershed
urbanization was also significant, especially in watersheds with degraded or poorly vegetated
stream buffers. These variables were responsible for the background integrity in the different
groups.
3.3.3. Maryland
• The methodology successfully divided biotic integrity responses in the three different strata
in Maryland. Variables affecting IBI at larger geographic scales were successfully identified and
strongly agreed with current literature. Potential biases of the available IBI in the database were
successfully recognized by the model. Because of this, conclusions from the research have to be
drawn with caution, especially in coastal and highland regions. However, the methodology can
be replicated easily when this bias is addressed.
• The normality hypothesis for the environmentally homogeneous groups was confirmed by
the model. IBI didn’t follow a Gaussian curve in any of the full strata databases. When different
141
biologic responses were separated, most of the groups became normal at some level of group
characterization. Nevertheless, some of the groups still didn’t follow a normal distribution. In
most cases, it was most likely due to lack of a truly representative population sample because
few observations belonged to that group (less than 15). Only group 41 in piedmont region and
group 111112 in highland region (116 and 27 observations respectively) were an exception.
Existence of relevant, non-identified stressors could be a cause why populations with different
responses weren’t separated.
• Coastal and highland regions were very heterogeneous natural systems and the models
successfully identified this. Many different biological responses due to local effects were
identified. However, it remains unclear with the available data from the MBSS if the highly
diverse biologic responses are due to strata’s variability, or presence of non-sensitive habitat
metrics in coastal areas and an IBI bias in highland sites. These issues are reported in current
literature.
• Biological integrity in piedmont areas was mainly dominated by a combination of
agricultural and urban land uses. Agriculture had a strong positive correlation to IBI and urban
land use had a strong negative correlation. Even though agriculture is negatively associated to
biological integrity, urban impact has more acute detrimental effects when a threshold is reached.
• Comparison of differences between reference and non-reference sites helped identify the
most critical issues to be addressed in order to achieve realistic goals for improvement in each
group. In small streams in coastal sites (group 1, average DA = 3,600 acres), pool quality is
critical in order to achieve such conditions. In larger streams, urbanization is the main problem.
In piedmont sites, urbanization and channel quality are the main key issues to be addressed. In
highland areas, improvement of substrate quality combined with other local instream habitat and
chemical characteristics such as woody debris presence, shading, channel quality, or sulfate
142
concentration are paramount for IBI improvement. In larger streams, water quality is the major
issue in highland regions.
143
4. Main conclusions
Two main outcomes have been achieved with the work presented in this thesis. First, it was
demonstrated that IBI is predicted more accurately using data patterning techniques based on
environmental similarities than with traditional methods. Second, a new methodology that allows
evaluation of biologic response to environmental stressors at multiple scales was developed. This
methodology was named PROHIBID (Probabilistic Hierarchical Biologic Integrity
Discrimination).
Since biological integrity is at the top of the natural system hierarchy, it is impossible to find
simple mechanistic processes and mathematical equations able to link changes in the biological
community to one or several environmental variables. Biological integrity is the result of many
natural existing conditions and anthropogenic stressors that are highly intertwined and explain a
larger or smaller portion of the final outcome. Because of the high dimensionality of the
problem, traditional prediction or evaluation techniques have great limitations. A simple
comparison of the IBI predictions between the k-nearest neighbor concept (kNN) and traditional
linear and non-linear regressions clearly showed that the first was superior in performance and
computation capabilities. Moreover, prediction performed finding the most similar
environmental observations proved much more dynamic because it was easily validated using a
leave-one-out approach without drastically increasing the computation time. Such approach
wasn’t possible when IBI was predicted using regression. In this case, a validation dataset had to
be separated. A leave-one-out approach was not possible because that would have meant
developing new equations each time one observation was taken out of the database.
144
One of the main problems encountered using kNN was determining the optimum number of
closest neighbors that yielded the best possible prediction. Since extreme values in nature are by
definition rare (or at least much less frequent than non-extreme values), these were usually
predicted more accurately with lower numbers of k nearest neighbors (i.e. 1 or 2) because few
observations were truly environmentally similar. The opposite happened with observations with
no extreme values in any of the fields. Such observations had many other observations falling
within a smaller distance radius. Therefore, determining the optimum number of closest
neighbors to obtain the best possible prediction was a challenge because, depending on the type
of site and available observations in the databases, the ideal number of closest neighbors was
different in each case.
This issue was partially solved in Chapter 2 when the kNN technique was used to find the closest
branch of a hierarchical tree calculated with the observations being compared against. The
importance of using such a structure lies in the fact that the branches of the hierarchical structure
is composed of groups of observations that are very close to the remaining members of the same
branch. If the difference between two observations is larger than a specific threshold, these are
placed in different branches. Therefore, the closest branch to the target site being predicted is
only composed of a group of truly similar observations. In Chapter 1, when a specified number
of k-nearest neighbors (i.e. 5) was selected arbitrarily, it was not guaranteed that all the closest
observations were truly similar ( for example, for observations with extreme values, maybe only
one or two sites were truly similar but the remaining three could be quite different and lead the
model to poor predictions). Another clear advantage of using such hierarchical structure instead
of direct kNN prediction is the possibility of zooming up and down of the hierarchical structure
and finding the optimum number of branches that optimize the prediction (the number of tree
145
branches can range from two to the number of available observations). Prediction techniques
used in Chapter 1and 2 can easily be implemented in many other scenarios and can easily be
used to evaluate the effect of anthropogenic stressors on the biologic community (or any other
endpoints) if enough historical data are available.
The results from Chapter 1 and Chapter 2 also revealed the importance of scale in the prediction
of system endpoints. Background biologic integrity is determined by variables that are
ubiquitous at the scale of the study and they were named environmental gradients or large-scale
variables. However, this doesn’t imply that variables that are non-ubiquitous don’t play an
important role. Point source pollution, channelization, or other localized variables can have a big
impact in specific sites but little impact on the overall integrity of a region. Therefore, ubiquitous
stressors are capable of major shifts in species population and therefore, major changes in biotic
integrity may occur. As a consequence, ubiquitous stressors affect the higher levels of the
species’ suitable habitat hierarchy. On the other hand, localized stressors only modify habitat
suitability at lower levels of the hierarchy and are only identified as significant variables when
the scale is small enough.
In order to address the scale issue, the PROHIBID methodology was developed. It was a
successful attempt to replicate the nested hierarchy of suitable habitats existing in nature.
Offstream environmental gradients in Ohio (i.e. large-scale variables) were mainly associated to
regional land use patterns as expected. When instream variables were analyzed in Ohio, large-
scale variables were mainly related to nutrient input, and habitat quality (which are directly
related to land use). PROHIBID successfully separated different biologic signatures that resulted
from different levels of stress at the local level.
146
The assumption of normality of the IBI distribution within a highly homogeneous environmental
group was proven true. Most of the resulting groups from the progressive segregation of biologic
responses followed a Gaussian distribution when the system was described in greater detail.
None of the initial databases followed such distribution because they were highly heterogeneous
and different biologic signals were mixed.
Because IBI can be easily characterized with a normal distribution and because the
environmental observations within a group are similar; realistic, achievable, reference conditions
can be identified within each group and represented again with a normal curve. The importance
of this lies in the fact that it allows comparison between group’s reference and non-reference
sites and helps target potential issues that must be addressed in order to achieve reference
conditions. Moreover, such methodology can be applied at different levels of system
characterization (in this thesis, analyses were performed at one specific level for the sake of
brevity). This is important because the effect of the different variables at one specific level of
system characterization is always analyzed in the environmental background context of each
group (i.e. the effect of a specific local variable is only revealed when the effect of other
stressors with a larger overall impact on biologic integrity has been segregated previously). If a
PROHIBID scheme has been developed in a specific region, watershed managers can easily find
actual reference conditions for targeted sites by identifying the most similar group at the level the
available environmental variables allow.
PROHIBID could easily be implemented for the establishment of biological standards based on
probability of exceedance similar to those used in water quality. In this thesis, the group
147
reference conditions were set arbitrarily at the 75th IBI percentile. However, reference conditions
can be more or less stringent depending on the designated use of a specific water body.
148
5. Future research and work
Research to further understand the relationship between biologic integrity and different types of
stressors acting at different scales should be performed. Moreover, implementation of scale-
sensitive methodologies to frame and segregate biologic responses is a real possibility with the
readily available historic data some environmental agencies have collected. In my opinion, some
of the most critical issues that need to be addressed before enforcing biologic integrity as a
standard for stream’s health preservation are the following.
1. Development of a stand-alone, fully-integrated model. The PROHIBID methodology
presented in the current thesis is the result of multiple steps that use diverse data patterning
techniques combined with statistical analysis that might result complex for the potential users. If
such a methodology were to be applied, it is necessary to develop a user-friendly framework in
which the user is only required to enter the model inputs in a spreadsheet.
2. Data sampling strategy: one of the main problems encountered when different states
were modeled is the lack of consistency in the sampled environmental data. For example, habitat
quality is evaluated using multiple habitat quality indices but these and their corresponding
metrics differ importantly among states. Physical and water quality parameters are not always
consistent either. For example, Ohio was the only state in which metal concentration was
measured, in Maryland Dissolved Organic Carbon (DOC) concentrations were available but not
in the rest of states, in Minnesota, stream channel morphology data was available but not in the
other databases.
149
While it is understandable that each region has its particular environmental challenges, a
minimum consensus in the sampling needs to be achieved. In my opinion, this consensus should
be achieved not at the state level (as it currently happens) but at the ecoregional level.
Ecoregions are defined as “areas within which there is spatial coincidence in characteristics of
geographical phenomena associated with differences in the quality, health, and integrity of
ecosystems” (Omernik 2004). “Characteristics of geographical phenomena” may include
geology, physiography, vegetation, climate, hydrology, terrestrial and aquatic fauna, and soils,
and may or may not include the impacts of human activity (e.g. land use patterns, vegetation
changes).
Sampling of potential large-scale anthropogenic disruptors should be homogeneous within an
ecoregion (and its basins) not the state level. Targeted large-scale variables should at least
contemplate the following disturbances: stream fragmentation (at the basin level or larger),
regional land use (in the drainage area and regional stream buffer and preferably using the
sixteen land use types defined in the NLCD), water quality parameters (mainly parameters
related to nutrient loading such as BOD, TN, TP or TKN, or ionic strength such as conductivity,
hardness, or SO4), habitat quality (preferably continuous measurements instead of discrete
metrics and mostly related to substrate quality and stream variability because they reflect
regional hydrologic conditions), and point source density and intensity (if point source impact is
significant in the region).
Because of the large number of potential small-scale disruptors, these should be evaluated at
smaller scales (e.g. the watershed level) and target only those that are most likely to occur in a
specific area because of its particular environmental conditions. However, since there is a need to
150
compare impaired and non-impaired sites, several watersheds (impaired and non-impaired) with
similar large-scale environmental features should be sampled.
3. Holistic approach to improve stream health: the Clean Water Act of 1,972 has been an
extraordinary tool to resolve the deep water quality problem U.S. streams faced in the end of the
last century. However, many research efforts agree that the main threat to U.S. stream health is
not so much related to only water quality but to habitat degradation. Habitat degradation not only
relates to physical changes in habitat structure, but hydrologic and hydraulic modifications,
fragmentation, or siltation. Current disturbances are mostly related to non-point source system
fragmentation of available habitat (physical, chemical, or hydraulic fragmentation). Non-point
source pollution is mainly driven by changes in the regional and local land use. Therefore, future
research evaluating the integrity of waters needs to be approached in this context and potential
solutions need to take this river ‘continuum’ concept into account.
4. Development of progressive biological standards: biologic integrity is a direct measure
of stream’s health. Its importance lies in the fact that it is an indication of disturbance in any part
of the natural system, not just water quality as explained in point 3. Therefore, setting biologic
standards is important to guarantee a minimal ecosystem functionality of a specific region. I
believe a statistical approach such as the one presented in Section 2 of this thesis should be
implemented because allows easy identification of reference sites within a specific region.
Biological standards should be developed in a two-tier fashion. In the first phase, larger regions
(i.e. basins, sub-basins, or watersheds) within an environmentally homogenous unit (i.e.
ecoregions) should be targeted to guarantee good background integrity for subsequent, more
stringent standards. In a second phase, and after the standards in phase one have been met, more
151
local standards can be developed (i.e. at the watershed or sub-watershed level) targeting small-
scale stressors present in the region of study.
5. Use of information from observations with missing attribute values:
The results presented in this thesis were obtained by selecting complete datasets with no missing
data in either the response variables (i.e. IBI) or the explanatory variables (i.e. instream and/or
offstream environmental attributes). However, it is important to realize that the initial databases
were composed of a larger number of observations. Many of these observations were not used in
the work presented in this thesis because they had one or several missing explanatory variables
and therefore, discarded. Dealing with observations with missing attributes is a common problem
when large databases are used. Research from many different disciplines has focused on
extracting the potentially valuable information underlying in incomplete observations. Some
common scientific disciplines dealing extensively with such problems are genetics (Ouyang et al.
2004; Troyanskaya et al. 2001), political and social sciences (Fessant and Midenet 2002; King et
al. 2001; Wang 2003), neural computing and machine learning (Batista and Monard 2003), or
more recently, environmental sciences (De'ath and Fabricius 2000; Dickson and Giblin 2007;
Junninen et al. 2004).
The first step in order to adopt a methodology to estimate missing attribute values is to determine
their degree of randomness because this will affect subsequent missing data treatment. Three
commonly accepted categories for missing data randomness are the following (Little and Rubin
1987):
1. Missing completely at random (MCAR): this is the highest level of randomness. It occurs
when the probability of an observation having a missing value for an attribute does not depend
152
on either the known values or the missing data. At this level, any missing data treatment can be
applied without risk of introducing bias on the data. The missing data in the presented research
qualifies as MCAR.
2. Missing at random (MAR): when the probability of an observation having a missing
value for an attribute may depend on the known values but not on the value of the missing data
itself
3. Not missing at random (NMAR): when the probability of an observation having a
missing value for an attribute could depend on the value of that attribute
Several different methods have been proposed in the literature to treat missing data. These
methodologies can be divided in three main categories (Little and Rubin 1987):
1. Ignoring and discarding data: this consists of discarding observations and/or attributes
with missing entries. This methodology was adopted in the present thesis.
2. Parameter estimation: this category includes all those methods that involve the
calculation of parameters of a maximum likelihood function using a complete set of data.
Probably, the most widely used methodologies falling into this category is the Multiple
Imputation (MI) method. The widely implemented Expectation-Maximization (EM) algorithm is
one example of MI which can handle parameter estimation in the presence of missing data
(Dempster et al. 1977).
3. Imputation: this category refers to those procedures that aim to fill missing values with
estimated ones. Information from known relationships identified with the valid observations is
used to estimate the missing entries. Examples of imputation methods are the KNN , SOM,
Multi-Layer Perceptron structures (MLP), or hierarchical trees (De'ath and Fabricius 2000;
153
Junninen et al. 2004). Other very commonly used, although rather naïve, imputation methods are
row or column average, or imputation of zeroes. Other simple univariate imputation
methodologies are the linear, spline, or nearest neighbor interpolation, and multivariate
regression based imputation (Junninen et al. 2004) . Hybrid methods combine different
imputation methodologies depending on the ‘length of the gap’ in the missing data (e.g. in time-
series data) or the percentage of missing data.
The MI approach involves imputing m values for each missing item in an observation and
creating m complete data sets. Therefore, the observed values within each data set are the same,
but the imputed values are different to reflect uncertainty (King et al. 2001).Hence, each of the m
data sets can be treated as complete data sets and then use a procedure to combine the m results.
One MI model that has proven useful in many situations assumes that the variables are jointly
multivariate normal. Even though the normal distribution is just an approximation (few data sets
have variables that are all continuous and unbounded), many researchers have found that it
works as well as other more complicated functions especially designed for categorical or mixed
data (Schafer 1997; Schafer and Olsen 1998).
In the MI, the missing attribute values are usually imputed with a linear or multinomial
regression function of the rest of known attributes within the same observation. The regression
coefficients (vector β) are then estimated and uncertainty is introduced using a random parameter
(ε). Therefore m different estimated values of the missing attribute are obtained. Subsequently,
using the normality condition, a likelihood function can be calculated with the vector of variable
means (µ) and the variance matrix (∑) of the p variables (dependent and independent attributes)
154
of the full observations. Within the m generated versions of the incomplete observation, the one
which maximizes the likelihood function is selected and its calculated missing value chosen.
Even though MI approaches seem to be a very reliable way of imputing data comparable to other
methods such as the SOM (Dickson and Giblin 2007), computing the data likelihood function
can be unfeasible with classical methods. In response to such difficulties, different algorithms
have been developed such as the Imputation Posterior (IP), which is based on Markov Chain
Monte Carlo methods and requires a high level of expertise, or the Expectation-Maximization
(EM), which is deterministic. IP draws random simulations from the multivariate normal
observed data posterior (P(Dmis/Dobs)), while EM calculates the posterior means
deterministically. EM has the advantage that is much faster in finding the maximum of the
likelihood function but the drawback that it does not yield the rest of the distribution (King et al.
2001). For detailed information on the IP algorithm refer to Schafer(1997) and for the EM
algorithm refer to Dempster et al. (1977) and McLachlan and Krishan (1997).MI is considered
the most accurate and reliable way to infer missing parameters in time-series air quality data sets.
Its main drawback is the computation speed (Junninen et al. 2004).
In the present thesis, techniques such as kNN or SOM were used for prediction and data
classification purposes respectively and its principles were explained in previous chapters. These
same techniques could be easily implemented for missing data imputation. A MLP structure,
which was not used in this thesis, is probably the most widely known and successful neural
network. These networks employ a feed-forward architecture and are typically trained using a
procedure called error back-propagation (Junninen et al. 2004). However, the MLP appear to be
only viable and good alternative to classical imputation type models when calibration data are
155
sufficient, but it does not solve practical difficulties encountered in real-size surveys treatment.
The MLP has a fixed architecture for imputing a pre-defined set of variables thanks to another
pre-defined set of variables. In real applications, missing items combinations among variables
vary among observations, which make the MLP implementation difficult in most cases (Fessant
and Midenet 2002).
Junninen et al.(2004) compared the data imputation performances in air quality data sets of
different techniques such as row averaging, linear interpolation, multivariate regressions, kNN,
SOM, MLP, MI, along with hybrid methods of these. In all cases, the hybrid MI model was the
most accurate and reliable data imputation model. The hybrid SOM, kNN, and MLP had the
second best performances and their results were very similar. Non-hybrid SOM, KNN, and MLP
had the third best performances (again very similar results among all three). All these models
outperformed significantly linear interpolation, multivariate regression, or row averaging
methods.
Another study attempted to calculate the missing trace metals concentrations in ground waters. In
this case the EM algorithm performance was compared against the SOM. In all cases (with 25
and 50% of data missing) the SOM outperformed the EM algorithm, whose missing values
estimations tended to be more scattered (Dickson and Giblin 2007). SOM can also be designed to
include uncertainty like the IP multiple imputation models do. This can be done by calculating a
fuzzy-SOM trained with complete and incomplete observations. Incomplete observations are
called fuzzy because different possible values for the missing attribute are introduced in the
model by estimating membership functions of the missing attribute (Wang 2003).
156
In another study by Troyanskaya et al. (2001), the kNN imputation method was compared to
other methods such as the row average method and the Singular Value Decomposition ([SVD],
which is based on principal component decomposition) in gene expression databases. Both, kNN
and SVD outperformed significantly the row average method and both methods were robust to
an increasing fraction of missing data. However, kNN was less sensitive to the type of data used
and data noise and was able to provide accurate estimations for missing values in genes that
belonged to small tight expression clusters. SVD only predicted well in dominant clusters. A
similar conclusion for kNN imputation was reached by Batista and Monard (2003). In their work,
kNN outperformed other methods such as mean/mode imputation or no imputation at all.
157
6. References
Allan, J. D. (2004). "Influence of land use and landscape setting on the ecological status of rivers." Limnetica, 23(3-4), 187-198.
Allan, J. D., Erickson, D. L., and Fay, J. (1997). "The influence of catchment land use on stream integrity across multiple spatial scales." Freshwater Biology, 37(1), 149-161.
Allen, T. F. H., and Starr, T. B. (1982). Hierarchy : perspectives for ecological complexity, University of Chicago Press, Chicago.
Anderson, J. R., Harvey, E. H., Roach, J. T., and Whitman, R. E. (1976). "A land use and land cover classification system for use with remote sensor data." Geological Survey Professional Paper 964, U.S. Government Printing Office, Washington D.C.
Archer, D., and Newson, M. (2002). "The use of indices of flow variability in assessing the hydrological and instream habitat impacts of upland afforestation and drainage." Journal of Hydrology, 268(1-4), 244-258.
Barbour, M. T., Gerritsen, J., Snyder, B. D., and Stribling, J. B. (1999). "Rapid bioassessment protocols for use in streams and wadeable rivers: periphyton, benthic, macroinvertebrates, and fish, second. ed. EPS-841-B-99/002." US Environmental Protection Agency, Washington, DC.
Batista, G. E. A. P. A., and Monard, M. C. (2003). "An analysis of four missing data treatment methods for supervised learning." Applied Artificial Intelligence, 17(5-6), 519-533.
Beyer, H. L. (2004). "Hawth's Analysis Tools for ArcGIS." 2008). Bode, R. W. (1988). "Methods for Rapid Biological Assessment of Streams." New York State
Department of Environmental Conservation, Albany, NY. Castelle, A. J., Johnsn, A. W., and Conolly, C. (1994). "Wetland and Stream Buffer Size
Requirements - a Review." Journal of Environmental Quality, 23(5), 878-882. Cereghino, R., Giraudel, J. L., and Compin, A. (2001). "Spatial analysis of stream invertebrates
distribution in the Adour-Garonne drainage basin (France), using Kohonen self organizing maps." Ecological Modelling, 146(1-3), 167-180.
Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983). Graphical methods for data analysis, Pacific Grove, CA: Wadswoth & Brooks/Cole
Chon, T. S., Park, Y. S., Moon, K. H., and Cha, E. Y. (1996). "Patternizing communities by using an artificial neural network." Ecological Modelling, 90(1), 69-78.
Davies, D. L., and Bouldin, D. W. (1979). "A cluster separation measure." IEEE Transactions on Pattern Analysis and Machinery Intelligence, 1(2), 224-227.
De'ath, G., and Fabricius, K. E. (2000). "Classification and regression trees: A powerful yet simple technique for ecological data analysis." Ecology, 81(11), 3178-3192.
Dempster, A. P., Laird, N. M., and Rubin, D. M. (1977). "Maximum likelihood for incomplete data via the EM algorithm (with discussion)." Journal of Royal Statistic Society, B39, 1-38.
Detenbeck, N. E., Batterman, S. L., Brady, V. J., Brazner, J. C., Snarski, V. M., Taylor, D. L., Thompson, J. A., and Arthur, J. W. (2000). "A test of watershed classification systems for ecological risk assessment." Environmental Toxicology and Chemistry, 19(4(2)), 1174-1181.
Detenbeck, N. E., Johnston, C. A., and Niemi, G. J. (1993). "Wetland Effects on Lake Water-Quality in the Minneapolis St-Paul Metropolitan-Area." Landscape Ecology, 8(1), 39-61.
158
Dickson, B. L., and Giblin, A. M. (2007). "An evaluation of methods for imputation of missing trace element data in groundwaters." Geochemistry-Exploration Environment Analysis, 7, 173-178.
DNR, M. (2008). "Maryland Biological Stream Survey. Available at: http://www.dnr.state.md.us/streams/mbss/." <http://www.dnr.state.md.us/streams/mbss/>.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern classification. 2nd edition, Wiley, New York, NY.
Dyer, S. D., White-Hull, C., Carr, G. J., Smith, E. P., and Wang, X. H. (2000). "Bottom-up and top-down approaches to assess multiple stressors over large geographic areas." Environmental Toxicology and Chemistry, 19(4), 1066-1075.
Dyer, S. D., White-Hull, C., Wang, X., Johnson, T. D., and Carr, G. J. (1998a). "Determining the influence of habitat and chemical factors on instream bioitc integrity for a Southern Ohio watershed." Journal of aquatic ecosystem stress and recovery, 6, 91-110.
Dyer, S. D., White-Hull, C. E., Johnson, T. D., Carr, G. J., and Wang, X. (1998b). "The importance of space in understanding the risk of multiple stressors on the biological integrity of receiving waters." Journal of Hazardous Materials, 61(1-3), 37-41.
Dynesius, M., and Nilsson, C. (1994). "Fragmentation and Flow Regulation of River Systems in the Northern 3rd of the World." Science, 266(5186), 753-762.
EPA. (2000). "The quality of our nation's waters. EPA 841-S-00-001." USEPA Office of Water, Washington, DC.
EPA. (2008a). "Current national recommended water quality criteria. Available at: http://www.epa.gov./waterscience/criteria/wqcriteria.html. Last time visited: April 2008." (April 2008.
EPA. (2008b). "Multi-resolution land charcteristics consortium (MRLC). Available at: http://www.epa.gov/mrlc/."
EPA. (2008c). "Permit Compliance System Database." Available at: http://epa.gov/enviro/html/pcs/pcs_query_java.html.
Fessant, F., and Midenet, S. (2002). "Self-organising map for data imputation and correction in surveys." Neural Computing & Applications, 10(4), 300-310.
Fischer, S., and Kummer, H. (2000). "Effects of residual flow and habitat fragmentation on distribution and movement of bullhead (Cottus gobio L.) in an alpine stream." Hydrobiologia, 422, 305-317.
Freeman, M. C., Bowen, Z. H., Bovee, K. D., and Irwin, E. R. (2001). "Flow and habitat effects on juvenile fish abundance in natural and altered flow regimes." Ecological Applications, 11(1), 179-190.
Frissell, C. A., Liss, W. J., Warren, C. E., and Hurley, M. D. (1986). "A hierarchical framework for stream habitat classification: viewing streams in a watershed context." Environmental Management, 10, 199-214.
Gilvear, D. J., Heal, K. V., and Stephen, A. (2002). "Hydrology and the ecological quality of Scottish river ecosystems." Science of the Total Environment, 294(1-3), 131-159.
Gujarati, D. N. (2003). Basic econometrics, McGraw-Hill, NY. Hall, L. W., and Killen, W. D. (2005). "Temporal and spatial assessment of water quality,
physical habitat, and benthic communities in an impaired agricultural stream in California's San Joaquin Valley." Journal of Environmental Science and Health Part a-Toxic/Hazardous Substances & Environmental Engineering, 40(5), 959-989.
159
Hall, L. W., Morgan, R. P., Perry, E. S., and Waltz, A. (1999). "Developmenmt of a provisional physical habitat index for Maryland freshwater streams." Maryland Department of Natural Resources. Chesepeake Bay and watershed programs. Monitoring and non-tidal assessment., Annapolis, MD.
Hall, L. W., Scott, M. C., Killen, W. D., and Anderson, R. D. (1996). "The effects of land-use characteristics and acid sensitivity on the ecological status of Maryland coastal plain streams." Environmental Toxicology and Chemistry, 15(3), 384-394.
Herlilhy, A., Stoddard, J. L., and Johnosn, C. B. (1998). "The relationship between stream chemistry and watershed land cover data in the Mid-Atlantic region, USA." Water Air Soil Pollution, 105, 377-386
Hilsenhoff, W. L. (1987). "AN improved biotic index of organic stream pollution." Great Lakes Entomologist, 20(1), 31-39.
Jain, A. K., and Dubes, R. C. (1988). Algorithms for clustering data., Prentice Hall Inc., Saddle River, NJ.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). "Data clustering: a review." ACM Computer Surveys, 31(3), 264-323.
Jarque, C. M., and Bera, A. K. (1987). "A test of normality of observations and regression residuals." International Statistics Review, 55(2), 163-172.
Johnson, L. B., Richards, C., Host, G. E., and Arthur, J. W. (1997). "Landscape influences on water chemistry in Midwestern stream ecosystems." Freshwater Biology, 37(1), 193-&.
Johnston, C. A., Detenbeck, N. E., and Niemi, G. J. (1990). "The cumulative effect of wetlands on stream water quality and quantity. A landscape approach." Biogeochemistry, 10, 105-141.
Judge, G. G., Hill, R. C., Griffiths, W. E., Lutkepohl, H., and Lee, T.-C. (1985). The theory and practice of econometrics. 2nd edition, 2nd Ed., Wiley, NY.
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., and Kolehmainen, M. (2004). "Methods for imputation of missing values in air quality data sets." Atmospheric Environment, 38(18), 2895-2907.
Karr, J. R. (1991). "Biological Integrity: a long-neglected aspect of water resource management." Ecological Applications, 1(1), 66-84.
Karr, J. R., Fausch, K. D., Angermeier, P. L., Yant, P. R., and Schlosser, I. J. (1986). "Assessing biological integrity of running waters: a method and its rationale." Illinois Natural History Survey, Champaign, IL.
Karr, J. R., and Kerans, B. L. (1981). "Components of biological integrity: their definition and use in development of an invertebrate IBI." 1991 MidWest Pollution Control Biologists Meeting. Environmental Indicators: measurement and assessment endpoints. U.S. Environmental Protection Agency, Lincolnwood, IL.
King, G., Honaker, J., Joseph, A., and Scheve, K. (2001). "Analyzing incomplete political science data: An alternative algorithm for multiple imputation." American Political Science Review, 95(1), 49-69.
Kiviluoto, K. (Year). "Topology preservation in Self-Organizing Maps." IEEE International Conference in Neural Networks, 294-299.
Kohonen, T. (2001). Self-Organizing Maps, 3 Ed., Springer-Verlag, Berlin. Kolasa, J. (1989). "Ecological systems in hierarchical perspective: breaks in community structure
and other consequences." Ecology, 70(1), 36-47. Kolasa, J., and Biesiadka, E. (1984). "Diversity Concept in Ecology." Acta Biotheoretica, 33,
145-162.
160
Kolasa, J., and Strayer, D. (1988). "Patterns of the abundance of species: a comparison of two hierarchical models." OIKOS, 53, 235-241.
Lammert, M., and Allan, J. D. (1999). "Assessing biotic integrity of streams: Effects of scale in measuring the influence of land use/cover and habitat structure on fish and macroinvertebrates." Environmental Management, 23(2), 257-270.
Leland, H. V., and Porter, S. D. (2000). "Distribution of benthic algae in the upper Illinois River basin in relation to geology and land use." Freshwater Biology, 44(2), 279-301.
Little, R. J., and Rubin, D. M. (1987). Statistical analysis with missing data, John Wiley and Sons, New York.
Lyons, J. (2006). "A fish-based index of biotic integrity to assess intermittent headwater streams in Wisconsin, USA." Environmental Monitoring and Assessment, 122(1-3), 239-258.
Lyons, J., Piette, R. R., and Niermeyer, K. W. (2001). "Development, validation, and application of a fish-based index of biotic integrity for Wisconsin's large warmwater rivers." Transactions of the American Fisheries Society, 130(6), 1077-1094.
Manolakos, E., Virani, H., and Novotny, V. (2007). "Extracting knowledge on the links between the water body stressors and biotic integrity." Water Research, 41(18), 4041-4050.
McLachlan, G. J., and Krishan, T. (1997). The EM algorithm and extensions, Wiley, New York. Meador, M. R., and Goldstein, R. M. (2003). "Assessing water quality at large geographic scales:
Relations among land use, water physicochemistry, riparian condition, and fish community structure." Environmental Management, 31(4), 504-517.
Mercurio, G., Chaillou, J. C., and Roth, N. E. (1999). "Guide to using the 1995-1997 Maryland Biological stream survey data." Maryland Department of Natural Resources, Annapolis,MD.
Minshall, G. W. (1984). "Aquatic-insect substratum relationships." In: The Ecology Of Aquatic Insects, V. H. Resh and D. M. Rosenberg, eds., Praeger Scientific, New York NY, 358-400.
Mitsch, W. J., and Gosselink, J. G. (1986). Wetlands, Van Nostrand Reinhold, New York, NY. Morita, K., and Yamamoto, S. (2002). "Effects of habitat fragmentation by damming on the
persistence of stream-dwelling charr populations." Conservation Biology, 16(5), 1318-1323.
Morita, K., and Yokota, A. (2002). "Population viability of stream-resident salmonids after habitat fragmentation: a case study with white-spotted charr (Salvelinus leucomaenis) by an individual based model." Ecological Modelling, 155(1), 85-94.
Morley, S. A., and Karr, J. R. (2002). "Assessing and restoring the health of urban streams in the Puget Sound basin." Conservation Biology, 16(6), 1498-1509.
Nilsson, C., Reidy, C. A., Dynesius, M., and Revenga, C. (2005). "Fragmentation and flow regulation of the world's large river systems." Science, 308(5720), 405-408.
Norton, S. B., Cormier, S. M., Smith, M., and Jones, R. C. (2000). "Can biological assessments discriminate among types of stress? A case study from the Eastern Corn Belt Plains ecoregion." Environmental Toxicology and Chemistry, 19(4), 1113-1119.
Norton, S. B., Cormier, S. M., Smith, M., Jones, R. C., and Schubauer-Berigan, M. (2002). "Predicting levels of stress from biological assessment data: Empirical models from the Eastern Corn Belt Plains, Ohio, USA." Environmental Toxicology and Chemistry, 21(6), 1168-1175.
Noss, R. F. (1990). "Indicators for monitoring biodiversity: a hierarchical approach." Conservation Biology, 4(4), 355-364.
161
Novotny, V. (2003). Water Quality. Diffuse Pollution and Watershed Management, 2 Ed., John Wiley & Sons, New York.
Novotny, V. (2004). "Simplified Databased Total Maximum Daily Loads, or the World is Log-Normal." Journal of Environmental Engineering, June 2004, 674-683.
Novotny, V., Bartosova, A., O'Reilly, N., and Ehlinger, T. (2005). "Unlocking the relationship of biotic waters to anthropogenic integrity of impaired stresses." Water Research, 39(1), 184-198.
Novotny, V., Manolakos, E., Ehlinger, T., Bartosova, A., O'Reilly, N., Bedoya, D., McGarvey, K., Brooks, J., Beach, D., Farah, J., and Shaker, R. (2007). "Developing a risk propagation model for estimating ecological responses of streams to anthropogenic watershed stresses and stream modifications. Final Report.", Center for Urban and Environmental Studies. Northeastern University, Boston,MA. Available at: http://www.coe.neu.edu/environment/WebReports/EPA_final_Report2.pdf .
O'Neill, R. V., DeAngelis, D. L., Waide, J. B., and Allen, T. F. H. (1986). A hierarchical concept of ecosystems, Princeton University Press, Princeton, NJ.
Ohio_EPA. (1987). "Biological Criteria for the Protetction of Aquatic Life: Volume I-III. Standardized Field and laboratory methods for assessing fish and macroinvertebrate communities", Division of Water Quality Monitoring and Assessment, Surface Water Section,Columbus, OH.
Omernik, J. M. (2004). "Perspectives on the nature and definition of ecological regions." Environmental Management, 34, S27-S38.
Ott, W. R. (1978). Environmental Indices:theory and practice, Ann Arbor Science, Ann Arbor, MI.
Ouyang, M., Welsh, W. J., and Georgopoulos, P. (2004). "Gaussian mixture clustering and imputation of microarray data." Bioinformatics, 20(6), 917-923.
Park, Y. S., Chang, J. B., Lek, S., Cao, W. X., and Brosse, S. (2003). "Conservation strategies for endemic fish species threatened by the Three Gorges Dam." Conservation Biology, 17(6), 1748-1758.
Parkyn, S. M., Davies-Colley, R. J., Halliday, N. J., Costley, K. J., and Croker, G. F. (2003). "Planted riparian buffer zones in New Zealand: Do they live up to expectations?" Restoration Ecology, 11(4), 436-447.
Paul, M. J., Stribling, J. B., Klauda, R. J., Kazyak, P. F., Southerland, M. T., and Roth, N. E. (2002). "A phsyical habitat index for freshwater wadeable streams in Maryland. Final report ", Maryland Department of Natural Resources. Chesepeake bay and watershed programs. Monitoring and non-tidal assessment., Annapolis, MD.
Pickett, S. T. A., Kolasa, J., Armesto, J. J., and Collins, S. L. (1989). "The ecological concept of disturbance and its expression at various hierarchical levels." OIKOS, 54, 129-136.
Pirhalla, D. E. (2004). "Evaluating fish-habitat relationships for refining regional indexes of biotic integrity: Development of a tolerance index of habitat degradation for Maryland stream fishes." Transactions of the American Fisheries Society, 133(1), 144-159.
Poff, N. L., and Allan, J. D. (1995). "Functional-Organization of Stream Fish Assemblages in Relation to Hydrological Variability." Ecology, 76(2), 606-627.
Poff, N. L., Allan, J. D., Bain, M. B., Karr, J. R., Prestegaard, K. L., Richter, B. D., Sparks, R. E., and Stromberg, J. C. (1997). "The natural flow regime." Bioscience, 47(11), 769-784.
162
Quinn, J. M., Cooper, A. B., Davies-Colley, R. J., Rutherford, J. C., and Williamson, R. B. (1997). "Land use effects on habitat, water quality, periphyton, and benthic invertebrates in Waikato, New Zealand, hill-country streams." New Zealand Journal of Marine and Freshwater Research, 31(5), 579-597.
Quinn, J. M., and Hickey, C. W. (1990). "Magnitude of effects of substrate particle size, recent flooding, and watershed development on benthic invertebrates in 88 New Zealand rivers " N.Z.J. Mar. Freshwater Resources, 24, 411-428.
Rabeni, C. F., and Smale, M. A. (1995). "Effects of Siltation on Stream Fishes and the Potential Mitigating Role of the Buffering Riparian Zone." Hydrobiologia, 303(1-3), 211-219.
Rankin, E. T. (1989). "The Qualitative Habitat Evaluation Index (QHEI): rationale, methods, and application." Ecological Assessment Section, Division of Water Quality, Planning, and Assessment. Ohio Environmental Protection Agency, Columbus, OH.
Rankin, E. T., Miltner, B., Yoder, C. O., and Mishne, D. (1999). "Association between nutrients, habitat, and the aquatic biota in Ohio rivers and streams." Ohio EPA Technical Bulletin MAS/1999-1-1, Columbus, OH.
Rankin, E. T., Yoder, C. O., and Mishne, D. (1990). "Ohio Water Resources Inventory:Executive Summary and Volume 1." Ohio Environmental Protection Agency, Columbus, Ohio.
ReyesGavilan, F. G., Garrido, R., Nicieza, A. G., Toledo, M. M., and Brana, F. (1996). "Fish community variation along physical gradients in short streams of northern Spain and the disruptive effect of dams." Hydrobiologia, 321(2), 155-163.
Richards, C., Host, G. E., and Arthur, J. W. (1993). "Identification of Predominant Environmental-Factors Structuring Stream Macroinvertebrate Communities within a Large Agricultural Catchment." Freshwater Biology, 29(2), 285-294.
Richards, C., Johnson, L. B., and Host, G. E. (1996). "Landscape-scale influences on stream habitats and biota." Canadian Journal of Fisheries and Aquatic Sciences, 53, 295-311.
Richter, B. D., Baumgartner, J. V., Powell, J., and Braun, D. P. (1996). "A method for assessing hydrologic alteration within ecosystems." Conservation Biology, 10(4), 1163-1174.
Roth, N., Southerland, M., Chaillou, J., Klauda, R., Kazyak, P., Stranko, S., Weisberg, S., Hall, L., and Morgan, R. (1998). "Maryland biological stream survey: Development of a fish Index of Biotic Integrity." Environmental Monitoring and Assessment, 51(1-2), 89-106.
Roth, N. E., Allan, J. D., and Erickson, D. L. (1996). "Landscape influences on stream biotic integrity assessed at multiple spatial scales." Landscape Ecology, 11(3), 141-156.
Roth, N. E., Southerland, M. T., Chaillou, J. C., Kazyak, P. F., and Stranko, S. A. (2000). "Refinement and validation of a fish index of biotic integrity for Maryland streams." Prepared by Versar Inc. for Maryland Department of Natural Resources, Columbia,MD.
Rykiel , E. J. (1985). "Towards a definition of ecological disturbance." Australian Journal of Ecology, 10, 361-365.
Scarsbrook, M. R., and Halliday, J. (1999). "Transition from pasture to native forest land-use along stream continua: effects on stream ecosystems and implications for restoration." New Zealand Journal of Marine and Freshwater Research, 33(2), 293-310.
Schafer, J. L. (1997). Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L., and Olsen, M. K. (1998). "Multiple imputation for multivariate missing-data
problems: A data analyst's perspective." Multivariate Behavioral Research, 33(4), 545-571.
Schlosser, I. J. (1982). "Fish community structure and function along two habitat gradients in a headwater stream." Ecological Monographs, 52(4), 395-414.
163
Schueler, T. (1994). "The importance of imperviousness." Watershed Protection Techniques, 1, 100-111.
Shields, F. D., Langendoen, E. J., and Doyle, M. W. (2006). "Adapting existing models to examine effects of agricultural conservation programs on stream habitat quality." Journal of the American Water Resources Association, 42(1), 25-33.
Southerland, M. T., Rogers, G. M., Kline, M. J., Morgan, R. P., Boward, D. M., Kazyak, P. F., Klauda, R. J., and Stranko, S. A. (2005). "New biological indicators to better assess the condition of Maryland streams." Maryland Department of Natural Resources. Monitoring and non-tidal assessment division. DNR-12-0305-0100, Annapolis, MD.
Stewart, J. S., Wang, L. Z., Lyons, J., Horwatich, J. A., and Bannerman, R. (2001). "Influences of watershed, riparian-corridor, and reach-scale characteristics on aquatic biota in agricultural watersheds." Journal of the American Water Resources Association, 37(6), 1475-1487.
Stribling, J. B., Kessup, K. J., and White, J. S. (1998). "Development of a benthic index of biotic integrity for Maryland streams." Maryland Department of Natural resources. Monitoring and non-tidal assessment division. CBWP-EA-98-3, Annaplis, MD.
Sugihara, G. (1980). "Minimal community structure: an explanation of species abundance patterns." Am. Nat., 116, 770-787.
Sugihara, G. (1983). "Niche hierarchy: structure, organization and assembly in natural communities," Princeton University, Princeton, N.J.
Tran, L. T., Knight, C. G., O'Neill, R. V., Smith, E. R., and O'Connell, M. (2003). "Self-organizing maps for integrated environmental." Environmental Management, 31(6), 822-835.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001). "Missing value estimation methods for DNA microarrays." Bioinformatics, 17(6), 520-525.
USACE. (2005). "National Inventory of Dams." Available at: http://crunch.tec.army.mil/nidpublic/webpages/nid.cfm.
USGS. (2008a). "National Hydrography Dataset." U.S. Department of the Interior. Available at: http://nhd.usgs.gov/index.html.
USGS. (2008b). "National Land Cover Database." Multi-resolution Land Characteristics Consortium U.S. Department of the Interior. Available at: http://www.mrlc.gov/index.php.
Volstad, J. H., Roth, N. E., Mercurio, G., Southerland, M. T., and Strebel, D. E. (2003). "Using environmental stressor information to predict the ecological status of Maryland non-tidal streams as measured by biological indicators." Environmental Monitoring and Assessment, 84(3), 219-242.
Wang, L., Lyons, J., Kanehl, P., and Bannerman, R. (2001). "Impacts of Urbanization on stream habitat and fish across multiple spatial scales." Environmental Management, 28(2), 255-266.
Wang, L. Z., Lyons, J., Kanehl, P., Bannerman, R., and Emmons, E. (2000). "Watershed urbanization and changes in fish communities in southeastern Wisconsin streams." Journal of the American Water Resources Association, 36(5), 1173-1189.
Wang, S. H. (2003). "Application of self-organising maps for data mining with incomplete data sets." Neural Computing & Applications, 12(1), 42-48.
164
Wright, J. F., Armitage, P. D., Furse, M. T., and Moss, D. (1988). "A new approach to the biological surveillance of river quality using macroinvertebrates." Verh. International Verein. Limnol., 23, 1548-1552.
Yuan, L. L., and Norton, S. B. (2004). "Assessing the relative severity of stressors at a watershed scale." Environmental Monitoring and Assessment, 98(1-3), 323-349.
Appendices
Appendix I: group statistics
165
APPENDIX I: GROUP STATISTICS
GROUP STATISTICS IN OHIO USING INSTREAM NDATA
GROUPS AFTER ENVIRONMENTAL GRADIENTS GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA IBI
μ 8.57 1.37 138.90 3.59 3.53 5.29 32.71 27.84 G1 87 σ 2.86 1.85 112.12 0.61 1.89 1.83 37.82 8.41 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 38.67 G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 7.75 μ 6.75 2.43 266.34 3.80 3.74 7.83 15.89 24.00 G3 35 σ 3.00 3.74 180.41 0.39 1.07 1.56 19.74 7.08 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 31.04 G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 42.82 G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 28.43 G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 7.21
GROUPS AFTER TOTAL ZINC CONCENTRATION GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN IBI
μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 10.39μ 8.66 1.30 139.78 3.60 3.54 5.26 33.74 21.26 28.19G12 84 σ 2.84 1.74 113.91 0.60 1.90 1.78 38.09 17.91 8.19 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 7.75 μ 6.79 2.40 257.23 3.88 3.68 7.60 17.21 15.80 22.67G31 30 σ 3.02 3.94 178.08 0.31 1.13 1.52 21.01 6.79 5.23 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 30.19 28.43G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 66.65 7.21
166
GROUPS AFTER pH GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN PH IBI
μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 7.00 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 0.25 10.39μ 8.07 1.14 139.52 3.55 3.61 5.24 27.04 20.70 7.82 28.92G121 74 σ 2.05 1.75 116.54 0.62 2.01 1.83 28.67 16.61 0.30 8.32 μ 13.04 2.48 141.70 3.95 3.00 5.40 83.30 25.40 8.84 22.80G122 10 σ 4.00 1.10 97.50 0.16 0.47 1.35 60.13 26.44 0.42 4.44 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 7.93 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 0.30 7.75 μ 5.98 1.95 260.36 3.90 3.72 7.44 16.27 16.36 7.62 23.52G311 25 σ 2.23 2.74 190.34 0.29 1.16 1.58 19.34 6.81 0.21 5.11 μ 10.86 4.68 241.60 3.80 3.50 8.40 21.90 13.00 8.30 18.40G312 5 σ 3.38 7.74 110.32 0.45 1.00 0.89 30.38 6.71 0.24 3.85 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 7.63 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 0.31 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 7.81 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 0.23 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 7.94 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 0.21 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 30.19 7.72 28.43G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 66.65 0.62 7.21
GROUPS AFTER NITRATE CONCENTRATION
GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN PH NO3 IBI
μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 7.00 18.30 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 0.25 8.92 10.39μ 8.07 1.14 139.52 3.55 3.61 5.24 27.04 20.70 7.82 2.56 28.92G121 74 σ 2.05 1.75 116.54 0.62 2.01 1.83 28.67 16.61 0.30 3.92 8.32 μ 13.04 2.48 141.70 3.95 3.00 5.40 83.30 25.40 8.84 0.29 22.80G122 10 σ 4.00 1.10 97.50 0.16 0.47 1.35 60.13 26.44 0.42 0.23 4.44 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 7.93 2.12 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 0.30 3.36 7.75 μ 5.98 1.95 260.36 3.90 3.72 7.44 16.27 16.36 7.62 1.30 23.52G311 25 σ 2.23 2.74 190.34 0.29 1.16 1.58 19.34 6.81 0.21 1.83 5.11 μ 10.86 4.68 241.60 3.80 3.50 8.40 21.90 13.00 8.30 0.84 18.40G312 5 σ 3.38 7.74 110.32 0.45 1.00 0.89 30.38 6.71 0.24 0.99 3.85 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 7.63 2.33 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 0.31 3.29 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 7.81 1.70 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 0.23 1.77 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 7.94 1.93 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 0.21 1.89 5.81 μ 8.03 0.98 225.66 3.34 6.13 9.50 105.27 20.31 7.88 8.33 25.63G61 32 σ 1.71 0.43 129.65 0.47 1.63 0.88 61.88 19.45 0.40 10.91 6.84 μ 7.43 0.81 159.62 3.57 5.58 8.70 106.59 38.73 7.59 0.68 30.86G62 37 σ 1.58 0.73 146.66 0.58 2.26 1.08 109.23 88.90 0.74 0.60 6.69
167
GROUP STATISTICS IN OHIO USING OFFSTREAM DATA
GROUPS AFTER ENVIRONMENTAL GRADIENTS
GROUP # OBS
R30
_ FO
RE
ST
R10
0_
AG
RI
R30
_ B
AR
RE
N
IBI
μ 39.50 60.29 0.24 41.06G1 17 σ 10.53 7.65 0.23 5.44 μ 24.88 60.94 0.01 32.57G2 412 σ 18.90 22.98 0.03 9.74
GROUPS AFTER R30_NONFOREST
GROUP # OBS
R30
_ FO
RE
ST
R10
0_
AG
RI
R30
_ B
AR
RE
N
R30
_ N
ON
FOR
IBI
μ 39.50 60.29 0.24 0.39 41.06G1
17 σ 10.53 7.65 0.23 0.47 5.44
μ 27.05 59.28 0.01 0.77 33.97G21
304 σ 20.28 24.11 0.03 0.65 9.34
μ 18.78 65.60 0.01 3.63 28.63G22
108 σ 12.54 18.76 0.01 1.99 9.81
GROUPS AFTER SITE_CON
GROUP # OBS
R30
_ FO
RE
ST
R10
0_
AG
RI
R30
_ B
AR
RE
N
R30
_ N
ON
FOR
SIT
E
_CO
N
IBI
μ 39.50 0.24 60.29 0.39 34.24 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 5.44 μ 40.62 0.00 59.66 0.37 87.71 40.09G211 46 σ 12.22 0.01 11.37 0.44 12.93 7.83 μ 24.63 0.01 59.21 0.85 14.89 32.88G212 258 σ 20.49 0.03 25.74 0.66 13.65 9.18 μ 42.43 0.02 59.28 4.41 59.67 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 1.91 μ 21.52 0.01 66.33 3.06 34.60 30.50G222 60 σ 10.93 0.01 12.24 0.59 7.77 10.87μ 15.36 0.00 64.70 26.41 3.43 26.29G223 48 σ 13.65 0.01 24.73 14.00 2.39 7.80
168
GROUPS AFTER RDA_URBAN
GROUP # OBS
R30
_ FO
RE
ST
R10
0_
AG
RI
R30
_ B
AR
RE
N
R30
_ N
ON
FOR
SIT
E_
CO
N
RD
A_
UR
BA
N
IBI
μ 39.50 0.24 60.29 0.39 34.24 8.65 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 5.44 μ 40.62 0.00 59.66 0.37 87.71 11.01 40.09G211 46 σ 12.22 0.01 11.37 0.44 12.93 10.29 7.83 μ 24.63 0.01 59.21 0.85 14.89 18.70 32.88G212 258 σ 20.49 0.03 25.74 0.66 13.65 20.40 9.18 μ 42.43 0.02 59.28 4.41 59.67 6.08 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 38.08G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 9.98 μ 21.33 0.01 62.40 19.41 34.17 22.61 24.71G2222 34 σ 6.91 0.01 9.24 9.96 1.84 8.56 7.47 μ 11.71 0.01 79.09 18.91 2.22 7.68 24.30G2231 33 σ 10.69 0.01 10.89 9.67 0.47 2.72 7.88 μ 23.39 0.00 33.04 9.91 6.10 47.47 30.67G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 5.69
169
GROUPS AFTER L100_FOREST
GROUP # OBS
R30
_ FO
RE
ST
R10
0_
AG
RI
R30
_ B
AR
RE
N
R30
_ N
ON
FOR
SIT
E_
CO
N
RD
A_
UR
BA
N
L10
0_
FOR
EST
IBI
μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 37.60 G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 8.18 μ 34.80 0.01 49.57 0.91 12.29 22.08 41.74 35.81 G2122 105 σ 16.70 0.03 22.41 0.61 10.33 21.24 11.91 9.71 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 9.98 μ 21.33 0.01 62.40 19.41 34.17 22.61 14.77 24.71 G2222 34 σ 6.91 0.01 9.24 9.96 1.84 8.56 12.74 7.47 μ 11.71 0.01 79.09 18.91 2.22 7.68 9.79 24.30 G2231 33 σ 10.69 0.01 10.89 9.67 0.47 2.72 11.02 7.88 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 5.69
170
GROUPS AFTER L30_AGRI
GROUP # OBSERV
R30
_FO
RE
ST
R10
0_A
GR
I
R30
_BA
RR
EN
R30
_NO
NFO
RE
ST
SIT
E_C
ON
RD
A_U
RB
AN
L10
0_FO
RE
ST
L30
_AG
RI
IBI
μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 35.66 37.60 G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 20.81 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 8.18 μ 34.80 0.01 49.57 0.91 12.29 22.08 41.74 24.93 35.81 G2122 105 σ 16.70 0.03 22.41 0.61 10.33 21.24 11.91 18.33 9.71 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 22.11 G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 27.63 G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 25.85 G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 18.57 G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 5.69
171
GROUPS AFTER L30_NONFOREST
GROUP # OBSERV
R30
_FO
RE
ST
R10
0_A
GR
I
R30
_BA
RR
EN
R30
_NO
NFO
RE
ST
SIT
E_C
ON
RD
A_U
RB
AN
L10
0_FO
RE
ST
L30
_AG
RI
L30
_NO
NFO
R
IBI
μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 1.26 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 2.36 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 35.66 0.73 37.60G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 20.81 1.43 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 0.11 43.05G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 0.42 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 1.01 39.43G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 1.36 8.18 μ 35.94 0.01 47.95 0.88 12.30 23.03 42.09 24.44 0.45 36.55G21221 98 σ 16.06 0.03 21.94 0.61 10.48 21.67 11.68 18.52 0.67 9.44 μ 18.84 0.01 72.38 1.29 12.09 8.76 36.77 31.67 7.70 25.43G21222 7 σ 18.54 0.03 16.40 0.33 8.61 2.45 14.89 14.97 4.58 7.81 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 1.35 28.96G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 2.65 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 0.91 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.19 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 3.01 38.08G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 4.27 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 0.00 22.11G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 0.00 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 4.80 27.63G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 4.93 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 3.52 25.85G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 3.20 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 8.04 18.57G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 4.25 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 5.81 30.67G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 8.65 5.69
172
GROUPS AFTER L100_URBAN
GROUP # OBS
R30
_FO
RE
ST
R10
0_A
GR
I
R30
_BA
RR
EN
R30
_NO
NFO
RE
ST
SIT
E_C
ON
RD
A_U
RB
AN
L10
0_FO
RE
ST
L30
_AG
RI
L30
_NO
NFO
R
L10
0_U
RB
AN
IBI
μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 1.26 12.00 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 2.36 9.83 5.44 μ 35.97 0.00 65.69 0.43 92.44 9.05 35.77 38.88 0.66 6.42 35.18 G21111 17 σ 15.51 0.01 12.40 0.60 3.05 11.39 11.71 22.64 1.46 4.14 6.33 μ 43.62 0.01 52.17 0.57 76.65 14.23 32.13 28.83 0.89 26.57 42.75 G21112 8 σ 9.60 0.02 11.14 0.45 18.19 7.78 7.91 15.32 1.44 14.67 8.88 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 0.11 7.61 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 0.42 4.54 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 1.01 8.24 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 1.36 5.96 8.18 μ 35.94 0.01 47.95 0.88 12.30 23.03 42.09 24.44 0.45 22.80 36.55 G21221 98 σ 16.06 0.03 21.94 0.61 10.48 21.67 11.68 18.52 0.67 19.70 9.44 μ 18.84 0.01 72.38 1.29 12.09 8.76 36.77 31.67 7.70 13.97 25.43 G21222 7 σ 18.54 0.03 16.40 0.33 8.61 2.45 14.89 14.97 4.58 4.57 7.81 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 1.35 22.62 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 2.65 25.81 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 0.91 13.61 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.19 8.88 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 3.01 10.91 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 4.27 9.87 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 0.00 75.47 22.11 G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 0.00 16.58 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 4.80 14.24 27.63 G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 4.93 11.69 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 3.52 8.76 25.85 G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 3.20 4.67 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 8.04 20.63 18.57 G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 4.25 7.20 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 5.81 63.18 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 8.65 30.09 5.69
173
GROUP STATISTICS IN COASTAL MARYLAND
GROUPS AFTER ENVIRONMENTAL GRADIENTS GROUP # OBS POOLQUAL AVGWID RIFFQUAL IBI
μ 9.68 3.32 5.34 3.05G1 103 σ 4.48 3.08 4.05 1.00μ 13.94 5.99 13.17 3.84G2 122 σ 3.12 3.41 3.12 0.66
GROUPS AFTER pH GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH IBI
μ 9.94 3.29 5.62 6.88 3.16G11 94 σ 4.35 3.03 4.03 0.42 0.94μ 7.00 3.73 2.44 5.41 1.89G12 9 σ 5.20 3.67 3.09 0.57 0.88μ 13.94 5.99 13.17 6.86 3.84G2 122 σ 3.12 3.41 3.12 0.47 0.66
GROUPS AFTER FORWET GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH FORWET IBI
μ 9.17 3.03 6.13 6.80 66.38 2.98 G111 52 σ 4.37 2.65 4.28 0.46 13.34 0.96 μ 10.88 3.61 4.98 6.97 30.66 3.39 G112 42 σ 4.19 3.46 3.65 0.35 10.40 0.87 μ 7.00 3.73 2.44 5.41 73.11 1.89 G12 9 σ 5.20 3.67 3.09 0.57 10.11 0.88 μ 13.84 6.31 13.04 6.92 41.81 3.79 G21 105 σ 3.26 3.51 3.17 0.38 9.98 0.67 μ 14.59 4.01 14.00 6.50 72.95 4.15 G22 17 σ 1.91 1.77 2.78 0.73 6.12 0.56
174
GROUPS AFTER INSTRHAB GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH FORWET INSTRHAB IBI
μ 9.17 3.03 6.13 6.80 66.38 54.29 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 0.62μ 12.00 7.44 11.88 6.94 44.32 39.23 3.62G212 42 σ 3.64 3.65 3.04 0.42 9.31 12.50 0.71μ 14.59 4.01 14.00 6.50 72.95 71.08 4.15G22 17 σ 1.91 1.77 2.78 0.73 6.12 21.76 0.56
GROUPS AFTER AESTHET
GROUP # OBS
POO
LQ
UA
L
AV
GW
ID
RIF
FQU
AL
PH
FOR
WE
T
INST
RH
AB
AE
STH
ET
IBI
μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 0.62μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 3.39G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 0.66μ 11.24 8.21 12.52 6.87 44.75 38.37 16.43 3.85G2122 21 σ 3.73 3.43 2.93 0.37 10.05 13.16 1.29 0.69μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 3.50G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 0.50μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 4.29G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 0.48
175
GROUPS AFTER COND
GROUP # OBS
POO
LQ
UA
L
AV
GW
ID
RIF
FQU
AL
PH
FOR
WE
T
INST
RH
AB
AE
STH
ET
CO
ND
IBI
μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 165.13 2.98 G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 82.47 0.96 μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 204.79 3.39 G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 114.84 0.87 μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 90.00 1.46 G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 23.89 0.53 μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 112.67 2.75 G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 76.56 0.87 μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 192.08 3.91 G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 111.78 0.62 μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 199.81 3.39 G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 98.32 0.66 μ 12.00 7.48 12.75 7.00 39.14 43.45 16.92 172.08 4.19 G21221 12 σ 3.93 3.93 2.63 0.38 8.38 6.99 1.08 26.98 0.65 μ 10.22 9.19 12.22 6.69 52.22 31.61 15.78 102.67 3.39 G21222 9 σ 3.38 2.50 3.42 0.28 6.77 16.61 1.30 11.94 0.45 μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 107.33 3.50 G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 6.11 0.50 μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 80.57 4.29 G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 38.97 0.48
176
GROUPS AFTER URBAN
GROUP # OBS
POO
LQ
UA
L
AV
GW
ID
RIF
FQU
AL
PH
FOR
WE
T
INST
RH
AB
AE
STH
ET
CO
ND
UR
BA
N
IBI
μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 165.13 4.91 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 82.47 5.73 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 204.79 7.76 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 114.84 14.02 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 90.00 7.57 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 23.89 10.69 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 112.67 0.01 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 76.56 0.02 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 192.08 10.97 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 111.78 14.89 0.62μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 199.81 19.53 3.39G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 98.32 16.48 0.66μ 17.50 5.53 14.00 6.98 50.39 44.29 17.50 156.00 23.66 3.13G212211 3 σ 3.54 2.23 4.24 0.36 5.42 0.82 0.71 15.56 9.11 0.18μ 10.90 7.87 12.50 7.00 36.89 43.28 16.80 175.30 1.05 4.40G212212 10 σ 3.07 4.16 2.46 0.40 6.99 7.72 1.14 28.17 0.62 0.46μ 10.22 9.19 12.22 6.69 52.22 31.61 15.78 102.67 2.90 3.39G21222 9 σ 3.38 2.50 3.42 0.28 6.77 16.61 1.30 11.94 3.51 0.45μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 107.33 12.58 3.50G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 6.11 4.46 0.50μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 80.57 5.72 4.29G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 38.97 2.35 0.48
177
GROUP STATISTCIS IN PIEDMONT MARYLAND
GROUPS AFTER ENVIRONMENTAL GRADEINTS
GROUP # OBS CH_FLOW CHAN_ALT URBAN POOLQUAL REMOTE EMBEDDED IBI
μ 82.03 12.36 0.64 9.97 56.25 70.00 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.84μ 69.52 8.07 59.36 12.52 14.01 71.23 1.802 29 σ 24.02 4.71 18.73 3.63 11.41 27.80 0.67μ 79.81 6.79 4.96 13.79 48.94 58.56 3.593 48 σ 14.60 2.48 6.23 2.63 27.07 22.75 0.71μ 86.13 13.44 2.68 15.41 66.32 77.61 3.934 126 σ 13.41 2.73 3.05 2.01 28.11 20.16 0.64μ 82.71 14.29 36.89 14.76 45.59 71.24 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 0.97
GROUPS AFTER GRAD
GROUP # OBS
CH
_FL
OW
CH
AN
_AL
T
UR
BA
N
POO
LQ
UA
L
RE
MO
TE
EM
BE
DD
ED
GR
AD
IBI
μ 82.03 12.36 0.64 9.97 56.25 70.00 1.17 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.52 0.84μ 78.30 7.35 54.07 13.10 12.81 70.78 0.74 2.0821 20 σ 19.39 4.38 18.20 3.95 11.38 26.26 0.43 0.61μ 50.00 9.67 71.10 11.22 16.67 72.22 2.63 1.2022 9 σ 22.50 5.29 14.69 2.54 11.69 32.63 0.64 0.28μ 79.81 6.79 4.96 13.79 48.94 58.56 0.96 3.593 48 σ 14.60 2.48 6.23 2.63 27.07 22.75 0.68 0.71μ 86.13 13.44 2.68 15.41 66.32 77.61 1.25 3.934 126 σ 13.41 2.73 3.05 2.01 28.11 20.16 1.24 0.64μ 82.71 14.29 36.89 14.76 45.59 71.24 1.12 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 1.02 0.97
178
GROUPS AFTER FORWET GROUP #OBS CH_FLOW CHAN_ALT URBAN POOLQUAL REMOTE EMBEDDED GRAD FORWET IBI
μ 82.03 12.36 0.64 9.97 56.25 70.00 1.17 29.44 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.52 14.72 0.84μ 75.43 8.57 35.58 12.86 11.61 81.59 0.43 41.45 2.46211 7 σ 27.17 4.58 10.26 4.10 11.08 10.86 0.38 5.66 0.65μ 79.85 6.69 64.03 13.23 13.46 64.96 0.90 20.44 1.87212 13 σ 14.79 4.31 12.84 4.02 11.93 30.46 0.36 5.72 0.50μ 50.00 9.67 71.10 11.22 16.67 72.22 2.63 18.55 1.2022 9 σ 22.50 5.29 14.69 2.54 11.69 32.63 0.64 8.83 0.28μ 80.82 6.89 4.92 13.95 48.68 58.98 0.90 30.24 3.7631 38 σ 13.74 2.72 6.60 2.54 25.71 23.90 0.63 9.07 0.64μ 75.56 6.33 5.12 13.11 50.00 56.79 1.18 56.40 2.8832 9 σ 18.10 0.87 4.63 3.06 33.95 18.17 0.86 2.91 0.50μ 87.03 13.52 2.59 15.47 66.49 77.02 1.23 28.93 3.9941 116 σ 12.46 2.76 3.08 1.97 28.59 20.37 1.28 8.94 0.58μ 75.70 12.50 3.66 14.80 64.38 84.44 1.55 55.63 3.2742 10 σ 19.53 2.27 2.62 2.39 22.83 16.93 0.71 3.71 0.89μ 82.71 14.29 36.89 14.76 45.59 71.24 1.12 29.82 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 1.02 11.27 0.97
179
GROUP STATISTICS IN HIGHLAND MARYLAND
GROUPS AFTER ENVIRONMENTAL GRADIENTSGROUP # OBS EPISUB AVGWID IBI
μ 77.05 4.73 3.25 1 153
σ 17.26 2.29 1.12 μ 24.67 4.30 2.67 2
111 σ 12.82 2.49 1.07 μ 75.17 12.85 3.79 3
32 σ 21.21 2.94 0.90
GROUPS AFTER NUMROOT GROUP # OBS EPISUB AVGWID NUMROOT IBI
μ 76.64 4.63 0.56 3.1611 132
σ 17.58 2.29 0.76 1.12μ 79.63 5.37 4.33 3.80
12 21 σ 15.25 2.23 1.62 0.96μ 24.88 3.96 0.20 2.51
21 90 σ 12.19 2.35 0.40 1.05μ 23.81 5.74 2.76 3.38
22 21 σ 15.53 2.63 1.34 0.87μ 75.17 12.85 1.72 3.79
3 32 σ 21.21 2.94 3.01 0.90
180
GROUPS AFTER GRAD GROUP # OBS EPISUB AVGWID NUMROOT GRAD IBI
μ 75.56 4.79 0.61 1.57 3.24111
115 σ 17.39 2.33 0.78 0.95 1.09
μ 83.99 3.55 0.24 6.92 2.61112
17 σ 17.56 1.67 0.56 3.02 1.24
μ 79.63 5.37 4.33 0.93 3.8012
21 σ 15.25 2.23 1.62 0.74 0.96
μ 24.61 3.92 0.23 1.22 2.60211
79 σ 11.49 2.36 0.42 0.78 1.01
μ 26.77 4.30 0.00 5.18 1.83212
11 σ 17.00 2.29 0.00 1.62 1.07
μ 23.81 5.74 2.76 0.99 3.3822
21 σ 15.53 2.63 1.34 0.59 0.87
μ 75.17 12.85 1.72 0.75 3.793
32 σ 21.21 2.94 3.01 0.53 0.90
GROUPS AFTER SO4 GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 IBI
μ 75.56 4.79 0.61 1.57 17.22 3.24 111
115 σ 17.39 2.33 0.78 0.95 16.22 1.09
μ 83.99 3.55 0.24 6.92 17.64 2.61 112
17 σ 17.56 1.67 0.56 3.02 14.32 1.24
μ 79.63 5.37 4.33 0.93 12.59 3.80 12
21 σ 15.25 2.23 1.62 0.74 7.69 0.96
μ 24.61 3.92 0.23 1.22 27.02 2.60 211
79 σ 11.49 2.36 0.42 0.78 47.47 1.01
μ 26.77 4.30 0.00 5.18 95.16 1.83 212
11 σ 17.00 2.29 0.00 1.62 146.69 1.07
μ 23.81 5.74 2.76 0.99 30.38 3.38 22
21 σ 15.53 2.63 1.34 0.59 61.80 0.87
μ 53.70 12.43 0.00 1.50 213.42 1.67 31
3 σ 8.49 5.08 0.00 1.00 166.63 0.43
μ 77.39 12.89 1.90 0.67 11.34 4.01 32
29 σ 20.94 2.77 3.11 0.42 4.11 0.59
181
GROUPS AFTER WOOD GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 WOOD IBI
μ 75.56 4.79 0.61 1.57 17.22 1.28 3.24 111
115 σ 17.39 2.33 0.78 0.95 16.22 1.86 1.09
μ 84.72 3.34 0.17 6.82 20.21 0.83 3.09 1121
12 σ 14.24 1.30 0.39 3.51 16.37 0.72 1.08
μ 82.22 4.04 0.40 7.17 11.49 3.40 1.46 1122
5 σ 25.88 2.45 0.89 1.60 4.04 0.55 0.74
μ 79.63 5.37 4.33 0.93 12.59 5.10 3.80 12
21 σ 15.25 2.23 1.62 0.74 7.69 4.71 0.96
μ 24.61 3.92 0.23 1.22 27.02 1.91 2.60 211
79 σ 11.49 2.36 0.42 0.78 47.47 3.35 1.01
μ 26.77 4.30 0.00 5.18 95.16 0.36 1.83 212
11 σ 17.00 2.29 0.00 1.62 146.69 0.67 1.07
μ 23.81 5.74 2.76 0.99 30.38 2.90 3.38 22
21 σ 15.53 2.63 1.34 0.59 61.80 3.33 0.87
μ 53.70 12.43 0.00 1.50 213.42 2.00 1.67 31
3 σ 8.49 5.08 0.00 1.00 166.63 2.65 0.43
μ 77.39 12.89 1.90 0.67 11.34 2.48 4.01 32
29 σ 20.94 2.77 3.11 0.42 4.11 2.86 0.59
182
GROUPS AFTER AGRIBARR GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 WOOD AGRIBARR IBI
μ 75.84 5.06 0.66 1.36 17.77 1.36 51.10 3.471111
82 σ 17.90 2.25 0.80 0.90 18.03 1.94 23.29 0.87
μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 2.661112
32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 1.36
μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 3.091121
12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 1.08
μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 1.461122
5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 0.74
μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 3.8012
21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 0.96
μ 24.61 3.92 0.23 1.22 27.02 1.91 38.88 2.60211
79 σ 11.49 2.36 0.42 0.78 47.47 3.35 27.43 1.01
μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 1.83212
11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 1.07
μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 3.3822
21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 0.87
μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 1.6731
3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 0.43
μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 4.0132
29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 0.59
183
GROUPS AFTER CHAN
GROUP # OBS
EPI
SUB
AV
GW
ID
NU
MR
OO
T
GR
AD
SO4
WO
OD
AG
RIB
AR
R
CH
AN
IBI
μ 80.85 5.12 0.58 1.50 14.37 0.76 49.47 16.63 3.77 11111
38 σ 18.21 2.40 0.76 0.88 8.12 1.05 23.13 1.44 0.77
μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 3.22 11112
45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 0.87
μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 2.66 1112
32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 1.36
μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 3.09 1121
12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 1.08
μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 1.46 1122
5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 0.74
μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 3.80 12
21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 0.96
μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 2.39 2111
46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 0.95
μ 26.26 4.07 0.18 1.39 24.17 0.82 32.27 15.36 2.91 2112
33 σ 13.05 2.26 0.39 0.82 24.41 1.42 27.61 2.93 1.04
μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 1.83 212
11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 1.07
μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 3.38 22
21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 0.87
μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 1.67 31
3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 0.43
μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 4.01 32
29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 0.59
184
GROUPS AFTER EMBED
GROUP # OBS
EPI
SUB
AV
GW
ID
NU
MR
OO
T
GR
AD
SO4
WO
OD
AG
RIB
AR
R
CH
AN
EM
BE
D
IBI
μ 70.20 4.36 0.45 1.30 13.44 0.73 67.13 16.09 37.73 3.18 111111 11 σ 17.26 2.80 0.69 1.30 7.18 1.01 17.38 1.14 10.09 1.02 μ 85.19 5.43 0.63 1.58 14.75 0.78 42.27 16.85 17.85 4.01 111112 27 σ 17.02 2.20 0.79 0.66 8.57 1.09 21.44 1.51 7.18 0.50 μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 32.78 3.22 11112 45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 15.76 0.87 μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 29.63 2.66 1112 32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 24.24 1.36 μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 29.58 3.09 1121 12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 16.85 1.08 μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 33.00 1.46 1122 5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 27.75 0.74 μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 24.71 3.80 12 21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 15.65 0.96 μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 51.52 2.39 2111 46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 28.34 0.95 μ 32.50 4.63 0.15 1.34 18.25 0.65 27.65 15.20 15.50 3.34 21121 20 σ 10.86 2.09 0.37 0.80 14.80 0.88 25.43 2.61 15.89 0.93 μ 16.67 3.22 0.23 1.46 33.28 1.08 39.38 15.62 69.62 2.23 21122 13 σ 10.14 2.32 0.44 0.88 33.07 2.02 30.31 3.48 17.73 0.83 μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 28.18 1.83 212 11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 29.69 1.07 μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 51.10 3.38 22 21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 22.83 0.87 μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 56.67 1.67 31 3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 22.55 0.43 μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 34.97 4.01 32 29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 18.97 0.59
185
GROUPS AFTER SHADE
GROUP # OBS
EPI
SUB
AV
GW
ID
NU
MR
OO
T
GR
AD
SO4
WO
OD
AG
RIB
AR
R
CH
AN
EM
BE
D
SHA
DE
IBI
μ 74.07 4.27 0.44 1.52 12.16 0.56 63.70 16.11 35.00 73.43 3.57 1111111 8 σ 16.20 2.89 0.73 1.34 6.97 0.73 17.40 1.27 5.00 16.49 0.61 μ 52.78 4.77 0.50 0.30 19.24 1.50 82.59 16.00 50.00 32.48 1.43 1111112 3 σ 11.79 3.34 0.71 0.00 6.76 2.12 3.90 0.00 21.21 3.23 0.20 μ 85.19 5.43 0.63 1.58 14.75 0.78 42.27 16.85 17.85 73.78 4.01 111112 27 σ 17.02 2.20 0.79 0.66 8.57 1.09 21.44 1.51 7.18 19.05 0.50 μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 32.78 57.81 3.22 11112 45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 15.76 26.90 0.87 μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 29.63 75.79 2.66 1112 32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 24.24 20.14 1.36 μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 29.58 89.44 3.09 1121 12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 16.85 13.97 1.08 μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 33.00 92.84 1.46 1122 5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 27.75 8.68 0.74 μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 24.71 71.67 3.80 12 21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 15.65 16.87 0.96 μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 51.52 58.11 2.39 2111 46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 28.34 31.78 0.95 μ 35.65 5.29 0.17 1.28 17.12 0.58 26.53 15.25 14.58 43.06 3.79 211211 12 σ 12.64 2.12 0.39 0.86 8.06 1.00 20.46 2.56 16.85 16.28 0.64 μ 27.78 3.64 0.13 1.42 19.96 0.75 29.33 15.13 16.88 89.74 2.68 211212 8 σ 5.14 1.70 0.35 0.73 22.06 0.71 33.06 2.85 15.34 9.36 0.95 μ 16.67 3.22 0.23 1.46 33.28 1.08 39.38 15.62 69.62 54.72 2.23 21122 13 σ 10.14 2.32 0.44 0.88 33.07 2.02 30.31 3.48 17.73 28.11 0.83 μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 28.18 78.50 1.83 212 11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 29.69 15.76 1.07 μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 51.10 72.43 3.38 22 21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 22.83 21.44 0.87 μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 56.67 29.28 1.67 31 3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 22.55 16.89 0.43 μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 34.97 62.05 4.01 32 29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 18.97 17.88 0.59
Appendix II: computer code
186
Code for the Self- Organizing Maps and the raw and neuron-based
correlation matrices
clear all close all clc fig_handle = []; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Read the datasets (in .csv format) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Database = readtexttocells('REGLU AND FRAGMENT.csv'); fields = Database(1,:); warning off MATLAB:divideByZero %% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Forming the environmental variable matrix - input to the SOM %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MTC =Database(:,[find(strcmp(fields,'RDA_WATER')):find(strcmp(fields,'AREA'))]); %% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Creating the struct for SOM after normalizing the input metric data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sD1 = som_data_struct(str2double(MTC(2:end,:)),'comp_names',MTC(1,:),'labels',... Database(2:end,find(strcmp(fields,'IDX')))); sD2 = som_normalize(sD1,'log'); sD2 = som_normalize(sD2,'range'); clc clc % Finding the optimal # of SOM map units based on the quantization and % topographic errors qea = []; tea = []; for m = 10:5:100 clear sM sM = som_make(sD2,'munits',m,'algorithm','seq'); [qe,te] = som_quality(sM, sD2); qea = [qea qe]; tea = [tea te]; end m = 10:5:100; fig_handle(end+1) = figure; gca; [AX,H1,H2] = plotyy(m,qea,m,tea); set(AX(1),'Ycolor','k') set(AX(2),'Ycolor','k') set(get(AX(1),'Ylabel'),'String','Quantization error') set(get(AX(2),'Ylabel'),'String','Topographic error') set(H1,'LineStyle','-.') set(H2,'LineStyle','-') xlabel('No of map units')
187
title('Finding optimal no of map units') legend([H1 H2],'Quantization error','Topographic error') grid set(gca,'xtick',[0:10:100]) saveas(gcf,'No_neurons.fig') saveas(gcf,'No_neurons.jpg') clear AX H1 H2 clc %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SOM training after selecting the number of map units %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% mu = input('Enter optimal no of map units : '); close(gcf); sM = som_make(sD2,'munits',mu,'algorithm','seq','name','','training',[20 100]); [qe,te] = som_quality(sM, sD2); SOM_cells = prod(sM.topol.msize); [tempX, tempY] = meshgrid(1:sM.topol.msize(2),1:sM.topol.msize(1)); L1 = (flipud(tempY)-1)*sM.topol.msize(2)+tempX; L1 = L1(:); clear tempX tempY clc %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % U matrix %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'umat',[]) hold on som_cplane('hexa',sM.topol.msize,'none'); som_show_add('label',cellstr(int2str(L1)),'Textsize',8); colormap(1-gray);som_recolorbar saveas(gcf,'U_matrix.fig') saveas(gcf,'U_matrix.jpg') close(gcf); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % k means clustering of the SOM neurons %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [c, p, err, ind] = kmeans_clusters(sM,[],100); % find clusterings fig_handle(end+1) = figure; set(gcf,'Color',[1 1 1]); set(gca,'XColor',[0 0 0],'YColor',[0 0 0]) hold on plot(ind,'k') xlabel('No of clusters'); ylabel('Davies - Bouldin index'); title('Optimal no of clusters','Color',[0 0 0]); grid; saveas(gcf,'No_clusters.fig') saveas(gcf,'No_clusters.jpg')
188
%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Setting number of desired clusters and sorting the cluster labels starting %from the lowest at the bottom of the SOM map %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% no_clusters = input('Enter no. of clusters : '); close(gcf); temp = sortrows([L1 p{no_clusters}],[2 1]); lookup = sort(temp([0; find(diff(temp(:,2))==1)]+1,:),1); clear c1 for id = 1:no_clusters c1(temp(temp(:,2)==temp(find(temp(:,1)==lookup(id,1)),2)),:) = lookup(id,2); end Cluster_label = c1(L1); clear c1 Color_map = jet(64); Color_map = Color_map(ceil(linspace(1,55,no_clusters))',:); SOMcolors = (repmat(Cluster_label,[1, no_clusters]) == repmat([1:no_clusters],[length(Cluster_label),1])); SOMcolors = (linspace(0,1,no_clusters) * SOMcolors')'; fig_handle(end+1) = figure; som_show(sM,'empty',sprintf('%d clusters',no_clusters)) hold on som_cplane('hexa',sM.topol.msize,SOMcolors); som_show_add('label',cellstr(int2str(L1)),'Textsize',8); colormap(Color_map); h = colorbar; set(h,'YTick',linspace(min(get(h,'YTick')),max(get(h,'YTick')),no_clusters),... 'YTickLabel',[1:no_clusters]) sM = som_label(sM,'clear','all'); sM = som_autolabel(sM,sD2); saveas(gcf,'SOM_neurons.fig') saveas(gcf,'SOM_neurons.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Forming the matrices based on neuron site distribution % 1) Habitat index % 2) Environmental variables % 3) Fish metrics % 4) Indices of integrity i.e. IBI/ICI % 5) Fish counts %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [tf loc]= ismember(sM.labels,Database(2:end,find(strcmp(fields,'IDX')))); Ne1 = som_unit_neighs(sM); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % HABITAT INDEX MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% QHEI_data = str2double(Database(:,find(strcmp(fields,'QHEI')))); var_cluster = nan(size(sM.labels)); var_cluster(loc~=0) = (QHEI_data(loc(loc~=0))); QHEI_SOM = nanmean(var_cluster')';
189
if length(find(isnan(QHEI_SOM)))>0 Coord = find(isnan(QHEI_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = QHEI_SOM(b(~isnan(b))); QHEI_SOM(isnan(QHEI_SOM)) = nanmean(c')'; clear b c end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENVIRONMENTAL VARIABLES MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'RDA_WATER')):find(strcmp(fields,'AREA'))]; Env_var = fields(index); No_env = length(index); ENV_MTX = []; for var_no = 1:No_env index = find(strcmp(fields,Env_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Env_SOM = nanmean(var_cluster')'; if length(find(isnan(Env_SOM)))>0 Coord = find(isnan(Env_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Env_SOM(b(~isnan(b))); Env_SOM(isnan(Env_SOM)) = nanmean(c')'; clear c b end ENV_MTX = [ENV_MTX Env_SOM]; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % FISH METRICS MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'SPSCORE')):find(strcmp(fields,'SPWNSCORE'))]; Fish_var = fields(index); No_fish = length(index); FISH_MTX = []; for var_no = 1:No_fish index = find(strcmp(fields,Fish_var(var_no))); var_data = str2double(Database(2:end,index)); var_data = log(var_data+1); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0));
190
Fish_SOM = nanmean(var_cluster')'; if length(find(isnan(Fish_SOM)))>0 Coord = find(isnan(Fish_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Fish_SOM(b(~isnan(b))); Fish_SOM(isnan(Fish_SOM)) = nanmean(c')'; clear c b end FISH_MTX = [FISH_MTX Fish_SOM]; end FISH_MTX = round(exp(FISH_MTX)-1); fish_removed = find(sum(FISH_MTX)==0); Fish_var(find(sum(FISH_MTX)==0))=[]; FISH_MTX(:,find(sum(FISH_MTX)==0))=[]; FISH_MTX(find(sum(FISH_MTX,2)==0),:) = eps; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDICES MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'IBI')) find(strcmp(fields,'ICI'))]; Indices_var = fields(index); No_indices = length(index); INDICES_MTX = []; for var_no = 1:No_indices index = find(strcmp(fields,Indices_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Indices_SOM = nanmean(var_cluster')'; if length(find(isnan(Indices_SOM)))>0 Coord = find(isnan(Indices_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Indices_SOM(b(~isnan(b))); Indices_SOM(isnan(Indices_SOM)) = nanmean(c')'; clear c b end INDICES_MTX = [INDICES_MTX Indices_SOM]; end
191
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FISH COUNTS MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'NUMINDSP')):find(strcmp(fields,'NUMSPAWN'))]; Count_var = fields(index); No_counts = length(index); FISHCOUNTS_MTX = []; for var_no = 1:No_counts index = find(strcmp(fields,Count_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Counts_SOM = nanmean(var_cluster')'; if length(find(isnan(Counts_SOM)))>0 Coord = find(isnan(Counts_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Counts_SOM(b(~isnan(b))); Counts_SOM(isnan(Counts_SOM)) = nanmean(c')'; clear c b end FISHCOUNTS_MTX = [FISHCOUNTS_MTX Counts_SOM]; end %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SPATIAL DISTRIBUTION OF THE CLUSTERS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Cluster_symbol = 'x^o+*.+'; sM = som_label(sM,'clear','all'); sM = som_autolabel(sM,sD2); L = sM.labels'; L = L(:); L(cellfun('isempty',L))=[]; [tf loc]= ismember(L,Database(2:end,find(strcmp(fields,'IDX')))); % Reading the latitude and longitudes from the dataset lat = str2double(Database(2:end,find(strcmp(fields,'LAT')))); lat_site = lat(loc); long = str2double(Database(2:end,find(strcmp(fields,'LONG')))); long_site = long(loc); % Calculate the # of sampling sites in each SOM neuron hits = som_hits(sM,sD2); hits_idx=hits>0; temp_hits=hits(hits_idx); SOM_color_map = []; SOM_label = []; Cluster_id = 1:length(unique(Cluster_label)); temp_cluster_label=Cluster_label(hits_idx); Site_label=zeros(sum(temp_hits),1);
192
Site_label([1; 1+cumsum(temp_hits(1:end-1))])=[temp_cluster_label(1); diff(temp_cluster_label)]; Site_label = cumsum(Site_label); clear temp_hits temp_cluster_label Site_selected = find(ismember(Site_label,Cluster_id)); fig_handle(end+1) = figure; gscatter(long_site(Site_selected),lat_site(Site_selected),Site_label(Site_selected),... Color_map(Cluster_id,:),Cluster_symbol(Cluster_id),[],0) hold on xlabel('Longitude');ylabel('Latitude'); legend(cellstr([repmat('Cluster ',length(Cluster_id),1) num2str(Cluster_id')])','Location','Best') title('Clustered Spatial representation of sites'); box on; saveas(gcf,'Lat_longdist.fig') saveas(gcf,'Lat_longdist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %EXPORT THE CLUSTERED SITE DATA TO EXCEL %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% site_cluster=[lat_site(Site_selected),long_site(Site_selected),Site_label(Site_selected)]; xlswrite('Site_cluster.xls',site_cluster,'Site_cluster'); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CREATING THE HABITAT INDEX CLUSTER DISTRIBUTION FIGURE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Cluster_ids = ones(prod(sM.topol.msize),1); for idx = 1:no_clusters Cluster_ids = [Cluster_ids ~cellfun('isempty',regexp(cellstr(num2str(Cluster_label)),cellstr(num2str(idx))))]; end Cluster_ids(Cluster_ids==0) = nan; QHEIVar_label = []; QHEI_diff = []; MSE = []; notch = 1; scale = ~isnan(Cluster_ids(:,2:end))*flipud(linspace(0.5,1,no_clusters)'); % SOM visualization and Clustered Boxplots for Habitat Index f = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,QHEI_SOM,scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)) h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(QHEI_SOM,1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])'])
193
set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for QHEI'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) saveas(gcf,'Habitat_index_dist.fig') saveas(gcf,'Habitat_index_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CREATING THE ENVIRONMENTAL VARIABLES CLUSTER DISTRIBUTION FIGURES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sM1 = som_denormalize(sM); notch = 1; Metric_names = 1:size(ENV_MTX,2); y = 4; x = 4; fig_handle(end+1) = figure; %METRICS 1 TO 8 for var_no = 1:8 h1 = subplot(x,y,((var_no-1)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 1:8 subplot(x,y,((var_no-1)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 1 to 8.jpg') saveas(gcf,'Metrics 1 to 8.fig') close(gcf); %% %METRICS 9 TO 16 fig_handle(end+1) = figure; for var_no = 9:16 h1 = subplot(x,y,((var_no-9)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no));
194
set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 9:16 subplot(x,y,((var_no-9)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 9 to 16.jpg') saveas(gcf,'Metrics 9 to 16.fig') close(gcf); %% %METRICS 17 TO 24 fig_handle(end+1) = figure; for var_no = 17:24 h1 = subplot(x,y,((var_no-17)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 17:24 subplot(x,y,((var_no-17)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 17 to 24 .jpg') saveas(gcf,'Metrics 17 to 24.fig') close(gcf); %% %METRICS 25 TO 32 fig_handle(end+1) = figure; for var_no = 25:32 h1 = subplot(x,y,((var_no-25)*2)+1); temp_pos = get(h1,'Position');
195
set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 25:32 subplot(x,y,((var_no-25)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 25 to 32 .jpg') saveas(gcf,'Metrics 25 to 32.fig') close(gcf); %% %METRICS 33 AND 34 fig_handle(end+1) = figure; for var_no = 33:34 h1 = subplot(x,y,((var_no-33)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 33:34 subplot(x,y,((var_no-33)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 33 to 34 .jpg') saveas(gcf,'Metrics 33 to 34.fig') close(gcf);
196
%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CLUSTER DISTRIBUTION OF THE FISH COUNTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Metric_names = 1:size(FISHCOUNTS_MTX,2); y = 2; x = 4; fig_handle(end+1) = figure; %FISH COUNTS FOR METRICS 1 TO 4 for var_no = 1:4 h1 = subplot(x,y,((var_no-1)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 1:4 subplot(x,y,((var_no-1)*2)+2) boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 1 to 4.jpg') saveas(gcf,'Fishcounts 1 to 4.fig') close(gcf); %% %FISH COUNTS FOR METRICS 5 TO 8 for var_no = 5:8 h1 = subplot(x,y,((var_no-5)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 5:8 subplot(x,y,((var_no-5)*2)+2)
197
boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 5 to 8.jpg') saveas(gcf,'Fishcounts 5 to 8.fig') close(gcf); %% %FISH COUNTS FOR METRICS NINE TO ELEVEN for var_no = 9:11 h1 = subplot(x,y,((var_no-9)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 9:11 subplot(x,y,((var_no-9)*2)+2) boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 9 to 11.jpg') saveas(gcf,'Fishcounts 9 to 11.fig') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % FISH METRICS CLUSTER DISTRIBUTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sM1 = som_denormalize(sM); notch = 1; % SOM FOR THE DIFFERENT FISH METRICS Fish_metrics = 1:size(FISH_MTX,2); y = ceil(sqrt(length(Fish_metrics))); x = ceil(length(Fish_metrics)/y); fig_handle(end+1) = figure; for var_no = Fish_metrics h1 = subplot(x,y,find(Fish_metrics==var_no)); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]) h = som_cplane('hexa',sM.topol.msize,FISH_MTX(:,var_no)); set(h,'EdgeColor','none')
198
h = colorbar; set(h,'Position',get(h,'Position')+[0.012 -0.008 0.003 0.015]) title(Fish_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization for Fish metrics') saveas(gcf,'SOM_fishmetrics1.fig') saveas(gcf,'SOM_fishmetrics1.jpg') close(gcf); %% % BOXPLOTS FOR THE DIFFERENT FISH METRICS y = ceil(sqrt(length(Fish_metrics))); x = ceil(length(Fish_metrics)/y); fig_handle(end+1) = figure; for var_no = Fish_metrics subplot(x,y,find(Fish_metrics==var_no)) boxplot(repmat(FISH_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) title(Fish_var(var_no),'Interpreter','none','Fontsize',7) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end suptitle_withpatch('Clustered Boxplots for Fish Metrics') saveas(gcf,'SOM_fishmetrics1.jpg') saveas(gcf,'SOM_fishmetrics1.fig') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CLUSTER DISTRIBUTION OF INDICES OF BIOTIC INTEGRITY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDEX #1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,INDICES_MTX(:,1),scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)); h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(INDICES_MTX(:,1),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for Biotic index 1'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])'])
199
saveas(gcf,'BIOINDEX1_dist.fig') saveas(gcf,'BIOINDEX1_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDEX #2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,INDICES_MTX(:,2),scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)); h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(INDICES_MTX(:,2),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for Biotic index 2'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) saveas(gcf,'BIOINDEX2_dist.fig') saveas(gcf,'BIOINDEX2_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ANALYSIS BASED ON THE SOM (MAX-MIN METRICS AND ENVIRONMENTAL VARIABLES IN NEURONS) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% colors = (repmat(Cluster_label,[1, no_clusters]) == repmat([1:no_clusters],[length(Cluster_label),1])); colors = (linspace(0.4,1,no_clusters) * colors')'; % Forming the per-cluster median for the Environmental variables t1 = repmat(ENV_MTX,[1 1 no_clusters]); t2 = repmat(reshape(Cluster_ids(:,2:end),[prod(sM.topol.msize) 1 no_clusters]),[1 No_env 1]); Env_median = reshape(nanmedian(t1 .* t2,1),[No_env no_clusters]); clear t1 t2 %Maximal and minimal median values of the Environmental variables [Env_max Envmaxidx] = max(Env_median'); [Env_max Envmaxidx] = max(ENV_MTX.*... (repmat(Cluster_label,[1,length(Envmaxidx)]) == repmat(Envmaxidx,[length(Cluster_label),1]))); [Env_min Envminidx] = min(Env_median'); H2 = double(repmat(Cluster_label,[1,length(Envminidx)]) == repmat(Envminidx,[length(Cluster_label),1])); H2(H2==0) = nan;
200
[Env_min Envminidx] = nanmin(ENV_MTX.*H2); clear H2 sM = som_label(sM,'clear','all'); sM = som_label(sM,'add',[1:prod(sM.topol.msize)],cellstr(int2str(L1))); sM = som_label(sM,'add',Envmaxidx,Env_var'); fig_handle(end+1) = figure; som_show(sM,'empty','Maximal Environmental variables','empty','Minimal Environmental variables','subplots',[1 2]) subplot(121) hold on som_cplane('hexa',sM.topol.msize,colors); colormap((1-0.3*gray(no_clusters))); hold on h = som_show_add('label',sM,'Textsize',6,'subplot',1); set(h,'Interpreter','none') sM = som_label(sM,'clear','all'); sM = som_label(sM,'add',[1:prod(sM.topol.msize)],cellstr(int2str(L1))); sM = som_label(sM,'add',Envminidx,Env_var'); subplot(122) hold on som_cplane('hexa',sM.topol.msize,colors); colormap((1-0.3*gray(no_clusters))); hold on h = som_show_add('label',sM,'Textsize',6,'subplot',2); set(h,'Interpreter','none') saveas(gcf,'Maxmin_envvar.fig') saveas(gcf,'Maxmin_envvar.jpg') close(gcf); clc; sM = som_label(sM,'clear','all'); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CORRELATION MATRIX OF THE RAW DATA %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% X1 = corrcoef([ENV_MTX INDICES_MTX]); X2 = [Env_var,'IBI'] figure;imagesc(abs(X1)) set(gca,'XTick',1:size(X2,2),'XTickLabel',X2,'FontSize',6) set(gca,'YTick',1:size(X2,2),'YTickLabel', X2','FontSize',6) title('Correlation Matrix','FontSize',10) X3 = sign(X1); [ir,ic] = find(X3==-1); th=text(ic,ir,'-'); set(th,'horizontalalignment','center'); hold on; [ir,ic] = find(X3==1); th=text(ic,ir,'+'); set(th,'horizontalalignment','center'); caxis([0 1]);colorbar colormap(jet) xticklabel_rotate([],90,X2) saveas(gcf,'Corrmatrix.fig') saveas(gcf,'Corrmatrix.jpg') close(gcf); %%
201
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CORRELATION MATRIX OF THE NEURON WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% X4 = corrcoef([sM1.codebook, INDICES_MTX(:,1)]); X5 = [sM.comp_names','IBI']; figure;imagesc(abs(X4)) set(gca,'XTick',1:size(X5,2),'XTickLabel',X5,'FontSize',6) set(gca,'YTick',1:size(X5,2),'YTickLabel', X5','FontSize',6) title('SOM Neuron Weights Correlation Matrix','FontSize',10) X6 = sign(X4); [ir,ic] = find(X6==-1); th=text(ic,ir,'-'); set(th,'horizontalalignment','center'); hold on; [ir,ic] = find(X6==1); th=text(ic,ir,'+'); set(th,'horizontalalignment','center'); caxis([0 1]);colorbar colormap(jet) xticklabel_rotate([],90,X5) saveas(gcf,'Neuron_Corrmatrix.fig') saveas(gcf,'Neuron_Corrmatrix.jpg') close(gcf);
202
Code for the KNN variable sorting and step-wise predictions
clear all close all clc close(gcf); fig_handle = []; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % READ THE DATASETS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Database = readtexttocells('C:\Program Files\MATLAB\R2006a\toolbox\somtoolbox\SOM\DATABASES\MD_COASTAL_FINAL_NOZEROS.csv'); fields = Database(1,:); warning off MATLAB:divideByZero %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENTERING THE VARIABLES NAMES USED IN EACH STEP %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Env_var = [find(strcmp(fields,'SO4_LAB')) find(strcmp(fields,'TEMP_FLD')) find(strcmp(fields,'ST_GRAD')) find(strcmp(fields,'NO3_LAB')) find(strcmp(fields,'ACREAGE')) find(strcmp(fields,'MAXDEPTH')) find(strcmp(fields,'PHI'))]; Env_var_name = 'STEP 25B'; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENTERING THE NUMBER OF DESIRED CLOSEST NEIGHBORS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% No_hits = 10; % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %EXTRACT ONE VARIABLE AT A TIME FROM THE DATASET,AND CALCULATE DISTANCES %WITH ALL THE REMAINING POINTS IN THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for Row_no =2:size(Database,1) MTC =Database([1:(Row_no-1), (Row_no+1):end],Env_var); Val_IBI = str2double(Database([2:(Row_no-1), (Row_no+1):end],find(strcmp(fields,'FIBI_98')))); sD1 = som_data_struct(str2double(MTC(2:end,:)),'comp_names',MTC(1,:),'labels',... Database([2:(Row_no-1),(Row_no+1):end],find(strcmp(fields,'IDX')))); sD2 = som_normalize(sD1,'log'); sD2 = som_normalize(sD2,'range'); %IDENTIFY ROW THAT IS BEING PREDICTED VAL_DATA = str2double(Database(Row_no,Env_var)); % MERGING THE DATABASE AND TARGET ROW, LOGGING AND RANGING DATA2 = []; for i = 1: size (VAL_DATA,1) sD3 = som_data_struct([str2double(MTC(2:end,:));VAL_DATA(i,:)],'comp_names',MTC(1,:)); sD4 = som_normalize(sD3,'log'); sD4 = som_normalize(sD4,'range');
203
LOG_VALUE = sD4.data(size(sD4.data,1),:)'; DATA2 = [DATA2 LOG_VALUE]; clear LOG_VALUE sD3 sD4; end % CALCULATING THE EUCLIDEAN DISTANCES AND FINDING THES K-SITES THAT HAVE THE % SMALLEST DISTANCES EUCDIST = dist(sD2.data,DATA2)'; for i =1:size(EUCDIST,1) [Sort index] = sort(EUCDIST(i,:)); Calc_IBI = mean(Val_IBI(index(1:No_hits))); CALC_IBI((Row_no-1),1)= Calc_IBI; clear Calc_IBI; end end %WITHDRAWING THE OBSERVED IBI Obs_IBI = str2double(Database(2:end,find(strcmp(fields,'FIBI_98')))); %REGRESSION STATISTICS R2= regstats(Obs_IBI, CALC_IBI,'linear', 'rsquare'); R2= R2.rsquare; R2text=num2str(R2); MSE = regstats(Obs_IBI, CALC_IBI,'linear', 'mse'); RMSE= sqrt(MSE.mse); RMSEtext = num2str(RMSE); %PLOT THE RESULTS h=figure; scatter (Obs_IBI,CALC_IBI,15,'b','filled'); box on; xlabel ('Observed IBI','Color',[0 0 0]); ylabel('Predicted IBI','Color',[0 0 0]); title ('IBI prediction using SOM','Color',[0 0 0]); set(h, 'Color', [1 1 1]); set(gca, 'XColor', [0 0 0], 'YColor', [0 0 0],'ZColor', [0 0 0]); axis ([0 5 0 5]); text(0.5,4, ['RMSE =' RMSEtext],'Color',[0 0 0]); hold on text(0.5,4.5,['R2=' R2text],'Color',[0 0 0]); %DRAWING THE LINE hold on FAKEDATA1 = 0:2:100; FAKEDATA2 = 0:2:100; plot(FAKEDATA1,FAKEDATA2,'r--'); saveas (gcf, sprintf('Direct pred using %s_%dsites.jpg',Env_var_name,No_hits)); saveas (gcf, sprintf('Direct pred using %s_%dsites.fig',Env_var_name,No_hits)); save(sprintf('Direct pred using %s_%dsites',Env_var_name, No_hits));
204
Code for the step-wise variable sorting and prediction using a hierarchical
tree
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% READING DATABASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% clear all; [Database Fields] = xlsread... ('DATABASE.xls','A1:N56'); [BIODATA BioFields] = xlsread... ('DATABASE.xls','P1:P56'); EnvData = Database(:,2:end); Fields_EnvData = Fields (2:end); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% EXTRACT ONE VARIABLE AT A TIME AND CHECK PREDICTION CAPABILITIES WITH DIFFERENT NUMBER OF HOMOGENEOUS GROUPS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% VARREG_STAT =[]; OBS_IBI = BIODATA(:,1); %SELECTING ONE VARIABLE AT A TIME FROM THE ENVIRONMENTAL DATABASE for var_no =1:size(EnvData,2) REG_STAT =[]; % SELECTING THE DIFFERENT NUMBER OF HOMOGENEOUS GROUPS WITH WHICH EACH % VARIABLE IS TESTED for max_sc= round(size(EnvData,1)/10):190:size(EnvData,1) CALC_IBI=[]; %LEAVE-ONE OBSERVATION OUT METHODOLOGY for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = Database(Row_no-1,[1 var_no+1]); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); EnvDataTemp = EnvData(INDIDX,var_no); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZEnvDataTemp = zscore(EnvDataTemp); DIST = pdist(ZEnvDataTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close(gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM (DETERMINED WITH VECTOR 'T') clear AVG_EnvData AVG_BIODATA
205
for i =1:max(T) INDEX = find(T==i); SUB_EnvData = EnvDataTemp(INDEX,:); AVG_EnvData(i,:) = mean(SUB_EnvData,1); clear INDEX SUB_EnvData; end % FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Merging target site to homogeneous group data Merge = [AVG_EnvData;TARGET_VAR(1,2:end)]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist((size(AVG_EnvData,1)+1),1:(size(AVG_EnvData,1))); %Calculating the average IBI of the closest site/s pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_EnvData SUB_BIODATA INDEX EnvDataTemp ZEnvDataTemp BIOTemp pdistMin end %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; RMSE = sqrt(mean((OBS_IBI-CALC_IBI).^2)); STATemp = [R2; RMSE]; REG_STAT = [REG_STAT STATemp]; clear STATemp R2 RMSE end VARREG_STAT = [VARREG_STAT;REG_STAT]; end % PLOT NUMBER OF HOMOGENEOUS GROUPS VERSUS R2 FOR EACH VARIABLE HG = round(size(EnvData,1)/10):190:size(EnvData,1); ax1 = axes ('Xlim',[min(HG) max(HG)],'XTick',HG); xlabel ('NUMBER OF HOMOGENEOUS GROUPS','Color',[0 0 0]); ylabel('R2','Color',[0 0 0]); title ('OPTIMUM NUMBER OF HOMOGENEOUS GROUPS','Color',[0 0 0]); box on; %SELECTING AND PLOTTING R2 FIELD FOR EACH VARIABLE IN THE REGRESSION STATISTICS FILE for var_no = 1:2:(size(EnvData,2)*2) hold on line(HG,VARREG_STAT(var_no,:),'Parent',ax1); end
206
%SAVE FIGURES saveas (gcf,'VAR_SEL_PLOT.fig'); saveas (gcf,'VAR_SEL_PLOT.jpg'); close (gcf); save('MAT_FILES'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% SORT THE DATA AND START THE STEP-WISE PREDICTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SORT DATA ACCORDING TO PREDICTION CAPABILITIES VARREG_STAT = VARREG_STAT'; MAXR2= max(VARREG_STAT(:,1:2:size(VARREG_STAT,2)),[],1); [SortR2 indR2]= sort(MAXR2,'descend'); R2Comp = SortR2(1); % STEP-WISE PREDICTION FOLLOWING THE ORDER DETERMINED BY THE OBTAINED R2 SLTD_VAR = EnvData(:,indR2(1,1)); VAR_Names = Fields_EnvData(:,indR2); indSel_Var = 1; PROGR2 = [R2Comp]; R2ALL =[]; for sel_var=2:size(indR2,2) REG_STAT=[]; SLTD_VAR = [SLTD_VAR EnvData(:,indR2(sel_var))]; for max_sc= 233:95:423 CALC_IBI=[]; %LEAVE-ONE-OUT PROCEDURE for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = SLTD_VAR(Row_no-1,:); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); SLTD_VARTemp = SLTD_VAR(INDIDX,:); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZSLTD_VARTemp = zscore(SLTD_VARTemp); DIST = pdist(ZSLTD_VARTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close (gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM AND DETERMINED WITH VECTOR 'T' for i =1:max(T) INDEX = find(T==i); SUB_SLTDVAR = SLTD_VARTemp(INDEX,:); AVG_SLTDVAR(i,:) = mean(SUB_SLTDVAR,1); clear INDEX SUB_EnvData; end
207
% FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %Merging target site to homogeneous group data Merge = [AVG_SLTDVAR;TARGET_VAR]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist(size(Targ_HG_dist,1),1:(size(Targ_HG_dist,1)-1)); %Calculating the average IBI of the closest site/s pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_SLTDVAR SUB_BIODATA INDEX SLTD_VARTemp ZSLTD_VARTemp BIOTemp pdistMin AVG_SLTDVAR AVG_BIODATA end %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; REG_STAT=[REG_STAT R2]; clear R2 end R2ALL = [R2ALL;REG_STAT]; R2=max(REG_STAT,[],2); if R2>R2Comp R2Comp = R2; PROGR2 = [PROGR2 R2]; indSel_Var =[indSel_Var sel_var]; else SLTD_VAR = SLTD_VAR(:,1:(size(SLTD_VAR,2)-1)); end end %OBTAINING THE NAMES OF THE VARIABLES SELECTED Sort_Fields = Fields_EnvData(indR2); Sel_fields = Sort_Fields(indSel_Var); %SAVE MATLAB FILES save('MAT_FILES');
208
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% PLOTTING THE BEST PREDICTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %ENTER NAMES OF BEST VARIABLES AS THEY APPEAR IN Fields (USE QUOTES) Ind_Bestvar = [find(strcmp(Fields,'EMBEDDED')),find(strcmp(Fields,'RIFFLE')),find(strcmp(Fields,'SUBSTRATE'))... find(strcmp(Fields,'POOL')),find(strcmp(Fields,'AREA')),find(strcmp(Fields,'COVER'))]; REG_STAT =[]; % SELECTING THE DIFFERENT NUMBER OF HOMOGENEOUS GROUPS WITH WHICH EACH % VARIABLE IS TESTED MaxGroups = input('Enter number of desired groups in hierarchical tree'); for max_sc=MaxGroups CALC_IBI=[]; %LEAVE-ONE-OUT PROCEDURE for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = Database(Row_no-1,[1 Ind_Bestvar]); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); EnvDataTemp = Database(INDIDX,Ind_Bestvar); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZEnvDataTemp = zscore(EnvDataTemp); DIST = pdist(ZEnvDataTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close(gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM AND DETERMINED WITH VECTOR 'T' clear AVG_EnvData AVG_BIODATA for i =1:max(T) INDEX = find(T==i); SUB_EnvData = EnvDataTemp(INDEX,:); AVG_EnvData(i,:) = mean(SUB_EnvData,1); clear INDEX SUB_EnvData; end % FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end
209
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %MERGING TARGET SITE TO HOMOGENEOUS GROUP DATA Merge = [AVG_EnvData;TARGET_VAR(1,2:end)]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist((size(AVG_EnvData,1)+1),1:(size(AVG_EnvData,1))); %CALCULATE AVERAGE IBI OF THE CLOSEST SITE/S pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_EnvData SUB_BIODATA INDEX EnvDataTemp ZEnvDataTemp BIOTemp pdistMin end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; R2text = num2str(R2); RMSE = sqrt(mean((abs(OBS_IBI-CALC_IBI)).^2)); RMSEtext = num2str(RMSE); STATemp = [max_sc R2 RMSE]; REG_STAT = [REG_STAT; STATemp]; end scatter(OBS_IBI, CALC_IBI); xlabel ('Observed IBI','Color',[0 0 0]); ylabel('Predicted IBI','Color',[0 0 0]); title ('IBI prediction using a hierarchical approach','Color',[0 0 0]); axis ([12 60 12 60]); text(13,55, ['RMSE =' RMSEtext],'Color',[0 0 0]); hold on text(13,57,['R2=' R2text],'Color',[0 0 0]); %DRAWING THE LINE hold on FAKEDATA1 = 12:2:60; FAKEDATA2 = 12:2:60; plot(FAKEDATA1,FAKEDATA2,'r--'); %PLOTTING 1.5xRMSE INETRVALS FAKEDATA_21 = 12+1.5*RMSE:2:60+1.5*RMSE; FAKEDATA_22 = 12-1.5*RMSE:2:60-1.5*RMSE; hold on plot(FAKEDATA1,FAKEDATA_21,'r--'); plot(FAKEDATA1,FAKEDATA_22,'r--'); saveas (gcf, 'Best prediction Instream variables.jpg'); saveas (gcf, 'Best prediction Instream variables.fig');