Biological response to environmental stress - DRS812/fulltext.pdf · i Abstract Biological response to environmental stress. Environmental similarity and hierarchical, scale-dependant

Biological response to environmental stress. Environmental

similarity and hierarchical, scale-dependant segregation of

biotic signatures for prediction purposes

A Dissertation Presented

by

David Bedoya Ribó

to

The Department of Civil & Environmental Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Civil Engineering

in the field of

Environmental Engineering

Nortehastern University

Boston, Massachusetts

(October 2008)

i

Abstract

Biological response to environmental stress. Environmental similarity and

hierarchical, scale-dependant segregation of biotic signatures for prediction

purposes

David Bedoya Ribó

In the hierarchical river system, any deviation from the pristine state will be translated

into disturbances that propagate and eventually reach its endpoints (i.e. the biologic

community). Endpoints are indicative of the overall health or integrity of a water body.

Integrity is usually measured with multi-metric indices that compare actual observations

to reference scenarios. Despite strong agreement among experts about the importance of

biological indicators, development of numeric biological standards similar to those used

for water quality remains uncertain for several reasons: (1) the natural system is

composed of highly intertwined and cross-correlated variables. Identification of simple

stress-response relationships is not often possible; (2) the natural system is organized in a

nested hierarchy of suitable habitats with very different geographic scales; (3) many

environmental variables have a categorical evaluation, which introduces subjectivity and

relativity into the system ; (4) true reference conditions may no longer exist; and (5)

natural randomness .

ii

In order to address these issues, an attempt to predict or characterize biologic integrity

was performed. In the first section, fish Indices of Biologic Integrity (IBI) were predicted

using the K-nearest neighbor concept (KNN). This methodology was used because it

allows a fast, step-wise approach easily implemented with highly dimensional

environmental vectors. The KNN concept was tested with databases in Maryland, Ohio,

and Minnesota. Subsequently, a slightly modified version of the algorithm was tested

with a new database in Ohio which combined instream and offtstream features improving

the results significantly.

The second section consisted of a progressive, hierarchical separation of biological

responses using Self-Organizing Maps (SOM) and subsequent clustering of sites using

one environmental variable at a time in decreasing order of importance. This

methodology attempted to replicate the nested hierarchy of habitats in nature. The

biologic responses were characterized using a Gaussian probabilistic curve because it was

assumed that IBI was a projection of the log-normal distribution of species onto an

arithmetic scale. The best sites in each group were considered as truly reference

conditions and compared to the remaining sites within the group. This was applied in

Ohio (with only instream or only offstream data) and Maryland (instream and offstream

data combined).

iii

Acknowledgments

I would like to especially thank my wife: Tonya L. Berenson. Her affection, empathy,

sense of humor, and always positive attitude during very hard periods at the academic

and personal levels have been crucial to me in order to achieve this goal.

I would like to thank my advisor: Professor Vladimir Novotny. His guidance and broad

experience in the water resources field were critical for the successful completion of this

research. I am also very grateful to the committee members and especially to Professor

Elias Manolakos. His experience and advice with complex data patterning techniques, a

field that was completely new to me, were extremely helpful.

I would like to thank all my family and friends in Spain and the U.S.A. for their

unconditional support and understanding. Just knowing they were there has been an

endless source of energy and joy.

I would like to thank all my friends in the Civil & Environmental Engineering

Department at Northeastern University for their support and the good times we spent

together all these years.

Finally, I would like to express my gratitude towards Mr. Ed Rankin and Dennis Mishne

from Ohio EPA for their help with the environmental databases and valuable advice.

This research was partially funded by a USEPA STAR watershed research grant to

Northeastern University, Boston, MA.

iv

Table of contents Research summary .............................................................................................................. 1 Introduction......................................................................................................................... 4 1. Chapter 1: Comparison of IBI predictions using regression and the environmental similarity concept.............................................................................................................. 13

1.1. Methodology..................................................................................................... 13 1.1.1. Self-Organizing feature maps ................................................................... 14 1.1.2. k-nearest neighbor concept ....................................................................... 15 1.1.3. Description of the databases ..................................................................... 16 1.1.4. IBI prediction methodology using kNN ................................................... 18 1.1.5. IBI prediction using regression and SOM + regression............................ 20 1.1.6. Chronic and acute toxic chemical effects ................................................. 23

1.2. Results and discussion ...................................................................................... 25 1.2.1. IBI predictions using kNN (k =1 or k= 2) ................................................ 25 1.2.2. IBI predictions using kNN (with k =5 or k = 10) ..................................... 26 1.2.3. Regression models .................................................................................... 28

1.3. Conclusions....................................................................................................... 36 2. Chapter 2: Large-scale biologic integrity prediction based on environmental similarity using instream data and regional and local offstream characteristics .............. 39

2.1. Methodology..................................................................................................... 39 2.1.1. Data and study area................................................................................... 39 2.1.2. Variable sorting based on IBI prediction power using a leave-one-out, hierarchical approach ................................................................................................ 45 2.1.3. Step-wise IBI prediction using a leave-one-out, hierarchical approach ... 46 2.1.4. Analysis of observations with a significant impact from local variables . 48

2.2. Results............................................................................................................... 48 2.2.1. Step-wise IBI predictions.......................................................................... 48 2.2.2. Analysis of sites with significant local-scale stressors ............................. 50

2.3. Discussion......................................................................................................... 55 2.3.1. Land use .................................................................................................... 56 2.3.2. Fragmentation ........................................................................................... 59 2.3.3. Point sources and instream water quality.................................................. 60 2.3.4. Instream Habitat........................................................................................ 62 2.3.5. Mispredictions due to local effects ........................................................... 62

2.4. Conclusions....................................................................................................... 64 3. Chapter 3: Probabilistic, Hierarchical, Biologic Integrity Discrimination ............... 66

3.1. Methodology..................................................................................................... 66 3.1.1. Ohio: instream data and study area........................................................... 66 3.1.2. Ohio: offstream data and study area ......................................................... 68 3.1.3. Maryland data and study area ................................................................... 72

v

3.1.4. Self-Organizing Feature Maps (SOM)...................................................... 75 3.1.5. Initial data clustering and SOM neuron analysis ...................................... 77 3.1.6. Second SOM data clustering..................................................................... 78 3.1.7. Site patterning based on ‘large-scale’ variables and associated biotic responses ……………………………………………………………………………79 3.1.8. Site patterning based on ‘small-scale’ variables and associated biotic response ……………………………………………………………………………82 3.1.9. IBI response curve development for different levels of watershed characterization ......................................................................................................... 82 3.1.10. Development of biotic response reference curves .................................... 86

3.2. Results and discussion ...................................................................................... 87 3.2.1. Ohio: instream data ................................................................................... 87 3.2.2. Ohio offstream data................................................................................... 99 3.2.3. Coastal Maryland .................................................................................... 117 3.2.4. Piedmont Maryland................................................................................. 122 3.2.5. Highland Maryland ................................................................................. 128

3.3. Conclusions..................................................................................................... 138 3.3.1. Ohio with instream data .......................................................................... 138 3.3.2. Ohio with offstream data ........................................................................ 140 3.3.3. Maryland................................................................................................. 140

4. Main conclusions .................................................................................................... 143 5. Future research and work........................................................................................ 148 6. References............................................................................................................... 157 Appendices...................................................................................................................... 165 Appendix I: group statistics ............................................................................................ 166 Appendix II: computer code ........................................................................................... 186

vi

List of Figures Figure 1. Hierarchical stressor-risk-endpoint propagation model based on Karr et al. (1986)

integrity concept and Novotny( 2003) concept of risk propagation…………………………4 Figure 1-1. Flow-chart of the step-wise kNN prediction method. Dashed arrow lines represent

the steps followed when the environmental variables are sorted with k =1. Dotted arrow lines represent the steps followed when the variable sorting is performed with k =10. Solid arrow lines depict common steps for both cases................................................................... 20

Figure 1-2. Flow-chart of the step-wise multiple regression method. Dashed lines indicate steps

for the cluster-based model only. Dotted lines indicate steps for the whole database model only. Solid lines are common steps for both methods .......................................................... 23

Figure 1-3. Top, site cluster distribution in Minnesota (left), Maryland (Piedmont sites) (center),

and Ohio (right). In Minnesota, cluster 1 is concentrated in Southern watersheds. In Maryland clusters 4 and 5 are concentrated in a specific region and, in Ohio, sites located in the same watershed usually belong to the same cluster. Bottom, Self-organizing Map neuron lattice and box plots with the cluster-based IBI values. The red line in the boxplots represents median cluster value, the top line is 75 percentile, and bottom line is 25 percentile............................................................................................................................... 34

Figure 2-1. From left to right and top to bottom. (1)Upstream stream network carrying waste

water; (2) upstream stream network fragmentation; (3) basin-scale dams in the downstream main channel; (4) basin-scale stream network fragmentation .............................................. 42

Figure 2-2. Hierarchical tree with different clustering levels to which the test site (Xi1,Xi2,…,Xin)

is being compared against. i indicates the observation number, n indicates the environmental variable within the environmental vector ............................................................................. 47

Figure 2-3. Diagram showing the order with which the variable groups were merged. Orange

rectangles indicate instream variables. Green rectangles indicate offstream variables. Blue indicates a mix of both.......................................................................................................... 47

Figure 2-4. IBI predictions with the best offstream variables (top), best instream variables

(middle), and best variables overall (bottom). Dashed red lines indicate perfect fit line (center) and ± 1.5×RMSE (sides). Dot size is proportional to the number of hits in a specific point. ........................................................................................................................ 54

Figure 3-1. Distribution of observations used in the analysis and basins. On the left, groups after

the 2nd SOM. On the right groups after clustering using SITE_Con (groups from the same parent group are segregated by basin) .................................................................................. 70

Figure 3-2. 1995-1997 MBSS monitoring stations in the state of Maryland and strata distribution

............................................................................................................................................... 73

vii

Figure 3-3. Example of a hierarchical tree of the 2nd SOM neurons (left) and analysis of

differences among group biologic responses (right). On the right, example of MRT analysis. Overlapping indicate not significant differences in group IBI means. Non-overlapping indicates significantly different group IBI means. In this case, Level 4 partition would be chosen because it yields the largest number of different biotic responses (5) with less overlapping than Level 5 (Figure for clarification purposes only). ................. 81

Figure 3-4. Flow chart summarizing the methodology used to characterize response of the

biologic community to similar environmental characteristics and stressors (Maryland and Ohio with instream data)....................................................................................................... 84

Figure 3-5. Flow chart summarizing the methodology used to characterize response of the

biologic community to similar environmental characteristics and stressors (for Ohio with offstream data) ...................................................................................................................... 85

Figure 3-6. Correlation matrix of the variable neuron-based weights and neuron-based average

IBI values in the trained SOM. ............................................................................................. 87 Figure 3-7. Groups and subgroups with different biological responses after clustering with large

and small-scale environmental filters. Red color marks groups that did not pass normality tests. Blue color indicates groups that passed the normality tests. ....................................... 92

Figure 3-8. Normal distribution probability plots for groups 1 through 6. Red line indicates 75th

IBI percentile. Points to the right of the red line were considered as reference observations for the respective group of sites and separated. ................................................................... 96

Figure 3-9. Normal probability plots for the reference (green) and impaired (red) conditions for

the six groups obtained after clustering the SOM neurons with environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group......................................................................................................... 98

Figure 3-10. Correlation matrix of the variable neuron-based weights and neuron-based average

IBI scores in the trained SOM. Color bar on the right indicates absolute value of the absolute correlation coefficient. Plus and minus signs indicate positive or negative correlation. ............................................................................................................................ 99

Figure 3-11. Hierarchical diagram of habitats with significantly different biotic responses. On the

right, list of environmental variables used to segregate biotic signatures at each step. Rectangles in blue indicate groups that passed normality test. Rectangles in red indicate groups that did not pass normality test. .............................................................................. 102

Figure 3-12. Normal distribution probability plots for the biologic signatures after clustering sites

with SITE_Con. Group 212 did not pass the Jarque-Bera test of normality at the 95% confidence level (see Figure 3-11) . Group 221 was not plotted because it only had 4 observations ........................................................................................................................ 103

viii

Figure 3-13. Example of biologic response separation by segregation of sites with environmental variables. Group 222 splits in groups 2221 and 2222 (group 2222 not-normally distributed) after clustering with RDA_Urban. Group 2222 splits in groups 22221 and 22222 (both normally distributed) after clustering with R30_Agri. ....................................................... 104

Figure 3-14. Normal probability plots for the reference (green) and impaired (red) conditions for

the groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 212 was fitted to a Gaussian distribution only for demonstration purposes) .................................................... 105

Figure 3-15. Groups of sampling sites in a watershed located in the Muskingum River Basin. On

the left, groups after partition with regional watershed land use and fragmentation metrics. On the right, groups after partitions with land use in the local 100-meter buffer............... 116

Figure 3-16. Correlation matrix of the variable neuron-based weights and neuron, average IBI

values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables ............................................................................ 117

Figure 3-17. Groups and subgroups with different biological response after clustering with large

and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed the normality tests ...................................... 119

Figure 3-18. Normal probability plots for the IBI responses found after the 2nd SOM clustering

............................................................................................................................................. 120 Figure 3-19. Normal probability plots for the reference (green) and impaired (red) conditions for

the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution............................................. 121



Figure 3-21. Groups and subgroups with different biological responses after clustering with large

and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests ............................................ 124

Figure 3-22. Normal probability plots for the IBI responses identified by the 2nd SOM clustering

in Piedmont sites (Group 4 didn’t pass the normality test)................................................. 125

ix

Figure 3-23. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 4 was fitted to a Gaussian distribution only for demonstration purposes) .................................................... 126



Figure 3-25. Biological response hierarchical structure after clustering with large and small-scale

environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests.............................................................. 130

Figure 3-26. Normal probability plots for the IBI responses the 2nd SOM clustering in Highland

sites (groups 1 and 3 didn’t pass normality tests) ............................................................... 131 Figure 3-27. Normal probability plots for the reference (green) and impaired (red) conditions for

the three groups obtained using environmental gradients in Highland sites. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group in order to describe its Gaussian distribution (Groups 1 and 3 fitted to a Gaussian distribution only for demonstration purposes) .................................................... 132

x

List of Tables Table 1-1. Description of the environmental variables, scores and indices available for each state

and their units........................................................................................................................ 17 Table 1-2. Summary of IBI predictions using the kNN methodology. The different functions (Mh

= Mahalanobis; Eu = Euclidean) and selected number of closest neighbors (k) are specified. Final selected variables in each case are also listed.............................................................. 28

Table 1-3. Summary of the step-wise regressions for IBI prediction for the development and

validation sets. The variables used in each case are listed together with their coefficients and curve type (in parentheses). Variables in italics in the whole database regressions indicate variables also used in some of the kNN predictions. Results in Ohio after including metal toxicity penalties ................................................................................................................... 35

Table 2-1. Description, percentage quartiles, and individual IBI predicting power for the

different NLCD land use categories present in the Ohio database ....................................... 43 Table 2-2. Description, quartile values, and individual IBI predicting power for the water quality,

habitat, point source, and stream fragmentation metrics ...................................................... 44 Table 2-3. List of variables with significant differences between over-predicted sites and sites

with a prediction within the ±1.5 ×RMSE intervals ............................................................ 51 Table 2-4. List of variables with significant differences between under-predicted sites and

observations with a prediction within the ±1.5 ×RMSE intervals ....................................... 52 Table 2-5. Step-wise IBI predictions. R2 indicate the variability explained after adding a new

variable to the model. All results were achieved using a hierarchical tree with 423 branches. For an explanation of variables refer to Table 2-1 and Table 2-2 ........................................ 53

Table 3-1. List of water quality, habitat, and biologic integrity parameters used in the research 67 Table 3-2. Land use categories and quartiles at the watershed (R) and the local (L) scales ........ 71 Table 3-3. Fragmentation (top) and point source density and intensity metrics (middle) , units,

and quartiles .......................................................................................................................... 71 Table 3-4. Description, quartiles, and units for the available regional environmental variables 74 Table 3-5. Neuron-based correlation coefficients between variables and IBI. ............................ 90

xi

Table 3-6. ANOVA (top) and MRT (bottom) analyses for the IBI means in groups after 2nd SOM patterning with environmental gradients shown in Figure 3-7. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups. ............................................... 90

Table 3-7. 95% confidence intervals for the environmental variable means in reference and

impaired sites. Text in bold indicates statistically significant differences for that variable and group according to the t-tests ......................................................................................... 97

Table 3-8. Correlation coefficients between the neuron-based regional environmental variables

and the neuron-based average IBI scores (left and mid columns) and raw local variables and IBI scores (left column). Variables in bold were capable of separating significantly different biological responses in the hierarchical structure ............................................................... 101

Table 3-9. ANOVA (top) and MRT (bottom) analyses to detect significant differences in IBI

means between 2nd SOM groups of neurons. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups. ..................................................................................................... 101

Table 3-10. 95% confidence intervals and ANOVA test between reference and non-reference

sites in variables used in the separation of biotic responses ............................................... 106 Table 3-11. Average group values after clustering with basin/watershed scale variables.......... 116 Table 3-12. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the

MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups .............................................. 119


sites with variables used in the separation of biotic responses in coastal sites................... 121 Table 3-14. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the

MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups .............................................. 124


sites with variables used in the separation of biotic responses in piedmont sites............... 127 Table 3-16. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses in

highland sites. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups...... 130


sites in variables used in the separation of biotic responses in highland sites.................... 133

1

Research summary

The research presented in this thesis is an attempt to predict or characterize biological integrity

using data patterning techniques. In the initial stages, a comparison of traditional and more

advanced prediction methods was performed at the state-level and presented in Chapter 1. The

results showed how predictions based on evaluation of environmental similarity outperformed

predictions based on more traditional techniques (i.e. non-linear regression). Moreover, this

methodology was much faster computationally and allowed a leave-one-out validation procedure

that other methods couldn’t afford due to time constraints. This methodology was tested using

databases compiled by public agencies in Ohio, Maryland, and Minnesota.

After these initial results, I realized the prediction results could potentially be improved because

none of the available databases had complete instream (i.e. water and habitat quality) and

offstream (i.e. regional and local land use, point source, and fragmentation information). A new

database was created for Ohio using Geographic information Systems (GIS) in order to obtain

accurate land use, point source, and fragmentation metrics for each existing site in the original

database. IBI was predicted using the merged databases. An improved algorithm was used which

assessed environmental variability at different levels of a hierarchical tree of homogeneous

groups. The results improved the previous prediction by almost 10% using only offstream data

(regional land use and stream fragmentation). Mispredicted sites were separated and the

differences with the remaining observations analyzed. Significant differences in upstream

fragmentation, local land use, and water quality were detected. These results are presented in

Chapter 2.

2

A literature review as well as the results from Chapter 2 led to the development of a

methodology able to characterize biological responses at different levels of environmental

characterization or description. The methodology developed was named PROHIBID

(PRObabilistic HIerarchical Biologic Integrity Discrimination). This consisted of a top-down

hierarchical classification of environmental stressors based on their overall effect on IBI. This

started with a separation of major biologic signatures by identification of environmental

gradients using Self-Organizing Maps (SOM). Subsequently, distinct biotic responses due to

more localized environmental stressors were progressively segregated using one variable at a

time in decreasing order of importance. Therefore, as the system characterization increased,

group environmental and biologic homogeneity was increased as well. Biotic responses in each

group were represented using a Gaussian distribution. This function was used because the

hypothesis that the IBI is a projection of the observed log-normal distribution of species onto an

arithmetic axis was made. This hypothesis worked very well and groups usually reached

normality if they were homogeneous enough. Some groups did not achieve normality but this

was most likely due to lack of a representative sample.

The best observations in each group were considered as truly reference conditions because they

belonged to a highly environmentally homogeneous cluster. Differences between reference and

non-reference sites were evaluated and indicated the main issues to be addressed as well as their

scale in order to achieve reference conditions. For example, when offstream data was used,

PROHIBID identified regional land use and fragmentation as environmental gradients (i.e. large-

scale variables responsible for background integrity). Local buffer land uses usually explained

the fluctuations within these groups.

3

This methodology could easily be implemented to establish probabilistic biological standards

similar to those in water quality. Furthermore, reference or realistically achievable conditions are

easily identified because we ‘let the data speak’ with no a-priori assumptions of what reference

conditions should be. Moreover, the scale at which the problem is analyzed is flexible because

with this method differences can be analyzed at any level of the hierarchical structure. Therefore,

the scale issue is no longer a problem. PROHIBID was implemented in Ohio (with instream and

offstream data) and Maryland (combination of both) and described in detail in Chapter 3.

4

Introduction

Biologic integrity represents the highest point of the hierarchy in the natural system. It is a direct

measure of the ecological status in a water body and considered a response indicator (Novotny et

al. 2005). Environmental stressors and fauna’s exposure risk to stress propagate through the

hierarchical structure and in the final outcome impact the biologic community (Figure 1). For

this reason, integrity is considered as a true indicator of the overall health of a water body and

sensitive to any departure from the pristine conditions due to anthropogenic modifications at any

scale.

Figure 1. Hierarchical stressor-risk-endpoint propagation model based on Karr et al. (1986) integrity concept and Novotny( 2003) concept of risk propagation

5

Biological integrity in fresh water systems is usually evaluated with indices.

The use of indices to monitor the biological integrity of surface waters has been common

practice since the last quarter of the 20th century, but started almost a century ago (Novotny

2003). One of the most widely used indices in the United States is the Index of Biologic Integrity

(IBI) developed by (Karr et al. 1986). Many public agencies have adopted it as a framework for

their own calibrated version at the state or region scales (Bode 1988; Lyons 2006; Lyons et al.

2001; Ohio_EPA 1987; Roth et al. 1998).The IBI is a multi-metric (12 metrics), comparative

index in which the fish samples obtained from a particular water body are compared against the

fish abundances and community composition in reference watersheds. Fish samples are obtained

by fish electro-shocking. The sum of the 12 metrics constitute the final IBI score, which is a

discrete number ranging from 12 (essentially no fish) to 60 (healthy fish community). Even

though the index developed by Karr et al. (1986) is based on fish, numerous IBI based on the

macroinvertebrate community also exist and currently used (Barbour et al. 1999; Hilsenhoff

1987; Southerland et al. 2005; Stribling et al. 1998; Wright et al. 1988). For convenience, from

this point forward, fish IBI will be referred as IBI.

Valid environmental and biodiversity indicators should be sensitive enough to track changes

from reference conditions, applicable in large geographical areas, capable of providing a

continuous assessment over a wide range of stress, and differentiate between natural cycles or

trends and anthropogenic stress (Ott 1978). This is not an easy task because an index must be

able to reflect changes in the community produced by stressors at different hierarchical levels of

the ecosystem and at different geographic scales. The IBI developed by Karr et al. (1986) is

currently accepted as an index with the desirable characteristics and has been applied

6

successfully to aquatic communities (Noss 1990). Numerous authors have confirmed that

indices based on Karr’s IBI are sensitive to man-induced environmental stresses (Dyer et al.

2000; Dyer et al. 1998a; Lammert and Allan 1999; Manolakos et al. 2007; Richards et al. 1996;

Roth et al. 1996; Wang et al. 2001; Yuan and Norton 2004).

Species distribution in a pristine, lotic system is determined by natural inputs such as

meteorology, geography, latitude, elevation, stream or lake morphology, habitat quality, and

water chemistry (Novotny 2003). However, finding completely pristine environments is difficult

if not impossible. Therefore, identifying a pre-existing state (i.e. actual state) is of major

importance in order to set a reference against which to be able to compare (Rykiel 1985).

Deviations from the natural state are a consequence of introducing some disturbance in the

reference system that will cause a perturbation (at the system level) and/or stress (at the

physiological and functional level). This can be quantified by looking at the departure of the

biological and ecological features of the modified system (Rykiel 1985). In literature, there exists

discussion on the concept of disturbance and its expression in the ecosystem. Most authors agree

that disturbances should not be approached as monolithic inputs that will translate into a specific

change in the whole reference structure. Instead, system disturbance will be expressed in

different ways at different levels of biological organization (Noss 1990). Disturbances should be

analyzed in the context of a highly hierarchical system (i.e. ecosystem) in which the scale in

which a disturbance is manifested will determine its consequences (Pickett et al. 1989).

The hierarchy theory in ecosystems suggests that higher levels of organization incorporate and

determine the response of lower levels (Allen and Starr 1982; O'Neill et al. 1986). Four different

7

levels of organization of the biological community exist (from high to low hierarchy): regional

landscape, community-ecosystem, population species, and genetics. Each one of these levels has,

at the same time, three different dimensions that define them: functional, structural, and

compositional (Noss 1990; Novotny et al. 2007). The relevance of higher order constraints

should not mean that monitoring be limited to these levels (e.g. landscape patterns). It is in the

lower levels of the hierarchy where the most detailed information (e.g. species abundances) and

the mechanistic basis for higher levels can be found (Noss 1990). According to the concept of

hierarchical structure, one should be able to distinguish between disturbances that cause stress in

the higher hierarchies (regional or community level) and lower ones (population and genetic

level). Disturbances that translate into some sort of stress at high hierarchy levels are also known

as environmental gradients or large-scale stressors. Environmental gradients usually occur

when normal ecological stimuli and processes in the system, which constitute a continuum, go

beyond normal limits and constitute an axis of continuous change in frequency (Allen and Starr

1982; Rykiel 1985).Environmental gradients are usually ubiquitous, meaning that they will

always be there at different levels or configurations (e.g. landscape patterns, background water

quality). Since they are usually related to large-scale patterns, deviations from their normal,

natural boundaries affect the biological community in its higher hierarchies, producing an overall

shift of the natural species distribution. Large-scale variables determine the background quality

of the biotic community.

On the other hand, environmental stressors that affect lower levels of the biologic hierarchy will

be called for consistency small-scale variables. These variables usually have a marginal effect on

the whole community structure distributed over a large geographic area, but can severely affect

8

the biologic community at the regional or local scale (e.g. point sources). Two types of small-

scale variables might exist. The first one would be when some element foreign to the natural

system is introduced in sufficient amount as to negatively affect the biologic quality (e.g.

introduction of metals from point sources). The second type would be when localized extreme

values of already existing elements or gradients are reached due to human activity (e.g. high

levels of siltation due to presence of construction sites).

Many studies trying to link IBI to stressors focus on a specific scale and therefore found that the

relevant variables to IBI were those that could affect the biologic community at its highest

possible level of hierarchy at the given scale. Thus, impacts from small-scale stressors get

blurred by these. For example, Manolakos et al. (2007); (also summarized in Novotny et al.,

(2007)) used the whole state of Ohio as their system of study. In their analysis, they found that

habitat characteristics together with conductivity and hardness were the main descriptors of the

three identified clusters with different IBI qualities. These variables are large-scale variables

with great effect on the overall IBI variability at the state level. Other variables such as metal

concentration showed a weaker overall effect on IBI. In another study by (Dyer et al. 2000), the

main IBI predictors were identified through multiple linear regression in the Great Miami River

watershed in Ohio. When they analyzed the entire area they found that the Qualitative Habitat

Evaluation Index (QHEI), the percentage of municipal effluent flow in average stream

conditions, gradient, and hardness were the best predictors for IBI. These are all large-scale

variables. When they analyzed the lower portion of the watershed, hardness, total suspended

solids, concentrations of selenium, lead, zinc, and ammonium together with pool and channel

qualities were the best predictors. Roth et al. (1996) found that regional land use was a better

9

predictor for IBI than local land use in a study at the River Raisin watershed in Michigan. Their

study comprised multiple samples in streams of different order and different biologic integrity.

However, another study in the same watershed found exactly the opposite results (Lammert and

Allan 1999). Their study focused only on three first-order warm water tributaries to the River

Raisin. The discrepancies between Roth et al. (1996) and Lammert and Allan (1999) were due to

the scale at which the problem was approached (Allan et al. 1997).

Therefore, average biologic integrity observed in a specific area (which will be referred as

background integrity) is mainly determined by environmental stressors that are ubiquitous at the

specific scale. This doesn’t necessarily imply that these stressors are the best biologic integrity

predictors. For example, in a pristine environment, species distribution within homogeneous

geographical regions (e.g. ecoregions) is mostly determined by natural inputs such as

meteorology, geology, geography, latitude, and altitude (Novotny 2003). Species presence or

absence and species abundance within smaller, pristine environmental units is mostly determined

by other variables such as local habitat quality, stream morphology, or natural water quality.

Therefore, at very large scales, variables such as geology will be better predictors. However,

when the scale is reduced, local variables become better predictors because larger-scale

variables are homogeneous within the study area (i.e. they determine the average background

quality but not the fluctuations). This concept is transferable to areas undergoing disturbance.

Some stressors act as big disruptors of the ecological hierarchy affecting it at its highest levels

(e.g. climate change at The Earth’s scale or extensive land use changes at the basin, sub-basin or

watershed scales). Other anthropogenic stressors may only affect species distribution in small

areas or localized points and affect the ecosystem at lower levels of the ecological hierarchy (e.g.

10

channelization of a stream section). Therefore, anthropogenic disturbances may alter the

ecological system at different levels of the hierarchical system depending on their geographic

extent. One stressor may only be local if it is highly localized (e.g. a point source) but can

become a major disruptor if its intensity and extent are severe enough (e.g. extensive water

quality degradation in the U.S. before passing the Clean Water Act of 1,972).

Because of the scale dependence of environmental stressors, a correct sampling design is

paramount in order to obtain reliable results and identify relationships between response

indicators and environmental stressors. Targeted environmental stressors need to be ubiquitous

and diverse in the area of study in order to draw reliable predictions and identify clear patterns. If

biologic integrity is to be evaluated in multiple watersheds, these need to have significantly

different regional characteristics in order to identify reliable stressor-response relationships. If

evaluation of biologic integrity is to be performed at smaller scales (e.g. subcatchment level),

environmental stressors with a scale larger than the study area must be highly homogeneous in

order to reveal the effect of more localized stressors on IBI. Background quality is still

determined by the regional variables; however fluctuations within a homogeneous unit are due to

variables that are local and diverse at the given scale unless extraordinary stressors exist such as

toxic spills.

In summary, the physical structure of the aquatic system is organized in a hierarchical manner

(Allan et al. 1997; Frissell et al. 1986). Therefore, the distribution of species within this

hierarchical structure is also a nested hierarchy of suitable habitats to which species have adapted

(Kolasa and Biesiadka 1984; Kolasa and Strayer 1988; Sugihara 1980; 1983). Due to the

11

correspondence in hierarchies, it is logical to think that disturbances in high habitat hierarchical

levels will affect high levels of the biological hierarchy. Stresses in the higher hierarchies of

habitat (e.g. regional scale land use) will propagate and directly alter instream fish habitat

conditions (e.g. sediment retention, instream habitat quality, or organic matter input) (Allan et al.

1997). As a consequence and due to the habitat-biological hierarchy correspondence, high levels

and all subsequent lower levels of the biotic community structure will be affected as well. Since

these major shifts in community structure are produced by environmental gradients, IBI is better

predicted with high level environmental variables in large geographic scales. At smaller scales, a

combination of large and small scale variables predicts IBI more accurately.

Despite the clear theoretical relationship between environmental stressors and response

indicators, identification of stress-response relationships remains challenging for several reasons:

(1) the natural system is composed of highly intertwined and cross-correlated environmental

stressors, (2) the natural system is organized in a nested hierarchy of suitable habitats that are

adequate to different types of species and organisms and may have very different scales (Allan et

al. 1997; Frissell et al. 1986; Kolasa 1989), (3) categorical evaluation of environmental variables

such as habitat quality may bring some degree of subjectivity or relativity into the system and

lead to misleading results, data errors, or poor numerical relationships (i.e. lower coefficients of

multiple determination due to discrete nature of the data), (4) truly reference conditions may no

longer exist. Thus, selection of a representative, reference actual state is crucial in order to have

reliable, non-arbitrary results (Rykiel 1985). Reference conditions are also linked to scale

(Pickett et al. 1989), and (5) presence of natural randomness.

12

In my opinion, most of the current research efforts to predict or characterize biologic integrity

have three main issues that need to be addressed. First, many of the numerical analysis

techniques used for IBI prediction purposes are performed with traditional methods that have

limited capabilities to truly reflect the high non-linearity of the natural system (e.g. linear

regression or canonical correspondence analysis). Also, since many environmental variables may

be responsible for a fraction of the IBI variability, easy-to-apply numerical methodologies that

allow easy validation become crucial. Second, and most importantly, the results of any research

effort trying to predict or characterize biological integrity are bound to the scale and design of

the sampling strategy. Many examples exist in which different (and even opposite) results have

been found in the same region of study due to scale issues (Allan et al. 1997; Dyer et al. 2000).

Therefore, development of a methodology able to segregate different biologic responses to

stressors acting at different geographic scales is a link missing in current research. Third,

identification of true reference or realistically achievable conditions to compare new

observations against with no a-priori assumptions is also paramount in order to set future

strategies for standard development and set priorities for future restoration efforts.

13

1. Chapter 1: Comparison of IBI predictions using regression and the

environmental similarity concept

1.1. Methodology

In the present chapter, two different methodologies to predict biotic integrity were tested. For the

analyses, three large state or region-wide databases of indices of biotic integrity and their metrics

as well as accompanying land use, habitat, and chemical parameters were obtained. The first IBI

prediction methodology consisted of using the k-nearest neighbor concept (kNN) with the entire

databases. This method was used first because it is usually considered as a benchmark for

subsequent, more elaborate techniques to be compared against. The kNN concept is based on

assessing proximity among observations by measuring their dissimilarity. The Euclidean and

Mahalanobis distance functions were used for this purpose. A detailed description of the kNN

methodology can be found in (Jain et al. 1999), and (Jain and Dubes 1988). It was used as a first

step because it is a very fast, computationally efficient technique that easily allows good model

validation by using a leave-one-out approach without drastically increasing the computation

time. Since it was performed using the entire databases, it was expected to reveal the main

environmental parameters with a significant impact on biotic integrity at larger scales.

Once the kNN predictions were performed and a prediction benchmark was obtained, another

methodology was tested. It consisted of a step-wise multiple regression using the best fitting

function (linear or non-linear) at each step. This was performed using two different data scales.

The first scale was the entire database (same as with kNN). The second scale was clusters of sites

14

obtained using Self Organizing Maps (SOM) (Kohonen 2001; Manolakos et al. 2007) followed

by SOM-neuron clustering with the k-means method (Duda et al. 2000).

The first goal of the research was to compare the performance of more traditional approaches

(regressions) in identifying critical environmental variables to a more simple and time-efficient

technique based on site similarities (kNN). The second goal was to demonstrate the importance

of data scale in biotic integrity prediction and develop a methodology able to identify relevant

variables at different scales. This was done by running the regression model first with the entire

database (state or region level) and then on a cluster (of sampling sites) basis.

The SOM, the kNN techniques as well as a description of the different databases and their

parameters are presented briefly in this section before describing the methodology followed in

each case.

1.1.1. Self-Organizing feature maps

SOM are considered a type of unsupervised Artificial Neural Network (ANN). The SOM consist

of a topologically ordered mapping of the input space (in our case vectors of environmental

variables) onto a two-dimensional space according to a meaningful order (Kohonen 2001). SOM

are composed of multiple units called cells or neurons, which represent a homogeneous unit in

the SOM environment. Neurons can be grouped into clusters using similarity functions among

the neuron centroids.

A SOM is usually composed of a two-dimensional lattice that represents the SOM cells. In an

initialization process, each neuron in the SOM is associated with a random weight vector

15

( [ ]iniiim μμμ ,...,, 21= ), which has the same dimension (n) as the input environmental vectors

( [ ]bnbsbb xxxx ,...;,1= ). Using a dissimilarity function (Euclidean distance), each environmental

vector (corresponding to a sampling site) is associated with the most similar SOM neuron, called

the Best Matching Unit (BMU). Thus, an initial environmental vector SOM-layout is obtained.

Subsequently, the initial neuron-allocated weights (mi) are updated using a neighborhood

function. This function minimizes the overall distance between the neuron itself and its

neighbors. The new updated neuron weight is called the generalized median (ε ). This process is

iterated several times (epochs) until convergence or until a certain criterion is met

(usually iim ε≅ ). After convergence, similar SOM neurons can be further grouped according to

their similarity. Grouped SOM neurons constitute the clusters. SOM have been used for

environmental modeling in different occasions ((Cereghino et al. 2001; Manolakos et al. 2007)).

1.1.2. k-nearest neighbor concept

The kNN technique consists of a simple algorithm in which one observation point (which is

composed of multiple physical and chemical environmental variables measured at a specific site)

is compared against a set of observations with the exact same attributes. The objective is to find a

specified number of most similar observations (k) to the one being tested.

In order to measure the degree of dissimilarity, there exist numerous distance metrics. Some

common metrics are the Minkowski distance, the Euclidean distance (which is a particular case

of the Minkowski distance), the cosine distance, or the Mahalanobis distance. The latter is

particularly interesting because it applies a whitening transformation to the data that avoids or

reduces linear correlation distortion among features. Detailed information on these functions can

16

be found in Jain and Dubes (1988) and Jain et al. (1999). The Euclidean (Eq.1) and Mahalanobis

(Eq.2) distances were used in the research with a customized application developed with

MATLAB® .

2/1

1

2,, ))((),( ∑

=

−=n

kkjkiji XXXXED (Equation 1)

))()(),( 1ji

Tjiji XXXXXXMhD −××−=

−∑ (Equation 2)

In the above equations, n is the dimension of the data vectors (number of environmental

variables) in the database, Xi and Xj are the pair of vectors being compared. Matrix Σ in Equation

2 is the covariance matrix of the observed data vectors using the selected features.

1.1.3. Description of the databases

Environmental databases compiled by the Minnesota Pollution Control Agency (MNPCA), Ohio

EPA and Maryland Biological Stream Survey (MBSS) were obtained. The databases contained

multiple observations of chemical, physical and biological parameters at different sites.

Unfortunately the type and format of data available, especially for physical variables, were quite

different among the three states. A summary of the environmental variables recorded for each

observation is provided in Table 1-1. The variables in each site were collected within an one

week window. Therefore, IBI observations at that specific time can be considered as the outcome

of the recorded physical and chemical characteristics from an observation site. The number of

sites in each case is the total number of observations with no missing data in any field, and the

number of observations used in the analysis.

17

OH (429 sites) MN (125 sites) MD Piedmont sites (246 sites) Water Chemistry Water Chemistry Water Chemistry Conductivity (Cond) (µmho/cm) Conductivity (Cond) (µmho/cm) Conductivity (Cond) (µmho/cm) Dissolved oxygen (DO) (mg/L) Dissolved oxygen (DO) (mg/L) Dissolved oxygen (DO) (mg/L) pH (standard units) pH (standard units) pH (standard units) Total Suspended Solid (TSS) (mg/L) Total Suspended Solid (TSS) (mg/L) Nitrate as N (NO3) (mg/L) Total Phosphorus (P) (mg/L) Total Phosphorus (P) (mg/L) Temperature (Temp) (deg C) Ammonia as N (NH4) (mg/L) Ammonia as N (NH4) (mg/L) Sulfate (SO4) (mg/L) Nitrite as N (NO2) (mg/L) Total Nitrogen (TN) (mg/L) Alkalinity (ANC) (µEq/L) Nitrogen Kjeldahl (TKN)(mg/L) Temperature (Temp) (deg C) Diss. Organic Carbon (DOC) (mg/L) Nitrate as N (NO3) (mg/L) Turbidiy (Turb) (NTU) Habitat and morphology Hardness as CaCO3 (Hard) (mg/L) Habitat and morphology Remoteness score (Remote) (0-20) Biological Oxygen Demand (BOD) (mg/L)

Substrate, channel,,and cover scores Habitat index (QHEI) (0-100) Instream habitat (Instrhab) (0-20)

Total Calcium (Ca) (mg/L) Buffer width (MBufWid) (m) Epifaunal substrate (EpiSub) (0-20)

Total Magnesium (Mg) (mg/L) Mean bank erosion (MBankEros) (m)

Velocity-depth variability (Vel-dpth) (0-20)

Chloride (Cl) (mg/L) % undercut (PctUndercut) Pool quality (Pool) (0-20) Sulfate (SO4) (mg/L) % woody (PctWoody) Riffle quality (Riffle) (0-20) Total Arsenic (As) (µg/L) % over vegetat. (PctOverVeg) Channel alteration (Chan) (0-20)

Total Cadmium (Cd) (µg/L) % emerging macrophyytes (Pct Emermac) Bank stability (BankStab) (0-20)

Total Copper (Cu) (µg/L) % submerged macrophytes (PctSubMac) % embeddedness (PctEmbed)

Total Iron (Fe) (µg/L) % other cover (PctOtherCov) % channel with flow (Ch_flow) Total Lead (Pb) (µg/L) % vegetal cover (PctCov) % shading in channel (Shading) Total Zinc (Zn) (µg/L) %pool (PctPool) Buffer width (MBufWid) (m) Habitat and morphology % run (PctRun) Aesthetic quality (Aesthet) (0-20) Substrate score (Subs) (0-20) %riffle (PctRiffle) Habitat index (PHI) (0-100) Embeddedness score (Embed) (0-4) % pool+run (PctPoolRun) Thalweg depth (MThalDep) (cm) Riparian score (Rip) (0-10) Mean width (MWidth) (m) Mean width (MWidth) (m) Instream cover score (Cov) (0-20) Thalweg depth (MThalDep) (cm) Maximum depth (MaxDepth) (cm) Riffle score (Riffle) (0-8) Mean depth (MDepth) (cm) Slope (Sl) (%) Pool score (Pool) (0-12) Width-depth ratio (WDRatio) Average flow velocity (m/s) Channel score (Chan) (0-20) Sinuosity ratio (Sin) Woody debris count Gradient score (Grad) (0-10) Slope (Sl) (m/km) Root count Habitat index (QHEI) (0-100) % boulder (PctBould) Land use ( in drainage area) Land use ( beyond 100m buffer area) %rock (PctRock) % urban land uses (Urban) %Agriculture (Agri) (25% increments) %fines (PctFine) %agriculture + barren (Agribarr) % Forest-wetland (Forwet) (25% inc.) % embeddedness (PctEmbed) % forest+wetland+water (Forwetwat) % Urban (Urban) (25% increments) Mean fines’ depth (MFineDep) (cm) Biological indices Biological indices Land use(in riparian area) Fish IBI (1-5) Fish IBI (12-60) Land use (0-5),riparian (0-15) scores Benthic IBI (1-5)

ICI (0-60) % disturbed LU in 100m buffer (PctDistLU) Hilsenhoff Index (0-10)

% undisturbed LU in 100 meter buffer (PctUnDistLU)

% dist. LU in 30-meter buffer (PctDistLU30)

% undisturbed LU in 30-meter buffer (PctUnDistLU30)

Biological indices Fish IBI (0-100) Table 1-1. Description of the environmental variables, scores and indices available for each state and their units.

18

1.1.4. IBI prediction methodology using kNN

Due to the small computation time required, a leave-one-out cross validation procedure was

used. Thus, each individual observation was taken out of the database and compared against the

rest of the remaining observations one at a time. Once the first observation was compared to the

rest of the database, it was reintroduced into the database and the next observation was taken out

to repeat the process until all the observations were tested. With this method, there was no need

to separate a validation set because each point was validated against the remaining sites in the

database. Two different similarity functions were used; the Euclidean and the Mahalanobis

distances. Prior to the analysis with the Euclidean distance, the data were log transformed and

scaled in the range [0 1]. The steps followed are described below (also see Figure 1-1).

1. Best metric selection (using 1 and 10 closest neighbors): this step evaluated prediction

capability of each environmental variable alone by comparing the IBI value of the site being

tested (one-out) with the average IBI in the identified closest site/s (1 and 10). The variables

were then sorted for both cases separately (for k=1 and k =10) in decreasing order. The r2 of the

linear regression between IBI scores and each environmental variable determined the variable

sorting. One (k =1) closest neighbor was used because by using the closest observation, the

extreme values would be predicted more accurately since few observations in the very low and

upper IBI ranges existed. With k = 10 , observations in the mid IBI range (with larger number of

observations) would be predicted more reliably, but not the extremes. Therefore, two lists of

sorted variables were obtained (with k=1 and k=10).

19

2. Step-wise predictions using variables from the k=1 sorted list: Following the variable

sorting obtained in step 1 (with k =1) a new variable at a time was introduced. The similarity

function was computed with the selected variables. The prediction was performed by finding the

IBI value (with k=1) or the mean IBI value (with k =2) of the most similar sites at each step. If

the IBI prediction with the new added variable (with either k=1 or k =2) improved the previous

one, the new variable would be kept, otherwise it would not. When a new variable was added,

backtracking was performed. Therefore, previously included variables were excluded one at a

time to see how the predictions were affected. If the exclusion of an old variable improved the

prediction, then this would be eliminated from the model. The reason for backtracking was to

minimize the effect of the order with which the variables were included in the model, as

suggested by Jain and Dubes (1988).

3. Step-wise predictions using the variables from the k=10 sorted list: It was implemented

as step 2 except that in this case the average IBI value from the 5 or 10 closest neighbors (k=5 or

k=10) was used for prediction.

20

Figure 1-1. Flow-chart of the step-wise kNN prediction method. Dashed arrow lines represent the steps followed when the environmental variables are sorted with k =1. Dotted arrow lines represent the steps followed when the variable sorting is performed with k =10. Solid arrow lines depict common steps for both cases

1.1.5. IBI prediction using regression and SOM + regression

Prediction of IBI using multiple regression was performed at the state (or region in Maryland)

and the cluster (of sites) scales. The regression equations were obtained following a step-wise

methodology and using 75 percent of all the available observations in each database for model

development. The remaining 25 percent was kept for model validation. The observation subsets

were selected randomly. A diagram summarizing the different steps is presented in Figure 1-2

and described as follows.

No more variables

No

No

Yes

Yes

Environmental data

Variable selection and sorting using k = 1

Best variable

Improves model?

Analysis of the best environmental variables

Variable selection and sorting using k = 10

Add next variable

Predict with k= 1 and k =2

Predict with k =5 and k = 10

Improves previous model?

Discard variable Keep variable Backtrack

Discard oldvariable

Plot IBI predictions

21

1. Database clustering using the SOM (only in cluster-based predictions): Each of the

databases was clustered using all the available chemical and physical environmental variables

shown in Table 1-1. In Ohio, land use data was not used for clustering purposes because this

variable was measured in a very crude scale (25 percent increments). Land use data was kept out

of the clustering as a cautionary measure because it could negatively alter the SOM site

distribution. The environmental data were converted to their natural logarithms and ranged [0-1]

before training the SOM. The number of SOM neurons was determined based on the topographic

and quantization errors. The quantization error is the average distance (Euclidean) between each

data vector and its BMU, while the topographic error is the proportion of data vectors for which

the first and second closest SOM cells are not adjacent in the grid of neurons (Kiviluoto 1996).

A compromise between the two errors had to be made because the quantization error usually

tends to decrease as the number of SOM neurons increase, and a very large map size was

undesirable given the available data. Hence, the maximum number of SOM neurons was limited

to 100. In Ohio, a SOM with 60 )106( × neurons was used. For Minnesota and Maryland, SOM

with 63 ( 97× ) and 54 ( 96× ) neurons were used, respectively.

The next step consisted of finding the optimum number of neuron clusters. The k-means

algorithm was used for this purpose (Manolakos et al., 2007). The optimal number of clusters

found using the Davies-Bouldin index (Davies and Bouldin 1979) was 3 in Ohio and Minnesota,

and 5 in Maryland.

2. Selection of a validation set: 25 percent of randomly selected observations in each cluster

were kept aside for validation. The remaining 75 percent was used to develop the regression

22

models. The validation sets used for the cluster-based and the state-based regressions were the

same in all cases.

3. Best metric selection (at state and cluster level): In the regression development datasets,

each one of the environmental variables was regressed linearly against the fish IBI score. The

environmental variables were then sorted in decreasing order based on the coefficient of multiple

determination (r2). An F-test at the 95% confidence level was performed in each case to check

the statistical significance of the regressions. Only variables that showed statistical significance

(p ≤ 0.05) were included in the model.

4. Linear correlation checking: The correlation coefficient (r) was calculated for each pair of

significant variables selected and sorted in the previous step. In cases in which the variable-

variable 85.0≥r , the least discriminant variable (i..e with smaller IBI-variable r2) was

removed because it was considered not to bring any new relevant information to the system.

5. Step-wise regression and backtracking: This was done by starting the regressions with the

best variable from step 2 and adding the next best one at each step. If the new added variable

increased the previous r2 it was kept, otherwise it was discarded and the next variable was tested.

When a variable was introduced, linear and non-linear regression equations were evaluated. The

function that yielded the highest r2 was selected. Quadratic, logarithmic, exponential, inverse, S-

curve, and power functions were the non-linear model forms tested. Backtracking was also

performed in this case. Steps 2 through 5 were performed using the statistical software SPSS

Version 15® for Windows.

23

6. Model validation: the equations obtained in step 5 were tested with the validation sets and

the IBI predictions plotted.

Figure 1-2. Flow-chart of the step-wise multiple regression method. Dashed lines indicate steps for the cluster-based model only. Dotted lines indicate steps for the whole database model only. Solid lines are common steps for both methods

1.1.6. Chronic and acute toxic chemical effects

After the prediction models were developed, a further fine-tuning was performed by adding a

penalty on those sites in which the reported metal concentrations (only available in Ohio) were

higher than the chronic exposure limit (CCC). Chemical toxicity does not act as a gradient along

r≤ 0.85 and p≤ 0.05

r≥ 0.85 and/or p>0.05

No more variables

No

No Yes

Yes

Whole environmental database

Clusters: SOM + k-means

Variable sorting

Variable selection Discarded variables

Selected variables

Best variable

Improves model?

Analysis of significant variables at different scales

Data processing and normalization

Add next variable

Improves model?

Discard variable

Keep variable Backtracking Discard old variable

Plot IBI prediction

Plot validation set

24

concentration. Only an effect on the biotic community would be observed if a specific threshold

is reached. Regression and kNN are unlikely to identify the effects of variables that do not act as

gradients in large scale models. Variables acting as environmental gradients have a greater

overall impact on the biotic community and are more likely to be selected in the predictive

model. For this reason, and since it was deemed important to account for chemical toxicity, a

penalty was included in the calculated IBI when the CCC for some of the available metals was

reached. The penalty followed an exponential curve (see equations 3 and 4). Since no literature

relating IBI change to chemical toxicity was found, the penalty was arbitrarily set by finding the

penalty value that yielded a better fit. The chronic and acute (CMC) concentrations for each

metal were obtained using the EPA Water Quality Criteria (EPA 2008a).

( )1)(

1−= −×

=∑ iii CCCCONC

n

ieP α (Equation 3)

ii

CMCi CCCCMC

PLni

−

+=

)1(α (Equation 4)

Where P is the final penalty, n is the number of available metal concentration measurements,

CONC is the observed concentration for that metal, PCMC is the set penalty when the CMC

concentration is reached, α is a coefficient calculated given the boundary conditions of the

equation (PCONC≤CCC =0 and PCONC=CMC , which was determined in each case).

25

1.2. Results and discussion

1.2.1. IBI predictions using kNN (k =1 or k= 2)

Minnesota

In this state, the Euclidean distance performed better than the Mahalanobis (r2 = 0.53 and 0.42

respectively with k=1 ). In both cases, total nitrogen (TN) and percent disturbed land use in the

riparian buffer (LU) were among the most significant variables, but not necessarily included in

the final prediction model. With the Mahalanobis function these variables were discarded in the

backtracking process, and required only two variables to yield its best possible prediction (see

Table 1-2). The variables used in the best model (Euclidean distance) were related to nutrient

loads, land use patterns, stream variability, substrate quality, and channel morphology, which

agreed strongly with the variables obtained in the regression model for the whole state (Table

1-3).

Maryland

Both proximity functions performed very similarly, achieving equal final results (r2 = 0.54 with

k = 2 in both cases) and identified very similar significant environmental variables (see Table

1-2). Land use patterns in the drainage area and alkalinity were key parameters in both cases

(like in the whole dataset regression model), and so was the PHI (unlike the regression model).

The rest of the selected variables were different. The Mahalanobis function identified aesthetic

quality (Aesthet) as an important parameter, similarly to the regression model. Even though this

is a qualitative parameter, it seems to have high predicting capabilities in Piedmont regions.

26

Ohio

The Mahalanobis function outperformed the Euclidean (r2 = 0.51 and 0.47 respectively with k

=1 in both cases). Habitat parameters (Substrate, Riffle, and Cover) were able to explain a very

large portion of the total biologic variability in both cases. Land use in the riparian corridor was

also found important in both cases. The significant habitat variables found with the distance

functions in Ohio agree again with those found relevant with the regression approach (see Table

1-3). No chemical parameters were selected in this case, with the exception of copper using the

Euclidean distance function.

1.2.2. IBI predictions using kNN (with k =5 or k = 10)

Minnesota

Again, the Euclidean proximity function performed better than the Mahalanobis (r2 = 0.54 and

0.48 respectively with k =5). Again, land use patterns in the riparian corridor determined a big

percentage of the total biotic variability (around 40 percent in both cases). However, land use

related variables were removed from the prediction with the Euclidean function after

backtracking, which might indicate that other included variables (TN, and Cond.) could be

strongly related to land use patterns and also account for new information from other non-land

use related stressors (i.e. point sources). This suggested that water quality is the main stressor in

Minnesota’s dataset, especially for heavily degraded sites (Southern watersheds).

With the Mahalanobis function, the land use score (LU) was included and the top chemical

variables removed (TN, P, TSS). The use of the covariance matrix (see Eq.2) which acts as a

whitening transformation by eliminating parameters with high correlation could explain this

difference between the two distance functions. Since in the step-wise variable sorting process LU

27

was the top metric, the subsequent added variables with high correlation were eliminated

accordingly.

Maryland

As in the previous case, both functions performed almost identically in terms of predicting

capability (r2 = 0.59 in both cases with k = 10). However, it is remarkable that the Mahalanobis

function needed only two variables (Agricutural/barren land and velocity-depth variability) to

achieve such result, while the Euclidean needed six. Both variables used with the Mahalanobis

function were also used in the regression model and in the Euclidean distance function. Other

variables used in the regression model (i.e. ANC) were also included with the Euclidean

distance.

Ohio

Unlike the previous time, the Euclidean distance obtained better overall results (r2 = 0.53 versus

0.47 with k = 5). The Mahalanobis distance needed only five variables versus nine needed by the

Euclidean, which matched the variables selected with the regression model very well. The

Mahalanobis function only needed three physical habitat parameters (embeddedness, riffle and

pool quality) to explain a large part of the total variability (44 percent). The remaining variability

was explained with the inclusion of cadmium and copper concentrations. Cadmium was also

identified as an important variable in the cluster-based regression predictions (cluster number 2,

see Table 1-3) and in research conducted by (Dyer et al. 2000). The inclusion of metal toxicity

penalties helped slightly improve the overall model performance in Ohio (best previous model r2

= 0.51).

28

Location Similarity function

k RMSE* r2 Variables used

1 7.28 0.51 Subs, Riffle, Cov, Embed, Forwet, Urban Mh 5 7.06 0.47 Riffle, Pool, Embed, Cd, Cu 1 7.77 0.47 Pool, Riffle, Cov, Subs, Rip, Cu, Urban

OH Eu

5 6.76 0.53 Riffle, Pool, Subs, Rip, Embed,SO4,TKN,pH,Urban1 21.74 0.41 TN, PctWoody Mh 5 20.52 0.48 LU, Rip, Cond, MFineDep, PctCov 1 19.43 0.53 LU,TN, Channel, PctBoulder, PctRun, PctRiffle

MN Eu

5 19.42 0.54 TN, Cond, TSS, PctUnderCut, MDepth 2 0.66 0.54 Agribarr, Urban, PHI, ANC, Aesthet, SO4,DOC Mh

10 0.63 0.59 Agribarr, Vel-dpth 2 0.67 0.54 Agribarr, ANC, Urban, PHI, Sl,DO MD

Eu 10 0.63 0.59 Agribarr, Urban, ANC, PHI, Vel-dpth, DOC

* RMSE for different IBI predictions. IBI scales: 12 to 60 in Ohio, 0 to 100 in Minnesota, 1 to 5 in Maryland Table 1-2. Summary of IBI predictions using the kNN methodology. The different functions (Mh = Mahalanobis; Eu = Euclidean) and selected number of closest neighbors (k) are specified. Final selected variables in each case are also listed.

1.2.3. Regression models

Minnesota

Whole state predictions

The metrics able to explain the largest part of the total variability were conductivity, total

nitrogen, TSS, land use score, mean width and mean thalweg depth. These variables explained

68 percent of the total variability in the equation development set and 52 percent in the validation

one. In the latest, the main source of error were those sites with very poor IBI scores (IBI = 0).

These observations corresponded to highly urbanized areas located in the Southern part of the

state (see cluster 1 in Figure 1-3). Water chemical quality is the main cause for the severely

impaired biotic integrity in those sites, and not so much the existing physical conditions. Due to

lack of metal concentrations in the Minnesota’s database, a toxicity penalty could not be

29

included. Other selected variables such as stream’s width and thalweg depth deemed relevant in

less degraded areas (Northern watersheds) and contributed to an overall better fit of the

regression in the development dataset, but led to misprediction in highly urbanized areas during

validation (Table 1-3).

Cluster-based predictions

The SOM yielded three clusters that were very clearly separated geographically (see Figure 1-3).

The observations located in the Southern areas (cluster 1) showed significantly lower IBI and

habitat quality, with higher accumulations of fine sediment in the substrate and larger percentage

of disturbed land use in the riparian strip. Those areas were also associated with higher total

nitrogen and TSS concentrations as well as higher conductivity. Clusters 2 and 3 (located in the

Upper Mississippi and St. Croix River Basins) had similar IBI scores and chemical quality. The

main differences between them were due to physical habitat quality, especially substrate and

channel characteristics.

The IBI cluster-based prediction improved significantly the results obtained using the entire state

database. The variables selected (different for each cluster) were able to explain 80 percent of the

total variability in the regression development set and 59 percent in the validation one.

Identification of sites with similar environmental characteristics helped improve biotic integrity

prediction. Effects of more local stressors could be better identified using sub-datasets with

similar environmental properties. In larger scales, their effects are blurred by other more

ubiquitous variables. Stressors like nutrient loads or land use patterns still explain a big part of

the biotic variability at the cluster-level. However, segregation of the points that were

30

mispredicted using the whole set (mainly those in cluster 1) led to significant improvement

despite the lack of metal toxicity data. The variables used in the prediction model in each cluster

are shown in Table 1-3.

.

Maryland, Piedmont Areas

Whole dataset predictions

In this case, the largest part of biotic variability was explained with five variables; percentage of

agriculture/barren lands in the drainage area, alkalinity, aesthetic quality, velocity-depth

variability, and mean width. The total variability explained was 66 percent in the regression

development dataset and 59 percent in the validation set. Other chemical variables found of

relevance in previous models (mainly total nitrogen and conductivity) were within the top seven

best metrics in Maryland and had a high degree of correlation with ANC (r=0.75) and

agricultural land uses (r=0.69) respectively.


Five clusters of SOM neurons were identified. As shown in Figure 1-3, clusters 4 and 5 were

concentrated in a specific region. Cluster 1 IBI scores were significantly higher than the rest,

while cluster 5 scores were significantly lower. Clusters 2, 3 and 4 showed similar median IBI

values, with wide ranges and overlapping among them. For this reason, and because a minimum

number of observations is required for each cluster to develop the equations, clusters 2, 3 and 4

were merged. In this case, the regression dataset explained up to 71 percent of the total

variability, and a 62 percent in the validation dataset.

31

The best predictive parameters for cluster 1 were (in decreasing order); ANC, velocity-depth

variability, urban land use, agricultural/barren land uses, and shading. These results strongly

agree with the metrics found with the whole dataset, which are large-scale variables with the

exception of shading . Shading is a variable considered important only at the local scale. The best

metrics in clusters 2, 3 and 4 were (in decreasing order); agricultural and barren land uses, ANC,

PHI, percentage of channel covered by flow, and, again, shading. The best metrics in cluster 5

were aesthetic quality, urban land use, conductivity, and epifaunal substrate quality.

Ohio

Whole state predictions

Forty-seven percent of the total variability was explained with the regression development set

and 41 percent with the validation set. The top metrics used in the regressions were (in

descending order); embeddedness, substrate quality, pool quality, sulfate concentration, hardness

and TKN. The top seven variables were habitat-related, while sulfate concentration, BOD,

conductivity, arsenic concentration, hardness and TKN were the top chemicals. The results

obtained in this case are very similar to those obtained by Dyer et al. (2000) who identified

habitat and morphological parameters as the most important variables affecting biotic integrity

with chemical variables playing a more secondary role at the state scale.


Three very prominent clusters of SOM neurons were found. The IBI distribution was

significantly different in all three, having cluster 1 the highest IBI scores and cluster 3 the lowest

(see Figure 1-3). The sites within one watershed usually belonged to only one cluster with few

32

exceptions. The cluster-based predictions outperformed significantly the non-clustered model in

the model development set (r2 = 0.62). However, the validation results were very similar to the

non-clustered model (r2 = 0.44).

Cluster 1 was dominated by habitat metrics, especially those related to substrate quality

(embeddedness and substrate quality were within the top variables). Sulfate concentration, pH

and conductivity were the top chemical variables and in the top ten overall. The variables

included in the regression model for cluster 1 are shown in Table 1-3.

Cluster 2 biotic quality was again dominated by habitat-related variables (riffle, pool, and

substrate qualities). However, the chemical variables differed significantly with zinc, nitrate,

nitrite, and cadmium concentrations as the most relevant. Hardness, pH, conductivity, and sulfate

concentrations were not among the top chemical variables unlike cluster 1.

In cluster 3, biotic integrity was clearly driven by water quality. Eutrophication problems seemed

to be the main environmental impact with BOD and nutrient input among the top predictors. Zinc

was again, an important stressor as well. Conductivity and sulfate were also important variables

in the model. Riparian quality was the only non-chemical parameter that placed in the top ten in

the variable selection process, which might be an indication of the importance of functional

stream buffers in severely impaired areas. The importance of the riparian buffer quality in

heavily degraded sites was also identified in Minnesota’s cluster 1 (see Table 1-3). Interestingly,

water quality was also the most important biotic integrity driver in that case. A functional

33

riparian corridor is of capital importance in regulating sediment and chemical delivery from the

surrounding lands, especially in severely impaired sites.

34

Figure 1-3. Top, site cluster distribution in Minnesota (left), Maryland (Piedmont sites) (center), and Ohio (right). In Minnesota, cluster 1 is concentrated in Southern watersheds. In Maryland clusters 4 and 5 are concentrated in a specific region and, in Ohio, sites located in the same watershed usually belong to the same cluster. Bottom, Self-organizing Map neuron lattice and box plots with the cluster-based IBI values. The red line in the boxplots represents median cluster value, the top line is 75 percentile, and bottom line is 25 percentile.

35

Scale State Variables State Variables State Variables Whole

database MN Cond ( 5101.3 −×− ,Q) MD Agribarr/10 (-.035,.488,Q) OH Embed

(2.397,-17.843,Q) TN (.066,-2.58,Q) ANC (-.00048, L) Subs (.273,L)

TSS (-.235,L) Aesthet (-.006,.15,Q) Pool (.563,L)

MWidth (-27.6,I) Vel-dpth (.049,L) SO4 (-3.957, Lg)

MThalDep (-496.62,I) MWidth (.02, L) Hard/100 (-.264, 3.031,Q)

PctRun (-.007,.77,Q) TKN (.846, I) LU+1 (9.5, Lg ) Cu (-.087,.749,Q) Constant (56.81) Constant (.597) Constant (66.260) r2 = .68 r2 = .66 r2 = .47 r2 validation = .52 r2 validation = .58 r2 validation = .41 RMSE* = 27.16 RMSE* = 0.71 RMSE* = 7.23

Cluster 1 MN PctDistLU (0.0015,Q) MD 2 ANC (-0.069,Q) OH Embed (-4.541, L) TSS (-11.5,Lg) Vel-Dpth (.182,L) SO4 (-.031,L)

PctOverVeg+1 (50.97, I) Agribarr (-.081,Q) QHEI (-967.42,I)

Rip (0.273,Q) Urban (-.046,.1,Q) pH (-13.80,219.31,Q) Shading (.121,Q) TKN (.686,-5.342,Q) Hard (.018, L) Chan (-.098, 2.742,Q) NO3 (.01,-.610,Q) NO2 (9.887,L) Constant (10.66) Constant (3.84) Constant (-824.408)

Cluster 21 MN TN (-0.398,Q) MD 2 Agribarr OH Riffle (1.703,Q) PctEmbed (-0.367,L) (-.178,.159,Q) Pool (.649,L) MThalDep (-314.2,I) ANC (-.223,L) Zn (-.85,.019,E) MDepth (1075.9,I) PHI (.293,L) Subs (.497,L) PctPool+1 (-145.7, I) CH_Flow (.188,L) NO3 (.101,-2.074,Q) Shading (-.181,L) NO2 (-293.3,60.71,Q) Cd (4.097, L) SO4 (-1.897,Lg) Constant (113.81) Constant (3.51) Constant (29.256)

Cluster 33 MN Temp (-822.4, S) MD Aesthet OH BOD (23.295,I) Sl (-1.48,L) (.075,1.05,P) Cond (-.001,L) MThalDep (-623.04,I) Urban (-.67,Lg) Zn (.001,-.124,Q) TN (-6.69,E) Cond TKN (117.23, .021,P) ( 71008.7 −×− ,Q) SO4 (291.707,I) EpiSub (.007,.23,E) Constant (957.72) Constant (4.075) Constant (-98.829) r2 = .81 r2 = .71 r2 = .62 r2 validation .59 r2 validation .62 r2 validation = .44 RMSE* = 25.81 RMSE* = 0.66 RMSE* = 6.78

1Clusters 2, 3, and 4 in Maryland, 2regressions performed with standardized values, 3Cluster 5 in Maryland * RMSE for different IBI predictions. IBI scales: 12 to 60 in Ohio, 0 to 100 in Minnesota, 1 to 5 in Maryland L =linear; E=exponential; Lg = natural log; I = inverse; Q = quadratic; P=power; S =S-curve Table 1-3. Summary of the step-wise regressions for IBI prediction for the development and validation sets. The variables used in each case are listed together with their coefficients and curve type (in parentheses). Variables in italics in the whole database regressions indicate variables also used in some of the kNN predictions. Results in Ohio after including metal toxicity penaltie

36

1.3. Conclusions

• Biotic integrity is the result of both, offstream and instream factors that generally behave in a

highly non-linear manner. Offstream (allochtonous) variables, such as land use patterns in the

drainage area or the riparian corridor, are the variables able to account for the largest part of the

total variability, especially when predictions are performed at larger scales. Other instream

environmental parameters are also able to explain a large part of the variability. Habitat

parameters, nutrient and sulfate concentrations, conductivity, alkalinity, hardness and some

morphologic variables are equally important and potential good indicators of stream’s health by

themselves or when combined. Most of these variables depend on regional (catchment)

conditions. Some local parameters (e.g. instream shading, high metal concentrations) showed an

important effect when the scale was reduced or some variations were made to account for their

impacts. Hence, the models confirmed that biotic integrity is sensitive to different stressors

acting at different geographic scales.

• SOM-based clustering of sites successfully improved biotic integrity predictions in all three

states. The improvement was generally more noticeable in the model development datasets than

in the validation datasets. Finding the optimum degree of similarity among sites, while having

enough observations to develop and validate a regression model is paramount to extract the

maximum possible knowledge out of the available data.

• Since very few sites had concentrations of some metal above the chronic toxicity threshold,

the models were generally unable to identify their effect on biotic integrity because the overall

effect was small. Metals have an important effect only when a threshold is met. Other variables

37

such as habitat parameters or nutrient concentration (especially nitrogen) behaved more as

gradients along their measurement ranges.

• In Ohio, the inclusion of a penalty to account for toxic effects on IBI improved the model

predictions. The inclusion of such penalty in Minnesota and Maryland might also improve the

IBI predictions, but metal concentration data was not available. In Minnesota, the most

significant mispredictions in both, clustered and non-clustered, models occurred when extremely

poor IBI observations were present. The clustered model helped improve the predictions but

those sites were still the main source of error. Chemical toxicity is suspected to be the cause for

those extremely poor IBI values.

• An attempt to predict biotic integrity using similarity functions was performed. The results

were quite encouraging. One of the main advantages of using such techniques is that they need

very little computational time. For this reason, the predictions could be easily validated with a

leave-one-out approach instead of splitting the data into a model development and a validation

dataset. This methodology proved to be robust with crudely-scaled or qualitative data such as

Ohio’s land use and habitat parameters. In all three states, more than 50 percent of the total biotic

variability was explained without need of previous clustering.

• Two different types of proximity functions were tested for IBI prediction purposes; the

Mahalanobis and the Euclidean functions. In general, the Euclidean distance function obtained

better results (higher r2). However, to achieve such results it used a larger number of variables

than the Mahalanobis function. The Mahalanobis function achieved similar or equal results

using fewer variables. The covariance matrix used in the calculation of this function is

considered a whitening transformation that eliminates correlation. Therefore, variables that did

not bring any new relevant information to the system were discarded.

38

• With the kNN technique, the optimum number of closest neighbors for prediction purposes

depends on the available data distribution. The number of existing observations along the IBI

scoring scale determines the model prediction performance in each IBI range. When very few

observations exist (i.e. for the extreme IBI values), IBI is predicted more accurately using lower

k values, assuming that some other observation in the database has similar environmental

conditions. If a high k value is used in such cases, a “smoothing effect” will occur because some

of the selected neighbors will be quite different from the target site. Determining a Maximum

Allowable Distance Threshold between the site being predicted and its neighbors could be a

solution to finding the optimum k in each case. The larger the number of observations in each

data range and the more relevant environmental variables to be compared against, the more

reliable the predictions will be.

39

2. Chapter 2: Large-scale biologic integrity prediction based on

environmental similarity using instream data and regional and local offstream

characteristics

2.1. Methodology

2.1.1. Data and study area

The research was based in 429 observations in the state of Ohio. Each observation had biologic

integrity measurements along with instream and offstream environmental characteristics. The

biologic and instream data were collected and compiled by the Ohio Environmental Protection

Agency between years 1996 and 2000. The offstream data were obtained using a Geographic

Information System (GIS). The biological and instream environmental data were collected in the

same stream segment with no more than a 5-day time difference. To our knowledge, all data

were collected in base-flow conditions and extreme events (e.g a spill) were not reported.

Biologic integrity was measured with the fish Index of Biologic Integrity. This is a discrete

score ranging from 12 (very poor biologic integrity) to 60 (excellent biologic integrity). The IBI

is composed of 12 metrics that describe the species richness and composition, the trophic

composition, and the fish abundance and condition of the fish community (Karr et al. 1986;

Ohio_EPA 1987).

Instream variables consisted of water quality and habitat quality metrics (Table 2-2). The habitat

parameters consisted of the metrics from the Qualitative Habitat Evaluation Index (QHEI)

40

(Rankin 1989). The QHEI and its metrics are discrete scores with different ranges of

measurement (see Table 2-2). Another score quantifying the percentage of fine sediment in the

river bed (embeddedness) was also available (embeddedness is not used as a QHEI metric itself,

but as a penalizing factor for the QHEI’s substrate and channel quality metrics).

Offstream environmental variables were grouped in three main categories: upstream land use,

stream fragmentation, and point source information. In order to obtain the upstream land use,

each site’s watershed was delineated using a 30-meter resolution Digital Elevation Map (DEM)

with ArcGIS Spatial Analyst. Subsequently, the percentage of each upstream land use was

calculated at two different scales: the regional scale referred to the whole upstream segment,

while the local scale referred only to 2 miles upstream the observation site. Land uses for the

whole catchment and the 100 and 30-meter buffers were obtained for both scales. A 30-meter

width was chosen because this was the minimal possible distance due to data resolution and

beyond the minimal recommended 15-meter width, effective under most conditions (Castelle et

al. 1994). A 100-meter width was chosen because this is an intermediate value between 3 and

200 meters, minimum and maximum effective widths depending on site-specific conditions

according to Castelle et al.( 1994). Land use percentages were obtained using the Thematic

Raster Summary function within Hawth’s Analysis Tools for ArcGIS (Beyer 2004) . The land

cover categories defined in the 2001 National Land Cover Dataset (NLCD) were used (USGS

2008b). Sixteen different land use categories existed in the area of study and listed in Table 2-1.

The Open Water (OW) land use category was only calculated for the drainage and catchment

areas, not for the buffers. Drainage areas (DA) for each site were also obtained. The

fragmentation and point source metrics (Table 2-2) were calculated using the National

41

Hydrography Datasets (NHD) (USGS 2008a). The ArcGIS Utility Network Analyst was used to

trace upstream or downstream a specific site. Majors dams (only with DA ≥ 2.59 Km2) and point

sources (major and minor waste water treatment plants ands major industrial dischargers) were

obtained from the National Inventory of Dams (USACE 2005) and the Permit Compliance

System database (EPA 2008c)

The observations were distributed in the whole state of Ohio. Most of the observations belonged

to either the Eastern Corn Belt Plains (ECBP), the Huron/Erie Lake Plains (HELP), or the

Erie/Ontario Lake Plains (EOLP) ecoregions with 180, 73, and 100 observations respectively.

The Western Allegheny Plateau (WAP) and the Interior Plateau (IP) ecoregions only had 36 and

40 observations respectively. The HELP and ECBP eocregions have the highest nutrient

background concentrations, the EOLP and IP ecoregions have intermediate levels of nutrients,

while the WAP ecoregion has the lowest levels (Rankin et al. 1999).

Drainage areas were also very diverse, ranging from 1.55 to 16,428 Km2. In this research, sites

were not subdivided in ecoregions or stream size and were introduced into the model all at once.

Since the model was based on assessing environmental similarities, groups with higher/lower

concentrations of nutrients and/or ionic strength would tend to group together if chemical

stresses were a major source of overall biologic variability. Moreover, we wanted to “let the data

speak”.

42

Figure 2-1. From left to right and top to bottom. (1)Upstream stream network carrying waste water; (2) upstream stream network fragmentation; (3) basin-scale dams in the downstream main channel; (4) basin-scale stream network fragmentation

43

Table 2-1. Description, percentage quartiles, and individual IBI predicting power for the different NLCD land use categories present in the Ohio database RDA = drainage area; R100 = regional 100-meter buffer; R30 = regional 30-meter buffer; LDA = local catchment area; L100 = local 100-meter buffer; L30 = local

30-meter buffer area; a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches

a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches

Name Description Quartiles R2 Name Quartiles R2 Name Quartiles R2 RDA_OW Open water 0.10 – 0.25 -0.60 0.292a ---- ---- ---- ---- ---- ----

RDA_DevO Urban Open Space 5.23 – 6.25 – 9.76 0.212a R100_DevO 5.26 – 6.84 -10.33 0.300a R30_DevO 4.44 -6.26 -10.01 0.234a

RDA_DevL Urban low intens. 1.22 – 2.39 – 6.06 0.293a R100_DevL 0.94 -1.85 -3.79 0.274a R30_DevL 0.79 -1.51 – 3.48 0.245a

RDA_DevM Urban medium intens. 0.24 – 0.64 -1.64 0.253a R100_DevM 0.10 -0.40 -0.87 0.275a R30_DevM 0.07 – 0.26 -0.62 0.259a

RDA_DevH Urban high intens. 0.07 – 0.26 -0.75 0.223a R100_DevH 0.00 -0.13 -0.38 0.320a R30_DevH 0.00 – 0.07 – 0.20 0.198a

RDA_Bar Barren 0.00 – 0.01 -0.05 0.239a R100_Bar 0.00 -0.00 – 0.02 0.184b R30_Bar 0.00 – 0.00 -0.01 0.182a

RDA_ForD Deciduous forest 5.29 -9.45 – 18.60 0.281a R100_ForD 8.22 -16.06 – 30.56 0.292a R30_ForD 9.82 – 20.90 – 38.01 0.368a

RDA_ForE Evergreen forest 0.01 -0.08 -0.25 0.243a R100_ForE 0.00 – 0.09 -0.24 0.285a R30_ForE 0.00 -0.05 -0.20 0.225a

RDA_ForM Mixed forest 0.00 -0.00 – 0.03 0.294b R100_ForM 0.00 -0.00 – 0.05 0.261a R30_ForM 0.00 -0.00 -0.04 0.232a

RDA_Shr Shrub/scrub 0.00 -0.00 -0.03 0.221a R100_Shr 0.00 – 0.00 -0.05 0.223a R30_Shr 0.00 – 0.00 -0.02 0.208a

RDA_Herb Herbaceous 0.35 – 0.89 – 1.38 0.225a R100_Herb 0.29 – 1.00 – 1.81 0.296a R30_Herb 0.23 – 1.00 -2.08 0.312b

RDA_Hay Hay/pasture 3.16 – 7.67 – 14.60 0.385a R100_Hay 2.97 – 8.20 – 13.60 0.322a R30_Hay 3.13 -7.35 – 12.57 0.312a

RDA_Crop Crops 40.49 -60.84 -75.58 0.257a R100_Crop 34.65 – 54.32- 70.27 0.231a R30_Crop 30.69 -50.23- 64.82 0.304a

RDA_WetW Woody wetlands 0.00 -0.04 – 0.20 0.223a R100_WetW 0.00 – 0.13 -0.58 0.270a R30_WetW 0.00 -0.26-0.95 0.242a

RDA_WetH Herbaceous wetlands 0.00 – 0.02 – 0.09 0.261b R100_WetH 0.00 – 0.03 -0.20 0.236b R30_WetH 0.00-0.01-0.36 0.172b

RDA_Oth Other 0.00 -0.00 -0.00 0.012a R100_Oth 0.00 -0.00 -0.00 0.012a R30_Oth 0.00-0.00-0.00 0.012a

LDA_OW Open water 0.00 – 0.19 – 0.96 0.200b ---- ---- ---- ---- ---- ---- LDA_DevO Urban Open Space 4.97 – 7.22 – 13.73 0.289a L100_DevO 4.17 – 7.80- 14.68 0.208a L30_DevO 3.00-6.42-15.23 0.158a

LDA_DevL Urban low intens. 0.42 – 2.42 -11.19 0.183a L100_DevL 0.30 – 2.06 – 6.99 0.272b L30_DevL 0.00-1.20-6.59 0.186a

LDA_DevM Urban medium intens. 0.00 -0.27 -2.01 0.159a L100_DevM 0.00 -0.00 – 1.78 0.107b L30_DevM 0.00-0.00-1.12 0.100a

LDA_DevH Urban high intens. 0.00 -0.00 - 0.80 0.140b L100_DevH 0.00 - 0.00 – 0.16 0.048b L30_DevH 0.00-0.00-0.00 0.071c

LDA_Bar Barren 0.00 -0.00 -0.00 0.020a L100_Bar 0.00 – 0.00 – 0.00 0.005b L30_Bar 0.00-0.00-0.00 0.004b

LDA_ForD Deciduous forest 4.47 – 13.59 -28.73 0.285b L100_ForD 7.39 – 24.43 – 46.14 0.335a L30_ForD 7.33-29.34 -53.65 0.334a

LDA_ForE Evergreen forest 0.00 – 0.00 – 0.25 0.098a L100_ForE 0.00 – 0.00 – 0.15 0.064a L30_ForE 0.00-0.00-0.00 0.066a

LDA_ForM Mixed forest 0.00 – 0.00 -0.00 0.077a L100_ForM 0.00 – 0.00 – 0.00 0.042a L30_ForM 0.00-0.00-0.00 0.036c

LDA_Shr Shrub/scrub 0.00 – 0.00 -0.00 0.117a L100_Shr 0.00 -0.00 – 0.00 0.069c L30_Shr 0.00-0.00 - 0.00 0.032c

LDA_Herb Herbaceous 0.00 -0.65 -1.68 0.130a L100_Herb 0.00 – 0.34 – 1.84 0.117b L30_Herb 0.00-0.00-1.61 0.154b

LDA_Hay Hay/pasture 0.00 -5.51 - 12.92 0.214a L100_Hay 0.00 – 3.97 – 10.85 0.196a L30_Hay 0.00 – 1.67 -9.45 0.161a

LDA_Crop Crops 17.20 -44.71 -69.59 0.152a L100_Crop 8.55 – 30.06 – 59.61 0.190b L30_Crop 6.33 – 25.24 – 53.97 0.202a

LDA_WetW Woody wetlands 0.00 – 0.00 -0.47 0.128a L100_WetW 0.00 – 0.00 – 1.28 0.113a L30_WetW 0.00 – 0.00 – 2.24 0.092a

LDA_WetH Herbaceous wetlands 0.00 – 0.00 – 0.19 0.124b L100_WetH 0.00 – 0.00 – 0.63 0.051a L30_WetH 0.00 – 0.00 2.24 0.087a

LDA_Oth Other 0.00-0.00-0.00 0.000 L100_Oth 0.00 – 0.00 -0.00 0.000 L30_Oth 0.00 - 0.00 -0.00 0.000

44

Table 2-2. Description, quartile values, and individual IBI predicting power for the water quality, habitat, point source, and stream fragmentation metrics

a = best prediction at 423 branches; b = best prediction at 328 branches; c = best prediction at 233 branches; * Downstream distances until basin outlet; ** Values were multiplied by 1,000

Name Units Type Quartiles R2 Name Units Descriptiom Quartiles R2

Cond μmho/cm 575.00-722.00-948.75 0.02a Subs 0-20 Substrate quality 10.0-13.0-16.0 0.23a

DO mg/L 6.50-7.70-8.70 0.02c Embed 0-4 Embeddedness 2.5 -3.0 -4.0 0.28a pH SU 7.68-7.86-8.05 0.01b Rip 0-10 Riparian and bank qualities 3.5-5.0-7.0 0.17b TSS mg/L 5.00-13.00-32.00 0.01b Cover 0-20 Instream vegetal cover 7.0-11.0-14.0 0.13a TP mg/L Total 0.11-0.22-0.42 0.00a Riffle 0-8 Riffle and run quality 0.0-2.0-4.5 0.24a

NH4 mg/L As N 0.05-0.05-0.13 0.07b Pool 0-12 Pool and glide quality 5.0-8.0-10.0 0.23a NO2 mg/L As N 0.02-0.02-0.05 0.06b Chan 0-20 Channel morphology score 9.0-13.0-16.0 0.21a TKN mg/L 0.40-0.60-1.00 0.09c Grad 0-10 Gradient score 6.0-8.0-10.0 0.09c

NO3 mg/L As N 0.30-1.09-2.94 0.01c DA Km2 Drainage area 32.89-103.60-344.73 0.19c

Hard mg/L As CaCO3

239.75-304.00-374.25 0.02a UPS_Con % Upstream connected length / total upstream length 94.5-100.0-100.0 0.19a

BOD mg/L 2.00-2.m00-3.23 0.12b SITE_Con* % Total connected length / basin network length 3.3-18.9-34.5 0.47a Ca mg/L Total 62.00-77.00-91.00 0.01b DW_MainDf* Km Main channel downstream length / # downst.dams 28.2-42.9-179.3 038a Mg mg/L Total 19.00-26.00-36.00 0.04b U_Df Km Upstream network length / # upstream dams 26.3-83.2-226.2 0.25a Cl mg/L Total 26.00-42.00-84.00 0.02a Avg_Df* Km Average of D_MainDf and U_Df 54.4-110.2-198.7 0.26a

SO4 mg/L Total 52.00-81.00-153.25 0.03a Uflood_len m2/km Upstream flooded area / upstream network length 0.0-0.0-630.2 0.17a

As μg/L Total 2.00-2.00-3.00 0.08c UPS_stor_DA m3/Km2 Upstream dam storage / DA 0.0-0.0-1459.5 0.15a Cd μg/L Total 0.20-0.20-0.20 0.01a UPS_stor_len m3/km Upstream dam storage / upstream network length 0.0-0.0-1,202.4 0.16a Cu μg/L Total 10.00-10.00-10.00 0.01c UPS_Flooded % Upstream flooded area / DA 0.00-0.00-0.05 0.17a

Fe μg/L Total 281.75-602.00-1212.50 0.01b Dfl_MainLen* m2/km Downstream flooded area / downstream main

channel length

38,171.0-68,393.2-137,272.0

0.46a

Pb μg/L Total 2.00-2.00-2.00 0.00a Dsto_MLen* m3/km Downstream dam storage / main channel length 231,248.8-370,196.2- 1,487,654

0.44b

Zn μg/L Total 10.00-10.00-19.00 0.01b Flow_PS % % of upstream network carrying wastewater 0.0-3.7-11.5 0.20b Temp Deg C In water 19.70-21.70-23.43 0.06b PSDisch_LT m3/d/Km Point source discharge/upstream network length 0.0-1.5 -32.6 0.20b

PSDisch_LPS m3/d/Km Point source discharge/ distance from site to all point sources 0.0-23.8-320.6 0.21b

PS_LPS #/km # point sources/distance to all point sources 0.0-43.5-90.1** 0.21a PS_LTOT #/km # point sources/ upstream network length 0.0-3.7-10.7** 0.26b PSDisch_DA m3/d/Km2 Point source discharge/ DA 0.0-4.5-70.4 0.21b LPS-DA Km/Km2 Distance to all point sources/ DA 0.0-27.9-98.8** 0.18b

45

2.1.2. Variable sorting based on IBI prediction power using a leave-one-out,

hierarchical approach

The environmental variables were kept separated in two different categories: offtream and

instream variables. Each category was composed of different groups. Thus, the offstream

category was composed of four groups: local land and regional uses ( in catchment and 30 and

100-meter buffer areas), fragmentation metrics, and point source metrics. The instream category

was composed of two groups: water and habitat qualities.

A leave-one-out method was taken in order to assess the IBI predictive power (R2) of each

environmental variable individually. Following this method, one observation at a time was left

out (test site). The remaining 428 observations were then clustered using an agglomerative

hierarchical structure with the average linkage method and the standardized Euclidean distance

(Jain et al. 1999) . The hierarchical clustering (HC) of the remaining 428 observations was

performed using only the environmental variable being tested. Subsequently, the closest branch

in the hierarchical tree (i.e. with the smallest standardized Euclidean distance between the branch

mean value and the test site value) was selected. The calculated IBI was the average IBI of the

sites located in the selected branch. The observed IBI was the test site’s IBI. The prediction

capability was tested for all 429 observations at three different levels of the hierarchical

structure: with 233, 328, and 423 branches (see Figure 2-2). The highest R2 as used to sort the

variables.

46

2.1.3. Step-wise IBI prediction using a leave-one-out, hierarchical approach

Using the variable sorting obtained in point 2.1.2, the same methodology was repeated using a

step-wise approach. In each group, the best variable form the previous step was selected and the

next best group variable was included to form an array of two-dimensional environmental

vectors. The test site two-dimensional environmental vector was compared to the hierarchical

tree branches obtained with the two-dimensional array of remaining sites. If the prediction

improved the previous one, the new variable was kept, otherwise it was discarded. This process

was repeated in each group until no more variables were available. Therefore, the IBI prediction

capability for each group of variables was revealed.

Subsequently, the different groups were progressively merged together using the selected

variables at each step (Figure 2-3). The IBI prediction methodology used when two groups were

merged was the same. Again, the variables were sorted according to their individual predicting

power from point 2.1.2 and introduced one at a time. The order of the group mergers is shown in

Figure 2-3. At the end o this process, the best variables and the best possible IBI prediction at the

state scale was obtained.

47

Figure 2-2. Hierarchical tree with different clustering levels to which the test site (Xi1,Xi2,…,Xin) is being compared against. i indicates the observation number, n indicates the environmental variable within the environmental vector

Figure 2-3. Diagram showing the order with which the variable groups were merged. Orange rectangles indicate instream variables. Green rectangles indicate offstream variables. Blue indicates a mix of both

48

2.1.4. Analysis of observations with a significant impact from local variables

Predictions obtained with this methodology represent the best possible prediction using the

selected variables at the scale of the study. However, some sites still showed poor predictions.

We hypothesized that most of the variability not accounted for with the selected variables was

due to local stressors able to explain an important part of the biotic quality in one or few specific

sites. Thus, sites where regional scale environmental factors were the most significant stressors

to the biologic community should be well predicted with the selected variables in the presented

model. Significant mispredictions would occur when some local scale variable had a significant

impact in one or several sites.

Observations whose predictions fell beyond the interval of ±1.5 times the root mean square error

(RMSE) were considered as mispredictions due to significant impacts from local effects. These

observations were separated from the rest and tested for differences against the group of sites

whose predictions fell within the ±1.5×RMSE interval. Differences in water quality (in sites

affected by point sources), point source and fragmentation density and intensity, and local and

regional land uses were tested. These were evaluated using a Student t-test at the 95% confidence

level.

2.2. Results

2.2.1. Step-wise IBI predictions

Local land use was able to predict 49% of the total IBI variability. In the30-meter buffer,

presence of agricultural land (cropland, pasture), herbaceous, and developed land (medium and

highly developed) were the most relevant to IBI. In the 100-meter buffer, presence of deciduous

49

forest and developed lands (low and medium intensity) were the most important. In the whole

local catchment area, presence of developed open space, evergreem forest, and herbaceous

wetlands were the land uses that had a relevant impact on IBI (Table 2-5).

Regional land use explained 58% of the total biologic variability. The variables selected in the

step-wise procedure were deciduous forest and woody wetlands in the 30-meter buffer;

developed (low intensity and open space) and other land uses in the 100-meter buffer; and

hay/pasture, deciduous forest, herbaceous, and woody wetlands in the drainage area.

Combination of both, local and regional land uses, resulted in the selection of mostly regional

variables. Only medium intensity developed lands in the local buffers were able to bring new

information to the system (see All LU in Table 2-5).

Point source density and intensity was the group of offstream variables that explained the least

overall variability. The ratio between number of upstream point sources versus upstream network

length explained 26% of the overall variability and was the first and only metric selected in the

step-wise algorithm.

Stream fragmentation explained 54% of the overall variability. One metric alone (Site_Con) was

able to explain 47% of the overall variability. Site percentage of connected network, downstream

dam frequency, average dam frequency, and percentage of upstream connected network were the

variables selected in the fragmentation model (Table 2-5).

Combination of all the best offstream variables resulted in the selection of fragmentation metrics

(SITE_Con and DW_MainDf) and regional land use variables. None of the local land use

50

variables was selected. The best prediction was marginally better than the prediction with

combined local and regional land uses and explained around 60% of the total variability with just

eight variables instead of ten.

Combination of the best habitat parameters plus drainage area explained 49% of the overall

biologic variability. Six variables were selected in the best predicting model; embeddedness,

riffle, substrate, and pool qualities, drainage area, and instream vegetal cover. Water quality was

the group with the smallest overall prediction capability (R2 = 0.13). BOD, NO2-N, and Cd were

the variables that yielded the best prediction.

The variables selected after merging the best offstream and instream variables were all the

previously selected offstream variables plus two habitat parameters (riffle and cover). However,

the improvement in the overall IBI prediction was very modest compared to the best offstream

and land use models (R2 = 0.606 versus R2 = 0.0597 and 0.596 respectively). New information

from instream parameters was minimal. IBI prediction plots for best offstream and instream

variables alone and combined are shown in Figure 2-4. The step-wise metric selection is shown

in Table 2-5.

2.2.2. Analysis of sites with significant local-scale stressors

A total of 28 sites were above the 1.5×RSME threshold, while 27 were below. Some of the

chemical concentrations in overpredicted sites had unusually high values, well beyond the

reported background concentrations (e.g. NO3-N, NH4, Cu, Zn, Cond, TP, BOD, or NO2-N)

(Rankin et al. 1999) . The concentrations of most elements were not significantly higher than the

rest of the database because not all overpredicted sites had consistently higher levels of a

51

particular element. Only Cu and Zn concentrations and some point source metrics (LPS_DA and

PSDisch_DA) were significantly higher in sites with reported point sources. However, this was

due to the existence of two sites with values one order of magnitude larger than the rest (one site

had very high Cu and Zn concentrations and the other one had very high point source density and

intensity). Since the source of impairment on these two sites was evident, and their presence

would have a significant impact on the t-tests results, they were disregarded from subsequent

point source and water quality analyses. After this, only upstream fragmentation, and land use-

related metrics had significant differences between groups as shown in Table 2-3.

Table 2-3. List of variables with significant differences between over-predicted sites and sites with a prediction within the ±1.5 ×RMSE intervals

Variable Name

# of over/well predicted

sites

Type of sites

Value in over-predicted sites (95% conf. interval)

Value in well-predicted sites (95% conf. interval) p

Uflood_len 11/19 NPS+UF 14.2 ± 8.8 2.2 ± 1.3 0.000UPS_Con 11/19 NPS+UF 40.6 ± 25.7 75.6 ± 10.2 0.003

UPS_stor_len 11/19 NPS+UF 142.5 ± 82.2 17.2 ± 13.4 0.000Ups_stor_DA 11/19 NPS+UF 0.115 ± 0.078 0.021 ± 0.018 0.003

UPS_Con 28/374 ALL 76.6 ± 14.6 89.4 ± 2.5 0.011L30M_ForD 28/374 ALL 44.2 ± 9.1 30.6 ± 2.72 0.009L100M_ForD 28/374 ALL 39.8 ± 8.0 26.4 ± 2.4 0.003L100M_DevM 28/374 ALL 0.24 ± 0.17 2.5 ± 0.58 0.041

LDA_ForD 28/374 ALL 26.5 ± 7.1 17.8 ± 1.8 0.012LDA_ForE 28/374 ALL 1.1 ± 1.2 0.4 ± 0.1 0.014LDA_Hay 28/374 ALL 12.6 ± 4.8 8.2 ± 1.1 0.034

R30M_ForD 28/374 ALL 36.5 ± 7.5 23.2 ± 1.8 0.000R100M_ForD 28/374 ALL 29.0 ± 5.7 19.3 ± 1.5 0.001RDA_ForD 28/374 ALL 18.1 ± 4.6 12.9 ± 1.1 0.019

NPS = sites without point sources; UF = sites with upstream fragmentation; ALL = all sites

Sites with predictions below the 1.5×RMSE interval (i.e. underpredicted) consistently showed

lower hardness and hardness-related parameters in sites with reported point sources. Other

variables such as unit flow (discharge flow/area), Pb, Cu, and Zn were significantly higher in

under predicted sites. However, these results were highly influenced by one site in particular

with extremely high concentrations and point source density compared to the remaining 11 sites

affected by point source pollution. For this reason, final water quality and point source density

52

analyses were run disregarding this particular observation. These results are presented in Table

2-4.

Table 2-4. List of variables with significant differences between under-predicted sites and observations with a prediction within the ±1.5 ×RMSE intervals

Variable Name

# of under/well predicted

sites

Type of sites

Value in under-predicted sites

(95% conf. interval)

Value in well- predicted sites

(95% conf. interval) p

Hard 11/213 PS 247.0 ± 42.1 313.8 ± 13.8 0.033Mg 11/213 PS 20.9 ± 4.7 28.2 ± 1.6 0.046SO4 11/213 PS 135.2 ± 15.5 64.1 ± 25.1 0.042

Dsto_MLen 22/331 DF 1920.8 ± 711.9 1194.8 ± 157.5 0.025L100M_ForD 27/374 ALL 39.5 ± 8.5 26.4 ± 2.4 0.005L100M_ForE 27/374 ALL 1.8 ± 2.0 0.4 ± 0.2 0.002L30M_ForD 27/374 ALL 47.4 ± 9.9 30.6 ± 2.7 0.002

L30DevL 27/374 ALL 2.0 ± 1.7 6.0 ± 1.0 0.038L30M_ForE 27/374 ALL 3.8 ± 5.8 0.4 ± 0.2 0.000LDA_ForD 27/374 ALL 29.9 ± 7.8 17.8 ± 1.8 0.000LDA_ForE 27/374 ALL 1.4 ± 1.3 0.4 ± 0.1 0.000R30_Crop 27/374 ALL 38.1± 1.3 49.1 ± 2.5 0.025R30_ForD 27/374 ALL 36.3± 7.2 23.2 ± 1.8 0.000

R100M_ForD 27/374 ALL 29.4 ± 5.9 19.3 ± 1.5 0.000RDA_ForD 27/374 ALL 20.0 ± 4.7 12.9 ± 1.1 0.002

DF = sites with downstream fragmentation; PS = sites with point sources; ALL = all sites

53

Local LU R2 L100M_ForD 0.335 LDA_DevO 0.363

L100M_DevL 0.363 L30M_Crops 0.373 L30M_Hay 0.392 L30M_Herb 0.427

L100M_WetW 0.458 L100M_DevM 0.465 L30M_DevM 0.472 LDA_ForE 0.476

L30M_DevH 0.480 All LU R2 L30M_ForM 0.484 RDA_Hay 0.385

LDA_Bar 0.492 R30M_ForD 0.509 R100M_DevO 0.546

Regional LU R2 RDA_ForD 0.558 RDA_Hay 0.385 R100M_DevL 0.562

R30M_ForD 0.509 R30M_WetW 0.569 R100M_DevO 0.546 RDA_Herb 0.570

RDA_ForD 0.558 RDA_WetW 0.572 Offtream variables R2 R100M_DevL 0.562 L100M_DevM 0.593 SITE_Con 0.467 R30M_WetW 0.569 L30M_DevM 0.596 RDA_Hay 0.512

RDA_Herb 0.570 DW_MainDf 0.535 RDA_WetW 0.572 R30M_ForD 0.537 Overall R2

R100M_Other 0.577 R100M_DevO 0.563 SITE_Con 0.469 RDA_ForD 0.592 RDA_Hay 0.512

Fragmentation R2 R100M_DevL 0.596 DW_MainDf 0.535SITE_Con 0.467 R30M_WetW 0.597 R30M_ForD 0.537

DW_MainDf 0.499 R100M_DevO 0.563Avg_Df 0.541 RDA_ForD 0.592

UPS_Con 0.542 R100M_DevL 0.596 R30M_WetW 0.597

Point sources R2 Riffle 0.605PS_LTOT 0.260 Cover 0.606

Water Quality R2

BOD 0.116 Instream variables R2 NO2-N 0.124 Embedded 0.281

Cd 0.130 Riffle 0.326 Substrate 0.403

Habitat R2 Pool 0.431 Embedded 0.281 Area 0.442

Riffle 0.326 Cover 0.491 Substrate 0.403

Pool 0.431 Area 0.442

Cover 0.491 Table 2-5. Step-wise IBI predictions. R2 indicate the variability explained after adding a new variable to the model. All results were achieved using a hierarchical tree with 423 branches. For an explanation of variables refer to Table 2-1 and Table 2-2

54

Figure 2-4. IBI predictions with the best offstream variables (top), best instream variables (middle), and best variables overall (bottom). Dashed red lines indicate perfect fit line (center) and ± 1.5×RMSE (sides). Dot size is proportional to the number of hits in a specific point.

55

2.3. Discussion

The model confirmed that biological integrity is the result of the impact on the biotic community

by many stressors of different nature acting at different scales. Out of the five components of

biotic integrity (energy sources, water quality, habitat structure, flow regime, and biotic

interactions (Karr 1991; Karr et al. 1986; Karr and Kerans 1981), at least the first four were

totally or partially accounted for in the model. Even though numerous stressors existed in our

database, biologic integrity could be best characterized with only two groups of stressors:

regional land use and stream fragmentation. Only larger scale metrics (basin or watershed scales)

were selected in the final model (i.e. regional land use, percentage of connected stream network

and downstream dam frequency at the basin scale). This result is not surprising because the scale

of the research performed was large (state scale). In consequence, regional or basin scale

variables were able to explain a greater part of the total variability than local ones such as

instream habitat quality. The influence of land use on stream integrity is scale-dependent (Allan

et al. 1997), and the significance of scale becomes evident depending on the sampling design and

distribution. If the research is based on a wide array of streams with different characteristics and

different environmental conditions (e.g. substantially different upstream land uses or stream

order), regional characteristics will prevail as main contributors to biotic variability (Roth et al.

1996). Alternatively, if the study is based on similar types of observations with little

environmental variability (e.g. same order streams in one watershed), local variables will reveal

as the most significant (Lammert and Allan 1999).

56

2.3.1. Land use

Our model identified combinations of drainage area’s extent of hay/pasture and deciduous forest

(this was the dominant type of forest in our area of study) as the most critical to IBI. Herbaceous

and woody wetlands were also identified as important types of regional land uses in the drainage

area for IBI in Ohio (see All LU model in Table 2-5). These results strongly agree with research

by Roth et al. (1996) and Wang et al. (1997). They identified agriculture and forest in the

drainage area as the main contributors to IBI variability in Ohio and Wisconsin respectively.

However, in their research cropland and pasture lands were lumped into one single agriculture

category, while forest included deciduous, mixed and evergreen categories (Anderson et al.

1976). Stewart et al., (2001) identified positive correlation of fish diversity, intolerant fish, and

EPT species with increased forest cover. Richards et al.(1996) linked non-row crop agricultural

lands with increased woody debris, flood ratio and shallows. In our research, land uses were not

merged according to Anderson et al. (1976) recommendations and kept the different sub-

categories as defined in the NLCD (USGS 2008b). The result was that the extent of hay/pasture

in the drainage area turned out to have the greatest prediction power overall. This was

significantly greater than cropland’s (R2= 0.385 vs.0.257) despite the great dominance of this

type of land use (average cropland coverage equal to 56.1% versus 9.1% for hay/pasture).

Presence of pasture lands in the drainage area has been associated with reduced vegetal cover,

increased water temperature, nitrate, biomass concentrations, photosynthetic rates, and total

suspended solids as well as with an increase of fine sediments deposited in the river bed. A major

shift in the composition of the macroinvertebrate species was also associated to pasture lands

(Quinn et al. 1997). It has been found that presence of rangeland is particularly harmful to

aquatic fauna, especially in sites with poor riparian quality (Meador and Goldstein 2003).

57

Woody wetlands in the drainage area, and especially in the regional buffers, were also deemed

important for IBI. Even though little new variability was explained after the introduction of this

metric in the final model (most likely due to cross-correlation with other land uses such as

deciduous forest or developed lands), its presence is remarkable because of its little extent (mean

percentages equal to a 0.33, 0.70, and 1.07% in the drainage area, 100 and 30-meter regional

buffers respectively). Woody wetlands seemed to gain importance with proximity to the stream

(its individual-based predictive power ranked in 12th out of 16 land uses in the drainage area, 9th

out of 15 land uses in the 100-meter regional buffer, and 7th out of 15 land uses in the regional

30-meter buffer). In the final model, only presence of woody wetlands in the 30-meter buffer

introduced some new information to the model. A similar result was reported by Richards et al.

(1996), who linked small presence of forested wetlands (mean extent of 10% in drainage area)

with increased presence of woody debris and some channel characteristics such as bankfull

depth. Wetlands are known to act as regulators between surface water flow and hydrology

(Mitsch and Gosselink 1986). Their presence is associated with decreased sediment input,

nutrients, temperature, ionic strength, and increased resilience to disturbances (Detenbeck et al.

2000; Richards et al. 1996). Of special importance is the presence of wetlands near the receiving

waterbody as the model indicated (30-meter buffer was selected over drainage area). A decreased

wetland-stream distance has been positively correlated to reduced levels of nutrients, ions, and

bacteria.Wetland extent has been correlated to decreased lead and high color in downstream

lakes. This was found to be especially true in areas with highly fragmented riparian

corridors(Detenbeck et al. 2000; Detenbeck et al. 1993; Johnston et al. 1990).

58

Presence of developed lands in the regional and local buffers also provided significant new

information. At the regional level, open and low intensity urban lands (the dominant urban

categories with mean extents equal to 9.1 and 4.3% respectively) in the 100-meter buffer were

the urban land uses selected in the final predicting model. Therefore, it seems from the obtained

results that the extent of developed lands (represented mainly by low intensity and open space) in

the regional 100-meter buffer plays and important role on biotic degradation (Morley and Karr

2002; Stewart et al. 2001; Wang et al. 2001). This is also true for local land use in buffers even

though open and low intensity developed lands at this scale were not selected in any of the

models, most likely due to high correlation with their regional homologues (r = 0.60 and 0.57 for

open space and r = 0.57 and 0.59 for low development in the 30 and 100-meter buffers

respectively).

At the regional level, presence of deciduous forest and woody wetlands in the most immediate

lands (30-meter buffer) seemed to counter-act the effect of urban land uses within or beyond that

buffer (100-meter). It’s been reported that buffer fragmentation and patchiness data can provide

substantial information beyond traditional land use percentages (Allan 2004; Detenbeck et al.

2000; Stewart et al. 2001). Presence of medium intensity development at local scales in the

Overall LU model may be an indication urban intensity is also important, especially in the

immediate surroundings of a water body (local-scale 30 and 100-meter buffers) (Morley and

Karr 2002; Wang et al. 2001). Medium intensity development was not the dominant urban land

use in local buffers (2.21 and 1.76% in the 100 and 30-meter buffers respectively, versus 12.3

and 11.6%, and 5.9 and 5.51% of open space and low intensity urban lands respectively). Around

10-12% of connected imperviousness is considered the threshold beyond which biologic quality

declines rapidly in watersheds without or small riparian buffers (Schueler 1994; Wang et al.

59

2001; Wang et al. 2000). Presence of medium intensity development in the local buffer as a

significant variable in our model may indicate that this threshold has been reached.

2.3.2. Fragmentation

Fragmentation and flow regulation affects a large percentage of the streams worldwide,

especially in developed countries (Dynesius and Nilsson 1994; Nilsson et al. 2005).Stream

fragmentation by dams has serious consequences for the biologic community, preventing fish

from reaching upstream habitats, and isolating trapped upstream populations. Decreased species

richness and risk of extinction of native fauna through demographic, environmental, and genetic

stochasticity are some of the consequences fragmented populations face (Morita and Yamamoto

2002). The negative effects of physical separation of stream segments on aquatic species has

been widely studied (Morita and Yamamoto 2002; Morita and Yokota 2002; ReyesGavilan et al.

1996). Moreover, physical barriers are not the only consequence of dams. Usually, hydrologic

changes are also associated with impoundments. Alteration of the natural flow regime affects

fauna by eliminating or modifying natural habitat conditions, which in turn, produces a shift in

species composition and, therefore, biologic integrity (Fischer and Kummer 2000; Freeman et al.

2001; Gilvear et al. 2002; Poff and Allan 1995; Poff et al. 1997; Richter et al. 1996)

In this research, the site percentage of connected stream network (SITE_Con) and downstream

fragmentation metrics (Dfl_MainLen, Dsto_MLen, and DW_MainDf) had the largest individual

predicting powers overall. These were able to explain around 40% of the total IBI variability by

themselves. Upstream fragmentation metrics had far less prediction power and showed

importance only in some specific sites as shown in Table 2-3. Most of the sites were located well

inland and far from the basin outlet (average stream distance to basin outlet = 284.3 Km,

60

minimum distance = 18.35 km, maximum distance = 833 Km) which could have influenced the

results. However, no significant differences in drainage area between fragmented sites in over

and well-predicted groups were found. Furthermore, the only fragmentation metric that included

upstream and downstream fragmentation (SITE_Con) had the greatest predictive power overall.

2.3.3. Point sources and instream water quality

The prediction power of the individual instream water quality variables was clearly sorted in

three main groups. The first one was related to nutrient concentration, especially nitrogen (BOD,

TKN, NO2-N, and NH4). Nitrate and TP concentrations were not ranked among the top

chemical predictors. In fact, TP was the worst chemical predictor, which could be the due to

elevated concentrations beyond the biomass limiting-nutrient condition (Rankin et al. 1999).

Rankin et al. (1999) didn’t find a clear relationship between NO3-N and IBI in Ohio either, and

only concentrations beyond 3-4 mg/L had consistently negative effects on IBI. Ionic strength

parameters (Mg, Hard, Cl, Cond, SO4) were the second group. These elements affect the toxicity

of some components such as metals. Metal concentrations came in last (Zn, Cd, Fe, Cu, Pb) with

the exception of arsenic which had the third highest individual predicting power of all available

chemicals. Other variables such as DO, TSS, or pH had low prediction capabilities.

The first two variables selected in the step-wise model (BOD, NO2-N) showed that nutrient

input is the main water quality contributor to biologic degradation in Ohio. BOD has been

identified as a source of degradation in Ohio streams (Dyer et al. 2000; Norton et al. 2000;

Norton et al. 2002) and is an indicator of biologic degradation due to highly eutrophic

conditions. The third selected variable in the model was cadmium concentration, which provided

61

marginal improvement (see Table 2-5). Metal toxicity is indeed a powerful agent of biologic

degradation. However, it is only able to explain a significant part of the overall IBI variability at

smaller scales such as the upper or lower parts of a watershed (Dyer et al. 2000). This is most

likely a consequence of its highly localized nature (i.e. coming from point sources or legacy

pollution). None of the chemical variables were present in the Instream Variables model. Habitat

and water qualities (especially if related to nutrient input) are highly influenced by changes in

local and regional land uses. Therefore, in severely impaired habitats (e.g. with a high level of

fine sediment due to accelerated denudation processes) poor water quality is likely due to

increased non-point source inputs (chemicals attached to flushed particles in runoff). Therefore,

water quality did not provide any further improvement in the subsequent models at this particular

scale.

Point source metrics had only significant effects at the local scale as expected. When extreme

cases were removed, no significant differences in water quality and point source density and

intensity were found between sites with reported point sources. Only significantly lower ionic

strength in under-predicted sites was found, which could be an indication of less intense waste

water discharges in these sites. Therefore, point sources have a small overall impact on biotic

integrity compared to other more ubiquitous stressors directly or indirectly linked to land use

changes. At the sub-basin scale or smaller, point source pollution can play a significant role if

they have a substantial presence (Dyer et al. 2000; Dyer et al. 1998a). However, as the scale

expands other factors take over for one simple reason: they are more ubiquitous, hence they act

as gradients in all the available observations. Thus, point sources explain a significant amount of

variability in specific clusters of sites but little when all are considered as a whole (Allan et al.

1997; Manolakos et al. 2007).

62

2.3.4. Instream Habitat

Instream habitat and drainage area were able to explain 49% of the overall IBI variability.

Substrate-related metrics (embeddedness and substrate quality), stream variability (pool and

riffle qualities), as well as vegetal cover were the most relevant QHEI metrics. Habitat

parameters have been identified as the main instream sources of IBI variability (Dyer et al.

1998a; Hall et al. 1996; Manolakos et al. 2007). At larger scales, a significant part of the

variability due to water quality is accounted for with habitat quality for the reasons mentioned

above. Our model confirmed this, and the Habitat model selected exactly the same variables as

the Instream one (Table 2-5). Riffle and Cover qualities were selected in the Overall model but

contributed very little to the final result. Stream variability, substrate quality, and/or instream

cover have been identified as significant contributors to biotic quality in Ohio (Dyer et al. 1998a;

Dyer et al. 1998b; Manolakos et al. 2007; Yuan and Norton 2004) and elsewhere (Minshall

1984; Quinn and Hickey 1990; Rabeni and Smale 1995; Richards et al. 1993). Drainage area was

positively correlated to IBI, which strongly agreed with the findings by Dyer et al. (1998a) in

Ohio.

2.3.5. Mispredictions due to local effects

The main cause for IBI overprediction was the presence of either one extreme factor (e.g. very

high levels of point source pollution), or a combination of two or more variables with

significantly different values from the remaining well-predicted sites (upstream fragmentation

and/or local land use differences). Overpredicted observations had significantly higher levels of

deciduous forest at all levels at the regional scale, which contributed to high calculated IBI

scores. Surprisingly, extent of forested areas at all local levels were also significantly higher and

63

urban extent was lower in overpredicted sites, which was counter-intuitive given the lower

observed IBI scores in sites with such good ‘land use quality’. The only local land use metric

that could contribute to lowering the IBI expectancy was the presence of significantly higher

percentages of hay/pasture lands in the local catchment area. No significant differences existed

with this land use at regional scales. Hay/pasture was identified as the best land use predictor and

negatively correlated to IBI at the regional scale.

Overprediction was also due to increased upstream fragmentation (average upstream connected

network equal to 76.6% versus 89.4% in all over and well-predicted sites respectively).

Therefore, high point source density and intensity, highly fragmented upstream networks, and

larger extents of hay/pasture in the local catchment area lowered the observed IBI. These metrics

were not part of the final model. On the other hand, significantly higher percentages of forested

areas in the regional buffer yielded higher than expected predicted IBI scores.

On the other hand, underpredicted sites also had significantly better ‘land use quality’ at all

scales. Therefore, high quality of the local land use (higher levels of forested areas followed by

reduced extents of urban and crop lands, and absence of other significant differences) is the most

likely cause of underprediction because the model doesn’t take into account exceptional local

conditions. Furthermore, lower concentrations of ionic strength parameters might be an

indication of reduced sediment and chemical input from non-point sources at the regional and

local scales and also absence of a substantial impact from point sources.

64

2.4. Conclusions

• The presented prediction model was based on evaluation of environmental similarities among

sites with the same environmental variables. It successfully identified the most significant

variables to IBI at the state-scale with a very fast and easy-to-apply technique. Selected variables

at each step strongly agreed with published research.

• At the state-scale, regional land use and stream fragmentation are the main predictors of

biotic integrity. Habitat variables only contribute marginally to model improvement, while

instream water quality and point source intensity and density were not able to improve the final

model at all. Most of the information from instream water and habitat qualities is introduced into

the model by regional land use, which acts as a surrogate variable.

• Sixty-one percent of the total variability was explained with regional land use and

fragmentation metrics and sixty percent with just local and regional land use. Overpredictions

mainly came from a combination of higher upstream fragmentation, extreme point source density

and intensity, and high levels of hay/pasture in the local catchment area. Underpredictions

mainly came from sites with an extraordinary local land use quality which was not accounted for

in the model, and less harmful effects from disposed waste water.

• If the 55 sites with significant local effects were to be disregarded, the model could explain

86% of the overall IBI variability. Therefore, at the state-scale local stressors account for 25% of

the variability beyond the one explained by land use and fragmentation. The remaining 14% may

be due to sampling errors, data quality issues, or natural randomness (for example, a site with

BOD = 24mg/L; TKN = 3.1mg/L; TP=1.29 mg/L; Zn = 180 µg/L; Cu = 39µg/L; Fe = 19,700

µg/L; or NO2-N = 0.19 mg/L had one of the highest observed IBI scores (50)).

65

• The results show that water quality issues from point sources have small overall impact on

biotic integrity in Ohio. This may indicate a successful control of points sources through the

NPDES permits and Total Maximum Daily Loads (TMDL) projects, which have been top

priority for surface waters since the Clean water Act of 1972. Our model showed how current,

most significant stressors are related to stream fragmentation and land use change, especially in

the regional buffers. Habitat degradation and nutrient input are the most direct consequences

from this. In order to achieve the aimed physical, chemical, and biological integrity of the

Nation’s waters, protection and enforcing policies have to refocus towards a more holistic view

beyond point source control. Ecosystem continuum must be kept and watershed-level land use

planning is necessary to attain such goals, especially in the most immediate lands to any water

body

66

3. Chapter 3: Probabilistic, Hierarchical, Biologic Integrity Discrimination

3.1. Methodology

3.1.1. Ohio: instream data and study area

The data used consisted of 429 observations. An observation consisted of an array of instream

habitat and water quality parameter measurements, and the corresponding value of the fish Index

of Biologic Integrity (IBI). This data set of observations was extracted from a larger data base

and was selected because the complete set of input and output variables was available. Other

observations in the data base were incomplete (e.g., only biological parameter values were

available. The data were collected between years 1996 and 2000 by Ohio EPA. Habitat

observations consisted of discrete scores for each of the metrics in the Qualitative Habitat

Evaluation Index (QHEI): in-stream cover score (Cov), gradient score (Grad), and substrate

(Subs), riparian (Rip) , pool, riffle, and channel (Chan) qualities. Furthermore, discrete scores

quantifying the site’s embeddedness extent due to fine sediment deposition (Embed) were also

available (embeddedness is not a metric in itself but a penalizing factor for substrate and channel

qualities). A detailed description of each QHEI metric and scoring criteria can be found in

(Rankin 1989). Drainage area (DA) was also available. Water temperature (temp), conductivity

(Cond), dissolved oxygen (DO), biologic oxygen demand (BOD), pH, total suspended solids

(TSS), ammonia (NH4-N), nitrite (NO2-N), nitrate (NO3-N), total Kjeldahl nitrogen (TKN),

total phosphorus (TP), hardness (Hard), total calcium (Ca), total magnesium, (Mg), chloride (Cl),

sulfate (SO4), total arsenic (As), total cadmium (Cd), total copper (Cu), total iron (Fe), total lead

(Pb), and total zinc (Zn) were the available physical and chemical values for water quality. The

67

units for both, habitat and water quality parameters are shown in Table 3-1. The IBI scores

consisted of discrete scores ranging from 12 (essentially no fish) to 60 (healthy fish community).

A description of how IBI was developed and implemented for biological assessment of streams

in Ohio can be found in Ohio EPA (1987). Habitat, water quality and IBI data collection was

performed in the same stream segment. Habitat, water quality, and biologic sampling dates did

not differ more than 5 days in any of the observations. To our knowledge, the monitoring was

performed in base-flow conditions during summer time or early fall. No extreme events such as

chemical spills were reported.

Table 3-1. List of water quality, habitat, and biologic integrity parameters used in the research Variable Symbol Units Variable Symbol Units Metric Symbol Scale

Conductivity Cond µmho/cm Total Calcium Ca (mg/L) Substrate quality Subs 0-20

Dissolved Oxygen DO mg/L Total

magnesium Mg (mg/L) Embeddedness Embed 0-4

pH pH 0-14 Chloride Cl (mg/L) Riparian quality Rip 0-10 Total susp.

solids TSS mg/L Sulfate SO4 (mg/L) Instream cover Cov 0-20

Total phosphorus TP mg/L Total arsenic As (µg/L) Riffle quality Riffle 0-8

Ammonia as N NH4 mg/L Total

cadmium Cd (µg/L) Pool quality Pool 0-12

Nitrite as N NO2 mg/L Total copper Cu (µg/L) Channel quality Chan 0-20 Total

Kjeldahl nitrogen

TKN mg/L Total iron Fe (µg/L) Gradient Grad 0-10

Nitrate as N NO3 mg/L Total lead Pb (µg/L) Qualitative

Habitat Evaluation Index

QHEI 0-100

Hardness as CaCO3

Hard mg/L Total zinc Zn (µg/L) Drainage area DA km2

Biologic Oxygen Demand

BOD mg/L Water temperature Temp deg C Fish Index Of

Biologic Integrity IBI 12-60

The observations were distributed across the entire state. The majority was collected in the

Eastern Corn Belt Plains (ECBP), the Huron/Erie Lake Plains (HELP), and the Erie/Ontario

Lake Plains (EOLP) ecoregions with 180, 73, and 100 observations respectively. The Western

Allegheny Plateau (WAP) and the Interior Plateau (IP) ecoregions only had 36 and 40

68

observations respectively. The HELP and ECBP eocregions have the highest nutrient

background concentrations, the EOLP and IP ecoregions have intermediate levels of nutrients,

while the WAP ecoregion has the lowest levels (Rankin et al. 1999). The watershed areas were

also very diverse, ranging from 1.55 km2 to 16,420 km2. In our research the sites were not

subdivided in ecoregions or stream size and were introduced into the model all at once. Small

number of observations would have limited the progressive partitioning process, which requires

large number of sites. Moreover, we wanted to ‘let the data speak’. Ecoregional or stream size

trends in nutrient concentration would be captured by the different patterning techniques used in

the research if they were significant enough.

3.1.2. Ohio: offstream data and study area

The data used consisted of 429 observations, where an observation here consists of an array of

basin, watershed and local-scale offstream variables along with the fish Index of Biologic

Integrity (IBI). The biological data were collected between years 1996 and 2000 by Ohio EPA.

Basin-scale observations consisted of fragmentation metrics. Watershed-scale metrics consisted

of percentages of different types of land use in the drainage area and the 100 and 30-meter buffer

areas around the stream network in the entire watershed. Watershed-based point source density

and intensity were also watershed-scale variables. Local metrics consisted of percentages of land

use in the catchment area and the 100 and 30-meter buffers only 2 miles upstream of the

sampling point. A 30-meter buffer width was chosen because this was the minimal possible

distance due to data resolution and beyond the minimal recommended 15-meter width, effective

under most conditions (Castelle et al. 1994). A 100-meter width was chosen because this is an

intermediate value between 3 and 200 meters, minimum and maximum effective widths

69

depending on site-specific conditions according to Castelle et al.( 1994).A description of the

different variables is available in Table 3-1, Table 3-2,, and Table 3-3.

In order to obtain the upstream land uses, each site’s watershed was delineated using a 30-meter

resolution Digital Elevation Map (DEM) with ArcGIS Spatial Analyst. Subsequently, the

percentage of each upstream land use was calculated at two different scales: the watershed scale

and the local scales. Land use percentages were obtained using the Thematic Raster Summary

function within Hawth’s Analysis Tools for ArcGIS (Beyer 2004) . Eight different broad land

use categories were used for each scale: urban, agricultural, non-forested, forested, surface

water, wetland, barren, and other (Anderson et al. 1976). These were calculated from the sixteen

land cover categories defined in the 2001 National Land Cover Dataset (NLCD) (USGS 2008b)

The surface water land use category was only calculated for the drainage and catchment areas,

not for the buffers because we felt that including it would heavily affect the final percentages of

narrow buffers. The fragmentation and point source metrics were calculated using the National

Hydrography Datasets (NHD) (USGS 2008a). The ArcGIS Utility Network Analyst was used to

trace upstream or downstream a specific site. Majors dams (with DA ≥ 2.59 Km2 ) and point

sources (major and minor waste water treatment plants ands major industrial dischargers) were

obtained from the National Inventory of Dams (USACE 2005) and the Permit Compliance

System database (EPA 2008c)

The IBI scores consisted of discrete scores ranging from 12 (very poor biotic integrity) to 60

(excellent biotic integrity). A description of how IBI was developed and implemented for

biological assessment of streams in Ohio can be found in Ohio EPA (1987).

70

The observations were located in five different basins: the Western Lake Erie, Muskingum River

Basin, the Sciotto River Basin, Middle Ohio and Little Miami River Basin, and the Wabash

River Basin (Figure 3-1).The drainage areas of the sampling points were also calculated and

were very diverse, ranging from 1.55 km2 to 16,420 km2. In our research, sites were not

subdivided in ecoregions or stream size and were introduced into the model all at once. Smaller

number of observations would have limited the progressive partitioning process, which requires

large number of observations. Moreover, we wanted to ‘let the data speak’ and not make any pre-

conceived assumptions.

Figure 3-1. Distribution of observations used in the analysis and basins. On the left, groups after the 2nd SOM. On the right groups after clustering using SITE_Con (groups from the same parent group are segregated by basin)

71

Table 3-2. Land use categories and quartiles at the watershed (R) and the local (L) scales

Name Units Quartiles Name Units Quartiles RDA_Water % 0.10-0.25-0.60 LDA_Water % 0.00-0.19-0.96 RDA_Forest % 5.54-9.56-19.72 LDA_Forest % 4.47-13.60-29.77

RDA_NonForest % 0.48-1.04-1.42 LDA_NonForest % 0.16-0.73-1.89 RDA_Barren % 0.00-0.01-0.05 LDA_Barren % 0.00-0.00-0.00 RDA_Agric % 57.01-70.64-81.93 LDA_Agric % 32.39-57.97-78.67 RDA_Urban % 6.84-9.48-18.58 LDA_Urban % 5.99-11.08-30.99

RDA_Wetlands % 0.03-0.15-0.29 LDA_Wetlands % 0.00-0.22-1.13 RDA_Other % 0.00-0.00-0.00 LDA_Other % 0.00-0.00-0.00 R100_Forest % 8.22-16.22-31.24 L100_Forest % 7.66-24.62-46.09

R100_NonForest % 0.39-1.11-1.89 L100_NonForest % 0.00-0.49-2.02 R100_Barren % 0.00-0.00-0.02 L100_Barren % 0.00-0.00-0.00 R100_Agric % 48.75-65.64-77.14 L100_Agric % 19.06-43.99-70.17 R100_Urban % 6.60-9.50-15.73 L100_Urban % 5.58-11.13-28.01

R100_Wetlands % 0.02-0.37-1.01 L100_Wetlands % 0.00-0.41-3.50 R100_Other % 0.00-0.00-0.00 L100_Other % 0.00-0.00-0.00 R30_Forest % 9.83-21.20-39.77 L30_Forest % 7.38-29.34-55.19

R30_NonForest % 0.26-1.06-2.25 L30_NonForest % 0.00-0.00-1.77 R30_Barren % 0.00-0.00-0.01 L30_Barren % 0.00-0.00-0.00 R30_Agric % 42.05-60.88-76.07 L30_Agric % 11.54-33.58-67.21 R30_Urban % 5.42-8.40-15.23 L30_Urban % 3.74-8.77-24.89

R30_Wetlands % 0.00-0.58-1.65 L30_Wetlands % 0.00-0.35-6.67 R30_Other % 0.00-0.00-0.00 L30_Other % 0.00-0.00-0.00

DA = drainage or catchment area; 100 =100-meter buffer; 30 =30-meter buffer

Table 3-3. Fragmentation (top) and point source density and intensity metrics (middle) , units, and quartiles Name Description Units Quartile

UPS_Floodarea Percentage of flooded drainage area % 0.00-0.00-0.05 UPS_Con Percentage of upstream connected network % 94.63-100.00-100.00 SITE_Con Percent of total connected network % 3.26-18.93-34.52

DW_MainDF Downstream channel length/ # of dams on channel Km 28.19-42.86-179.26

UPS_DF Upstream network length/number of upstream dams Km 26.44-83.2-225.28

Avg_DF Average of DW_MainDF and UPS_DF Km 54.43-110.23-198.40 UPS_floodlen Upstream flooded area/upstream network length m2/Km 0.0-0.0-630.2 UPS_storDA Upstream dam storage capacity/drainage area m3/Km2 0.0-0.0-1459.5

UPS_storlength Upstream dam storage / upstream network length m3/km 0.0-0.0-1,202.4

DW_floodMainlen Downstream flooded area / main channel length m2/km 38,171.0-68,393.2-137,272.0

DW_storMainlen Downstream dam storage / main channel length m3/km 231,248.8-370,196.2- 1,487,654

Name Description Units Quartile Flow_PS % of upstream network carrying wastewater % 0.00-3.70-11.52

PSDisch_LT Point source discharge/upstream network length m3/d/Km 0.0-1.5 -32.6

PSDisch_LPS Point source discharge/ distance from site to all point sources m3/d/Km 0.0-23.8-320.6

PS_LPS # point sources/distance to all point sources #/km 0.0-43.5-90.1** PS_LTOT # point sources/ upstream network length #/km 0.0-3.7-10.7**

PSDisch_DA Point source discharge/ DA m3/d/Km2 0.0-4.5-70.4 LPS-DA Distance to all point sources/ DA Km/Km2 0.0-27.9-98.8**

DA Drainage area Km2 32.89-103.60-344.73 ** Values were multiplied by 1,000

72

3.1.3. Maryland data and study area

A total of 774 observations were used for the present research. These were grouped in three

geographic strata : coastal, piedmont, and highland regions. Piedmont and highland regions

represent non-coastal areas, and have significant differences in soil and land use history. Also,

the metrics used to calculate the Physical Habitat Index (PHI) are different for each region (Paul

et al. 2002). Coastal areas had a total of 225 observations, highland had 196 observations, while

piedmont regions had 252 sites. Figure 3-2 shows the distribution of the observation sites within

the state of Maryland. The data was obtained from the 1995-1997 Maryland Biological Stream

Survey (MBSS) (DNR 2008) . The data consisted of biologic, habitat, and water qualities, stream

morphology, and watershed land use information.

The available habitat information in the MBSS database corresponded to the Maryland’s

Provisional PHI metrics (Hall et al. 1999). These metrics were recalculated to obtain the “new”

PHI following the guidelines by Paul et al. (2002). The old metrics not used in the calculation of

the “new” PHI for a particular stratum were kept. Therefore, each site had its regional “new” PHI

and corresponding metrics and the remaining “old” habitat metrics not included in the new

regional PHI. Table 3-4 shows a list of all the different environmental variables available in each

of the observations and strata.

Land use information consisted of percentages of each category in the drainage area. The

original land use dataset contained fifteen different categories. MBSS used the land use/ land

cover information from the Federal Region III Multi-Resolution Land Characterization (MRLC)

digital data set, Version 2 (EPA 2008b). The MRLC was developed by a federal agency

consortium, using data primarily from Landsat 1991-1993 Thematic Mapper satellite images at a

73

resolution of 30 meters. In the present research, the fifteen MRLC land use categories were

grouped in three land cover classes: urban (which included low and high intensity development),

agriculture and barren (hay/pasture/grass, row crops, quarries, coal mines, beach areas, and

transitional), and natural lands (forest, wetlands, and open water).

The biological data consisted of fish IBI. This is based on the comparison of observed fish

assemblages at each site to those found at reference sites (Roth et al. 1998). Reference sites exist

for each of the strata: coastal, piedmont, and highland regions. The final IBI scores are the mean

values of the individual metrics, which are discrete scores (1, 3 or 5, being 5 the score if there’s

little or no departure from reference conditions and 1 if viceversa). In coastal areas, the IBI is

composed of eight metrics, nine metrics in piedmont regions, and seven metrics in highland sites

(Roth et al. 2000).

Figure 3-2. 1995-1997 MBSS monitoring stations in the state of Maryland and strata distribution

74

Table 3-4. Description, quartiles, and units for the available regional environmental variables COASTAL Quartiles Units PIEDMONT Quartiles Units HIGHLAND Quartiles Units Description

Cond 106-154-209.2 μmho/cm Cond 136-174-223 μmho/cm Cond 87.2-150-247 μmho/cm Conductivity DO 6.1-7.1-8.4 mg/L DO 8.4-9.2-9.8 mg/L DO 7.5-8.2-9.1 mg/L Dissolved oxygen pH 6.6-6.9-7.2 SU pH 7.1-7.4-7.6 SU pH 6.8-7.1-7.4 SU

NO3 0.7-1.1-2.9 mg/L NO3 1.8-2.6-4.1 mg/L NO3 0.46-0.94-3.04 mg/L Nitrate as N Temp 18.8-20.8-23.1 deg C Temp 17.4-19.2-21.1 deg C Temp 16.3-18.0-20.2 deg C Water temperature SO4 10.8-14.2-17.8 mg/L SO4 6.0-9.0-13.1 mg/L SO4 9.5-13.5-21.5 mg/L Sulfate

ANC 165.07-262.7-453.0 μEq/L ANC 352.1-515.2-

845.5 μEq/L ANC 168.9-343.1-700.0 μEq/L Alkalinity

DOC 3.2-5-7.35 mg/L DOC 1.2-1.9-2.5 mg/L DOC 1.2-2.0-2.2 mg/L Dissolved organic carbon CRemote 37.7-64.6-86.2 0-100 Remote 31.2-50-81.2 0-20 HRemote 25-45-75 0-100 Remoteness score CShade 58.9-73.3-84.6 0-100 PShade 69.1-80.1-89.7 0-100 HShade 52.0-75.2-87.4 0-100 Shading score

CEpiSub 35.6-58.3-77.9 0-100 PEpiSub 58.8-76.5-88.2 0-100 HEpiSub 27.8-61.1-83.3 0-100 Epifaunal substrate score CInstrHab 42.2-58.1-80.8 0-100 PInstrHab 64.4-79.3-87.6 0-100 InstrHab 10-14-16 0-20 Instream Habitat score

CWood 40.8-57.9-69.1 0-100 PWood 8.3-25-41.8 0-100 Wood 0-1-3 Count Instream wood score or count CBank 59.2-74.2-86.6 0-100 PBank 50.8-66.7-84.5 0-100 HBank 62.3-82.7-90.0 0-100 Bank stability score Root In CWood ---- Root In PWood ---- Root 0.0-0.0-1.0 Count # of instream rootwads Pool 8-13-15 0-20 Pool 12-15-16 0-20 Pool 10-14-16 0-20 Pool quality score

Riffle 6-11-14 0-20 PRiffle 74.9-85.0-92.2 0-100 Riffle 7-12-15 0-20 Riffle quality score Chan 5-8-11 0-20 Chan 8-12-15 0-20 Chan 7-15-16 0-20 Channel alteration score

Vel_dep 6-10-13 0-20 Vel_dep 11-14-16 0-20 Vel_dep 8-11-14 0-20 Veloc.-depth variability score Aesthet 11-15-17 0-20 Aesthet 10-15-16 0-20 Aesthet 11.2-16-18 0-20 Aesthetic quality score

PHI 52.2-62.9-73.2 0-100 PHI 59.2-67.3-74.4 0-100 PHI 43.6-59.1-75.7 0-100 Physical Habitat Index ThalDep 19.4-29.5-46.8 cm ThalDep 22.7-32.2-43.9 cm ThalDep 13.5-22.7-35.7 cm Mean thalweg depth

Wid 2.4-3.7-5.9 m Wid 3.4- 5.5- 8.8 m Wid 2.2-4.0-7.0 M Mean stream width MaxDep 42-63-88 cm MaxDep 52.2-71-90.1 cm MaxDep 36-51-75..7 cm Maxiimum stream depth

Sl 0.2-0.3-0.7 % Sl 0.5-1.0-1.5 % Sl 0.7-1.3-2.5 % Average slope FlowVel 0.05-0.09-0.16 m/s FlowVel 0.14 -0.22 - 0.32 m/s FlowVel 0.08-0.17-0.30 m/s Average flow velocity

DA 5.8 – 15.24 – 41.21 Km2 DA 4.9- 14.5 – 38.7 Km2 DA 3.13 – 11.2 -

28.7 Km2 Drainage area

Ch_flow 70-81-90 % Ch_flow 75-90-95 % Ch_flow 70-90-96 % % of channel covered by water

RipWid 20-50-50 m RipWid 0-20-50 m HRipWid 0-28-100 0-100 Riparian score or width (up to 50m)

Agribarr 23.9-39.8-57.8 % Agribarr 43.3-65.7-73.1 % Agribarr 13.6-33.0-67.0 % Agricultural land use in DA Forwetwat 35.0-41.3-59.2 % Forwetwat 22.9-28.3-37.7 % Forwetwat 29.1- 64.4-83.8 % Forest+ wetland +water in DA

Urban 0.49-2.7-9.9 % Urban 0.6-2.1-8.2 % Urban 0.0- 0.2- 1.2 % Urban land use in DA Embed 44-85-100 % PEmbed 55.5-77.8-88.9 0-100 Embed 20-35-50 % % fine sediment or score

IBI 3.0-3.5-4.25 1-5 IBI 2.8-3.7-4.1 1-5 IBI 2.1-3.3-4.1 1-5 Fish Index of Biotic Integrity Variables starting with a C = metrics used to calculate new PHI in coastal sites; starting with a H = highland sites; starting with a P = piedmont sites

75

3.1.4. Self-Organizing Feature Maps (SOM)

The SOM consists of an unsupervised Artificial Neural Network (ANN) model, whose operation

is inspired by the way the human brain is organized when new data is presented to it (Kohonen

2001). SOMs consist of a nonlinear projection of multidimensional data vectors on a 2D grid

with a meaningful order. The SOM grid is composed of individual units, called cells or neurons,

that compete with each other in order to identify the closest, or most similar cell, to the new data

vector being presented to the system. One neuron in a trained SOM will represent a specific

number of observations that have similar characteristics. Therefore, SOM neurons can be

considered as clusters of similar observations.

The data observations allocation process in the SOM map starts by assigning random weights to

each one of the SOM neurons ( [ ]ni wwww ,...,, 21= ). These weights have the same dimension as

the environmental variable input vectors ( [ ]nsj xxxx ,...;,1= ). One at a time, each observation is

presented to the SOM and compared to the neuron-based weights. The observation is then

associated with the most similar SOM neuron, which is called the Best Matching Unit (BMU).

Similarities between pairs of data and weight vectors are measured using the Euclidean distance.

Therefore, each unit in the input layer (i.e. observations of environmental vectors) is linked to

one unit in the output layer (i.e. SOM neurons).

Subsequently, this same process is iterated for better organization of the input space. The weights

are updated using a neighborhood function. This function looks at the observations placed in a

specific neuron and the surrounding ones within a specified radius. The initial random weight is

then replaced by another vector called the generalized median (εi), which is the ‘middlemost’

76

vector that minimizes the sum of distances between the data observations in the neuron itself and

the surrounding ones within the used neighborhood radius (Kohonen 2001). The process is then

repeated until convergence ( i.e. until a certain criterion is met [usually iiw ε≅ ]), or a certain

number of iterations is completed.

One of the properties of the SOM is that the nonlinear projection of the multidimensional input

vectors xj on the neuron grid can be considered to approximate the probability density function

p(x) of the high dimensional input data. Therefore, relevant information can be retrieved by

observing the neuron-based weights distribution of the final weights. Also, since the weights

learning process creates a smoothing effect on the weight vectors of the neurons, correlations

among variables become more clear. This is especially important for the understanding of highly

complex, natural systems in which one observation can be the outcome of multiple variable

combinations. The smoothing effect is also important to identify correlations in discrete or

crudely scaled data because the final trained neuron weights have a more continuous nature than

the initial input data.

SOMs have been used in several environmental applications, usually for data exploration

purposes in combination with more conventional techniques (Manolakos et al. 2007; Tran et al.

2003), spatial analysis and site classification and characterization (Cereghino et al. 2001; Tran et

al. 2003), identification of the main traits of the biotic community (Chon et al. 1996), or

prediction of the probability of presence/absence of fish species in specific sites after some

anthropogenic change in the study area took place (Park et al. 2003).

77

3.1.5. Initial data clustering and SOM neuron analysis

In the case of Maryland and Ohio (instream data), all the available physical and chemical

environmental variables were used to train the SOM. In the case of Ohio with offtsream data,

only regional land use and fragmentation metrics were used to train the SOM because these

variables were deemed responsible for the background quality of the biologic integrity in a

specific area. Point source density and intensity and local land use were deemed too local or non-

ubiquitous to have a significant effect on the overall IBI variability and therefore, not used in the

SOM training. Unprocessed data for each variable were logged (natural log) and ranged between

[0,1]. This step was necessary in order to equalize the effect of each input variable on the final

SOM output due to different scaling.

The size of the SOM (number of neurons) was mainly determined by the topographic error,

although the quantization error was also checked. The topographic error is defined as the

proportion of input data vectors for which the first and second most similar SOM neurons are not

adjacent in the grid of neurons (Kiviluoto 1996). The quantization error is defined as the average

distance (Euclidean) between each input data vector and its BMU. In our research, the optimum

number of neurons was found by choosing the number that had the minimum topographic error.

The quantization error usually decreases monotonically with SOM size. Since a very large map

size was undesirable given the available data set size, it was deemed less important and was not

used to determine the optimum map size. The maximum number of SOM neurons was limited to

100. A SOM with 60 )106( × and 72 )98( × neurons was used for the initial SOM training in

Ohio with instream and offstream data respectively. SOMs with 48 )86( × , 54 )96( × , and 54

)96( × neurons were used for the coastal, highland and piedmont regions in Maryland

78

respectively. The SOM training consisted of 20 and 100 epochs for the coarse and fine-tuning

map training respectively.

There exists one element in each weight vector corresponding to each one of the environmental

variables included in the input data vectors used for the SOM training. Therefore, a vector of

SOM neuron weights could be extracted for each environmental variable used in the SOM

patterning. Also, the IBI values of the patterned observations in each SOM neuron were

averaged. Hence, a neuron-based average IBI value was determined for each SOM neuron.. The

correlation matrix among the environmental weight vectors and the neuron-based mean IBI

vector was computed. The goal was to evaluate the effect of each environmental variable over

IBI and also reveal relationships among environmental variables. The absolute values of the

neuron-based IBI-variable correlation coefficients were sorted in descending order. Variables

with a higher variable-IBI absolute correlation coefficient were considered to have a greater

overall impact on the biological community and vice versa.

3.1.6. Second SOM data clustering

A second SOM training was performed using variables that showed a significant impact on IBI

(neuron based IBI-variable 5.0≥r in Maryland and Ohio with instream data and 4.0≥r in

Ohio with offstream data), and were not highly correlated to a more relevant variable (variable-

variable r < 0.8). In the case of Ohio with offstream data, the large-scale variable correlation

coefficient criterion was relaxed because if the initial criterion was kept only one variable would

have been available for the 2nd SOM training (see Table 3-8) because other large-scale variables

79

(i.e. with IBI-variable 5.0≥r ) were discarded due to cross-correlation. For this reason, the

criterion was relaxed so that more than one variable could be used for the 2nd SOM patterning.

Therefore, the initial dataset was reduced to a smaller one that included only variables with great

overall effect on IBI (large-scale variables or environmental gradients). The number of SOM

neurons was again determined with the topographic error. A SOM with 72 ( 126× ) and 78

( 136× ) neurons was used in Ohio with instream and offstream data respectively. SOMs with 48

( 86× ), 70 ( 107× ), and 45 ( 95× ) were used in Maryland’s coastal, highland, and piedmont

sites respectively.

3.1.7. Site patterning based on ‘large-scale’ variables and associated biotic

responses

The neurons from the 2nd SOM patterning were grouped into different clusters of similar units

with an agglomerative Hierarchical Clustering (HC) using the average linkage method and the

standardized Euclidean distance (Jain et al. 1999). The neuron-based SOM weights for each of

the variables used in the 2nd SOM patterning were used for this purpose. Therefore, groups with

different environmental characteristics were obtained and the corresponding IBI observations in

each one of these groups retrieved for analysis. The final number of groups in the hierarchical

structure was the maximum number of statistically different biotic responses (determined by the

group IBI) these variables were able to segregate with no or little overlapping among groups (see

Figure 3-3). The process started with 2 groups. If these two groups of observations yielded 2

different biotic responses, three groups were tested and so forth. An ANOVA F-test at the 95%

confidence level was performed to test the null hypothesis that the groups’ IBI means were

equal. If the null hypothesis was rejected (p<0.05), Multiple Range Tests (MRT) using the

80

Fisher’s Least Significant Difference (LSD) method at the 95% confidence level were

performed. This consists of a pair-wise comparison of the group IBI means. Thus, statistically

different biologic qualities corresponding to different environmental conditions were separated.

The number of groups that yielded the clearest separation of biologic responses (i.e. with the

largest possible number of IBI categories with no or little overlap among groups) was selected.

81

Figure 3-3. Example of a hierarchical tree of the 2nd SOM neurons (left) and analysis of differences among group biologic responses (right). On the right, example of MRT analysis. Overlapping indicate not significant differences in group IBI means. Non-overlapping indicates significantly different group IBI means. In this case, Level 4 partition would be chosen because it yields the largest number of different biotic responses (5) with less overlapping than Level 5 (Figure for clarification purposes only).

BIOTIC RESPONSE BIOTIC RESPONSE

Level Group 1 2 3 4 Level Group 1 2 3 4 5

1 1 I

2 2

1 3

2 4 II

3 5 1

IV

6 2 1 3 2 4 3

III

5 4 5 6

V

7

82

3.1.8. Site patterning based on ‘small-scale’ variables and associated biotic

response

To account for the potential effect of variables acting at a local scale, each group obtained in the

previous step was subdivided using small-scale variables one at a time. Small-scale variables

were those with an absolute neuron based variable-IBI correlation coefficient smaller than 0.5 (or

0.4 in the case of Ohio with offstream data along with local land use and point source metrics).

Again, this process was executed in a hierarchical manner. The order with which the different

variables were tested was determined by the variable ranking from the neuron-based variable-IBI

correlation coefficient in the initial SOM analysis. If the subgroups’ biologic responses were

statistically different according to ANOVA, the main group was split into new subgroups,

otherwise it was not split. This procedure was repeated with all the available variables that were

not used in the 2nd SOM training and were not highly correlated to other variables. Figure 3-4

and Figure 3-5 show a flow chart summarizing this methodology.

3.1.9. IBI response curve development for different levels of watershed

characterization

Each group obtained at each level of clustering represents a separation of the IBI responses given

different environmental conditions. An assumption made in the present research is that the biotic

community response would follow a Gaussian distribution if the environmental characteristics of

the groups were homogeneous enough. Normality can be achieved at different levels of group

characterization. This condition would be achieved depending on what part of the overall

biologic variability is explained with the identified group stressors. Departure from normality

would mean that the current level of characterization is not enough because heterogeneous

83

conditions produce the existence of different populations. Groups that follow a Gaussian

distribution are indicative of more homogeneous conditions and further subdivisions would lead

to at least one new, narrower normal distribution because the system is defined in greater detail.

To confirm the normality condition in the different groups, the group cumulative density

function (CDF) was plotted in a normal probability plot. A straight CDF would be an indication

of normality (Chambers et al. 1983). Important deviations from the straight pattern would

indicate group homogeneity was not achieved in order to guarantee this condition. Moreover, a

Jarque-Bera statistical test for normality at the 95% confidence level was also performed in each

group (Jarque and Bera 1987). This test was chosen over more traditional ones such as the

Kolmogorov-Smirnoff test because the group distribution was unknown. The Lilliefors test for

normality was also rejected because it required large amounts of data in order to be performed.

The Jarque-Bera test is considered more robust and is based on the sample skewness and

kurtosis. Some authors recommend this test over the rest (Gujarati 2003; Judge et al. 1985).

For the sake of brevity, only the normal proabibility plots at one level of system characterization

were plotted in each case. This level corresponded to the main biologic signatures identified after

clustering the 2nd SOM neurons (in Maryland [Figure 3-17, Figure 3-21, and Figure 3-25] and

Ohio using instream data [Figure 3-7]) or the biologic responses found after clustering with site’s

percentage of fragmented stream network (Ohio using offstream data [Figure 3-11]).

84

.

Figure 3-4. Flow chart summarizing the methodology used to characterize response of the biologic community to similar environmental characteristics and stressors (Maryland and Ohio with instream data)

85

Figure 3-5. Flow chart summarizing the methodology used to characterize response of the biologic community to similar environmental characteristics and stressors (for Ohio with offstream data)

86

3.1.10. Development of biotic response reference curves

The IBI observations above the 75 percentile in each group (at the selected level of system

characterization) were separated and considered as group reference conditions. The IBI 75th

percentile was selected arbitrarily. However, another reference percentile could be selected if

more/less stringent criteria were to be met. The IBI response in sites above the 75th percentile

were considered to resemble pristine or realistically achievable conditions and therefore,

considered reference sites. New CDF curves for the reference and impaired scenarios were

developed.

Departure from reference conditions were evaluated in each group. Student’s t-tests at the 95%

confidence level were performed to test the null hypothesis that the reference and impaired group

means for the different environmental variables were equal.

87

3.2. Results and discussion

3.2.1. Ohio: instream data

3.2.1.1. Biotic response separation The correlation matrix of the neuron-based environmental weight vectors and the neuron-based

average IBI vector after the initial SOM training is shown in Figure 3-6.

Figure 3-6. Correlation matrix of the variable neuron-based weights and neuron-based average IBI values in the trained SOM.

The variables that showed a relevant influence on IBI ( r ≥ 0.5) were, in decreasing order:

embeddedness (r = -0.861), riffle quality (r = 0.815), substrate quality (r = 0.81), channel quality

(r = 0.789), QHEI (r = 0.789), cover quality (r = 0.732), pool quality (r = 0.722), gradient score

(r = 0.711), DO (r = 0.664), TKN (r = - 0.63), riparian quality (r = 0.625), ammonia (r = - 0.62),

88

total arsenic (r = -0.61), BOD (r = -0.61), nitrite (r = -0.57), sulfate (r = -0.54), drainage area (r =

0.54), and total iron (r = -0.52).

The significant variables that were subsequently eliminated due to cross-correlation with more

relevant variables ( r ≥ 0.8) were: riffle, substrate, channel, cover, and pool qualities, and QHEI,

which were highly correlated to embeddedness (r = -0.965, -0.926, -0.920, -0.856,-0.823, and -

0.890 respectively). Ammonia, total arsenic, and BOD were correlated to TKN (r = 0.961, 0.893,

0.963 respectively). Total iron was negatively correlated to DO (r = -0.850). Thus, the remaining

variables for the second SOM patterning were embeddedness, gradient score, DO, TKN, riparian

quality, nitrite, sulfate, and drainage area. NO2-N was disregarded because we wanted to analyze

the effect of Zn on IBI, and Zn was highly correlated to NO2-N (r = 0.806). A summary of the

variables used is presented in Table 3-5.

The variables used in the second SOM training were considered as environmental gradients or

large-scale variables responsible for the largest part of the biotic variability. Hierarchical

clustering of neurons from the second SOM yielded six groups with different environmental

conditions and five significantly different biologic responses according to ANOVA, as shown in

Table 3-6 and Figure 3-7.

In order to be able to account for environmental variables at the local scale, each of the groups

was clustered using the variables that were deemed not relevant (i.e. neuron-based variable –IBI

r < 0.5). The clustering was performed using one variable at a time. New subgroups were

created only if their biologic qualities were significantly different according to ANOVA. The

89

small-scale variables able to separate sites with different levels of IBI were (in the same order in

which they were patterned); total zinc concentration, pH, and nitrate concentration. Total copper,

TSS, and total cadmium and lead concentrations did not bring any further separation of IBI

responses (see Figure 3-7).

Within the available observations, very large watersheds (group 5 average DA = 2,303.1 km2)

had the best IBI scores (μ = 42.82, σ = 5.81). Headwater streams (DA< 51.8 Km2) mainly

belonged to group 3 (average DA = 41.15 Km2) and had the worst IBI scores (μ = 24.08, σ =

7.08). Group 3 had the highest values of embeddedness, TKN and sulfate concentrations, and the

second lowest DO and riparian quality. Sites from group 1, with the second smaller average DA

(84.72 Km2), had also the second poorest IBI scores after group 3 (μ = 27.84, σ = 8.41). This

might be an indication of greater resilience to degradation of larger watersheds, since Ohio’s IBI

is calibrated with drainage area EPA (1987). Positive correlation between IBI and drainage area

in Ohio was also found by (Dyer et al. 2000). In Ohio, high levels of total phosphorus (which

was not used due to high correlation to TKN) were associated with poor IBI (Rankin et al. 1999).

The two groups with highest levels of TKN (groups 3 and 1 with mean TKN equal to 2.43 and

1.37 mg/L respectively) had the poorest IBI scores. A summary of the average environmental

variables at each level of characterization is included in Appendix I.

90

Table 3-5. Neuron-based correlation coefficients between variables and IBI. Variables in bold were able to separate significantly different biotic responses

Variable IBI-Variable r Variable IBI-Variable r EmbedL -0.86 DAL 0.54 RiffleLC 0.815 FeLC -0.52 SubsLC 0.81 TPSC -0.48 ChanLC 0.789 MgSC -0.47 QHEILC 0.789 CuSC -0.46 CovLC 0.732 ZnS -0.42 PoolLC 0.722 CondSC -0.39 GradL 0.711 pHS 0.382 DOL 0.664 ClSC -0.37

TKNL -0.63 HardSC -0.37 RipL 0.625 TSSSC -0.36

NH4LC -0.62 NO3

S 0.312 AsLC -0.61 CdS -0.27

BODLC -0.61 CaSC -0.25 NO2

LD -0.57 TempS 0.222 SO4

L -0.54 PbS -0.08 L = large-scale variables or environmental gradients; S = small-scale variables; C = variables cross-correlated to higher hierarchy variables; D = disregarded variable

Table 3-6. ANOVA (top) and MRT (bottom) analyses for the IBI means in groups after 2nd SOM patterning with environmental gradients shown in Figure 3-7. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups.

ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 15825.0 5 3165.0 53.95 0.0000 Within groups 24816.5 423 58.6677 ----------------------------------------------------------------------------- Total (Corr.) 40641.5 428 Multiple Range Tests -------------------------------------------------------------------------------- Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI3 35 24.0 X IBI1 87 27.8391 X IBI6 69 28.4348 X IBI4 71 31.0423 X IBI2 111 38.6667 X IBI5 56 42.8214 X --------------------------------------------------------------------------------

91

Clustering with small-scale variables resulted in the creation of 10 new subgroups. Clustering

with total Zn resulted in the separation of few sites (3 in group 11 and 4 in group 32) in which

zinc seemed to be the cause of degradation (mean concentration equal to 178.67 and 44.75 µg/L

in groups 11 and 32 respectively versus 21.26 and 15.80 in groups 12 and 31 respectively [see

Figure 3-7 and Appendix I]). Clustering with pH values resulted in the creation of four new

subgroups after dividing groups 12 and 31; subgroup 121 (mean pH = 7.82 with 74 sites) versus

subgroup 122 (mean pH = 8.84 with 10 sites), and subgroup 311 (mean pH = 7.62 with 25 sites)

versus 312 (mean pH = 8.30 with 5 sites). Finally, nitrate concentration yielded two new groups

out of group 6; subgroup 61 (mean [NO3] = 8.33 mg/L and 32 sites) versus subgroup 62 (mean

[NO3] = 0.62 mg/L and 37 sites). Even though nitrate is a nutrient-related parameter and one

might think it should be more an environmental gradient than a small-scale variable, a clear

relationship between IBI and NO3-N concentration has not been found in Ohio (Rankin et al.

1999). High concentrations of NO3-N can be associated to the presence of waste water treatment

plants (WWTP) or intensive agriculture tile drainage. Negative effects should not be observed

until the median NO3-N concentration is greater than 3-4 mg/L (Rankin et al. 1999). Average

NO3-N concentrations in groups 11 and 61 surpassed this threshold and had significantly poorer

IBI than their homologues (groups 12 and 62) (see Appendix I).All the new subgroups obtained

with small-scale variables had statistically significant differences in IBI means in the ANOVA

test. A summary of the results after clustering with large and small-scale variables is shown in

Figure 3-7 and Appendix I.

One of the assumptions of our research was that the biological response would follow a normal

distribution if biota’s exposure to environmental conditions is homogeneous enough after the

system reaches a steady state (i.e. even if a specific group is far from reference conditions, its

92

biologic community has adapted to that level of stress by changing its structure). This

hypothesis was confirmed when the tests for normality were performed. With the full database

(i.e. with highly heterogeneous conditions), the tests for normality indicated that the IBI scores

did not follow a normal distribution. However, when the data were divided into more

homogeneous groups using large and small scale variables, they followed a normal distribution.

Group 11 and 12 were the only two exceptions. However, group 11 included only three

observations and therefore, the result was most likely due to a non-representative sample of the

group distribution. Subdivision of group 12 resulted in two subgroups that followed normal

distributions.

Figure 3-7. Groups and subgroups with different biological responses after clustering with large and small-scale environmental filters. Red color marks groups that did not pass normality tests. Blue color indicates groups that passed the normality tests.

The normal probability plots of the six groups from large-scale variables are shown in Figure

3-8. Similar plots could be obtained for each subgroup obtained with small-scale variables if

enough data were available. Unfortunately, this was not the case for some groups (only 3

observations in group 11, 4 in group 32, 10 in group 122, and 5 in group 312). These plots are a

characterization of the biotic response after passing through the specified environmental filters

(Figure 3-8 shows responses after large-scale filters are passed). With this methodology it is

93

possible to isolate the response of the biotic community to a specific stressor in a hierarchical

manner (i.e. the specific effect of a stressor will be revealed together with other relevant stressors

with a higher hierarchy in the tree as shown in Figure 3-7).

3.2.1.2. Reference conditions for similar environmental sites and potential causes for departure Reference sites represent the environmental conditions that could potentially be met by other

sites with similar characteristics. They could also be a potential framework for development of

biologic standards, similar to water quality standards in which a maximum probability of

exceedance is set with a log-normal probability plot. Exceedance probabilities beyond a set

threshold would represent a violation of the standard (Novotny 2004). Analysis of significant

differences between reference and impaired sites may indicate the likely causes of departure

from reference conditions. However, impairment is usually not the result of a single, isolated

stressor, but a highly/intertwined combination of environmental factors structured in a

hierarchical manner that propagate through the hierarchy producing a response of the biotic

endpoint (Novotny 2003).

The 75th IBI percentile was identified in each group and sites beyond this threshold were

considered as reference sites. This corresponded to IBI scores equal to 32, 44, 28, 34, 48, and 34

for groups 1 through 6 respectively. In some of these groups the IBI was far from being

considered as good (48≤ IBI ≤ 52) (Karr et al. 1986; Rankin et al. 1990). However, reference

conditions do not necessarily refer to pristine environments but to least impacted watersheds

within a highly homogeneous group. Pristine or undisturbed streams do not really exist in Ohio

anymore. The character of the reference sites should reflect the reasonably attainable biological

conditions within a particular homogeneous group given the prevailing background conditions

94

(Ohio_EPA 1987). The 75th IBI percentile was selected because departure from normality was

observed at that point in groups 1, 3, and 4 (Figure 3-8). This was interpreted as an abrupt shift in

all or some of the environmental gradients from impaired to reference conditions (i.e. a break or

gap in the environmental gradient ‘continuum’). Therefore, the biologic community responded

somewhat differently in these sites. Groups 2 and 6 showed the best overall goodness-of-fit in

the plots. Group 2 didn’t undergo any further division with new environmental filters (see Figure

3-7), which could indicate a highly continuous biotic response to the environmental gradients

and absence of significant local stressors. Group 6 was divided in two even groups using nitrate

concentration (32 and 37 observations in groups 61 and 62 respectively). This was interpreted as

a shift in the biologic response due to increased nutrient loading. When a threshold concentration

was surpassed beyond normal limits, the biologic community response was somewhat different.

However, the shift in behavior was not sudden enough as to be perceived in the normal

probability plots for level I groups (see group 6 in Figure 3-8).

Figure 3-9 shows the curves for reference and impaired sites in each one of the six groups. The t-

tests found significant differences between pristine and impaired sites in groups 1,2, and 5. No

significant differences were observed within the rest of the groups (Table 3-7). TKN was the

only environmental gradient that showed consistently better results in reference sites, and can be

an indication of progressive degradation due to increased nutrient input with changing land uses.

Embeddedness was also better in all reference sites with the exception of group 6. Sediment (and

therefore, all substrate-related habitat parameters) and nutrient input have been identified as the

most relevant factors for biotic degradation and are intimately related to land use and hydrologic

changes in the drainage and buffer areas, especially at the catchment scale(Allan 2004; Archer

and Newson 2002; Dyer et al. 1998a; Gilvear et al. 2002; Hall and Killen 2005; Manolakos et al.

95

2007; Richards et al. 1996; Shields et al. 2006; Yuan and Norton 2004). All the environmental

gradients identified in the research are directly or indirectly related to substrate quality (i.e.

embeddedness, DA, gradient, riparian quality) and nutrient input (DO, TKN, and sulfate

concentration). Even though the t-tests did not show statistically significant differences among

most of the variables, TKN and embeddedness seem to be the most consistent variables in the

differentiation of reference and impaired environmental qualities in Ohio.

96

Figure 3-8. Normal distribution probability plots for groups 1 through 6. Red line indicates 75th IBI percentile. Points to the right of the red line were considered as reference observations for the respective group of sites and separated.

97

Table 3-7. 95% confidence intervals for the environmental variable means in reference and impaired sites. Text in bold indicates statistically significant differences for that variable and group according to the t-tests ( p = 0.05)

Variable Group 1 Group 2 Group 3

Drainage area 16.32 ±6.65 65.20±34.49 9.28 ±3.639 Dissolved Oxygen 8.68 ± 1.04 7.69±0.59 5.98 ±2.15

Embeddedness 3.28 ± 0.35 2.15±0.17 3.67 ±0.33 Gradient score 6.36 ± 1.08 8.79 ±0.57 8.44 ±1.28 Riparian score 3.29 ± 0.29 6.58 ±0.47 4.11 ±0.50

Sulfate concent. 90.92 ±25.39 80.82 ±17.07 300.33 ± 144.78

Ref

eren

ce

TKN 1.31 ± 0.97 0.48 ±0.21 2.18 ±1.61 Drainage area 38.26 ±10.29 65.98 ±19.01 18.18 ±9.05

Dissolved Oxygen 8.53 ± 0.75 8.08 ±0.40 7.02 ±1.24 Embeddedness 3.69 ± 0.12 2.36 ±0.11 3.85 ±0.15 Gradient score 4.92 ± 0.35 9.16 ±0.30 7.61 ±0.60 Riparian score 3.61± 0.46 5.73 ±0.38 3.61 ±0.47

Sulfate concent. 155.14 ± 30.10 102.88 ±13.93 254.58 ±72.65

Impa

ired

TKN 1.38 ± 0.43 0.56 ±0.08 2.52 ±1.69 Variable Group 4 Group 5 Group 6

Drainage area 29.79 ±9.13 703.75 ± 435.57 152.98 ± 65.21 Dissolved Oxygen 7.26 ±0.78 8.63 ± 1.14 7.69 ± 0.78

Embeddedness 3.58 ±0.21 2.21 ± 0.19 3.56 ± 0.24 Gradient score 5.89 ±0.23 9.57 ± 0.49 8.82 ± 0.52 Riparian score 5.17 ±0.75 5.82± 0.75 5.71 ± 1.03

Sulfate concent. 67.22 ±22.10 62.36 ± 24.18 138.42 ± 40.55

Ref

eren

ce

TKN 0.55 ±0.21 0.45 ± 0.10 0.76 ± 0.28 Drainage area 465.99 ± 387.76 992.23 ± 541.09 90.61 ± 19.14

Dissolved Oxygen 6.43 ± 0.48 8.41 ± 0.58 7.72 ± 0.48 Embeddedness 3.70 ± 0.11 2.30 ± 0.14 3.43 ± 0.16 Gradient score 6.04 ± 0.27 9.29 ± 0.30 9.15 ± 0.30 Riparian score 6.48 ± 0.59 6.18 ± 0.37 5.87 ± 0.56

Sulfate concent. 64.63 ± 9.22 56.90 ± 14.72 207.19 ± 42.88

Impa

ired

TKN 0.78 ± 0.12 0.82 ± 0.34 0.93 ± 0.17

The rest of the environmental gradients did not show a clear pattern between reference and non-

reference sites. However, the differences among them were never large. Hence, reference sites

can be initially screened based on their substrate-related parameters (e.g. degree of

embeddedness compared to a reference site) and nutrient inputs (e.g. TKN and phosphorus

levels).Different combinations of the rest of gradients would determine the final biotic integrity.

Land use data in the drainage area and the riparian buffer at different scales would, most likely,

helped refine the watershed classification. Different combinations of local and regional land uses

98

in the drainage area and the riparian buffer are the main regulators of sediment and nutrient

input. Morphologic characteristics can also play a significant role (i.e. gradient). Unfortunately

this data was not available at the time this analysis was performed.

Figure 3-9. Normal probability plots for the reference (green) and impaired (red) conditions for the six groups obtained after clustering the SOM neurons with environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group

99

3.2.2. Ohio offstream data

3.2.2.1. Biotic response separation The correlation matrix of the neuron-based regional environmental variable vectors and the

neuron-based average IBI is presented in Figure 3-10.

Figure 3-10. Correlation matrix of the variable neuron-based weights and neuron-based average IBI scores in the trained SOM. Color bar on the right indicates absolute value of the absolute correlation coefficient. Plus and minus signs indicate positive or negative correlation.

The regional variables that showed a strong effect on biotic integrity ( neuron-based IBI-variable

4.0≥r ) were (in descending order): R30_Forest, R100_Forest, RDA_Forest (r = 0.662, 0.646,

0.579 respectively), R30_Agric (r = -0.436), R30_Barren (r = 0.460), R100_Agric (r = -0.436),

DW_storMainlen (r = 0.430). Some of these variables were eliminated due to strong correlation

100

with other more significant variables. R100_Forest, RDA_Forest, and R30_Agric were strongly

correlated to R30_Forest (r = 0.996, 0.910, and -0.826 respectively). Therefore, the variables that

should theoretically have been used for the 2nd SOM patterning were R30_Forest, R30_Barren,

R100_Agric, and DW_storMainlen. However, we decided to disregard the variable

DW_storMainlen because the IBI-variable correlation coefficient sign seemed counter-intuitive

(the correlation was positive, which would mean an IBI improvement with greater dam water

storage capacity per main channel length unit). A negative correlation was expected in this case.

In the present thesis (see Chapter 2) and elsewhere (Dyer et al. 2000) it has been reported that

Ohio’s IBI is positively correlated to drainage area. Hence, fragmentation metrics whose final

units had some element directly or indirectly related to drainage area (e.g. UPS_floodlen

[m2/Km]) were deemed biased. For this reason, only unitless fragmentation metrics were kept

for further analysis (i.e. UPS_Floodarea, UPS_Con and SITE_Con).The rest were disregarded.

As a consequence, the only variables used in the 2nd SOM patterning were R30_Forest,

R30_Barren, and R100_Agric.

The remaining variables were considered as regional variables with a local effect (neuron-based

IBI-variable 4.0<r ) and used for individual, progressive clustering along with local variables

(i.e. local land use and point source metrics). A list of both, the large and small-scale variables

and their respective correlation coefficients with IBI are shown in Table 3-8. Strongly cross-

correlated variables, and therefore discarded, are also identified in Table 3-8.

Clustering of the 2nd SOM neurons using the three-dimensional neuron-based environmental

vectors (R30_Forest, R30_Barren, R100_Agric) segregated two significantly different biologic

responses as indicated by the ANOVA and MRT analysis (Table 3-9).

101

Table 3-8. Correlation coefficients between the neuron-based regional environmental variables and the neuron-based average IBI scores (left and mid columns) and raw local variables and IBI scores (left column). Variables in bold were capable of separating significantly different biological responses in the hierarchical structure

Regional variables with widespread impact r Regional variables with

localized impact r Local variables r

R30_Forest 0.662 R30_NonForest -0.391 L100_Forest 0.458 R100_ForestC 0.646 RDA_BarrenC 0.378 L30_ForestC 0.456 RDA_ForestC 0.579 R100_BarrenC 0.365 LDA_ForestC 0.380 R30_AgricC -0.483 RDA_AgricC -0.347 L30_Agri -0.271 R30_Barren 0.460 R100_NonForestC -0.338 L30_NonForest -0.223 R100_Agric -0.436 UPS_Con -0.320 L100_AgriC -0.217

DW_storlengthD 0.430 RDA_NonForestC -0.271 L100_Urban -0.175 DW_MainDFD -0.257 L100_NonForestC -0.159 UPS_storDAC 0.242 LDA_UrbanC -0.153 DW_floodMainlenC 0.226 L30_UrbanC -0.139 DAC 0.225 LDA_NonForestC -0.112 RDA_WaterC 0.179 LDA_AgriC -0.112 SITE_Con 0.170 UPS_floodareaC 0.139 Avg_DFD -0.120 UPS_DFD 0.110 UPS_floodlenD 0.109 RDA_Urban 0.105

C = strongly cross-correlated with a higher hierarchy variable; D = disregarded

Table 3-9. ANOVA (top) and MRT (bottom) analyses to detect significant differences in IBI means between 2nd SOM groups of neurons. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups.

ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 1181.57 1 1181.57 12.79 0.0004 Within groups 39539.1 428 92.381 ----------------------------------------------------------------------------- Total (Corr.) 40720.6 429 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 413 32.5521 X IBI1 17 41.0588 X -------------------------------------------------------------------------------- Contrast Difference � Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *8.50677 4.67525 -------------------------------------------------------------------------------- * denotes a statistically significant difference.

102

Observations in each of the two main groups obtained with the regional most significant

variables were clustered with the remaining regional and local variables one at a time in the order

shown in Table 3-8. Figure 3-11 shows a diagram of the final hierarchical tree in which the

initial two main groups were progressively split using the remaining variables. The subgroups

shown are those with statistically different biologic responses and the variables at each level are

those responsible for the differences and used in the group partitioning.

Figure 3-11. Hierarchical diagram of habitats with significantly different biotic responses. On the right, list of environmental variables used to segregate biotic signatures at each step. Rectangles in blue indicate groups that passed normality test. Rectangles in red indicate groups that did not pass normality test.

The biologic responses of the groups obtained after clustering with the fragmentation metric “site

percentage of connected network” (SITE_Con) were plotted in a normal probabilistic plot

(Figure 3-12).Such plot could have been created for each level of clustering. Only one level was

characterized in the present paper for the sake of brevity.

103

Figure 3-12. Normal distribution probability plots for the biologic signatures after clustering sites with SITE_Con. Group 212 did not pass the Jarque-Bera test of normality at the 95% confidence level (see Figure 3-11) . Group 221 was not plotted because it only had 4 observations

104

Figure 3-13. Example of biologic response separation by segregation of sites with environmental variables. Group 222 splits in groups 2221 and 2222 (group 2222 not-normally distributed) after clustering with RDA_Urban. Group 2222 splits in groups 22221 and 22222 (both normally distributed) after clustering with R30_Agri.

3.2.2.2. Reference conditions for similar environmental sites and potential causes for departure The 75th percentile in groups 1, 211, 212,222, and 223 was 44, 46, 40, 40, and 32 respectively.

Values above and below the 75th percentile were considered as reference and non-reference

respectively for each of the groups and plotted in Figure 3-14. Analysis of group differences is

presented in Table 3-10.

105

Figure 3-14. Normal probability plots for the reference (green) and impaired (red) conditions for the groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 212 was fitted to a Gaussian distribution only for demonstration purposes)

106

Table 3-10. 95% confidence intervals and ANOVA test between reference and non-reference sites in variables used in the separation of biotic responses

Group 1 Group 211

Variable Reference Non-reference p Reference Non-reference p R30_Forest 39.13 ± 7.28 39.93 ± 10.26 0.882 41.99 ± 7.62 40.13 ± 4.34 0.655 R30_Barren 0.22 ± 0.09 0.26 ± 0.26 0.755 0.00 ± 0.00 0.00 ± 0.00 0.367 R100_Agri 61.36 ± 6.38 59.09 ± 6.02 0.600 56.67 ± 6.73 60.71 ± 4.05 0.296

R30_NonForest 0.41 ± 0.40 0.37 ± 0.38 0.881 0.33 ± 0.20 0.38 ± 0.17 0.752 SITE_Con 20.49 ± 17.53 49.71 ± 33.82 0.082 80.13 ± 10.97 90.39 ± 3.49 0.016*

RDA_Urban 9.42 ± 4.25 7.79 ± 3.58 0.511 14.40 ± 8.27 9.82 ± 3.16 0.188 L100_Forest 50.62 ± 9.85 48.69 ± 17.73 0.821 62.89 ± 10.84 44.89 ± 6.23 0.004*

L30_Agri 17.87 ± 9.66 25.93 ± 18.51 0.363 15.69 ± 6.25 30.03 ± 7.22 0.026*L30_NonForest 0.45 ± 0.80 2.16 ± 2.60 0.138 0.34 ± 0.50 0.48 ± 0.43 0.701

L100_Urban 11.13 ± 5.70 12.97 ± 10.43 0.713 10.19 ± 3.68 10.56 ± 4.00 0.915 Group 212 Group 222

Variable Reference Non-reference p Reference Non-reference p R30_Forest 38.13 ± 5.04 19.70 ± 2.58 0.000* 28.49 ± 9.40 19.20 ±2.05 0.003*R30_Barren 0.00 ± 0.00 0.01 ± 0.01 0.016* 0.01 ± 0.01 0.01 ±0.00 0.294 R100_Agri 52.13 ± 5.46 61.80 ± 3.78 0.007* 65.03 ± 9.50 66.76 ±3.10 0.640

R30_NonForest 0.80 ± 0.14 0.87 ± 0.10 0.478 3.28 ± 0.34 3.00 ± 0.17 0.114 SITE_Con 12.42 ± 2.37 15.79 ± 2.11 0.080 37.52 ± 8.09 33.62 ±0.91 0.092

RDA_Urban 16.45 ± 3.53 19.52 ± 3.17 0.286 7.99 ± 1.93 18.78 ±2.99 0.000*L100_Forest 42.79 ± 5.82 24.22 ± 3.37 0.000* 33.13 ±11.95 14.59 ±3.59 0.000*

L30_Agri 30.52 ± 6.50 44.94 ± 4.54 0.000* 42.08 ±14.62 36.27 ± 10.54 0.559 L30_NonForest 0.42 ± 0.19 1.41 ± 0.38 0.003* 2.26 ± 2.31 2.70 ± 1.26 0.728

L100_Urban 18.12 ±3.84 21.91 ± 3.45 0.226 12.00 ± 4.84 37.80 ±10.19 0.005* Group 223

Variable Reference Non-reference p R30_Forest 20.18 ± 7.80 13.57 ±4.69 0.134 R30_Barren 0.00 ± 0.00 0.00 ± 0.00 0.884 R100_Agri 55.98 ±16.64 67.94 ±7.97 0.138

R30_NonForest 25.91 ± 8.83 26.60 ±4.80 0.882 SITE_Con 3.83 ± 1.47 3.29 ± 0.82 0.491

RDA_Urban 30.48 ±18.01 16.27 ±5.82 0.043* L100_Forest 6.76 ± 2.41 12.97 ±5.46 0.173

L30_Agri 45.35 ±23.33 58.89 ±10.69 0.216 L30_NonForest 3.10 ± 2.70 5.56 ± 2.08 0.190

L100_Urban 44.45 ±24.66 21.20 ± 7.68 0.015* * Indicates a statistically significant difference at the 95% confidence level (p < 0.05)

107

The hypothesis of normality in environmentally homogeneous groups worked well and most of

the groups’ IBI scores followed a Gaussian distribution at some point of the hierarchical

partitioning process (Figure 3-11). The IBI scores in the full database did not follow a Gaussian

pattern as expected due to environmental heterogeneity which caused a mix of different biologic

signatures. Only one group (group 21221) out of fifteen groups after the last partition didn’t pass

the normality test. This group was composed of 98 sites and still had a wide IBI range (minimum

and maximum IBI equal to 12 and 52 respectively). Therefore, homogeneity could not be

achieved within this group using offstream variables. Use of other stressor types (i.e. instream

features) or a mix of instream and offstream variables would, most likely, solve this issue.

Separation of reference and non-reference sites within each group targeted the main issues that

need to be addressed in order to achieve realistic integrity goals within homogeneous groups.

Some of the groups had problems at the local scale (i.e group 211 in Table 3-10), but most of

them had significant differences at the regional and local scale (groups 212, 222, and 223).

Groups 1 and 221 (group 221 not included in Table 3-10, only 4 observations) had no differences

because they were highly homogeneous with good biotic integrity (average IBI equal to 41 and

48.5 respectively). Results at the shown level of partition need to be read carefully. For example,

in group 223 the only significant differences were found with the regional and local urban land

uses, which were surprisingly higher in reference sites. However, percentage of forest in the

regional buffer was higher in reference conditions and agriculture in the local and regional

buffers was lower in reference sites. Differences were not statistically significant though.

Selection of a more stringent reference threshold would most likely identify these differences as

significant.

108

Regional land use

The model identified two main groups using the most important regional variables (i.e. forest and

barren land in the 30-meter buffer and agriculture in the 100-meter buffer). Group 1 was only

composed of 17 sites. Eleven of them were located in the East fork of the Little Miami River’s

main stem, three of them were located on the Sciotto River’s main stem, and the remaining three

were located in the Huron River watershed (see Figure 3-1). The remaining 412 sites that

composed group 2 were distributed among the Western Lake Erie, Wabash River, Middle-Ohio

and Little Miami Rivers, Muskingum River, and Sciotto River basins. The main difference

between both groups was the percentage of forested area in the regional 30-meter buffer (average

forested land equal to 39.5 versus 24.9 percent in groups 1 and 2 respectively) as well as the

percentage of barren land in the regional 30-meter buffer (0.24 versus 0.01% in groups 1 and 2

respectively). The percentage of agriculture within the regional 100-meter buffer was very

similar in both groups( average percentage of agriculture equal to 60.3 and 60.9 in groups 1 and

2 respectively). The average IBI was higher in group 1 than in group 2 as expected (41.1 versus

32.6 respectively). The importance of protective vegetated buffers of at least 15 meters in order

to preserve wetlands, streams, and other aquatic resources is widely recognized (Castelle et al.

1994). Proper management of vegetated buffers is particularly important in order to avoid the

negative effects of sedimentation to the fish community (Rabeni and Smale 1995). Moreover, the

results showed how proximity to the water body is also a significant factor affecting integrity. At

the regional level, the most relevant land uses consistently showed slightly better predictions

with the 30-meter buffer than with the 100-meter buffer.

Group 1 didn’t undergo any further segregation of biologic responses with any of the subsequent

partitions using localized environmental variables. Group 1 IBI scores were rather homogeneous

109

(25th, 50th and 75th IBI percentiles equal to 36, 44, and 44 respectively). On the other hand, group

2 was subdivided due to localized regional effects (Figure 3-11). Group 2 had a much greater

variability of offstream features than group 1 did and a smaller part of the biotic variability could

be explained with large-scale variables only. Subsequent divisions with watershed and watershed

buffer land uses (R30_NonForest, RDA_Urban) successfully separated sites with different

regional features and therefore, different biologic responses were identified.

Immediate non-forested lands (which included herbaceous and shrub/scrub lands) were able to

separate group 2 into two main groups. Increased presence of non-forest translated into reduced

biotic quality (mean IBI equal to 28.5 in group 21 versus 34.0 in group 22). This was most likely

related to smaller presence of forested land in the 30-meter buffer and larger agricultural

coverage in group 21 (average forested and agricultural land was 27 and 59 percent and 19 and

66 percent in groups 21 and 22 respectively).

Watershed urbanization segregated 4 groups nested in group 22 (group 2221 and 2222, and

group 2231 and 2232). The biological responses to watershed urbanization were diverse. Groups

with similar percentages of forested and agricultural lands in the regional buffer (i.e. groups 2221

and 2222) showed a negative biologic response to increased watershed urbanization as expected

(Table 3-11). However, the opposite response was observed between groups 2231 and 2232. This

suggested that highly urbanized watersheds (i.e. group 2232) can achieve better integrity than

less urbanized ones if the regional buffer keeps its protective functions intact. Of special

importance are its vegetated areas. Importance of vegetated regional buffers was also revealed in

groups 221 and 1, which had the top two mean IBI scores and also had the highest percentage of

forest in the 30-meter regional buffer. Group 211 had significantly better average IBI than group

110

212. The regional buffer characteristics were quite similar in both groups but percentages of

forested lands was higher in group 211 (Table 3-11). Therefore, the importance of regional

buffers as the main regulator of biologic integrity was revealed. However, when similar buffer

characteristics exist in two different sites, IBI fluctuations are determined by land use beyond it

as observed in groups 2221 and 2222. In general, sites within a same order stream belonged to

only one group due to homogeneous regional characteristics (Figure 3-15).

The results from the regional land use analysis strongly agreed with current research. Regional

buffers were found to be better predictors of sediment-related habitat variables than the whole

catchment area (Richards et al. 1996). Habitat, and especially substrate quality degradation has

been strongly associated to negative effects on aquatic fauna in Ohio (Dyer et al. 2000; Dyer et

al. 1998a; Manolakos et al. 2007; Norton et al. 2000; Norton et al. 2002; Yuan and Norton 2004)

and elsewhere (Richards et al. 1993; Shields et al. 2006). Richards et al. (1996) also found that

land use in the whole catchment area has a stronger effect in variables related to hydraulic

regime such as channel dimensions than regional buffers do. The results with the regional

variables agreed extremely well with similar studies evaluating the impact of land use at a

similar scale. Stewart et al. (2001) linked larger increased presence of intolerant species and

total number of fish species to increased forested wetland in the 20-30 meter buffer. Percent

tolerant species and percent insectivorous fish decreased as the percent of forest in the 20-30

meter buffer increased. These relationships indicate a positive correlation between regional

stream buffer and biologic integrity. Moreover, percentage of grasslands (equivalent to non

forested lands in our research) in the 20-30 meter regional buffer has been negatively associated

with the health of fish communities (Stewart et al. 2001). Also, several authors have identified

urban land use in the whole catchment area as a good indicator of biological integrity

111

degradation (Morley and Karr 2002; Stewart et al. 2001). Therefore, the model confirmed the

disproportionate importance of regional buffers compared to its total land area (Johnson et al.

1997). However, good quality of regional stream buffers alone does not guarantee good

biological integrity in highly urbanized basins (Roth et al. 1996).

Fragmentation

Of the two fragmentation metrics used in the hierarchical separation of biologic signatures

(percentage of upstream connected network [UPS_Con] and percentage of basin connected

network [SITE_Con]), only the basin-based metric was able to separate different biological

responses. Because the observations within a basin were highly concentrated in specific areas or

river systems and because the metric was calculated at the basin-scale, clustering with this metric

functioned as a basin-filter. Observations within same basins (or within a same watershed in

basins with multiple outlets) were grouped together. Therefore, biologic integrity responses were

segregated on a basin level (Figure 3-1).

The results could suggest that separation of different biotic qualities in this case was more due to

regional characteristics than the effect of fragmentation itself. However, a clear pattern was

observed. The two groups with the lowest biological integrity (i.e. group 223 and 212 with an

average IBI equal to 26.3 and 32.9 respectively) had the lowest average site network connectivity

(3.43 and 14.9 percent respectively). The three groups with the highest average IBI (i.e. groups

221,1, and 211 with average IBI equal to 48.5, 41.1, and 40.1 respectively) had much larger

mean connectivity values (59.7, 34.2, and 87.7 percent respectively). Therefore, fragmentation at

the basin-level seems to play an important role in biologic integrity. Connectivity thresholds to

guarantee species survival and persistence may exist. Physically fragmented networks tend to

112

isolate small populations which become not viable and are condemned to disappearance in a

time ranging from 30 to 100 years (Morita and Yokota 2002). Some studies suggest that risk of

species disappearance due to stream damming is positively correlated to increasing population’s

isolation period with respect to the rest of the river network, and stream gradient, and is

negatively correlated to watershed area (i.e. habitat size) (Morita and Yamamoto 2002).

Moreover, fragmentation not only represents a physical barrier to fauna but this is also associated

with flow regulation. Hydraulic intermittency due to flow regulation/abstractions are usually

associated with the presence of dams or other infrastructure for flow regulation. One of the main

consequences is the longitudinal and lateral dispersion of species due habitat fragmentation or

disappearance triggered by the new hydrologic regime (Fischer and Kummer 2000; Freeman et

al. 2001).

Upstream connectivity didn’t seem to be as relevant as site connectivity at the basin-level. One

of the possible reasons why this metric wasn’t as relevant as the basin-connectivity is because

many of the observations were very far from the basin outlet. Therefore, the largest part of the

network available for fauna was located in the downstream section (average distance to basin

outlet following the main channel was equal to 284.3 Km, average total upstream network

distance was equal to 488 Km). Results from Chapter 2 indicated that increased upstream

fragmentation in the same study area was responsible for IBI over-predictions in some sites (see

Table 2-3). Upstream fragmentation (which was not a predicting variable in the model) was

significantly higher in some over-predicted sites. Therefore, given the available observation

points, upstream fragmentation is more a local than a regional stressor because of the generally

big distances to the basin outlet (i.e. the upstream network section represents a small fraction of

the existing fish habitat).

113

Local variables

Only some local land uses were considered to have a significant correlation to IBI (Table 3-8).

None of the point source density and intensity metrics had an absolute raw IBI-variable

correlation coefficient larger than 0.1 and therefore, these were disregarded. On the other hand,

the order of the correlations between local land use and IBI were almost exactly the same as

those observed at the regional scale. Forest was the most significant once again and positively

correlated to IBI, while agriculture, non-forested lands, and urbanization were negatively

correlated to IBI. Once again, land use in the buffer strip was more strongly correlated to IBI

than land use in the whole local catchment area. Urban lands was not an exception and was a

significant difference with the regional land use variable selection (percent of urban land use in

the whole drainage area was more relevant at the regional level than in the buffer zone). Also,

percent of barren lands in the local buffer had a very weak correlation with IBI and disregarded

for further analysis.

The importance of forested local buffers was revealed in groups 211 and 212, which were split

into two and three subgroups respectively. The pattern was very clear. Increased vegetation in

the local buffer corresponded to improved biologic integrity within the limits established by the

group’s regional and basin characteristics (i.e. background integrity). Average IBI for subgroups

2111 and 2112 were equal to 37.6 and 43.6 respectively. Percentage of forested land in the local,

100-meter buffer in the same sub-groups was 34.6 and 67.4 respectively. Average IBI in groups

2121, 2122, and 2123 were 39.4, 35.8, and 28.9 and corresponded to an average percentage of

forested land in the local buffer of 77.5, 41.7, and 7.82 respectively.

114

Groups 221, 2221, 2222, 2231, and 2232 didn’t undergo any further subdivision with this

variable. This was most likely due to high homogeneity in groups 221, 2231, and 2232 (25th,

50th, and 75th quartiles equal to 42.5, 44.6, 45.6% in group 221; 0.8, 8.2, 11.6% in group 2231;

and 4.3, 8.2, 15.9% in group 2232). However, groups 2221 and group 2222 had more variability

(25th, 50th, and 75th quartiles equal to 3.38, 29.2, 35.4% in group 2221; 2.5, 12.5, 25.3% in group

2222). We believe that despite variability, differences in biological responses may not be

statistically significant when a lower limit of forested land cover is reached. For example, group

212 and 211 were divided into three and two subgroups respectively. Sub-groups 2121, 2122,

and 2123 had average percentages of forest equal to 77.5, 41.7, and 7.82 respectively, while

subgroups 2111 and 2112 had average percentages equal to 34.6 and 67.4 respectively. Different

biotic responses were not observed between the 10-30% range of forest and less than 10% of

forested lands. Since most of the observations in groups 2221 and 2222 were below the 30%

limit, we believe local buffer functionality was degraded enough as to not be able to further

generate different biotic responses.

The strong influence of forest in the local 100-meter buffer was also identified in a study in the

River Raisin in Michigan (Lammert and Allan 1999). Thirty percent of the total fish IBI

variability was explained with this variable. Local forest cover was also an important factor

positively affecting the Benthic IBI (B-IBI). In another study, local urban land use showed the

strongest correlation to B-IBI when compared to other land uses. However, since the watersheds

under study were mostly dominated by either urbanization or forest, forest cover was excluded

from the analysis due to almost perfect correlation with urban land use(Morley and Karr 2002).

Like in our model, Morley and Karr (2002) identified watershed urban land use as a better

predictor for B-IBI than local urban land use. However, they found that local urban land use in

115

highly urbanized watersheds was strongly correlated to B-IBI in watersheds with little vegetal

cover continuity in the immediate stream buffer (1 Km upstream).

Agriculture, non-forested land uses, and urbanization in the local buffers were able to further

separate more biotic responses. However, the biologic responses obtained in each new sub-

division were not always the expected ones. Groups 22221 and 22312 had worse biotic

integrities than groups 22222 and 22311 despite having smaller percentages of agriculture in the

local buffer. Also, group 21112 had better integrity than group 21111 despite having larger

percentage of urban land use (average of 26.6 versus 6.4 respectively).Only percentage of non-

forested land in the local buffer yielded the expected outcome. Group 21221 had better integrity

than 21222 and its percentage of non-forested land was significantly lower (0.34 versus 25.4

percent respectively).

Even though some of these results were counter-intuitive, they may reflect truly extraordinary

local conditions. The sign of the overall IBI- local variable correlation coefficients when the

whole database was used were the ones expected (Table 3-8). We believe the discrepancies in

some groups were due to data resolution (e.g. agriculture was the sum of hay pasture, range and

croplands which may have different behaviors). For example, group 22312 had far less local

agricultural coverage than group 22311 (average 36.3 versus 81.1% respectively). However, the

average percentage of local pasture and rangeland (included in the agricultural category) was

almost doubled in group 22312 with respect to group 22311 (10.3 versus 5.9% respectively). In

Chapter 2, this land use type was identified as the most deleterious to biotic integrity in Ohio.

Data resolution problems were not observed in groups 22221 and 22222, and 21111 and 21112.

Even though a clear explanation for the observed pattern was not found, dominance of regional

116

influence could be a possible cause. It has been documented that some instream features such as

shade, channel width and stability, epilithon biomass, or water clarity improve rapidly with

improved local buffer quality. However, other processes that can severely affect biotic integrity

such as water chemistry, nutrient input, surficial fine sediment, or fecal contamination are highly

dependant on regional characteristics (Parkyn et al. 2003; Scarsbrook and Halliday 1999).

Figure 3-15. Groups of sampling sites in a watershed located in the Muskingum River Basin. On the left, groups after partition with regional watershed land use and fragmentation metrics. On the right, groups after partitions with land use in the local 100-meter buffer Table 3-11. Average group values after clustering with basin/watershed scale variables

R30_FOREST R30_BARREN R100_AGRI R30_NONFOREST SITE_CON RDA_URB IBI

1 39.50 0.24 60.29 0.39 34.24 8.65 41.06211 40.62 0.00 59.66 0.37 87.71 11.01 40.09212 24.63 0.01 59.21 0.85 14.89 18.70 32.88221 42.43 0.02 59.28 4.41 59.67 6.08 48.502221 21.78 0.01 71.47 3.15 35.16 7.54 38.082222 21.33 0.01 62.40 19.41 34.17 22.61 24.712231 11.71 0.01 79.09 18.91 2.22 7.68 24.302232 23.39 0.00 33.04 9.91 6.10 47.47 30.67

117

3.2.3. Coastal Maryland

3.2.3.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI scores is shown

in Figure 3-16.

Figure 3-16. Correlation matrix of the variable neuron-based weights and neuron, average IBI values in the trained SOM. Color bar on the right indicates color code for the absolute correlation coefficients among variables

The variables that showed a relevant overall impact on IBI (variable-IBI r ≥ 0.5) were: pool

quality ( r = 0.730), average thalweg (r = 0.675), average width (r = 0.671), velocity-depth

variability (r = 0.665), percentage of channel covered by flow (r = 0.659), maximum depth (r =

118

0.654), drainage area ( r = 0.620), wood score (r = 0.577), flow velocity (r = 0.529), and riffle

quality (r = 0.517).

Many of these significant variables’ neuron weights were strongly cross-correlated (variable-

variable r ≥ 0.8), and therefore disregarded for subsequent analysis. Average thalweg, velocity-

depth variability, percentage of channel covered by flow, and maximum depth were strongly

correlated to pool quality (r = 0.912, 0.876, 0.841, 0.968 respectively). Drainage area and wood

score were strongly correlated to average stream width (r = 0.955, -0.864 respectively). Average

flow velocity was also disregarded for further analysis because we considered this variable could

be influenced by local conditions (e.g. channelization). Thus, the remaining variables for the 2nd

SOM patterning were pool and riffle qualities, and average width.

The remaining variables were considered as small-scale variables. DO and ANC were eliminated

due to high correlation with pH (r = 0.85 and 0.84 respectively), Agribarr and CShade were

correlated to NO3 ( r = 0.95, -0.81 respectively), Embed and CEpiSub were correlated to

CInstrHab (r = -0.86 and 0.87 respectively), and CRemote was correlated to Aesthet ( r = 0.95).

Temperature was also discarded from further analysis due to its variability. Thus, the remaining

small-scale variables were: pH, NO3, Forwetwat, CInstrHab, Aesthet, RipWid, CBank, Chan,

Cond, SO4, Sl, DOC, and Urban.

The 2nd SOM was run with the identified environmental gradients and the subsequent SOM-

neuron clustering yielded two groups with significantly different biologic responses according to

ANOVA (Table 3-12).

119

Table 3-12. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups

ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 34.965 1 34.965 50.28 0.0000 Within groups 155.07 223 0.695381 ----------------------------------------------------------------------------- Total (Corr.) 190.035 224 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI1 103 3.05097 X IBI2 122 3.84221 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *-0.791242 0.219896 -------------------------------------------------------------------------------- * denotes a statistically significant difference.

Figure 3-17. Groups and subgroups with different biological response after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed the normality tests

120

Subsequent biologic response separation based on individual, not strongly correlated variable

site-clustering yielded eleven different groups as shown in Figure 3-17. Group and sub-group

variable statistics included in Appendix I.

The normal probability plots for the two main biologic responses after the 2nd SOM clustering is

shown in Figure 3-18.

Figure 3-18. Normal probability plots for the IBI responses found after the 2nd SOM clustering

Reference conditions for similar environmental sites and potential causes for departure

The IBI 75th percentile for groups 1 and 2 were 3.75 and 4.25 respectively. Values beyond these

scores were arbitrarily set as reference sites for each biological response at the given level of

partition. Reference sites curves along with curves from the remaining non-reference sites are

121

shown in Figure 3-19. Significant differences among variables between reference and non-

reference conditions are presented in Table 3-13.

Figure 3-19. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution

Table 3-13. 95% confidence intervals and ANOVA test between reference and non-reference sites with variables used in the separation of biotic responses in coastal sites

. Group 1 Group 2 Variable Reference Non-reference p Reference Non-reference p Aesthet 14.14 ± 1.65 12.65 ± 1.14 0.132 15.00 ± 0.99 13.49 ± 1.03 0.054

Wid 3.73 ± 0.95 3.12 ± 0.78 0.350 6.62 ± 1.24 5.62 ± 0.65 0.118 CBank 66.8 ± 5.85 71.12 ± 4.28 0.236 73.73 ± 4.49 73.14 ± 3.35 0.831 Chan 7.31 ± 1.41 7.44 ± 1.04 0.886 8.73 ± 1.06 9.93 ± 0.85 0.083 Cond 161.57 ± 25.10 182.51 ± 26.58 0.311 180.24 ± 41.25 163.96 ± 15.92 0.388 DOC 5.73 ± 1.09 7.75 ± 1.55 0.081 5.30 ± 1.01 5.71 ± 0.71 0.499

Forwetwat 48.1 ± 6.64 54.63 ± 5.49 0.148 48.04 ± 4.76 45.04 ± 3.06 0.269 CInstrhab 56.59 ± 7.77 53.37 ± 5.89 0.517 64.53 ± 6.56 63.60 ± 5.30 0.828

NO3 1.83 ± 0.55 1.74 ± 0.67 0.864 2.71 ± 0.79 2.35 ± 0.54 0.440 pH 6.89 ± 0.14 6.67 ± 0.16 0.075 6.93 ± 0.15 6.83 ± 0.10 0.252

Pool 11.66 ± 1.35 8.66 ± 1.07 0.001* 13.93 ± 0.87 13.95 ± 0.74 0.980 Riffle 5.20 ± 1.314 5.41 ± 1.01 0.803 12.82 ± 1.07 13.38 ± 0.65 0.346

RipWid 31.6 ± 6.81 34.96 ± 4.18 0.377 37.83 ± 5.30 38.62 ± 3.91 0.683 Sl 0.459 ± 0.112 0.447 ± 0.117 0.895 0.46 ± 0.14 0.54 ± 0.09 0.318

SO4 14.00± 1.91 14.67 ± 1.81 0.634 14.46 ± 1.50 15.47 ± 1.29 0.323 Urban 6.07± 2.78 6.09 ± 2.71 0.992 6.81 ± 2.74 12.93 ± 3.55 0.018*

*Indicates statistically significant difference at the 95% confidence level (p<0.05)

122

3.2.4. Piedmont Maryland

3.2.4.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI after the initial

SOM training is shown in Figure 3-20.


The variables with significant impact to IBI were (in decreasing order of importance): Ch_flow

(r = 0.749), Chan (r = 0.747), Urban (r = - 0.746), Agribarr (r = 0.726), SO4 (r = 0.690), Aesthet

(r = 0.689), Veldep (r = 0.673), Pool ( r = 0.672), DOC (r = -0.662), DO (r = 0.659), NO3 (r =

0.655), ANC (r = -0.650), Cond (r = -0.632), ThalDep (r = 0.605), PRemote (r = 0.585),

PEmbed (r = 0.531), MaxDep (r = 0.505). Agribarr, SO4, Aesthet, DOC, DO, ANC, and Cond

123

were highly correlated to Urban and therefore, disregarded (r = -0.910, 0.925, -0.930, 0.871, -

0.914, 0.925, 0.943 respectively). Agribarr, Veldep, and NO3 were strongly correlated to

Ch_flow (r = 0.816, 0.837, 0.835 respectively). ThalDep and MaxDep were strongly correlated

to Pool (r = 0.942 and 0.934 respectively). Hence, the remaining large-scale variables for the 2nd

SOM patterning were: Ch_Flow, Chan, Urban, Pool, PRemote, and PEmbed.

The remaining variables (IBI-variable 5.0<r ) were considered as small-scale variables.

However, some variables were again disregarded due to strong cross-correlation. Wid was

strongly correlated to Pool ( r = 0.845), PRiffle, PInstrHab, PShade, and PBank were strongly

correlated to PEpiSub (r = 0.936, 0.957, 0.835, and 0.840 respectively). Therefore, the remaining

small-scale variables were: PEpiSub, DA, PWood, pH, RipWid, Sl, Forwetwat, and PHI. Again,

Flow_vel and Temp were disregarded from further analyses for the reasons mentioned

previously.

Clustering of the 2nd SOM neurons using the six most significant, non-correlated variables

yielded five groups with significantly different biologic responses as indicated by the ANOVA

and MRT analyses (Table 3-14). Subsequent separation of biological responses due to small-

scale stressors resulted in two more levels of IBI segregation. Stream gradient and percentage of

forest, wetlands, and water in the drainage area were the variables responsible for the

significantly different biological signatures (Figure 3-21).

124

Table 3-14. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups

ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 119.017 4 29.7543 59.14 0.0000 Within groups 124.261 247 0.503083 ----------------------------------------------------------------------------- Total (Corr.) 243.279 251 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 29 1.80483 X IBI5 17 2.71353 X IBI1 33 3.24818 X IBI3 47 3.59149 X IBI4 126 3.92992 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *1.44335 0.355584 IBI1 - IBI3 *-0.343308 0.317279 IBI1 - IBI4 *-0.681739 0.273186 IBI1 - IBI5 *0.534652 0.417067 IBI2 - IBI3 *-1.78666 0.329884 IBI2 - IBI4 *-2.12509 0.287729 IBI2 - IBI5 *-0.908702 0.426734 IBI3 - IBI4 *-0.338431 0.238776 IBI3 - IBI5 *0.87796 0.395383 IBI4 - IBI5 *1.21639 0.360961 -------------------------------------------------------------------------------- * denotes a statistically significant difference.

Figure 3-21. Groups and subgroups with different biological responses after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests

125

Figure 3-22 shows the normal probability plot for the five main groups after clustering the SOM

neurons using the identified environmental gradients in piedmont regions.

Figure 3-22. Normal probability plots for the IBI responses identified by the 2nd SOM clustering in Piedmont sites (Group 4 didn’t pass the normality test)


The IBI 75th percentile for groups 1 through 5 were 3.89, 2.33, 4.11, 4.33, and 3.00 respectively.

Group reference and non-reference curves for the values above and below the 75th IBI percentile

are presented in Figure 3-23. Differences between reference and non-reference sites are shown in

Table 3-15.

126

Figure 3-23. Normal probability plots for the reference (green) and impaired (red) conditions for the two groups obtained after clustering the SOM neurons using environmental gradients. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group to describe its Gaussian distribution (Group 4 was fitted to a Gaussian distribution only for demonstration purposes)

127

Table 3-15. 95% confidence intervals and ANOVA test between reference and non-reference sites with variables used in the separation of biotic responses in piedmont sites

. Group 1 Group 2 Variable Reference Non-reference p Reference Non-reference p Ch_flow 84.84 +/- 8.82 80.43 +/- 7.72 0.451 70.62 +/- 19.41 69.0 +/- 11.32 0.881

Chan 13.58 +/- 1.92 11.67 +/- 1.69 0.139 8.50 +/- 3.10 7.90 +/- 2.33 0.767Urban 1.12 +/- 0.61 0.37 +/- 0.27 0.008* 49.34 +/- 16.23 63.17 +/- 7.93 0.075Pool 10.17 +/- 0.97 9.86 +/- 0.90 0.644 13.50 +/- 2.86 12.14 +/- 1.69 0.378

PRemote 54.17 +/- 19.91 57.44 +/- 15.93 0.790 15.62 +/- 9.26 13.39 +/- 5.34 0.646PEmbed 76.57 +/- 14.40 66.24 +/- 10.10 0.211 68.06 +/- 26.13 72.43 +/- 12.33 0.712

Sl 1.12 +/- 0.372 1.2 +/- 0.23 0.699 0.79 +/- 0.35 1.53+/- 0.51 0.082Forwetwat 28.41 +/- 5.08 30.02 +/- 8.02 0.767 30.62 +/- 9.97 22.75 +/- 4.98 0.102

. Group 3 Group 4 Variable Reference Non-reference p Reference Non-reference p Ch_flow 85.87 +/- 6.71 76.97 +/- 5.40 0.050* 85.27 +/- 4.17 86.62 +/- 2.92 0.590

Chan 7.40 +/- 1.63 6.50 +/- 0.80 0.250 13.49 +/- 0.76 13.41 +/- 0.63 0.873Urban 3.46 +/- 3.30 5.66 +/- 2.28 0.263 3.12 +/- 1.17 2.43 +/- 0.54 0.228Pool 14.27 +/- 1.21 13.56 +/- 1.01 0.398 15.56 +/- 0.56 15.33 +/- 0.46 0.553

PRemote 41.25 +/- 14.2587 52.54 +/- 9.84 0.185 63.33 +/- 8.35 67.98 +/- 6.26 0.376PEmbed 50.00 +/- 11.57 62.57 +/- 8.21 0.077 80.76 +/- 5.47 75.86 +/- 4.66 0.191

Sl 1.04 +/- 0.27 0.92 +/- 0.27 0.562 1.17 +/- 0.27 1.30 +/- 0.31 0.567Forwetwat 29.44 +/- 4.25 37.97 +/- 5.23 0.038* 31.30 +/- 2.77 30.91 +/- 2.72 0.855

Group 5 Variable Reference Non-reference p Ch_flow 88.83 +/- 13.53 79.36 +/- 9.89 0.206

Chan 14.00 +/- 1.88 14.45 +/- 1.54 0.682 Urban 23.82 +/- 4.50 44.018 +/- 13.35 0.028* Pool 15.17 +/- 3.28 14.54 +/- 2.11 0.702

PRemote 50.00 +/- 24.89 43.18 +/- 9.47 0.464 PEmbed 77.78 +/- 15.20 67.8 +/- 21.23 0.474

Sl 0.90 +/- 0.43 1.24 +/- 0.83 0.533 Forwetwat 28.97 +/- 5.42 30.29 +/- 9.24 0.823 *Indicates statistically significant difference at the 95% confidence level (p<0.05)

128

3.2.5. Highland Maryland

3.2.5.1. Biologic response separation The correlation matrix of the neuron weights and the neuron-based average IBI after the initial

SOM training is shown in Figure 3-24.


In this case, the variables with a significant impact to IBI were: HEpiSub (r = 0.588), Riffle (r =

0.582), Veldep (r = 0.567), Wid (r = 0.555), FlowVel (r =0.545), InstrHab (r = 0.536), MaxDep

(r = 0.534), Pool (r = 0.521), DA (r = 0.517), ThalDep (r = 0.510). Riffle, VelDep, and InstrHab

were strongly correlated to HEpiSub (r = 0.901, 0.800, and 0.890 respectively); Instrhab,

129

MaxDep, Pool, DA, and ThalDep were strongly correlated to Wid (r = 0.907, 0.977, 0.964,

0.971, and 0.968 respectively). Therefore, the variables left for the 2nd SOM patterning were only

HEpiSub and Wid.

The remaining, non-correlated variables were considered small-scale variables. These were (in

decreasing order of importance): Root, DO, Sl, SO4, Wood, Agribarr, Aesthet, Chan, Embed,

HShade. Other variables were disregarded due to strong correlation: DOC with HEpiSub (r = -

0.824), Ch_Flow with DO (r = 0.824), pH to Sl (r = -0.911), HRipWid and HRemote to Aesthet

(r = 0.851 and 0.867 respectively), and Urban, HBank, ANC, Cond, NO3, and Forwetwat to

Agribarr (r = 0.861, -0.898, 0.909, 0.899, 0.971, and – 0.939 respectively). Again, Flow_vel and

Temp were disregarded from further analyses for the reasons mentioned previously.

Clustering of the 2nd SOM neurons using the two identified large-scale variables resulted in three

groups with significantly different IBI responses according to ANOVA and MRT analyses

(Table 3-16). Subsequent separation of biological responses with small-scale variables resulted

in 9 different levels (Figure 3-25).

130

Table 3-16. SOM-neuron group IBI means ANOVA (top) and MRT (bottom) analyses in highland sites. In the MRT, overlapping X’s indicate non significant differences. Non-overlapping X’s indicate statistically significant differences between pairs of groups

ANOVA Table Analysis of Variance ----------------------------------------------------------------------------- Source Sum of Squares Df Mean Square F-Ratio P-Value ----------------------------------------------------------------------------- Between groups 39.0566 2 19.5283 16.76 0.0000 Within groups 341.399 293 1.16518 ----------------------------------------------------------------------------- Total (Corr.) 380.455 295 Multiple Range Tests -------------------------------------------------------------------------------- Method: 95.0 percent LSD Count Mean Homogeneous Groups -------------------------------------------------------------------------------- IBI2 111 2.67324 X IBI1 153 3.25 X IBI3 32 3.79406 X -------------------------------------------------------------------------------- Contrast Difference +/- Limits -------------------------------------------------------------------------------- IBI1 - IBI2 *0.576757 0.264873 IBI1 - IBI3 *-0.544062 0.412961 IBI2 - IBI3 *-1.12082 0.426261 -------------------------------------------------------------------------------- * denotes a statistically significant difference.

Figure 3-25. Biological response hierarchical structure after clustering with large and small-scale environmental filters. Red color indicates groups that did not pass normality tests. Blue color indicates groups that passed normality tests

131

Figure 3-26 shows the three different biological responses after separation with environmental

gradients (see Figure 3-25).

Figure 3-26. Normal probability plots for the IBI responses the 2nd SOM clustering in Highland sites (groups 1 and 3 didn’t pass normality tests)


The IBI 75th percentile for groups 1 through 3 were 4.14, 3.57, and 4.43 respectively. Reference

vs. non-reference curves are presented in Figure 3-27. Differences among variables between

reference and non-reference sites are presented in Table 3-17.

132

Figure 3-27. Normal probability plots for the reference (green) and impaired (red) conditions for the three groups obtained using environmental gradients in Highland sites. Points were randomly generated using the reference and impaired sites’ mean and standard deviation in each group in order to describe its Gaussian distribution (Groups 1 and 3 fitted to a Gaussian distribution only for demonstration purposes)

The model identified a rather homogeneous system (piedmont areas), and a medium and a highly

heterogeneous ones (coastal plains and highland areas respectively).

In coastal sites, the variables with the greatest overall impact on IBI were either related to stream

variability/complexity (i.e. pool and riffle qualities) or stream size (i.e. average stream width).

Pool quality as well as stream size-related parameters (i.e.maximum depth) were part of the

provisional PHI and were also found in another research effort to have a high IBI discriminatory

power (Hall et al. 1999). Another stream variability metric with high discriminatory power

eliminated due to cross-correlation and also included in the provisional PHI was velocity-depth

133

variability. None of the new metrics included in the final PHI index by Paul et al. (2002) was

used in the 2nd SOM patterning. Only the PHI metric CWood was among the top predictors but

eliminated due to strong cross-correlation with stream size parameters.

Table 3-17. 95% confidence intervals and ANOVA test between reference and non-reference sites in variables used in the separation of biotic responses in highland sites

GROUP 1 GROUP 2 GROUP 3

Variable REF NON-REF p REF NON-

REF p REF NON-REF p

HEpiSub 85.99 +/- 3.88

72.96 +/- 3.36 0.000* 30.37 +/-

5.78 22.56 +/-

2.44 0.004* 73.46 +/- 17.58

75.84 +/- 9.09 0.780

Wid 5.34 +/- 0.67

4.45 +/- 0.43 0.026* 5.16 +/-

0.96 3.98 +/-

0.53 0.027* 13.96 +/- 2.89

12.42 +/- 1.08 0.185

Root 1.56 +/- 0.54

0.86 +/- 0.27 0.011* 1.13 +/-

0.59 0.52 +/-

0.22 0.017* 1.89 +/- 1.05

1.65 +/- 1.50 0.845

Sl 2.01 +/- 0.41

2.11 +/- 0.47 0.803 1.18 +/-

0.30 1.72 +/-

0.36 0.090 0.52 +/- 0.33

0.84 +/- 0.24 0.132

SO4 10.96 +/- 1.65

19.22 +/- 3.35 0.001* 15.89 +/-

4.39 41.26 +/-

17.38 0.082* 10.20 +/- 1.84

38.14 +/- 37.10 0.341

Wood 1.96 +/- 0.84

1.78 +/- 0.51 0.710 1.37 +/-

0.77 2.16 +/-

0.79 0.252 2.22 +/- 1.32

2.52 +/- 1.37 0.791

Agribarr 40.74 +/- 7.35

39.30 +/- 5.88 0.774 37.86 +/-

9.62 39.13 +/-

6.17 0.828 50.56 +/- 15.47

36.00 +/- 11.06 0.137

Chan 13.85 +/- 1.03

12.29 +/- 0.94 0.049* 12.77 +/-

1.93 9.44 +/-

1.25 0.006* 12.78 +/- 3.19

13.78 +/- 2.28 0.613

Embed 25.23 +/- 4.28

29.98 +/- 3.65 0.125 38.60 +/-

10.43 47.04 +/-

6.58 0.180 34.33 +/- 13.23

38.04 +/- 9.18 0.644

HShade 77.36 +/- 5.02

67.37 +/- 4.84 0.013* 65.89 +/-

8.88 62.06 +/-

6.56 0.526 62.91 +/- 13.27

57.43 +/- 9.16 0.496

*Indicates statistically significant difference at the 95% confidence level (p<0.05)

134

Coastal sites were mainly dominated by a rather constant combination of agricultural and forest

land covers (41.2 and 49.7 average percentages respectively). The agricultural coverage in these

areas was well beyond the average in the Mid-Atlantic region estimated in 20 percent (Herlilhy

et al. 1998). Inappropriate habitat metrics and/or widespread human disturbance may explain the

weak relationships between IBI and PHI and land use parameters (Pirhalla 2004), which were

mostly evaluated as small-scale variables. Despite weak correlation, agriculture still remains the

main source of impairment in coastal sites with high nutrient loading (Pirhalla 2004) as well as

widespread levels of fine sediment deposition and embeddedness (Paul et al. 2002). However,

agriculture and related variables were not the main source of IBI variability in this stratum and

therefore, the model wasn’t sensitive to it. The detrimental effects of agriculture on biologic

integrity have been widely reported (EPA 2000; Hall and Killen 2005; Lammert and Allan 1999;

Meador and Goldstein 2003; Shields et al. 2006; Stewart et al. 2001).

Because regional/catchment environmental characteristics were quite constant, the model

showed how other variables more important at the reach system (related to stream variability)

took over as main predictors. Even though the background biotic integrity is mostly determined

by regional characteristics, fluctuations of IBI within the region were mostly determined by

variables relevant at a smaller geographic scale due to regional homogeneity. Nutrient and

sediment input, hydrologic regime and channel morphology are processes mostly controlled at

the regional scale. Other factors such as organic matter inputs, site habitat quality, as well as

shade are controlled at more local scales (Allan et al. 1997; Frissell et al. 1986).

Comparison of impaired and reference sites in coastal plains’ group 1 revealed pool quality as

the most critical issue to address in order to achieve realistic reference conditions as shown in

135

Figure 3-19. Pool quality was the only significantly different variable within group 1 between

reference and non-reference sites (Table 3-13). This might be an indication of a substantial gap

in habitat diversity (depth, current, and substrate or DCS) and habitat volume between reference

and non-reference sites. It has been reported that in headwater streams (group 1 is composed

mainly of small streams with and average DA equal to 3,600 acres), pools have a greater DCS

diversity and habitat volume than riffles (Schlosser 1982). Therefore, pool quality in small

streams is critical in order to achieve good biotic integrity. Differences in pH and DOC were

close to be statistically significant (Table 3-13). DOC and pH have not been found to have a

strong relationship with land use patterns in the Mid-Atlantic region (Herlilhy et al. 1998). DOC

has been linked to nutrient enrichment (Leland and Porter 2000). However, differences in

nutrient concentrations were not observed in our research.

Biologic quality in piedmont sites was mainly determined by different configurations of

agricultural and urban land uses (average land use percentages in the drainage area equal to 57.5

and 11.7 percent respectively).Like in coastal and highland sites, agricultural and barren land

uses were positively correlated to IBI (neuron-based r = 0.73). The opposite occurred with urban

land uses which had a strong negative correlation (neuron-based r = -0.75) in piedmont regions.

Correlations between IBI and agriculture and urban land uses were much weaker in the two other

strata (neuron-based agriculture-IBI r = 0.16 and 0.41, and urban-IBI r = 0.121 and 0.01 in

highland and coastal sites respectively). Weaker correlations in these strata were possibly due to

the existence of a third regulating factor: larger percentages of forest and wetlands (see Table

3-4).Forest and wetlands only had a significant positive effect in few sites in the piedmont

stratum (Figure 3-21).

136

Even though a positive correlation between agriculture and urban land uses might seem counter-

intuitive, this was the consequence of the agricultural-urban dominance. A study by (Wang et al.

2000) demonstrated how expansion of urban land uses in traditionally agricultural watersheds led

to a decrease in fish species, fish density and IBI. Therefore, agriculture was positively

correlated to IBI despite its evident negative effects if compared to pristine conditions. Impact of

urbanization becomes critical to the biologic community when a threshold ranging from 8 to

15% of connected imperviousness is reached (Schueler 1994; Wang et al. 2001). In Maryland, a

10% increase in urban land use has been linked to a doubled likelihood of failing biocriteria

(Volstad et al. 2003).

In piedmont sites, a steady channel flow and good channel quality seemed to be critical in order

to achieve good biologic integrity. Percentage of channel covered by flow showed a strong

positive correlation to agriculture, but was negatively correlated to urban land uses. This

association could be the consequence of greater percentages of imperviousness in developed

lands and its subsequent increase in storm runoff with shorter residence times. Fast conveyance

channeling is common practice in order to deal with increased urban runoff which, in turn, would

degrade channel quality (Novotny 2003). Remoteness was also significant and is an indication of

the proximity of human activity to the sampling site (Mercurio et al. 1999). Again, pool quality

was identified as a very significant variable to IBI. Degree of embeddedness due to fine

sediment deposition was the last main variable affecting biologic integrity. Substrate degradation

has very negative consequences to aquatic fauna (Manolakos et al. 2007; Quinn and Hickey

1990; Rabeni and Smale 1995; Richards et al. 1993). However, in this region fine sediment

embeddedness was positively correlated to IBI, most likely due to its linkage with agriculture

and its not-so-negative effects if compared to urban sprawl..

137

In piedmont sites, comparison of reference and non-reference sites showed how urbanization was

more extensive in non-reference sites (differences were statistically significant in groups 1 and

5). Group 4 was an exception to this. Percentage of channel covered by flow, and channel and

pool quality were consistently higher in reference sites with the exception, again, of group 4.

Group 4 had the best biologic integrity overall and therefore, sources of impairment may differ

from the rest of the groups. The rest of metrics had more variability between groups.

A high degree of heterogeneity was found in highland regions as shown in Figure 3-25. The high

level of heterogeneity of highland observations in the MBSS database was already identified

(Southerland et al. 2005). In their study, a cluster analysis of the fish assemblages separated

highland observations in two main groups: sites with a drainage area smaller than 3,000 acres

(12.14 km2) and sites with a larger drainage area. Our model successfully identified this pattern.

One of the selected environmental variables in the 2nd SOM training in highland sites was

strongly correlated to drainage area (i.e. average stream width). Many of the variables related to

drainage area were also among the top IBI predictors in highland areas, although most of them

were disregarded due to cross-correlation (i.e. average thalweg depth, drainage area, and

maximum depth). The separation of observations in highland regions matched the results by

Southerland et al. (2005) quite well. Groups 1 and 2 had a median drainage area equal to 3,108

and 3,393 acres respectively, while in group 3 it was equal to 16,765 acres. IBI seemed to be

positively correlated to drainage area. This pattern was also observed in coastal sites. Even

though the available IBI scores in the MBSS database are calibrated with drainage area (Roth et

al. 2000), these don’t capture actual reference conditions for smaller streams (Southerland et al.

2005). Our model was able to successfully detect this trend. The second significant variable to

138

IBI was Epifaunal substrate quality, which also showed a strong positive correlation to IBI as

expected.

The high level of heterogeneity was more evident in small streams (groups 1 and 2), maybe due

to the IBI bias mentioned above. Local habitat conditions (instream woody debris and rootwads,

shade, and channel quality), water quality (SO4), or channel morphology (slope) explained the

remaining variability. In larger streams, different biological responses were only found due to

significantly different water qualities (i.e. differences in SO4 concentration) (Figure 3-25).

Comparison of reference and non-reference sites consistently confirmed stream size, substrate

quality, and some local variables as key issues to address if reference conditions were to be met

in highland’s small streams (Table 3-17). Not statistically significant differences were observed

in larger streams (group 3). However, average SO4 concentrations were almost four times higher

in non-reference sites (10.2 versus 38.1 in reference and non-reference sites respectively), which

might be an indication of chemical degradation as shown in Figure 3-25. Most likely, selection

of a more stringent reference percentile to set reference conditions (e.g. 80th instead of 75th)

would have identified this difference in mean SO4 concentration as statistically significant.

3.3. Conclusions

3.3.1. Ohio with instream data

• Instream substrate parameters (i.e. embeddedness), nutrient input (i.e TKN), and stream size

were the variables with the clearest relationship to IBI, which strongly agreed with current

literature. The rest of the environmental gradients acted more like ‘moderators’ of the final IBI.

139

The effects of variables acting at a local scale (i.e total zinc and nitrate concentrations and pH)

were also successfully identified and separated.

• The model was sensitive to stream size. Headwaters (DA <51.8 Km2) and wadeable streams

(51.8 ≤ DA ≤ 518 Km2) were mainly contained within groups 1, 2, 3, and 6. Small (518 < DA≤

2,590 Km2) and large rivers (DA > 2,590 Km2) were mainly in group 6. Group 4 showed greater

variability in stream size. In this group, however, most of the observations belonged to wadeable

streams, but very large streams (maximum DA = 15,672.1 Km2) and small rivers were also

present. This was important because no a priori assumptions were made with the data. Stream

size is known to play an important role on biologic integrity, having a positive correlation to IBI

in Ohio. The model successfully identified this trend.

• Development of reference curves using this methodology gives an indication of the expected

probability of violation if a biotic standard needs to be achieved. For example, in group 2, the

WWH biologic standard (IBI =40) would be violated 10% of the times in reference conditions.

With this methodology, the reference sites can be determined at will by watershed managers

depending on the realistic goals that must be met for different watershed types. In the present

research the reference threshold was set to 75th percentile.

• The same methodology would yield more accurate results with an a priori separation of sites

in different ecoregions. In the present paper, the sites were not separated because not enough

observations were available for all of them to perform the partitions and subsequent curve

development. However, most of the data belonged to the ECBP and HELP ecoregions, with

higher natural nutrient concentrations, and the IP and EOLP with medium levels. Only 36

observations out of 429 belonged to the WAP ecoregion, with the lowest nutrient background

concentration.

140

3.3.2. Ohio with offstream data

• The model partitioning corresponded very well to a progressive reduction of geographic

scale. Basin-scale variables (i.e. basin connectivity) segregated different biotic responses in

different basins or watersheds within a basin. Upstream regional buffer and watershed land use

segregated biologic responses at the stream segment level. Local land use separated different

responses due to different local conditions but within a watershed and stream context. With the

presented methodology we believe the scale-issue in the analysis of biologic integrity has been

resolved. The model developed is able to zoom in and out of geographic scale and identify

responses at each level of watershed characterization.

• Regional land uses, and particularly percentage of forest and agriculture in the 30 and 100-

meter regional buffers, were the most important variables to biotic integrity. Watershed

urbanization was also significant, especially in watersheds with degraded or poorly vegetated

stream buffers. These variables were responsible for the background integrity in the different

groups.

3.3.3. Maryland

• The methodology successfully divided biotic integrity responses in the three different strata

in Maryland. Variables affecting IBI at larger geographic scales were successfully identified and

strongly agreed with current literature. Potential biases of the available IBI in the database were

successfully recognized by the model. Because of this, conclusions from the research have to be

drawn with caution, especially in coastal and highland regions. However, the methodology can

be replicated easily when this bias is addressed.

• The normality hypothesis for the environmentally homogeneous groups was confirmed by

the model. IBI didn’t follow a Gaussian curve in any of the full strata databases. When different

141

biologic responses were separated, most of the groups became normal at some level of group

characterization. Nevertheless, some of the groups still didn’t follow a normal distribution. In

most cases, it was most likely due to lack of a truly representative population sample because

few observations belonged to that group (less than 15). Only group 41 in piedmont region and

group 111112 in highland region (116 and 27 observations respectively) were an exception.

Existence of relevant, non-identified stressors could be a cause why populations with different

responses weren’t separated.

• Coastal and highland regions were very heterogeneous natural systems and the models

successfully identified this. Many different biological responses due to local effects were

identified. However, it remains unclear with the available data from the MBSS if the highly

diverse biologic responses are due to strata’s variability, or presence of non-sensitive habitat

metrics in coastal areas and an IBI bias in highland sites. These issues are reported in current

literature.

• Biological integrity in piedmont areas was mainly dominated by a combination of

agricultural and urban land uses. Agriculture had a strong positive correlation to IBI and urban

land use had a strong negative correlation. Even though agriculture is negatively associated to

biological integrity, urban impact has more acute detrimental effects when a threshold is reached.

• Comparison of differences between reference and non-reference sites helped identify the

most critical issues to be addressed in order to achieve realistic goals for improvement in each

group. In small streams in coastal sites (group 1, average DA = 3,600 acres), pool quality is

critical in order to achieve such conditions. In larger streams, urbanization is the main problem.

In piedmont sites, urbanization and channel quality are the main key issues to be addressed. In

highland areas, improvement of substrate quality combined with other local instream habitat and

chemical characteristics such as woody debris presence, shading, channel quality, or sulfate

142

concentration are paramount for IBI improvement. In larger streams, water quality is the major

issue in highland regions.

143

4. Main conclusions

Two main outcomes have been achieved with the work presented in this thesis. First, it was

demonstrated that IBI is predicted more accurately using data patterning techniques based on

environmental similarities than with traditional methods. Second, a new methodology that allows

evaluation of biologic response to environmental stressors at multiple scales was developed. This

methodology was named PROHIBID (Probabilistic Hierarchical Biologic Integrity

Discrimination).

Since biological integrity is at the top of the natural system hierarchy, it is impossible to find

simple mechanistic processes and mathematical equations able to link changes in the biological

community to one or several environmental variables. Biological integrity is the result of many

natural existing conditions and anthropogenic stressors that are highly intertwined and explain a

larger or smaller portion of the final outcome. Because of the high dimensionality of the

problem, traditional prediction or evaluation techniques have great limitations. A simple

comparison of the IBI predictions between the k-nearest neighbor concept (kNN) and traditional

linear and non-linear regressions clearly showed that the first was superior in performance and

computation capabilities. Moreover, prediction performed finding the most similar

environmental observations proved much more dynamic because it was easily validated using a

leave-one-out approach without drastically increasing the computation time. Such approach

wasn’t possible when IBI was predicted using regression. In this case, a validation dataset had to

be separated. A leave-one-out approach was not possible because that would have meant

developing new equations each time one observation was taken out of the database.

144

One of the main problems encountered using kNN was determining the optimum number of

closest neighbors that yielded the best possible prediction. Since extreme values in nature are by

definition rare (or at least much less frequent than non-extreme values), these were usually

predicted more accurately with lower numbers of k nearest neighbors (i.e. 1 or 2) because few

observations were truly environmentally similar. The opposite happened with observations with

no extreme values in any of the fields. Such observations had many other observations falling

within a smaller distance radius. Therefore, determining the optimum number of closest

neighbors to obtain the best possible prediction was a challenge because, depending on the type

of site and available observations in the databases, the ideal number of closest neighbors was

different in each case.

This issue was partially solved in Chapter 2 when the kNN technique was used to find the closest

branch of a hierarchical tree calculated with the observations being compared against. The

importance of using such a structure lies in the fact that the branches of the hierarchical structure

is composed of groups of observations that are very close to the remaining members of the same

branch. If the difference between two observations is larger than a specific threshold, these are

placed in different branches. Therefore, the closest branch to the target site being predicted is

only composed of a group of truly similar observations. In Chapter 1, when a specified number

of k-nearest neighbors (i.e. 5) was selected arbitrarily, it was not guaranteed that all the closest

observations were truly similar ( for example, for observations with extreme values, maybe only

one or two sites were truly similar but the remaining three could be quite different and lead the

model to poor predictions). Another clear advantage of using such hierarchical structure instead

of direct kNN prediction is the possibility of zooming up and down of the hierarchical structure

and finding the optimum number of branches that optimize the prediction (the number of tree

145

branches can range from two to the number of available observations). Prediction techniques

used in Chapter 1and 2 can easily be implemented in many other scenarios and can easily be

used to evaluate the effect of anthropogenic stressors on the biologic community (or any other

endpoints) if enough historical data are available.

The results from Chapter 1 and Chapter 2 also revealed the importance of scale in the prediction

of system endpoints. Background biologic integrity is determined by variables that are

ubiquitous at the scale of the study and they were named environmental gradients or large-scale

variables. However, this doesn’t imply that variables that are non-ubiquitous don’t play an

important role. Point source pollution, channelization, or other localized variables can have a big

impact in specific sites but little impact on the overall integrity of a region. Therefore, ubiquitous

stressors are capable of major shifts in species population and therefore, major changes in biotic

integrity may occur. As a consequence, ubiquitous stressors affect the higher levels of the

species’ suitable habitat hierarchy. On the other hand, localized stressors only modify habitat

suitability at lower levels of the hierarchy and are only identified as significant variables when

the scale is small enough.

In order to address the scale issue, the PROHIBID methodology was developed. It was a

successful attempt to replicate the nested hierarchy of suitable habitats existing in nature.

Offstream environmental gradients in Ohio (i.e. large-scale variables) were mainly associated to

regional land use patterns as expected. When instream variables were analyzed in Ohio, large-

scale variables were mainly related to nutrient input, and habitat quality (which are directly

related to land use). PROHIBID successfully separated different biologic signatures that resulted

from different levels of stress at the local level.

146

The assumption of normality of the IBI distribution within a highly homogeneous environmental

group was proven true. Most of the resulting groups from the progressive segregation of biologic

responses followed a Gaussian distribution when the system was described in greater detail.

None of the initial databases followed such distribution because they were highly heterogeneous

and different biologic signals were mixed.

Because IBI can be easily characterized with a normal distribution and because the

environmental observations within a group are similar; realistic, achievable, reference conditions

can be identified within each group and represented again with a normal curve. The importance

of this lies in the fact that it allows comparison between group’s reference and non-reference

sites and helps target potential issues that must be addressed in order to achieve reference

conditions. Moreover, such methodology can be applied at different levels of system

characterization (in this thesis, analyses were performed at one specific level for the sake of

brevity). This is important because the effect of the different variables at one specific level of

system characterization is always analyzed in the environmental background context of each

group (i.e. the effect of a specific local variable is only revealed when the effect of other

stressors with a larger overall impact on biologic integrity has been segregated previously). If a

PROHIBID scheme has been developed in a specific region, watershed managers can easily find

actual reference conditions for targeted sites by identifying the most similar group at the level the

available environmental variables allow.

PROHIBID could easily be implemented for the establishment of biological standards based on

probability of exceedance similar to those used in water quality. In this thesis, the group

147

reference conditions were set arbitrarily at the 75th IBI percentile. However, reference conditions

can be more or less stringent depending on the designated use of a specific water body.

148

5. Future research and work

Research to further understand the relationship between biologic integrity and different types of

stressors acting at different scales should be performed. Moreover, implementation of scale-

sensitive methodologies to frame and segregate biologic responses is a real possibility with the

readily available historic data some environmental agencies have collected. In my opinion, some

of the most critical issues that need to be addressed before enforcing biologic integrity as a

standard for stream’s health preservation are the following.

1. Development of a stand-alone, fully-integrated model. The PROHIBID methodology

presented in the current thesis is the result of multiple steps that use diverse data patterning

techniques combined with statistical analysis that might result complex for the potential users. If

such a methodology were to be applied, it is necessary to develop a user-friendly framework in

which the user is only required to enter the model inputs in a spreadsheet.

2. Data sampling strategy: one of the main problems encountered when different states

were modeled is the lack of consistency in the sampled environmental data. For example, habitat

quality is evaluated using multiple habitat quality indices but these and their corresponding

metrics differ importantly among states. Physical and water quality parameters are not always

consistent either. For example, Ohio was the only state in which metal concentration was

measured, in Maryland Dissolved Organic Carbon (DOC) concentrations were available but not

in the rest of states, in Minnesota, stream channel morphology data was available but not in the

other databases.

149

While it is understandable that each region has its particular environmental challenges, a

minimum consensus in the sampling needs to be achieved. In my opinion, this consensus should

be achieved not at the state level (as it currently happens) but at the ecoregional level.

Ecoregions are defined as “areas within which there is spatial coincidence in characteristics of

geographical phenomena associated with differences in the quality, health, and integrity of

ecosystems” (Omernik 2004). “Characteristics of geographical phenomena” may include

geology, physiography, vegetation, climate, hydrology, terrestrial and aquatic fauna, and soils,

and may or may not include the impacts of human activity (e.g. land use patterns, vegetation

changes).

Sampling of potential large-scale anthropogenic disruptors should be homogeneous within an

ecoregion (and its basins) not the state level. Targeted large-scale variables should at least

contemplate the following disturbances: stream fragmentation (at the basin level or larger),

regional land use (in the drainage area and regional stream buffer and preferably using the

sixteen land use types defined in the NLCD), water quality parameters (mainly parameters

related to nutrient loading such as BOD, TN, TP or TKN, or ionic strength such as conductivity,

hardness, or SO4), habitat quality (preferably continuous measurements instead of discrete

metrics and mostly related to substrate quality and stream variability because they reflect

regional hydrologic conditions), and point source density and intensity (if point source impact is

significant in the region).

Because of the large number of potential small-scale disruptors, these should be evaluated at

smaller scales (e.g. the watershed level) and target only those that are most likely to occur in a

specific area because of its particular environmental conditions. However, since there is a need to

150

compare impaired and non-impaired sites, several watersheds (impaired and non-impaired) with

similar large-scale environmental features should be sampled.

3. Holistic approach to improve stream health: the Clean Water Act of 1,972 has been an

extraordinary tool to resolve the deep water quality problem U.S. streams faced in the end of the

last century. However, many research efforts agree that the main threat to U.S. stream health is

not so much related to only water quality but to habitat degradation. Habitat degradation not only

relates to physical changes in habitat structure, but hydrologic and hydraulic modifications,

fragmentation, or siltation. Current disturbances are mostly related to non-point source system

fragmentation of available habitat (physical, chemical, or hydraulic fragmentation). Non-point

source pollution is mainly driven by changes in the regional and local land use. Therefore, future

research evaluating the integrity of waters needs to be approached in this context and potential

solutions need to take this river ‘continuum’ concept into account.

4. Development of progressive biological standards: biologic integrity is a direct measure

of stream’s health. Its importance lies in the fact that it is an indication of disturbance in any part

of the natural system, not just water quality as explained in point 3. Therefore, setting biologic

standards is important to guarantee a minimal ecosystem functionality of a specific region. I

believe a statistical approach such as the one presented in Section 2 of this thesis should be

implemented because allows easy identification of reference sites within a specific region.

Biological standards should be developed in a two-tier fashion. In the first phase, larger regions

(i.e. basins, sub-basins, or watersheds) within an environmentally homogenous unit (i.e.

ecoregions) should be targeted to guarantee good background integrity for subsequent, more

stringent standards. In a second phase, and after the standards in phase one have been met, more

151

local standards can be developed (i.e. at the watershed or sub-watershed level) targeting small-

scale stressors present in the region of study.

5. Use of information from observations with missing attribute values:

The results presented in this thesis were obtained by selecting complete datasets with no missing

data in either the response variables (i.e. IBI) or the explanatory variables (i.e. instream and/or

offstream environmental attributes). However, it is important to realize that the initial databases

were composed of a larger number of observations. Many of these observations were not used in

the work presented in this thesis because they had one or several missing explanatory variables

and therefore, discarded. Dealing with observations with missing attributes is a common problem

when large databases are used. Research from many different disciplines has focused on

extracting the potentially valuable information underlying in incomplete observations. Some

common scientific disciplines dealing extensively with such problems are genetics (Ouyang et al.

2004; Troyanskaya et al. 2001), political and social sciences (Fessant and Midenet 2002; King et

al. 2001; Wang 2003), neural computing and machine learning (Batista and Monard 2003), or

more recently, environmental sciences (De'ath and Fabricius 2000; Dickson and Giblin 2007;

Junninen et al. 2004).

The first step in order to adopt a methodology to estimate missing attribute values is to determine

their degree of randomness because this will affect subsequent missing data treatment. Three

commonly accepted categories for missing data randomness are the following (Little and Rubin

1987):

1. Missing completely at random (MCAR): this is the highest level of randomness. It occurs

when the probability of an observation having a missing value for an attribute does not depend

152

on either the known values or the missing data. At this level, any missing data treatment can be

applied without risk of introducing bias on the data. The missing data in the presented research

qualifies as MCAR.

2. Missing at random (MAR): when the probability of an observation having a missing

value for an attribute may depend on the known values but not on the value of the missing data

itself

3. Not missing at random (NMAR): when the probability of an observation having a

missing value for an attribute could depend on the value of that attribute

Several different methods have been proposed in the literature to treat missing data. These

methodologies can be divided in three main categories (Little and Rubin 1987):

1. Ignoring and discarding data: this consists of discarding observations and/or attributes

with missing entries. This methodology was adopted in the present thesis.

2. Parameter estimation: this category includes all those methods that involve the

calculation of parameters of a maximum likelihood function using a complete set of data.

Probably, the most widely used methodologies falling into this category is the Multiple

Imputation (MI) method. The widely implemented Expectation-Maximization (EM) algorithm is

one example of MI which can handle parameter estimation in the presence of missing data

(Dempster et al. 1977).

3. Imputation: this category refers to those procedures that aim to fill missing values with

estimated ones. Information from known relationships identified with the valid observations is

used to estimate the missing entries. Examples of imputation methods are the KNN , SOM,

Multi-Layer Perceptron structures (MLP), or hierarchical trees (De'ath and Fabricius 2000;

153

Junninen et al. 2004). Other very commonly used, although rather naïve, imputation methods are

row or column average, or imputation of zeroes. Other simple univariate imputation

methodologies are the linear, spline, or nearest neighbor interpolation, and multivariate

regression based imputation (Junninen et al. 2004) . Hybrid methods combine different

imputation methodologies depending on the ‘length of the gap’ in the missing data (e.g. in time-

series data) or the percentage of missing data.

The MI approach involves imputing m values for each missing item in an observation and

creating m complete data sets. Therefore, the observed values within each data set are the same,

but the imputed values are different to reflect uncertainty (King et al. 2001).Hence, each of the m

data sets can be treated as complete data sets and then use a procedure to combine the m results.

One MI model that has proven useful in many situations assumes that the variables are jointly

multivariate normal. Even though the normal distribution is just an approximation (few data sets

have variables that are all continuous and unbounded), many researchers have found that it

works as well as other more complicated functions especially designed for categorical or mixed

data (Schafer 1997; Schafer and Olsen 1998).

In the MI, the missing attribute values are usually imputed with a linear or multinomial

regression function of the rest of known attributes within the same observation. The regression

coefficients (vector β) are then estimated and uncertainty is introduced using a random parameter

(ε). Therefore m different estimated values of the missing attribute are obtained. Subsequently,

using the normality condition, a likelihood function can be calculated with the vector of variable

means (µ) and the variance matrix (∑) of the p variables (dependent and independent attributes)

154

of the full observations. Within the m generated versions of the incomplete observation, the one

which maximizes the likelihood function is selected and its calculated missing value chosen.

Even though MI approaches seem to be a very reliable way of imputing data comparable to other

methods such as the SOM (Dickson and Giblin 2007), computing the data likelihood function

can be unfeasible with classical methods. In response to such difficulties, different algorithms

have been developed such as the Imputation Posterior (IP), which is based on Markov Chain

Monte Carlo methods and requires a high level of expertise, or the Expectation-Maximization

(EM), which is deterministic. IP draws random simulations from the multivariate normal

observed data posterior (P(Dmis/Dobs)), while EM calculates the posterior means

deterministically. EM has the advantage that is much faster in finding the maximum of the

likelihood function but the drawback that it does not yield the rest of the distribution (King et al.

2001). For detailed information on the IP algorithm refer to Schafer(1997) and for the EM

algorithm refer to Dempster et al. (1977) and McLachlan and Krishan (1997).MI is considered

the most accurate and reliable way to infer missing parameters in time-series air quality data sets.

Its main drawback is the computation speed (Junninen et al. 2004).

In the present thesis, techniques such as kNN or SOM were used for prediction and data

classification purposes respectively and its principles were explained in previous chapters. These

same techniques could be easily implemented for missing data imputation. A MLP structure,

which was not used in this thesis, is probably the most widely known and successful neural

network. These networks employ a feed-forward architecture and are typically trained using a

procedure called error back-propagation (Junninen et al. 2004). However, the MLP appear to be

only viable and good alternative to classical imputation type models when calibration data are

155

sufficient, but it does not solve practical difficulties encountered in real-size surveys treatment.

The MLP has a fixed architecture for imputing a pre-defined set of variables thanks to another

pre-defined set of variables. In real applications, missing items combinations among variables

vary among observations, which make the MLP implementation difficult in most cases (Fessant

and Midenet 2002).

Junninen et al.(2004) compared the data imputation performances in air quality data sets of

different techniques such as row averaging, linear interpolation, multivariate regressions, kNN,

SOM, MLP, MI, along with hybrid methods of these. In all cases, the hybrid MI model was the

most accurate and reliable data imputation model. The hybrid SOM, kNN, and MLP had the

second best performances and their results were very similar. Non-hybrid SOM, KNN, and MLP

had the third best performances (again very similar results among all three). All these models

outperformed significantly linear interpolation, multivariate regression, or row averaging

methods.

Another study attempted to calculate the missing trace metals concentrations in ground waters. In

this case the EM algorithm performance was compared against the SOM. In all cases (with 25

and 50% of data missing) the SOM outperformed the EM algorithm, whose missing values

estimations tended to be more scattered (Dickson and Giblin 2007). SOM can also be designed to

include uncertainty like the IP multiple imputation models do. This can be done by calculating a

fuzzy-SOM trained with complete and incomplete observations. Incomplete observations are

called fuzzy because different possible values for the missing attribute are introduced in the

model by estimating membership functions of the missing attribute (Wang 2003).

156

In another study by Troyanskaya et al. (2001), the kNN imputation method was compared to

other methods such as the row average method and the Singular Value Decomposition ([SVD],

which is based on principal component decomposition) in gene expression databases. Both, kNN

and SVD outperformed significantly the row average method and both methods were robust to

an increasing fraction of missing data. However, kNN was less sensitive to the type of data used

and data noise and was able to provide accurate estimations for missing values in genes that

belonged to small tight expression clusters. SVD only predicted well in dominant clusters. A

similar conclusion for kNN imputation was reached by Batista and Monard (2003). In their work,

kNN outperformed other methods such as mean/mode imputation or no imputation at all.

157

6. References

Allan, J. D. (2004). "Influence of land use and landscape setting on the ecological status of rivers." Limnetica, 23(3-4), 187-198.

Allan, J. D., Erickson, D. L., and Fay, J. (1997). "The influence of catchment land use on stream integrity across multiple spatial scales." Freshwater Biology, 37(1), 149-161.

Allen, T. F. H., and Starr, T. B. (1982). Hierarchy : perspectives for ecological complexity, University of Chicago Press, Chicago.

Anderson, J. R., Harvey, E. H., Roach, J. T., and Whitman, R. E. (1976). "A land use and land cover classification system for use with remote sensor data." Geological Survey Professional Paper 964, U.S. Government Printing Office, Washington D.C.

Archer, D., and Newson, M. (2002). "The use of indices of flow variability in assessing the hydrological and instream habitat impacts of upland afforestation and drainage." Journal of Hydrology, 268(1-4), 244-258.

Barbour, M. T., Gerritsen, J., Snyder, B. D., and Stribling, J. B. (1999). "Rapid bioassessment protocols for use in streams and wadeable rivers: periphyton, benthic, macroinvertebrates, and fish, second. ed. EPS-841-B-99/002." US Environmental Protection Agency, Washington, DC.

Batista, G. E. A. P. A., and Monard, M. C. (2003). "An analysis of four missing data treatment methods for supervised learning." Applied Artificial Intelligence, 17(5-6), 519-533.

Beyer, H. L. (2004). "Hawth's Analysis Tools for ArcGIS." 2008). Bode, R. W. (1988). "Methods for Rapid Biological Assessment of Streams." New York State

Department of Environmental Conservation, Albany, NY. Castelle, A. J., Johnsn, A. W., and Conolly, C. (1994). "Wetland and Stream Buffer Size

Requirements - a Review." Journal of Environmental Quality, 23(5), 878-882. Cereghino, R., Giraudel, J. L., and Compin, A. (2001). "Spatial analysis of stream invertebrates

distribution in the Adour-Garonne drainage basin (France), using Kohonen self organizing maps." Ecological Modelling, 146(1-3), 167-180.

Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983). Graphical methods for data analysis, Pacific Grove, CA: Wadswoth & Brooks/Cole

Chon, T. S., Park, Y. S., Moon, K. H., and Cha, E. Y. (1996). "Patternizing communities by using an artificial neural network." Ecological Modelling, 90(1), 69-78.

Davies, D. L., and Bouldin, D. W. (1979). "A cluster separation measure." IEEE Transactions on Pattern Analysis and Machinery Intelligence, 1(2), 224-227.

De'ath, G., and Fabricius, K. E. (2000). "Classification and regression trees: A powerful yet simple technique for ecological data analysis." Ecology, 81(11), 3178-3192.

Dempster, A. P., Laird, N. M., and Rubin, D. M. (1977). "Maximum likelihood for incomplete data via the EM algorithm (with discussion)." Journal of Royal Statistic Society, B39, 1-38.

Detenbeck, N. E., Batterman, S. L., Brady, V. J., Brazner, J. C., Snarski, V. M., Taylor, D. L., Thompson, J. A., and Arthur, J. W. (2000). "A test of watershed classification systems for ecological risk assessment." Environmental Toxicology and Chemistry, 19(4(2)), 1174-1181.

Detenbeck, N. E., Johnston, C. A., and Niemi, G. J. (1993). "Wetland Effects on Lake Water-Quality in the Minneapolis St-Paul Metropolitan-Area." Landscape Ecology, 8(1), 39-61.

158

Dickson, B. L., and Giblin, A. M. (2007). "An evaluation of methods for imputation of missing trace element data in groundwaters." Geochemistry-Exploration Environment Analysis, 7, 173-178.

DNR, M. (2008). "Maryland Biological Stream Survey. Available at: http://www.dnr.state.md.us/streams/mbss/." <http://www.dnr.state.md.us/streams/mbss/>.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern classification. 2nd edition, Wiley, New York, NY.

Dyer, S. D., White-Hull, C., Carr, G. J., Smith, E. P., and Wang, X. H. (2000). "Bottom-up and top-down approaches to assess multiple stressors over large geographic areas." Environmental Toxicology and Chemistry, 19(4), 1066-1075.

Dyer, S. D., White-Hull, C., Wang, X., Johnson, T. D., and Carr, G. J. (1998a). "Determining the influence of habitat and chemical factors on instream bioitc integrity for a Southern Ohio watershed." Journal of aquatic ecosystem stress and recovery, 6, 91-110.

Dyer, S. D., White-Hull, C. E., Johnson, T. D., Carr, G. J., and Wang, X. (1998b). "The importance of space in understanding the risk of multiple stressors on the biological integrity of receiving waters." Journal of Hazardous Materials, 61(1-3), 37-41.

Dynesius, M., and Nilsson, C. (1994). "Fragmentation and Flow Regulation of River Systems in the Northern 3rd of the World." Science, 266(5186), 753-762.

EPA. (2000). "The quality of our nation's waters. EPA 841-S-00-001." USEPA Office of Water, Washington, DC.

EPA. (2008a). "Current national recommended water quality criteria. Available at: http://www.epa.gov./waterscience/criteria/wqcriteria.html. Last time visited: April 2008." (April 2008.

EPA. (2008b). "Multi-resolution land charcteristics consortium (MRLC). Available at: http://www.epa.gov/mrlc/."

EPA. (2008c). "Permit Compliance System Database." Available at: http://epa.gov/enviro/html/pcs/pcs_query_java.html.

Fessant, F., and Midenet, S. (2002). "Self-organising map for data imputation and correction in surveys." Neural Computing & Applications, 10(4), 300-310.

Fischer, S., and Kummer, H. (2000). "Effects of residual flow and habitat fragmentation on distribution and movement of bullhead (Cottus gobio L.) in an alpine stream." Hydrobiologia, 422, 305-317.

Freeman, M. C., Bowen, Z. H., Bovee, K. D., and Irwin, E. R. (2001). "Flow and habitat effects on juvenile fish abundance in natural and altered flow regimes." Ecological Applications, 11(1), 179-190.

Frissell, C. A., Liss, W. J., Warren, C. E., and Hurley, M. D. (1986). "A hierarchical framework for stream habitat classification: viewing streams in a watershed context." Environmental Management, 10, 199-214.

Gilvear, D. J., Heal, K. V., and Stephen, A. (2002). "Hydrology and the ecological quality of Scottish river ecosystems." Science of the Total Environment, 294(1-3), 131-159.

Gujarati, D. N. (2003). Basic econometrics, McGraw-Hill, NY. Hall, L. W., and Killen, W. D. (2005). "Temporal and spatial assessment of water quality,

physical habitat, and benthic communities in an impaired agricultural stream in California's San Joaquin Valley." Journal of Environmental Science and Health Part a-Toxic/Hazardous Substances & Environmental Engineering, 40(5), 959-989.

159

Hall, L. W., Morgan, R. P., Perry, E. S., and Waltz, A. (1999). "Developmenmt of a provisional physical habitat index for Maryland freshwater streams." Maryland Department of Natural Resources. Chesepeake Bay and watershed programs. Monitoring and non-tidal assessment., Annapolis, MD.

Hall, L. W., Scott, M. C., Killen, W. D., and Anderson, R. D. (1996). "The effects of land-use characteristics and acid sensitivity on the ecological status of Maryland coastal plain streams." Environmental Toxicology and Chemistry, 15(3), 384-394.

Herlilhy, A., Stoddard, J. L., and Johnosn, C. B. (1998). "The relationship between stream chemistry and watershed land cover data in the Mid-Atlantic region, USA." Water Air Soil Pollution, 105, 377-386

Hilsenhoff, W. L. (1987). "AN improved biotic index of organic stream pollution." Great Lakes Entomologist, 20(1), 31-39.

Jain, A. K., and Dubes, R. C. (1988). Algorithms for clustering data., Prentice Hall Inc., Saddle River, NJ.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). "Data clustering: a review." ACM Computer Surveys, 31(3), 264-323.

Jarque, C. M., and Bera, A. K. (1987). "A test of normality of observations and regression residuals." International Statistics Review, 55(2), 163-172.

Johnson, L. B., Richards, C., Host, G. E., and Arthur, J. W. (1997). "Landscape influences on water chemistry in Midwestern stream ecosystems." Freshwater Biology, 37(1), 193-&.

Johnston, C. A., Detenbeck, N. E., and Niemi, G. J. (1990). "The cumulative effect of wetlands on stream water quality and quantity. A landscape approach." Biogeochemistry, 10, 105-141.

Judge, G. G., Hill, R. C., Griffiths, W. E., Lutkepohl, H., and Lee, T.-C. (1985). The theory and practice of econometrics. 2nd edition, 2nd Ed., Wiley, NY.

Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., and Kolehmainen, M. (2004). "Methods for imputation of missing values in air quality data sets." Atmospheric Environment, 38(18), 2895-2907.

Karr, J. R. (1991). "Biological Integrity: a long-neglected aspect of water resource management." Ecological Applications, 1(1), 66-84.

Karr, J. R., Fausch, K. D., Angermeier, P. L., Yant, P. R., and Schlosser, I. J. (1986). "Assessing biological integrity of running waters: a method and its rationale." Illinois Natural History Survey, Champaign, IL.

Karr, J. R., and Kerans, B. L. (1981). "Components of biological integrity: their definition and use in development of an invertebrate IBI." 1991 MidWest Pollution Control Biologists Meeting. Environmental Indicators: measurement and assessment endpoints. U.S. Environmental Protection Agency, Lincolnwood, IL.

King, G., Honaker, J., Joseph, A., and Scheve, K. (2001). "Analyzing incomplete political science data: An alternative algorithm for multiple imputation." American Political Science Review, 95(1), 49-69.

Kiviluoto, K. (Year). "Topology preservation in Self-Organizing Maps." IEEE International Conference in Neural Networks, 294-299.

Kohonen, T. (2001). Self-Organizing Maps, 3 Ed., Springer-Verlag, Berlin. Kolasa, J. (1989). "Ecological systems in hierarchical perspective: breaks in community structure

and other consequences." Ecology, 70(1), 36-47. Kolasa, J., and Biesiadka, E. (1984). "Diversity Concept in Ecology." Acta Biotheoretica, 33,

145-162.

160

Kolasa, J., and Strayer, D. (1988). "Patterns of the abundance of species: a comparison of two hierarchical models." OIKOS, 53, 235-241.

Lammert, M., and Allan, J. D. (1999). "Assessing biotic integrity of streams: Effects of scale in measuring the influence of land use/cover and habitat structure on fish and macroinvertebrates." Environmental Management, 23(2), 257-270.

Leland, H. V., and Porter, S. D. (2000). "Distribution of benthic algae in the upper Illinois River basin in relation to geology and land use." Freshwater Biology, 44(2), 279-301.

Little, R. J., and Rubin, D. M. (1987). Statistical analysis with missing data, John Wiley and Sons, New York.

Lyons, J. (2006). "A fish-based index of biotic integrity to assess intermittent headwater streams in Wisconsin, USA." Environmental Monitoring and Assessment, 122(1-3), 239-258.

Lyons, J., Piette, R. R., and Niermeyer, K. W. (2001). "Development, validation, and application of a fish-based index of biotic integrity for Wisconsin's large warmwater rivers." Transactions of the American Fisheries Society, 130(6), 1077-1094.

Manolakos, E., Virani, H., and Novotny, V. (2007). "Extracting knowledge on the links between the water body stressors and biotic integrity." Water Research, 41(18), 4041-4050.

McLachlan, G. J., and Krishan, T. (1997). The EM algorithm and extensions, Wiley, New York. Meador, M. R., and Goldstein, R. M. (2003). "Assessing water quality at large geographic scales:

Relations among land use, water physicochemistry, riparian condition, and fish community structure." Environmental Management, 31(4), 504-517.

Mercurio, G., Chaillou, J. C., and Roth, N. E. (1999). "Guide to using the 1995-1997 Maryland Biological stream survey data." Maryland Department of Natural Resources, Annapolis,MD.

Minshall, G. W. (1984). "Aquatic-insect substratum relationships." In: The Ecology Of Aquatic Insects, V. H. Resh and D. M. Rosenberg, eds., Praeger Scientific, New York NY, 358-400.

Mitsch, W. J., and Gosselink, J. G. (1986). Wetlands, Van Nostrand Reinhold, New York, NY. Morita, K., and Yamamoto, S. (2002). "Effects of habitat fragmentation by damming on the

persistence of stream-dwelling charr populations." Conservation Biology, 16(5), 1318-1323.

Morita, K., and Yokota, A. (2002). "Population viability of stream-resident salmonids after habitat fragmentation: a case study with white-spotted charr (Salvelinus leucomaenis) by an individual based model." Ecological Modelling, 155(1), 85-94.

Morley, S. A., and Karr, J. R. (2002). "Assessing and restoring the health of urban streams in the Puget Sound basin." Conservation Biology, 16(6), 1498-1509.

Nilsson, C., Reidy, C. A., Dynesius, M., and Revenga, C. (2005). "Fragmentation and flow regulation of the world's large river systems." Science, 308(5720), 405-408.

Norton, S. B., Cormier, S. M., Smith, M., and Jones, R. C. (2000). "Can biological assessments discriminate among types of stress? A case study from the Eastern Corn Belt Plains ecoregion." Environmental Toxicology and Chemistry, 19(4), 1113-1119.

Norton, S. B., Cormier, S. M., Smith, M., Jones, R. C., and Schubauer-Berigan, M. (2002). "Predicting levels of stress from biological assessment data: Empirical models from the Eastern Corn Belt Plains, Ohio, USA." Environmental Toxicology and Chemistry, 21(6), 1168-1175.

Noss, R. F. (1990). "Indicators for monitoring biodiversity: a hierarchical approach." Conservation Biology, 4(4), 355-364.

161

Novotny, V. (2003). Water Quality. Diffuse Pollution and Watershed Management, 2 Ed., John Wiley & Sons, New York.

Novotny, V. (2004). "Simplified Databased Total Maximum Daily Loads, or the World is Log-Normal." Journal of Environmental Engineering, June 2004, 674-683.

Novotny, V., Bartosova, A., O'Reilly, N., and Ehlinger, T. (2005). "Unlocking the relationship of biotic waters to anthropogenic integrity of impaired stresses." Water Research, 39(1), 184-198.

Novotny, V., Manolakos, E., Ehlinger, T., Bartosova, A., O'Reilly, N., Bedoya, D., McGarvey, K., Brooks, J., Beach, D., Farah, J., and Shaker, R. (2007). "Developing a risk propagation model for estimating ecological responses of streams to anthropogenic watershed stresses and stream modifications. Final Report.", Center for Urban and Environmental Studies. Northeastern University, Boston,MA. Available at: http://www.coe.neu.edu/environment/WebReports/EPA_final_Report2.pdf .

O'Neill, R. V., DeAngelis, D. L., Waide, J. B., and Allen, T. F. H. (1986). A hierarchical concept of ecosystems, Princeton University Press, Princeton, NJ.

Ohio_EPA. (1987). "Biological Criteria for the Protetction of Aquatic Life: Volume I-III. Standardized Field and laboratory methods for assessing fish and macroinvertebrate communities", Division of Water Quality Monitoring and Assessment, Surface Water Section,Columbus, OH.

Omernik, J. M. (2004). "Perspectives on the nature and definition of ecological regions." Environmental Management, 34, S27-S38.

Ott, W. R. (1978). Environmental Indices:theory and practice, Ann Arbor Science, Ann Arbor, MI.

Ouyang, M., Welsh, W. J., and Georgopoulos, P. (2004). "Gaussian mixture clustering and imputation of microarray data." Bioinformatics, 20(6), 917-923.

Park, Y. S., Chang, J. B., Lek, S., Cao, W. X., and Brosse, S. (2003). "Conservation strategies for endemic fish species threatened by the Three Gorges Dam." Conservation Biology, 17(6), 1748-1758.

Parkyn, S. M., Davies-Colley, R. J., Halliday, N. J., Costley, K. J., and Croker, G. F. (2003). "Planted riparian buffer zones in New Zealand: Do they live up to expectations?" Restoration Ecology, 11(4), 436-447.

Paul, M. J., Stribling, J. B., Klauda, R. J., Kazyak, P. F., Southerland, M. T., and Roth, N. E. (2002). "A phsyical habitat index for freshwater wadeable streams in Maryland. Final report ", Maryland Department of Natural Resources. Chesepeake bay and watershed programs. Monitoring and non-tidal assessment., Annapolis, MD.

Pickett, S. T. A., Kolasa, J., Armesto, J. J., and Collins, S. L. (1989). "The ecological concept of disturbance and its expression at various hierarchical levels." OIKOS, 54, 129-136.

Pirhalla, D. E. (2004). "Evaluating fish-habitat relationships for refining regional indexes of biotic integrity: Development of a tolerance index of habitat degradation for Maryland stream fishes." Transactions of the American Fisheries Society, 133(1), 144-159.

Poff, N. L., and Allan, J. D. (1995). "Functional-Organization of Stream Fish Assemblages in Relation to Hydrological Variability." Ecology, 76(2), 606-627.

Poff, N. L., Allan, J. D., Bain, M. B., Karr, J. R., Prestegaard, K. L., Richter, B. D., Sparks, R. E., and Stromberg, J. C. (1997). "The natural flow regime." Bioscience, 47(11), 769-784.

162

Quinn, J. M., Cooper, A. B., Davies-Colley, R. J., Rutherford, J. C., and Williamson, R. B. (1997). "Land use effects on habitat, water quality, periphyton, and benthic invertebrates in Waikato, New Zealand, hill-country streams." New Zealand Journal of Marine and Freshwater Research, 31(5), 579-597.

Quinn, J. M., and Hickey, C. W. (1990). "Magnitude of effects of substrate particle size, recent flooding, and watershed development on benthic invertebrates in 88 New Zealand rivers " N.Z.J. Mar. Freshwater Resources, 24, 411-428.

Rabeni, C. F., and Smale, M. A. (1995). "Effects of Siltation on Stream Fishes and the Potential Mitigating Role of the Buffering Riparian Zone." Hydrobiologia, 303(1-3), 211-219.

Rankin, E. T. (1989). "The Qualitative Habitat Evaluation Index (QHEI): rationale, methods, and application." Ecological Assessment Section, Division of Water Quality, Planning, and Assessment. Ohio Environmental Protection Agency, Columbus, OH.

Rankin, E. T., Miltner, B., Yoder, C. O., and Mishne, D. (1999). "Association between nutrients, habitat, and the aquatic biota in Ohio rivers and streams." Ohio EPA Technical Bulletin MAS/1999-1-1, Columbus, OH.

Rankin, E. T., Yoder, C. O., and Mishne, D. (1990). "Ohio Water Resources Inventory:Executive Summary and Volume 1." Ohio Environmental Protection Agency, Columbus, Ohio.

ReyesGavilan, F. G., Garrido, R., Nicieza, A. G., Toledo, M. M., and Brana, F. (1996). "Fish community variation along physical gradients in short streams of northern Spain and the disruptive effect of dams." Hydrobiologia, 321(2), 155-163.

Richards, C., Host, G. E., and Arthur, J. W. (1993). "Identification of Predominant Environmental-Factors Structuring Stream Macroinvertebrate Communities within a Large Agricultural Catchment." Freshwater Biology, 29(2), 285-294.

Richards, C., Johnson, L. B., and Host, G. E. (1996). "Landscape-scale influences on stream habitats and biota." Canadian Journal of Fisheries and Aquatic Sciences, 53, 295-311.

Richter, B. D., Baumgartner, J. V., Powell, J., and Braun, D. P. (1996). "A method for assessing hydrologic alteration within ecosystems." Conservation Biology, 10(4), 1163-1174.

Roth, N., Southerland, M., Chaillou, J., Klauda, R., Kazyak, P., Stranko, S., Weisberg, S., Hall, L., and Morgan, R. (1998). "Maryland biological stream survey: Development of a fish Index of Biotic Integrity." Environmental Monitoring and Assessment, 51(1-2), 89-106.

Roth, N. E., Allan, J. D., and Erickson, D. L. (1996). "Landscape influences on stream biotic integrity assessed at multiple spatial scales." Landscape Ecology, 11(3), 141-156.

Roth, N. E., Southerland, M. T., Chaillou, J. C., Kazyak, P. F., and Stranko, S. A. (2000). "Refinement and validation of a fish index of biotic integrity for Maryland streams." Prepared by Versar Inc. for Maryland Department of Natural Resources, Columbia,MD.

Rykiel , E. J. (1985). "Towards a definition of ecological disturbance." Australian Journal of Ecology, 10, 361-365.

Scarsbrook, M. R., and Halliday, J. (1999). "Transition from pasture to native forest land-use along stream continua: effects on stream ecosystems and implications for restoration." New Zealand Journal of Marine and Freshwater Research, 33(2), 293-310.

Schafer, J. L. (1997). Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L., and Olsen, M. K. (1998). "Multiple imputation for multivariate missing-data

problems: A data analyst's perspective." Multivariate Behavioral Research, 33(4), 545-571.

Schlosser, I. J. (1982). "Fish community structure and function along two habitat gradients in a headwater stream." Ecological Monographs, 52(4), 395-414.

163

Schueler, T. (1994). "The importance of imperviousness." Watershed Protection Techniques, 1, 100-111.

Shields, F. D., Langendoen, E. J., and Doyle, M. W. (2006). "Adapting existing models to examine effects of agricultural conservation programs on stream habitat quality." Journal of the American Water Resources Association, 42(1), 25-33.

Southerland, M. T., Rogers, G. M., Kline, M. J., Morgan, R. P., Boward, D. M., Kazyak, P. F., Klauda, R. J., and Stranko, S. A. (2005). "New biological indicators to better assess the condition of Maryland streams." Maryland Department of Natural Resources. Monitoring and non-tidal assessment division. DNR-12-0305-0100, Annapolis, MD.

Stewart, J. S., Wang, L. Z., Lyons, J., Horwatich, J. A., and Bannerman, R. (2001). "Influences of watershed, riparian-corridor, and reach-scale characteristics on aquatic biota in agricultural watersheds." Journal of the American Water Resources Association, 37(6), 1475-1487.

Stribling, J. B., Kessup, K. J., and White, J. S. (1998). "Development of a benthic index of biotic integrity for Maryland streams." Maryland Department of Natural resources. Monitoring and non-tidal assessment division. CBWP-EA-98-3, Annaplis, MD.

Sugihara, G. (1980). "Minimal community structure: an explanation of species abundance patterns." Am. Nat., 116, 770-787.

Sugihara, G. (1983). "Niche hierarchy: structure, organization and assembly in natural communities," Princeton University, Princeton, N.J.

Tran, L. T., Knight, C. G., O'Neill, R. V., Smith, E. R., and O'Connell, M. (2003). "Self-organizing maps for integrated environmental." Environmental Management, 31(6), 822-835.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001). "Missing value estimation methods for DNA microarrays." Bioinformatics, 17(6), 520-525.

USACE. (2005). "National Inventory of Dams." Available at: http://crunch.tec.army.mil/nidpublic/webpages/nid.cfm.

USGS. (2008a). "National Hydrography Dataset." U.S. Department of the Interior. Available at: http://nhd.usgs.gov/index.html.

USGS. (2008b). "National Land Cover Database." Multi-resolution Land Characteristics Consortium U.S. Department of the Interior. Available at: http://www.mrlc.gov/index.php.

Volstad, J. H., Roth, N. E., Mercurio, G., Southerland, M. T., and Strebel, D. E. (2003). "Using environmental stressor information to predict the ecological status of Maryland non-tidal streams as measured by biological indicators." Environmental Monitoring and Assessment, 84(3), 219-242.

Wang, L., Lyons, J., Kanehl, P., and Bannerman, R. (2001). "Impacts of Urbanization on stream habitat and fish across multiple spatial scales." Environmental Management, 28(2), 255-266.

Wang, L. Z., Lyons, J., Kanehl, P., Bannerman, R., and Emmons, E. (2000). "Watershed urbanization and changes in fish communities in southeastern Wisconsin streams." Journal of the American Water Resources Association, 36(5), 1173-1189.

Wang, S. H. (2003). "Application of self-organising maps for data mining with incomplete data sets." Neural Computing & Applications, 12(1), 42-48.

164

Wright, J. F., Armitage, P. D., Furse, M. T., and Moss, D. (1988). "A new approach to the biological surveillance of river quality using macroinvertebrates." Verh. International Verein. Limnol., 23, 1548-1552.

Yuan, L. L., and Norton, S. B. (2004). "Assessing the relative severity of stressors at a watershed scale." Environmental Monitoring and Assessment, 98(1-3), 323-349.

Appendices

Appendix I: group statistics

165

APPENDIX I: GROUP STATISTICS

GROUP STATISTICS IN OHIO USING INSTREAM NDATA

GROUPS AFTER ENVIRONMENTAL GRADIENTS GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA IBI

μ 8.57 1.37 138.90 3.59 3.53 5.29 32.71 27.84 G1 87 σ 2.86 1.85 112.12 0.61 1.89 1.83 37.82 8.41 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 38.67 G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 7.75 μ 6.75 2.43 266.34 3.80 3.74 7.83 15.89 24.00 G3 35 σ 3.00 3.74 180.41 0.39 1.07 1.56 19.74 7.08 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 31.04 G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 42.82 G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 28.43 G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 7.21

GROUPS AFTER TOTAL ZINC CONCENTRATION GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN IBI

μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 10.39μ 8.66 1.30 139.78 3.60 3.54 5.26 33.74 21.26 28.19G12 84 σ 2.84 1.74 113.91 0.60 1.90 1.78 38.09 17.91 8.19 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 7.75 μ 6.79 2.40 257.23 3.88 3.68 7.60 17.21 15.80 22.67G31 30 σ 3.02 3.94 178.08 0.31 1.13 1.52 21.01 6.79 5.23 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 30.19 28.43G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 66.65 7.21

166

GROUPS AFTER pH GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN PH IBI

μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 7.00 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 0.25 10.39μ 8.07 1.14 139.52 3.55 3.61 5.24 27.04 20.70 7.82 28.92G121 74 σ 2.05 1.75 116.54 0.62 2.01 1.83 28.67 16.61 0.30 8.32 μ 13.04 2.48 141.70 3.95 3.00 5.40 83.30 25.40 8.84 22.80G122 10 σ 4.00 1.10 97.50 0.16 0.47 1.35 60.13 26.44 0.42 4.44 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 7.93 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 0.30 7.75 μ 5.98 1.95 260.36 3.90 3.72 7.44 16.27 16.36 7.62 23.52G311 25 σ 2.23 2.74 190.34 0.29 1.16 1.58 19.34 6.81 0.21 5.11 μ 10.86 4.68 241.60 3.80 3.50 8.40 21.90 13.00 8.30 18.40G312 5 σ 3.38 7.74 110.32 0.45 1.00 0.89 30.38 6.71 0.24 3.85 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 7.63 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 0.31 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 7.81 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 0.23 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 7.94 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 0.21 5.81 μ 7.71 0.89 190.25 3.46 5.83 9.07 105.98 30.19 7.72 28.43G6 69 σ 1.66 0.61 141.95 0.54 2.00 1.06 89.79 66.65 0.62 7.21

GROUPS AFTER NITRATE CONCENTRATION

GROUP # OBS DO TKN SO4 EMBED RIP GRAD DA ZN PH NO3 IBI

μ 6.00 3.21 114.33 3.17 3.33 6.00 3.83 178.67 7.00 18.30 18.00G11 3 σ 2.72 4.17 33.20 0.76 1.53 3.46 2.63 43.00 0.25 8.92 10.39μ 8.07 1.14 139.52 3.55 3.61 5.24 27.04 20.70 7.82 2.56 28.92G121 74 σ 2.05 1.75 116.54 0.62 2.01 1.83 28.67 16.61 0.30 3.92 8.32 μ 13.04 2.48 141.70 3.95 3.00 5.40 83.30 25.40 8.84 0.29 22.80G122 10 σ 4.00 1.10 97.50 0.16 0.47 1.35 60.13 26.44 0.42 0.23 4.44 μ 7.98 0.54 97.32 2.31 5.95 9.06 65.78 15.74 7.93 2.12 38.67G2 111 σ 1.75 0.42 60.04 0.49 1.65 1.40 87.13 18.62 0.30 3.36 7.75 μ 5.98 1.95 260.36 3.90 3.72 7.44 16.27 16.36 7.62 1.30 23.52G311 25 σ 2.23 2.74 190.34 0.29 1.16 1.58 19.34 6.81 0.21 1.83 5.11 μ 10.86 4.68 241.60 3.80 3.50 8.40 21.90 13.00 8.30 0.84 18.40G312 5 σ 3.38 7.74 110.32 0.45 1.00 0.89 30.38 6.71 0.24 0.99 3.85 μ 5.78 3.03 319.25 3.38 4.13 9.00 8.93 44.75 7.63 2.33 37.00G32 4 σ 3.16 2.64 237.68 0.48 0.75 1.15 4.70 9.71 0.31 3.29 3.46 μ 6.64 0.72 65.28 3.67 6.15 6.00 355.40 13.96 7.81 1.70 31.04G4 71 σ 1.73 0.45 36.22 0.41 2.08 0.89 1227.50 5.88 0.23 1.77 8.48 μ 8.46 0.73 58.27 2.28 6.09 9.36 920.11 17.73 7.94 1.93 42.82G5 56 σ 1.88 0.95 45.65 0.41 1.21 0.94 1548.52 21.69 0.21 1.89 5.81 μ 8.03 0.98 225.66 3.34 6.13 9.50 105.27 20.31 7.88 8.33 25.63G61 32 σ 1.71 0.43 129.65 0.47 1.63 0.88 61.88 19.45 0.40 10.91 6.84 μ 7.43 0.81 159.62 3.57 5.58 8.70 106.59 38.73 7.59 0.68 30.86G62 37 σ 1.58 0.73 146.66 0.58 2.26 1.08 109.23 88.90 0.74 0.60 6.69

167

GROUP STATISTICS IN OHIO USING OFFSTREAM DATA

GROUPS AFTER ENVIRONMENTAL GRADIENTS

GROUP # OBS

R30

_ FO

RE

ST

R10

0_

AG

RI

R30

_ B

AR

RE

N

IBI

μ 39.50 60.29 0.24 41.06G1 17 σ 10.53 7.65 0.23 5.44 μ 24.88 60.94 0.01 32.57G2 412 σ 18.90 22.98 0.03 9.74

GROUPS AFTER R30_NONFOREST

GROUP # OBS

R30

_ FO

RE

ST

R10

0_

AG

RI

R30

_ B

AR

RE

N

R30

_ N

ON

FOR

IBI

μ 39.50 60.29 0.24 0.39 41.06G1

17 σ 10.53 7.65 0.23 0.47 5.44

μ 27.05 59.28 0.01 0.77 33.97G21

304 σ 20.28 24.11 0.03 0.65 9.34

μ 18.78 65.60 0.01 3.63 28.63G22

108 σ 12.54 18.76 0.01 1.99 9.81

GROUPS AFTER SITE_CON

GROUP # OBS

R30

_ FO

RE

ST

R10

0_

AG

RI

R30

_ B

AR

RE

N

R30

_ N

ON

FOR

SIT

E

_CO

N

IBI

μ 39.50 0.24 60.29 0.39 34.24 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 5.44 μ 40.62 0.00 59.66 0.37 87.71 40.09G211 46 σ 12.22 0.01 11.37 0.44 12.93 7.83 μ 24.63 0.01 59.21 0.85 14.89 32.88G212 258 σ 20.49 0.03 25.74 0.66 13.65 9.18 μ 42.43 0.02 59.28 4.41 59.67 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 1.91 μ 21.52 0.01 66.33 3.06 34.60 30.50G222 60 σ 10.93 0.01 12.24 0.59 7.77 10.87μ 15.36 0.00 64.70 26.41 3.43 26.29G223 48 σ 13.65 0.01 24.73 14.00 2.39 7.80

168

GROUPS AFTER RDA_URBAN

GROUP # OBS

R30

_ FO

RE

ST

R10

0_

AG

RI

R30

_ B

AR

RE

N

R30

_ N

ON

FOR

SIT

E_

CO

N

RD

A_

UR

BA

N

IBI

μ 39.50 0.24 60.29 0.39 34.24 8.65 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 5.44 μ 40.62 0.00 59.66 0.37 87.71 11.01 40.09G211 46 σ 12.22 0.01 11.37 0.44 12.93 10.29 7.83 μ 24.63 0.01 59.21 0.85 14.89 18.70 32.88G212 258 σ 20.49 0.03 25.74 0.66 13.65 20.40 9.18 μ 42.43 0.02 59.28 4.41 59.67 6.08 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 38.08G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 9.98 μ 21.33 0.01 62.40 19.41 34.17 22.61 24.71G2222 34 σ 6.91 0.01 9.24 9.96 1.84 8.56 7.47 μ 11.71 0.01 79.09 18.91 2.22 7.68 24.30G2231 33 σ 10.69 0.01 10.89 9.67 0.47 2.72 7.88 μ 23.39 0.00 33.04 9.91 6.10 47.47 30.67G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 5.69

169

GROUPS AFTER L100_FOREST

GROUP # OBS

R30

_ FO

RE

ST

R10

0_

AG

RI

R30

_ B

AR

RE

N

R30

_ N

ON

FOR

SIT

E_

CO

N

RD

A_

UR

BA

N

L10

0_

FOR

EST

IBI

μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 37.60 G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 8.18 μ 34.80 0.01 49.57 0.91 12.29 22.08 41.74 35.81 G2122 105 σ 16.70 0.03 22.41 0.61 10.33 21.24 11.91 9.71 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 9.98 μ 21.33 0.01 62.40 19.41 34.17 22.61 14.77 24.71 G2222 34 σ 6.91 0.01 9.24 9.96 1.84 8.56 12.74 7.47 μ 11.71 0.01 79.09 18.91 2.22 7.68 9.79 24.30 G2231 33 σ 10.69 0.01 10.89 9.67 0.47 2.72 11.02 7.88 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 5.69

170

GROUPS AFTER L30_AGRI

GROUP # OBSERV

R30

_FO

RE

ST

R10

0_A

GR

I

R30

_BA

RR

EN

R30

_NO

NFO

RE

ST

SIT

E_C

ON

RD

A_U

RB

AN

L10

0_FO

RE

ST

L30

_AG

RI

IBI

μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 35.66 37.60 G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 20.81 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 8.18 μ 34.80 0.01 49.57 0.91 12.29 22.08 41.74 24.93 35.81 G2122 105 σ 16.70 0.03 22.41 0.61 10.33 21.24 11.91 18.33 9.71 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 22.11 G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 27.63 G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 25.85 G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 18.57 G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 5.69

171

GROUPS AFTER L30_NONFOREST

GROUP # OBSERV

R30

_FO

RE

ST

R10

0_A

GR

I

R30

_BA

RR

EN

R30

_NO

NFO

RE

ST

SIT

E_C

ON

RD

A_U

RB

AN

L10

0_FO

RE

ST

L30

_AG

RI

L30

_NO

NFO

R

IBI

μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 1.26 41.06G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 2.36 5.44 μ 38.42 0.01 61.36 0.47 87.39 10.71 34.61 35.66 0.73 37.60G2111 25 σ 14.16 0.01 13.42 0.55 12.62 10.50 10.61 20.81 1.43 7.92 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 0.11 43.05G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 0.42 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 1.01 39.43G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 1.36 8.18 μ 35.94 0.01 47.95 0.88 12.30 23.03 42.09 24.44 0.45 36.55G21221 98 σ 16.06 0.03 21.94 0.61 10.48 21.67 11.68 18.52 0.67 9.44 μ 18.84 0.01 72.38 1.29 12.09 8.76 36.77 31.67 7.70 25.43G21222 7 σ 18.54 0.03 16.40 0.33 8.61 2.45 14.89 14.97 4.58 7.81 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 1.35 28.96G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 2.65 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 0.91 48.50G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.19 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 3.01 38.08G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 4.27 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 0.00 22.11G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 0.00 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 4.80 27.63G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 4.93 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 3.52 25.85G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 3.20 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 8.04 18.57G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 4.25 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 5.81 30.67G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 8.65 5.69

172

GROUPS AFTER L100_URBAN

GROUP # OBS

R30

_FO

RE

ST

R10

0_A

GR

I

R30

_BA

RR

EN

R30

_NO

NFO

RE

ST

SIT

E_C

ON

RD

A_U

RB

AN

L10

0_FO

RE

ST

L30

_AG

RI

L30

_NO

NFO

R

L10

0_U

RB

AN

IBI

μ 39.50 0.24 60.29 0.39 34.24 8.65 49.71 21.66 1.26 12.00 41.06 G1 17 σ 10.53 0.23 7.65 0.47 34.67 4.90 16.73 17.63 2.36 9.83 5.44 μ 35.97 0.00 65.69 0.43 92.44 9.05 35.77 38.88 0.66 6.42 35.18 G21111 17 σ 15.51 0.01 12.40 0.60 3.05 11.39 11.71 22.64 1.46 4.14 6.33 μ 43.62 0.01 52.17 0.57 76.65 14.23 32.13 28.83 0.89 26.57 42.75 G21112 8 σ 9.60 0.02 11.14 0.45 18.19 7.78 7.91 15.32 1.44 14.67 8.88 μ 43.23 0.00 57.62 0.24 88.10 11.37 67.42 15.13 0.11 7.61 43.05 G2112 21 σ 9.06 0.00 8.19 0.18 13.59 10.29 8.97 9.54 0.42 4.54 6.77 μ 51.75 0.00 34.02 0.80 14.48 25.40 77.50 11.22 1.01 8.24 39.43 G2121 28 σ 17.48 0.00 23.77 0.62 9.71 23.61 6.73 10.44 1.36 5.96 8.18 μ 35.94 0.01 47.95 0.88 12.30 23.03 42.09 24.44 0.45 22.80 36.55 G21221 98 σ 16.06 0.03 21.94 0.61 10.48 21.67 11.68 18.52 0.67 19.70 9.44 μ 18.84 0.01 72.38 1.29 12.09 8.76 36.77 31.67 7.70 13.97 25.43 G21222 7 σ 18.54 0.03 16.40 0.33 8.61 2.45 14.89 14.97 4.58 4.57 7.81 μ 10.01 0.01 72.95 0.81 17.17 14.36 7.82 61.34 1.35 22.62 28.96 G2123 125 σ 10.04 0.04 20.42 0.70 16.29 18.00 6.71 29.34 2.65 25.81 6.95 μ 42.43 0.02 59.28 4.41 59.67 6.08 43.56 35.67 0.91 13.61 48.50 G221 4 σ 0.49 0.00 0.38 1.29 0.33 0.15 4.93 16.67 1.19 8.88 1.91 μ 21.78 0.01 71.47 3.15 35.16 7.54 25.05 49.25 3.01 10.91 38.08 G2221 26 σ 14.80 0.01 13.87 0.67 11.73 1.77 19.78 27.47 4.27 9.87 9.98 μ 22.35 0.02 63.80 11.42 33.87 19.42 19.55 0.18 0.00 75.47 22.11 G22221 18 σ 9.26 0.00 8.84 5.32 2.52 3.58 13.82 0.76 0.00 16.58 6.34 μ 20.17 0.01 60.82 10.41 34.50 26.21 9.39 61.23 4.80 14.24 27.63 G22222 16 σ 2.24 0.01 9.70 4.76 0.06 10.99 9.07 22.79 4.93 11.69 7.74 μ 8.87 0.01 81.72 15.41 2.13 7.48 6.39 81.08 3.52 8.76 25.85 G22311 26 σ 7.00 0.01 7.55 7.65 0.38 2.82 6.88 10.03 3.20 4.67 7.20 μ 22.24 0.01 69.33 5.91 2.59 8.45 22.45 46.28 8.04 20.63 18.57 G22312 7 σ 15.55 0.01 15.91 2.16 0.62 2.34 14.59 7.49 4.25 7.20 8.14 μ 23.39 0.00 33.04 9.91 6.10 47.47 14.58 14.57 5.81 63.18 30.67 G2232 15 σ 16.23 0.00 14.84 4.47 2.77 20.31 18.94 22.35 8.65 30.09 5.69

173

GROUP STATISTICS IN COASTAL MARYLAND

GROUPS AFTER ENVIRONMENTAL GRADIENTS GROUP # OBS POOLQUAL AVGWID RIFFQUAL IBI

μ 9.68 3.32 5.34 3.05G1 103 σ 4.48 3.08 4.05 1.00μ 13.94 5.99 13.17 3.84G2 122 σ 3.12 3.41 3.12 0.66

GROUPS AFTER pH GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH IBI

μ 9.94 3.29 5.62 6.88 3.16G11 94 σ 4.35 3.03 4.03 0.42 0.94μ 7.00 3.73 2.44 5.41 1.89G12 9 σ 5.20 3.67 3.09 0.57 0.88μ 13.94 5.99 13.17 6.86 3.84G2 122 σ 3.12 3.41 3.12 0.47 0.66

GROUPS AFTER FORWET GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH FORWET IBI

μ 9.17 3.03 6.13 6.80 66.38 2.98 G111 52 σ 4.37 2.65 4.28 0.46 13.34 0.96 μ 10.88 3.61 4.98 6.97 30.66 3.39 G112 42 σ 4.19 3.46 3.65 0.35 10.40 0.87 μ 7.00 3.73 2.44 5.41 73.11 1.89 G12 9 σ 5.20 3.67 3.09 0.57 10.11 0.88 μ 13.84 6.31 13.04 6.92 41.81 3.79 G21 105 σ 3.26 3.51 3.17 0.38 9.98 0.67 μ 14.59 4.01 14.00 6.50 72.95 4.15 G22 17 σ 1.91 1.77 2.78 0.73 6.12 0.56

174

GROUPS AFTER INSTRHAB GROUP # OBS POOLQUAL AVGWID RIFFQUAL PH FORWET INSTRHAB IBI

μ 9.17 3.03 6.13 6.80 66.38 54.29 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 0.62μ 12.00 7.44 11.88 6.94 44.32 39.23 3.62G212 42 σ 3.64 3.65 3.04 0.42 9.31 12.50 0.71μ 14.59 4.01 14.00 6.50 72.95 71.08 4.15G22 17 σ 1.91 1.77 2.78 0.73 6.12 21.76 0.56

GROUPS AFTER AESTHET

GROUP # OBS

POO

LQ

UA

L

AV

GW

ID

RIF

FQU

AL

PH

FOR

WE

T

INST

RH

AB

AE

STH

ET

IBI

μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 0.62μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 3.39G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 0.66μ 11.24 8.21 12.52 6.87 44.75 38.37 16.43 3.85G2122 21 σ 3.73 3.43 2.93 0.37 10.05 13.16 1.29 0.69μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 3.50G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 0.50μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 4.29G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 0.48

175

GROUPS AFTER COND

GROUP # OBS

POO

LQ

UA

L

AV

GW

ID

RIF

FQU

AL

PH

FOR

WE

T

INST

RH

AB

AE

STH

ET

CO

ND

IBI

μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 165.13 2.98 G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 82.47 0.96 μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 204.79 3.39 G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 114.84 0.87 μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 90.00 1.46 G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 23.89 0.53 μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 112.67 2.75 G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 76.56 0.87 μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 192.08 3.91 G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 111.78 0.62 μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 199.81 3.39 G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 98.32 0.66 μ 12.00 7.48 12.75 7.00 39.14 43.45 16.92 172.08 4.19 G21221 12 σ 3.93 3.93 2.63 0.38 8.38 6.99 1.08 26.98 0.65 μ 10.22 9.19 12.22 6.69 52.22 31.61 15.78 102.67 3.39 G21222 9 σ 3.38 2.50 3.42 0.28 6.77 16.61 1.30 11.94 0.45 μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 107.33 3.50 G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 6.11 0.50 μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 80.57 4.29 G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 38.97 0.48

176

GROUPS AFTER URBAN

GROUP # OBS

POO

LQ

UA

L

AV

GW

ID

RIF

FQU

AL

PH

FOR

WE

T

INST

RH

AB

AE

STH

ET

CO

ND

UR

BA

N

IBI

μ 9.17 3.03 6.13 6.80 66.38 54.29 14.08 165.13 4.91 2.98G111 52 σ 4.37 2.65 4.28 0.46 13.34 24.71 4.55 82.47 5.73 0.96μ 10.88 3.61 4.98 6.97 30.66 55.08 11.88 204.79 7.76 3.39G112 42 σ 4.19 3.46 3.65 0.35 10.40 22.17 5.02 114.84 14.02 0.87μ 4.33 4.50 2.67 5.44 69.97 36.13 15.17 90.00 7.57 1.46G121 6 σ 2.66 4.39 3.20 0.49 10.96 11.88 3.06 23.89 10.69 0.53μ 12.33 2.20 2.00 5.36 79.40 85.50 11.00 112.67 0.01 2.75G122 3 σ 5.13 0.82 3.46 0.83 4.48 14.38 3.61 76.56 0.02 0.87μ 15.06 5.56 13.81 6.91 40.14 78.49 14.76 192.08 10.97 3.91G211 63 σ 2.30 3.22 3.03 0.36 10.13 11.79 3.88 111.78 14.89 0.62μ 12.76 6.66 11.24 7.02 43.88 40.08 9.67 199.81 19.53 3.39G2121 21 σ 3.48 3.79 3.08 0.46 8.73 12.06 3.60 98.32 16.48 0.66μ 17.50 5.53 14.00 6.98 50.39 44.29 17.50 156.00 23.66 3.13G212211 3 σ 3.54 2.23 4.24 0.36 5.42 0.82 0.71 15.56 9.11 0.18μ 10.90 7.87 12.50 7.00 36.89 43.28 16.80 175.30 1.05 4.40G212212 10 σ 3.07 4.16 2.46 0.40 6.99 7.72 1.14 28.17 0.62 0.46μ 10.22 9.19 12.22 6.69 52.22 31.61 15.78 102.67 2.90 3.39G21222 9 σ 3.38 2.50 3.42 0.28 6.77 16.61 1.30 11.94 3.51 0.45μ 14.33 4.85 15.33 6.32 72.93 72.89 6.00 107.33 12.58 3.50G221 3 σ 3.06 4.03 2.08 0.31 2.56 17.09 1.00 6.11 4.46 0.50μ 14.64 3.83 13.71 6.54 72.95 70.70 15.57 80.57 5.72 4.29G222 14 σ 1.74 1.07 2.89 0.79 6.72 23.18 2.85 38.97 2.35 0.48

177

GROUP STATISTCIS IN PIEDMONT MARYLAND

GROUPS AFTER ENVIRONMENTAL GRADEINTS

GROUP # OBS CH_FLOW CHAN_ALT URBAN POOLQUAL REMOTE EMBEDDED IBI

μ 82.03 12.36 0.64 9.97 56.25 70.00 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.84μ 69.52 8.07 59.36 12.52 14.01 71.23 1.802 29 σ 24.02 4.71 18.73 3.63 11.41 27.80 0.67μ 79.81 6.79 4.96 13.79 48.94 58.56 3.593 48 σ 14.60 2.48 6.23 2.63 27.07 22.75 0.71μ 86.13 13.44 2.68 15.41 66.32 77.61 3.934 126 σ 13.41 2.73 3.05 2.01 28.11 20.16 0.64μ 82.71 14.29 36.89 14.76 45.59 71.24 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 0.97

GROUPS AFTER GRAD

GROUP # OBS

CH

_FL

OW

CH

AN

_AL

T

UR

BA

N

POO

LQ

UA

L

RE

MO

TE

EM

BE

DD

ED

GR

AD

IBI

μ 82.03 12.36 0.64 9.97 56.25 70.00 1.17 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.52 0.84μ 78.30 7.35 54.07 13.10 12.81 70.78 0.74 2.0821 20 σ 19.39 4.38 18.20 3.95 11.38 26.26 0.43 0.61μ 50.00 9.67 71.10 11.22 16.67 72.22 2.63 1.2022 9 σ 22.50 5.29 14.69 2.54 11.69 32.63 0.64 0.28μ 79.81 6.79 4.96 13.79 48.94 58.56 0.96 3.593 48 σ 14.60 2.48 6.23 2.63 27.07 22.75 0.68 0.71μ 86.13 13.44 2.68 15.41 66.32 77.61 1.25 3.934 126 σ 13.41 2.73 3.05 2.01 28.11 20.16 1.24 0.64μ 82.71 14.29 36.89 14.76 45.59 71.24 1.12 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 1.02 0.97

178

GROUPS AFTER FORWET GROUP #OBS CH_FLOW CHAN_ALT URBAN POOLQUAL REMOTE EMBEDDED GRAD FORWET IBI

μ 82.03 12.36 0.64 9.97 56.25 70.00 1.17 29.44 3.251 24 σ 15.84 3.56 0.82 1.81 33.26 22.58 0.52 14.72 0.84μ 75.43 8.57 35.58 12.86 11.61 81.59 0.43 41.45 2.46211 7 σ 27.17 4.58 10.26 4.10 11.08 10.86 0.38 5.66 0.65μ 79.85 6.69 64.03 13.23 13.46 64.96 0.90 20.44 1.87212 13 σ 14.79 4.31 12.84 4.02 11.93 30.46 0.36 5.72 0.50μ 50.00 9.67 71.10 11.22 16.67 72.22 2.63 18.55 1.2022 9 σ 22.50 5.29 14.69 2.54 11.69 32.63 0.64 8.83 0.28μ 80.82 6.89 4.92 13.95 48.68 58.98 0.90 30.24 3.7631 38 σ 13.74 2.72 6.60 2.54 25.71 23.90 0.63 9.07 0.64μ 75.56 6.33 5.12 13.11 50.00 56.79 1.18 56.40 2.8832 9 σ 18.10 0.87 4.63 3.06 33.95 18.17 0.86 2.91 0.50μ 87.03 13.52 2.59 15.47 66.49 77.02 1.23 28.93 3.9941 116 σ 12.46 2.76 3.08 1.97 28.59 20.37 1.28 8.94 0.58μ 75.70 12.50 3.66 14.80 64.38 84.44 1.55 55.63 3.2742 10 σ 19.53 2.27 2.62 2.39 22.83 16.93 0.71 3.71 0.89μ 82.71 14.29 36.89 14.76 45.59 71.24 1.12 29.82 2.715 17 σ 14.46 2.08 18.75 3.05 17.65 26.73 1.02 11.27 0.97

179

GROUP STATISTICS IN HIGHLAND MARYLAND

GROUPS AFTER ENVIRONMENTAL GRADIENTSGROUP # OBS EPISUB AVGWID IBI

μ 77.05 4.73 3.25 1 153

σ 17.26 2.29 1.12 μ 24.67 4.30 2.67 2

111 σ 12.82 2.49 1.07 μ 75.17 12.85 3.79 3

32 σ 21.21 2.94 0.90

GROUPS AFTER NUMROOT GROUP # OBS EPISUB AVGWID NUMROOT IBI

μ 76.64 4.63 0.56 3.1611 132

σ 17.58 2.29 0.76 1.12μ 79.63 5.37 4.33 3.80

12 21 σ 15.25 2.23 1.62 0.96μ 24.88 3.96 0.20 2.51

21 90 σ 12.19 2.35 0.40 1.05μ 23.81 5.74 2.76 3.38

22 21 σ 15.53 2.63 1.34 0.87μ 75.17 12.85 1.72 3.79

3 32 σ 21.21 2.94 3.01 0.90

180

GROUPS AFTER GRAD GROUP # OBS EPISUB AVGWID NUMROOT GRAD IBI

μ 75.56 4.79 0.61 1.57 3.24111

115 σ 17.39 2.33 0.78 0.95 1.09

μ 83.99 3.55 0.24 6.92 2.61112

17 σ 17.56 1.67 0.56 3.02 1.24

μ 79.63 5.37 4.33 0.93 3.8012

21 σ 15.25 2.23 1.62 0.74 0.96

μ 24.61 3.92 0.23 1.22 2.60211

79 σ 11.49 2.36 0.42 0.78 1.01

μ 26.77 4.30 0.00 5.18 1.83212

11 σ 17.00 2.29 0.00 1.62 1.07

μ 23.81 5.74 2.76 0.99 3.3822

21 σ 15.53 2.63 1.34 0.59 0.87

μ 75.17 12.85 1.72 0.75 3.793

32 σ 21.21 2.94 3.01 0.53 0.90

GROUPS AFTER SO4 GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 IBI

μ 75.56 4.79 0.61 1.57 17.22 3.24 111

115 σ 17.39 2.33 0.78 0.95 16.22 1.09

μ 83.99 3.55 0.24 6.92 17.64 2.61 112

17 σ 17.56 1.67 0.56 3.02 14.32 1.24

μ 79.63 5.37 4.33 0.93 12.59 3.80 12

21 σ 15.25 2.23 1.62 0.74 7.69 0.96

μ 24.61 3.92 0.23 1.22 27.02 2.60 211

79 σ 11.49 2.36 0.42 0.78 47.47 1.01

μ 26.77 4.30 0.00 5.18 95.16 1.83 212

11 σ 17.00 2.29 0.00 1.62 146.69 1.07

μ 23.81 5.74 2.76 0.99 30.38 3.38 22

21 σ 15.53 2.63 1.34 0.59 61.80 0.87

μ 53.70 12.43 0.00 1.50 213.42 1.67 31

3 σ 8.49 5.08 0.00 1.00 166.63 0.43

μ 77.39 12.89 1.90 0.67 11.34 4.01 32

29 σ 20.94 2.77 3.11 0.42 4.11 0.59

181

GROUPS AFTER WOOD GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 WOOD IBI

μ 75.56 4.79 0.61 1.57 17.22 1.28 3.24 111

115 σ 17.39 2.33 0.78 0.95 16.22 1.86 1.09

μ 84.72 3.34 0.17 6.82 20.21 0.83 3.09 1121

12 σ 14.24 1.30 0.39 3.51 16.37 0.72 1.08

μ 82.22 4.04 0.40 7.17 11.49 3.40 1.46 1122

5 σ 25.88 2.45 0.89 1.60 4.04 0.55 0.74

μ 79.63 5.37 4.33 0.93 12.59 5.10 3.80 12

21 σ 15.25 2.23 1.62 0.74 7.69 4.71 0.96

μ 24.61 3.92 0.23 1.22 27.02 1.91 2.60 211

79 σ 11.49 2.36 0.42 0.78 47.47 3.35 1.01

μ 26.77 4.30 0.00 5.18 95.16 0.36 1.83 212

11 σ 17.00 2.29 0.00 1.62 146.69 0.67 1.07

μ 23.81 5.74 2.76 0.99 30.38 2.90 3.38 22

21 σ 15.53 2.63 1.34 0.59 61.80 3.33 0.87

μ 53.70 12.43 0.00 1.50 213.42 2.00 1.67 31

3 σ 8.49 5.08 0.00 1.00 166.63 2.65 0.43

μ 77.39 12.89 1.90 0.67 11.34 2.48 4.01 32

29 σ 20.94 2.77 3.11 0.42 4.11 2.86 0.59

182

GROUPS AFTER AGRIBARR GROUP # OBS EPISUB AVGWID NUMROOT GRAD SO4 WOOD AGRIBARR IBI

μ 75.84 5.06 0.66 1.36 17.77 1.36 51.10 3.471111

82 σ 17.90 2.25 0.80 0.90 18.03 1.94 23.29 0.87

μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 2.661112

32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 1.36

μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 3.091121

12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 1.08

μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 1.461122

5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 0.74

μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 3.8012

21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 0.96

μ 24.61 3.92 0.23 1.22 27.02 1.91 38.88 2.60211

79 σ 11.49 2.36 0.42 0.78 47.47 3.35 27.43 1.01

μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 1.83212

11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 1.07

μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 3.3822

21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 0.87

μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 1.6731

3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 0.43

μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 4.0132

29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 0.59

183

GROUPS AFTER CHAN

GROUP # OBS

EPI

SUB

AV

GW

ID

NU

MR

OO

T

GR

AD

SO4

WO

OD

AG

RIB

AR

R

CH

AN

IBI

μ 80.85 5.12 0.58 1.50 14.37 0.76 49.47 16.63 3.77 11111

38 σ 18.21 2.40 0.76 0.88 8.12 1.05 23.13 1.44 0.77

μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 3.22 11112

45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 0.87

μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 2.66 1112

32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 1.36

μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 3.09 1121

12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 1.08

μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 1.46 1122

5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 0.74

μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 3.80 12

21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 0.96

μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 2.39 2111

46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 0.95

μ 26.26 4.07 0.18 1.39 24.17 0.82 32.27 15.36 2.91 2112

33 σ 13.05 2.26 0.39 0.82 24.41 1.42 27.61 2.93 1.04

μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 1.83 212

11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 1.07

μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 3.38 22

21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 0.87

μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 1.67 31

3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 0.43

μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 4.01 32

29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 0.59

184

GROUPS AFTER EMBED

GROUP # OBS

EPI

SUB

AV

GW

ID

NU

MR

OO

T

GR

AD

SO4

WO

OD

AG

RIB

AR

R

CH

AN

EM

BE

D

IBI

μ 70.20 4.36 0.45 1.30 13.44 0.73 67.13 16.09 37.73 3.18 111111 11 σ 17.26 2.80 0.69 1.30 7.18 1.01 17.38 1.14 10.09 1.02 μ 85.19 5.43 0.63 1.58 14.75 0.78 42.27 16.85 17.85 4.01 111112 27 σ 17.02 2.20 0.79 0.66 8.57 1.09 21.44 1.51 7.18 0.50 μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 32.78 3.22 11112 45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 15.76 0.87 μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 29.63 2.66 1112 32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 24.24 1.36 μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 29.58 3.09 1121 12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 16.85 1.08 μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 33.00 1.46 1122 5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 27.75 0.74 μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 24.71 3.80 12 21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 15.65 0.96 μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 51.52 2.39 2111 46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 28.34 0.95 μ 32.50 4.63 0.15 1.34 18.25 0.65 27.65 15.20 15.50 3.34 21121 20 σ 10.86 2.09 0.37 0.80 14.80 0.88 25.43 2.61 15.89 0.93 μ 16.67 3.22 0.23 1.46 33.28 1.08 39.38 15.62 69.62 2.23 21122 13 σ 10.14 2.32 0.44 0.88 33.07 2.02 30.31 3.48 17.73 0.83 μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 28.18 1.83 212 11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 29.69 1.07 μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 51.10 3.38 22 21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 22.83 0.87 μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 56.67 1.67 31 3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 22.55 0.43 μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 34.97 4.01 32 29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 18.97 0.59

185

GROUPS AFTER SHADE

GROUP # OBS

EPI

SUB

AV

GW

ID

NU

MR

OO

T

GR

AD

SO4

WO

OD

AG

RIB

AR

R

CH

AN

EM

BE

D

SHA

DE

IBI

μ 74.07 4.27 0.44 1.52 12.16 0.56 63.70 16.11 35.00 73.43 3.57 1111111 8 σ 16.20 2.89 0.73 1.34 6.97 0.73 17.40 1.27 5.00 16.49 0.61 μ 52.78 4.77 0.50 0.30 19.24 1.50 82.59 16.00 50.00 32.48 1.43 1111112 3 σ 11.79 3.34 0.71 0.00 6.76 2.12 3.90 0.00 21.21 3.23 0.20 μ 85.19 5.43 0.63 1.58 14.75 0.78 42.27 16.85 17.85 73.78 4.01 111112 27 σ 17.02 2.20 0.79 0.66 8.57 1.09 21.44 1.51 7.18 19.05 0.50 μ 71.60 5.02 0.73 1.25 20.63 1.87 52.48 8.78 32.78 57.81 3.22 11112 45 σ 16.68 2.14 0.84 0.92 23.06 2.34 23.60 2.41 15.76 26.90 0.87 μ 74.83 4.08 0.47 2.11 15.79 1.06 7.91 13.53 29.63 75.79 2.66 1112 32 σ 16.27 2.41 0.72 0.84 10.24 1.66 4.28 4.59 24.24 20.14 1.36 μ 84.72 3.34 0.17 6.82 20.21 0.83 11.25 15.33 29.58 89.44 3.09 1121 12 σ 14.24 1.30 0.39 3.51 16.37 0.72 7.14 3.23 16.85 13.97 1.08 μ 82.22 4.04 0.40 7.17 11.49 3.40 8.68 17.80 33.00 92.84 1.46 1122 5 σ 25.88 2.45 0.89 1.60 4.04 0.55 10.83 1.92 27.75 8.68 0.74 μ 79.63 5.37 4.33 0.93 12.59 5.10 67.10 10.62 24.71 71.67 3.80 12 21 σ 15.25 2.23 1.62 0.74 7.69 4.71 18.14 4.57 15.65 16.87 0.96 μ 23.43 3.81 0.26 1.11 29.06 2.70 43.62 5.22 51.52 58.11 2.39 2111 46 σ 10.20 2.46 0.44 0.75 58.92 4.07 26.59 1.81 28.34 31.78 0.95 μ 35.65 5.29 0.17 1.28 17.12 0.58 26.53 15.25 14.58 43.06 3.79 211211 12 σ 12.64 2.12 0.39 0.86 8.06 1.00 20.46 2.56 16.85 16.28 0.64 μ 27.78 3.64 0.13 1.42 19.96 0.75 29.33 15.13 16.88 89.74 2.68 211212 8 σ 5.14 1.70 0.35 0.73 22.06 0.71 33.06 2.85 15.34 9.36 0.95 μ 16.67 3.22 0.23 1.46 33.28 1.08 39.38 15.62 69.62 54.72 2.23 21122 13 σ 10.14 2.32 0.44 0.88 33.07 2.02 30.31 3.48 17.73 28.11 0.83 μ 26.77 4.30 0.00 5.18 95.16 0.36 14.39 14.73 28.18 78.50 1.83 212 11 σ 17.00 2.29 0.00 1.62 146.69 0.67 9.43 3.26 29.69 15.76 1.07 μ 23.81 5.74 2.76 0.99 30.38 2.90 51.21 11.38 51.10 72.43 3.38 22 21 σ 15.53 2.63 1.34 0.59 61.80 3.33 24.84 6.22 22.83 21.44 0.87 μ 53.70 12.43 0.00 1.50 213.42 2.00 18.80 9.33 56.67 29.28 1.67 31 3 σ 8.49 5.08 0.00 1.00 166.63 2.65 7.36 7.64 22.55 16.89 0.43 μ 77.39 12.89 1.90 0.67 11.34 2.48 42.30 13.93 34.97 62.05 4.01 32 29 σ 20.94 2.77 3.11 0.42 4.11 2.86 24.93 4.56 18.97 17.88 0.59

Appendix II: computer code

186

Code for the Self- Organizing Maps and the raw and neuron-based

correlation matrices

clear all close all clc fig_handle = []; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Read the datasets (in .csv format) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Database = readtexttocells('REGLU AND FRAGMENT.csv'); fields = Database(1,:); warning off MATLAB:divideByZero %% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Forming the environmental variable matrix - input to the SOM %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MTC =Database(:,[find(strcmp(fields,'RDA_WATER')):find(strcmp(fields,'AREA'))]); %% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Creating the struct for SOM after normalizing the input metric data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sD1 = som_data_struct(str2double(MTC(2:end,:)),'comp_names',MTC(1,:),'labels',... Database(2:end,find(strcmp(fields,'IDX')))); sD2 = som_normalize(sD1,'log'); sD2 = som_normalize(sD2,'range'); clc clc % Finding the optimal # of SOM map units based on the quantization and % topographic errors qea = []; tea = []; for m = 10:5:100 clear sM sM = som_make(sD2,'munits',m,'algorithm','seq'); [qe,te] = som_quality(sM, sD2); qea = [qea qe]; tea = [tea te]; end m = 10:5:100; fig_handle(end+1) = figure; gca; [AX,H1,H2] = plotyy(m,qea,m,tea); set(AX(1),'Ycolor','k') set(AX(2),'Ycolor','k') set(get(AX(1),'Ylabel'),'String','Quantization error') set(get(AX(2),'Ylabel'),'String','Topographic error') set(H1,'LineStyle','-.') set(H2,'LineStyle','-') xlabel('No of map units')

187

title('Finding optimal no of map units') legend([H1 H2],'Quantization error','Topographic error') grid set(gca,'xtick',[0:10:100]) saveas(gcf,'No_neurons.fig') saveas(gcf,'No_neurons.jpg') clear AX H1 H2 clc %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SOM training after selecting the number of map units %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% mu = input('Enter optimal no of map units : '); close(gcf); sM = som_make(sD2,'munits',mu,'algorithm','seq','name','','training',[20 100]); [qe,te] = som_quality(sM, sD2); SOM_cells = prod(sM.topol.msize); [tempX, tempY] = meshgrid(1:sM.topol.msize(2),1:sM.topol.msize(1)); L1 = (flipud(tempY)-1)*sM.topol.msize(2)+tempX; L1 = L1(:); clear tempX tempY clc %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % U matrix %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'umat',[]) hold on som_cplane('hexa',sM.topol.msize,'none'); som_show_add('label',cellstr(int2str(L1)),'Textsize',8); colormap(1-gray);som_recolorbar saveas(gcf,'U_matrix.fig') saveas(gcf,'U_matrix.jpg') close(gcf); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % k means clustering of the SOM neurons %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [c, p, err, ind] = kmeans_clusters(sM,[],100); % find clusterings fig_handle(end+1) = figure; set(gcf,'Color',[1 1 1]); set(gca,'XColor',[0 0 0],'YColor',[0 0 0]) hold on plot(ind,'k') xlabel('No of clusters'); ylabel('Davies - Bouldin index'); title('Optimal no of clusters','Color',[0 0 0]); grid; saveas(gcf,'No_clusters.fig') saveas(gcf,'No_clusters.jpg')

188

%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Setting number of desired clusters and sorting the cluster labels starting %from the lowest at the bottom of the SOM map %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% no_clusters = input('Enter no. of clusters : '); close(gcf); temp = sortrows([L1 p{no_clusters}],[2 1]); lookup = sort(temp([0; find(diff(temp(:,2))==1)]+1,:),1); clear c1 for id = 1:no_clusters c1(temp(temp(:,2)==temp(find(temp(:,1)==lookup(id,1)),2)),:) = lookup(id,2); end Cluster_label = c1(L1); clear c1 Color_map = jet(64); Color_map = Color_map(ceil(linspace(1,55,no_clusters))',:); SOMcolors = (repmat(Cluster_label,[1, no_clusters]) == repmat([1:no_clusters],[length(Cluster_label),1])); SOMcolors = (linspace(0,1,no_clusters) * SOMcolors')'; fig_handle(end+1) = figure; som_show(sM,'empty',sprintf('%d clusters',no_clusters)) hold on som_cplane('hexa',sM.topol.msize,SOMcolors); som_show_add('label',cellstr(int2str(L1)),'Textsize',8); colormap(Color_map); h = colorbar; set(h,'YTick',linspace(min(get(h,'YTick')),max(get(h,'YTick')),no_clusters),... 'YTickLabel',[1:no_clusters]) sM = som_label(sM,'clear','all'); sM = som_autolabel(sM,sD2); saveas(gcf,'SOM_neurons.fig') saveas(gcf,'SOM_neurons.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Forming the matrices based on neuron site distribution % 1) Habitat index % 2) Environmental variables % 3) Fish metrics % 4) Indices of integrity i.e. IBI/ICI % 5) Fish counts %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [tf loc]= ismember(sM.labels,Database(2:end,find(strcmp(fields,'IDX')))); Ne1 = som_unit_neighs(sM); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % HABITAT INDEX MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% QHEI_data = str2double(Database(:,find(strcmp(fields,'QHEI')))); var_cluster = nan(size(sM.labels)); var_cluster(loc~=0) = (QHEI_data(loc(loc~=0))); QHEI_SOM = nanmean(var_cluster')';

189

if length(find(isnan(QHEI_SOM)))>0 Coord = find(isnan(QHEI_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = QHEI_SOM(b(~isnan(b))); QHEI_SOM(isnan(QHEI_SOM)) = nanmean(c')'; clear b c end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENVIRONMENTAL VARIABLES MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'RDA_WATER')):find(strcmp(fields,'AREA'))]; Env_var = fields(index); No_env = length(index); ENV_MTX = []; for var_no = 1:No_env index = find(strcmp(fields,Env_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Env_SOM = nanmean(var_cluster')'; if length(find(isnan(Env_SOM)))>0 Coord = find(isnan(Env_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Env_SOM(b(~isnan(b))); Env_SOM(isnan(Env_SOM)) = nanmean(c')'; clear c b end ENV_MTX = [ENV_MTX Env_SOM]; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % FISH METRICS MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'SPSCORE')):find(strcmp(fields,'SPWNSCORE'))]; Fish_var = fields(index); No_fish = length(index); FISH_MTX = []; for var_no = 1:No_fish index = find(strcmp(fields,Fish_var(var_no))); var_data = str2double(Database(2:end,index)); var_data = log(var_data+1); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0));

190

Fish_SOM = nanmean(var_cluster')'; if length(find(isnan(Fish_SOM)))>0 Coord = find(isnan(Fish_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Fish_SOM(b(~isnan(b))); Fish_SOM(isnan(Fish_SOM)) = nanmean(c')'; clear c b end FISH_MTX = [FISH_MTX Fish_SOM]; end FISH_MTX = round(exp(FISH_MTX)-1); fish_removed = find(sum(FISH_MTX)==0); Fish_var(find(sum(FISH_MTX)==0))=[]; FISH_MTX(:,find(sum(FISH_MTX)==0))=[]; FISH_MTX(find(sum(FISH_MTX,2)==0),:) = eps; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDICES MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'IBI')) find(strcmp(fields,'ICI'))]; Indices_var = fields(index); No_indices = length(index); INDICES_MTX = []; for var_no = 1:No_indices index = find(strcmp(fields,Indices_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Indices_SOM = nanmean(var_cluster')'; if length(find(isnan(Indices_SOM)))>0 Coord = find(isnan(Indices_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Indices_SOM(b(~isnan(b))); Indices_SOM(isnan(Indices_SOM)) = nanmean(c')'; clear c b end INDICES_MTX = [INDICES_MTX Indices_SOM]; end

191

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FISH COUNTS MATRIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% index = [find(strcmp(fields,'NUMINDSP')):find(strcmp(fields,'NUMSPAWN'))]; Count_var = fields(index); No_counts = length(index); FISHCOUNTS_MTX = []; for var_no = 1:No_counts index = find(strcmp(fields,Count_var(var_no))); var_data = str2double(Database(2:end,index)); var_cluster = repmat(nan,size(sM.labels)); var_cluster(loc~=0) = var_data(loc(loc~=0)); Counts_SOM = nanmean(var_cluster')'; if length(find(isnan(Counts_SOM)))>0 Coord = find(isnan(Counts_SOM))'; Ne2 = Ne1(Coord,:); b=repmat(nan,size(Ne2,1),6); ix=find(Ne2); [dum,iy]=find(Ne2); ix=rem(ix-1,numel(b))+1; b(ix)=iy; b=sort(b,2); c = repmat(nan,size(b)); c(~isnan(b)) = Counts_SOM(b(~isnan(b))); Counts_SOM(isnan(Counts_SOM)) = nanmean(c')'; clear c b end FISHCOUNTS_MTX = [FISHCOUNTS_MTX Counts_SOM]; end %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SPATIAL DISTRIBUTION OF THE CLUSTERS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Cluster_symbol = 'x^o+*.+'; sM = som_label(sM,'clear','all'); sM = som_autolabel(sM,sD2); L = sM.labels'; L = L(:); L(cellfun('isempty',L))=[]; [tf loc]= ismember(L,Database(2:end,find(strcmp(fields,'IDX')))); % Reading the latitude and longitudes from the dataset lat = str2double(Database(2:end,find(strcmp(fields,'LAT')))); lat_site = lat(loc); long = str2double(Database(2:end,find(strcmp(fields,'LONG')))); long_site = long(loc); % Calculate the # of sampling sites in each SOM neuron hits = som_hits(sM,sD2); hits_idx=hits>0; temp_hits=hits(hits_idx); SOM_color_map = []; SOM_label = []; Cluster_id = 1:length(unique(Cluster_label)); temp_cluster_label=Cluster_label(hits_idx); Site_label=zeros(sum(temp_hits),1);

192

Site_label([1; 1+cumsum(temp_hits(1:end-1))])=[temp_cluster_label(1); diff(temp_cluster_label)]; Site_label = cumsum(Site_label); clear temp_hits temp_cluster_label Site_selected = find(ismember(Site_label,Cluster_id)); fig_handle(end+1) = figure; gscatter(long_site(Site_selected),lat_site(Site_selected),Site_label(Site_selected),... Color_map(Cluster_id,:),Cluster_symbol(Cluster_id),[],0) hold on xlabel('Longitude');ylabel('Latitude'); legend(cellstr([repmat('Cluster ',length(Cluster_id),1) num2str(Cluster_id')])','Location','Best') title('Clustered Spatial representation of sites'); box on; saveas(gcf,'Lat_longdist.fig') saveas(gcf,'Lat_longdist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %EXPORT THE CLUSTERED SITE DATA TO EXCEL %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% site_cluster=[lat_site(Site_selected),long_site(Site_selected),Site_label(Site_selected)]; xlswrite('Site_cluster.xls',site_cluster,'Site_cluster'); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CREATING THE HABITAT INDEX CLUSTER DISTRIBUTION FIGURE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Cluster_ids = ones(prod(sM.topol.msize),1); for idx = 1:no_clusters Cluster_ids = [Cluster_ids ~cellfun('isempty',regexp(cellstr(num2str(Cluster_label)),cellstr(num2str(idx))))]; end Cluster_ids(Cluster_ids==0) = nan; QHEIVar_label = []; QHEI_diff = []; MSE = []; notch = 1; scale = ~isnan(Cluster_ids(:,2:end))*flipud(linspace(0.5,1,no_clusters)'); % SOM visualization and Clustered Boxplots for Habitat Index f = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,QHEI_SOM,scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)) h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(QHEI_SOM,1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])'])

193

set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for QHEI'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) saveas(gcf,'Habitat_index_dist.fig') saveas(gcf,'Habitat_index_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CREATING THE ENVIRONMENTAL VARIABLES CLUSTER DISTRIBUTION FIGURES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sM1 = som_denormalize(sM); notch = 1; Metric_names = 1:size(ENV_MTX,2); y = 4; x = 4; fig_handle(end+1) = figure; %METRICS 1 TO 8 for var_no = 1:8 h1 = subplot(x,y,((var_no-1)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 1:8 subplot(x,y,((var_no-1)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 1 to 8.jpg') saveas(gcf,'Metrics 1 to 8.fig') close(gcf); %% %METRICS 9 TO 16 fig_handle(end+1) = figure; for var_no = 9:16 h1 = subplot(x,y,((var_no-9)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no));

194

set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 9:16 subplot(x,y,((var_no-9)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 9 to 16.jpg') saveas(gcf,'Metrics 9 to 16.fig') close(gcf); %% %METRICS 17 TO 24 fig_handle(end+1) = figure; for var_no = 17:24 h1 = subplot(x,y,((var_no-17)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 17:24 subplot(x,y,((var_no-17)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 17 to 24 .jpg') saveas(gcf,'Metrics 17 to 24.fig') close(gcf); %% %METRICS 25 TO 32 fig_handle(end+1) = figure; for var_no = 25:32 h1 = subplot(x,y,((var_no-25)*2)+1); temp_pos = get(h1,'Position');

195

set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 25:32 subplot(x,y,((var_no-25)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 25 to 32 .jpg') saveas(gcf,'Metrics 25 to 32.fig') close(gcf); %% %METRICS 33 AND 34 fig_handle(end+1) = figure; for var_no = 33:34 h1 = subplot(x,y,((var_no-33)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,ENV_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(Env_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Environmental variables') for var_no = 33:34 subplot(x,y,((var_no-33)*2)+2) boxplot(repmat(ENV_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Metrics 33 to 34 .jpg') saveas(gcf,'Metrics 33 to 34.fig') close(gcf);

196

%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CLUSTER DISTRIBUTION OF THE FISH COUNTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Metric_names = 1:size(FISHCOUNTS_MTX,2); y = 2; x = 4; fig_handle(end+1) = figure; %FISH COUNTS FOR METRICS 1 TO 4 for var_no = 1:4 h1 = subplot(x,y,((var_no-1)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 1:4 subplot(x,y,((var_no-1)*2)+2) boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 1 to 4.jpg') saveas(gcf,'Fishcounts 1 to 4.fig') close(gcf); %% %FISH COUNTS FOR METRICS 5 TO 8 for var_no = 5:8 h1 = subplot(x,y,((var_no-5)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 5:8 subplot(x,y,((var_no-5)*2)+2)

197

boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 5 to 8.jpg') saveas(gcf,'Fishcounts 5 to 8.fig') close(gcf); %% %FISH COUNTS FOR METRICS NINE TO ELEVEN for var_no = 9:11 h1 = subplot(x,y,((var_no-9)*2)+1); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]); h = som_cplane('hexa',sM.topol.msize,FISHCOUNTS_MTX(:,var_no)); set(h,'EdgeColor','none') h = colorbar; set(h,'Position',get(h,'Position')+[0.005 -0.01 0.005 0.0325], 'Fontsize',6) title(fields(52+var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization and clustered boxplots for Fish Counts') for var_no = 9:11 subplot(x,y,((var_no-9)*2)+2) boxplot(repmat(FISHCOUNTS_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end saveas(gcf,'Fishcounts 9 to 11.jpg') saveas(gcf,'Fishcounts 9 to 11.fig') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % FISH METRICS CLUSTER DISTRIBUTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% sM1 = som_denormalize(sM); notch = 1; % SOM FOR THE DIFFERENT FISH METRICS Fish_metrics = 1:size(FISH_MTX,2); y = ceil(sqrt(length(Fish_metrics))); x = ceil(length(Fish_metrics)/y); fig_handle(end+1) = figure; for var_no = Fish_metrics h1 = subplot(x,y,find(Fish_metrics==var_no)); temp_pos = get(h1,'Position'); set(h1,'Position',[temp_pos(1:2) 0.09 0.1]) h = som_cplane('hexa',sM.topol.msize,FISH_MTX(:,var_no)); set(h,'EdgeColor','none')

198

h = colorbar; set(h,'Position',get(h,'Position')+[0.012 -0.008 0.003 0.015]) title(Fish_var(var_no),'Interpreter','none','Fontsize',7,'Position',[4 0]) end set(findobj(gcf,'Tag','Colorbar'),'FontSize',6) suptitle_withpatch('SOM visualization for Fish metrics') saveas(gcf,'SOM_fishmetrics1.fig') saveas(gcf,'SOM_fishmetrics1.jpg') close(gcf); %% % BOXPLOTS FOR THE DIFFERENT FISH METRICS y = ceil(sqrt(length(Fish_metrics))); x = ceil(length(Fish_metrics)/y); fig_handle(end+1) = figure; for var_no = Fish_metrics subplot(x,y,find(Fish_metrics==var_no)) boxplot(repmat(FISH_MTX(:,var_no),1,size(Cluster_ids,2)).* Cluster_ids,notch) title(Fish_var(var_no),'Interpreter','none','Fontsize',7) set(gca,'XTicklabel',[cellstr('Overall') cellstr([num2str((1:no_clusters)')])']) set(gca,'FontSize',6) ylabel(''); xlabel(''); end suptitle_withpatch('Clustered Boxplots for Fish Metrics') saveas(gcf,'SOM_fishmetrics1.jpg') saveas(gcf,'SOM_fishmetrics1.fig') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CLUSTER DISTRIBUTION OF INDICES OF BIOTIC INTEGRITY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDEX #1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,INDICES_MTX(:,1),scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)); h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(INDICES_MTX(:,1),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for Biotic index 1'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])'])

199

saveas(gcf,'BIOINDEX1_dist.fig') saveas(gcf,'BIOINDEX1_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % BIOTIC INDEX #2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% fig_handle(end+1) = figure; som_show(sM,'empty','','subplots',[1 2]) hold on som_cplane('hexa',sM.topol.msize,INDICES_MTX(:,2),scale); som_show_add('label',cellstr(int2str(L1)),'Textsize',6); set(gca,'Position',[0.05 0.1 0.35 0.9]) colormap(flipud(jet)); h = colorbar; set(h,'Position', [0.43 0.23 0.025 0.64],'Fontsize',8) subplot(122) boxplot(repmat(INDICES_MTX(:,2),1,size(Cluster_ids,2)).* Cluster_ids,notch) set(gca,'XTicklabel',[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) set(gca,'FontSize',8,'Position', [0.6 0.1 0.35 0.8]) set(gca,'YGrid','on'); ylabel(''); xlabel(''); h = title('SOM visualization and Clustered Boxplots for Biotic index 2'); set(h,'Position',get(h,'Position')-[0.75 0 0],'FontSize',12) xticklabel_rotate([],90,[cellstr('Overall') cellstr([repmat('Cluster ',no_clusters,1) num2str((1:no_clusters)')])']) saveas(gcf,'BIOINDEX2_dist.fig') saveas(gcf,'BIOINDEX2_dist.jpg') close(gcf); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ANALYSIS BASED ON THE SOM (MAX-MIN METRICS AND ENVIRONMENTAL VARIABLES IN NEURONS) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% colors = (repmat(Cluster_label,[1, no_clusters]) == repmat([1:no_clusters],[length(Cluster_label),1])); colors = (linspace(0.4,1,no_clusters) * colors')'; % Forming the per-cluster median for the Environmental variables t1 = repmat(ENV_MTX,[1 1 no_clusters]); t2 = repmat(reshape(Cluster_ids(:,2:end),[prod(sM.topol.msize) 1 no_clusters]),[1 No_env 1]); Env_median = reshape(nanmedian(t1 .* t2,1),[No_env no_clusters]); clear t1 t2 %Maximal and minimal median values of the Environmental variables [Env_max Envmaxidx] = max(Env_median'); [Env_max Envmaxidx] = max(ENV_MTX.*... (repmat(Cluster_label,[1,length(Envmaxidx)]) == repmat(Envmaxidx,[length(Cluster_label),1]))); [Env_min Envminidx] = min(Env_median'); H2 = double(repmat(Cluster_label,[1,length(Envminidx)]) == repmat(Envminidx,[length(Cluster_label),1])); H2(H2==0) = nan;

200

[Env_min Envminidx] = nanmin(ENV_MTX.*H2); clear H2 sM = som_label(sM,'clear','all'); sM = som_label(sM,'add',[1:prod(sM.topol.msize)],cellstr(int2str(L1))); sM = som_label(sM,'add',Envmaxidx,Env_var'); fig_handle(end+1) = figure; som_show(sM,'empty','Maximal Environmental variables','empty','Minimal Environmental variables','subplots',[1 2]) subplot(121) hold on som_cplane('hexa',sM.topol.msize,colors); colormap((1-0.3*gray(no_clusters))); hold on h = som_show_add('label',sM,'Textsize',6,'subplot',1); set(h,'Interpreter','none') sM = som_label(sM,'clear','all'); sM = som_label(sM,'add',[1:prod(sM.topol.msize)],cellstr(int2str(L1))); sM = som_label(sM,'add',Envminidx,Env_var'); subplot(122) hold on som_cplane('hexa',sM.topol.msize,colors); colormap((1-0.3*gray(no_clusters))); hold on h = som_show_add('label',sM,'Textsize',6,'subplot',2); set(h,'Interpreter','none') saveas(gcf,'Maxmin_envvar.fig') saveas(gcf,'Maxmin_envvar.jpg') close(gcf); clc; sM = som_label(sM,'clear','all'); %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % CORRELATION MATRIX OF THE RAW DATA %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% X1 = corrcoef([ENV_MTX INDICES_MTX]); X2 = [Env_var,'IBI'] figure;imagesc(abs(X1)) set(gca,'XTick',1:size(X2,2),'XTickLabel',X2,'FontSize',6) set(gca,'YTick',1:size(X2,2),'YTickLabel', X2','FontSize',6) title('Correlation Matrix','FontSize',10) X3 = sign(X1); [ir,ic] = find(X3==-1); th=text(ic,ir,'-'); set(th,'horizontalalignment','center'); hold on; [ir,ic] = find(X3==1); th=text(ic,ir,'+'); set(th,'horizontalalignment','center'); caxis([0 1]);colorbar colormap(jet) xticklabel_rotate([],90,X2) saveas(gcf,'Corrmatrix.fig') saveas(gcf,'Corrmatrix.jpg') close(gcf); %%

201

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CORRELATION MATRIX OF THE NEURON WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% X4 = corrcoef([sM1.codebook, INDICES_MTX(:,1)]); X5 = [sM.comp_names','IBI']; figure;imagesc(abs(X4)) set(gca,'XTick',1:size(X5,2),'XTickLabel',X5,'FontSize',6) set(gca,'YTick',1:size(X5,2),'YTickLabel', X5','FontSize',6) title('SOM Neuron Weights Correlation Matrix','FontSize',10) X6 = sign(X4); [ir,ic] = find(X6==-1); th=text(ic,ir,'-'); set(th,'horizontalalignment','center'); hold on; [ir,ic] = find(X6==1); th=text(ic,ir,'+'); set(th,'horizontalalignment','center'); caxis([0 1]);colorbar colormap(jet) xticklabel_rotate([],90,X5) saveas(gcf,'Neuron_Corrmatrix.fig') saveas(gcf,'Neuron_Corrmatrix.jpg') close(gcf);

202

Code for the KNN variable sorting and step-wise predictions

clear all close all clc close(gcf); fig_handle = []; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % READ THE DATASETS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Database = readtexttocells('C:\Program Files\MATLAB\R2006a\toolbox\somtoolbox\SOM\DATABASES\MD_COASTAL_FINAL_NOZEROS.csv'); fields = Database(1,:); warning off MATLAB:divideByZero %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENTERING THE VARIABLES NAMES USED IN EACH STEP %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Env_var = [find(strcmp(fields,'SO4_LAB')) find(strcmp(fields,'TEMP_FLD')) find(strcmp(fields,'ST_GRAD')) find(strcmp(fields,'NO3_LAB')) find(strcmp(fields,'ACREAGE')) find(strcmp(fields,'MAXDEPTH')) find(strcmp(fields,'PHI'))]; Env_var_name = 'STEP 25B'; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENTERING THE NUMBER OF DESIRED CLOSEST NEIGHBORS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% No_hits = 10; % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %EXTRACT ONE VARIABLE AT A TIME FROM THE DATASET,AND CALCULATE DISTANCES %WITH ALL THE REMAINING POINTS IN THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for Row_no =2:size(Database,1) MTC =Database([1:(Row_no-1), (Row_no+1):end],Env_var); Val_IBI = str2double(Database([2:(Row_no-1), (Row_no+1):end],find(strcmp(fields,'FIBI_98')))); sD1 = som_data_struct(str2double(MTC(2:end,:)),'comp_names',MTC(1,:),'labels',... Database([2:(Row_no-1),(Row_no+1):end],find(strcmp(fields,'IDX')))); sD2 = som_normalize(sD1,'log'); sD2 = som_normalize(sD2,'range'); %IDENTIFY ROW THAT IS BEING PREDICTED VAL_DATA = str2double(Database(Row_no,Env_var)); % MERGING THE DATABASE AND TARGET ROW, LOGGING AND RANGING DATA2 = []; for i = 1: size (VAL_DATA,1) sD3 = som_data_struct([str2double(MTC(2:end,:));VAL_DATA(i,:)],'comp_names',MTC(1,:)); sD4 = som_normalize(sD3,'log'); sD4 = som_normalize(sD4,'range');

203

LOG_VALUE = sD4.data(size(sD4.data,1),:)'; DATA2 = [DATA2 LOG_VALUE]; clear LOG_VALUE sD3 sD4; end % CALCULATING THE EUCLIDEAN DISTANCES AND FINDING THES K-SITES THAT HAVE THE % SMALLEST DISTANCES EUCDIST = dist(sD2.data,DATA2)'; for i =1:size(EUCDIST,1) [Sort index] = sort(EUCDIST(i,:)); Calc_IBI = mean(Val_IBI(index(1:No_hits))); CALC_IBI((Row_no-1),1)= Calc_IBI; clear Calc_IBI; end end %WITHDRAWING THE OBSERVED IBI Obs_IBI = str2double(Database(2:end,find(strcmp(fields,'FIBI_98')))); %REGRESSION STATISTICS R2= regstats(Obs_IBI, CALC_IBI,'linear', 'rsquare'); R2= R2.rsquare; R2text=num2str(R2); MSE = regstats(Obs_IBI, CALC_IBI,'linear', 'mse'); RMSE= sqrt(MSE.mse); RMSEtext = num2str(RMSE); %PLOT THE RESULTS h=figure; scatter (Obs_IBI,CALC_IBI,15,'b','filled'); box on; xlabel ('Observed IBI','Color',[0 0 0]); ylabel('Predicted IBI','Color',[0 0 0]); title ('IBI prediction using SOM','Color',[0 0 0]); set(h, 'Color', [1 1 1]); set(gca, 'XColor', [0 0 0], 'YColor', [0 0 0],'ZColor', [0 0 0]); axis ([0 5 0 5]); text(0.5,4, ['RMSE =' RMSEtext],'Color',[0 0 0]); hold on text(0.5,4.5,['R2=' R2text],'Color',[0 0 0]); %DRAWING THE LINE hold on FAKEDATA1 = 0:2:100; FAKEDATA2 = 0:2:100; plot(FAKEDATA1,FAKEDATA2,'r--'); saveas (gcf, sprintf('Direct pred using %s_%dsites.jpg',Env_var_name,No_hits)); saveas (gcf, sprintf('Direct pred using %s_%dsites.fig',Env_var_name,No_hits)); save(sprintf('Direct pred using %s_%dsites',Env_var_name, No_hits));

204

Code for the step-wise variable sorting and prediction using a hierarchical

tree

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% READING DATABASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% clear all; [Database Fields] = xlsread... ('DATABASE.xls','A1:N56'); [BIODATA BioFields] = xlsread... ('DATABASE.xls','P1:P56'); EnvData = Database(:,2:end); Fields_EnvData = Fields (2:end); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% EXTRACT ONE VARIABLE AT A TIME AND CHECK PREDICTION CAPABILITIES WITH DIFFERENT NUMBER OF HOMOGENEOUS GROUPS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% VARREG_STAT =[]; OBS_IBI = BIODATA(:,1); %SELECTING ONE VARIABLE AT A TIME FROM THE ENVIRONMENTAL DATABASE for var_no =1:size(EnvData,2) REG_STAT =[]; % SELECTING THE DIFFERENT NUMBER OF HOMOGENEOUS GROUPS WITH WHICH EACH % VARIABLE IS TESTED for max_sc= round(size(EnvData,1)/10):190:size(EnvData,1) CALC_IBI=[]; %LEAVE-ONE OBSERVATION OUT METHODOLOGY for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = Database(Row_no-1,[1 var_no+1]); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); EnvDataTemp = EnvData(INDIDX,var_no); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZEnvDataTemp = zscore(EnvDataTemp); DIST = pdist(ZEnvDataTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close(gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM (DETERMINED WITH VECTOR 'T') clear AVG_EnvData AVG_BIODATA

205

for i =1:max(T) INDEX = find(T==i); SUB_EnvData = EnvDataTemp(INDEX,:); AVG_EnvData(i,:) = mean(SUB_EnvData,1); clear INDEX SUB_EnvData; end % FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Merging target site to homogeneous group data Merge = [AVG_EnvData;TARGET_VAR(1,2:end)]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist((size(AVG_EnvData,1)+1),1:(size(AVG_EnvData,1))); %Calculating the average IBI of the closest site/s pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_EnvData SUB_BIODATA INDEX EnvDataTemp ZEnvDataTemp BIOTemp pdistMin end %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; RMSE = sqrt(mean((OBS_IBI-CALC_IBI).^2)); STATemp = [R2; RMSE]; REG_STAT = [REG_STAT STATemp]; clear STATemp R2 RMSE end VARREG_STAT = [VARREG_STAT;REG_STAT]; end % PLOT NUMBER OF HOMOGENEOUS GROUPS VERSUS R2 FOR EACH VARIABLE HG = round(size(EnvData,1)/10):190:size(EnvData,1); ax1 = axes ('Xlim',[min(HG) max(HG)],'XTick',HG); xlabel ('NUMBER OF HOMOGENEOUS GROUPS','Color',[0 0 0]); ylabel('R2','Color',[0 0 0]); title ('OPTIMUM NUMBER OF HOMOGENEOUS GROUPS','Color',[0 0 0]); box on; %SELECTING AND PLOTTING R2 FIELD FOR EACH VARIABLE IN THE REGRESSION STATISTICS FILE for var_no = 1:2:(size(EnvData,2)*2) hold on line(HG,VARREG_STAT(var_no,:),'Parent',ax1); end

206

%SAVE FIGURES saveas (gcf,'VAR_SEL_PLOT.fig'); saveas (gcf,'VAR_SEL_PLOT.jpg'); close (gcf); save('MAT_FILES'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% SORT THE DATA AND START THE STEP-WISE PREDICTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SORT DATA ACCORDING TO PREDICTION CAPABILITIES VARREG_STAT = VARREG_STAT'; MAXR2= max(VARREG_STAT(:,1:2:size(VARREG_STAT,2)),[],1); [SortR2 indR2]= sort(MAXR2,'descend'); R2Comp = SortR2(1); % STEP-WISE PREDICTION FOLLOWING THE ORDER DETERMINED BY THE OBTAINED R2 SLTD_VAR = EnvData(:,indR2(1,1)); VAR_Names = Fields_EnvData(:,indR2); indSel_Var = 1; PROGR2 = [R2Comp]; R2ALL =[]; for sel_var=2:size(indR2,2) REG_STAT=[]; SLTD_VAR = [SLTD_VAR EnvData(:,indR2(sel_var))]; for max_sc= 233:95:423 CALC_IBI=[]; %LEAVE-ONE-OUT PROCEDURE for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = SLTD_VAR(Row_no-1,:); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); SLTD_VARTemp = SLTD_VAR(INDIDX,:); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZSLTD_VARTemp = zscore(SLTD_VARTemp); DIST = pdist(ZSLTD_VARTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close (gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM AND DETERMINED WITH VECTOR 'T' for i =1:max(T) INDEX = find(T==i); SUB_SLTDVAR = SLTD_VARTemp(INDEX,:); AVG_SLTDVAR(i,:) = mean(SUB_SLTDVAR,1); clear INDEX SUB_EnvData; end

207

% FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %Merging target site to homogeneous group data Merge = [AVG_SLTDVAR;TARGET_VAR]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist(size(Targ_HG_dist,1),1:(size(Targ_HG_dist,1)-1)); %Calculating the average IBI of the closest site/s pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_SLTDVAR SUB_BIODATA INDEX SLTD_VARTemp ZSLTD_VARTemp BIOTemp pdistMin AVG_SLTDVAR AVG_BIODATA end %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; REG_STAT=[REG_STAT R2]; clear R2 end R2ALL = [R2ALL;REG_STAT]; R2=max(REG_STAT,[],2); if R2>R2Comp R2Comp = R2; PROGR2 = [PROGR2 R2]; indSel_Var =[indSel_Var sel_var]; else SLTD_VAR = SLTD_VAR(:,1:(size(SLTD_VAR,2)-1)); end end %OBTAINING THE NAMES OF THE VARIABLES SELECTED Sort_Fields = Fields_EnvData(indR2); Sel_fields = Sort_Fields(indSel_Var); %SAVE MATLAB FILES save('MAT_FILES');

208

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% PLOTTING THE BEST PREDICTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %ENTER NAMES OF BEST VARIABLES AS THEY APPEAR IN Fields (USE QUOTES) Ind_Bestvar = [find(strcmp(Fields,'EMBEDDED')),find(strcmp(Fields,'RIFFLE')),find(strcmp(Fields,'SUBSTRATE'))... find(strcmp(Fields,'POOL')),find(strcmp(Fields,'AREA')),find(strcmp(Fields,'COVER'))]; REG_STAT =[]; % SELECTING THE DIFFERENT NUMBER OF HOMOGENEOUS GROUPS WITH WHICH EACH % VARIABLE IS TESTED MaxGroups = input('Enter number of desired groups in hierarchical tree'); for max_sc=MaxGroups CALC_IBI=[]; %LEAVE-ONE-OUT PROCEDURE for Row_no = 2:(size(Database,1)+1) %ISOLATING SITE OF INTEREST TARGET_VAR = Database(Row_no-1,[1 Ind_Bestvar]); TARGET_BIO = BIODATA(Row_no-1,:); TARGET_IDX = Database(Row_no-1,1); %OBTAIN REST OF THE DATABASE EXCLUDING THAT OBSERVATION INDIDX = find (Database(:,1)~=TARGET_IDX); EnvDataTemp = Database(INDIDX,Ind_Bestvar); BIOTemp = BIODATA(INDIDX,:); clear INDIDX %STANDARDIZE, CALCULATE DISTANCES, LINK, AND BUILD DENDROGRAM WITH THE %REMAINING OBSERVATIONS (ALL EXCEPT TARGET SITE) ZEnvDataTemp = zscore(EnvDataTemp); DIST = pdist(ZEnvDataTemp,'euclidean'); LINK =linkage(DIST,'average'); [D T] = dendrogram(LINK,max_sc, 'colorthreshold','default'); close(gcf); %FIND AVERAGE VALUES FOR EACH ENVIRONMENTAL VARIABLE IN HOMOGENEOUS GROUP IN %DENDROGRAM AND DETERMINED WITH VECTOR 'T' clear AVG_EnvData AVG_BIODATA for i =1:max(T) INDEX = find(T==i); SUB_EnvData = EnvDataTemp(INDEX,:); AVG_EnvData(i,:) = mean(SUB_EnvData,1); clear INDEX SUB_EnvData; end % FIND AVERAGE BIOTIC VALUES for i =1:max(T) INDEX = find(T==i); SUB_BIODATA = BIOTemp(INDEX,:); AVG_BIODATA(i,:) = mean(SUB_BIODATA,1); clear INDEX SUB_BIODATA; end

209

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %FIND DISTANCES BETWEEN TARGET SITE AND THE REST OF THE DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %MERGING TARGET SITE TO HOMOGENEOUS GROUP DATA Merge = [AVG_EnvData;TARGET_VAR(1,2:end)]; Targ_HG_dist = squareform(pdist(Merge,'euclidean')); Targ_HG_dist = Targ_HG_dist((size(AVG_EnvData,1)+1),1:(size(AVG_EnvData,1))); %CALCULATE AVERAGE IBI OF THE CLOSEST SITE/S pdistMin = min(Targ_HG_dist,[],2); index = find(Targ_HG_dist==pdistMin); CALC_IBI = [CALC_IBI, mean(AVG_BIODATA(index,1))]; clear SUB_EnvData SUB_BIODATA INDEX EnvDataTemp ZEnvDataTemp BIOTemp pdistMin end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %CALCULATING PREDICTION PERFORMANCE FOR THAT ENVIRONMENTAL VARIABLE AFTER %TESTING ALL THE SITES AVAILABLE IN OUR DATABASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CALC_IBI=CALC_IBI'; R2= regstats(OBS_IBI, CALC_IBI,'linear', 'rsquare'); R2=R2.rsquare; R2text = num2str(R2); RMSE = sqrt(mean((abs(OBS_IBI-CALC_IBI)).^2)); RMSEtext = num2str(RMSE); STATemp = [max_sc R2 RMSE]; REG_STAT = [REG_STAT; STATemp]; end scatter(OBS_IBI, CALC_IBI); xlabel ('Observed IBI','Color',[0 0 0]); ylabel('Predicted IBI','Color',[0 0 0]); title ('IBI prediction using a hierarchical approach','Color',[0 0 0]); axis ([12 60 12 60]); text(13,55, ['RMSE =' RMSEtext],'Color',[0 0 0]); hold on text(13,57,['R2=' R2text],'Color',[0 0 0]); %DRAWING THE LINE hold on FAKEDATA1 = 12:2:60; FAKEDATA2 = 12:2:60; plot(FAKEDATA1,FAKEDATA2,'r--'); %PLOTTING 1.5xRMSE INETRVALS FAKEDATA_21 = 12+1.5*RMSE:2:60+1.5*RMSE; FAKEDATA_22 = 12-1.5*RMSE:2:60-1.5*RMSE; hold on plot(FAKEDATA1,FAKEDATA_21,'r--'); plot(FAKEDATA1,FAKEDATA_22,'r--'); saveas (gcf, 'Best prediction Instream variables.jpg'); saveas (gcf, 'Best prediction Instream variables.fig');

Documents

Biological response to environmental stress - DRS812/fulltext.pdf · i Abstract Biological response to environmental stress. Environmental similarity and hierarchical, scale-dependant