1. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 APPLIED STATISTICS FOR CIVIL
AND ENVIRONMENTAL ENGINEERS i
2. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 ii
3. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 APPLIED STATISTICS FOR CIVIL
AND ENVIRONMENTAL ENGINEERS Second Edition Nathabandu T. Kottegoda
Department of Hydraulic, Environmental, and Surveying Engineering
Politecnico di Milano, Italy Renzo Rosso Department of Hydraulic,
Environmental, and Surveying Engineering Politecnico di Milano,
Italy iii
4. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 This edition rst published
2008 C 2008 by Blackwell Publishing Ltd and 1997 by The McGraw-Hill
Companies, Inc. Blackwell Publishing was acquired by John Wiley
& Sons in February 2007. Blackwells publishing programme has
been merged with Wileys global Scientic, Technical, and Medical
business to form Wiley-Blackwell. Registered ofce John Wiley &
Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19
8SQ, United Kingdom Editorial ofce 9600 Garsington Road, Oxford,
OX4 2DQ, United Kingdom For details of our global editorial ofces,
for customer services and for information about how to apply for
permission to reuse the copyright material in this book please see
our website at www.wiley.com/wiley-blackwell. The right of the
author to be identied as the author of this work has been asserted
in accordance with the Copyright, Designs and Patents Act 1988. All
rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any
means, electronic, mechanical, photocopying, recording or
otherwise, except as permitted by the UK Copyright, Designs and
Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats.
Some content that appears in print may not be available in
electronic books. Designations used by companies to distinguish
their products are often claimed as trademarks. All brand names and
product names used in this book are trade names, service marks,
trademarks or registered trademarks of their respective owners. The
publisher is not associated with any product or vendor mentioned in
this book. This publication is designed to provide accurate and
authoritative information in regard to the subject matter covered.
It is sold on the understanding that the publisher is not engaged
in rendering professional services. If professional advice or other
expert assistance is required, the services of a competent
professional should be sought. ISBN: 978-1-4051-7917-1 Library of
Congress Cataloging-in-Publication Data Kottegoda, N. T. Applied
statistics for civil and environmental engineers / Nathabandu T.
Kottegoda, Renzo Rosso. 2nd ed. p. cm. Prev. ed. published as:
Statistics, probability, and reliability for civil and
environmental engineers. New York : McGraw-Hill, c1997. Includes
bibliographical references and index. ISBN-13: 978-1-4051-7917-1
(hardback : alk. paper) ISBN-10: 1-4051-7917-1 (hardback : alk.
paper) 1. Civil engineeringStatistical methods. 2. Environmental
engineeringStatistical methods. 3. Probabilities. I. Rosso, Renzo.
II. Kottegoda, N. T. Statistics, probability, and reliability for
civil and environmental engineers. III. Title. TA340.K67 2008
519.502 4624dc22 2007047496 A catalogue record for this book is
available from the British Library. Set in 10/12pt Times by Aptara
Inc., New Delhi, India Printed in Singapore by Utopia Press Pte Ltd
1 2008 iv
5. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Contents Dedication xiii
Preface to the First Edition xiv Preface to the Second Edition xvi
Introduction 1 1 Preliminary Data Analysis 3 1.1 Graphical
Representation 3 1.1.1 Line diagram or bar chart 4 1.1.2 Dot
diagram 4 1.1.3 Histogram 5 1.1.4 Frequency polygon 8 1.1.5
Cumulative relative frequency diagram 9 1.1.6 Duration curves 10
1.1.7 Summary of Section 1.1 11 1.2 Numerical Summaries of Data 11
1.2.1 Measures of central tendency 12 1.2.2 Measures of dispersion
15 1.2.3 Measure of asymmetry 19 1.2.4 Measure of peakedness 19
1.2.5 Summary of Section 1.2 19 1.3 Exploratory Methods 20 1.3.1
Stem-and-leaf plot 20 1.3.2 Box plot 22 1.3.3 Summary of Section
1.3 23 1.4 Data Observed in Pairs 23 1.4.1 Correlation and
graphical plots 23 1.4.2 Covariance and the correlation coefcient
24 1.4.3 Q-Q plots 26 1.4.4 Summary of Section 1.4 27 1.5 Summary
for Chapter 1 27 References 28 Problems 29 2 Basic Probability
Concepts 38 2.1 Random Events 39 2.1.1 Sample space and events 39
2.1.2 The null event, intersection, and union 41 2.1.3 Venn diagram
and event space 43 2.1.4 Summary of Section 2.1 49 v
6. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 vi Contents 2.2 Measures of
Probability 50 2.2.1 Interpretations of probability 50 2.2.2
Probability axioms 52 2.2.3 Addition rule 53 2.2.4 Further
properties of probability functions 55 2.2.5 Conditional
probability and multiplication rule 56 2.2.6 Stochastic
independence 61 2.2.7 Total probability and Bayes theorems 65 2.2.8
Summary of Section 2.2 72 2.3 Summary for Chapter 2 72 References
73 Problems 74 3 Random Variables and Their Properties 83 3.1
Random Variables and Probability Distributions 83 3.1.1 Random
variables 83 3.1.2 Probability mass function 84 3.1.3 Cumulative
distribution function of a discrete random variable 85 3.1.4
Probability density function 86 3.1.5 Cumulative distribution
function of a continuous random variable 88 3.1.6 Summary of
Section 3.1 90 3.2 Descriptors of Random Variables 90 3.2.1
Expectation and other population measures 90 3.2.2 Generating
functions 99 3.2.3 Estimation of parameters 103 3.2.4 Summary of
Section 3.2 112 3.3 Multiple Random Variables 112 3.3.1 Joint
probability distributions of discrete variables 113 3.3.2 Joint
probability distributions of continuous variables 118 3.3.3
Properties of multiple variables 124 3.3.4 Summary of Section 3.3
132 3.4 Associated Random Variables and Probabilities 132 3.4.1
Functions of a random variable 133 3.4.2 Functions of two or more
variables 135 3.4.3 Properties of derived variables 143 3.4.4
Compound variables 151 3.4.5 Summary of Section 3.4 154 3.5 Copulas
154 3.6 Summary for Chapter 3 157 References 157 Problems 160 4
Probability Distributions 165 4.1 Discrete Distributions 165 4.1.1
Bernoulli distribution 166 4.1.2 Binomial distribution 167 4.1.3
Poisson distribution 171 4.1.4 Geometric and negative binomial
distributions 181
7. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Contents vii 4.1.5
Log-series distribution 185 4.1.6 Multinomial distribution 187
4.1.7 Hypergeometric distribution 189 4.1.8 Summary of Section 4.1
192 4.2 Continuous Distributions 194 4.2.1 Uniform distribution 194
4.2.2 Exponential distribution 196 4.2.3 Erlang and gamma
distribution 200 4.2.4 Beta distribution 203 4.2.5 Weibull
distribution 205 4.2.6 Normal distribution 209 4.2.7 Lognormal
distribution 215 4.2.8 Summary of Section 4.2 217 4.3 Multivariate
Distributions 217 4.3.1 Bivariate normal distribution 219 4.3.2
Other bivariate distributions 222 4.4 Summary for Chapter 4 222
References 223 Problems 224 5 Model Estimation and Testing 230 5.1
A Review of Terms Related to Random Sampling 230 5.2 Properties of
Estimators 231 5.2.1 Unbiasedness 231 5.2.2 Consistency 232 5.2.3
Minimum variance 232 5.2.4 Efciency 234 5.2.5 Sufciency 234 5.2.6
Summary of Section 5.2 235 5.3 Estimation of Condence Intervals 236
5.3.1 Condence interval estimation of the mean when the standard
deviation is known 236 5.3.2 Condence interval estimation of the
mean when the standard deviation is unknown 239 5.3.3 Condence
interval for a proportion 242 5.3.4 Sampling distribution of
differences and sums of statistics 242 5.3.5 Interval estimation
for the variance: chi-squared distribution 243 5.3.6 Summary of
Section 5.3 247 5.4 Hypothesis Testing 247 5.4.1 Procedure for
testing 248 5.4.2 Probabilities of Type I and Type II errors and
the power function 254 5.4.3 Neyman-Pearson lemma 256 5.4.4 Tests
of hypotheses involving the variance 257 5.4.5 The F distribution
and its use 258 5.4.6 Summary of Section 5.4 259 5.5 Nonparametric
Methods 260 5.5.1 Sign test applied to the median 261 5.5.2
Wilcoxon signed-rank test for association of paired observations
262
8. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 viii Contents 5.5.3
Kruskal-Wallis test for paired observations in k samples 264 5.5.4
Tests on randomness: runs test 267 5.5.5 Spearmans rank correlation
coefcient 268 5.5.6 Summary of Section 5.5 269 5.6 Goodness-of-Fit
Tests 270 5.6.1 Chi-squared goodness-of-t test 271 5.6.2
Kolmogorov-Smirnov goodness-of-t test 273 5.6.3 Kolmogorov-Smirnov
two-sample test 274 5.6.4 Anderson-Darling goodness-of-t test 277
5.6.5 Other methods for testing the goodness-of-t to a normal
distribution 281 5.6.6 Summary of Section 5.6 282 5.7 Analysis of
Variance 283 5.7.1 One-way analysis of variance 284 5.7.2 Two-way
analysis of variance 288 5.7.3 Summary of Section 5.7 294 5.8
Probability Plotting Methods and Visual Aids 295 5.8.1 Probability
plotting for uniform distribution 296 5.8.2 Probability plotting
for normal distribution 297 5.8.3 Probability plotting for Gumbel
or EV1 distribution 300 5.8.4 Probability plotting of other
distributions 301 5.8.5 Visual tting methods based on the histogram
303 5.8.6 Summary of Section 5.8 305 5.9 Identication and
Accommodation of Outliers 305 5.9.1 Hypothesis tests 306 5.9.2 Test
statistics for detection of outliers 307 5.9.3 Dealing with
nonnormal data 309 5.9.4 Estimation of probabilities of extreme
events when outliers are present 311 5.9.5 Summary of Section 5.9
312 5.10 Summary of Chapter 5 312 References 313 Problems 316 6
Methods of Regression and Multivariate Analysis 326 6.1 Simple
Linear Regression 327 6.1.1 Estimates of the parameters 328 6.1.2
Properties of the estimators and errors 332 6.1.3 Tests of
signicance and condence intervals 337 6.1.4 The bivariate normal
model and correlation 339 6.1.5 Summary of Section 6.1 342 6.2
Multiple Linear Regression 342 6.2.1 Formulation of the model 343
6.2.2 Linear least squares solutions using the matrix method 343
6.2.3 Properties of least squares estimators and error variance 346
6.2.4 Model testing 350 6.2.5 Model adequacy 355 6.2.6 Residual
plots 356 6.2.7 Inuential observations and outliers in regression
358 6.2.8 Transformations 365
9. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Contents ix 6.2.9 Condence
intervals on mean response and prediction 366 6.2.10 Ridge
regression 368 6.2.11 Other methods and discussion of Section 6.2
370 6.3 Multivariate Analysis 373 6.3.1 Principal components
analysis 373 6.3.2 Factor analysis 379 6.3.3 Cluster analysis 383
6.3.4 Other methods and summary of Section 6.3 385 6.4 Spatial
Correlation 386 6.4.1 The estimation problem 387 6.4.2 Spatial
correlation and the semivariogram 387 6.4.3 Some semivariogram
models and physical aspects 389 6.4.4 Spatial interpolations and
Kriging 391 6.4.5 Summary of Section 6.4 394 6.5 Summary of Chapter
6 394 References 395 Problems 398 7 Frequency Analysis of Extreme
Events 405 7.1 Order Statistics 406 7.1.1 Denitions and
distributions 406 7.1.2 Functions of order statistics 409 7.1.3
Expected value and variance of order statistics 411 7.1.4 Summary
of Section 7.1 415 7.2 Extreme Value Distributions 415 7.2.1 Basic
concepts of extreme value theory 415 7.2.2 Gumbel distribution 422
7.2.3 Frechet distribution 429 7.2.4 Weibull distribution as an
extreme value model 432 7.2.5 General extreme value distribution
435 7.2.6 Contagious extreme value distributions 439 7.2.7 Use of
other distributions as extreme value models 445 7.2.8 Summary of
Section 7.2 450 7.3 Analysis of Natural Hazards 453 7.3.1 Floods,
storms, and droughts 453 7.3.2 Earthquakes and volcanic eruptions
461 7.3.3 Winds 465 7.3.4 Sea levels and highest sea waves 470
7.3.5 Summary of Section 7.3 473 7.4 Summary of Chapter 7 474
References 474 Problems 478 8 Simulation Techniques for Design 487
8.1 Monte Carlo Simulation 488 8.1.1 Statistical experiments 488
8.1.2 Probability integral transform 493 8.1.3 Sample size and
accuracy of Monte Carlo experiments 495 8.1.4 Summary for Section
8.1 501 8.2 Generation of Random Numbers 501
10. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 x Contents 8.2.1 Random
outcomes from standard uniform variates 501 8.2.2 Random outcomes
from continuous variates 506 8.2.3 Random outcomes from discrete
variates 511 8.2.4 Random outcomes from jointly distributed
variates 513 8.2.5 Summary of Section 8.2 514 8.3 Use of Simulation
514 8.3.1 Distributions of derived design variates 514 8.3.2
Sampling statistics 517 8.3.3 Simulation of time- or space-varying
systems 519 8.3.4 Design alternatives and optimal design 524 8.3.5
Summary of Section 8.3 530 8.4 Sensitivity and Uncertainty Analysis
530 8.5 Summary and Discussion of Chapter 8 531 References 531
Problems 533 9 Risk and Reliability Analysis 541 9.1 Measures of
Reliability 542 9.1.1 Factors of safety 542 9.1.2 Safety margin 547
9.1.3 Reliability index 550 9.1.4 Performance function and limiting
state 558 9.1.5 Further practical solutions 568 9.1.6 Summary of
Section 9.1 577 9.2 Multiple Failure Modes 577 9.2.1 Independent
failure modes 578 9.2.2 Mutually dependent failure modes 584 9.2.3
Summary of Section 9.2 592 9.3 Uncertainty in Reliability
Assessments 592 9.3.1 Reliability limits 592 9.3.2 Bayesian
revision of reliability 593 9.3.3 Summary of Section 9.3 597 9.4
Temporal Reliability 597 9.4.1 Failure process and survival time
597 9.4.2 Hazard function 602 9.4.3 Reliable life 605 9.4.4 Summary
of Section 9.4 606 9.5 Reliability-Based Design 606 9.6 Summary for
Chapter 9 612 References 613 Problems 615 10 Bayesian Decision
Methods and Parameter Uncertainty 623 10.1 Basic Decision Theory
624 10.1.1 Bayes rules 624 10.1.2 Decision trees 627 10.1.3 The
minimax solution 630 10.1.4 Summary of Section 10.1 632 10.2
Posterior Bayesian Decision Analysis 632 10.2.1 Subjective
probabilities 633
11. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Contents xi 10.2.2 Loss and
utility functions 634 10.2.3 The discrete case 635 10.2.4 Inference
with conditional binomial and prior beta 636 10.2.5 Poisson hazards
and gamma prior 638 10.2.6 Inferences with normal distribution 639
10.2.7 Likelihood ratio testing 642 10.2.8 Summary of Section 10.2
643 10.3 Markov Chain Monte Carlo Methods 643 10.4 James-Stein
Estimators 650 10.5 Summary and Discussion of Chapter 10 653
References 653 Problems 656 Appendix A: Further mathematics 659 A.1
Chebyshev Inequality 659 A.2 Convex Function and Jensen Inequality
659 A.3 Derivation of the Poisson distribution 659 A.4 Derivation
of the normal distribution 660 A.5 MGF of the normal distribution
661 A.6 Central limit theorem 662 A.7 Pdf of Students T
distribution 663 A.8 Pdf of the F distribution 664 A.9 Wilcoxon
signed-rank test: mean and variance of the test statistic 664 A.10
Spearmans rank correlation coefcient 665 Appendix B: Glossary of
Symbols 667 Appendix C: Tables of Selected Distributions 673
Appendix D: Brief Answers to Selected Problems 684 Appendix E: Data
Lists 687 Index 707
12. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 xii
13. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Dedication To my parents. To
estimate the debt I owe them requires a lifespan of nibbanic
extent. To Mali, Shani, Siraj, and Natasha. N.T.K. A mamma Aria, a
Donatella, ai due Riccardi della mia vita e al nostro
indimenticabile Rufus. R.R. xiii
14. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Preface to the First Edition
Statistics, probability, and reliability are subject areas that are
not commonly easy for stu- dents of civil and environmental
engineering. Such difculties notwithstanding, a greater emphasis is
currently being made on the teaching of these methods throughout
institutions of higher learning. Many professors with whom we have
spoken have expressed the need for a single textbook of sufcient
breadth and clarity to cover these topics. One might ask why it is
necessary to write a new book specically for civil and envi-
ronmental engineers. Firstly, we see a particular importance of
statistical and associated methods in our disciplines. For example,
some modes of failure, interactions, probability distributions,
outliers, and spatial relationships that one encounters are unique
and require different approaches. Secondly, colleagues have said
that existing books are either old and outdated or omit
particularly important engineering problems, emphasizing instead
areas that may not be directly relevant to the practitioner. We set
ourselves several objectives in writing this book. First, it was
necessary to update much of the older material, which have rightly
stood for decades, even centuries. Indeed. Second, we had to look
at the engineers structures, waterways, and the like and bring in
as much material as possible for the tasks at hand. We felt an
urgent need to modernize, incorporate new concepts throughout, and
reduce or eliminate the impact of some topics. We aimed to order
the material in a logical sequence. In particular we tried to adopt
a writing style and method of presentation that are lively and
without overrigorous drudgery. These had to be accomplished without
compromising a deep and thorough treatment of fundamentals. The
layout of the book is sraightforward, so it can be used to suit
ones personal needs. We apologize to any readers who think we have
strayed from the path of simplicity in certain parts, such as the
associated variables and contagious distributions of Chapter 3 and
the order statistics of Chapter 7. One might wish to omit these
sections on a rst reading. The introductions to the chapters will
be helpful for this purpose. The explanation of the theory is
accompanied by the assumptions made. Denitions are separately
highlighted. In many places we point out the limitations and
pitfalls or viola- tions. There are warnings of possible misuses,
misunderstandings, and misinterpretations. We provide guidance to
the proper interpretation of statistical results. The numerous
examples, for which we have for the most part used recorded
observa- tions, will be helpful to beginners as well as to mature
students who will consult the text as a reference. We hope these
examples will lead to a better understanding of the material and
design variabilities, a prelude to the making of sound decisions.
Each chapter concludes with extensive homework problems. In many
instances, as in Chapter 1, they are based on real data not used
elsewhere in the text. We have not used cards or dice or coins or
black and red balls in any of the problems and examples. Answers to
selected problems are summarized in Appendix D. A detailed manual
of solutions is available. Computers are continuously becoming
cheaper and more powerful. Newer ways of handling data are being
devised. At the inception, we seriously considered the use of
commercial software packages to enhance the scope of the book.
However, the problem of choosing one, from the many suitable
packages acted as a deterrent. Our concern was the serious
limitations imposed by utilizing a source that necessitates
corresponding purchase xiv
15. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Preface to the First Edition
xv by an adopting school or by individual engineers. Besides, the
calculations illustrated in the book can be made using worksheets
available as standard software for personal computers. As an aid,
the data in Appendix E will be placed on the Internet. We have
utilized the space saved (from jargon and notation of a particular
software, output, graphs, and tables) to widen the scope, make our
explanations more thorough, and insert additional illustrations and
problems. Readers also have an almost all-inclusive index, a
comprehensive glossary of notation, additional mathematical
explanations, and other material in the appendixes. Furthermore, we
hope that the extensive, annotated bibli- ographies at the end of
each chapter, numerous citations and tables, will make this a
useful reference source. The book is written for use by students,
practicing engineers, teachers, and researchers in civil and
environmental engineering and applied statistics; female readers
will nd no hint of male chauvinism here. It is designed for a one-
or two-semester course and is suitable for nal-year undergraduate
and rst-year graduate students. The text is self-contained for
study by engineers. A background of elementary calculus and matrix
algebra is assumed. ACKNOWLEDGMENTS We acknowledge with thanks the
work of the staff at Publication Services, Inc., in Cham- paign,
IL. Gianfausto Salvadori gave his time generously in reviewing the
manuscript and providing solutions to some homework problems.
Thanks are due again to Adri Buishand for his elaborate and
painstaking reviews. Our publisher solicited other reviewers whose
reports were useful. Howard Tillotson and colleagues at the
University of Birmingham, England, provided data and some student
problems. Discussions with Tony Lawrance at lunch in the University
Staff House and the example problem he solved at Helsinki Airport
are appreciated. Valuable assistance was provided by Giovanni
Solari and Giulio Ballio in wind and steel engineering,
respectively. In addition, Giovanni Vannuchi was consulted on
geotechnical engineering. Research staff and doctoral students at
the Politecnico di Mi- lano helped with the homework problems and
the preparation of the index. Dora Tartaglia worked diligently on
revisions to the manuscript. We thank the publishers, companies,
and individuals who gave us permission to use their material, data,
and tables; some of the tables were obtained through our own
resources We shall be pleased to have any omissions brought to our
notice. The support and hospitality provided at the Universit`a
degli Studi di Pavia by Luigi Natale and others are acknowledged
with thanks. Most importantly, without the patience and tolerance
of our families this book could not have been completed. N. T.
Kottegoda R. Rosso Milano, Italy 1 July 1996
16. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Preface to the Second
Edition Last year a senior European professor, who uses our book,
was visiting us in Milano. When told of the revisions underway he
expressed some surprise. There is nothing to revise, he said. But
all books need revision sooner or later, especially a
multidimensional one. The equations, examples, problems, gures,
tables, references, and footnotes are all subject to inevitable
human fallibilities: typographical errors and errors of fact. Our
rst objective was to bring the text as close to the ideal state as
possible. The second priority was to modernize. In Chapter 10, a
new section is added on Markov chain Monte Carlo modeling; this has
popularized Bayesian methods in recent years; there is a full
description and case study on Gibbs sampling. In Chapter 8 on
simulation, we include a new section on sensitivity analysis and
uncertainty analysis; a clear and detailed distinction is made
between epis- temic and aleatory uncertainties; their implications
in decision-making are discussed. In Chapter 7 on Frequency
Analysis of Extreme Events, natural hazards and ood hydrology are
updated. In Chapter 6 on regression analysis, further
considerations have been made on the diagnostics of regression;
there are new discussions on general and generalized linear models.
In Chapter 5 on Model Estimation and Testing we give special
importance to the Anderson-Darling goodness-of-t test because of
its sensitivity to departures in the tail areas of a probability
density function; we make applications to nonnormal distributions
using the same data as in the estimation of parameters. In Chapter
3 a section is added on the novel method of copulas with particular
emphasis on bivariate distributions. We have revised the problems
following Basic Probability Concepts in Chapter 2. Other chapters
are also revised and modernized and the annotated references are
updated. As before, we have kept in mind the scientic method of
Claude Bernard, the French medical researcher of the nineteenth
century. This had three essential parts: observation of phenomena
in nature (seen in Appendix E, and in the examples and problems),
observation of experiments (as reported in each chapter), and the
theoretical part (clear enough for the audience in mind, but
without over-simplication). Nobody trusts a model except the one
who originated it; everybody trusts data except those who record
it. Models and data are subject to uncertainty. There is still a
gap between models and data. We attempt to bridge this gap. The
title of the book has been abridged from Statistics, Probability,
and Reliability for Civil and Environmental Engineers to Applied
Statistics for Civil and Environmental Engineers. The applications
and problems pertain almost equally to both disciplines and all
areas are included. Another aspect we emphasized before was that
the calculations illustrated in the book can be made using
worksheets available as standard software for personal computers.
Alternatively, R which is now commonplace can be downloaded free of
charge and adopted to run some of the homework problems, if one so
prefers. Our decision not to recommend the use of particular
commercial software packages, by giving details of jargon,
notation, and so on, seems to be justied. We nd that a specic
version soon become obsolete with the advent of a new version. A
limited access solutions manual is available with the data from
Appendix E on the Wiley-Blackwell website
[www.blackwellpublishing.com/kottegoda]. xvi
17. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 Preface to the Second
Edition xvii We are grateful for the encouragement given by many
users of the rst edition, and to the few who pointed out some
discrepancies. We thank the anonymous reviewers for their useful
comments. Gianfausto Salvadori, Carlo De Michele, Adri Buishand,
and Tony Lawrance assisted us again in the revisions. Julia Burden
and Lucy Alexander of Blackwell Publishing supported us throughout
the project. Universit`a degli Studi di Pavia is thanked for
continued hospitality. The help provided by Fabrizio Borsa and
Enrico Raiteri in the preparation of some gures is acknowledged. N.
T. Kottegoda R. Rosso Milano, Italy 14 September 2007
18. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 13, 2008 12:36 xviii
19. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 2, 2008 15:39 Introduction As a
wide-ranging discipline, statistics concerns numerous procedures
for deriving infor- mation from data that have been affected by
chance variations. On the basis of scientic experiments, one may
record and make summaries of observations, quantify variations, or
other changes of signicance, and compare data sequences by means of
some numbers or characteristics. The use of statistics in this way
is for descriptive purposes. At a more sophisticated level of
analysis and interpretation, one can, for instance, test hypotheses
using the inferential approach developed during the twentieth
century. Thus it may be ascertained, for instance, whether the
change of an ingredient affects the properties of a concrete or
whether a particular method of surfacing produces a longer-lasting
road; this approach often includes the estimation by means of
observations of the parameters of a statistical model. Then
inferences can be drawn from data and predictions made or decisions
taken. When faced with uncertainty, this last phase is the
principal aim of a civil or environmental engineer acting as an
applied statistician. In all activities, engineers have to cope
with possible uncertainties. Observations of soil pressures,
tensile strengths of concrete, yield strengths of steel, trafc
densities, rainfalls, river ows, and pollution loads in streams
vary from one case to the next for apparently unknown reasons or on
account of factors that cannot be assessed to any degree of accu-
racy. However, designs need to be completed and structures,
highways, water supply, and sewerage schemes constructed. Sound
engineering judgment, in fact, springs from physi- cal and
mathematical theories, but it goes far beyond that. Randomness in
nature must be taken into account. Thus the onus of dealing with
the uncertainties lies with the engineer. The appropriate methods
of tackling the uncertainty vary with different circumstances. The
key is often the dispersion that is commonly evidenced in available
data sets. Some phenomena may have negligible or low variability.
In such a case, the mean of past observa- tions may be used as a
descriptor, for example, the elastic constant of a steel.
Nevertheless, the consequences of a possible change in the mean
should also be considered. Frequently, the variability in
observations is found to be quite substantial. In such situations,
an engi- neer sometimes uses, rather conservatively, a design value
such as the peak storm runoff or the compressive strength of a
concrete. Alternatively, it has been the practice to express the
ability of a component in a structure to withstand a specied
loading without failure or a permissible deection by a so-called
factor of safety; this is in effect a blanket to cover all possible
contingencies. However, we envisage some problems here in following
a purely deterministic approach because there are doubts concerning
the consistency of specied strengths, ows, loads, or factors from
one case to another. These cannot be lightly dismissed or easily
compounded when the consequences of ignoring variability are
detrimental or, in general, if the decision is sensitive to a
particular uncertainty. (Often there are crucial economic
considerations in these matters.) This obstacle strongly sug- gests
that the way forward is by treating statistics and probability as
necessary aids in decision making, thus coping with uncertainty
through the engineering process. Note that statistical methods are
in no way intended to replace the physical knowl- edge and
experience of the engineer and his or her skills in
experimentation. The engineer should know how the measurements are
made and recorded and how errors may arise from possible
limitations in the equipment. There should be readiness to make
changes and im- provements so that the data-gathering process is as
reliable and representative as possible. 1
20. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 2, 2008 15:39 2 Applied Statistics for
Civil and Environmental Engineers On this basis, statistics can be
a complementary and a valuable aid to technology. In prudent hands
it can lead to the best practical assessment of what is partially
known or uncertain. The quantication of uncertainty and the
assessment of its effects on design and im- plementation must
include concepts and methods of probability, because statistics is
built on the foundation of probability theory. In addition,
decision making under risk involves the use of applied probability.
Historically, probability theory arose as a branch of math- ematics
concerned with the analysis of certain games of chance; it
consequently found applications in the measurement and
understanding of uncertainty in innumerable natural phenomena and
human activities. The fundamental interrelationship between
statistics and probability is clearly evident in practice. As seen
in past decades, there has been an irreversible change in emphasis
from descriptive to inferential statistics. In this respect we must
note that statistical inferences and the risk and reliability of
decision making under uncertainty are evaluated through applied
probability, using frequentist or Bayesian estimation. This applies
to the most widely used methods. Alternatives that come under
generalized information theory are now available. The reliability
of a system, structure, or component is the complement of its
probability of failure. Risk and reliability analysis, however,
entail many activities. The survival prob- ability of a system is
usually stated in terms of the reliabilities of its components. The
modeling process is an essential part of the analysis, and time can
be an important factor. Also, the risk factor that one computes may
be inherent, additional, or composite. All these points show that
reliability design deserves special emphasis. Methods of reducing
data, reviewed in Chapter 1, begin with tabulation and graphical
representation, which are necessary rst steps in understanding the
uncertainty in data and the inherent variability. Numerical
summaries provide descriptions for further analysis. Exploratory
methods are followed by relationships between data observed in
pairs. Thus the investigation begins. The route is long and
diverse, because statistics is the science and art of
experimenting, collecting, analyzing, and making inferences from
data. This opening chapter provides a route map of what is to
follow so that one can gain insight into the numerous tools
statistics offers and realizes the variety of problems that can be
tackled. In Chapters 2 and 3, we develop a background in
probability theory for coping with uncertainty in engineering.
Using basic concepts, we then discuss the total probability and
Bayes theorems and dene statistical properties of distributions
used for estimation purposes. Chapter 4 examines various
mathematical models of random processes. There is a wide-ranging
discussion of discrete and continuous distributions; joint and
derived types are also given in Chapters 3 and 4; we introduce
copulas that can effectively model joint distributions. Model
estimation and testing methods, such as condence intervals,
hypothesis testing, analysis of variance, probability plotting, and
identication of outliers, are treated in Chapter 5. The estimation
and testing are based on the principle that all suppositions need
to be carefully examined in light of experimentation and
observation. Details of regression and multivariate statistical
methods are provided in Chapter 6, along
withprincipalcomponentanalysisandassociatedmethodsandspatialcorrelation.Extreme
value analysis applied to oods, droughts, winds, earthquakes, and
other natural hazards is found in Chapter 7; some special types of
models are included. Simulation is the subject of
Chapter8,whichcomprisestheuseofsimulationindesignandforotherpracticalpurposes;
also, we discuss sensitivity analysis and uncertainty analysis of
the aleatory and epistemic types. In Chapter 9, risk and
reliability analysis and reliability design are developed in
detail. Chapter 10 is devoted to Bayesian and other types of
economic decision making, used when the engineer faces uncertainty;
we include here Markov chain Monte Carlo methods that have recently
popularized the Bayesian approach.
21. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Chapter 1 Preliminary Data
Analysis All natural processes, as well as those devised by humans,
are subject to variability. Civil engineers are aware, for example,
that crushing strengths of concrete, soil pressures, strengths of
welds, trafc ow, oods, and pollution loads in streams have wide
variations. These may arise on account of natural changes in
properties, differences in interactions between the ingredients of
a material, environmental factors, or other causes. To cope with
uncertainty, the engineer must rst obtain and investigate a sample
of data, such as a set of ow data or triaxial test results. The
sample is used in applying statistics and probability at the
descriptive stage. For inferential purposes, however, one needs to
make decisions regarding the population from which the sample is
drawn. By this we mean the total or aggregate, which, for most
physical processes, is the virtually unlimited universe of all
possible measurements. The main interest of the statistician is in
the aggregation; the individual items provide the hints, clues, and
evidence. A data set comprises a number of measurements of a
phenomenon such as the failure load of a structural component. The
quantities measured are termed variables, each of which may take
any one of a specied set of values. Because of its inherent
randomness and hence unpredictability, a phenomenon that an
engineer or scientist usually encounters is referred to as a random
variable, a name given to any quantity whose value depends on
chance.1 Random variables are usually denoted by capital letters.
These are classied by the form that their values can possibly take
(or are assumed to take). The pattern of variability is called a
distribution. A continuous variable can have any value on a conti-
nuous scale between two limits, such as the volume of water owing
in a river per second or the amount of daily rainfall measured in
some city. A discrete variable, on the contrary, can only assume
countable isolated numbers like integers, such as the number of
vehicles turning left at an intersection, or other distinct values.
Having obtained a sample of data, the rst step is its presentation.
Consider, for ex- ample, the modulus of rupture data for a certain
type of timber shown in Table E.1.1, in Appendix E. The initial
problem facing the civil engineer is that such an array of data by
itself does not give a clear idea of the underlying characteristics
of the stress values in this natural type of construction material.
To extract the salient features and the particular types of
information one needs, one must summarize the data and present them
in some readily comprehensible forms. There are several methods of
presentation, organization, and reduction of data. Graphical
methods constitute the rst approach. 1.1 GRAPHICAL REPRESENTATION
If a picture is worth a thousand words, then graphical techniques
provide an excellent method to visualize the variability and other
properties of a set of data. To the powerful interactive system of
ones brain and eyes, graphical displays provide insight into the
form 1 The term will be formally dened in Section 3.1. 3
22. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 4 Applied Statistics for
Civil and Environmental Engineers and shape of the data and lead to
a preliminary concept of the generating process. We proceed by
assembling the data into graphs, scanning the details, and noting
the important characteristics. There are numerous types of graphs.
Line and dot diagrams, histograms, relative frequency polygons, and
cumulative frequency curves are given in this section.
Subsequently, exploratory methods, such as stem-and-leaf plots and
box diagrams and graphs depicting a possible association between
two variables, are presented in Sections 1.3 and 1.4. We begin with
the simple task of counting. 1.1.1 Line diagram or bar chart The
occurrences of a discrete variable can be classied on a line
diagram or bar chart. In this type of graph, the horizontal axis
gives the values of the discrete variable and the occurrences are
represented by the heights of vertical lines. The horizontal spread
of these lines and their relative heights indicate the variability
and other characteristics of the data. Example 1.1. Flood
occurrences. Consider the annual number of oods of the Magra River
at Calamazza, situated between Pisa and Genoa in northwestern
Italy, over a 34-year period, as shown in Table 1.1.1. A ood in the
river at the point of measurement means the river has risen above a
specied level, beyond which the river poses a threat to lives and
property. The data are plotted in Fig. 1.1.1 as a line diagram. The
data suggest a symmetrical distribution with a midlocation of four
oods per year. In some other river basins, there is a nonlinear
decrease in the occurrences for increasing numbers of oods in a
year commencing at zero, showing a negative exponential type of
variation. 1.1.2 Dot diagram A different type of graph is required
to present continuous data. If the data are few (say, less than 25
items) a dot diagram is a useful visual aid. Consider the
possibility that only Table 1.1.1 Number of ood occurrences per
year from 1939 to 1972 at the gauging station of Calamazza on the
Magra River, between Pisa and Genoa in northwestern Italya Number
of oods Number of in a year occurrences 0 0 1 2 2 6 3 7 4 9 5 4 6 1
7 4 8 1 9 0 Total 34 a A ood occurrence is dened as river discharge
exceeding 300 m3/s.
23. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 5 0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Number of floods
Numberofoccurrences Fig. 1.1.1 Line diagram for ood occurrences in
the Magra River at Calamazza between Genoa and Pisa in northwestern
Italy. the rst 15 items of data in Table E.1.1which shows the
modulus of rupture in N/mm2 for 50 mm 150 mm Swedish redwood and
whitewoodare available. The abridged data are ranked in ascending
order and are given in Table 1.1.2 and plotted in Fig. 1.1.2. The
reader can see that the midlocation is close to 40 N/mm2 but the
wide spread makes this location difcult to discern. A larger sample
should certainly be helpful. 1.1.3 Histogram If there are at least,
say, 25 observations, one of the most common graphical forms is a
block diagram called the histogram. For this purpose, the data are
divided into groups according to their magnitudes. The horizontal
axis of the graph gives the magnitudes. Blocks are drawn to
represent the groups, each of which has a distinct upper and lower
limit. The area of a block is proportional to the number of
occurrences in the group. The variability of the data is shown by
the horizontal spread of the blocks, and the most common values are
found in blocks with the largest areas. Other features such as the
symmetry of the data or lack of it are also shown. The rst step is
to take into account the range r of the observations, that is, the
difference between the largest and smallest values. Example 1.2.
Timber strength. We go back to the timber strength data given in
Table E.1.1. They are arranged in order of magnitude in Table
1.1.3. There are n = 165 observations with somewhat high
variability, as expected, because timber is a naturally variable
material. Here the range r = 70.22 0.00 = 70.22 N/mm2 . To draw a
histogram, one divides the range into a number of classes or cells
nc. The number of occurrences in each class is counted and
tabulated. These are called frequencies. Table 1.1.2 The rst 15
items of modulus of rupture data measuring timber strengths in
N/mm2 , from Table E.1.1 (commencing with the top row), ranked in
increasing order 29.11 29.93 32.02 32.40 33.06 34.12 35.58 39.34
40.53 41.64 45.54 48.37 48.78 50.98 65.35
24. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 6 Applied Statistics for
Civil and Environmental Engineers 25 30 35 40 45 50 55 60 65 70
Modulus of rupture, N/mm2 Fig. 1.1.2 Dot diagram for a short sample
of timber strengths from Table 1.1.3. The width of the classes is
usually made equal to facilitate interpretation. For some work such
as the tting of a theoretical function to observed frequencies,
however, unequal class widths are used. Care should be exercised in
the choice of the number of classes, nc. Too few will cause an
omission of some important features of the data; too many will not
give a clear overall picture because there may be high uctuations
in the frequencies. A rule of thumb is to make nc = n or an integer
close to this, but it should be at least 5 and not greater than 25.
Thus, histograms based on fewer than 25 items may not be
meaningful. Sturges (1926) suggested the approximation nc = 1 + 3.3
log10 n. (1.1.1)
AmoretheoreticallybasedalternativefollowstheworkofFreedmanandDiaconis(1981):2
nc = r n1/3 2 iqr . (1.1.2) Here iqr is the interquartile range. To
clarify this term, we must dene Q2, or the median. This denotes the
middle term of a set of data when the values are arranged in
ascending order, or the average of the two middle terms if n is an
even number. The rst or lower quartile, Q1, is the median of the
lower half of the data, and likewise the third Table 1.1.3 Ranked
modulus of rupture data for timber strengths in N/mm2 , in
ascending ordera 0.00 28.00 31.60 34.44 36.84 39.21 41.75 44.30
47.25 53.99 17.98 28.13 32.02 34.49 36.85 39.33 41.78 44.36 47.42
54.04 22.67 28.46 32.03 34.56 36.88 39.34 41.85 44.36 47.61 54.71
22.74 28.69 32.40 34.63 36.92 39.60 42.31 44.51 47.74 55.23 22.75
28.71 32.48 35.03 37.51 39.62 42.47 44.54 47.83 56.60 23.14 28.76
32.68 35.17 37.65 39.77 43.07 44.59 48.37 56.80 23.16 28.83 32.76
35.30 37.69 39.93 43.12 44.78 48.39 57.99 23.19 28.97 33.06 35.43
37.78 39.97 43.26 44.78 48.78 58.34 24.09 28.98 33.14 35.58 38.00
40.20 43.33 45.19 49.57 65.35 24.25 29.11 33.18 35.67 38.05 40.27
43.33 45.54 49.59 65.61 24.84 29.90 33.19 35.88 38.16 40.39 43.41
45.92 49.65 69.07 25.39 29.93 33.47 35.89 38.64 40.53 43.48 45.97
50.91 70.22 25.98 30.02 33.61 36.00 38.71 40.71 43.48 46.01 50.98
26.63 30.05 33.71 36.38 38.81 40.85 43.64 46.33 51.39 27.31 30.33
33.92 36.47 39.05 40.85 43.99 46.50 51.90 27.90 30.53 34.12 36.53
39.15 41.64 44.00 46.86 53.00 27.93 31.33 34.40 36.81 39.20 41.72
44.07 46.99 53.63 a The original data set is given in Table E.1.1;
n = 165. The median is underlined. 2 See also Scott (1979).
25. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 7
Table 1.1.4 Frequency computations for the modulus of rupture data
ranked in Table 1.1.3a Class upper limit Class center Absolute
Relative Cumulative relative (N/mm2 ) (N/mm2 ) frequency frequency
frequency (%) 5 2.5 1 0.006 0.61 10 7.5 0 0.000 0.61 15 12.5 0
0.000 0.61 20 17.5 1 0.006 1.21 25 22.5 9 0.055 6.67 30 27.5 18
0.109 17.58 35 32.5 26 0.158 33.33 40 37.5 38 0.230 56.36 45 42.5
34 0.206 76.97 50 47.5 20 0.121 89.09 55 52.5 9 0.055 94.55 60 57.5
5 0.030 97.58 65 62.5 0 0.000 97.58 70 67.5 3 0.018 99.39 75 72.5 1
0.006 100.00 a The width of each class is 5 N/mm2 in this example.
or upper quartile, Q3, is the median of the upper half of the data.
This denition will be used throughout.3 Thus, iqr = Q3 Q1. (1.1.3)
Example 1.3. Timber strength. For the timber strength data of Table
E.1.1, the median, that is, Q2, is 39.05 N/mm2 . Also Q3 and Q1 are
44.57 and 32.91 N/mm2 , respectively, and hence iqr = 11.66 N/mm2 .
From the simple square-root rule, the number of classes, nc =
12.84. However, by using Eqs. (1.1.1) and (1.1.2), the number of
classes are 8.32 and 16.52, respectively. If these are rounded to 9
and 15 and the range is extended to 72 and 75 N/mm2 for graphical
purposes, the equal class widths become 8 and 5 N/mm2 ,
respectively. Let us use these widths. It is important to specify
the class boundaries without ambiguity for the counting of
frequencies; for example, in the rst case, these should be from 0
to 7.99, 8.00 to 15.99, and so on. As already mentioned, the
vertical axis of a histogram is made to represent the frequency and
the horizontal axis is used as a measurement scale on which the
class boundaries are marked. For each of these class widths, 8 and
5 N/mm2 , class boundaries are made and counting of frequencies is
completed using Table 1.1.3; the lowest boundary is at 0 and the
highest boundaries are at 72 and 75 N/mm2 , respectively. Table
1.1.4 gives the absolute and relative frequencies for class widths
of 5 N/mm2 . Rectangles are then erected over each of the classes,
proportional in area to the class frequencies. When equal class
widths are used, as shown here, the heights of the rectangles
represent the frequencies. Thus, Figs. 1.1.3 and 1.1.4 are
obtained. The information conveyed by the two histograms seems to
be similar. The diagrams are almost symmetrical with a peak in the
class below 40 N/mm2 and a steady decrease on either side. This
type of diagram usually brings out any possible imperfections in
the data, such as 3 There are alternatives, such as rounding (n +
1)/4 and (n + 1) (3/4) to the nearest integers to calculate the
locations of Q1 and Q3, respectively. The rounding is upward or
downward, respectively, when the numbers fall exactly between two
integers.
26. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 8 Applied Statistics for
Civil and Environmental Engineers 0.0 0.1 0.2 0.3 0.4 07.99 815.99
1623.99 2431.99 3239.99 4047.99 4855.99 5663.99 6471.99 7279.99
Modulus of rupture (N/mm2) Relativefrequency Fig. 1.1.3 Histogram
for timber strength data with class width of 8 N/mm2 . the gaps at
the ends. Further investigations are required to understand the
true nature of the population. More on these aspects will follow in
this and subsequent chapters. 1.1.4 Frequency polygon A frequency
polygon is a useful diagnostic tool to determine the distribution
of a variable. It can be drawn by joining the midpoints of the tops
of the rectangles of a histogram after extending the diagram by one
class on both sides. We assume that equal class widths are used. If
the ordinates of a histogram are divided by the total number of
observations, then a relative frequency histogram is obtained.
Thus, the ordinates for each class denote the probabilities bounded
by 0 and 1, by which we simply mean the chances of occurrence. The
resulting diagram is called the relative frequency polygon. Example
1.4. Timber strength. Corresponding to the histogram of Fig. 1.1.4,
the values of class center are computed and a relative frequency
polygon is obtained; this is shown in Fig. 1.1.5. 0.00 0.10 0.20
0.30 04.99 1014.99 2024.99 3034.99 4044.99 5054.99 6064.99 7074.99
Modulus of rupture (N/mm2) Relativefrequency Fig. 1.1.4 Histogram
for timber strength data with class width of 5 N/mm2 .
27. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 9
0.0 0.1 0.2 0.3 0 20 40 60 80 Modulus of rupture (N/mm2)
Relativefrequency Fig. 1.1.5 Relative frequency polygon for timber
strength data with class width of 5 N/mm2 . As the number of
observations becomes large, the class widths theoretically tend to
de- crease and, in the limiting case of an innite sample, a
relative frequency polygon becomes a frequency curve. This is in
fact a probability curve, which represents a mathematical
probability density function, abbreviated as pdf, of the
population.4 1.1.5 Cumulative relative frequency diagram If a
cumulative sum is taken of the relative frequencies step by step
from the smallest class to the largest, then the line joining the
ordinates (cumulative relative frequencies) at the ends of the
class boundaries forms a cumulative relative frequency or
probability diagram. On the vertical axis of the graph, this line
gives the probabilities of nonexceedance of values shown on the
horizontal axis. In practice, this plot is made by utilizing and
displaying every item of data distinctly, without the necessity of
proceeding via a histogram and the restric- tive categories that it
entails. For this purpose, one may simply determine (e.g., from the
ranked data of Table 1.1.3) the number of observations less than or
equal to each value and divide these numbers by the total number of
observations. This procedure is adopted here.5 Thus, the
probability diagram, as represented by the cumulative relative
frequency diagram, becomes an important practical tool. This
diagram yields the median and other quartiles directly. Also, one
can nd the 9 values that divide the total frequency into 10 equal
parts called deciles and the so-called percentiles, where the pth
percentile is the value that is greater than p percent of the
observations. In general, it is possible to obtain the (n 1) values
that divide the total frequency into n equal parts called the
quantiles. Hence a cumulative frequency polygon is also called a
quantile or Q-plot; a Q-plot though has quantiles on the vertical
axis unlike a cumulative frequency diagram. Example 1.5. Timber
strength. Figure 1.1.6 is the cumulative frequency diagram obtained
from the ranked timber strength data of Table 1.1.3 using each item
of data as just described. 4 This function is discussed in Chapter
3. One of the rst tasks in applying inferential statistics, as
presented in Chapters 4 and 5, will be to estimate the mathematical
function from a nite sample and examine its closeness to the
histogram. 5 Further aspects of this subject, as related to
probability plots, are described in Chapter 5.
28. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 10 Applied Statistics for
Civil and Environmental Engineers 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40
60 80 Modulus of rupture (N/mm2) Cumulativerelativefrequency Fig.
1.1.6 Cumulative relative frequency diagram for timber strength
data. The deciles and percentiles can be abstracted. By convention
a vertical probability or proportionality scale is used rather than
one giving percentages (except in duration curves, discussed
shortly). The 90th percentile, for instance, is 51 N/mm2
approximately and the value 40 N/mm2 has a probability of
nonexceedance of approximately 0.56. If the sample size increases
indenitely, the cumulative relative frequency diagram will become a
distribution curve in the limit. This represents the population by
means of a (mathematical) distribution function, usually called a
cumulative distribution function, ab- breviated to cdf, just as a
relative frequency polygon leads to a probability density function.
As a graphical method of ascertaining the distribution of the
population, the quantile plot can be drawn using a modied nonlinear
scale for the probabilities, which represents one of several types
of theoretical distributions.6 Also, as shown in Section 1.4, two
distributions can be compared using a Q-Q plot. 1.1.6 Duration
curves For the assessment of water resources and for associated
design and planning purposes, engineers nd it useful to draw
duration curves. When dealing with ows in rivers, this type of
graph is known as a ow duration curve. It is in effect a cumulative
frequency diagram with specic time scales. The vertical axis can
represent, for example, the percentage of the time a ow is
exceeded; and in addition, the number of days per year or season
during which the ow is exceeded (or not) may be given. The volume
of ow per day is given on the horizontal axis. For some purposes,
the vertical and horizontal axes are interchanged as in a Q-plot.
One example of a practical use is the scaled area enclosed by the
curve, a horizontal line representing 100% of the time, and a
vertical line drawn at a minimum value of ow, which is desirable to
be maintained in the river. This area represents the estimated
supplementary volume of water that should be diverted to the river
on an annual basis to meet such an objective. Example 1.6. Streamow
duration. Figure 1.1.7 gives the ow duration curve of the Dora
Riparia River in the Alpine region of northern Italy, calculated
over a period of 47 years from the records at Salbertrand gauging
station. This gure is drawn using the same procedure 6 This method
is demonstrated in Section 5.8.
29. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 11
0 73 146 219 292 365 0 10 20 30 40 50 Daily streamflow (m3/s)
Duration,daysperyearflow isexceeded 0 10 20 30 40 50 60 70 80 90
100 Percentage,duration Fig. 1.1.7 Flow duration curve of Dora
Riparia River at Salbertrand in the Alpine region of Italy. adopted
for a cumulative relative frequency diagram, such as Fig. 1.1.6.
For instance, suppose it is decided to divert a proportion of the
discharges above 10 m3 /s and below 20 m3 /s from the river. Then
the area bounded by the curve and the vertical lines drawn at these
discharges, using the vertical scale on the left-hand side, will
give the estimated maximum amount available for diversion during
the year in m3 after multiplication by the number of seconds in a
day. This area is hatched in Fig. 1.1.7. If such a decision were to
be implemented over a long- term basis, it should be essential to
use a long series of data and to estimate the distribution
function. 1.1.7 Summary of Section 1.1 In this section we have
introduced some of the basic graphical methods. Other procedures
such as stem-and-leaf plots and scatter diagrams are presented in
Sections 1.3 and 1.4, respectively. More advanced plots are
introduced in Chapters 5 and 6. In the next section we discuss
associated numerical methods. 1.2 NUMERICAL SUMMARIES OF DATA
Useful graphical procedures for presenting data and extracting
knowledge on variabil- ity and other properties were shown in
Section 1.1. There is a complementary method through which much of
the information contained in a data set can be represented eco-
nomically and conveyed or transmitted with greater precision. This
method utilizes a set of characteristic numbers to summarize the
data and highlight their main features. These numerical summaries
represent several important properties of the histogram and the
rel- ative frequency polygon. The most important purpose of these
descriptive measures is for statistical inference, a role that
graphs cannot fulll. Basically, there are three distinctive types:
measures of central tendency, of dispersion, and of asymmetry, all
of which can be visualized through the histogram as discussed in
Section 1.1. The additional measure of peakedness, that is, the
relative height of the peak, requires a large sample for its
estimation and is mainly relevant in the case of symmetric
distributions.
30. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 12 Applied Statistics for
Civil and Environmental Engineers 1.2.1 Measures of central
tendency Generally data from many natural systems, as well as those
devised by humans, tend to cluster around some values of variables.
A particular value, known as the central value, can be taken as a
representative of the sample. This feature is called central
tendency because the spread seems to take place about a center. The
denition of the central value is exible, and its magnitude is
obtained through one of the measures of its location. There are
three such well-known measures: the mean, the mode, and the median.
The choice depends on the use or application of the central value.
The sample arithmetic mean is estimated from a sample of
observations: x1, x2, . . . , xn, as x = 1 n n i=1 xi . (1.2.1) If
one uses a single number to represent the data, the sample mean
seems ideal for the purpose. After counting, this calculation is
the next basic step in statistics. For theoretical purposes the
mean is the most important numerical measure of location. As stated
in Section 1.1, if the sample size increases indenitely a curve is
obtained from a frequency polygon; the mean is the centroid of the
area between this curve and the horizontal axis and it is thus the
balance point of the frequency curve. The population value of the
mean is denoted by . We reiterate our denition of popu- lation with
reference to a phenomenon such as that represented by the timber
strength data of Table E.1.1. A population is the aggregate of
observations that might result by making an experiment in a
particular manner. The sample mean has a disadvantage because it
may sometimes be affected by un- expectedly high or low values,
called outliers. Such values do not seem to conform to the
distribution of the rest of the data. There may be physical reasons
for outliers. Their presence may be attributed to conditions that
have perhaps changed from what were as- sumed, or because the data
are generated by more than one process. On the other hand, they may
arise on account of errors of faulty instrumentation, measurement,
observation, or recording. The engineer must examine any visible
outliers and ascertain whether they are erroneous or whether their
inclusion is justiable. The occurrence of any improbable value
requires careful scrutiny in practice, and this should be followed
by rectication or elimination if there are valid reasons for doing
so. Example 1.7. Timber strength. A case in point is the value of
zero in the timber strength data of Table E.1.1 This value is
retained here for comparative purposes. The mean of the 165 items,
which is 39.09 N/mm2 , becomes 39.33 N/mm2 without the value of
zero. Example 1.8. Concrete test Table E.1.2 is a list of the
densities and compressive strengths at 28 days from the results of
40 concrete cube test records conducted in Barton-on-Trent,
England, during the period 8 July 1991 to 21 September 1992, and
arranged in reverse chronological order. These have sample means of
2445 kg/m3 and 60.14 N/mm2 , respectively. The two numbers are
measures of location representing the density and compressive
strength of concrete. With many discordant values at the extremes,
a trimmed mean, such as a 5% trimmed mean, may be calculated. For
this purpose, the data are ranked and the mean is obtained after
ignoring 5% of the observations from each of the two extremities
(see Problem 1.16).
31. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 13
The technique of coding is sometimes used to facilitate
calculations when the data are given to several signicant gures but
the digits are constant except for the last few. For example, the
densities in Table E.1.2 are higher than 2400 N/mm2 and less than
2500 N/mm2 , so that the number 2400 can be subtracted from the
densities. The remainders will retain the essential characteristics
of the original set (apart from the enforced shift in the mean),
thus simplifying the arithmetic. In considering the entire data
set, a weighted mean is obtained if the variables of a sample are
multiplied by numbers called weights and then divided by the sum of
the weights. It is used if some variables should contribute more
(or less) to the average than others. The median is the central
value in an ordered set or the average of the two central values if
the number of values, n, is even, as specied in Section 1.1.
Example 1.9. Concrete test. The calculation of the median and other
measures of location will be greatly facilitated if the data are
arranged in order of magnitude. For example, the compressive
strengths of concrete given in Table E.1.2 are rewritten in
ascending order in Table 1.2.1. The median of these data is 60.1
N/mm2 , which is the average of 60.0 and 60.2 N/mm2 . The median of
the timber strength data of Table 1.1.3 is 39.05 N/mm2 , as noted
in the table. The median has an advantage over the mean. It is
relatively unaffected by outliers and is thus often referred to as
a resistant measure. For instance, the exclusion of the zero value
in Table 1.1.3 results only in a minor change of the median from
39.05 to 39.10 N/mm2 . One of the countless practical uses of the
median is the application of a disinfectant to many samples of
bacteria. Here, one seeks an association between the proportion of
bacteria destroyed and the strength of the disinfectant. The
concentration that kills 50% of the bacteria is the median dose.
This is termed LD50 (lethal dose for 50%) and provides an excellent
measure. The mode is the value that occurs most frequently. Quite
often the mode is not unique because two or more sets of values
have equal status. For this reason and for convenience, the mode is
often taken from the histogram or frequency polygon. Example 1.10.
Concrete test. For the ranked compressive strengths of concrete in
Table 1.2.1, the mode is 60.5 N/mm2 . Example 1.11. Timber
strength. From Fig. 1.1.4, for example, the mode of the timber
strength data is 37.5 N/mm2 , which corresponds to the midpoint of
the class with the highest frequency. However, there is ambiguity
in the choice of the class widths as already noted. On the other
hand, in Table 1.1.3 there are nine values in the range 38.6439.34
N/mm2 , and thus 39 N/mm2 seems a more representative value, but
this problem can only be resolved theoretically. As the sample size
becomes indenitely large, the modal value will correspond to the
peak of the relative frequency curve on a theoretical basis. The
mode may often have greater practical signicance than the mean and
the median. It becomes more useful as the asymmetry of the
distribution increases. For instance, if an engineer were to ask a
person who sits habitually on the banks of a river shing to
indicate the mean level of the river, he or she is inclined to
point out the modal level. It is the value most likely to occur and
it
32. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 14 Applied Statistics for
Civil and Environmental Engineers Table 1.2.1 Ordered data of
density and compressive strength of concretea Compressive strength
Order Density (kg/m3 ) (N/mm2 ) 1 2411 49.9 2 2415 50.7 3 2425 52.5
4 2427 53.2 5 2427 53.4 6 2428 54.4 7 2429 54.6 8 2433 55.8 9 2435
56.3 10 2435 56.7 11 2436 56.9 12 2436 57.8 13 2436 57.9 14 2436
58.8 15 2437 58.9 16 2437 59.0 17 2441 59.6 18 2441 59.8 19 2444
59.8 20 2445 60.0 21 2445 60.2 22 2446 60.5 23 2447 60.5 24 2447
60.5 25 2448 60.9 26 2448 60.9 27 2449 61.1 28 2450 61.5 29 2454
61.9 30 2454 63.3 31 2455 63.4 32 2456 64.9 33 2456 64.9 34 2457
65.7 35 2458 67.2 36 2469 67.3 37 2471 68.1 38 2472 68.3 39 2473
68.9 40 2488 69.5 a The original data sets are given in Table
E.1.2. is not affected by exceptionally high or low values.
Clearly, the deletion of the zero value from Table 1.1.3 does not
alter the mode, as we have also seen in the case of the median.
These positive attributes of the mode and median notwithstanding,
the mean is indis- pensable for many theoretical purposes. Also in
the same class as the sample arithmetic
33. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 15
mean, there are two other measures of location that are used in
special situations. These are the harmonic and geometric means. The
harmonic mean is the reciprocal of the mean of the reciprocals.
Thus the harmonic mean for a sample of observations, x1, x2, . . .
, xn, is dened as xh = 1 1/n[(1/x1) + (1/x2) + + 1/xn)] . (1.2.2)
It is applied in situations where the reciprocal of a variable is
averaged. Example 1.12. Stream ow velocity. A practical example of
the harmonic mean is the determination of the mean velocity of a
stream based on measurements of travel times over a given reach of
the stream using a oating device. For instance, if three velocities
are calculated as 0.20, 0.24, and 0.16 m/s, then the sample
harmonic mean is xh = 1 (1/3)[(1/0.20) + (1/0.24) + (1/0.16)] =
0.19 m/s. The geometric mean is used in averaging values that
represent a rate of change. Here the variable follows an
exponential, that is, a logarithmic law. For a sample of
observations, x1, x2, . . . , xn, the geometric mean is the
positive nth root of the product of the n values. This is the same
as the antilog of the mean of the logarithms: xg = (x1x2 . . .
xn)1/n = exp 1 n n i=1 In xi = n i=1 x 1/n i . (1.2.3) Example1.13.
Populationgrowth. Considerthecaseofpopulationsoftownsandcitiesthat
increase geometrically, which means that a future increase is
expected that is proportional to the current population. Such
information is invaluable for planning and designing urban water
supplies and sewerage systems. Suppose, for example, that according
to a census conducted in 1970 and again in 1990 the population of a
city had increased from 230,000 to 310,000. An engineer needs to
verify, for purposes of design, the per capita consumption of water
in the intermediate period and hence tries to estimate the
population in 1980. The central value to use in this situation is
the geometric mean of the two numbers which is xg = (230, 000 310,
000)1/2 = 267,021. (Note that the sample arithmetic mean x =
270,000.) As we see in Example 1.13, the geometric mean is less
than the arithmetic mean.7 1.2.2 Measures of dispersion Whereas a
measure of central tendency is obtained by locating a central or
representative value, a measure of dispersion represents the degree
of scatter shown by observations or the inherent variability in a
phenomenon under observation. Dispersion also indicates the
precision of the data. One method of quantication is through an
order statistic, that is, one of ranked data.8 The simplest in the
category is the range, which is the difference between the largest
and smallest values, as dened in Section 1.1. 7 This theoretical
property is demonstrated in Example 3.10. 8 We shall discuss order
statistics formally in Chapter 7; see also Chapter 5.
34. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 16 Applied Statistics for
Civil and Environmental Engineers Example 1.14. Timber strength. As
noted before, the range of the timber strength data of Table 1.1.3
is 70.22 0.00 = 70.22 N/mm2 . Example 1.15. Concrete test. For the
compressive strengths of concrete given in Table E.1.2 and ranked
in Table 1.2.1, the range is r = 69.5 49.9 = 19.6 N/mm2 ; the range
of the concrete densities is 2488 2411 = 77 kg/m3 . These numbers
provide a measure of the spread of the data in each case. The
range, however, is a nondecreasing function of the sample size and
thus charac- terizes the population poorly. Moreover, the range is
unduly affected by high and low values that may be somewhat
incompatible with the rest of the data even though they may not
always be classied as outliers. For this reason, the interquartile
range, iqr, which is relatively a resistant measure, is preferable.
As dened in Section 1.1, in a ranked set of data this is the
difference between the median of the top half and the median of the
bottom half. Example 1.16. Concrete test. For the compressive
strengths of concrete, the iqr is 6.55 N/mm2 . Example 1.17. Timber
strength. The timber strength data in Table 1.1.3 have an iqr of
11.66 and 11.47 N/mm2 , respectively, with or without the zero
value. A similar and more general measure is given by the interval
between two symmetrical percentiles. For example, the 9010
percentile range for the timber strength data is approximately 52
28 = 24 N/mm2 from Fig. 1.1.6. The aforementioned measures of
dispersion can be easily obtained. However, their shortcoming is
that, apart from two values or numbers equivalent to them, the vast
infor- mation usually found in a sample of data is ignored. This
criticism is not applicable if one determines the average deviation
about some central value, thus including all the obser- vations.
For example, the mean absolute deviation, denoted by d, measures
the average absolute deviation from the sample mean. For a sample
of observations, x1, x2, . . . , xn, it is dened as d = |x1 x| +
|x2 x| + + |xn x| n = n i=1 |xi x| n . (1.2.4) Example 1.18. Annual
rainfall. If the annual rainfalls in a city are 50, 56, 42, 53, and
49 cm over a 5-year period, the absolute deviation with respect to
the sample mean of 50 cm is given by d = 1 5 (|50 50| + |56 50| +
|42 50| + |53 50| + |49 50|) = 3.6 cm. This measure of dispersion
is easily understood and practically useful. However, it is valid
only if the large and small deviations are as signicant as the
average deviations. There are strong theoretical reasons (as seen
in Chapters 3, 4, and 5), on the other hand, for using the sample
standard deviation, denoted by s, which is the root mean square
deviation about the mean. Indeed, this is the principal measure of
dispersion (although the interquartile
35. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 17
range is meaningful and expedient). For a sample of observations,
x1, x2, . . . , xn it is dened by s = 1 n [(x1 x)2 + (x2 x)2 + +
(xn x)2] = 1 n n i=1 (xi x)2. (1.2.5) By expanding and summarizing
the terms on the extreme right-hand side, s = 1 n n i=1 x2 i 2x n
i=1 xi + n x2 = 1 n n i=1 x2 i x2. (1.2.6) Engineers will recognize
that this measure is analogous to the radius of gyration of a
structural cross section. In contrast to the mean absolute
deviation, it is highly inuenced by the largest and smallest
values. The standard deviation of the population is denoted by . It
is common practice to replace the divisor n of Eq. (1.2.5) by (n 1)
and denote the left-hand side by s. Consequently, the estimate of
the standard deviation is, on average, closer to the population
value because it is said to have smaller bias. Therefore, Eq.
(1.2.5) will, on average, give an underestimate of except in the
rare case in which is known.9 The required modication to Eq.
(1.2.6) is as follows: s = 1 n 1 n i=1 x2 i n n 1 x2. (1.2.7) This
reduction in n can be justied by means of the concept of degrees of
freedom. It is a consequence of the fact that the sum of the n
deviations (x1 x), (x2 x), . . . , (xn x) is zero, which follows
from Eq. (1.2.1) for the mean. Hence, regardless of the arrangement
of the data, if any (n 1) terms are specied the remaining term is
xed or known, because xn x = n1 i=1 (xi x). It follows from this
equation that one degree of freedom is lost in dening the sample
standard deviation. The concept of degrees of freedom was
introduced by the English statistician R. A. Fisher on the analogy
of a dynamical system in which the term denotes the number of
independent coordinate values necessary to determine the system.
Example 1.19. Annual rainfall. From the annual rainfall data in
Example 1.18 (50, 56, 42, 53, and 49 cm), one can estimate the
standard deviation by using Eq. (1.2.5), as follows: s = 1 5 [(50
50)2 + (56 50)2 + (42 50)2 + (53 50)2 + (49 50)2] = 1 5 (02 + 62 +
82 + 32 + 12) = 110 5 = 4.69 cm. An alternative estimate of (which
is, on average, less biased) is obtained using Eq. (1.2.7) as
follows: s = 110 4 = 5.24 cm. 9 Terms such as bias are discussed
formally in Section 5.2. It is shown in Example 5.1 that s2 is
unbiased; however, s is known to have bias, though less than s on
average.
36. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 18 Applied Statistics for
Civil and Environmental Engineers Example 1.20. Timber strength. By
using Eq. (1.2.7), the sample standard deviation of the timber
strength data of Table E.1.1 is 9.92 N/mm2 (or 9.46 N/mm2 if the
zero value is excluded). Example 1.21. Concrete test. By using Eq.
(1.2.7), the sample standard deviation for the density and
compressive strength of concrete in Table E.1.2 are 15.99 kg/m3 and
5.02 N/mm2 , respectively. Dividing the standard deviation by the
mean gives the dimensionless measure of dis- persion called the
sample coefcient of variation, v: v = s x (1.2.8)
Thisisusuallyexpressedasapercentage.Thecoefcientofvariationisusefulincomparing
different data sets with respect to central location and
dispersion. Example 1.22. Comparison of timber and concrete
strength data. From the values of mean and standard deviation in
Examples 1.7 and 1.20, the sample coefcient of variation of the
timber strength data is 25.3% (or 24.0% without the value of zero).
Similarly, from Examples 1.8 and 1.21 the density and compressive
strength of concrete data have sample coefcients of 0.65 and 8.24%,
respectively. The higher variation in the timber strength data is a
reection of the variability of the natural material, whereas the
low variation in the density of the concrete is evidence of a
uniform quality in the constituents and a high standard of
workmanship, including care taken in mixing. The variation in the
compressive strength of concrete is higher than that of its
density. This can be attributed to random factors that inuence
strength, such as some subtle changes in the effectiveness of the
concrete that do not alter its density. From the square of the
sample standard deviation one obtains the sample variance, s2 ,
which is the mean of the squared deviations from the mean. The
population variance is denoted by 2 . The variance, like the mean,
is important in theoretical distributions.
BysquaringEqs.(1.2.6)and(1.2.7),twoestimatorsofthepopulationvariancearefound.
Here estimator refers to a method of estimating a constant in a
parent population. As in all the foregoing equations, this term
means the random variable of which the estimate is a realization.
An unbiased estimator is obtained from Eq. (1.2.7) because on
average (that is by repeated sampling) the estimator tends to the
population variance 2 . In other words, the expectation E, which is
in effect the average from an innite number of observations, of the
square of the right-hand side of Eq. (1.2.7) is equal to 2 . There
are also measures of dispersion pertaining to the mean of the
deviations between the observations. Ginis mean difference, for
example, is a long-standing method.10 This is given by g = 2 n (n
1) i> j n j=1 [x(i) x( j)], (1.2.9) in which the observations
x1, x2, . . . , xn are arranged in ascending order. 10 See, for
example, Stuart and Ord (1994, p. 58) for more details of this
method originated by the Italian mathematician, Gini. See also
Problem 1.7.
37. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 19
1.2.3 Measure of asymmetry Another important property of the
histogram or frequency polygon is its shape with respect to
symmetry (on either side of the mode). The sample coefcient of
skewness measures the asymmetry of a set of data about its mean.
For a sample of observations, x1, x2, . . . , xn, it is dened as g1
= n i=1 (xi x)3 ns3 . (1.2.10) Division by the cube of the sample
standard deviation gives a dimensionless measure. A histogram is
said to have positive skewness if it has a longer tail on the
right, which is toward increasing values, than on the left. In this
case the number of values less than the mean is greater than the
number that exceeds the mean. Many natural phenomena tend to have
this property. For a positively skewed histogram, mode < median
< mean. This inequality is reversed if skewness is negative. A
symmetrical histogram suggests zero skewness. Example 1.23.
Comparison of timber and concrete strength data. The coefcient of
skewness of the timber strength data of Table E.1.1 and the
compressive strength data of Table E.1.2 are 0.15 (or 0.53 after
excluding the zero value) and 0.03, respectively. These indicate a
small skewness in the rst case and a symmetrical distribution in
the second case. The example indicates that this measure of
skewness is sensitive to the tails of the distribution. 1.2.4
Measure of peakedness The extent of the relative steepness of
ascent in the vicinity and on either side of the mode in a
histogram or frequency polygon is said to be a measure of its
peakedness or tail weight. This is quantied by the dimensionless
sample coefcient of kurtosis, which is dened for a sample of
observations, x1, x2, . . . , xn by g2 = n i=1 (xi x)4 ns4 .
(1.2.11) Example 1.24. Comparison of timber and concrete strength
data. The kurtosis of the timber strength data of Table E.1.1 is
4.46 (or 3.57 without the zero value) and that of the compressive
strengths of Table E.1.2 is 2.33. One can easily see from Eq.
(1.2.11) that even a small variation in one of the items of data
may inuence the kurtosis signicantly. This observation warrants a
large sample size, perhaps 200 or greater, for the estimation of
the kurtosis. Small sample sizes, particularly in the second case
with n = 40, preclude the attachment of any special signicance to
these estimates. 1.2.5 Summary of Section 1.2 Of the numerical
summaries listed here, the mean, standard deviation, and coefcient
of skewness are the best representative measures of the histogram
or frequency polygon, from both visual and theoretical aspects.
These provide economical measures for summarizing the information
in a data set. Sample estimates for the data we have been
discussing here, including the coefcients of variation and
kurtosis, are given in Table 1.2.2.
38. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 20 Applied Statistics for
Civil and Environmental Engineers Table 1.2.2 Sample estimates of
numerical summaries of the timber strength data of Table 1.1.3 and
the concrete strength and density data of Table 1.2.1 Sample
Standard Coefcient of Coefcient Coefcient Data set size Meana
deviationa variation (%) of skewness of kurtosis Estimated by
equation 1.2.1 1.2.7 1.2.8 1.2.10 1.2.11 Timber strengthfull sample
165 39.09 9.92 25.3 0.15 4.46 Timber strength without the zero
value 164 39.33 9.46 24.0 0.53 3.57 Compressive strength of
concrete 40 60.14 5.02 8.35 0.03 2.33 Density of concrete 40 2445
15.99 0.65 0.38 3.15 a Units for strength are N/mm2; units for
density are kg/m3. 1.3 EXPLORATORY METHODS Some graphical displays
are used when one does not have any specic questions in mind before
examining a data set. These methods were appropriately called
exploratory data analysis by Tukey (1977). Among such procedures
the box plot is advantageous, and the stem-and-leaf plot is also a
valuable tool. 1.3.1 Stem-and-leaf plot The histogram is a highly
effective graphical procedure for showing various characteristics
of data as seen in Section 1.1. However, for smaller samples, less
than, say, 40 in size, it may not give a clear indication of the
variability and other properties of the data. The stem-and-leaf
plot, which resembles a histogram turned through a right angle, is
a useful procedure in such cases. Its advantage is that the data
are grouped without loss of information because the magnitudes of
all the values are presented. Furthermore, its intrinsic tabular
form highlights extreme values and other characteristics that a
histogram may obscure. As in a histogram, the data are initially
ranked in ascending order but a different approach is adopted in
nding the number of classes. The class widths are almost invariably
equal. For the increments or class intervals (and hence class
widths) one uses 0.5, 1, or 2 multiplied by a power of 10, which
means that the intervals are in units such as 0.1 or 200 or 10,000,
which are more tractable than, say, 0.13 or 140 or 12,000. The
terminology is best explained through the following worked example.
Example 1.25. Concrete test. For the concrete strength data of
Table E.1.2, the maximum and minimum values are 69.5 and 49.9 N/mm2
, respectively. As a rst choice, the data can be divided into 21
classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51
N/mm2 , and so on, up to 69 N/mm2 . For the ordered stem-and-leaf
plot of Fig. 1.3.1, a vertical line is drawn with the class
boundaries marked in increasing order immediately to its left. The
boundary values are called the leading digits and, together with
the vertical line, constitute the stem. The trailing digits on the
right represent the items of data in increasing order when read
jointly with the leading digits using the indicated units. They are
termed leaves, and their counts are the class frequencies. Thus the
digits 49 (stem) and 9 (leaf) constitute 49.9. It is useful to
provide an additional column at the extreme left, as shown here,
giving the cumulative frequenciescalled depthsup to each class.
This is completed
39. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 21
1 49 9 2 50 7 2 51 3 52 5 5 53 2 4 7 54 4 6 8 55 8 11 56 3 7 9 13
57 8 9 15 58 8 9 19 59 0 6 8 8 (7) 60 0 2 5 5 5 9 9 14 61 1 5 9 11
62 11 63 3 4 9 64 9 9 7 65 7 6 66 6 67 2 3 4 68 1 3 9 1 69 5 Fig.
1.3.1 Stem-and-leaf plot for compressive strengths of concrete in
Table E.1.2; units for stem: 1 N/mm2 ; units for leaves: 0.1 N/mm2
. rstly by starting at the top and totaling downward to the line
containing the median for which the individual frequency is given
in parentheses, and secondly by starting at the bottom and totaling
upward to the line containing the median. The diagram gives all the
information in the data, which is its main advantage. Further-
more, the range, median, symmetry, or gaps in the data, frequently
occurring values, and any possible outliers can be highlighted. In
this example, a symmetrical distribution is indicated. The plot may
be redrawn with a smaller number of classes, perhaps for greater
clarity, using the guidelines for choosing the intervals stipulated
previously. The units of data in a plot can be rounded to any
number of signicant gures as necessary. Also, the number of stems
in a plot can be doubled by dividing each stem into two lines. When
1 multiplied by a power of 10 is used as an interval, for example,
the rst line, which is denoted by an asterisk ( ), will thus have
leaves 0 to 4, and the leaves of the second, represented by a
period (.), will be from 5 to 9. Likewise, one may divide a stem
into ve lines. The stem-and-leaf plot is best suited for small to
moderate sample sizes, say, less than 200.
40. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 22 Applied Statistics for
Civil and Environmental Engineers (N/mm2) 10 20 30 40 50 60 70 80
Strength excluding 0 value Timber strength, of concrete Compressive
strength 17.98 33.10 39.10 44.57 70.22 56.8 69.5 49.9 63.4 60.1
73.23 46.97 61.78 65.35 65.61 69.07 15.89 Maximum and minimum
values Critical values for detecting outliers Other high values
Quartiles Fig. 1.3.2 Box plots for timber strength and compressive
strength of concrete data from Tables 1.1.3 to 1.2.1. 1.3.2 Box
plot Another plot that is highly useful in data presentation is the
box plot, which displays the three quartiles, Q1, Q2, Q3, on a
rectangular box aligned either horizontally or vertically. The box,
together with the minimum and maximum values, which are shown at
the ends of lines extended at either side from the box from the
midpoints of its extremities, constitute the box-and-whiskers plot,
as it is sometimes called. The numerical signposts are arranged as
follows from top to bottom: minimum, Q1, Q2, Q3, and maximum.
Together they constitute a ve-number summary. The minimum and
maximum values may be replaced by the 5th and 95th (or other
extreme) percentiles or supplemented by these and additional
extreme values. These plots play an important role in comparing two
or more samples. The width of the box is made proportional to the
sample size in such cases, if they are different. Example 1.26.
Comparison of timber and concrete strength data. Let us use a box
plot to compare the strengths of two representative materials used
by civil engineers. Figure 1.3.2 shows the timber strength data
ranked in Table 1.1.3, with the zero value excluded, and the
compressive strength of concrete data that were ranked in Table
1.2.1. The box plot of compressive strengths of concrete shown on
the right strongly indicates symmetry in their distribution. In the
case of the timber strength data, the box is less symmetrical.
However, there are clear signs of positive skewness; because the
length of the line connecting the highest value to the box is
longer than that connecting the lowest value to the box. Empirical
rules have been devised to detect outliers by means of box plots.
As previ- ously stated, this term signies an excessive discordance
with reference to an assumed distribution to which the majority of
observations belong. One such procedure identies
41. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK
BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 23
outliers as those values located at distances greater than 1.5 iqr
above the third quartile or less than 1.5 iqr below the rst
quartile. Example 1.27. Comparison of timber and concrete strength
data. The iqrs for the timber strength and compressive strength of
concrete data are 11.47 and 6.55 N/mm2 and thus the two critical
distances for detecting outliers are 17.21 and 9.83 N/mm2 ,
respectively. These distances are set out on either from the
extremities of the boxes and are shown by thick horizontal lines in
Fig. 1.3.2. By this rule, the concrete data do not have any
outliers, whereas there are four outliers beyond the demarcating
line for high outliers in the timber strength data of Table 1.1.2.
These are the values 65.35, 65.61, 69.07, and 70.22 N/mm2 . At the
other extremity, there is the zero value that was discarded before
the diagram was drawn. When such an observation is recorded one
should verify whether it stems from a faulty calibration or other
source of error; it is clearly an outlier by the method described
here.11 1.3.3 Summary of Section 1.3 In general, box plots are
helpful in highlighting distributional features, including the
range and many of