737

Applied statistics

Embed Size (px)

Citation preview

  1. 1. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 APPLIED STATISTICS FOR CIVIL AND ENVIRONMENTAL ENGINEERS i
  2. 2. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 ii
  3. 3. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 APPLIED STATISTICS FOR CIVIL AND ENVIRONMENTAL ENGINEERS Second Edition Nathabandu T. Kottegoda Department of Hydraulic, Environmental, and Surveying Engineering Politecnico di Milano, Italy Renzo Rosso Department of Hydraulic, Environmental, and Surveying Engineering Politecnico di Milano, Italy iii
  4. 4. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 This edition rst published 2008 C 2008 by Blackwell Publishing Ltd and 1997 by The McGraw-Hill Companies, Inc. Blackwell Publishing was acquired by John Wiley & Sons in February 2007. Blackwells publishing programme has been merged with Wileys global Scientic, Technical, and Medical business to form Wiley-Blackwell. Registered ofce John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom Editorial ofce 9600 Garsington Road, Oxford, OX4 2DQ, United Kingdom For details of our global editorial ofces, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell. The right of the author to be identied as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. ISBN: 978-1-4051-7917-1 Library of Congress Cataloging-in-Publication Data Kottegoda, N. T. Applied statistics for civil and environmental engineers / Nathabandu T. Kottegoda, Renzo Rosso. 2nd ed. p. cm. Prev. ed. published as: Statistics, probability, and reliability for civil and environmental engineers. New York : McGraw-Hill, c1997. Includes bibliographical references and index. ISBN-13: 978-1-4051-7917-1 (hardback : alk. paper) ISBN-10: 1-4051-7917-1 (hardback : alk. paper) 1. Civil engineeringStatistical methods. 2. Environmental engineeringStatistical methods. 3. Probabilities. I. Rosso, Renzo. II. Kottegoda, N. T. Statistics, probability, and reliability for civil and environmental engineers. III. Title. TA340.K67 2008 519.502 4624dc22 2007047496 A catalogue record for this book is available from the British Library. Set in 10/12pt Times by Aptara Inc., New Delhi, India Printed in Singapore by Utopia Press Pte Ltd 1 2008 iv
  5. 5. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Contents Dedication xiii Preface to the First Edition xiv Preface to the Second Edition xvi Introduction 1 1 Preliminary Data Analysis 3 1.1 Graphical Representation 3 1.1.1 Line diagram or bar chart 4 1.1.2 Dot diagram 4 1.1.3 Histogram 5 1.1.4 Frequency polygon 8 1.1.5 Cumulative relative frequency diagram 9 1.1.6 Duration curves 10 1.1.7 Summary of Section 1.1 11 1.2 Numerical Summaries of Data 11 1.2.1 Measures of central tendency 12 1.2.2 Measures of dispersion 15 1.2.3 Measure of asymmetry 19 1.2.4 Measure of peakedness 19 1.2.5 Summary of Section 1.2 19 1.3 Exploratory Methods 20 1.3.1 Stem-and-leaf plot 20 1.3.2 Box plot 22 1.3.3 Summary of Section 1.3 23 1.4 Data Observed in Pairs 23 1.4.1 Correlation and graphical plots 23 1.4.2 Covariance and the correlation coefcient 24 1.4.3 Q-Q plots 26 1.4.4 Summary of Section 1.4 27 1.5 Summary for Chapter 1 27 References 28 Problems 29 2 Basic Probability Concepts 38 2.1 Random Events 39 2.1.1 Sample space and events 39 2.1.2 The null event, intersection, and union 41 2.1.3 Venn diagram and event space 43 2.1.4 Summary of Section 2.1 49 v
  6. 6. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 vi Contents 2.2 Measures of Probability 50 2.2.1 Interpretations of probability 50 2.2.2 Probability axioms 52 2.2.3 Addition rule 53 2.2.4 Further properties of probability functions 55 2.2.5 Conditional probability and multiplication rule 56 2.2.6 Stochastic independence 61 2.2.7 Total probability and Bayes theorems 65 2.2.8 Summary of Section 2.2 72 2.3 Summary for Chapter 2 72 References 73 Problems 74 3 Random Variables and Their Properties 83 3.1 Random Variables and Probability Distributions 83 3.1.1 Random variables 83 3.1.2 Probability mass function 84 3.1.3 Cumulative distribution function of a discrete random variable 85 3.1.4 Probability density function 86 3.1.5 Cumulative distribution function of a continuous random variable 88 3.1.6 Summary of Section 3.1 90 3.2 Descriptors of Random Variables 90 3.2.1 Expectation and other population measures 90 3.2.2 Generating functions 99 3.2.3 Estimation of parameters 103 3.2.4 Summary of Section 3.2 112 3.3 Multiple Random Variables 112 3.3.1 Joint probability distributions of discrete variables 113 3.3.2 Joint probability distributions of continuous variables 118 3.3.3 Properties of multiple variables 124 3.3.4 Summary of Section 3.3 132 3.4 Associated Random Variables and Probabilities 132 3.4.1 Functions of a random variable 133 3.4.2 Functions of two or more variables 135 3.4.3 Properties of derived variables 143 3.4.4 Compound variables 151 3.4.5 Summary of Section 3.4 154 3.5 Copulas 154 3.6 Summary for Chapter 3 157 References 157 Problems 160 4 Probability Distributions 165 4.1 Discrete Distributions 165 4.1.1 Bernoulli distribution 166 4.1.2 Binomial distribution 167 4.1.3 Poisson distribution 171 4.1.4 Geometric and negative binomial distributions 181
  7. 7. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Contents vii 4.1.5 Log-series distribution 185 4.1.6 Multinomial distribution 187 4.1.7 Hypergeometric distribution 189 4.1.8 Summary of Section 4.1 192 4.2 Continuous Distributions 194 4.2.1 Uniform distribution 194 4.2.2 Exponential distribution 196 4.2.3 Erlang and gamma distribution 200 4.2.4 Beta distribution 203 4.2.5 Weibull distribution 205 4.2.6 Normal distribution 209 4.2.7 Lognormal distribution 215 4.2.8 Summary of Section 4.2 217 4.3 Multivariate Distributions 217 4.3.1 Bivariate normal distribution 219 4.3.2 Other bivariate distributions 222 4.4 Summary for Chapter 4 222 References 223 Problems 224 5 Model Estimation and Testing 230 5.1 A Review of Terms Related to Random Sampling 230 5.2 Properties of Estimators 231 5.2.1 Unbiasedness 231 5.2.2 Consistency 232 5.2.3 Minimum variance 232 5.2.4 Efciency 234 5.2.5 Sufciency 234 5.2.6 Summary of Section 5.2 235 5.3 Estimation of Condence Intervals 236 5.3.1 Condence interval estimation of the mean when the standard deviation is known 236 5.3.2 Condence interval estimation of the mean when the standard deviation is unknown 239 5.3.3 Condence interval for a proportion 242 5.3.4 Sampling distribution of differences and sums of statistics 242 5.3.5 Interval estimation for the variance: chi-squared distribution 243 5.3.6 Summary of Section 5.3 247 5.4 Hypothesis Testing 247 5.4.1 Procedure for testing 248 5.4.2 Probabilities of Type I and Type II errors and the power function 254 5.4.3 Neyman-Pearson lemma 256 5.4.4 Tests of hypotheses involving the variance 257 5.4.5 The F distribution and its use 258 5.4.6 Summary of Section 5.4 259 5.5 Nonparametric Methods 260 5.5.1 Sign test applied to the median 261 5.5.2 Wilcoxon signed-rank test for association of paired observations 262
  8. 8. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 viii Contents 5.5.3 Kruskal-Wallis test for paired observations in k samples 264 5.5.4 Tests on randomness: runs test 267 5.5.5 Spearmans rank correlation coefcient 268 5.5.6 Summary of Section 5.5 269 5.6 Goodness-of-Fit Tests 270 5.6.1 Chi-squared goodness-of-t test 271 5.6.2 Kolmogorov-Smirnov goodness-of-t test 273 5.6.3 Kolmogorov-Smirnov two-sample test 274 5.6.4 Anderson-Darling goodness-of-t test 277 5.6.5 Other methods for testing the goodness-of-t to a normal distribution 281 5.6.6 Summary of Section 5.6 282 5.7 Analysis of Variance 283 5.7.1 One-way analysis of variance 284 5.7.2 Two-way analysis of variance 288 5.7.3 Summary of Section 5.7 294 5.8 Probability Plotting Methods and Visual Aids 295 5.8.1 Probability plotting for uniform distribution 296 5.8.2 Probability plotting for normal distribution 297 5.8.3 Probability plotting for Gumbel or EV1 distribution 300 5.8.4 Probability plotting of other distributions 301 5.8.5 Visual tting methods based on the histogram 303 5.8.6 Summary of Section 5.8 305 5.9 Identication and Accommodation of Outliers 305 5.9.1 Hypothesis tests 306 5.9.2 Test statistics for detection of outliers 307 5.9.3 Dealing with nonnormal data 309 5.9.4 Estimation of probabilities of extreme events when outliers are present 311 5.9.5 Summary of Section 5.9 312 5.10 Summary of Chapter 5 312 References 313 Problems 316 6 Methods of Regression and Multivariate Analysis 326 6.1 Simple Linear Regression 327 6.1.1 Estimates of the parameters 328 6.1.2 Properties of the estimators and errors 332 6.1.3 Tests of signicance and condence intervals 337 6.1.4 The bivariate normal model and correlation 339 6.1.5 Summary of Section 6.1 342 6.2 Multiple Linear Regression 342 6.2.1 Formulation of the model 343 6.2.2 Linear least squares solutions using the matrix method 343 6.2.3 Properties of least squares estimators and error variance 346 6.2.4 Model testing 350 6.2.5 Model adequacy 355 6.2.6 Residual plots 356 6.2.7 Inuential observations and outliers in regression 358 6.2.8 Transformations 365
  9. 9. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Contents ix 6.2.9 Condence intervals on mean response and prediction 366 6.2.10 Ridge regression 368 6.2.11 Other methods and discussion of Section 6.2 370 6.3 Multivariate Analysis 373 6.3.1 Principal components analysis 373 6.3.2 Factor analysis 379 6.3.3 Cluster analysis 383 6.3.4 Other methods and summary of Section 6.3 385 6.4 Spatial Correlation 386 6.4.1 The estimation problem 387 6.4.2 Spatial correlation and the semivariogram 387 6.4.3 Some semivariogram models and physical aspects 389 6.4.4 Spatial interpolations and Kriging 391 6.4.5 Summary of Section 6.4 394 6.5 Summary of Chapter 6 394 References 395 Problems 398 7 Frequency Analysis of Extreme Events 405 7.1 Order Statistics 406 7.1.1 Denitions and distributions 406 7.1.2 Functions of order statistics 409 7.1.3 Expected value and variance of order statistics 411 7.1.4 Summary of Section 7.1 415 7.2 Extreme Value Distributions 415 7.2.1 Basic concepts of extreme value theory 415 7.2.2 Gumbel distribution 422 7.2.3 Frechet distribution 429 7.2.4 Weibull distribution as an extreme value model 432 7.2.5 General extreme value distribution 435 7.2.6 Contagious extreme value distributions 439 7.2.7 Use of other distributions as extreme value models 445 7.2.8 Summary of Section 7.2 450 7.3 Analysis of Natural Hazards 453 7.3.1 Floods, storms, and droughts 453 7.3.2 Earthquakes and volcanic eruptions 461 7.3.3 Winds 465 7.3.4 Sea levels and highest sea waves 470 7.3.5 Summary of Section 7.3 473 7.4 Summary of Chapter 7 474 References 474 Problems 478 8 Simulation Techniques for Design 487 8.1 Monte Carlo Simulation 488 8.1.1 Statistical experiments 488 8.1.2 Probability integral transform 493 8.1.3 Sample size and accuracy of Monte Carlo experiments 495 8.1.4 Summary for Section 8.1 501 8.2 Generation of Random Numbers 501
  10. 10. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 x Contents 8.2.1 Random outcomes from standard uniform variates 501 8.2.2 Random outcomes from continuous variates 506 8.2.3 Random outcomes from discrete variates 511 8.2.4 Random outcomes from jointly distributed variates 513 8.2.5 Summary of Section 8.2 514 8.3 Use of Simulation 514 8.3.1 Distributions of derived design variates 514 8.3.2 Sampling statistics 517 8.3.3 Simulation of time- or space-varying systems 519 8.3.4 Design alternatives and optimal design 524 8.3.5 Summary of Section 8.3 530 8.4 Sensitivity and Uncertainty Analysis 530 8.5 Summary and Discussion of Chapter 8 531 References 531 Problems 533 9 Risk and Reliability Analysis 541 9.1 Measures of Reliability 542 9.1.1 Factors of safety 542 9.1.2 Safety margin 547 9.1.3 Reliability index 550 9.1.4 Performance function and limiting state 558 9.1.5 Further practical solutions 568 9.1.6 Summary of Section 9.1 577 9.2 Multiple Failure Modes 577 9.2.1 Independent failure modes 578 9.2.2 Mutually dependent failure modes 584 9.2.3 Summary of Section 9.2 592 9.3 Uncertainty in Reliability Assessments 592 9.3.1 Reliability limits 592 9.3.2 Bayesian revision of reliability 593 9.3.3 Summary of Section 9.3 597 9.4 Temporal Reliability 597 9.4.1 Failure process and survival time 597 9.4.2 Hazard function 602 9.4.3 Reliable life 605 9.4.4 Summary of Section 9.4 606 9.5 Reliability-Based Design 606 9.6 Summary for Chapter 9 612 References 613 Problems 615 10 Bayesian Decision Methods and Parameter Uncertainty 623 10.1 Basic Decision Theory 624 10.1.1 Bayes rules 624 10.1.2 Decision trees 627 10.1.3 The minimax solution 630 10.1.4 Summary of Section 10.1 632 10.2 Posterior Bayesian Decision Analysis 632 10.2.1 Subjective probabilities 633
  11. 11. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Contents xi 10.2.2 Loss and utility functions 634 10.2.3 The discrete case 635 10.2.4 Inference with conditional binomial and prior beta 636 10.2.5 Poisson hazards and gamma prior 638 10.2.6 Inferences with normal distribution 639 10.2.7 Likelihood ratio testing 642 10.2.8 Summary of Section 10.2 643 10.3 Markov Chain Monte Carlo Methods 643 10.4 James-Stein Estimators 650 10.5 Summary and Discussion of Chapter 10 653 References 653 Problems 656 Appendix A: Further mathematics 659 A.1 Chebyshev Inequality 659 A.2 Convex Function and Jensen Inequality 659 A.3 Derivation of the Poisson distribution 659 A.4 Derivation of the normal distribution 660 A.5 MGF of the normal distribution 661 A.6 Central limit theorem 662 A.7 Pdf of Students T distribution 663 A.8 Pdf of the F distribution 664 A.9 Wilcoxon signed-rank test: mean and variance of the test statistic 664 A.10 Spearmans rank correlation coefcient 665 Appendix B: Glossary of Symbols 667 Appendix C: Tables of Selected Distributions 673 Appendix D: Brief Answers to Selected Problems 684 Appendix E: Data Lists 687 Index 707
  12. 12. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 xii
  13. 13. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Dedication To my parents. To estimate the debt I owe them requires a lifespan of nibbanic extent. To Mali, Shani, Siraj, and Natasha. N.T.K. A mamma Aria, a Donatella, ai due Riccardi della mia vita e al nostro indimenticabile Rufus. R.R. xiii
  14. 14. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Preface to the First Edition Statistics, probability, and reliability are subject areas that are not commonly easy for stu- dents of civil and environmental engineering. Such difculties notwithstanding, a greater emphasis is currently being made on the teaching of these methods throughout institutions of higher learning. Many professors with whom we have spoken have expressed the need for a single textbook of sufcient breadth and clarity to cover these topics. One might ask why it is necessary to write a new book specically for civil and envi- ronmental engineers. Firstly, we see a particular importance of statistical and associated methods in our disciplines. For example, some modes of failure, interactions, probability distributions, outliers, and spatial relationships that one encounters are unique and require different approaches. Secondly, colleagues have said that existing books are either old and outdated or omit particularly important engineering problems, emphasizing instead areas that may not be directly relevant to the practitioner. We set ourselves several objectives in writing this book. First, it was necessary to update much of the older material, which have rightly stood for decades, even centuries. Indeed. Second, we had to look at the engineers structures, waterways, and the like and bring in as much material as possible for the tasks at hand. We felt an urgent need to modernize, incorporate new concepts throughout, and reduce or eliminate the impact of some topics. We aimed to order the material in a logical sequence. In particular we tried to adopt a writing style and method of presentation that are lively and without overrigorous drudgery. These had to be accomplished without compromising a deep and thorough treatment of fundamentals. The layout of the book is sraightforward, so it can be used to suit ones personal needs. We apologize to any readers who think we have strayed from the path of simplicity in certain parts, such as the associated variables and contagious distributions of Chapter 3 and the order statistics of Chapter 7. One might wish to omit these sections on a rst reading. The introductions to the chapters will be helpful for this purpose. The explanation of the theory is accompanied by the assumptions made. Denitions are separately highlighted. In many places we point out the limitations and pitfalls or viola- tions. There are warnings of possible misuses, misunderstandings, and misinterpretations. We provide guidance to the proper interpretation of statistical results. The numerous examples, for which we have for the most part used recorded observa- tions, will be helpful to beginners as well as to mature students who will consult the text as a reference. We hope these examples will lead to a better understanding of the material and design variabilities, a prelude to the making of sound decisions. Each chapter concludes with extensive homework problems. In many instances, as in Chapter 1, they are based on real data not used elsewhere in the text. We have not used cards or dice or coins or black and red balls in any of the problems and examples. Answers to selected problems are summarized in Appendix D. A detailed manual of solutions is available. Computers are continuously becoming cheaper and more powerful. Newer ways of handling data are being devised. At the inception, we seriously considered the use of commercial software packages to enhance the scope of the book. However, the problem of choosing one, from the many suitable packages acted as a deterrent. Our concern was the serious limitations imposed by utilizing a source that necessitates corresponding purchase xiv
  15. 15. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Preface to the First Edition xv by an adopting school or by individual engineers. Besides, the calculations illustrated in the book can be made using worksheets available as standard software for personal computers. As an aid, the data in Appendix E will be placed on the Internet. We have utilized the space saved (from jargon and notation of a particular software, output, graphs, and tables) to widen the scope, make our explanations more thorough, and insert additional illustrations and problems. Readers also have an almost all-inclusive index, a comprehensive glossary of notation, additional mathematical explanations, and other material in the appendixes. Furthermore, we hope that the extensive, annotated bibli- ographies at the end of each chapter, numerous citations and tables, will make this a useful reference source. The book is written for use by students, practicing engineers, teachers, and researchers in civil and environmental engineering and applied statistics; female readers will nd no hint of male chauvinism here. It is designed for a one- or two-semester course and is suitable for nal-year undergraduate and rst-year graduate students. The text is self-contained for study by engineers. A background of elementary calculus and matrix algebra is assumed. ACKNOWLEDGMENTS We acknowledge with thanks the work of the staff at Publication Services, Inc., in Cham- paign, IL. Gianfausto Salvadori gave his time generously in reviewing the manuscript and providing solutions to some homework problems. Thanks are due again to Adri Buishand for his elaborate and painstaking reviews. Our publisher solicited other reviewers whose reports were useful. Howard Tillotson and colleagues at the University of Birmingham, England, provided data and some student problems. Discussions with Tony Lawrance at lunch in the University Staff House and the example problem he solved at Helsinki Airport are appreciated. Valuable assistance was provided by Giovanni Solari and Giulio Ballio in wind and steel engineering, respectively. In addition, Giovanni Vannuchi was consulted on geotechnical engineering. Research staff and doctoral students at the Politecnico di Mi- lano helped with the homework problems and the preparation of the index. Dora Tartaglia worked diligently on revisions to the manuscript. We thank the publishers, companies, and individuals who gave us permission to use their material, data, and tables; some of the tables were obtained through our own resources We shall be pleased to have any omissions brought to our notice. The support and hospitality provided at the Universit`a degli Studi di Pavia by Luigi Natale and others are acknowledged with thanks. Most importantly, without the patience and tolerance of our families this book could not have been completed. N. T. Kottegoda R. Rosso Milano, Italy 1 July 1996
  16. 16. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Preface to the Second Edition Last year a senior European professor, who uses our book, was visiting us in Milano. When told of the revisions underway he expressed some surprise. There is nothing to revise, he said. But all books need revision sooner or later, especially a multidimensional one. The equations, examples, problems, gures, tables, references, and footnotes are all subject to inevitable human fallibilities: typographical errors and errors of fact. Our rst objective was to bring the text as close to the ideal state as possible. The second priority was to modernize. In Chapter 10, a new section is added on Markov chain Monte Carlo modeling; this has popularized Bayesian methods in recent years; there is a full description and case study on Gibbs sampling. In Chapter 8 on simulation, we include a new section on sensitivity analysis and uncertainty analysis; a clear and detailed distinction is made between epis- temic and aleatory uncertainties; their implications in decision-making are discussed. In Chapter 7 on Frequency Analysis of Extreme Events, natural hazards and ood hydrology are updated. In Chapter 6 on regression analysis, further considerations have been made on the diagnostics of regression; there are new discussions on general and generalized linear models. In Chapter 5 on Model Estimation and Testing we give special importance to the Anderson-Darling goodness-of-t test because of its sensitivity to departures in the tail areas of a probability density function; we make applications to nonnormal distributions using the same data as in the estimation of parameters. In Chapter 3 a section is added on the novel method of copulas with particular emphasis on bivariate distributions. We have revised the problems following Basic Probability Concepts in Chapter 2. Other chapters are also revised and modernized and the annotated references are updated. As before, we have kept in mind the scientic method of Claude Bernard, the French medical researcher of the nineteenth century. This had three essential parts: observation of phenomena in nature (seen in Appendix E, and in the examples and problems), observation of experiments (as reported in each chapter), and the theoretical part (clear enough for the audience in mind, but without over-simplication). Nobody trusts a model except the one who originated it; everybody trusts data except those who record it. Models and data are subject to uncertainty. There is still a gap between models and data. We attempt to bridge this gap. The title of the book has been abridged from Statistics, Probability, and Reliability for Civil and Environmental Engineers to Applied Statistics for Civil and Environmental Engineers. The applications and problems pertain almost equally to both disciplines and all areas are included. Another aspect we emphasized before was that the calculations illustrated in the book can be made using worksheets available as standard software for personal computers. Alternatively, R which is now commonplace can be downloaded free of charge and adopted to run some of the homework problems, if one so prefers. Our decision not to recommend the use of particular commercial software packages, by giving details of jargon, notation, and so on, seems to be justied. We nd that a specic version soon become obsolete with the advent of a new version. A limited access solutions manual is available with the data from Appendix E on the Wiley-Blackwell website [www.blackwellpublishing.com/kottegoda]. xvi
  17. 17. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 Preface to the Second Edition xvii We are grateful for the encouragement given by many users of the rst edition, and to the few who pointed out some discrepancies. We thank the anonymous reviewers for their useful comments. Gianfausto Salvadori, Carlo De Michele, Adri Buishand, and Tony Lawrance assisted us again in the revisions. Julia Burden and Lucy Alexander of Blackwell Publishing supported us throughout the project. Universit`a degli Studi di Pavia is thanked for continued hospitality. The help provided by Fabrizio Borsa and Enrico Raiteri in the preparation of some gures is acknowledged. N. T. Kottegoda R. Rosso Milano, Italy 14 September 2007
  18. 18. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 13, 2008 12:36 xviii
  19. 19. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 2, 2008 15:39 Introduction As a wide-ranging discipline, statistics concerns numerous procedures for deriving infor- mation from data that have been affected by chance variations. On the basis of scientic experiments, one may record and make summaries of observations, quantify variations, or other changes of signicance, and compare data sequences by means of some numbers or characteristics. The use of statistics in this way is for descriptive purposes. At a more sophisticated level of analysis and interpretation, one can, for instance, test hypotheses using the inferential approach developed during the twentieth century. Thus it may be ascertained, for instance, whether the change of an ingredient affects the properties of a concrete or whether a particular method of surfacing produces a longer-lasting road; this approach often includes the estimation by means of observations of the parameters of a statistical model. Then inferences can be drawn from data and predictions made or decisions taken. When faced with uncertainty, this last phase is the principal aim of a civil or environmental engineer acting as an applied statistician. In all activities, engineers have to cope with possible uncertainties. Observations of soil pressures, tensile strengths of concrete, yield strengths of steel, trafc densities, rainfalls, river ows, and pollution loads in streams vary from one case to the next for apparently unknown reasons or on account of factors that cannot be assessed to any degree of accu- racy. However, designs need to be completed and structures, highways, water supply, and sewerage schemes constructed. Sound engineering judgment, in fact, springs from physi- cal and mathematical theories, but it goes far beyond that. Randomness in nature must be taken into account. Thus the onus of dealing with the uncertainties lies with the engineer. The appropriate methods of tackling the uncertainty vary with different circumstances. The key is often the dispersion that is commonly evidenced in available data sets. Some phenomena may have negligible or low variability. In such a case, the mean of past observa- tions may be used as a descriptor, for example, the elastic constant of a steel. Nevertheless, the consequences of a possible change in the mean should also be considered. Frequently, the variability in observations is found to be quite substantial. In such situations, an engi- neer sometimes uses, rather conservatively, a design value such as the peak storm runoff or the compressive strength of a concrete. Alternatively, it has been the practice to express the ability of a component in a structure to withstand a specied loading without failure or a permissible deection by a so-called factor of safety; this is in effect a blanket to cover all possible contingencies. However, we envisage some problems here in following a purely deterministic approach because there are doubts concerning the consistency of specied strengths, ows, loads, or factors from one case to another. These cannot be lightly dismissed or easily compounded when the consequences of ignoring variability are detrimental or, in general, if the decision is sensitive to a particular uncertainty. (Often there are crucial economic considerations in these matters.) This obstacle strongly sug- gests that the way forward is by treating statistics and probability as necessary aids in decision making, thus coping with uncertainty through the engineering process. Note that statistical methods are in no way intended to replace the physical knowl- edge and experience of the engineer and his or her skills in experimentation. The engineer should know how the measurements are made and recorded and how errors may arise from possible limitations in the equipment. There should be readiness to make changes and im- provements so that the data-gathering process is as reliable and representative as possible. 1
  20. 20. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 2, 2008 15:39 2 Applied Statistics for Civil and Environmental Engineers On this basis, statistics can be a complementary and a valuable aid to technology. In prudent hands it can lead to the best practical assessment of what is partially known or uncertain. The quantication of uncertainty and the assessment of its effects on design and im- plementation must include concepts and methods of probability, because statistics is built on the foundation of probability theory. In addition, decision making under risk involves the use of applied probability. Historically, probability theory arose as a branch of math- ematics concerned with the analysis of certain games of chance; it consequently found applications in the measurement and understanding of uncertainty in innumerable natural phenomena and human activities. The fundamental interrelationship between statistics and probability is clearly evident in practice. As seen in past decades, there has been an irreversible change in emphasis from descriptive to inferential statistics. In this respect we must note that statistical inferences and the risk and reliability of decision making under uncertainty are evaluated through applied probability, using frequentist or Bayesian estimation. This applies to the most widely used methods. Alternatives that come under generalized information theory are now available. The reliability of a system, structure, or component is the complement of its probability of failure. Risk and reliability analysis, however, entail many activities. The survival prob- ability of a system is usually stated in terms of the reliabilities of its components. The modeling process is an essential part of the analysis, and time can be an important factor. Also, the risk factor that one computes may be inherent, additional, or composite. All these points show that reliability design deserves special emphasis. Methods of reducing data, reviewed in Chapter 1, begin with tabulation and graphical representation, which are necessary rst steps in understanding the uncertainty in data and the inherent variability. Numerical summaries provide descriptions for further analysis. Exploratory methods are followed by relationships between data observed in pairs. Thus the investigation begins. The route is long and diverse, because statistics is the science and art of experimenting, collecting, analyzing, and making inferences from data. This opening chapter provides a route map of what is to follow so that one can gain insight into the numerous tools statistics offers and realizes the variety of problems that can be tackled. In Chapters 2 and 3, we develop a background in probability theory for coping with uncertainty in engineering. Using basic concepts, we then discuss the total probability and Bayes theorems and dene statistical properties of distributions used for estimation purposes. Chapter 4 examines various mathematical models of random processes. There is a wide-ranging discussion of discrete and continuous distributions; joint and derived types are also given in Chapters 3 and 4; we introduce copulas that can effectively model joint distributions. Model estimation and testing methods, such as condence intervals, hypothesis testing, analysis of variance, probability plotting, and identication of outliers, are treated in Chapter 5. The estimation and testing are based on the principle that all suppositions need to be carefully examined in light of experimentation and observation. Details of regression and multivariate statistical methods are provided in Chapter 6, along withprincipalcomponentanalysisandassociatedmethodsandspatialcorrelation.Extreme value analysis applied to oods, droughts, winds, earthquakes, and other natural hazards is found in Chapter 7; some special types of models are included. Simulation is the subject of Chapter8,whichcomprisestheuseofsimulationindesignandforotherpracticalpurposes; also, we discuss sensitivity analysis and uncertainty analysis of the aleatory and epistemic types. In Chapter 9, risk and reliability analysis and reliability design are developed in detail. Chapter 10 is devoted to Bayesian and other types of economic decision making, used when the engineer faces uncertainty; we include here Markov chain Monte Carlo methods that have recently popularized the Bayesian approach.
  21. 21. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Chapter 1 Preliminary Data Analysis All natural processes, as well as those devised by humans, are subject to variability. Civil engineers are aware, for example, that crushing strengths of concrete, soil pressures, strengths of welds, trafc ow, oods, and pollution loads in streams have wide variations. These may arise on account of natural changes in properties, differences in interactions between the ingredients of a material, environmental factors, or other causes. To cope with uncertainty, the engineer must rst obtain and investigate a sample of data, such as a set of ow data or triaxial test results. The sample is used in applying statistics and probability at the descriptive stage. For inferential purposes, however, one needs to make decisions regarding the population from which the sample is drawn. By this we mean the total or aggregate, which, for most physical processes, is the virtually unlimited universe of all possible measurements. The main interest of the statistician is in the aggregation; the individual items provide the hints, clues, and evidence. A data set comprises a number of measurements of a phenomenon such as the failure load of a structural component. The quantities measured are termed variables, each of which may take any one of a specied set of values. Because of its inherent randomness and hence unpredictability, a phenomenon that an engineer or scientist usually encounters is referred to as a random variable, a name given to any quantity whose value depends on chance.1 Random variables are usually denoted by capital letters. These are classied by the form that their values can possibly take (or are assumed to take). The pattern of variability is called a distribution. A continuous variable can have any value on a conti- nuous scale between two limits, such as the volume of water owing in a river per second or the amount of daily rainfall measured in some city. A discrete variable, on the contrary, can only assume countable isolated numbers like integers, such as the number of vehicles turning left at an intersection, or other distinct values. Having obtained a sample of data, the rst step is its presentation. Consider, for ex- ample, the modulus of rupture data for a certain type of timber shown in Table E.1.1, in Appendix E. The initial problem facing the civil engineer is that such an array of data by itself does not give a clear idea of the underlying characteristics of the stress values in this natural type of construction material. To extract the salient features and the particular types of information one needs, one must summarize the data and present them in some readily comprehensible forms. There are several methods of presentation, organization, and reduction of data. Graphical methods constitute the rst approach. 1.1 GRAPHICAL REPRESENTATION If a picture is worth a thousand words, then graphical techniques provide an excellent method to visualize the variability and other properties of a set of data. To the powerful interactive system of ones brain and eyes, graphical displays provide insight into the form 1 The term will be formally dened in Section 3.1. 3
  22. 22. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 4 Applied Statistics for Civil and Environmental Engineers and shape of the data and lead to a preliminary concept of the generating process. We proceed by assembling the data into graphs, scanning the details, and noting the important characteristics. There are numerous types of graphs. Line and dot diagrams, histograms, relative frequency polygons, and cumulative frequency curves are given in this section. Subsequently, exploratory methods, such as stem-and-leaf plots and box diagrams and graphs depicting a possible association between two variables, are presented in Sections 1.3 and 1.4. We begin with the simple task of counting. 1.1.1 Line diagram or bar chart The occurrences of a discrete variable can be classied on a line diagram or bar chart. In this type of graph, the horizontal axis gives the values of the discrete variable and the occurrences are represented by the heights of vertical lines. The horizontal spread of these lines and their relative heights indicate the variability and other characteristics of the data. Example 1.1. Flood occurrences. Consider the annual number of oods of the Magra River at Calamazza, situated between Pisa and Genoa in northwestern Italy, over a 34-year period, as shown in Table 1.1.1. A ood in the river at the point of measurement means the river has risen above a specied level, beyond which the river poses a threat to lives and property. The data are plotted in Fig. 1.1.1 as a line diagram. The data suggest a symmetrical distribution with a midlocation of four oods per year. In some other river basins, there is a nonlinear decrease in the occurrences for increasing numbers of oods in a year commencing at zero, showing a negative exponential type of variation. 1.1.2 Dot diagram A different type of graph is required to present continuous data. If the data are few (say, less than 25 items) a dot diagram is a useful visual aid. Consider the possibility that only Table 1.1.1 Number of ood occurrences per year from 1939 to 1972 at the gauging station of Calamazza on the Magra River, between Pisa and Genoa in northwestern Italya Number of oods Number of in a year occurrences 0 0 1 2 2 6 3 7 4 9 5 4 6 1 7 4 8 1 9 0 Total 34 a A ood occurrence is dened as river discharge exceeding 300 m3/s.
  23. 23. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 5 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Number of floods Numberofoccurrences Fig. 1.1.1 Line diagram for ood occurrences in the Magra River at Calamazza between Genoa and Pisa in northwestern Italy. the rst 15 items of data in Table E.1.1which shows the modulus of rupture in N/mm2 for 50 mm 150 mm Swedish redwood and whitewoodare available. The abridged data are ranked in ascending order and are given in Table 1.1.2 and plotted in Fig. 1.1.2. The reader can see that the midlocation is close to 40 N/mm2 but the wide spread makes this location difcult to discern. A larger sample should certainly be helpful. 1.1.3 Histogram If there are at least, say, 25 observations, one of the most common graphical forms is a block diagram called the histogram. For this purpose, the data are divided into groups according to their magnitudes. The horizontal axis of the graph gives the magnitudes. Blocks are drawn to represent the groups, each of which has a distinct upper and lower limit. The area of a block is proportional to the number of occurrences in the group. The variability of the data is shown by the horizontal spread of the blocks, and the most common values are found in blocks with the largest areas. Other features such as the symmetry of the data or lack of it are also shown. The rst step is to take into account the range r of the observations, that is, the difference between the largest and smallest values. Example 1.2. Timber strength. We go back to the timber strength data given in Table E.1.1. They are arranged in order of magnitude in Table 1.1.3. There are n = 165 observations with somewhat high variability, as expected, because timber is a naturally variable material. Here the range r = 70.22 0.00 = 70.22 N/mm2 . To draw a histogram, one divides the range into a number of classes or cells nc. The number of occurrences in each class is counted and tabulated. These are called frequencies. Table 1.1.2 The rst 15 items of modulus of rupture data measuring timber strengths in N/mm2 , from Table E.1.1 (commencing with the top row), ranked in increasing order 29.11 29.93 32.02 32.40 33.06 34.12 35.58 39.34 40.53 41.64 45.54 48.37 48.78 50.98 65.35
  24. 24. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 6 Applied Statistics for Civil and Environmental Engineers 25 30 35 40 45 50 55 60 65 70 Modulus of rupture, N/mm2 Fig. 1.1.2 Dot diagram for a short sample of timber strengths from Table 1.1.3. The width of the classes is usually made equal to facilitate interpretation. For some work such as the tting of a theoretical function to observed frequencies, however, unequal class widths are used. Care should be exercised in the choice of the number of classes, nc. Too few will cause an omission of some important features of the data; too many will not give a clear overall picture because there may be high uctuations in the frequencies. A rule of thumb is to make nc = n or an integer close to this, but it should be at least 5 and not greater than 25. Thus, histograms based on fewer than 25 items may not be meaningful. Sturges (1926) suggested the approximation nc = 1 + 3.3 log10 n. (1.1.1) AmoretheoreticallybasedalternativefollowstheworkofFreedmanandDiaconis(1981):2 nc = r n1/3 2 iqr . (1.1.2) Here iqr is the interquartile range. To clarify this term, we must dene Q2, or the median. This denotes the middle term of a set of data when the values are arranged in ascending order, or the average of the two middle terms if n is an even number. The rst or lower quartile, Q1, is the median of the lower half of the data, and likewise the third Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2 , in ascending ordera 0.00 28.00 31.60 34.44 36.84 39.21 41.75 44.30 47.25 53.99 17.98 28.13 32.02 34.49 36.85 39.33 41.78 44.36 47.42 54.04 22.67 28.46 32.03 34.56 36.88 39.34 41.85 44.36 47.61 54.71 22.74 28.69 32.40 34.63 36.92 39.60 42.31 44.51 47.74 55.23 22.75 28.71 32.48 35.03 37.51 39.62 42.47 44.54 47.83 56.60 23.14 28.76 32.68 35.17 37.65 39.77 43.07 44.59 48.37 56.80 23.16 28.83 32.76 35.30 37.69 39.93 43.12 44.78 48.39 57.99 23.19 28.97 33.06 35.43 37.78 39.97 43.26 44.78 48.78 58.34 24.09 28.98 33.14 35.58 38.00 40.20 43.33 45.19 49.57 65.35 24.25 29.11 33.18 35.67 38.05 40.27 43.33 45.54 49.59 65.61 24.84 29.90 33.19 35.88 38.16 40.39 43.41 45.92 49.65 69.07 25.39 29.93 33.47 35.89 38.64 40.53 43.48 45.97 50.91 70.22 25.98 30.02 33.61 36.00 38.71 40.71 43.48 46.01 50.98 26.63 30.05 33.71 36.38 38.81 40.85 43.64 46.33 51.39 27.31 30.33 33.92 36.47 39.05 40.85 43.99 46.50 51.90 27.90 30.53 34.12 36.53 39.15 41.64 44.00 46.86 53.00 27.93 31.33 34.40 36.81 39.20 41.72 44.07 46.99 53.63 a The original data set is given in Table E.1.1; n = 165. The median is underlined. 2 See also Scott (1979).
  25. 25. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 7 Table 1.1.4 Frequency computations for the modulus of rupture data ranked in Table 1.1.3a Class upper limit Class center Absolute Relative Cumulative relative (N/mm2 ) (N/mm2 ) frequency frequency frequency (%) 5 2.5 1 0.006 0.61 10 7.5 0 0.000 0.61 15 12.5 0 0.000 0.61 20 17.5 1 0.006 1.21 25 22.5 9 0.055 6.67 30 27.5 18 0.109 17.58 35 32.5 26 0.158 33.33 40 37.5 38 0.230 56.36 45 42.5 34 0.206 76.97 50 47.5 20 0.121 89.09 55 52.5 9 0.055 94.55 60 57.5 5 0.030 97.58 65 62.5 0 0.000 97.58 70 67.5 3 0.018 99.39 75 72.5 1 0.006 100.00 a The width of each class is 5 N/mm2 in this example. or upper quartile, Q3, is the median of the upper half of the data. This denition will be used throughout.3 Thus, iqr = Q3 Q1. (1.1.3) Example 1.3. Timber strength. For the timber strength data of Table E.1.1, the median, that is, Q2, is 39.05 N/mm2 . Also Q3 and Q1 are 44.57 and 32.91 N/mm2 , respectively, and hence iqr = 11.66 N/mm2 . From the simple square-root rule, the number of classes, nc = 12.84. However, by using Eqs. (1.1.1) and (1.1.2), the number of classes are 8.32 and 16.52, respectively. If these are rounded to 9 and 15 and the range is extended to 72 and 75 N/mm2 for graphical purposes, the equal class widths become 8 and 5 N/mm2 , respectively. Let us use these widths. It is important to specify the class boundaries without ambiguity for the counting of frequencies; for example, in the rst case, these should be from 0 to 7.99, 8.00 to 15.99, and so on. As already mentioned, the vertical axis of a histogram is made to represent the frequency and the horizontal axis is used as a measurement scale on which the class boundaries are marked. For each of these class widths, 8 and 5 N/mm2 , class boundaries are made and counting of frequencies is completed using Table 1.1.3; the lowest boundary is at 0 and the highest boundaries are at 72 and 75 N/mm2 , respectively. Table 1.1.4 gives the absolute and relative frequencies for class widths of 5 N/mm2 . Rectangles are then erected over each of the classes, proportional in area to the class frequencies. When equal class widths are used, as shown here, the heights of the rectangles represent the frequencies. Thus, Figs. 1.1.3 and 1.1.4 are obtained. The information conveyed by the two histograms seems to be similar. The diagrams are almost symmetrical with a peak in the class below 40 N/mm2 and a steady decrease on either side. This type of diagram usually brings out any possible imperfections in the data, such as 3 There are alternatives, such as rounding (n + 1)/4 and (n + 1) (3/4) to the nearest integers to calculate the locations of Q1 and Q3, respectively. The rounding is upward or downward, respectively, when the numbers fall exactly between two integers.
  26. 26. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 8 Applied Statistics for Civil and Environmental Engineers 0.0 0.1 0.2 0.3 0.4 07.99 815.99 1623.99 2431.99 3239.99 4047.99 4855.99 5663.99 6471.99 7279.99 Modulus of rupture (N/mm2) Relativefrequency Fig. 1.1.3 Histogram for timber strength data with class width of 8 N/mm2 . the gaps at the ends. Further investigations are required to understand the true nature of the population. More on these aspects will follow in this and subsequent chapters. 1.1.4 Frequency polygon A frequency polygon is a useful diagnostic tool to determine the distribution of a variable. It can be drawn by joining the midpoints of the tops of the rectangles of a histogram after extending the diagram by one class on both sides. We assume that equal class widths are used. If the ordinates of a histogram are divided by the total number of observations, then a relative frequency histogram is obtained. Thus, the ordinates for each class denote the probabilities bounded by 0 and 1, by which we simply mean the chances of occurrence. The resulting diagram is called the relative frequency polygon. Example 1.4. Timber strength. Corresponding to the histogram of Fig. 1.1.4, the values of class center are computed and a relative frequency polygon is obtained; this is shown in Fig. 1.1.5. 0.00 0.10 0.20 0.30 04.99 1014.99 2024.99 3034.99 4044.99 5054.99 6064.99 7074.99 Modulus of rupture (N/mm2) Relativefrequency Fig. 1.1.4 Histogram for timber strength data with class width of 5 N/mm2 .
  27. 27. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 9 0.0 0.1 0.2 0.3 0 20 40 60 80 Modulus of rupture (N/mm2) Relativefrequency Fig. 1.1.5 Relative frequency polygon for timber strength data with class width of 5 N/mm2 . As the number of observations becomes large, the class widths theoretically tend to de- crease and, in the limiting case of an innite sample, a relative frequency polygon becomes a frequency curve. This is in fact a probability curve, which represents a mathematical probability density function, abbreviated as pdf, of the population.4 1.1.5 Cumulative relative frequency diagram If a cumulative sum is taken of the relative frequencies step by step from the smallest class to the largest, then the line joining the ordinates (cumulative relative frequencies) at the ends of the class boundaries forms a cumulative relative frequency or probability diagram. On the vertical axis of the graph, this line gives the probabilities of nonexceedance of values shown on the horizontal axis. In practice, this plot is made by utilizing and displaying every item of data distinctly, without the necessity of proceeding via a histogram and the restric- tive categories that it entails. For this purpose, one may simply determine (e.g., from the ranked data of Table 1.1.3) the number of observations less than or equal to each value and divide these numbers by the total number of observations. This procedure is adopted here.5 Thus, the probability diagram, as represented by the cumulative relative frequency diagram, becomes an important practical tool. This diagram yields the median and other quartiles directly. Also, one can nd the 9 values that divide the total frequency into 10 equal parts called deciles and the so-called percentiles, where the pth percentile is the value that is greater than p percent of the observations. In general, it is possible to obtain the (n 1) values that divide the total frequency into n equal parts called the quantiles. Hence a cumulative frequency polygon is also called a quantile or Q-plot; a Q-plot though has quantiles on the vertical axis unlike a cumulative frequency diagram. Example 1.5. Timber strength. Figure 1.1.6 is the cumulative frequency diagram obtained from the ranked timber strength data of Table 1.1.3 using each item of data as just described. 4 This function is discussed in Chapter 3. One of the rst tasks in applying inferential statistics, as presented in Chapters 4 and 5, will be to estimate the mathematical function from a nite sample and examine its closeness to the histogram. 5 Further aspects of this subject, as related to probability plots, are described in Chapter 5.
  28. 28. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 10 Applied Statistics for Civil and Environmental Engineers 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 Modulus of rupture (N/mm2) Cumulativerelativefrequency Fig. 1.1.6 Cumulative relative frequency diagram for timber strength data. The deciles and percentiles can be abstracted. By convention a vertical probability or proportionality scale is used rather than one giving percentages (except in duration curves, discussed shortly). The 90th percentile, for instance, is 51 N/mm2 approximately and the value 40 N/mm2 has a probability of nonexceedance of approximately 0.56. If the sample size increases indenitely, the cumulative relative frequency diagram will become a distribution curve in the limit. This represents the population by means of a (mathematical) distribution function, usually called a cumulative distribution function, ab- breviated to cdf, just as a relative frequency polygon leads to a probability density function. As a graphical method of ascertaining the distribution of the population, the quantile plot can be drawn using a modied nonlinear scale for the probabilities, which represents one of several types of theoretical distributions.6 Also, as shown in Section 1.4, two distributions can be compared using a Q-Q plot. 1.1.6 Duration curves For the assessment of water resources and for associated design and planning purposes, engineers nd it useful to draw duration curves. When dealing with ows in rivers, this type of graph is known as a ow duration curve. It is in effect a cumulative frequency diagram with specic time scales. The vertical axis can represent, for example, the percentage of the time a ow is exceeded; and in addition, the number of days per year or season during which the ow is exceeded (or not) may be given. The volume of ow per day is given on the horizontal axis. For some purposes, the vertical and horizontal axes are interchanged as in a Q-plot. One example of a practical use is the scaled area enclosed by the curve, a horizontal line representing 100% of the time, and a vertical line drawn at a minimum value of ow, which is desirable to be maintained in the river. This area represents the estimated supplementary volume of water that should be diverted to the river on an annual basis to meet such an objective. Example 1.6. Streamow duration. Figure 1.1.7 gives the ow duration curve of the Dora Riparia River in the Alpine region of northern Italy, calculated over a period of 47 years from the records at Salbertrand gauging station. This gure is drawn using the same procedure 6 This method is demonstrated in Section 5.8.
  29. 29. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 11 0 73 146 219 292 365 0 10 20 30 40 50 Daily streamflow (m3/s) Duration,daysperyearflow isexceeded 0 10 20 30 40 50 60 70 80 90 100 Percentage,duration Fig. 1.1.7 Flow duration curve of Dora Riparia River at Salbertrand in the Alpine region of Italy. adopted for a cumulative relative frequency diagram, such as Fig. 1.1.6. For instance, suppose it is decided to divert a proportion of the discharges above 10 m3 /s and below 20 m3 /s from the river. Then the area bounded by the curve and the vertical lines drawn at these discharges, using the vertical scale on the left-hand side, will give the estimated maximum amount available for diversion during the year in m3 after multiplication by the number of seconds in a day. This area is hatched in Fig. 1.1.7. If such a decision were to be implemented over a long- term basis, it should be essential to use a long series of data and to estimate the distribution function. 1.1.7 Summary of Section 1.1 In this section we have introduced some of the basic graphical methods. Other procedures such as stem-and-leaf plots and scatter diagrams are presented in Sections 1.3 and 1.4, respectively. More advanced plots are introduced in Chapters 5 and 6. In the next section we discuss associated numerical methods. 1.2 NUMERICAL SUMMARIES OF DATA Useful graphical procedures for presenting data and extracting knowledge on variabil- ity and other properties were shown in Section 1.1. There is a complementary method through which much of the information contained in a data set can be represented eco- nomically and conveyed or transmitted with greater precision. This method utilizes a set of characteristic numbers to summarize the data and highlight their main features. These numerical summaries represent several important properties of the histogram and the rel- ative frequency polygon. The most important purpose of these descriptive measures is for statistical inference, a role that graphs cannot fulll. Basically, there are three distinctive types: measures of central tendency, of dispersion, and of asymmetry, all of which can be visualized through the histogram as discussed in Section 1.1. The additional measure of peakedness, that is, the relative height of the peak, requires a large sample for its estimation and is mainly relevant in the case of symmetric distributions.
  30. 30. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 12 Applied Statistics for Civil and Environmental Engineers 1.2.1 Measures of central tendency Generally data from many natural systems, as well as those devised by humans, tend to cluster around some values of variables. A particular value, known as the central value, can be taken as a representative of the sample. This feature is called central tendency because the spread seems to take place about a center. The denition of the central value is exible, and its magnitude is obtained through one of the measures of its location. There are three such well-known measures: the mean, the mode, and the median. The choice depends on the use or application of the central value. The sample arithmetic mean is estimated from a sample of observations: x1, x2, . . . , xn, as x = 1 n n i=1 xi . (1.2.1) If one uses a single number to represent the data, the sample mean seems ideal for the purpose. After counting, this calculation is the next basic step in statistics. For theoretical purposes the mean is the most important numerical measure of location. As stated in Section 1.1, if the sample size increases indenitely a curve is obtained from a frequency polygon; the mean is the centroid of the area between this curve and the horizontal axis and it is thus the balance point of the frequency curve. The population value of the mean is denoted by . We reiterate our denition of popu- lation with reference to a phenomenon such as that represented by the timber strength data of Table E.1.1. A population is the aggregate of observations that might result by making an experiment in a particular manner. The sample mean has a disadvantage because it may sometimes be affected by un- expectedly high or low values, called outliers. Such values do not seem to conform to the distribution of the rest of the data. There may be physical reasons for outliers. Their presence may be attributed to conditions that have perhaps changed from what were as- sumed, or because the data are generated by more than one process. On the other hand, they may arise on account of errors of faulty instrumentation, measurement, observation, or recording. The engineer must examine any visible outliers and ascertain whether they are erroneous or whether their inclusion is justiable. The occurrence of any improbable value requires careful scrutiny in practice, and this should be followed by rectication or elimination if there are valid reasons for doing so. Example 1.7. Timber strength. A case in point is the value of zero in the timber strength data of Table E.1.1 This value is retained here for comparative purposes. The mean of the 165 items, which is 39.09 N/mm2 , becomes 39.33 N/mm2 without the value of zero. Example 1.8. Concrete test Table E.1.2 is a list of the densities and compressive strengths at 28 days from the results of 40 concrete cube test records conducted in Barton-on-Trent, England, during the period 8 July 1991 to 21 September 1992, and arranged in reverse chronological order. These have sample means of 2445 kg/m3 and 60.14 N/mm2 , respectively. The two numbers are measures of location representing the density and compressive strength of concrete. With many discordant values at the extremes, a trimmed mean, such as a 5% trimmed mean, may be calculated. For this purpose, the data are ranked and the mean is obtained after ignoring 5% of the observations from each of the two extremities (see Problem 1.16).
  31. 31. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 13 The technique of coding is sometimes used to facilitate calculations when the data are given to several signicant gures but the digits are constant except for the last few. For example, the densities in Table E.1.2 are higher than 2400 N/mm2 and less than 2500 N/mm2 , so that the number 2400 can be subtracted from the densities. The remainders will retain the essential characteristics of the original set (apart from the enforced shift in the mean), thus simplifying the arithmetic. In considering the entire data set, a weighted mean is obtained if the variables of a sample are multiplied by numbers called weights and then divided by the sum of the weights. It is used if some variables should contribute more (or less) to the average than others. The median is the central value in an ordered set or the average of the two central values if the number of values, n, is even, as specied in Section 1.1. Example 1.9. Concrete test. The calculation of the median and other measures of location will be greatly facilitated if the data are arranged in order of magnitude. For example, the compressive strengths of concrete given in Table E.1.2 are rewritten in ascending order in Table 1.2.1. The median of these data is 60.1 N/mm2 , which is the average of 60.0 and 60.2 N/mm2 . The median of the timber strength data of Table 1.1.3 is 39.05 N/mm2 , as noted in the table. The median has an advantage over the mean. It is relatively unaffected by outliers and is thus often referred to as a resistant measure. For instance, the exclusion of the zero value in Table 1.1.3 results only in a minor change of the median from 39.05 to 39.10 N/mm2 . One of the countless practical uses of the median is the application of a disinfectant to many samples of bacteria. Here, one seeks an association between the proportion of bacteria destroyed and the strength of the disinfectant. The concentration that kills 50% of the bacteria is the median dose. This is termed LD50 (lethal dose for 50%) and provides an excellent measure. The mode is the value that occurs most frequently. Quite often the mode is not unique because two or more sets of values have equal status. For this reason and for convenience, the mode is often taken from the histogram or frequency polygon. Example 1.10. Concrete test. For the ranked compressive strengths of concrete in Table 1.2.1, the mode is 60.5 N/mm2 . Example 1.11. Timber strength. From Fig. 1.1.4, for example, the mode of the timber strength data is 37.5 N/mm2 , which corresponds to the midpoint of the class with the highest frequency. However, there is ambiguity in the choice of the class widths as already noted. On the other hand, in Table 1.1.3 there are nine values in the range 38.6439.34 N/mm2 , and thus 39 N/mm2 seems a more representative value, but this problem can only be resolved theoretically. As the sample size becomes indenitely large, the modal value will correspond to the peak of the relative frequency curve on a theoretical basis. The mode may often have greater practical signicance than the mean and the median. It becomes more useful as the asymmetry of the distribution increases. For instance, if an engineer were to ask a person who sits habitually on the banks of a river shing to indicate the mean level of the river, he or she is inclined to point out the modal level. It is the value most likely to occur and it
  32. 32. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 14 Applied Statistics for Civil and Environmental Engineers Table 1.2.1 Ordered data of density and compressive strength of concretea Compressive strength Order Density (kg/m3 ) (N/mm2 ) 1 2411 49.9 2 2415 50.7 3 2425 52.5 4 2427 53.2 5 2427 53.4 6 2428 54.4 7 2429 54.6 8 2433 55.8 9 2435 56.3 10 2435 56.7 11 2436 56.9 12 2436 57.8 13 2436 57.9 14 2436 58.8 15 2437 58.9 16 2437 59.0 17 2441 59.6 18 2441 59.8 19 2444 59.8 20 2445 60.0 21 2445 60.2 22 2446 60.5 23 2447 60.5 24 2447 60.5 25 2448 60.9 26 2448 60.9 27 2449 61.1 28 2450 61.5 29 2454 61.9 30 2454 63.3 31 2455 63.4 32 2456 64.9 33 2456 64.9 34 2457 65.7 35 2458 67.2 36 2469 67.3 37 2471 68.1 38 2472 68.3 39 2473 68.9 40 2488 69.5 a The original data sets are given in Table E.1.2. is not affected by exceptionally high or low values. Clearly, the deletion of the zero value from Table 1.1.3 does not alter the mode, as we have also seen in the case of the median. These positive attributes of the mode and median notwithstanding, the mean is indis- pensable for many theoretical purposes. Also in the same class as the sample arithmetic
  33. 33. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 15 mean, there are two other measures of location that are used in special situations. These are the harmonic and geometric means. The harmonic mean is the reciprocal of the mean of the reciprocals. Thus the harmonic mean for a sample of observations, x1, x2, . . . , xn, is dened as xh = 1 1/n[(1/x1) + (1/x2) + + 1/xn)] . (1.2.2) It is applied in situations where the reciprocal of a variable is averaged. Example 1.12. Stream ow velocity. A practical example of the harmonic mean is the determination of the mean velocity of a stream based on measurements of travel times over a given reach of the stream using a oating device. For instance, if three velocities are calculated as 0.20, 0.24, and 0.16 m/s, then the sample harmonic mean is xh = 1 (1/3)[(1/0.20) + (1/0.24) + (1/0.16)] = 0.19 m/s. The geometric mean is used in averaging values that represent a rate of change. Here the variable follows an exponential, that is, a logarithmic law. For a sample of observations, x1, x2, . . . , xn, the geometric mean is the positive nth root of the product of the n values. This is the same as the antilog of the mean of the logarithms: xg = (x1x2 . . . xn)1/n = exp 1 n n i=1 In xi = n i=1 x 1/n i . (1.2.3) Example1.13. Populationgrowth. Considerthecaseofpopulationsoftownsandcitiesthat increase geometrically, which means that a future increase is expected that is proportional to the current population. Such information is invaluable for planning and designing urban water supplies and sewerage systems. Suppose, for example, that according to a census conducted in 1970 and again in 1990 the population of a city had increased from 230,000 to 310,000. An engineer needs to verify, for purposes of design, the per capita consumption of water in the intermediate period and hence tries to estimate the population in 1980. The central value to use in this situation is the geometric mean of the two numbers which is xg = (230, 000 310, 000)1/2 = 267,021. (Note that the sample arithmetic mean x = 270,000.) As we see in Example 1.13, the geometric mean is less than the arithmetic mean.7 1.2.2 Measures of dispersion Whereas a measure of central tendency is obtained by locating a central or representative value, a measure of dispersion represents the degree of scatter shown by observations or the inherent variability in a phenomenon under observation. Dispersion also indicates the precision of the data. One method of quantication is through an order statistic, that is, one of ranked data.8 The simplest in the category is the range, which is the difference between the largest and smallest values, as dened in Section 1.1. 7 This theoretical property is demonstrated in Example 3.10. 8 We shall discuss order statistics formally in Chapter 7; see also Chapter 5.
  34. 34. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 16 Applied Statistics for Civil and Environmental Engineers Example 1.14. Timber strength. As noted before, the range of the timber strength data of Table 1.1.3 is 70.22 0.00 = 70.22 N/mm2 . Example 1.15. Concrete test. For the compressive strengths of concrete given in Table E.1.2 and ranked in Table 1.2.1, the range is r = 69.5 49.9 = 19.6 N/mm2 ; the range of the concrete densities is 2488 2411 = 77 kg/m3 . These numbers provide a measure of the spread of the data in each case. The range, however, is a nondecreasing function of the sample size and thus charac- terizes the population poorly. Moreover, the range is unduly affected by high and low values that may be somewhat incompatible with the rest of the data even though they may not always be classied as outliers. For this reason, the interquartile range, iqr, which is relatively a resistant measure, is preferable. As dened in Section 1.1, in a ranked set of data this is the difference between the median of the top half and the median of the bottom half. Example 1.16. Concrete test. For the compressive strengths of concrete, the iqr is 6.55 N/mm2 . Example 1.17. Timber strength. The timber strength data in Table 1.1.3 have an iqr of 11.66 and 11.47 N/mm2 , respectively, with or without the zero value. A similar and more general measure is given by the interval between two symmetrical percentiles. For example, the 9010 percentile range for the timber strength data is approximately 52 28 = 24 N/mm2 from Fig. 1.1.6. The aforementioned measures of dispersion can be easily obtained. However, their shortcoming is that, apart from two values or numbers equivalent to them, the vast infor- mation usually found in a sample of data is ignored. This criticism is not applicable if one determines the average deviation about some central value, thus including all the obser- vations. For example, the mean absolute deviation, denoted by d, measures the average absolute deviation from the sample mean. For a sample of observations, x1, x2, . . . , xn, it is dened as d = |x1 x| + |x2 x| + + |xn x| n = n i=1 |xi x| n . (1.2.4) Example 1.18. Annual rainfall. If the annual rainfalls in a city are 50, 56, 42, 53, and 49 cm over a 5-year period, the absolute deviation with respect to the sample mean of 50 cm is given by d = 1 5 (|50 50| + |56 50| + |42 50| + |53 50| + |49 50|) = 3.6 cm. This measure of dispersion is easily understood and practically useful. However, it is valid only if the large and small deviations are as signicant as the average deviations. There are strong theoretical reasons (as seen in Chapters 3, 4, and 5), on the other hand, for using the sample standard deviation, denoted by s, which is the root mean square deviation about the mean. Indeed, this is the principal measure of dispersion (although the interquartile
  35. 35. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 17 range is meaningful and expedient). For a sample of observations, x1, x2, . . . , xn it is dened by s = 1 n [(x1 x)2 + (x2 x)2 + + (xn x)2] = 1 n n i=1 (xi x)2. (1.2.5) By expanding and summarizing the terms on the extreme right-hand side, s = 1 n n i=1 x2 i 2x n i=1 xi + n x2 = 1 n n i=1 x2 i x2. (1.2.6) Engineers will recognize that this measure is analogous to the radius of gyration of a structural cross section. In contrast to the mean absolute deviation, it is highly inuenced by the largest and smallest values. The standard deviation of the population is denoted by . It is common practice to replace the divisor n of Eq. (1.2.5) by (n 1) and denote the left-hand side by s. Consequently, the estimate of the standard deviation is, on average, closer to the population value because it is said to have smaller bias. Therefore, Eq. (1.2.5) will, on average, give an underestimate of except in the rare case in which is known.9 The required modication to Eq. (1.2.6) is as follows: s = 1 n 1 n i=1 x2 i n n 1 x2. (1.2.7) This reduction in n can be justied by means of the concept of degrees of freedom. It is a consequence of the fact that the sum of the n deviations (x1 x), (x2 x), . . . , (xn x) is zero, which follows from Eq. (1.2.1) for the mean. Hence, regardless of the arrangement of the data, if any (n 1) terms are specied the remaining term is xed or known, because xn x = n1 i=1 (xi x). It follows from this equation that one degree of freedom is lost in dening the sample standard deviation. The concept of degrees of freedom was introduced by the English statistician R. A. Fisher on the analogy of a dynamical system in which the term denotes the number of independent coordinate values necessary to determine the system. Example 1.19. Annual rainfall. From the annual rainfall data in Example 1.18 (50, 56, 42, 53, and 49 cm), one can estimate the standard deviation by using Eq. (1.2.5), as follows: s = 1 5 [(50 50)2 + (56 50)2 + (42 50)2 + (53 50)2 + (49 50)2] = 1 5 (02 + 62 + 82 + 32 + 12) = 110 5 = 4.69 cm. An alternative estimate of (which is, on average, less biased) is obtained using Eq. (1.2.7) as follows: s = 110 4 = 5.24 cm. 9 Terms such as bias are discussed formally in Section 5.2. It is shown in Example 5.1 that s2 is unbiased; however, s is known to have bias, though less than s on average.
  36. 36. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 18 Applied Statistics for Civil and Environmental Engineers Example 1.20. Timber strength. By using Eq. (1.2.7), the sample standard deviation of the timber strength data of Table E.1.1 is 9.92 N/mm2 (or 9.46 N/mm2 if the zero value is excluded). Example 1.21. Concrete test. By using Eq. (1.2.7), the sample standard deviation for the density and compressive strength of concrete in Table E.1.2 are 15.99 kg/m3 and 5.02 N/mm2 , respectively. Dividing the standard deviation by the mean gives the dimensionless measure of dis- persion called the sample coefcient of variation, v: v = s x (1.2.8) Thisisusuallyexpressedasapercentage.Thecoefcientofvariationisusefulincomparing different data sets with respect to central location and dispersion. Example 1.22. Comparison of timber and concrete strength data. From the values of mean and standard deviation in Examples 1.7 and 1.20, the sample coefcient of variation of the timber strength data is 25.3% (or 24.0% without the value of zero). Similarly, from Examples 1.8 and 1.21 the density and compressive strength of concrete data have sample coefcients of 0.65 and 8.24%, respectively. The higher variation in the timber strength data is a reection of the variability of the natural material, whereas the low variation in the density of the concrete is evidence of a uniform quality in the constituents and a high standard of workmanship, including care taken in mixing. The variation in the compressive strength of concrete is higher than that of its density. This can be attributed to random factors that inuence strength, such as some subtle changes in the effectiveness of the concrete that do not alter its density. From the square of the sample standard deviation one obtains the sample variance, s2 , which is the mean of the squared deviations from the mean. The population variance is denoted by 2 . The variance, like the mean, is important in theoretical distributions. BysquaringEqs.(1.2.6)and(1.2.7),twoestimatorsofthepopulationvariancearefound. Here estimator refers to a method of estimating a constant in a parent population. As in all the foregoing equations, this term means the random variable of which the estimate is a realization. An unbiased estimator is obtained from Eq. (1.2.7) because on average (that is by repeated sampling) the estimator tends to the population variance 2 . In other words, the expectation E, which is in effect the average from an innite number of observations, of the square of the right-hand side of Eq. (1.2.7) is equal to 2 . There are also measures of dispersion pertaining to the mean of the deviations between the observations. Ginis mean difference, for example, is a long-standing method.10 This is given by g = 2 n (n 1) i> j n j=1 [x(i) x( j)], (1.2.9) in which the observations x1, x2, . . . , xn are arranged in ascending order. 10 See, for example, Stuart and Ord (1994, p. 58) for more details of this method originated by the Italian mathematician, Gini. See also Problem 1.7.
  37. 37. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 19 1.2.3 Measure of asymmetry Another important property of the histogram or frequency polygon is its shape with respect to symmetry (on either side of the mode). The sample coefcient of skewness measures the asymmetry of a set of data about its mean. For a sample of observations, x1, x2, . . . , xn, it is dened as g1 = n i=1 (xi x)3 ns3 . (1.2.10) Division by the cube of the sample standard deviation gives a dimensionless measure. A histogram is said to have positive skewness if it has a longer tail on the right, which is toward increasing values, than on the left. In this case the number of values less than the mean is greater than the number that exceeds the mean. Many natural phenomena tend to have this property. For a positively skewed histogram, mode < median < mean. This inequality is reversed if skewness is negative. A symmetrical histogram suggests zero skewness. Example 1.23. Comparison of timber and concrete strength data. The coefcient of skewness of the timber strength data of Table E.1.1 and the compressive strength data of Table E.1.2 are 0.15 (or 0.53 after excluding the zero value) and 0.03, respectively. These indicate a small skewness in the rst case and a symmetrical distribution in the second case. The example indicates that this measure of skewness is sensitive to the tails of the distribution. 1.2.4 Measure of peakedness The extent of the relative steepness of ascent in the vicinity and on either side of the mode in a histogram or frequency polygon is said to be a measure of its peakedness or tail weight. This is quantied by the dimensionless sample coefcient of kurtosis, which is dened for a sample of observations, x1, x2, . . . , xn by g2 = n i=1 (xi x)4 ns4 . (1.2.11) Example 1.24. Comparison of timber and concrete strength data. The kurtosis of the timber strength data of Table E.1.1 is 4.46 (or 3.57 without the zero value) and that of the compressive strengths of Table E.1.2 is 2.33. One can easily see from Eq. (1.2.11) that even a small variation in one of the items of data may inuence the kurtosis signicantly. This observation warrants a large sample size, perhaps 200 or greater, for the estimation of the kurtosis. Small sample sizes, particularly in the second case with n = 40, preclude the attachment of any special signicance to these estimates. 1.2.5 Summary of Section 1.2 Of the numerical summaries listed here, the mean, standard deviation, and coefcient of skewness are the best representative measures of the histogram or frequency polygon, from both visual and theoretical aspects. These provide economical measures for summarizing the information in a data set. Sample estimates for the data we have been discussing here, including the coefcients of variation and kurtosis, are given in Table 1.2.2.
  38. 38. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 20 Applied Statistics for Civil and Environmental Engineers Table 1.2.2 Sample estimates of numerical summaries of the timber strength data of Table 1.1.3 and the concrete strength and density data of Table 1.2.1 Sample Standard Coefcient of Coefcient Coefcient Data set size Meana deviationa variation (%) of skewness of kurtosis Estimated by equation 1.2.1 1.2.7 1.2.8 1.2.10 1.2.11 Timber strengthfull sample 165 39.09 9.92 25.3 0.15 4.46 Timber strength without the zero value 164 39.33 9.46 24.0 0.53 3.57 Compressive strength of concrete 40 60.14 5.02 8.35 0.03 2.33 Density of concrete 40 2445 15.99 0.65 0.38 3.15 a Units for strength are N/mm2; units for density are kg/m3. 1.3 EXPLORATORY METHODS Some graphical displays are used when one does not have any specic questions in mind before examining a data set. These methods were appropriately called exploratory data analysis by Tukey (1977). Among such procedures the box plot is advantageous, and the stem-and-leaf plot is also a valuable tool. 1.3.1 Stem-and-leaf plot The histogram is a highly effective graphical procedure for showing various characteristics of data as seen in Section 1.1. However, for smaller samples, less than, say, 40 in size, it may not give a clear indication of the variability and other properties of the data. The stem-and-leaf plot, which resembles a histogram turned through a right angle, is a useful procedure in such cases. Its advantage is that the data are grouped without loss of information because the magnitudes of all the values are presented. Furthermore, its intrinsic tabular form highlights extreme values and other characteristics that a histogram may obscure. As in a histogram, the data are initially ranked in ascending order but a different approach is adopted in nding the number of classes. The class widths are almost invariably equal. For the increments or class intervals (and hence class widths) one uses 0.5, 1, or 2 multiplied by a power of 10, which means that the intervals are in units such as 0.1 or 200 or 10,000, which are more tractable than, say, 0.13 or 140 or 12,000. The terminology is best explained through the following worked example. Example 1.25. Concrete test. For the concrete strength data of Table E.1.2, the maximum and minimum values are 69.5 and 49.9 N/mm2 , respectively. As a rst choice, the data can be divided into 21 classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51 N/mm2 , and so on, up to 69 N/mm2 . For the ordered stem-and-leaf plot of Fig. 1.3.1, a vertical line is drawn with the class boundaries marked in increasing order immediately to its left. The boundary values are called the leading digits and, together with the vertical line, constitute the stem. The trailing digits on the right represent the items of data in increasing order when read jointly with the leading digits using the indicated units. They are termed leaves, and their counts are the class frequencies. Thus the digits 49 (stem) and 9 (leaf) constitute 49.9. It is useful to provide an additional column at the extreme left, as shown here, giving the cumulative frequenciescalled depthsup to each class. This is completed
  39. 39. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 21 1 49 9 2 50 7 2 51 3 52 5 5 53 2 4 7 54 4 6 8 55 8 11 56 3 7 9 13 57 8 9 15 58 8 9 19 59 0 6 8 8 (7) 60 0 2 5 5 5 9 9 14 61 1 5 9 11 62 11 63 3 4 9 64 9 9 7 65 7 6 66 6 67 2 3 4 68 1 3 9 1 69 5 Fig. 1.3.1 Stem-and-leaf plot for compressive strengths of concrete in Table E.1.2; units for stem: 1 N/mm2 ; units for leaves: 0.1 N/mm2 . rstly by starting at the top and totaling downward to the line containing the median for which the individual frequency is given in parentheses, and secondly by starting at the bottom and totaling upward to the line containing the median. The diagram gives all the information in the data, which is its main advantage. Further- more, the range, median, symmetry, or gaps in the data, frequently occurring values, and any possible outliers can be highlighted. In this example, a symmetrical distribution is indicated. The plot may be redrawn with a smaller number of classes, perhaps for greater clarity, using the guidelines for choosing the intervals stipulated previously. The units of data in a plot can be rounded to any number of signicant gures as necessary. Also, the number of stems in a plot can be doubled by dividing each stem into two lines. When 1 multiplied by a power of 10 is used as an interval, for example, the rst line, which is denoted by an asterisk ( ), will thus have leaves 0 to 4, and the leaves of the second, represented by a period (.), will be from 5 to 9. Likewise, one may divide a stem into ve lines. The stem-and-leaf plot is best suited for small to moderate sample sizes, say, less than 200.
  40. 40. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 22 Applied Statistics for Civil and Environmental Engineers (N/mm2) 10 20 30 40 50 60 70 80 Strength excluding 0 value Timber strength, of concrete Compressive strength 17.98 33.10 39.10 44.57 70.22 56.8 69.5 49.9 63.4 60.1 73.23 46.97 61.78 65.35 65.61 69.07 15.89 Maximum and minimum values Critical values for detecting outliers Other high values Quartiles Fig. 1.3.2 Box plots for timber strength and compressive strength of concrete data from Tables 1.1.3 to 1.2.1. 1.3.2 Box plot Another plot that is highly useful in data presentation is the box plot, which displays the three quartiles, Q1, Q2, Q3, on a rectangular box aligned either horizontally or vertically. The box, together with the minimum and maximum values, which are shown at the ends of lines extended at either side from the box from the midpoints of its extremities, constitute the box-and-whiskers plot, as it is sometimes called. The numerical signposts are arranged as follows from top to bottom: minimum, Q1, Q2, Q3, and maximum. Together they constitute a ve-number summary. The minimum and maximum values may be replaced by the 5th and 95th (or other extreme) percentiles or supplemented by these and additional extreme values. These plots play an important role in comparing two or more samples. The width of the box is made proportional to the sample size in such cases, if they are different. Example 1.26. Comparison of timber and concrete strength data. Let us use a box plot to compare the strengths of two representative materials used by civil engineers. Figure 1.3.2 shows the timber strength data ranked in Table 1.1.3, with the zero value excluded, and the compressive strength of concrete data that were ranked in Table 1.2.1. The box plot of compressive strengths of concrete shown on the right strongly indicates symmetry in their distribution. In the case of the timber strength data, the box is less symmetrical. However, there are clear signs of positive skewness; because the length of the line connecting the highest value to the box is longer than that connecting the lowest value to the box. Empirical rules have been devised to detect outliers by means of box plots. As previ- ously stated, this term signies an excessive discordance with reference to an assumed distribution to which the majority of observations belong. One such procedure identies
  41. 41. P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFK BLUK154-Kottegoda April 15, 2008 7:11 Preliminary Data Analysis 23 outliers as those values located at distances greater than 1.5 iqr above the third quartile or less than 1.5 iqr below the rst quartile. Example 1.27. Comparison of timber and concrete strength data. The iqrs for the timber strength and compressive strength of concrete data are 11.47 and 6.55 N/mm2 and thus the two critical distances for detecting outliers are 17.21 and 9.83 N/mm2 , respectively. These distances are set out on either from the extremities of the boxes and are shown by thick horizontal lines in Fig. 1.3.2. By this rule, the concrete data do not have any outliers, whereas there are four outliers beyond the demarcating line for high outliers in the timber strength data of Table 1.1.2. These are the values 65.35, 65.61, 69.07, and 70.22 N/mm2 . At the other extremity, there is the zero value that was discarded before the diagram was drawn. When such an observation is recorded one should verify whether it stems from a faulty calibration or other source of error; it is clearly an outlier by the method described here.11 1.3.3 Summary of Section 1.3 In general, box plots are helpful in highlighting distributional features, including the range and many of