Bio Statics

7/27/2019 Bio Statics

1/93

Good morning


2/93

Biostatistics

Ashfaq yaqoob

18.01.2010


3/93

Introduction

Any science needs precision for its development.

For precision, facts, observations ormeasurements have to be expressed in figures.

It has been said when you can measure whatyou are speaking about and express it innumbers, you know something about it, but

when you cannot express it in numbers yourknowledge is of meager and unsatisfactorykind. - Lord Kelvin


4/93

Similarly in medicine, be it diagnosis, treatment

or research everything depends on measurement E.g. you have to measure or count the number ofmissing teeth OR measure the verticaldimension and express it in number so that it

makes sense. Statisticor datummeans a measured or

counted fact or piece of the information stated asa figure such as height of one person, birth

weight of a baby etc.

Statisticsor datais plural of the same.


5/93

Stat ist icsis science of figures.

It is a field of study concerned withtechniques/methods of collection of data,

classification, summarizing, interpretation,

drawing inferences, testing hypothesis and

making recommendations.

Biostat is t ics-is term used when tools of

statistics are applied to the data derived frombiological sciences.


6/93

Datadiscrete observations ofattributes/events that carry little meaning

when considered alone. Information is data which is reduced andadjusted, according to variations such as agesex-so that comparisons over time and placeare possible.

Intelligence

is transformation ofinformation through integration andprocessing with experience and perceptionsbased on social and political values.

Any measurable characteristic of apopulation is called a Parameter.


7/93

Statistics used to summarize, or describe,the characteristics of a sample are calledDesc ript iv e stat ist ics .

Statistical procedures that are used to makeinferences (ie, draw conclusions) about thepopulation that the sample represents arecalled In ferential stat ist ic s.


8/93

Descriptive statistics


9/93

In the real world, we can not study the infinitemembers of an entire population.

Instead, we must select a sample in the hopethat it will serve as a representative surrogate.


10/93

sample -can be used to estimate quantities in a

population as a whole

Sampling variations minimized by

adequate sample size

proper sampling techniques


11/93

Non random samplingeasier and more

convenient to perform

Randomsampling .

In random sampling (also calledprobabilitysampling)

everyone in the sampling frame has an equalprobability of being chosen.


12/93

Non-random sampling (also called nonprobability sampling) does not have these aims,but is usually easier and more convenient to

perform.

Convenience or opportunistic sampling is thecrudest type of non random sampling.

This involves selecting the most convenientgroup available (e.g. using the first 20colleagues we see at work).

Though simple to perform, but is unlikely toresult in a sample that is either representative ofthe population or replicable.


13/93

Random selection of samples is important

In random sampling, everyone in the samplingframe has an equal probability of being chosen.

sample is truly representative of the population

It can help minimize bias (bias can be defined asan effect that produces results which are

systematically different from the true values )


14/93

Simple random sample using random numbers.

a. lottery method

b. Table of random numbers.

Multi stage sampling :school health survey all

children-.

Cluster sampling -all of the subjects in the final-stage

sample are investigated.

Stratified sampling - to randomly select subjectsfrom different strata or groups.


15/93

Systematic sampling is formed by selecting oneunit at random and then selecting additionalunits at evenly spaced interval till sample ofrequired size is formed.

Pathfinder surveys:specified proportion ofpopulation.1%


16/93

Sources of data

1. Experiments2. Surveys3. Records

Primary Secondary

Categories1. Quantitative/continuous

measured with a number

2. Qualitative/discrete- cannot be meaningfullysummarized by a number.


17/93

Qualitative or discrete data

In such data there is no notion of magnitude or

size of an attribute as the same cannot bemeasured.

The number of person having the sameattribute are variable and are measured

e.g. like out of 100 people 75 have class Iocclusion, 15 have class II occlusion and 10have class III occlusion.

Class I II III are attributes , which cannot bemeasured in figures, only no of people havingit can be determined


18/93

Quantitative or continuous data

In this the attribute has a magnitude. both

the attribute and the number of personshaving the attribute vary

E.g Freeway space. It varies for every patient. It

is a quantity with a different value for eachindividual and is measurable. It is continuousas it can take any value between 2 and 4 like itcan be 2.10 or 2.55 or 3.07 etc.


19/93

Data presentation

Statistical data once collected should besystematically arranged and presented

To arouse interest of readers

For data reduction

To bring out important points clearly andstrikingly

For easy grasp and meaningful conclusions

To facilitate further analysis To facilitate communication


20/93

Two main types of data presentation are

Tabulation

Graphic representation with charts anddiagrams

Tabulation

It is the most common method

Data presentation is in the form of columnsand rows


21/93

General principles for designing tables.

1. Tables should be numbered.2. A title- brief and self explanatory should be given for

each table.3. Headings of rows and columns should be clear and

concise.

4. Data must be presented according To size orimportance (chronologically/ alphabetically).

It can be of the following typesSimple tables

Frequency distribution tables


22/93

Simple table

NO of patients in MCODS Mangalore

Jan 2006 2000

Feb 2006 1800

March 2006 2300


23/93

Frequency distribution table

Data is first split into convenient groups andnumber of items in each group is shown in

adjacent columns.


24/93

Frequency distribution table

Number of Cavities Number of Patients

0 to 3 78

3 to 6 67

6 to 9 32

9 and above 16


25/93

Charts and diagrams

Useful method of presenting statistical data

Powerful impact on imagination of the people


26/93

Bar chart

Length of bars drawn vertical or horizontal isproportional to frequency of variable.

suitable scale is chosen

bars usually equally spaced

They are of three types -simple bar chart

-multiple bar chart two or more variables are grouped together

-component bar chart bars are divided into two parts

each part representing certain item and

proportional to magnitude of that item


27/93

Bar diagrams

Simple

Sub-divided Multiple

Simple

Sub-dividedMultiple


28/93

Histogram

-Pictorial diagram offrequency distribution .

Frequency polygonobtained by joiningmidpoints of histogramblocks at the height of

frequency by straightlines usually forming apolygon

75

4540

32

43

22

3429

38

0

10

20

30

40

50

60

70

80

Number of carious lesions

0 to 3

3 to 6

6 to 9

9 to 12

12 to 15

15 to 18

18 to 21

21 to 24

24 to 27


29/93

Pie charts

In this frequencies of the group are shown as

segment of circle Degree of angle denotes the frequency

Angle is calculated by

class frequency X 360total observations

200, 31%

150, 24%

180, 29%

70, 11%30, 5%

PROSTHO

CONSO

PERIO

ORTHO

PEDO


30/93

Scatter diagrams: show relation between twovariables.

If dots are clustered around a straight line-shows evidence of relationship of linear nature.

If no such cluster- it is probable that no relationbetween variables.

0

2

4

6

8

10

12

14

0 5 10 15

Carious lesion

Sugar Exposure


31/93

Pictogram

Popular method of presenting data to thecommon man

Spot map or map diagram

These maps are prepared to show geographicdistribution of frequencies of characteristics


32/93

Implies a value in distribution around whichother values are distributed.

Gives a picture of central value.1. Arithmetic mean2. Median3. Mode

Measures of statistical averages or

central tendency


33/93

Mean refers to arithmetic mean

it is the summation of all the observationsdivided by the total number of observations (n)

denoted by X for sample and for population X = x1 + X2 + X3 . Xn / n

Advantages it is easy to calculate

Disadvantages influenced by extreme values


34/93

Median

When all the observation are arranged either inascending order or descending order, the middleobservation is known as median

In case of even number the average of the twomiddle values is taken

Median is better indicator of central value as it isnot affected by the extreme values


35/93

Mode

Most frequently occurring observation in a data

is called mode Not often used in medical statistics.

Example

Number of decayed teeth in 10 children2,2,4,1,3,0,10,2,3,8

Mean = 34 / 10 = 3.4

Median = (0,1,2,2,2,3,3,4,8,10) = 2+3 /2

= 2.5

Mode = 2 ( 3 Times)


36/93

Variations

Data colleted has incredible variations.

Variation from person to person And alsovariation in same person at different times.

Thus Measures of variation / dispersion areused. Range

Mean/average deviation Standard deviation (sigma )


37/93

Range difference between highest and lowestvalues

Mean deviation-average of deviation fromarithmetic mean.

M.D.= (X-X1

)/n X 1= observation X = mean

n = no of observation


38/93

Standard deviaitonroot mean square

deviaiton. Denoted by (sigma) or S.D

= (X-X1 ) 2 /n

Greater the standard deviation, greater will bethe magnitude of dispersion from mean

Small standard deviation means a high degree of

uniformity of the observations Usually measurement beyond the range of 2SD are considered rare or unusual in anydistribution


39/93

Variance of the data Another way to describe dispersion is to

present interquartile ranges, such as thevalues for the 25th and 75th percentile level,

which are not as likely to be influenced by thevalues at the extreme upper and lower end ofthe spread of data points.


40/93

For continuous data, the most commonly usedmeasure of central tendency is the mean.

For ordinal data, the median or modeis used torepresent the center of the data.

The medianis also used as a measure of centraltendency for continuous data that are skewedto

minimize the effect of extremely large or smallvalues on the estimate of the center of the data.


41/93

Nominal dataare summarized by reporting theproportion or percentageof the data that are

classified in each level.


42/93

Sample Size and Power

Designing studies with inadequate sample sizesmay lead to errors and false conclusions (false

negative findings)

False negative findings can occur either bychance or study is under powered.

Care full sample size calculation can guideresearchers as to what can and cannot beaccomplished in a study with a finite amount ofresources .


43/93

Although the sample size calculations areperformed using mathematical methods, the

preparation for the calculation requires bothstatistical reasoning and clinical experience.

Calculation of sample size require four things

1. Deciding on the design of study2. Assessing the availability of resources

3. Specifying distribution assumptions

4. Defining a clinically relevant effect


44/93

Inferential statistics


45/93

Inferential statistics are those statistical

procedures that compare groups to see if thegroups are significantly different from eachother.

two kinds

parametric statistics

nonparametric statistics.


46/93

Parametric statisticsrefers to a group ofstatistical tests that uses meansand a measure of

variation (standard deviation, variance) to helpdetermine if groups are different from eachother.


47/93

Certain conditions regarding the data must be metbefore the simplest parametric tests, based on meansand standard deviations, may be validly used.

1. The data must be continuous(measured on acontinuous scale, eg, millimeters, pounds, degrees)

2. A scatter plot of the data must look like a normaldistribution (bell shaped curve) and

1. The dispersion or spread of data for each variablemust be the same in each group being compared (the

size of the variance or standard deviation of thevariable is the same in each of the groups beingcompared).


48/93

Distributions

Begin the initial analysis by plotting them on agraph to see how they are distributed.

points can be seen to follow some recognizedpattern or distribution.

Many patterns of distributions occur in nature.Frequently, these patterns can be described bymathematical functions, which then enable us todetermine the likelihood that a data point will

fall under a specific area of the distributioncurve.


49/93

The Normal distribution or Gaussian

distribution.

Bell - shaped curve

The data cluster around a central point andspread symmetrically around this center point. the central point is the mean of the sample. The width of the bell-shaped curve depends on

how much variability there is in the data.


50/93


51/93

The way to estimate the amount of variability is to

calculate the SD, the square root of the average squareddeviation of each data point from the mean value of all thedata points.

The larger the SD is, the greater the variability in the data.

The greater the variability is, the wider the shape of thecurve.


52/93


53/93

Importance of distribution

Many statistical tests are based on parametric assumptions(ie, the data are assumed to follow a distribution that can besummarized by parameters) requiring distribution of the

data which is normal (bell-shaped).

Many parametric statistical tests are insensitive to milddepartures of the data from normality, but severedepartures from the normal distribution mandate the use of

distribution-free tests- nonparametric statistics.


54/93

Parametric statistics tend to be more powerfulthan nonparametric statistics.

This means that they are more likely thannonparametric statistics to detect a significantsignificance between samples when thedifference is real, but use of a parametric test

when assumptions are violated is incorrect.


55/93

Common parametric tests include the

Student t test and

Analysis of variance (ANOVA)


56/93

Ordinal dataare analyzed by nonparametric

procedures. Nonparametric statistics use the ranks/medians of thedata rather than means and standard deviations tomake group comparisons.

Common nonparametric tests based on ranks include

the Mann-Whitney U test, the Wilcoxon signed rank test, and the Kruskal-Wallis test

Nonparametric statistical tests are also used forcontinuous data that are not normally distributed(bell-shaped curve).


57/93

The most common test to analyze nominal datais the 2test

Data that are nominal (eg, sex, tooth type) cannot besummarized by means or ordered into ranks.

Ratios / proportionscan be determined.


58/93

Test Statistics Statistical procedures comparing samples provide a

test statistic or critical ratio that is associated with aprobability level (Pvalue).

The probability level, is the likelihood or chance thattwo groups, representative of the same population,would be chosen, and that there would be adifference in the groups at least as big as the one

detected. Pvalue < .05 means there is an equal or lower than5% chance (1 in 20) that the two groups could besamples from the same population.

By convention, whenP


59/93

Parametric Tests

The Student t test is used when only two groups arebeing compared.

The Student t test uses sample means and standard

deviations to calculate the probability or likelihood thatthe groups are different.

It helps us to determine if the means differ because thetwo groups represent two different populations or if themeans differ because the groups have different subjectsbut each group represents the same population.


60/93

exists in two forms depending on whether thetwo groups under comparison are

paired (matched) or independent of each other.


61/93

A common paired design occurs when a single group ofsubjects is measured before and after a procedure toexamine the effect of some intervention (eg, treatment).

A matched group study design is one in which theoutcome of each subject in the treatment group iscompared directly to the outcome in another subject whois as similar as possible to its mate, with the exception of

the treatment under investigation.


62/93

An example of a paired study is a comparison ofmasticatory efficiency of complete denture

wearer with bilateral balanced occlusion afterselective grinding.


63/93

Two -sample, independent t test. to compare independent groups or unmatched

groups. An example is to estimate the masticatory

efficiency between bilateral balanced occlusionand lingualised occlusion in complete denture

wearers patients.


64/93

In paired study designs, the number of subjectsin both groups is the same, whereas in the two-

sample, independent design, the size of the twosamples may be different.


65/93

If more than two groups are being compared, theANOVAis used.

Unlike the t test, which uses the mean and standarddeviation of groups for its computations, ANOVAuses the mean and variance of groupsforcomputations.

Test statistic is F statistic.

ANOVA makes a series of pair-wise comparisons for

all the groups in the comparison.


66/93

A significantPvalue indicates that a difference existssomewhere between any two comparisons, but ANOVAdoes not identify which groups are different.

To determine which pairs differpost hoc or a posterioritestsused to examine the groups in detail and revealwhich groups significantly differ from each other.

Common post hoc tests are

the Tukey-Kramer honestly significant difference, Scheff, Dunnett, Duncan, and Newman-Keuls tests.


67/93

Nonparametric Tests

A common nonparametric test forcomparison of two unpaired samples is theMann-Whitney U testalso known as theWilcoxon rank sum test.

Compares the medians of the groups. Test statistic is U statistic.

Example -grade point averages

The comparable nonparametric test to thepaired t test is theWilcoxon signed rank test.


68/93

The nonparametric test comparable to the ANOVA is theKruskal-Wallis procedure.

Examines intergroup differences based on ranks.


69/93

x2 test.

nominal data analyzed.

It is used to compare the proportion of the datathat fall into each level of the nominal variable.

Correlation


70/93

Correlation. To test whether or not two variables bear a linear

relationship to each other (ie, whether or not they vary

together, either positively or negatively), the techniqueof Pearson product-moment linear correlationiscommonly used.

The correlation coefficient (r), a dimensionless indexindicates of the extent to which the two characteristicsvary together.,

r can range from +1, denoting a perfect positiverelationship, to 1, characteristic of a perfect negativerelationship,r = 0 signify complete independence.

normally r = 0.6 or -0.3 or 0.1


71/93

Regression.

If a linear relationship is significant statisticallyand is strong enough to be of practical use, the

next step is to model it mathematically in theform of a prediction equation so that it can beused clinically.

Y =A + BX


72/93

Regression and correlation are closely related: one dealswith the strength of a linear relationship and the other

with its form.


73/93

Multivariate Analysis


74/93

A statistical analysis that involves more thanone dependent variable.

The analysis of simultaneous relationshipsamong several variables. Examining simultaneously the effects of age, sex,

and social class on hypertension would be an

example of multivariate analysis


75/93

Considers the interrelationships of several traitsat a time .

Multivariate analysis comprises a set oftechniques dedicated to the analysis of data setswith more than one variable.


76/93

One data set

Interval or ratio level of measurement: principalcomponent analysis (PCA)

Nominal or ordinal level of measurement:correspondence analysis (CA), multiplecorrespondence analysis (MCA)

Similarity or distance: multidimensional scaling (MDS)

- Multidimensional scaling (MDS)is a set of relatedstatisticaltechniques often used in data visualizationfor exploringsimilarities or dissimilarities in data.

T d
http://en.wikipedia.org/wiki/Statisticalhttp://en.wikipedia.org/wiki/Data_visualizationhttp://en.wikipedia.org/wiki/Data_visualizationhttp://en.wikipedia.org/wiki/Statistical


77/93

Two data sets Case one: one independent variable set and one

dependent variable set- Multiple linear regression analysis (MLR) Regression with too many predictors and/or several

dependent variables Partial least square (PLS) regression (PLSR)

Principal component regression (PCR) Ridge regression (RR)

Reduced rank regression (RRR) or redundancy analysis

Multivariate analysis of variance (MANOVA) Predicting a nominal variable: discriminant analysis

(DA) Fitting a model: confirmatory factor analysis (CFA)


78/93

Two (or more) dependent variable sets:

Canonical correlation analysis (CC)

Multiple factor analysis (MFA)

Multiple correspondence analysis (MCA)

Procustean analysis (PA)


79/93

Regression analysis

In statistics, regression analysisis used tomodel relationships between random variables,

determine the magnitude of the relationshipsbetween variables, and can be used to makepredictions based on the models.
http://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Random_variable


80/93

Predictor variables may be defined quantitatively orqualitatively (or categorical).

If the predictors are all quantitative,- multipleregression.

If the predictors are all qualitative, one performs analysis

of variance.

If some predictors are quantitative and some qualitative,one performs an analysis of covariance
http://en.wikipedia.org/wiki/Multiple_regressionhttp://en.wikipedia.org/wiki/Multiple_regressionhttp://en.wikipedia.org/wiki/Analysis_of_variancehttp://en.wikipedia.org/wiki/Analysis_of_variancehttp://en.wikipedia.org/wiki/Analysis_of_covariancehttp://en.wikipedia.org/wiki/Analysis_of_covariancehttp://en.wikipedia.org/wiki/Analysis_of_variancehttp://en.wikipedia.org/wiki/Analysis_of_variancehttp://en.wikipedia.org/wiki/Multiple_regressionhttp://en.wikipedia.org/wiki/Multiple_regression


81/93

If two or more independent variablesarecorrelated, we say that the variables are

multicollinear. Multicollinearity results in parameter estimates

that are unbiased and consistent, but which mayhave relatively large variances


82/93

Many patterns of distributions occur innature. Frequently, these patterns can be

described by mathematical functions. The most common statistical tests can beapplied to data that is normally distributed.

What if data obtained is not normally

distributed?? Log transformationof data to normaldistribution is undertaken.

Normal staistical tests cannot be applied to

data that is log transformed.


83/93

Logistic regression In statistics, logistic regressionis a model used for

prediction of the probabilityof occurrence of an event.

It makes use of several predictor variables that may beeither numerical or categories. For example, theprobability that a person has a heart attack within aspecified time period might be predicted fromknowledge of the person's age, sex andbody mass index.

The "input" is z and the "output"
http://en.wikipedia.org/wiki/Logistic_regressionhttp://en.wikipedia.org/wiki/Statisticshttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Body_mass_indexhttp://en.wikipedia.org/wiki/Body_mass_indexhttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Statisticshttp://en.wikipedia.org/wiki/Logistic_regression


84/93

The input iszand the outputisf(z). The logistic function isuseful because it can take as an

input, any value from negativeinfinity to positive infinity,whereas the output is confinedto values between 0 and 1.

The variablezrepresents the

exposure to some set of riskfactors, whilef(z) represents theprobability of a particularoutcome, given that set of riskfactors. The variablezis a

measure of the totalcontribution of all the riskfactors used in the model and isknown as the logit
http://en.wikipedia.org/wiki/Logithttp://en.wikipedia.org/wiki/Logit


85/93

Z = 0 + 1x1 + 2x2 + 3x3 .

0 is the intercept valueit is the value of z when other risk factors are absent.

1, 2 and 3 are regression coefficient

X1,x2 and x3 are risk factor for heart disease

The application of a logistic regression may be illustrated

i fi titi l f d th f h t di


86/93

using a fictitious example of death from heart disease.This simplified model uses only three risk factors (age,sex and cholesterol) to predict the 10-year risk of death

from heart disease.

0 = 5.0 (the intercept) 1 = + 2.0

2 = 1.0 3 = + 1.2 x1 = age in decades x2 = sex, where 0 is male and 1 is female x3 = cholesterol level, in mmol/dl

Risk of death =1/1+e z where z = -5.0+2.0 x1 - 1.0 x2+1.2x3


87/93

Discriminant AnalysisDiscriminant function(modified Maddrey's

discriminant function)originally described by Maddrey and Boitnott to predict

prognosisin alcoholic hepatitis.

canonical variate analysis attempt to establish whether aset of variables can be used to distinguish between two

or more groups.
http://en.wikipedia.org/wiki/Prognosishttp://en.wikipedia.org/wiki/Alcoholic_hepatitishttp://en.wikipedia.org/wiki/Alcoholic_hepatitishttp://en.wikipedia.org/wiki/Prognosis


88/93

Suppose we have two samples representing differentpopulations,

We measured one character for them and found thattheir means for this character are not identical, theirdistributions overlap considerably, so that on thebasis of this character one could not, with any degreeof accuracy, identify an unknown specimen as

belonging to one or the other of the two populations. A second character may also differentiate them

somewhat, but not absolutely Two variables sayXl andX2 can be used to

distinguish them.


89/93

Discriminant function analysis computes a new variablesay Z, which is a linear function of both variablesX1andX2.

This function is constructed in such a way that as manyas possible of the members of one population have highvalue for "z" and as many as possible of the members ofthe other have low values, so that "z" serves as a muchbetter determinant of the two populations than doesvariableXl andX2 taken singly.


90/93

Example : Blood pressure and cholesterol levelsand blood sugar are different between those whoare obese and normal in body build.

Discriminant function analysis can be utilisedfor assessing the combined effect of factors thatare different between the two groups of subjects.


91/93

meta-analysis In statisticsa meta-analysiscombines the results of

several studies that address a set of related researchhypotheses.

The first meta-analysis was performed by Karl Pearsonin 1904, in an attempt to overcome the problem ofreduced statistical powerin studies with small samplesizes; analyzing the results from a group of studies canallow more accurate data analysis.
http://en.wikipedia.org/wiki/Statisticshttp://en.wikipedia.org/wiki/Karl_Pearsonhttp://en.wikipedia.org/wiki/Statistical_powerhttp://en.wikipedia.org/wiki/Statistical_powerhttp://en.wikipedia.org/wiki/Karl_Pearsonhttp://en.wikipedia.org/wiki/Statistics


92/93

CONCLUSION

Understanding the complexities of statisticalmodeling not only enable the use of test

characteristics in the actual design of diagnostictests, but familiarity with fundamental conceptswill also facilitate insight and critical evaluationof research that relies on such methodology.


93/93

Thank you

Documents

Bio Statics