Statistics

Statistics

By Manish Runthla

Definition of Statistics• Statistical analysis involves the process of collecting and

analyzing data and then summarizing the data into a numerical form.

• Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.

• A collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

Characteristics • Aggregate of facts• Affected to a market extent by multiplicity of causes• Numerically expressed• Estimated according to a reasonable standard of

accuracy• Collected in systematic order• Collected for a predetermined purpose

Need of study of statistics• Business• Mathematics• Accounting• Economics• Banking• Management & administration• Astronomy

Tools of statistics

• Mean : average• Median : mid value• Mode : highest frequency

Data • Observations (such as measurements,

genders, survey responses) that have been collected.

• Information in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects.

Data collection

• Data collection is a systematic approach to gathering information from a variety of sources to get a complete and accurate picture of an area of interest.

Methods of collection :• Primary data• Internal records: it is a kind of secondary data which

is not published like bank customer’s records, internal records

• Secondary data

Classification of data• Geography: area, country, state, region• Polity: ideology – socialistic, capitalist, sovereign• Demography: age, religion, gender, height, weight,

any individual characteristics• Income: low income, mid income, high income• Quantitative data• Qualitative data : preference, likability, other

characteristics pertaining to individual• Chronological : sorting, ascending or descending

order

Mathematics • The study of the measurement, properties, and relations

hips of quantities and sets, using numbers and symbols. Mathematics is an equation format.

• Mathematics is a group of related sciences, including algebra, geometry, and calculus, concerned with the study of number, quantity, shape, and space and their interrelationships by using a specialized notation

Branches of mathematics• Foundations: formulation & analysis of language• Algebra: study of one or more several variables• Arithmetic • Analysis• Geometry: concerned with the axiomatic study of polygons,

conic sections, spheres, polyhedra, and related geometric objects in two and three dimensions.

• Applied mathematics: numerical methods and computer science, which seeks concrete solutions, sometimes approximate, to explicit mathematical problems

Central tendency• Central tendency is a central or typical value for a

probability distribution.

Objectives:• To get one single value that describes the

characteristics of the entire data• To facilitate comparison

Computation :• Mean• Median• Mode

Dispersion • A statistical term describing the size of the range of

values expected for a particular variable.• Index of dispersion, a normalized measure of the

dispersion of a probability distribution• Price dispersion, a variation in prices across sellers of

the same item• Wage dispersion, the amount of variation in wages

encountered in an economy

Objectives • To determine the reliability of an average• To serve as a basis for the control of variability• To compare two or more series• To facilitate the use of other statistical measures

Measure of dispersion includes:• Range• Interquartile range• Variance• Mean deviation• Standard deviation

Range • Range of a set of data is the difference between the

largest and smallest values.• Difference between highest value and lowest value

Interquartile range

A measure of statistical dispersion, being equal to the difference between the upper and lower quartiles.• the first quartile• and the third quartile• If the actual values of the first or third quartiles differ

substantially. from the calculated values, P is not normally distributed.

Variance

• variance measures how far a set of numbers is spread out. A variance of zero indicates that all the values are identical. Variance is always non-negative: a small variance indicates that the data points tend to be very close to the mean(expected value) and hence to each other, while a high variance indicates that the data points are very spread out around the mean and from each other.

• The variance is a numerical value used to indicate how widely individuals in a group vary. If individual observations vary greatly from the group mean, the variance is big; and vice versa.

Variance =σ2 = Σ ( Xi - X )2 / N

Standard deviation• The square root of the arithmetic mean of the squares

of the deviation of the values taken from the mean. Standard deviation is denoted by small Greek letter (read as sigma) Standard deviation is also called as root mean square deviation.

• In other way Standard Deviation is defined as the square root of the sum of the squares of the difference of each observation from its mean divided by the no. of observations in the sample or population.

• A standard deviation close to 0 indicates that the data points tend to be very close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

• Square root of mean of deviation.

Skewness• Skewness is a measure of symmetry, or more precisely, the

lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

g1=∑Ni=1(Yi−Y¯)3/Ns3

• Skewness : a symmetrical distribution has a skewness of zero.• An asymmetrical distribution with long tail(higher value) to

right has positive skew• An asymmetrical distribution with long tail to left(lower

value) has negative skew• The skewness is unit less

Kurtosis• Kurtosis is a measure of whether the data are peaked or flat

relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.

kurtosis=∑Ni=1(Yi−Y¯)4/Ns4

kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population.

Correlation• A measure of the linear correlation (dependence)

between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. It is widely used in the sciences as a measure of the degree of linear dependence between two variables.

Sampling A Sample out of population is a predefined set of

potential respondents in a geographical area.The most common sampling element in Marketing Research is Human Respondent who could be :

- Consumer, - A potential Consumer, - A Dealer or Retailer - A person exposed to an

advertisement

Types of sampling

• Probabilistic sampling: In probable sampling technique each sampling unit (household or individual) has a known probability of being included in the sample.1. Simple Random Sampling.2. Stratified random sampling3. Cluster sampling4. Systematic sampling 5. Multistage or Combination sampling.

Non probabilistic sampling:• Quota Sampling ( a fixed number)• Judgment sampling• Convenience Sampling• Snowball Sampling.

• An enumeration is a complete, ordered listing of all the items in a collection.

• Methods of enumeration :• Explicit complete enumeration: Full enumeration of all

possible alternatives and comparison of all of them to pick the best solution.

• Implicit complete enumeration: Parts of the solution space that are definitely sub-optimal are excluded. This reduces complexity because only the most promising solutions have to be considered. For implicit complete enumeration, methods like Branch & Bound, limited enumeration and dynamic optimization can be used.

• Incomplete enumeration: Selecting alternatives by only looking at parts of the solution space by applying certain heuristics. This provides approximate solutions, but not necessarily optimal ones.

Sampling error• Sampling error is incurred when the statistical

characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics on the sample, such as means and quintiles, generally differ from parameters on the entire population.

• Sampling bias• Random sampling

Time series

• A Time series is a sequence of data points, typically consisting of successive measurements made over a time interval.

• Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, intelligent transport and trajectory forecasting ,earthquake prediction, control.

• Engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements. time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data.

Components of Time series:• Secular trends• Seasonal variation• Cyclical variation• Irregular variation

Components of time series• Secular trend: A time series data may show upward trend or

downward trend for a period of years and this may be due to factors like increase in population, change in technological progress ,large scale shift in consumers demands,etc. time series which results from long term effect of socio-economic and political factors. This trend may show the growth or decline in a time series over a long period.

• Seasonal variation: Seasonal variation are short-term fluctuation in a time series which occur periodically in a year. This continues to repeat year after year. The major factors that are responsible for the repetitive pattern of seasonal variations are weather conditions and customs of people. More woolen clothes are sold in winter than in the season of summer.

• Cyclical variations: Cyclical variations are recurrent upward or downward movements in a time series but the period of cycle is greater than a year. Also these variations are not regular as seasonal variation. There are different types of cycles of varying in length and size. The ups and downs in business activities are the effects of cyclical variation.

• Irregular variation: Irregular variations are fluctuations in time series that are short in duration, erratic in nature and follow no regularity in the occurrence pattern. These are sudden changes occurring in a time series which are unlikely to be repeated. These variations are also referred to as residual variations since by definition they represent what is left out in a time series after trend ,cyclical and seasonal variations. Irregular fluctuations results due to the occurrence of unforeseen events like floods,earthquakes,wars,famines,etc.

Documents

Statistics