Skip to main content

Statistics

This page provides an introduction to Statistics.

Overview

Statistics involves summarizing and describing the main features of a dataset, as well as drawing conclusions and making decisions based on data.

There are three types of data as below:

Ungrouped Data

A data where each observation is separate and distinct, with no grouping or classification.

For example, marks of 1010 students:

1,2,5,4,1,3,7,1,3,91, 2, 5, 4, 1, 3, 7, 1, 3, 9.

Discrete Frequency Data

A data where observations are grouped into distinct categories, with frequencies representing the number of times each category appears.

For example, student scores in a math test:

Score\text{Score}No of students with score\text{No of students with score}
404022
454555
505088
555533
606022
707011

Continuous Frequency Data

A data where observations are grouped into continuous intervals or ranges, with frequencies representing the number of observations in each interval.

For example, Heights of people:

Height Interval\text{Height Interval}No of people which falls in these height interval \text{No of people which falls in these height interval }
150155150-15555
155160155-16088
160165160-1651212
165170165-1701010
170175170-17566

Measure of Central Tendency

Mean, Median & mode are measures of central tendency. It is single number representing the whole data.

Mean

Mean is an average value.

For Ungrouped data, mean formula is:

xin\frac{\sum \text{x}_\text{i}}{\text{n}}

Where,

xi\text{x}_\text{i} is ith\text{i}^{\text{th}} observation,
n\text{n} is number of observations.

For discrete frequency data, mean formula is:

i=1nxifii = 1nfi\frac{\sum_{\text{i} = 1}^\text{n} \text{x}_\text{i} * \text{f}_\text{i}} {\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is ith\text{i}^{\text{th}} observation group,
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation group,
n\text{n} is number of observation groups.

For continuous frequency data, mean formula is:

i = 1nxifii = 1nfi\frac{\sum_{\text{i = 1}}^\text{n} \text{x}_\text{i} * \text{f}_\text{i}}{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is ith\text{i}^{\text{th}} observation group,
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation group,
n\text{n} is number of observation groups.

Median

Median is central value.

For ungrouped data, median is calculated as below:

First arrange the given data in ascending order or descending order.

  • If total number of observations are odd then median is (n  +  12)th\large(\frac{\text{n} \ \ + \ \ 1}{2})^{\text{th}} term.
  • If total number of observations are even then median is arithematic mean of (n2)th\large(\frac{\text{n}}{2})^{\text{th}} and (n  +  22)th\large(\frac{\text{n} \ \ + \ \ 2}{2})^{\text{th}} terms.

For discrete frequency data, median is calculated as below:

  • First arrange all observations in increasing order.
  • Now calculate cummulative frequency (Cf\text{C}_\text{f})
  • Median is that observation (xi\text{x}_\text{i}) whose (Cf\text{C}_\text{f}) is equal to or just greater than Sum of all frequencies2\large\frac{\text{Sum of all frequencies}}{2}

For continuous frequency data, median is calculated as below:

  • First arrange all observations in increasing order.
  • Now calculate cummulative frequency (Cf\text{C}_\text{f})
  • Median is that observation (xi\text{x}_\text{i}) whose (Cf\text{C}_\text{f}) is equal to or just greater than Sum of all frequencies2\large\frac{\text{Sum of all frequencies}}{2}

Mode

Mode is most frequent value.

For ungrouped data, mode is:

An observation xi\text{x}_\text{i} occuring maximum number of times.

For discrete frequency data, mode is:

An observation xi\text{x}_\text{i} which has highest value of fi\text{f}_\text{i}.

Where,

xi\text{x}_\text{i} is value of observation group,
fi\text{f}_\text{i} is frequency of observation group.

For continuous frequency data, mode formula is:

l+(f1f02f1f0f2)h\text{l} + (\frac{\text{f}_1 - \text{f}_0}{2\text{f}_1 - \text{f}_0 - \text{f}_2}) * \text{h}

Where,

l\text{l} is lower limit of model class,
f0\text{f}_0 is frequency of the class above the model class,
f1\text{f}_1 is frequency of the model class,
f2\text{f}_2 is frequency of the class below the class,
h\text{h} is width of the class interval.

Model class is the class interval whose frequency is greatest.

info

If model class is the the last class internval, then value of f2\text{f}_2 will be 00.

Measure of Dispersion

It tells us if measure of central tendency is reliable or not

There are 44 measures of dispersion:

  • Mean deviation about α\alpha, where α\alpha can be mean, median or mode
  • Variance(σ2\sigma^2)
  • Standard Deviation(σ\sigma)

Range

Range is difference between largest and smallest value in dataset.

For all types of data, range formula is:

largest valuesmallest value\text{largest value} - \text{smallest value}

Mean Deviation

Mean deviation is average distance between each value in a dataset and the mean value. It can be also be calculated around median and mode.

For ungrouped data, mean deviation formula is:

i=1nxixn\frac{\sum_{\text{i} = 1}^\text{n} | \text{x}_\text{i} - \overline{\text{x}}|}{\text{n}}

Where,

xi\text{x}_\text{i} is ith\text{i}_{\text{th}} observation,
x\overline{\text{x}} is mean of all the observations,
n\text{n} is total number of observations.

info

By replacing x\overline{\text{x}} in above formula with Median and Mode value we can calculate Mean deviation about Median and Mean deviation about Mode respectively

For discrete frequency data, mean deviation formula is:

i=1nfixixi=1nfi\frac{\sum_{\text{i} = 1}^\text{n} \text{f}_\text{i} * |\text{x}_\text{i} - \overline{\text{x}}|}{\sum_{\text{i} = 1}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is ith\text{i}_{\text{th}} observation group,
x\overline{\text{x}} is mean of all the observations for discreate frequency data
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation group.

info

By replacing x\overline{\text{x}} in above formula with Median and Mode value we can calculate Mean deviation about Median and Mean deviation about Mode respectively.

For continuous frequency data, mean deviation formula is:

i = 1nfixixi = 1nfi\frac{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i} * |\text{x}_\text{i} - \overline{\text{x}}|}{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is midpoint of ith\text{i}_{\text{th}} observation class interval,
x\overline{\text{x}} is mean of all the observations for continuous frequency data,
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation class interval.

info

By replacing x\overline{\text{x}} in above formula with Median and Mode value we can calculate Mean deviation about Median and Mean deviation about Mode respectively.

Variance

The average of the squared differences between each value in a dataset and the mean value. It is denoted as Variance(σ2\sigma^2).

For ungrouped data, variance formula is:

i=1n(xi)2n(x)2\frac{\sum_{\text{i} = 1}^\text{n} (\text{x}_\text{i})^2}{\text{n}} - (\overline{\text{x}})^2

Where,

xi\text{x}_\text{i} is ith\text{i}^{\text{th}} observation,
x\overline{\text{x}} is mean of all the observations,
n\text{n} is total number of observations.

For discrete frequency data, variance formula is:

i = 1nfi(xix)2i = 1nfi\frac{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i} * (\text{x}_\text{i} - \overline{\text{x}})^2}{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is ith\text{i}^{\text{th}} observation group,
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation group,
n\text{n} is number of observation groups.

For continuous frequency data, variance formula is:

i = 1nfi(xix)2i = 1nfi\frac{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i} * (\text{x}_\text{i} - \overline{\text{x}})^2}{\sum_{\text{i = 1}}^\text{n} \text{f}_\text{i}}

Where,

xi\text{x}_\text{i} is midpoint of ith\text{i}^{\text{th}} observation class interval,
fi\text{f}_\text{i} is frequency of ith\text{i}^{\text{th}} observation class interval,
n\text{n} is number of observation class interval.

Standard Deviation.

Standard deviation is square root of the variance, representing the spread or dispersion of a dataset. It is represented as (σ\sigma).

Coefficient of Variation

This indicator tells you how much variation you have in your data.

σx100\frac{\sigma}{\overline{\text{x}}} * 100

Higher coefficient of variation mean more variable, and lower coefficient of variation mean more consistent so more reliable.

Important Points

  • If every observation in a dataset is increased or decreased by the same constant value α, then:
xnew=xold±α\overline{\text{x}}_{\text{new}} = \overline{\text{x}}_{\text{old}} \pm \alpha σnew2=σold2\sigma^2_{\text{new}} = \sigma^2_{\text{old}}
  • If all observations multiplied by same non-zero number α\alpha, then:
xnew=xoldα\overline{\text{x}}_{\text{new}} = \overline{\text{x}}_{\text{old}} * \alpha σnew2=α2σold2\sigma^2_{\text{new}} = \alpha^2 * \sigma^2_{\text{old}}
  • Sum of squares of the deviations from the mean is minimum.
i = 1n(xix)2is least\sum_{\text{i = 1}}^{\text{n}} (\text{x}_\text{i} - \overline{\text{x}})^2 \text{is least}
  • Sum of deviations from the mean is zero.
i = 1n(xix)=0\sum_{\text{i = 1}}^{\text{n}} (\text{x}_\text{i} - \overline{\text{x}}) = 0
  • Extreme values do not affect the median as strongly as they affect the mean value. For example for dataset 1,2,3,400,5001, 2, 3, 400, 500 median will be 33, and mean will be 181\approx 181.

  • Sum of the absolute differences between each observation and the median is smallest.

i=1nxiα,where α is median.\sum_{i=1}^n |x_i - \alpha|, \text{where $\alpha$ is median}.
  • Maximum value of Variance for given data will be:
Variance(σ2)(range2)2\text{Variance} (\sigma^2) \le (\frac{\text{range}}{2})^2