Variability, Dispersion and Central Tendency
Quantitative data can be described by measures of central tendency, dispersion, and "shape". Central tendency is described by median, mode, and the means (there are different means- geometric and arithmetic). Dispersion is the degree to which data is distributed around this central tendency, and is represented by range, deviation, variance, standard deviation and standard error.
This chapter answers parts from Section A(e) of the Primary Syllabus, "Describe frequency distributions and measures of central tendency and dispersion". This topic was examined in Question 23 from the first paper of 2015. The pass rate was 8%. The only candidate who passed "gave an example of a simple data set (a set of numbers) and calculated the mean, median & mode and explained the effect of an outlier".
Quantitative data can be described by measures of central tendency, dispersion, and "shape".
This sort of data is:
- Expressed numerically, and ordered on a scale
- Interval data: increase at constant intervals, but do not start at zero, eg. temperature on the Celsius scale
- Ratio data: interval data which has a true zero, eg. pressure
- Binary data: yes or no answers
- Discrete data: isolated data points separated by gaps
- Continuous data: part of a continuous range of values
Measure of central tendency
- This is the average of a population - allowing the population to be represented by a single value.
- Median is the middle number in a data set that is ordered from least to greatest
- Mode is the number that occurs most often in a data set
- Arithmetic mean is the average of a set of numerical values,
- Geometric mean is the nth root of the product of n numbers
Degree of dispersion
- These describe the dispersion of data around some sort of mean.
- Range: the highest and the lowest score
- Interquartile range: the difference between 75th and 25th percentiles
- Percentile: the percentage band into which the score falls (mean = the 50th percentile)
- Deviation: distance between an observed score and the mean score. Because the difference can be positive or negative and this is cumbersome, usually the absolute deviation is used (which ignores the plus or minus sign).
- Variance: deviation squared
- Standard deviation: square root of variance
- Measure of the average spread of individual samples from the mean
- Reporting the SD along with the mean gives one the impression of how valid that mean value actually is (i.e. if the SD is huge, the mean is totally invalid - it is not an accurate measure of central tendency, because the data is so widely scattered.)
- Standard error
- This is an estimate of spread of samples around the population mean.
- You don't known the population mean- you only know the sample mean and the standard deviation for your sample, but if the standard deviation is large, the sample mean may be rather far from the population mean. How far is it? The SE can estimate this.
- Mean absolute deviation is the average of the absolute deviations from a central point for all data. As such, it is a summary of the net statistical variability in the data set. On average, it says, the data is this different from this central point.
- Coefficient of variation, also known as "relative standard deviation", is the SD divided by the mean. As a dimensionless number, it allows comparisons between different data sets (i.e. ones using different units).
Standard Error (SE) = SD / square root of n
- The variability among sample means will be increased if there is (a) a wide variability of individual data and (b) small samples
- SE is used to calculate the confidence interval.
Shape of the data
- This vaguely refers to the shape of the probability distribution bell curve.
- Skewness is a measure of the assymetry of the probability distribution - the tendency of the bell curve to be assymmetrical.
- Kurtosis or "peakedness" describes the width and height of the peak of the bell curve, i.e. the tendency for the scores to gather around the middle of the bell curve.
- A normal distribution is a perfectly symmetrical bell curve, and is not skewed.
- According to the college, point estimate is "a single value estimate of a population parameter. It represents a descriptive statistic for a summary measure, or a measure of central tendency from a given population sample." For example, in a population there exists a "true" average height; the point estimate of this average height is the average height of a sample group taken from that population.
- The range of values within which the "actual" result is found.
- A CI of 95% means that if a trial was repeated an infinite number of times, 95% of the results would fall within this range of values.
- The CI gives an indication of the precision of the sample mean as an estimate of the "true" population mean
- A wide CI can be caused by small samples or by a large variance within a sample.