Skip to content

Estimates

3 Measures of central tendency

  • Central tendency is a way to describe the core of a dataset, also known as the distribution center.
  • Measures of central tendency aim to define the typical or expected value of a dataset.
  • The most common measures of central tendency are:
    • Mean: The average of all values in the dataset.
    • Median: The middle value when the data is sorted.
    • Mode: The value that appears most frequently in the dataset.
  • The median is often preferred when there are extreme values in the data.

3.1 Mean

  • Mean: The mean, or arithmetic average, is a central tendency measure representing the midpoint or average value of a dataset.
  • Calculating Mean: The mean is calculated by summing all data points and dividing by the total number of observations.
  • Mean’s Limitations: The mean can be influenced by extreme values (outliers), potentially skewing the representation of the data’s center.
  • Ideal Use: The mean is most effective for representing the central tendency of symmetrical distributions.
  • Alternatives: For skewed distributions, the median is often a more reliable measure of central tendency.
  • Mean and Variability: The mean alone doesn’t indicate the spread or variability of the data. The standard deviation is used to assess this.

3.2 Median

  • Median: A measure of central tendency suitable for ordinal data. It represents the middle value in a sorted dataset, with half the values above and half below.
  • Finding the Median:
    • Odd Number of Values: Median is the (N+1)/2th term (where N is the total number of values).
    • Even Number of Values: Median is the average of the N/2th and (N/2 + 1)th terms.
  • Grouped Data: The formulas above apply to both ungrouped and grouped data.

3.2.2 Median for the ungrouped data

  • Calculating the Median for Ungrouped Data: The formula for calculating the median in a well-grouped dataset involves:
    • ""l"": The lower limit of the median class.
    • ""cf"": The cumulative frequency of the class preceding the median class.
    • ""f"": The frequency of the median class.
    • ""h"": The size of the class.
  • Comparing Median and Mean: Comparing the median and mean provides insights into data dispersal. Identical mean and median values suggest a uniform distribution.
  • Median vs. Mean in Skewed Distributions: In skewed distributions, the median is a more reliable measure of central tendency than the mean because it’s less affected by outliers. Outliers can significantly pull the mean away from the center of the distribution, making it potentially misleading.

3.3 Mode

  • Mode is a measure of central tendency representing the most frequent value in a dataset.
  • It can be applied to nominal data, unlike mean and median.
  • Datasets can have multiple modes (bimodal, trimodal, multimodal) or no mode.
  • Mode is unaffected by extreme values, making it useful for skewed data.
  • Limitations include lack of mathematical manipulability and limited insight into the entire dataset.

3.4 Percentiles, quartiles and interquartile range

  • Percentiles and Quartiles: Percentiles and quartiles are used to divide datasets into equal proportions. Percentiles represent 1% intervals, while quartiles divide data into four equal sections (25% each).
  • Interquartile Range (IQR): The IQR is a measure of dispersion that represents the difference between the 75th percentile (Q3) and the 25th percentile (Q1). It encompasses the central 50% of the data and is often used for non-Gaussian distributions.
  • Median: The median is the 50th percentile, representing the middle value of a dataset.
  • Q-Q Plot: A Q-Q plot compares the quantiles of two probability distributions to visually determine if they originate from the same population. If the distributions are identical, the points on the plot will fall along a straight line.
  • Uses of Q-Q Plots: Q-Q plots are used to compare the shapes of distributions, including their location, scale, and skewness. They can be used for comparing datasets or models.

4 Skewness and kurtosis

  • Skewness measures how much a distribution deviates from a normal (bell curve) distribution. It can be positive, negative, or normal.
    • Positive skewness: The data is clustered towards the left, with a longer tail on the right. Mean > Median > Mode.
    • Negative skewness: The data is clustered towards the right, with a longer tail on the left. Mode > Median > Mean.
  • Kurtosis measures the peakedness or flatness of a distribution compared to a normal distribution.
    • Positive kurtosis (leptokurtic): The distribution has a higher peak and heavier tails, indicating more outliers than a normal distribution.
    • Negative kurtosis (platykurtic): The distribution has a flatter peak and lighter tails, indicating fewer outliers than a normal distribution.
    • Normal kurtosis (mesokurtic): The distribution is a normal bell curve.
  • Skewness and kurtosis are used to characterize the shape of a distribution and help determine how closely it resembles a normal distribution.
    • Skewness describes the horizontal ""drag"" of the data, while kurtosis describes the vertical ""drag"" or height of the peak.

5 Variability and its measures

  • Variability is a statistical concept that quantifies the spread or dispersion of data points within a dataset.
  • Measures of variability complement measures of central tendency (like the mean) by indicating how much data points deviate from the central value.
  • Low dispersion implies data points are clustered tightly around the center, while high dispersion indicates data points are spread out.
  • Variability is crucial because it influences the stability and predictability of data. High variability increases the likelihood of extreme values.
  • Understanding variability helps in comprehending the chance of unusual events or outliers in a dataset.

5.1 Variance

  • Variance Definition: Variance is a measure of how spread out a set of data points is from its mean. It quantifies the average squared deviation of values from the mean.
  • High Variance Interpretation: A large variance indicates that the data points are widely dispersed from the mean and from each other.
  • Variance Calculation Formulas:
    • Population Variance: s^2 = Σ(X - m)^2 / N (used to assess the variation of the entire population)
    • Sample Variance: s^2 = Σ(X - M)^2 / (N - 1) (used to estimate the variance of the population from a sample)
  • Sample Variance Adjustment: The N - 1 in the denominator of the sample variance formula accounts for the tendency of samples to underestimate the true population variance.

5.2 Standard deviation

  • Standard deviation measures the average distance of data points from the population mean.
  • A higher standard deviation indicates greater data dispersion, while a lower standard deviation indicates less dispersion.
  • Standard deviation uses the original units of the data, making it easier to understand.
  • The standard deviation is calculated as the square root of the variance.
  • Two formulas are provided: one for population standard deviation and one for sample standard deviation.
  • The standard deviation is represented by ""s"" for the population and ""s"" for a sample.
  • Standard deviation is considered the most accurate indicator of variation, similar to the mean absolute deviation.

5.3 Standard error

  • Standard error is a measure of how representative a sample is of the total population. It essentially indicates the difference between the sample mean and the population mean.
  • A large standard error suggests the sample mean doesn’t accurately represent the population mean.
  • A small standard error implies the sample mean is close to the population mean, indicating a good representation.
  • Increasing sample size decreases standard error.
  • Standard deviation measures the spread of data within a sample, while standard error measures the spread of sample means from the population mean.
  • Standard deviation is always larger than standard error.
  • Standard error reflects variability across multiple samples, whereas standard deviation represents variability within a specific sample.
  • Standard error is an estimated value, while standard deviation is calculated from sample data.

5.4 Coefficient of variation

  • Coefficient of Variation (CV) Definition: The CV is a measure of relative variability, showing how much the standard deviation varies compared to the mean.
  • Purpose of CV: Used to compare variability between datasets with different scales or units.
  • Calculation: CV is calculated by dividing the standard deviation by the mean.
  • Interpretation:
    • A CV of 1 (or 100%) means the standard deviation equals the mean.
    • A CV less than 1 indicates the standard deviation is smaller than the mean.
    • A CV greater than 1 indicates the standard deviation is larger than the mean.
    • Higher CV values suggest greater relative variability.
  • Example: A courier service with an average delivery time of 30 minutes and a standard deviation of 6 minutes has a CV of 0.20 (or 20%). This means the standard deviation is 20% of the average delivery time.

5.4.1 Comprehending the source of variability for analysis

  • Understanding Variability is Key: Every statistical study assumes that findings from samples can be applied to the entire population. To ensure this, we need to understand and address the sources of variability in our data.
  • Sources of Variability: These can include biological factors (like individual differences), technological factors (like variations in data collection methods), and even the scientist performing the study.
  • Addressing Variability: We can use both quantitative and technological methods to adjust for these differences. This might involve standardized data collection procedures, controls in the statistical analyses, or carefully selecting study groups to minimize inherent differences.
  • Challenges with Inherent Variability: When studying factors like disease, it becomes difficult to find groups of individuals that are identical except for the presence or absence of the disease. This makes it challenging to isolate the effects of the disease itself.