Introduction to Statistics

UNIT-I: Combinatorics

Permutation and Combination, Repetition and Constrained Repetition, Binomial Coefficients, Binomial Theorem.

UNIT-II: Frequency Distributions and Measures of Central Tendency

Frequency distributions, Histograms and frequency polygons, Measures of central tendency: Mean, Mode, Median, Dispersion, Mean deviation and standard deviation. Moments, Skewness, Kurtosis.

UNIT-III: Probability Theory and Distributions

Elementary probability theory: Definition, conditional probability, Probability distribution, mathematical expectation. Theoretical distribution: Binomial, Poisson, and Normal distribution, Relation between the Binomial, Poisson, and Normal distribution.

UNIT-IV: Correlation, Regression, and Curve Fitting

Correlation and Regression: Linear Correlation, Measure of Correlation, Least Square Regression lines. Curve fitting: Method of least square, least square line, least squares Parabola. Chi-square test: definition of chi-square; significance test: contingency test, coefficient of contingency.

UNIT-V: Sampling Theory and Hypothesis Testing

Basics of sampling theory: Sample mean and variance, Student's t-test, test of Hypotheses and significance, degree of freedom, Z-test, small and large sampling, Introduction to Monte Carlo method.

UNIT-I: Combinatorics

1. Permutation and Combination

Permutation is the arrangement of objects in a specific order. Combination is the selection of objects without considering the order.

Permutation Formula:

P(n, r) = n! / (n - r)!

Combination Formula:

C(n, r) = n! / [r! (n - r)!]

Example:

How many ways can you arrange 3 letters out of 5?

Permutation: P(5,3) = 5! / (5-3)! = 60
Combination: C(5,3) = 5! / (3!2!) = 10

Diagram (Text form):

Set = {A, B, C, D, E}
Permutation (3 items): ABC, ACB, BAC, BCA, CAB, CBA...
Combination (3 items): ABC, ABD, ABE, ACD...

2. Repetition and Constrained Repetition

Repetition allows objects to be selected more than once. Constrained repetition limits the number of times an object can be selected.

Permutation with Repetition:

Total = n^r
(Choose r items from n, allowing repetition)

Example:

How many 2-digit numbers using 3 digits {1,2,3} with repetition?

3^2 = 9 combinations: 11, 12, 13, 21, 22, ...

3. Binomial Coefficients

Binomial coefficients are the values of C(n, r), used in the Binomial Theorem and Pascal's Triangle.

C(n, r) = n! / [r!(n-r)!]
Pascal's Triangle:
       1
      1 1
     1 2 1
    1 3 3 1
   1 4 6 4 1

4. Binomial Theorem

The Binomial Theorem gives a way to expand expressions of the form (a + b)ⁿ.

Formula:

(a + b)^n = Σ [C(n, k) * a^(n-k) * b^k] for k = 0 to n

Example:

(a + b)^2 = C(2,0)a^2 + C(2,1)ab + C(2,2)b^2
          = a^2 + 2ab + b^2

UNIT-II: Frequency Distributions and Statistical Measures

1. Frequency Distributions

A frequency distribution organizes raw data into a table showing the frequency (number of times) each value appears.

Example:
Data: 2, 3, 2, 5, 3, 2, 4, 5
Frequency Table:
Value | Frequency
------|----------
  2   |     3
  3   |     2
  4   |     1
  5   |     2

2. Histograms and Frequency Polygons

A histogram is a graphical representation of a frequency distribution using bars.
A frequency polygon is a line graph created by joining midpoints of class intervals.

Textual Histogram:
[10-20]: ||||
[20-30]: ||||||
[30-40]: |||
[40-50]: |||||

3. Measures of Central Tendency

Mean

Average of the data values.

Mean = (2 + 4 + 6 + 8) / 4 = 5

Median

The middle value when data is arranged in order.

Data: 3, 5, 7 —> Median = 5
If even count: average of middle two numbers.

Mode

The value that appears most frequently.

Data: 2, 3, 3, 4, 5 —> Mode = 3

4. Dispersion Measures

Mean Deviation

The average of absolute deviations from the mean.

Mean = 4, Data = 3, 4, 5
Deviations = 1, 0, 1 —> Mean Deviation = (1+0+1)/3 = 0.67

Standard Deviation (σ)

Shows how much data deviates from the mean. Formula:

σ = √[Σ(x - mean)² / N]
Example: Data = 2, 4, 4, 4, 5, 5, 7, 9
Mean = 5, σ ≈ 2

5. Moments

Moments are used to describe the shape characteristics of a distribution.

1st Moment: Mean
2nd Moment: Variance
3rd Moment: Skewness
4th Moment: Kurtosis

6. Skewness

Measures asymmetry of data:

Positive Skew: Tail on right
Negative Skew: Tail on left
Symmetric: Normal distribution

Diagram (Text):
Left-skewed:      * * * * * * *      *
Symmetrical:      *   * * * * *   *
Right-skewed:     *      * * * * * * *

7. Kurtosis

Measures the peakedness of the distribution.

Leptokurtic: More peaked than normal
Mesokurtic: Normal curve
Platykurtic: Flatter than normal

Diagram (Text):
Leptokurtic:      ^
Mesokurtic:       ^
Platykurtic:     ^

UNIT-III: Elementary Probability and Theoretical Distributions

1. Elementary Probability Theory

Probability: The measure of the likelihood that an event will occur.
If 'S' is the sample space and 'E' is an event, then:
P(E) = Number of favorable outcomes / Total number of outcomes

Conditional Probability

The probability of an event A given that event B has occurred.
P(A | B) = P(A ∩ B) / P(B), provided P(B) ≠ 0

2. Probability Distributions

A probability distribution assigns probabilities to all possible outcomes of a random variable.

Example:
X:     0   1   2
P(X):  0.2 0.5 0.3

3. Mathematical Expectation

The expected value (mean) of a random variable is the weighted average of all possible values.

E(X) = Σ [x * P(x)]
Example:
X:     1   2   3
P(X):  0.2 0.5 0.3
E(X) = 1*0.2 + 2*0.5 + 3*0.3 = 2.1

4. Theoretical Distributions

Binomial Distribution

Used when there are fixed number of trials (n), each with two outcomes: success or failure.
Formula: P(X = r) = nCr * p^r * (1-p)^(n-r)

Example: n = 4, p = 0.5
P(X = 2) = 4C2 * (0.5)^2 * (0.5)^2 = 0.375

Poisson Distribution

Used for rare events occurring over fixed intervals of time or space.
Formula: P(X = r) = (λ^r * e^(-λ)) / r!

Example: λ = 3, r = 2
P(X=2) = (3^2 * e^-3) / 2! ≈ 0.224

Normal Distribution

A continuous distribution that is symmetric and bell-shaped. Mean = Median = Mode.
Formula (standard normal): Z = (X - μ) / σ

Example: μ = 100, σ = 15, X = 115
Z = (115 - 100) / 15 = 1.0

5. Relationship Between Binomial, Poisson, and Normal

Binomial → Poisson: When n is large, p is small, λ = np
Binomial → Normal: For large n, p not too close to 0 or 1
Poisson → Normal: When λ is large

Approximations:
Binomial(n, p) ≈ Poisson(λ = np) if n→∞, p→0
Binomial(n, p) ≈ Normal(μ = np, σ = √npq) if n is large
Poisson(λ) ≈ Normal(μ = λ, σ = √λ) if λ is large

UNIT-IV: Correlation, Regression, Curve Fitting & Chi-Square Test

1. Correlation

Correlation measures the degree of relationship between two variables.

Linear Correlation

A linear relationship exists when the change in one variable causes a proportional change in another.

Measure of Correlation

The most common measure is Pearson’s correlation coefficient (r):
r = Σ[(X - X̄)(Y - Ȳ)] / √[Σ(X - X̄)² * Σ(Y - Ȳ)²]
Value of r lies between -1 and 1.

r = 1: Perfect positive correlation
r = -1: Perfect negative correlation
r = 0: No correlation

2. Regression

Regression is used to predict the value of one variable based on another.

Least Square Regression Lines

The regression line minimizes the sum of squares of the vertical distances from data points to the line.

Regression line of Y on X: Y = a + bX
Regression line of X on Y: X = a + bY

3. Curve Fitting

Curve fitting is the process of finding a curve that best fits the given data using the method of least squares.

Method of Least Squares

This method minimizes the sum of the squares of the errors (deviations) between the observed and fitted values.

Least Squares Line

A straight line fit: Y = a + bX, where a and b are calculated using least squares method.

Least Squares Parabola

A curve of the form Y = a + bX + cX² fitted to data using least squares approach.

4. Chi-Square Test

The Chi-square (χ²) test is a statistical test to determine if there is a significant difference between observed and expected frequencies.

Definition of Chi-Square

χ² = Σ[(O - E)² / E]
Where O = observed frequency, E = expected frequency.

Significance Test

Used to test the independence of two categorical variables.

Contingency Test

Used for testing the independence between row and column variables in a contingency table.

Coefficient of Contingency

A measure derived from Chi-square to express the degree of association between two attributes.
C = √[χ² / (χ² + N)], where N = total number of observations.

UNIT-V: Basics of Sampling Theory

1. Sample Mean and Variance

The sample mean is the average of a set of sample data. It is used to estimate the population mean. The sample variance measures how much the data points differ from the sample mean.

Sample Mean (x̄) = Σx / n
Sample Variance (s²) = Σ(x - x̄)² / (n - 1)

2. Student's t-test

The t-test is used to determine if there is a significant difference between the means of two groups, particularly when sample sizes are small and population variance is unknown.

t = (x̄ - μ) / (s / √n)
Where:
x̄ = sample mean, μ = population mean, s = sample standard deviation, n = sample size

3. Test of Hypotheses and Significance

Hypothesis testing is used to test an assumption regarding a population parameter. A hypothesis can be either null (H₀) or alternative (H₁). The goal is to determine whether the evidence is strong enough to reject the null hypothesis.

Null Hypothesis (H₀): Assumes no effect or difference.
Alternative Hypothesis (H₁): Assumes there is an effect or difference.
p-value: Probability of observing data at least as extreme as the current data under the null hypothesis.

4. Degree of Freedom

The degree of freedom refers to the number of independent values or quantities which can be assigned to a statistical distribution. It is used in the calculation of the t-distribution and chi-square tests.

Degree of freedom (df) = n - 1
Where n is the sample size.

5. Z-test

The Z-test is used to test the hypothesis about the population mean when the population variance is known or the sample size is large (n > 30). It compares the observed sample mean with the population mean.

Z = (x̄ - μ) / (σ / √n)
Where:
x̄ = sample mean, μ = population mean, σ = population standard deviation, n = sample size

6. Small and Large Sampling

Small sampling refers to samples with fewer than 30 data points. Large sampling refers to samples with more than 30 data points. Statistical tests like t-tests and Z-tests are applied differently based on the sample size.

7. Introduction to Monte Carlo Method

The Monte Carlo method is a statistical technique that uses random sampling to obtain numerical results. It is commonly used in simulations, optimization problems, and modeling complex systems.

Monte Carlo Simulation Steps:
1. Define the problem and random variables.
2. Generate random numbers for variables.
3. Run simulations and collect results.
4. Analyze the results to estimate the solution.