Key Takeaways

Linear regression finds the best-fit straight line through your data points
R-squared tells you what percentage of variance is explained by the model
The slope shows how much Y changes for each unit increase in X
Correlation coefficient (r) ranges from -1 to +1, indicating direction and strength
More data points generally lead to more reliable regression results

What is Linear Regression?

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. Simple linear regression uses a single independent variable to predict the value of a dependent variable. It finds the best-fitting straight line through a set of data points, minimizing the sum of squared differences between observed and predicted values.

This technique is fundamental to statistics, data science, and machine learning. Linear regression helps researchers understand relationships between variables, make predictions, and identify trends in data. It serves as the foundation for more complex regression models and is essential for anyone working with quantitative data.

The Linear Regression Equation

y = b0 + b1 * x

y = Predicted value of the dependent variable x = Independent variable b0 = Y-intercept (value of y when x = 0) b1 = Slope (change in y for each unit change in x)

Understanding Regression Coefficients

Slope (b1)

The slope tells you how much the dependent variable changes for each one-unit increase in the independent variable. A positive slope indicates a positive relationship (as x increases, y increases), while a negative slope indicates an inverse relationship.

Slope Formula

b1 = (n * Sum(xy) - Sum(x) * Sum(y)) / (n * Sum(x^2) - (Sum(x))^2)

Y-Intercept (b0)

The y-intercept represents the predicted value of y when x equals zero. Depending on your data context, this may or may not have a meaningful interpretation. For example, if x represents years of experience and y represents salary, a y-intercept would represent the starting salary with zero experience.

b0 = mean(y) - b1 * mean(x)

Where mean(x) and mean(y) are the averages of x and y values respectively.

Coefficient of Determination (R-squared)

R-squared measures how well the regression line fits the data. It represents the proportion of variance in the dependent variable that is explained by the independent variable. R-squared values range from 0 to 1:

R-squared = 1: Perfect fit, the line explains 100% of the variance
R-squared = 0.9: Excellent fit, explains 90% of the variance
R-squared = 0.7: Good fit, explains 70% of the variance
R-squared = 0.5: Moderate fit, explains 50% of the variance
R-squared = 0: No linear relationship

Correlation Coefficient (r)

The correlation coefficient measures the strength and direction of the linear relationship between two variables. Values range from -1 to +1:

r Value	Interpretation
0.9 to 1.0	Very strong positive
0.7 to 0.9	Strong positive
0.5 to 0.7	Moderate positive
0.3 to 0.5	Weak positive
-0.3 to 0.3	Little to no correlation
-0.5 to -0.3	Weak negative
-0.7 to -0.5	Moderate negative
-0.9 to -0.7	Strong negative
-1.0 to -0.9	Very strong negative

Note that r-squared = R-squared for simple linear regression with one independent variable.

Assumptions of Linear Regression

Linearity

The relationship between x and y must be linear. Plot your data first to verify this assumption. If the relationship appears curved, consider polynomial regression or data transformation.

Independence

Observations must be independent of each other. This assumption is often violated in time-series data where consecutive observations may be correlated.

Homoscedasticity

The variance of residuals should be constant across all levels of x. If the spread of residuals increases or decreases with x, the assumption is violated (heteroscedasticity).

Normality

For statistical inference (hypothesis tests, confidence intervals), residuals should be normally distributed. This is less critical for prediction purposes with large samples.

Practical Example

Example: Study Hours vs. Exam Scores

X (Hours): 1, 2, 3, 4, 5, 6, 7, 8

Y (Score): 52, 58, 65, 71, 75, 82, 87, 91

Results:

Equation: y = 46.64 + 5.57x

R-squared = 0.993 (99.3% of variance explained)

r = 0.996 (very strong positive correlation)

Interpretation: Each additional hour of study is associated with an approximately 5.57-point increase in exam score.

Applications of Linear Regression

Business and Economics

Predicting sales based on advertising spend
Estimating demand based on price
Forecasting economic indicators

Science and Research

Analyzing experimental data
Establishing dose-response relationships
Calibrating measurement instruments

Healthcare

Predicting patient outcomes
Analyzing treatment effectiveness
Modeling disease progression

Limitations of Linear Regression

Correlation vs. Causation

A strong correlation does not imply causation. The regression relationship only describes association; establishing causality requires experimental design or additional evidence.

Extrapolation Risks

Predictions outside the range of observed data (extrapolation) may be unreliable. The linear relationship may not hold beyond the data range.

Outlier Sensitivity

Linear regression is sensitive to outliers, which can significantly influence the slope and intercept. Always examine your data for outliers and consider their impact.

Frequently Asked Questions

What's the difference between correlation and regression?

Correlation measures the strength of the linear relationship (r). Regression goes further by providing an equation to predict y from x. Correlation is symmetric (x and y can be swapped), while regression has distinct dependent and independent variables.

When should I use linear regression?

Use linear regression when you want to predict a continuous outcome variable from one or more predictor variables, and the relationship appears linear. Verify assumptions before relying on results for inference.

How many data points do I need?

While you can calculate regression with as few as 2 points, meaningful analysis requires more. A common rule of thumb is at least 10-20 observations per predictor variable for reliable estimates.

What if R-squared is low?

A low R-squared doesn't necessarily mean the regression is useless. It may indicate that other variables affect y, or that y has high inherent variability. Consider adding predictors (multiple regression) or accepting that prediction precision is limited.

Linear Regression Calculator

Quick Facts

Regression Results

Interpretation