Key Takeaways
- Linear regression finds the best-fit straight line through your data points
- R-squared tells you what percentage of variance is explained by the model
- The slope shows how much Y changes for each unit increase in X
- Correlation coefficient (r) ranges from -1 to +1, indicating direction and strength
- More data points generally lead to more reliable regression results
What is Linear Regression?
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. Simple linear regression uses a single independent variable to predict the value of a dependent variable. It finds the best-fitting straight line through a set of data points, minimizing the sum of squared differences between observed and predicted values.
This technique is fundamental to statistics, data science, and machine learning. Linear regression helps researchers understand relationships between variables, make predictions, and identify trends in data. It serves as the foundation for more complex regression models and is essential for anyone working with quantitative data.
The Linear Regression Equation
y = b0 + b1 * x
Understanding Regression Coefficients
Slope (b1)
The slope tells you how much the dependent variable changes for each one-unit increase in the independent variable. A positive slope indicates a positive relationship (as x increases, y increases), while a negative slope indicates an inverse relationship.
Slope Formula
b1 = (n * Sum(xy) - Sum(x) * Sum(y)) / (n * Sum(x^2) - (Sum(x))^2)
Y-Intercept (b0)
The y-intercept represents the predicted value of y when x equals zero. Depending on your data context, this may or may not have a meaningful interpretation. For example, if x represents years of experience and y represents salary, a y-intercept would represent the starting salary with zero experience.
b0 = mean(y) - b1 * mean(x)
Coefficient of Determination (R-squared)
R-squared measures how well the regression line fits the data. It represents the proportion of variance in the dependent variable that is explained by the independent variable. R-squared values range from 0 to 1:
- R-squared = 1: Perfect fit, the line explains 100% of the variance
- R-squared = 0.9: Excellent fit, explains 90% of the variance
- R-squared = 0.7: Good fit, explains 70% of the variance
- R-squared = 0.5: Moderate fit, explains 50% of the variance
- R-squared = 0: No linear relationship
Correlation Coefficient (r)
The correlation coefficient measures the strength and direction of the linear relationship between two variables. Values range from -1 to +1:
| r Value | Interpretation |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.5 to 0.7 | Moderate positive |
| 0.3 to 0.5 | Weak positive |
| -0.3 to 0.3 | Little to no correlation |
| -0.5 to -0.3 | Weak negative |
| -0.7 to -0.5 | Moderate negative |
| -0.9 to -0.7 | Strong negative |
| -1.0 to -0.9 | Very strong negative |
Note that r-squared = R-squared for simple linear regression with one independent variable.
Assumptions of Linear Regression
Linearity
The relationship between x and y must be linear. Plot your data first to verify this assumption. If the relationship appears curved, consider polynomial regression or data transformation.
Independence
Observations must be independent of each other. This assumption is often violated in time-series data where consecutive observations may be correlated.
Homoscedasticity
The variance of residuals should be constant across all levels of x. If the spread of residuals increases or decreases with x, the assumption is violated (heteroscedasticity).
Normality
For statistical inference (hypothesis tests, confidence intervals), residuals should be normally distributed. This is less critical for prediction purposes with large samples.
Practical Example
Example: Study Hours vs. Exam Scores
X (Hours): 1, 2, 3, 4, 5, 6, 7, 8
Y (Score): 52, 58, 65, 71, 75, 82, 87, 91
Results:
Equation: y = 46.64 + 5.57x
R-squared = 0.993 (99.3% of variance explained)
r = 0.996 (very strong positive correlation)
Interpretation: Each additional hour of study is associated with an approximately 5.57-point increase in exam score.
Applications of Linear Regression
Business and Economics
- Predicting sales based on advertising spend
- Estimating demand based on price
- Forecasting economic indicators
Science and Research
- Analyzing experimental data
- Establishing dose-response relationships
- Calibrating measurement instruments
Healthcare
- Predicting patient outcomes
- Analyzing treatment effectiveness
- Modeling disease progression
Limitations of Linear Regression
Correlation vs. Causation
A strong correlation does not imply causation. The regression relationship only describes association; establishing causality requires experimental design or additional evidence.
Extrapolation Risks
Predictions outside the range of observed data (extrapolation) may be unreliable. The linear relationship may not hold beyond the data range.
Outlier Sensitivity
Linear regression is sensitive to outliers, which can significantly influence the slope and intercept. Always examine your data for outliers and consider their impact.
Frequently Asked Questions
Correlation measures the strength of the linear relationship (r). Regression goes further by providing an equation to predict y from x. Correlation is symmetric (x and y can be swapped), while regression has distinct dependent and independent variables.
Use linear regression when you want to predict a continuous outcome variable from one or more predictor variables, and the relationship appears linear. Verify assumptions before relying on results for inference.
While you can calculate regression with as few as 2 points, meaningful analysis requires more. A common rule of thumb is at least 10-20 observations per predictor variable for reliable estimates.
A low R-squared doesn't necessarily mean the regression is useless. It may indicate that other variables affect y, or that y has high inherent variability. Consider adding predictors (multiple regression) or accepting that prediction precision is limited.