# Line of best fit dataset["p"] = b0 + b1*dataset["x"]dataset.head(3)
x
e
y
p
0
0.496714
0.324084
2.317512
1.993428
1
-0.138264
-0.385082
0.338389
0.723471
2
0.647689
-0.676922
1.618455
2.295377
Visualising the data
Now, let’s plot the data and the line of best fit
# Plot the dataplt.scatter(x = dataset["x"], y = dataset["y"])plt.plot(dataset["x"], dataset["p"], color ='green')plt.xlabel("X Variable")plt.ylabel("Y Variable")plt.legend(labels = ["Data points", "Best fit line"])plt.show()
Note: .abs() is a pandas function that returns the absolute value of a number. For example, abs(-3) and abs(3) both return 3. It is useful to compute the distance between two numbers without considering their sign, such 5 and -3 (which are 8 units apart): a = 5; b = -3; abs(a - b) = 8. More here.
Estimate the line of best fit 📈
Estimate the line of best fit
We usually have information about the dependent variable \(y\) and the independent variable \(x\), but we don’t know the coefficients \(b_0\) and \(b_1\)
The goal is to estimate the line of best fit that minimises the sum of the squared residuals
The residuals are the differences between the observed values of \(y\) and the predicted values of \(y\)
# Create the new datasetsubset_above2 = pd.DataFrame()# Subset the records with y >= 2subset_above2 = dataset.query("y >= 2")# Count the original rowsoriginal_rows =len(dataset)# Count the subsetted rowssubsetted_rows =len(subset_above2)# Compute the proportion of subsetted observationsproportion = subsetted_rows / original_rowsproportion
Compute the standard deviation of \(y\) as stdv_sample
Use .query() to subset observations that satisfy \(abs\left(y - ybar \right) \le stdv\_sample\)
HINT: Use .mean() and .std()
HINT: Use the globals @xbar, @stdv_sample
# Store the sample mean of y as ybarybar = dataset["y"].mean()# Compute the standard deviation of y as stdv_samplestdv_sample = dataset["y"].std()# Subset observations that satisfy abs(y - ybar) <= stdv_samplesubset_within_stdv = dataset.query("abs(y - @ybar) <= @stdv_sample")# Count the subsetted rowslen(subset_within_stdv)
Compute a column with the formula sample_error = y - p_estimated
Create a lambda function fn_positive_error = lambda error: error >= 0
Compute a column for whether the error is positive using .apply()
# Compute a column with the formula sample_error = y - p_estimateddataset["sample_error"] = dataset["y"] - dataset["p_estimated"]# Create a lambda function fn_positive_error = lambda error: error >= 0fn_positive_error =lambda error: error >=0# Compute a column for whether the error is positive using .apply()dataset["positive_error"] = dataset["sample_error"].apply(fn_positive_error)dataset.head(3)
Compute a new column error_sqr = sample_error ** 2
Calculate the mean of error_sqr
# Compute a new column error_sqr = sample_error ** 2dataset["error_sqr"] = dataset["sample_error"] **2# Calculate the mean of error_sqrmean_error_sqr = dataset["error_sqr"].mean()mean_error_sqr