{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "# NO CODE\n", "\n", "from datascience import *\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "import numpy as np\n", "from scipy import stats\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Towards Multiple Regression ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is an extended example of applications of the methods we have derived for regression. We will start with simple regression, which we understand well, and will then indicate how some of the results can be extended when there is more than one predictor variable.\n", "\n", "The data are from a study on the treatment of Hodgkin's disease, a type of cancer that can affect young people. The good news is that treatments for this cancer have [high success rates](https://en.wikipedia.org/wiki/Hodgkin_lymphoma#Prognosis). The bad news is that the treatments can be rather strong combinations of chemotherapy and radiation, and thus have serious side effects. A goal of the study was to identify combinations of treatments with reduced side effects.\n", "\n", "The table `hodgkins` contains data on a random sample of patients. Each row corresponds to a patient. The columns contain the patient's height in centimeters, the amount of radiation, the amount of medication used in chemotherapy, and measurements on the health of the patient's lungs." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "# NO CODE\n", "\n", "hodgkins = Table.read_table('../../data/hodgkins.csv')\n", "diffs = hodgkins.column(4) - hodgkins.column(3)\n", "hodgkins = hodgkins.with_columns('difference', diffs)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
height rad chemo base month15 difference
164 679 180 160.57 87.77 -72.8
168 311 180 98.24 67.62 -30.62
173 388 239 129.04 133.33 4.29
157 370 168 85.41 81.28 -4.13
160 468 151 67.94 79.26 11.32
170 341 96 150.51 80.97 -69.54
163 453 134 129.88 69.24 -60.64
175 529 264 87.45 56.48 -30.97
185 392 240 149.84 106.99 -42.85
178 479 216 92.24 73.43 -18.81
\n", "

... (12 rows omitted)

" ], "text/plain": [ "height | rad | chemo | base | month15 | difference\n", "164 | 679 | 180 | 160.57 | 87.77 | -72.8\n", "168 | 311 | 180 | 98.24 | 67.62 | -30.62\n", "173 | 388 | 239 | 129.04 | 133.33 | 4.29\n", "157 | 370 | 168 | 85.41 | 81.28 | -4.13\n", "160 | 468 | 151 | 67.94 | 79.26 | 11.32\n", "170 | 341 | 96 | 150.51 | 80.97 | -69.54\n", "163 | 453 | 134 | 129.88 | 69.24 | -60.64\n", "175 | 529 | 264 | 87.45 | 56.48 | -30.97\n", "185 | 392 | 240 | 149.84 | 106.99 | -42.85\n", "178 | 479 | 216 | 92.24 | 73.43 | -18.81\n", "... (12 rows omitted)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hodgkins" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "22" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = hodgkins.num_rows\n", "n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The radiation was directed towards each patient's chest area or \"mantle\", to destroy cancer cells in the lymph nodes near that area. Since this could adversely affect the patients' lungs, the researchers measured the health of the patients' lungs both before and after treatment. Each patient received a score, with larger scores corresponding to more healthy lungs. \n", "\n", "The table records the baseline scores and also the scores 15 months after treatment. The change in score (15 month score minus baseline score) is in the final column. Notice the negative differences: 15 months after treatment, many patients' lungs weren't doing as well as before the treatment. \n", "\n", "Perhaps not surprisingly, patients with larger baseline scores had bigger drops in score. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hodgkins.scatter('base', 'difference')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will regress the difference on the baseline score, this time using the Python module `statsmodels` that allows us to easily perform multiple regression as well. You don't have to learn the code below (though it's not hard). Just focus on understanding an interpreting the output.\n", "\n", "As a first step, we must import the module." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Table` method `to_df` allows us to convert the table `hodgkins` to a structure called a data frame that works more smoothly with `statsmodels`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heightradchemobasemonth15difference
0164679180160.5787.77-72.80
116831118098.2467.62-30.62
2173388239129.04133.334.29
315737016885.4181.28-4.13
416046815167.9479.2611.32
517034196150.5180.97-69.54
6163453134129.8869.24-60.64
717552926487.4556.48-30.97
8185392240149.84106.99-42.85
917847921692.2473.43-18.81
10179376160117.43101.61-15.82
11181539196129.7590.78-38.97
1217321720497.5976.38-21.21
1316645619281.2967.66-13.63
1417025215098.2955.51-42.78
15165622162118.9890.92-28.06
16173305213103.1779.74-23.43
1717456619894.9793.08-1.89
1817332211985.0041.96-43.04
19173270160115.0281.12-33.90
20183259241125.0297.18-27.84
21188238252137.43113.20-24.23
\n", "
" ], "text/plain": [ " height rad chemo base month15 difference\n", "0 164 679 180 160.57 87.77 -72.80\n", "1 168 311 180 98.24 67.62 -30.62\n", "2 173 388 239 129.04 133.33 4.29\n", "3 157 370 168 85.41 81.28 -4.13\n", "4 160 468 151 67.94 79.26 11.32\n", "5 170 341 96 150.51 80.97 -69.54\n", "6 163 453 134 129.88 69.24 -60.64\n", "7 175 529 264 87.45 56.48 -30.97\n", "8 185 392 240 149.84 106.99 -42.85\n", "9 178 479 216 92.24 73.43 -18.81\n", "10 179 376 160 117.43 101.61 -15.82\n", "11 181 539 196 129.75 90.78 -38.97\n", "12 173 217 204 97.59 76.38 -21.21\n", "13 166 456 192 81.29 67.66 -13.63\n", "14 170 252 150 98.29 55.51 -42.78\n", "15 165 622 162 118.98 90.92 -28.06\n", "16 173 305 213 103.17 79.74 -23.43\n", "17 174 566 198 94.97 93.08 -1.89\n", "18 173 322 119 85.00 41.96 -43.04\n", "19 173 270 160 115.02 81.12 -33.90\n", "20 183 259 241 125.02 97.18 -27.84\n", "21 188 238 252 137.43 113.20 -24.23" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h_data = hodgkins.to_df()\n", "h_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several variables we could use to predict the difference. The only one we wouldn't use is the 15 month measurement, as that's precisely what we won't have for a new patient before the treatment is adminstered. \n", "\n", "But which of the rest should we use? One way to choose is to look at the *correlation matrix* of all the variables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heightradchemobasemonth15difference
height1.000000-0.3052060.5768250.3542290.390527-0.043394
rad-0.3052061.000000-0.0037390.0964320.040616-0.073453
chemo0.576825-0.0037391.0000000.0621870.4457880.346310
base0.3542290.0964320.0621871.0000000.561371-0.630183
month150.3905270.0406160.4457880.5613711.0000000.288794
difference-0.043394-0.0734530.346310-0.6301830.2887941.000000
\n", "
" ], "text/plain": [ " height rad chemo base month15 difference\n", "height 1.000000 -0.305206 0.576825 0.354229 0.390527 -0.043394\n", "rad -0.305206 1.000000 -0.003739 0.096432 0.040616 -0.073453\n", "chemo 0.576825 -0.003739 1.000000 0.062187 0.445788 0.346310\n", "base 0.354229 0.096432 0.062187 1.000000 0.561371 -0.630183\n", "month15 0.390527 0.040616 0.445788 0.561371 1.000000 0.288794\n", "difference -0.043394 -0.073453 0.346310 -0.630183 0.288794 1.000000" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h_data.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each entry in this table is the correlation between the variable specified by the row label and the variable specified by the column label. That's why all the diagonal entries are $1$.\n", "\n", "Look at the last column (or last row). This contains the correlation between `difference` and each of the other variables. The baseline measurement has the largest correlation. To run the regression of `difference` on `base` we must first extract the columns of data that we need and then use the appropriate `statsmodels` methods.\n", "\n", "First, we create data frames corresponding to the response and the predictor variable. The methods are not the same as for `Tables`, but you will get a general sense of what they are doing." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "y = h_data[['difference']] # response\n", "x = h_data[['base']] # predictor\n", "\n", "# specify that the model includes an intercept\n", "x_with_int = sm.add_constant(x) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The name of the `OLS` method stands for Ordinary Least Squares, which is the kind of least squares that we have discussed. There are other more complicated kinds that you might encounter in more advanced classes.\n", "\n", "There is a lot of output, some of which we will discuss and the rest of which we will leave to another class. For some reason the output includes the date and time of running the regression, right in the middle of the summary statistics." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: difference R-squared: 0.397
Model: OLS Adj. R-squared: 0.367
Method: Least Squares F-statistic: 13.17
Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.00167
Time: 22:34:07 Log-Likelihood: -92.947
No. Observations: 22 AIC: 189.9
Df Residuals: 20 BIC: 192.1
Df Model: 1
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 32.1721 17.151 1.876 0.075 -3.604 67.949
base -0.5447 0.150 -3.630 0.002 -0.858 -0.232
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1.133 Durbin-Watson: 1.774
Prob(Omnibus): 0.568 Jarque-Bera (JB): 0.484
Skew: 0.362 Prob(JB): 0.785
Kurtosis: 3.069 Cond. No. 530.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: difference R-squared: 0.397\n", "Model: OLS Adj. R-squared: 0.367\n", "Method: Least Squares F-statistic: 13.17\n", "Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.00167\n", "Time: 22:34:07 Log-Likelihood: -92.947\n", "No. Observations: 22 AIC: 189.9\n", "Df Residuals: 20 BIC: 192.1\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 32.1721 17.151 1.876 0.075 -3.604 67.949\n", "base -0.5447 0.150 -3.630 0.002 -0.858 -0.232\n", "==============================================================================\n", "Omnibus: 1.133 Durbin-Watson: 1.774\n", "Prob(Omnibus): 0.568 Jarque-Bera (JB): 0.484\n", "Skew: 0.362 Prob(JB): 0.785\n", "Kurtosis: 3.069 Cond. No. 530.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simple_regression = sm.OLS(y, x_with_int).fit()\n", "simple_regression.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three blocks of output. We will focus only on the the middle block.\n", "\n", "- `const` and `base` refer to the intercept and baseline measurement.\n", "- `coef` stands for the estimated coefficients, which in our notation are $\\hat{\\beta_0}$ and $\\hat{\\beta_1}$.\n", "- `t` is the $t$-statistic for testing whether or not the coefficient is 0. Based on our model, its degrees of freedom are $n-2 = 20$; you'll find this under `Df Residuals` in the top block.\n", "- `P > |t|` is the total area in the two tails of the $t$ distribution with $n-2$ degrees of freedom.\n", "- `[0.025 0.975]` are the ends of a 95% confidence interval for the parameter.\n", "\n", "For the test of whether or not the true slope of the baseline measurement is $0$, the observed test statistic is\n", "\n", "$$\n", "\\frac{-0.5447 - 0}{0.150} ~ = ~ -3.63\n", "$$\n", "\n", "The area in one tail is the chance that the $t$ distribution with $20$ degrees of freedom is less than $-3.63$. That's the cdf of the distribution evaluated at $-3.63$:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0008339581409629714" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_tail = stats.t.cdf(-3.63, 20)\n", "one_tail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our test is two-sided (large values of $\\vert t \\vert$ favor the alternative), so the $p$-value of the test is the total area of two tails, which is the displayed value $0.002$ after rounding." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0016679162819259429" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = 2*one_tail\n", "p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find a 95% confidence interval for the true slope, we have to replace $2$ in the expression $\\hat{\\beta}_1 \\pm 2SE(\\hat{\\beta}_1)$ by the corresponding value from the $t$ distribution with 20 degrees of freedom. That's not very far from $2$:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0859634472658364" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_95 = stats.t.ppf(0.975, 20)\n", "t_95" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A 95% confidence interval for the true slope is given by $\\hat{\\beta}_1 \\pm t_{95}SE(\\hat{\\beta}_1)$. The observed interval is therefore given by the calculation below, which results in the same values as in the output of `sm.OLS` above." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-0.8575945170898753, -0.23180548291012454)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 95% confidence interval for the true slope\n", "\n", "-0.5447 - t_95*0.150, -0.5447 + t_95*0.150" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple Regression ###\n", "What if we wanted to regress `difference` on both `base` and `chemo`? The first thing to do would be to check the correlation matrix again:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heightradchemobasemonth15difference
height1.000000-0.3052060.5768250.3542290.390527-0.043394
rad-0.3052061.000000-0.0037390.0964320.040616-0.073453
chemo0.576825-0.0037391.0000000.0621870.4457880.346310
base0.3542290.0964320.0621871.0000000.561371-0.630183
month150.3905270.0406160.4457880.5613711.0000000.288794
difference-0.043394-0.0734530.346310-0.6301830.2887941.000000
\n", "
" ], "text/plain": [ " height rad chemo base month15 difference\n", "height 1.000000 -0.305206 0.576825 0.354229 0.390527 -0.043394\n", "rad -0.305206 1.000000 -0.003739 0.096432 0.040616 -0.073453\n", "chemo 0.576825 -0.003739 1.000000 0.062187 0.445788 0.346310\n", "base 0.354229 0.096432 0.062187 1.000000 0.561371 -0.630183\n", "month15 0.390527 0.040616 0.445788 0.561371 1.000000 0.288794\n", "difference -0.043394 -0.073453 0.346310 -0.630183 0.288794 1.000000" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h_data.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What you are looking for is not just that `chemo` is the next most highly correlated with `difference` after `base`. More importantly, you are looking to see how strongly the two predictor variables `base` and `chemo` are linearly related *to each other*. That is, you are trying to figure out whether the two variables pick up genuinely different dimensions of the data.\n", "\n", "The correlation matrix shows that the correlation between `base` and `chemo` is only about $0.06$. The two predictors are not close to being linear functions of each other. So let's run the regression.\n", "\n", "The code is exactly the same as before, except that we have included a second predictor variable." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: difference R-squared: 0.546
Model: OLS Adj. R-squared: 0.499
Method: Least Squares F-statistic: 11.44
Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.000548
Time: 22:34:31 Log-Likelihood: -89.820
No. Observations: 22 AIC: 185.6
Df Residuals: 19 BIC: 188.9
Df Model: 2
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const -0.9992 20.227 -0.049 0.961 -43.335 41.336
base -0.5655 0.134 -4.226 0.000 -0.846 -0.285
chemo 0.1898 0.076 2.500 0.022 0.031 0.349
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 0.853 Durbin-Watson: 1.781
Prob(Omnibus): 0.653 Jarque-Bera (JB): 0.368
Skew: 0.317 Prob(JB): 0.832
Kurtosis: 2.987 Cond. No. 1.36e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: difference R-squared: 0.546\n", "Model: OLS Adj. R-squared: 0.499\n", "Method: Least Squares F-statistic: 11.44\n", "Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.000548\n", "Time: 22:34:31 Log-Likelihood: -89.820\n", "No. Observations: 22 AIC: 185.6\n", "Df Residuals: 19 BIC: 188.9\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -0.9992 20.227 -0.049 0.961 -43.335 41.336\n", "base -0.5655 0.134 -4.226 0.000 -0.846 -0.285\n", "chemo 0.1898 0.076 2.500 0.022 0.031 0.349\n", "==============================================================================\n", "Omnibus: 0.853 Durbin-Watson: 1.781\n", "Prob(Omnibus): 0.653 Jarque-Bera (JB): 0.368\n", "Skew: 0.317 Prob(JB): 0.832\n", "Kurtosis: 2.987 Cond. No. 1.36e+03\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.36e+03. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = h_data[['difference']] # response\n", "x2 = h_data[['base', 'chemo']] # predictors\n", "\n", "# specify that the model includes an intercept\n", "x2_with_int = sm.add_constant(x2) \n", "\n", "multiple_regression = sm.OLS(y, x2_with_int).fit()\n", "multiple_regression.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ignore the scary warning above. There isn't strong multicollinearity (predictor variables being highly correlated with each other) nor other serious issues.\n", "\n", "Just focus on the middle block of the output. It's just like the middle block of the simple regression output, with one more line corresponding to `chemo`.\n", "\n", "All of the values in the block are interpreted in the same way as before. The only change is in the degrees of freedom: because you are estimating one more parameter, the degrees of freedom have dropped by $1$, and are thus $19$ instead of $20$.\n", "\n", "For example, the 95% confidence interval for the slope of `chemo` is calculated as follows." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.03073017186497201, 0.348869828135028)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_95_df19 = stats.t.ppf(0.975, 19)\n", "\n", "0.1898 - t_95_df19*0.076, 0.1898 + t_95_df19*0.076" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, take a look at the value of `R-squared` in the very top line. It is $0.546$ compared to $0.397$ for the simple regression. It's a math fact that the more predictor variables you use, the higher the `R-squared` value will be. This tempts people into using lots of predictors, whether or not the resulting model is comprehensible.\n", "\n", "With an \"everything as well as the kitchen sink\" approach to selecting predictor variables, a researcher might be inclined to use all the possible predictors." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: difference R-squared: 0.550
Model: OLS Adj. R-squared: 0.444
Method: Least Squares F-statistic: 5.185
Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.00645
Time: 22:50:02 Log-Likelihood: -89.741
No. Observations: 22 AIC: 189.5
Df Residuals: 17 BIC: 194.9
Df Model: 4
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 33.5226 101.061 0.332 0.744 -179.698 246.743
base -0.5393 0.160 -3.378 0.004 -0.876 -0.202
chemo 0.2124 0.103 2.053 0.056 -0.006 0.431
rad -0.0062 0.031 -0.203 0.841 -0.071 0.059
height -0.2274 0.658 -0.346 0.734 -1.615 1.160
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 0.589 Durbin-Watson: 1.812
Prob(Omnibus): 0.745 Jarque-Bera (JB): 0.321
Skew: 0.286 Prob(JB): 0.852
Kurtosis: 2.851 Cond. No. 1.46e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: difference R-squared: 0.550\n", "Model: OLS Adj. R-squared: 0.444\n", "Method: Least Squares F-statistic: 5.185\n", "Date: Fri, 06 Dec 2019 Prob (F-statistic): 0.00645\n", "Time: 22:50:02 Log-Likelihood: -89.741\n", "No. Observations: 22 AIC: 189.5\n", "Df Residuals: 17 BIC: 194.9\n", "Df Model: 4 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 33.5226 101.061 0.332 0.744 -179.698 246.743\n", "base -0.5393 0.160 -3.378 0.004 -0.876 -0.202\n", "chemo 0.2124 0.103 2.053 0.056 -0.006 0.431\n", "rad -0.0062 0.031 -0.203 0.841 -0.071 0.059\n", "height -0.2274 0.658 -0.346 0.734 -1.615 1.160\n", "==============================================================================\n", "Omnibus: 0.589 Durbin-Watson: 1.812\n", "Prob(Omnibus): 0.745 Jarque-Bera (JB): 0.321\n", "Skew: 0.286 Prob(JB): 0.852\n", "Kurtosis: 2.851 Cond. No. 1.46e+04\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.46e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = h_data[['difference']] # response\n", "x3 = h_data[['base', 'chemo', 'rad', 'height']] # predictors\n", "\n", "# specify that the model includes an intercept\n", "x3_with_int = sm.add_constant(x3) \n", "\n", "bad_regression = sm.OLS(y, x3_with_int).fit()\n", "bad_regression.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not a good idea. We end up with a far more complicated model for no appreciable gain in `R-squared`. The \"adjusted $R^2$\" penalizes us for using more predictor variables: notice that the value of `Adj. R-squared` is smaller for the regression with all the predictors than for the regression with just `base` and `chemo`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Curious about how to select predictors, or about what makes a good regression? Then take some more statistics classes! This one is complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 4 }