{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "# NO CODE\n",
    "\n",
    "from datascience import *\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('fivethirtyeight')\n",
    "import numpy as np\n",
    "from scipy import stats\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Towards Multiple Regression ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section is an extended example of applications of the methods we have derived for regression. We will start with simple regression, which we understand well, and will then indicate how some of the results can be extended when there is more than one predictor variable.\n",
    "\n",
    "The data are from a study on the treatment of Hodgkin's disease, a type of cancer that can affect young people. The good news is that treatments for this cancer have [high success rates](https://en.wikipedia.org/wiki/Hodgkin_lymphoma#Prognosis). The bad news is that the treatments can be rather strong combinations of chemotherapy and radiation, and thus have serious side effects. A goal of the study was to identify combinations of treatments with reduced side effects.\n",
    "\n",
    "The table `hodgkins` contains data on a random sample of patients. Each row corresponds to a patient. The columns contain the patient's height in centimeters, the amount of radiation, the amount of medication used in chemotherapy, and measurements on the health of the patient's lungs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "# NO CODE\n",
    "\n",
    "hodgkins = Table.read_table('../../data/hodgkins.csv')\n",
    "diffs = hodgkins.column(4) - hodgkins.column(3)\n",
    "hodgkins = hodgkins.with_columns('difference', diffs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "    <thead>\n",
       "        <tr>\n",
       "            <th>height</th> <th>rad</th> <th>chemo</th> <th>base</th> <th>month15</th> <th>difference</th>\n",
       "        </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <td>164   </td> <td>679 </td> <td>180  </td> <td>160.57</td> <td>87.77  </td> <td>-72.8     </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>168   </td> <td>311 </td> <td>180  </td> <td>98.24 </td> <td>67.62  </td> <td>-30.62    </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>173   </td> <td>388 </td> <td>239  </td> <td>129.04</td> <td>133.33 </td> <td>4.29      </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>157   </td> <td>370 </td> <td>168  </td> <td>85.41 </td> <td>81.28  </td> <td>-4.13     </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>160   </td> <td>468 </td> <td>151  </td> <td>67.94 </td> <td>79.26  </td> <td>11.32     </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>170   </td> <td>341 </td> <td>96   </td> <td>150.51</td> <td>80.97  </td> <td>-69.54    </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>163   </td> <td>453 </td> <td>134  </td> <td>129.88</td> <td>69.24  </td> <td>-60.64    </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>175   </td> <td>529 </td> <td>264  </td> <td>87.45 </td> <td>56.48  </td> <td>-30.97    </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>185   </td> <td>392 </td> <td>240  </td> <td>149.84</td> <td>106.99 </td> <td>-42.85    </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>178   </td> <td>479 </td> <td>216  </td> <td>92.24 </td> <td>73.43  </td> <td>-18.81    </td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>\n",
       "<p>... (12 rows omitted)</p>"
      ],
      "text/plain": [
       "height | rad  | chemo | base   | month15 | difference\n",
       "164    | 679  | 180   | 160.57 | 87.77   | -72.8\n",
       "168    | 311  | 180   | 98.24  | 67.62   | -30.62\n",
       "173    | 388  | 239   | 129.04 | 133.33  | 4.29\n",
       "157    | 370  | 168   | 85.41  | 81.28   | -4.13\n",
       "160    | 468  | 151   | 67.94  | 79.26   | 11.32\n",
       "170    | 341  | 96    | 150.51 | 80.97   | -69.54\n",
       "163    | 453  | 134   | 129.88 | 69.24   | -60.64\n",
       "175    | 529  | 264   | 87.45  | 56.48   | -30.97\n",
       "185    | 392  | 240   | 149.84 | 106.99  | -42.85\n",
       "178    | 479  | 216   | 92.24  | 73.43   | -18.81\n",
       "... (12 rows omitted)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hodgkins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "22"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = hodgkins.num_rows\n",
    "n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The radiation was directed towards each patient's chest area or \"mantle\", to destroy cancer cells in the lymph nodes near that area. Since this could adversely affect the patients' lungs, the researchers measured the health of the patients' lungs both before and after treatment. Each patient received a score, with larger scores corresponding to more healthy lungs. \n",
    "\n",
    "The table records the baseline scores and also the scores 15 months after treatment. The change in score (15 month score minus baseline score) is in the final column. Notice the negative differences: 15 months after treatment, many patients' lungs weren't doing as well as before the treatment. \n",
    "\n",
    "Perhaps not surprisingly, patients with larger baseline scores had bigger drops in score. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAFWCAYAAACIDD3jAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de3TU5Z3H8fcEjMjFJDO5mIBJlxKCsASF3YQCwgoUSL0UuUkUpSwhgFExFgUJworc765GKqFAXbm4pLSAID3lpogsSW0xq1toUFsg0oTcMCENEDL7B2ZKSggTmMnM5Pm8zsk5ML/fDN9vcvLhmWee3/OzlJaW2hERkSbNz9MFiIiI+ynsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKex+Vm5vr6RI8wsS+TewZzOzbnT0r7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQM093QB3iy/sIT56ZspPleGNaANaSmJhAYHerosEZEG08i+HvPTN/NNQTEXL1bxTUEx89/a5OmSRERuisK+HsXnyvCzWADws1goKi3zcEUiIjdHYV8Pa0Abqu1X7sdebbdjDWjj4YpERG6Owr4eaSmJtA2z4u/fnIhQK2kpiZ4uSUTkpugD2nqEBgeyctZkT5chInLLNLIXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoJuXiAgA+YUlzE/fTPG5MqwBbUhLSSQ0ONDTZYmLaGQvIgDMT9/MNwXFXLxYxTcFxcx/a5OnSxIXUtiLCADF58rws1gA8LNYKCot83BF4koKexEBwBrQhmq7HYBqux1rQBsPVySupLAXEQDSUhJpG2bF3785EaFW0lISPV2SuJA+oBURAEKDA1k5a7KnyxA30cheRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQM0mbBfs2YNsbGxhIWF0a9fPz755BNPlyQi4jWaRNhv3bqV6dOn89Of/pSPPvqIuLg4Ro4cyalTpzxdmoiIV2gSYZ+ens7jjz/O2LFjiYmJYcmSJYSFhbF27VpPlyYi4hV8PuwvXrzI0aNH6d+/f63H+/fvz5EjRzxUlYiId/H57RKKioq4fPkyISEhtR4PCQmhoKCgzufk5uY2Rmlu11T6aCgT+zaxZzCz75vtOTo6ut7jPh/2NSzfbc1aw263X/NYjRt9U3xBbm5uk+ijoUzs28Sewcy+3dmzz0/j2Gw2mjVrds0ovrCw8JrRvoiIqXw+7P39/bn33nvZv39/rcf3799PfHy8h6oSEfEuTWIaJyUlhYkTJ9KjRw/i4+NZu3Ytf/3rXxk3bpynS5Pv6P6mIp7VJMJ+2LBhFBcXs2TJEvLz87nnnnv47//+byIjIz1dmnyn5v6mfhaL4/6m2jtdpPE0ibAHSEpKIikpydNlyHXo/qYinuXzc/biG3R/UxHPUthLo9D9TUU8q8lM44h30/1NRTxLI3sREQNoZG8gLYMUMY9G9gaqWQZ58WKVYxmkiDRtCnsDaRmkiHkU9gbSMkgR8yjsDaRlkCLm0Qe0BtIySBHzaGQvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYoLmnC5Bbk19Ywvz0zRSfK8Ma0Ia0lERCgwM9XZaIeBmN7H3c/PTNfFNQzMWLVXxTUMz8tzZ5uiQR8UIKex9XfK4MP4sFAD+LhaLSMg9XJCLeSGHv46wBbai22wGottuxBrTxcEUi4o0U9j4uLSWRtmFW/P2bExFqJS0l0dMliYgX0ge0Pi40OJCVsyZ7ugwR8XIKexEP0UoqaUyaxhHxEK2kksaksBfxEK2kksaksBfxEK2kksaksBfxEK2kksakD2hFPEQrqaQxaWQvImIAhb2IiAEU9iIiBlDYi4gYQB/QikvoalAR76aRvbiErgYV8W4NDvvPP/+c1atXs3DhQvLz8wH46quvKCtz7dV/JSUlvPjii/zrv/4rd911F126dOGFF16guLi41nmlpaUkJycTGRlJZGQkycnJlJaWurQWuTFdDSri3ZwO+wsXLjB27Fj69u3LtGnTWLx4MWfOnAFg1qxZLFu2zKWFnTlzhjNnzvDqq6/yySef8Pbbb/PJJ58wfvz4WuclJSWRk5PDli1byMzMJCcnh4kTJ7q0FrkxXQ0q4t2cDvvXXnuNAwcO8Pbbb5Obm4v9u19sgB/+8Ifs3bvXpYV17tyZd999lx/96Ee0b9+ePn36MGfOHA4cOMC3334LwPHjx9mzZw8rV64kPj6euLg4VqxYwW9+8xtyc3NdWo/UT1eDing3pz+g/eUvf8nMmTMZOXIkly9frnUsKiqKkydPury4f1RWVsbtt99Oy5YtAcjKyqJ169bEx8c7zunZsyetWrXiyJEjREdHu70muUJXg4p4N6fDvri4mI4dO9Z5rLq6mosXL7qsqLqUlpYyb948nnrqKZo3v1J2QUEBNpsNy3dzxQAWi4Xg4GAKCgqu+1pNZdTfVPpoKBP7NrFnMLPvm+35RoNbp8M+KiqK7Oxs+vXrd82xTz/9lA4dOjj1OnPnzmXp0qX1nrNjxw7uv/9+x9/Pnz9PYmIi4eHhzJkzp9a5Vwd9DbvdXufjNZrCiD83N7dJ9FHD2aWbTa1vZ5jYM5jZtzt7djrsR48ezfLly4mMjOThhx8GrgTtRx99xFtvvcX06dOdep3JkyczatSoes9p166d48/l5eWMHDkSgPfee48WLVo4joWGhlJYWFgr3O12O0VFRYSEhDjbmniBmqWbfhaLY+mmpoVEXMfpsJ8yZQqff/45EydO5LnnngMgISGByspKhg8f7vQKGJvNhs1mc+rcsrIyRo4cid1uJzMzk9atW9c6HhcXR3l5OVlZWY55+6ysLM6fP19rHl+8n5ZuiriX02HfrFkz1q5dS1JSEvv27ePs2bNYrVYGDBhAnz59XF5YWVkZw4YNo6ysjA0bNlBRUUFFRQUAQUFB+Pv7ExMTw8CBA0lNTeX111/HbreTmprK4MGDjXv75+usAW0cI3st3RRxvQZvl9CrVy969erljlpqOXr0KNnZ2QD06NGj1rGr5/QzMjKYNm0aw4YNA66821i8eLHb6xPXSktJZP5bmygq/fucvYi4jtNhv3v3bk6ePElycvI1xzIyMoiKimLQoEEuK+z+++936krYoKAgVq9e7bJ/VzxDSzdF3Mvpi6qWLFnimEb5R5WVlSxZssRlRYmIiGs5PbLPzc2lW7dudR7r2rWrwl68gnbfFKmb0yP76upqysvL6zxWVlZGVVWVy4oSuVnafVOkbk6H/T//8z+zZcuWOo9t2bKFLl26uKwokZulJZwidXN6GiclJYWnnnqKsWPHMnbsWCIiIjhz5gzr16/n/fffZ/369W4sU8Q5WsLp/TTV5hlOj+wffvhhFi5cyL59+xgxYgS9evVi2LBh7Nu3j0WLFvHII4+4s04Rp2j3Te+nqTbPaNA6+4kTJ/L444+TlZVFcXExNpuNuLi4a65sFfEULeH0fppq84wGX1TVpk0bBgwY4I5aRMQAmmrzjAaFfXV1NZ9++imnT5+msrLymuOJiXrLLCL109XSnuF02B87downnniCr7/+utZdqmpYLBaFvYjckKbaPMPpsP/pT39KVVUV69ato0uXLvj7+7uzLhERcSGnwz4nJ4f09HStuhER8UFOL720Wq0azYuI+Cinw/7pp59mzZo119xsXEREvJ/T0ziFhYXk5uYSHx/PAw88QGBg7SveLBYLM2bMcHmBIiJy65wO+6tvEv7ll19ec1xhL+JdtC2BXM3psC8pKXFnHSJC3QF9s3QTd7ma03P2IuJ+rtw3RtsSyNUaFPZ2u51du3Yxc+ZMnn76aU6ePAnAxx9/zJkzZ9xSoIhJXBnQ1oA2VH93AaS2JRCnw760tJRBgwbxxBNP8M4777B582aKi4sBeOedd1ixYoXbihQxhSsDWjuAytWcnrN/5ZVXyMvL4ze/+Q3du3cnJCTEcaxfv3688cYbbilQxCR17RtzruTsTb2WtiWQqzkd9rt27eK1114jLi7umrX27dq1Iy8vz+XFiZimroC+2bAXuZrT0zjnz58nIiKizmMXLlyoc3M0ERHxDk6HfYcOHdi3b1+dxw4dOkTnzp1dVpSIiLiW09M4EyZMYOrUqdx5552MGDECgHPnzvHuu++SkZHBypUr3VakmEMXAom4h9NhP3bsWL7++msWLFjA/PnzAXj00Ufx8/NjypQpjBo1ym1FSuPwhqDVhUAi7tGgO1X9x3/8B//+7//OgQMHOHv2LFarlQceeIDvfe97bipPGpM3BK0uBBJxD6fC/uLFi8yePZuRI0fSvXt3nnrqKXfXJR7gDUGr+5OKuIdTH9D6+/uzfv16/va3v7m7HvEgb7jiUhcCibiH09M4sbGx/N///R+9e/d2Zz3iQd5wI2hdCCTiHk6H/dy5cxk/fjx33303gwcPxvLd231pOhS0Ik2X02H/k5/8hG+//ZbHH3+c5s2bExISck3gf/755y4vUEREbp3TYd+3b1+N5kVEfJTTYb9q1Sp31iEiIm7UoHX2ItIw3nChmgg08OYln332GWPGjKF9+/bYbDaOHj0KwJw5c9izZ49bChTxZa6885TIrXA67A8fPsygQYPIzc1lxIgRVFdX//1F/PxYu3atWwoU8WXecKGaCDQg7F999VX69+/P//zP/zj2xqkRGxtLTk6Oy4sT8XXecKGaCDQg7D/77DPGjx+PxWK5ZlWOzWajsLDQ5cWJ+DpdESzewukPaG+//XYqKirqPJafn8+dd97psqJEmgpdqCbewumRfc+ePVm1alWtWxLWjPD/67/+i759+7q+OhERcQmnR/ZpaWkMGTKEPn368Mgjj2CxWNi0aRNpaWl89tln172LlYiIeJ7TI/uuXbuyc+dOQkJCWLZsGXa7nYyMDADef/99oqOj3VakiIjcmnpH9rt27aJ3794EBAQAcO+997J9+3YqKyspKSkhICCAli1bNkqhIiJy8+od2Y8ZM4YTJ04AYLVa+fTTTwFo0aIF4eHhCnoRER9Rb9i3bt2ab7/9FgD7d2uFRUTE99Q7jdOtWzeef/55evXqBcDixYsJDg6u81yLxcKbb77p+gpFROSW1Rv2y5cvZ8aMGXzyySdYLBZ+//vf4+/vX+e52v5YRMR71Rv20dHRbNmyBYCgoCA2b95Mjx49GqUwERFxnRt+QPvVV18BkJ6eTlhYWKMUJSIirlVv2O/atYvi4mIAnnnmGfLz8xulKBHxnPzCEqa8uoonX1jMlFdXUVBY6umSxAXqDfvQ0FCys7OBK6txNC8v0vRpD/6mqd6wHzp0KDNmzMBqtWKxWBg4cCBWq7XOL5vN5rYi7XY7w4cPJzAwkG3bttU6VlpaSnJyMpGRkURGRpKcnExpqUYiIjdLe/A3TfV+QLtgwQJ69uzJsWPHWLRoEY8//jjh4eGNVZvDm2++SbNmzeo8lpSUxOnTp9myZQsWi4XnnnuOiRMn8t577zVylSJNgzWgDd8UFONnsWgP/iak3rC3WCwMHToUgI0bNzJp0iS6du3aKIXV+MMf/sDPfvYzDhw4cM3+O8ePH2fPnj3s3r2b+Ph4AFasWEFCQgK5ubnar0fkJqSlJDL/rU0Ulf79vrni+5ze9dITd6IqKytj/PjxrFixgpCQkGuOZ2Vl0bp1a0fQw5WtmFu1asWRI0cU9iI3QXvwN031hv2hQ4fo1q0brVu35tChQzd8sd69e7usMIAXXniBAQMGMGjQoDqPFxQUYLPZan1wbLFYCA4OpqCg4Lqvm5ub69I6PaWp9NFQJvZtYs9gZt832/ONBrf1hv1DDz3Enj176NGjBw899NB1V+PUrNSpWaZZn7lz57J06dJ6z9mxYwd5eXl8/vnn7N+/v95z66rpRiuHmsKI39RpKhP7NrFnMLNvd/Zcb9jv2LGDjh07ArB9+3aXLL2cPHkyo0aNqvecdu3asXHjRo4dO0bbtm1rHRs3bhxxcXHs3r2b0NBQCgsLa4W73W6nqKiozmkfERFT1Rv2ffr0cfz5/vvvd8k/aLPZnFqm+corr/Dss8/WeqxXr1689tprPPjggwDExcVRXl5OVlaWY94+KyuL8+fP15rHFxExXb1h//DDDzv9QhaLhe3bt99yQTUiIiKIiIi45vF27drxve99D4CYmBgGDhxIamoqr7/+Ona7ndTUVAYPHmzc2z8RkfrUG/bV1dW1pm5OnDhBfn4+kZGRhIaGUlBQwMmTJ7nrrrvo0KGD24utS0ZGBtOmTWPYsGEAJCQksHjxYo/UIiLireoN+507dzr+/P777zN9+nTHB7Y1fve73zFu3DgmTZrkviq/U9eVsUFBQaxevdrt/7Y0nvzCEuanb6b43N/XeYcGB3q6LBGf5vQ6+/nz55OWlnbNFsf/8i//wvTp05k3b55jLl28ly8Eac3eLH4Wi2NvFq37Fnfzhd+NW1Hv3jhX+/LLL697l6qQkBDHVsji3XxhkyvtzSKe4Au/G7fC6bCPiopi3bp1dR5bt24dkZGRLitK3McXgtQa0Ibq7+55rL1ZpLH4wu/GrXB6GmfatGlMmDCBH/zgBzzyyCOOD2i3b9/On/70JzIyMtxZp7iIL2xypb1ZxBN84XfjVjgd9sOHD8dms7FgwQJWrFjBpUuXuO222+jevTtbt26lX79+7qxTXMQXglR7s4gn+MLvxq2wlJaW2hv6pOrqaoqKirDZbPj5OT0TJC5k4qXkYGbfJvYMZvbtse0SrsfPz0/bEYiI+BANy0VEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDKCwFxExgMJeRMQACnsREQMo7EVEDOD1Yf/pp58ydOhQ2rZtS7t27Rg0aBBFRUWO46WlpSQnJxMZGUlkZCTJycmUlpZ6sGIREe/j1WH/u9/9jkcffZQ+ffrw29/+lgMHDvDMM8/QvHlzxzlJSUnk5OSwZcsWMjMzycnJYeLEiR6sWkTE+zS/8SmeM2PGDCZMmMDUqVMdj3Xo0MHx5+PHj7Nnzx52795NfHw8ACtWrCAhIYHc3Fyio6MbvWYREW/ktSP7s2fPkpWVRVhYGEOGDCE6OpqEhAQ+/PBDxzlZWVm0bt3aEfQAPXv2pFWrVhw5csQTZYuIeCWvHdn/+c9/BmDBggXMmTOH2NhYtm3bxrBhwzhw4ABdu3aloKAAm82GxWJxPM9isRAcHExBQcF1Xzs3N9fd5TeKptJHQ5nYd2P3XFjyLW9v/i3nyioIaNOSSYmDsAW2adQaQD/rhrjRTEajh/3cuXNZunRpvefs2LEDf39/AMaNG8eTTz4JQLdu3fj4449Zt24dy5cvB6gV9DXsdnudj9doCtM7pk5Tmdi3J3p+89VVnL9wGf/bW3D+wmU2f3CElbMmN2oN+lm7VqOH/eTJkxk1alS957Rr184xMo+Jial1rGPHjpw+fRqA0NBQCgsLa4W73W6nqKiIkJAQN1QvYobic2X4ffc75WexUFRa5uGK5FY1etjbbDZsNtsNz4uKiiI8PPyatzRffvklnTt3BiAuLo7y8nKysrIc8/ZZWVmcP3++1jy+iDSMNaAN3xQU42exUG23Yw1o/CkccS2v/YDWYrHw7LPPsnr1an7961/z1VdfsWzZMrKzs/nJT34CXBn1Dxw4kNTUVLKzs8nKyiI1NZXBgwcb9/ZPxJXSUhJpG2bF3785EaFW0lISPV2S3CKv/YAW4Omnn+bSpUvMnDmT4uJiOnXqRGZmJl27dnWck5GRwbRp0xg2bBgACQkJLF682FMlizQJocGBjT5HL+7l1WEPMGXKFKZMmXLd40FBQaxevboRKxIR8T1eO40jIiKuo7AXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQMoLAXETGAwl5ExAAKexERAyjsRUQM0NzTBYiIeJP8whLmp2+m+FwZ1oA2pKUkEhoc6OmybplG9iIiV5mfvplvCoq5eLGKbwqKmf/WJk+X5BIKexGRqxSfK8PPYgHAz2KhqLTMwxW5hsJeROQq1oA2VNvtAFTb7VgD2ni4ItdQ2IuIXCUtJZG2YVb8/ZsTEWolLSXR0yW5hD6gFRG5SmhwICtnTfZ0GS6nkb2IiAEU9iIiBlDYi4gYQGEvImIAfUArIuIlCku+5c1XV7nl6l2N7EVEvMTbm3/rtqt3FfYiIl7iXFmF267eVdiLiHiJgDYt3Xb1rsJeRMRLTEoc5Lard/UBrYiIl7AFtnHb1bsa2YuIGEBhLyJiAIW9iIgBFPYiIgZQ2IuIGEBhLyJiAIW9iIgBFPYiIgawlJaW2j1dhIiIuJdG9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYe7HLly8zd+5cYmNjCQsLIzY2lrlz51JVVeU4x263s2DBAjp16sRdd93Fgw8+yB//+EcPVt0whw4dYvTo0dxzzz0EBgayYcOGWsed6a+0tJTk5GQiIyOJjIwkOTmZ0tLSxmyjwerr+9KlS8yePZtevXoRERFBTEwMSUlJnDp1qtZrXLhwgRdffJH27dsTERHB6NGjycvLa+xWnHajn/XVpkyZQmBgIG+88Uatx32tZ3Cu7xMnTjBmzBgiIyMJDw+nb9++HD9+3HHcFX0r7L3YypUrWbNmDYsWLSIrK4uFCxeSkZHB8uXLHee8/vrrpKens2jRIvbt20dISAiPPvooZWWuu52ZO50/f57OnTuzcOFC7rjjjmuOO9NfUlISOTk5bNmyhczMTHJycpg4cWJjttFg9fVdUVHBZ599xtSpU/nwww/ZuHEjeXl5jBgxotZ/9C+//DI7duzg5z//Obt27aKsrIzHHnuMy5cvN3Y7TrnRz7rGtm3b+P3vf094ePg1x3ytZ7hx33/+858ZPHgwUVFRbN++ncOHDzNz5kxatWrlOMcVfWudvRd77LHHCAoK4mc/+5njsUmTJlFSUsJ7772H3W6nU6dOTJgwgalTpwLwt7/9jejoaF577TXGjRvnqdJvStu2bVm8eDFPPPEEgFP9HT9+nPj4eHbv3k3Pnj0BOHz4MAkJCWRnZxMdHe2xfpz1j33X5dixY/Ts2ZNDhw7RpUsXzp07R4cOHUhPT2fUqFEAnD59mq5du5KZmcmAAQMaq/ybcr2eT548yeDBg/n1r3/NiBEjSE5O5tlnnwXw+Z6h7r6TkpKwWCxkZGTU+RxX9a2RvRfr2bMnH3/8MX/605+AK7/wBw8e5Ic//CEAf/nLX8jPz6d///6O59xxxx306tWLI0eOeKRmV3Kmv6ysLFq3bk18fLzjnJ49e9KqVasm8T2oUfNOJjAwEICjR49y6dKlWt+bdu3aERMT47N9V1VVkZSUxNSpU4mJibnmeFPsubq6mt27dxMTE8Pw4cP5/ve/zwMPPMDWrVsd57iqb92W0Is9//zzlJeXEx8fT7NmzaiqqmLq1KkkJSUBkJ+fD0BISEit54WEhHDmzJlGr9fVnOmvoKAAm82GxWJxHLdYLAQHB1NQUNB4xbrRxYsXmTlzJkOGDKFt27bAlb6bNWuGzWardW5ISIjP9r1gwQKCgoIYP358ncebYs9nz56lvLyc5cuXM2PGDGbPns1HH33EhAkTaNmyJUOGDHFZ3wp7L7Z161Y2b97MmjVr6NSpE//7v//L9OnTiYyM5KmnnnKcd3XQwZXpj398zJfdqL+6em0q34OqqiqSk5M5d+4cmzZtuuH5vtr3xx9/zMaNGzl48GCDn+urPcOVkT3Aj370I5555hkAYmNjOXr0KGvWrGHIkCHXfW5D+9Y0jhebNWsWzzzzDMOHD6dLly6MHj2alJQUVqxYAUBYWBjANf+7FxYWXjMa9kXO9BcaGkphYSF2+98/erLb7RQVFfn896Cqqorx48fzxRdfsG3bNqxWq+NYaGgoly9fpqioqNZzfPVnf/DgQf76178SExODzWbDZrNx6tQpZs+eTefOnYGm1zOAzWajefPm10xbdezYkdOnTwOu61th78UqKipo1qxZrceaNWvmGA1ERUURFhbG/v37HccrKys5fPhwrTlsX+VMf3FxcZSXl5OVleU4Jysri/Pnz/v09+DSpUuMGzeOL774gh07djj+46tx7733ctttt9X63uTl5Tk+sPY1SUlJHDp0iIMHDzq+wsPDefrpp9m2bRvQ9HoG8Pf3p3v37uTm5tZ6/MSJE9x9992A6/rWNI4XGzJkCCtXriQqKopOnTqRk5NDeno6o0ePBq5MX0yePJlly5YRHR1Nhw4dWLp0Ka1atWLEiBEert455eXlfPXVV8CVt7SnT58mJyeHoKAg7r777hv2FxMTw8CBA0lNTeX111/HbreTmprK4MGDvXolTn19h4eHM3bsWP7whz+wadMmLBaL4/OLO++8kzvuuIOAgACefPJJZs2aRUhICEFBQaSlpdGlSxf+7d/+zYOdXd+Nftb/OEpt3rw5YWFhjp+jL/YMN+77ueeeY9y4cfTq1Yu+ffty8OBBtm7d6liP76q+tfTSi5WVlTFv3jzef/99CgsLCQsLY/jw4bz00t62iG8AAATOSURBVEu0aNECuDJlsXDhQtavX09paSk9evRg6dKljre+3u7gwYM8/PDD1zyemJjIqlWrnOqvpKSEadOm8cEHHwCQkJDA4sWLHStXvFF9fU+fPp1u3brV+bz09HTHsr3KykpeeeUVMjMzqayspG/fvixbtox27dq5tfabdaOf9T/q2rVrraWX4Hs9g3N9b9iwgeXLl5OXl0f79u154YUXag3YXNG3wl5ExACasxcRMYDCXkTEAAp7EREDKOxFRAygsBcRMYDCXkTEAAp7kTosWLCAwMDAWvvHi/gyhb2IiAEU9iIiBlDYi9Tj+PHjPPTQQ4SHhxMTE8O8efMcG9FVVlby8ssv84Mf/IC2bdvSsWNHHnvsMcfNZmrk5+czadIkOnXqRGhoKDExMTz22GOcPXvWcU5FRQWzZ88mNjaWkJAQYmNjWbp0qePfErlV2ghNpB5PPPEEY8aM4YUXXmDv3r0sWbIEPz8/Xn75ZS5cuEB5eTlTp04lLCyMkpISfv7znzNw4ECys7MdO1VOnDiRU6dOMWfOHNq2bcvZs2f58MMPqaioAK5sZTx8+HCOHTvGiy++SJcuXcjOzmbJkiWUlJQwb948T34LpIlQ2IvUY+zYsaSmpgLQv39/ysrKSE9PZ/LkyQQGBvLGG284zr18+TIDBgygY8eOZGZmkpKSAkB2djavvPKK4/6hAEOHDnX8OTMzk8OHD7Nz50569+4NQL9+/QBYtGgRzz//vM/u1y7eQ9M4IvV49NFHa/19+PDhlJeX88c//hGAX/3qVwwYMIDIyEhsNhsRERGUl5dz4sQJx3Puu+8+3njjDVatWsUXX3xR60YrAHv37uXuu+8mPj6eqqoqx1f//v25dOkS2dnZ7m9UmjyFvUg96rr/LcCZM2f44IMPGDduHB07dmTNmjXs3buX/fv3ExwcTGVlpeM569atIyEhgf/8z/+kd+/e3HPPPSxatMgxH3/27FlOnTpFcHBwra+aG0wXFxc3UrfSlGkaR6QeZ8+epVWrVrX+DhAeHs7atWtp3759rb3YL126RElJSa3XCAkJYenSpSxdupTc3Fw2bdrEggULCA4OZvz48VitVqKioli/fn2dNURGRrq+MTGORvYi9fjVr35V6++//OUvad26Nffccw8VFRU0b157vLR582YuX7583deLjo5m1qxZBAYGOqaCBgwYQF5eHq1ateK+++675stms7m+MTGORvYi9fjFL35BdXU13bt3Z+/evbzzzjtMnz6dwMBABg4cyM6dO3n55ZcZMmQIR48e5e233yYgIMDx/HPnzjF06FBGjhxJx44due2229i5cyelpaU88MADAIwaNYoNGzbw4x//mJSUFLp27crFixf5+uuv+eCDD9iwYQMtW7b01LdAmgiFvUg9Nm7cyEsvvcSSJUu48847mTp1Ki+99BJwZaVOXl4e7777LuvXr+e+++5j06ZNjBkzxvH8Fi1a0K1bN9555x1OnTqFn58fHTp0ICMjgwcffBCA2267ja1bt7JixQp+8Ytf8Je//IWWLVvyT//0TwwaNAh/f3+P9C5Ni25LKCJiAM3Zi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gYQGEvImIAhb2IiAEU9iIiBlDYi4gY4P8BkQ4XkicvkIcAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 360x360 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "hodgkins.scatter('base', 'difference')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will regress the difference on the baseline score, this time using the Python module `statsmodels` that allows us to easily perform multiple regression as well. You don't have to learn the code below (though it's not hard). Just focus on understanding an interpreting the output.\n",
    "\n",
    "As a first step, we must import the module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Table` method `to_df` allows us to convert the table `hodgkins` to a structure called a data frame that works more smoothly with `statsmodels`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>height</th>\n",
       "      <th>rad</th>\n",
       "      <th>chemo</th>\n",
       "      <th>base</th>\n",
       "      <th>month15</th>\n",
       "      <th>difference</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>164</td>\n",
       "      <td>679</td>\n",
       "      <td>180</td>\n",
       "      <td>160.57</td>\n",
       "      <td>87.77</td>\n",
       "      <td>-72.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>168</td>\n",
       "      <td>311</td>\n",
       "      <td>180</td>\n",
       "      <td>98.24</td>\n",
       "      <td>67.62</td>\n",
       "      <td>-30.62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>173</td>\n",
       "      <td>388</td>\n",
       "      <td>239</td>\n",
       "      <td>129.04</td>\n",
       "      <td>133.33</td>\n",
       "      <td>4.29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>157</td>\n",
       "      <td>370</td>\n",
       "      <td>168</td>\n",
       "      <td>85.41</td>\n",
       "      <td>81.28</td>\n",
       "      <td>-4.13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>160</td>\n",
       "      <td>468</td>\n",
       "      <td>151</td>\n",
       "      <td>67.94</td>\n",
       "      <td>79.26</td>\n",
       "      <td>11.32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>170</td>\n",
       "      <td>341</td>\n",
       "      <td>96</td>\n",
       "      <td>150.51</td>\n",
       "      <td>80.97</td>\n",
       "      <td>-69.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>163</td>\n",
       "      <td>453</td>\n",
       "      <td>134</td>\n",
       "      <td>129.88</td>\n",
       "      <td>69.24</td>\n",
       "      <td>-60.64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>175</td>\n",
       "      <td>529</td>\n",
       "      <td>264</td>\n",
       "      <td>87.45</td>\n",
       "      <td>56.48</td>\n",
       "      <td>-30.97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>185</td>\n",
       "      <td>392</td>\n",
       "      <td>240</td>\n",
       "      <td>149.84</td>\n",
       "      <td>106.99</td>\n",
       "      <td>-42.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>178</td>\n",
       "      <td>479</td>\n",
       "      <td>216</td>\n",
       "      <td>92.24</td>\n",
       "      <td>73.43</td>\n",
       "      <td>-18.81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>179</td>\n",
       "      <td>376</td>\n",
       "      <td>160</td>\n",
       "      <td>117.43</td>\n",
       "      <td>101.61</td>\n",
       "      <td>-15.82</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>11</td>\n",
       "      <td>181</td>\n",
       "      <td>539</td>\n",
       "      <td>196</td>\n",
       "      <td>129.75</td>\n",
       "      <td>90.78</td>\n",
       "      <td>-38.97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>12</td>\n",
       "      <td>173</td>\n",
       "      <td>217</td>\n",
       "      <td>204</td>\n",
       "      <td>97.59</td>\n",
       "      <td>76.38</td>\n",
       "      <td>-21.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>13</td>\n",
       "      <td>166</td>\n",
       "      <td>456</td>\n",
       "      <td>192</td>\n",
       "      <td>81.29</td>\n",
       "      <td>67.66</td>\n",
       "      <td>-13.63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>14</td>\n",
       "      <td>170</td>\n",
       "      <td>252</td>\n",
       "      <td>150</td>\n",
       "      <td>98.29</td>\n",
       "      <td>55.51</td>\n",
       "      <td>-42.78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>15</td>\n",
       "      <td>165</td>\n",
       "      <td>622</td>\n",
       "      <td>162</td>\n",
       "      <td>118.98</td>\n",
       "      <td>90.92</td>\n",
       "      <td>-28.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>16</td>\n",
       "      <td>173</td>\n",
       "      <td>305</td>\n",
       "      <td>213</td>\n",
       "      <td>103.17</td>\n",
       "      <td>79.74</td>\n",
       "      <td>-23.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>17</td>\n",
       "      <td>174</td>\n",
       "      <td>566</td>\n",
       "      <td>198</td>\n",
       "      <td>94.97</td>\n",
       "      <td>93.08</td>\n",
       "      <td>-1.89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>18</td>\n",
       "      <td>173</td>\n",
       "      <td>322</td>\n",
       "      <td>119</td>\n",
       "      <td>85.00</td>\n",
       "      <td>41.96</td>\n",
       "      <td>-43.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>19</td>\n",
       "      <td>173</td>\n",
       "      <td>270</td>\n",
       "      <td>160</td>\n",
       "      <td>115.02</td>\n",
       "      <td>81.12</td>\n",
       "      <td>-33.90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>20</td>\n",
       "      <td>183</td>\n",
       "      <td>259</td>\n",
       "      <td>241</td>\n",
       "      <td>125.02</td>\n",
       "      <td>97.18</td>\n",
       "      <td>-27.84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>21</td>\n",
       "      <td>188</td>\n",
       "      <td>238</td>\n",
       "      <td>252</td>\n",
       "      <td>137.43</td>\n",
       "      <td>113.20</td>\n",
       "      <td>-24.23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    height  rad  chemo    base  month15  difference\n",
       "0      164  679    180  160.57    87.77      -72.80\n",
       "1      168  311    180   98.24    67.62      -30.62\n",
       "2      173  388    239  129.04   133.33        4.29\n",
       "3      157  370    168   85.41    81.28       -4.13\n",
       "4      160  468    151   67.94    79.26       11.32\n",
       "5      170  341     96  150.51    80.97      -69.54\n",
       "6      163  453    134  129.88    69.24      -60.64\n",
       "7      175  529    264   87.45    56.48      -30.97\n",
       "8      185  392    240  149.84   106.99      -42.85\n",
       "9      178  479    216   92.24    73.43      -18.81\n",
       "10     179  376    160  117.43   101.61      -15.82\n",
       "11     181  539    196  129.75    90.78      -38.97\n",
       "12     173  217    204   97.59    76.38      -21.21\n",
       "13     166  456    192   81.29    67.66      -13.63\n",
       "14     170  252    150   98.29    55.51      -42.78\n",
       "15     165  622    162  118.98    90.92      -28.06\n",
       "16     173  305    213  103.17    79.74      -23.43\n",
       "17     174  566    198   94.97    93.08       -1.89\n",
       "18     173  322    119   85.00    41.96      -43.04\n",
       "19     173  270    160  115.02    81.12      -33.90\n",
       "20     183  259    241  125.02    97.18      -27.84\n",
       "21     188  238    252  137.43   113.20      -24.23"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "h_data = hodgkins.to_df()\n",
    "h_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several variables we could use to predict the difference. The only one we wouldn't use is the 15 month measurement, as that's precisely what we won't have for a new patient before the treatment is adminstered. \n",
    "\n",
    "But which of the rest should we use? One way to choose is to look at the *correlation matrix* of all the variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>height</th>\n",
       "      <th>rad</th>\n",
       "      <th>chemo</th>\n",
       "      <th>base</th>\n",
       "      <th>month15</th>\n",
       "      <th>difference</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>height</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.305206</td>\n",
       "      <td>0.576825</td>\n",
       "      <td>0.354229</td>\n",
       "      <td>0.390527</td>\n",
       "      <td>-0.043394</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>rad</td>\n",
       "      <td>-0.305206</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.003739</td>\n",
       "      <td>0.096432</td>\n",
       "      <td>0.040616</td>\n",
       "      <td>-0.073453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>chemo</td>\n",
       "      <td>0.576825</td>\n",
       "      <td>-0.003739</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.062187</td>\n",
       "      <td>0.445788</td>\n",
       "      <td>0.346310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>base</td>\n",
       "      <td>0.354229</td>\n",
       "      <td>0.096432</td>\n",
       "      <td>0.062187</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.561371</td>\n",
       "      <td>-0.630183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>month15</td>\n",
       "      <td>0.390527</td>\n",
       "      <td>0.040616</td>\n",
       "      <td>0.445788</td>\n",
       "      <td>0.561371</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.288794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>difference</td>\n",
       "      <td>-0.043394</td>\n",
       "      <td>-0.073453</td>\n",
       "      <td>0.346310</td>\n",
       "      <td>-0.630183</td>\n",
       "      <td>0.288794</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              height       rad     chemo      base   month15  difference\n",
       "height      1.000000 -0.305206  0.576825  0.354229  0.390527   -0.043394\n",
       "rad        -0.305206  1.000000 -0.003739  0.096432  0.040616   -0.073453\n",
       "chemo       0.576825 -0.003739  1.000000  0.062187  0.445788    0.346310\n",
       "base        0.354229  0.096432  0.062187  1.000000  0.561371   -0.630183\n",
       "month15     0.390527  0.040616  0.445788  0.561371  1.000000    0.288794\n",
       "difference -0.043394 -0.073453  0.346310 -0.630183  0.288794    1.000000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "h_data.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each entry in this table is the correlation between the variable specified by the row label and the variable specified by the column label. That's why all the diagonal entries are $1$.\n",
    "\n",
    "Look at the last column (or last row). This contains the correlation between `difference` and each of the other variables. The baseline measurement has the largest correlation. To run the regression of `difference` on `base` we must first extract the columns of data that we need and then use the appropriate `statsmodels` methods.\n",
    "\n",
    "First, we create data frames corresponding to the response and the predictor variable. The methods are not the same as for `Tables`, but you will get a general sense of what they are doing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = h_data[['difference']]  # response\n",
    "x = h_data[['base']]        # predictor\n",
    "\n",
    "# specify that the model includes an intercept\n",
    "x_with_int = sm.add_constant(x) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The name of the `OLS` method stands for Ordinary Least Squares, which is the kind of least squares that we have discussed. There are other more complicated kinds that you might encounter in more advanced classes.\n",
    "\n",
    "There is a lot of output, some of which we will discuss and the rest of which we will leave to another class. For some reason the output includes the date and time of running the regression, right in the middle of the summary statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>difference</td>    <th>  R-squared:         </th> <td>   0.397</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.367</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   13.17</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 06 Dec 2019</td> <th>  Prob (F-statistic):</th>  <td>0.00167</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>22:34:07</td>     <th>  Log-Likelihood:    </th> <td> -92.947</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>    22</td>      <th>  AIC:               </th> <td>   189.9</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>    20</td>      <th>  BIC:               </th> <td>   192.1</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "    <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th> <td>   32.1721</td> <td>   17.151</td> <td>    1.876</td> <td> 0.075</td> <td>   -3.604</td> <td>   67.949</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>base</th>  <td>   -0.5447</td> <td>    0.150</td> <td>   -3.630</td> <td> 0.002</td> <td>   -0.858</td> <td>   -0.232</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 1.133</td> <th>  Durbin-Watson:     </th> <td>   1.774</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.568</td> <th>  Jarque-Bera (JB):  </th> <td>   0.484</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.362</td> <th>  Prob(JB):          </th> <td>   0.785</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 3.069</td> <th>  Cond. No.          </th> <td>    530.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:             difference   R-squared:                       0.397\n",
       "Model:                            OLS   Adj. R-squared:                  0.367\n",
       "Method:                 Least Squares   F-statistic:                     13.17\n",
       "Date:                Fri, 06 Dec 2019   Prob (F-statistic):            0.00167\n",
       "Time:                        22:34:07   Log-Likelihood:                -92.947\n",
       "No. Observations:                  22   AIC:                             189.9\n",
       "Df Residuals:                      20   BIC:                             192.1\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const         32.1721     17.151      1.876      0.075      -3.604      67.949\n",
       "base          -0.5447      0.150     -3.630      0.002      -0.858      -0.232\n",
       "==============================================================================\n",
       "Omnibus:                        1.133   Durbin-Watson:                   1.774\n",
       "Prob(Omnibus):                  0.568   Jarque-Bera (JB):                0.484\n",
       "Skew:                           0.362   Prob(JB):                        0.785\n",
       "Kurtosis:                       3.069   Cond. No.                         530.\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simple_regression = sm.OLS(y, x_with_int).fit()\n",
    "simple_regression.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three blocks of output. We will focus only on the the middle block.\n",
    "\n",
    "- `const` and `base` refer to the intercept and baseline measurement.\n",
    "- `coef` stands for the estimated coefficients, which in our notation are $\\hat{\\beta_0}$ and $\\hat{\\beta_1}$.\n",
    "- `t` is the $t$-statistic for testing whether or not the coefficient is 0. Based on our model, its degrees of freedom are $n-2 = 20$; you'll find this under `Df Residuals` in the top block.\n",
    "- `P > |t|` is the total area in the two tails of the $t$ distribution with $n-2$ degrees of freedom.\n",
    "- `[0.025 0.975]` are the ends of a 95% confidence interval for the parameter.\n",
    "\n",
    "For the test of whether or not the true slope of the baseline measurement is $0$, the observed test statistic is\n",
    "\n",
    "$$\n",
    "\\frac{-0.5447 - 0}{0.150} ~ = ~ -3.63\n",
    "$$\n",
    "\n",
    "The area in one tail is the chance that the $t$ distribution with $20$ degrees of freedom is less than $-3.63$. That's the cdf of the distribution evaluated at $-3.63$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0008339581409629714"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "one_tail = stats.t.cdf(-3.63, 20)\n",
    "one_tail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our test is two-sided (large values of $\\vert t \\vert$ favor the alternative), so the $p$-value of the test is the total area of two tails, which is the displayed value $0.002$ after rounding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0016679162819259429"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = 2*one_tail\n",
    "p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find a 95% confidence interval for the true slope, we have to replace $2$ in the expression $\\hat{\\beta}_1 \\pm 2SE(\\hat{\\beta}_1)$ by the corresponding value from the $t$ distribution with 20 degrees of freedom. That's not very far from $2$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.0859634472658364"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_95 = stats.t.ppf(0.975, 20)\n",
    "t_95"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A 95% confidence interval for the true slope is given by $\\hat{\\beta}_1 \\pm t_{95}SE(\\hat{\\beta}_1)$. The observed interval is therefore given by the calculation below, which results in the same values as in the output of `sm.OLS` above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-0.8575945170898753, -0.23180548291012454)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 95% confidence interval for the true slope\n",
    "\n",
    "-0.5447 - t_95*0.150, -0.5447 + t_95*0.150"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple Regression ###\n",
    "What if we wanted to regress `difference` on both `base` and `chemo`? The first thing to do would be to check the correlation matrix again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>height</th>\n",
       "      <th>rad</th>\n",
       "      <th>chemo</th>\n",
       "      <th>base</th>\n",
       "      <th>month15</th>\n",
       "      <th>difference</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>height</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.305206</td>\n",
       "      <td>0.576825</td>\n",
       "      <td>0.354229</td>\n",
       "      <td>0.390527</td>\n",
       "      <td>-0.043394</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>rad</td>\n",
       "      <td>-0.305206</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.003739</td>\n",
       "      <td>0.096432</td>\n",
       "      <td>0.040616</td>\n",
       "      <td>-0.073453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>chemo</td>\n",
       "      <td>0.576825</td>\n",
       "      <td>-0.003739</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.062187</td>\n",
       "      <td>0.445788</td>\n",
       "      <td>0.346310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>base</td>\n",
       "      <td>0.354229</td>\n",
       "      <td>0.096432</td>\n",
       "      <td>0.062187</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.561371</td>\n",
       "      <td>-0.630183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>month15</td>\n",
       "      <td>0.390527</td>\n",
       "      <td>0.040616</td>\n",
       "      <td>0.445788</td>\n",
       "      <td>0.561371</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.288794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>difference</td>\n",
       "      <td>-0.043394</td>\n",
       "      <td>-0.073453</td>\n",
       "      <td>0.346310</td>\n",
       "      <td>-0.630183</td>\n",
       "      <td>0.288794</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              height       rad     chemo      base   month15  difference\n",
       "height      1.000000 -0.305206  0.576825  0.354229  0.390527   -0.043394\n",
       "rad        -0.305206  1.000000 -0.003739  0.096432  0.040616   -0.073453\n",
       "chemo       0.576825 -0.003739  1.000000  0.062187  0.445788    0.346310\n",
       "base        0.354229  0.096432  0.062187  1.000000  0.561371   -0.630183\n",
       "month15     0.390527  0.040616  0.445788  0.561371  1.000000    0.288794\n",
       "difference -0.043394 -0.073453  0.346310 -0.630183  0.288794    1.000000"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "h_data.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What you are looking for is not just that `chemo` is the next most highly correlated with `difference` after `base`. More importantly, you are looking to see how strongly the two predictor variables `base` and `chemo` are linearly related *to each other*. That is, you are trying to figure out whether the two variables pick up genuinely different dimensions of the data.\n",
    "\n",
    "The correlation matrix shows that the correlation between `base` and `chemo` is only about $0.06$. The two predictors are not close to being linear functions of each other. So let's run the regression.\n",
    "\n",
    "The code is exactly the same as before, except that we have included a second predictor variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>difference</td>    <th>  R-squared:         </th> <td>   0.546</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.499</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   11.44</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 06 Dec 2019</td> <th>  Prob (F-statistic):</th> <td>0.000548</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>22:34:31</td>     <th>  Log-Likelihood:    </th> <td> -89.820</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>    22</td>      <th>  AIC:               </th> <td>   185.6</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>    19</td>      <th>  BIC:               </th> <td>   188.9</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     2</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "    <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th> <td>   -0.9992</td> <td>   20.227</td> <td>   -0.049</td> <td> 0.961</td> <td>  -43.335</td> <td>   41.336</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>base</th>  <td>   -0.5655</td> <td>    0.134</td> <td>   -4.226</td> <td> 0.000</td> <td>   -0.846</td> <td>   -0.285</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>chemo</th> <td>    0.1898</td> <td>    0.076</td> <td>    2.500</td> <td> 0.022</td> <td>    0.031</td> <td>    0.349</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 0.853</td> <th>  Durbin-Watson:     </th> <td>   1.781</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.653</td> <th>  Jarque-Bera (JB):  </th> <td>   0.368</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.317</td> <th>  Prob(JB):          </th> <td>   0.832</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.987</td> <th>  Cond. No.          </th> <td>1.36e+03</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.36e+03. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:             difference   R-squared:                       0.546\n",
       "Model:                            OLS   Adj. R-squared:                  0.499\n",
       "Method:                 Least Squares   F-statistic:                     11.44\n",
       "Date:                Fri, 06 Dec 2019   Prob (F-statistic):           0.000548\n",
       "Time:                        22:34:31   Log-Likelihood:                -89.820\n",
       "No. Observations:                  22   AIC:                             185.6\n",
       "Df Residuals:                      19   BIC:                             188.9\n",
       "Df Model:                           2                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const         -0.9992     20.227     -0.049      0.961     -43.335      41.336\n",
       "base          -0.5655      0.134     -4.226      0.000      -0.846      -0.285\n",
       "chemo          0.1898      0.076      2.500      0.022       0.031       0.349\n",
       "==============================================================================\n",
       "Omnibus:                        0.853   Durbin-Watson:                   1.781\n",
       "Prob(Omnibus):                  0.653   Jarque-Bera (JB):                0.368\n",
       "Skew:                           0.317   Prob(JB):                        0.832\n",
       "Kurtosis:                       2.987   Cond. No.                     1.36e+03\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 1.36e+03. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = h_data[['difference']]      # response\n",
    "x2 = h_data[['base', 'chemo']]  # predictors\n",
    "\n",
    "# specify that the model includes an intercept\n",
    "x2_with_int = sm.add_constant(x2) \n",
    "\n",
    "multiple_regression = sm.OLS(y, x2_with_int).fit()\n",
    "multiple_regression.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ignore the scary warning above. There isn't strong multicollinearity (predictor variables being highly correlated with each other) nor other serious issues.\n",
    "\n",
    "Just focus on the middle block of the output. It's just like the middle block of the simple regression output, with one more line corresponding to `chemo`.\n",
    "\n",
    "All of the values in the block are interpreted in the same way as before. The only change is in the degrees of freedom: because you are estimating one more parameter, the degrees of freedom have dropped by $1$, and are thus $19$ instead of $20$.\n",
    "\n",
    "For example, the 95% confidence interval for the slope of `chemo` is calculated as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.03073017186497201, 0.348869828135028)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_95_df19 = stats.t.ppf(0.975, 19)\n",
    "\n",
    "0.1898 - t_95_df19*0.076, 0.1898 + t_95_df19*0.076"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, take a look at the value of `R-squared` in the very top line. It is $0.546$ compared to $0.397$ for the simple regression. It's a math fact that the more predictor variables you use, the higher the `R-squared` value will be. This tempts people into using lots of predictors, whether or not the resulting model is comprehensible.\n",
    "\n",
    "With an \"everything as well as the kitchen sink\" approach to selecting predictor variables, a researcher might be inclined to use all the possible predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>difference</td>    <th>  R-squared:         </th> <td>   0.550</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.444</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   5.185</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 06 Dec 2019</td> <th>  Prob (F-statistic):</th>  <td>0.00645</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>22:50:02</td>     <th>  Log-Likelihood:    </th> <td> -89.741</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>    22</td>      <th>  AIC:               </th> <td>   189.5</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>    17</td>      <th>  BIC:               </th> <td>   194.9</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     4</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "     <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th>  <td>   33.5226</td> <td>  101.061</td> <td>    0.332</td> <td> 0.744</td> <td> -179.698</td> <td>  246.743</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>base</th>   <td>   -0.5393</td> <td>    0.160</td> <td>   -3.378</td> <td> 0.004</td> <td>   -0.876</td> <td>   -0.202</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>chemo</th>  <td>    0.2124</td> <td>    0.103</td> <td>    2.053</td> <td> 0.056</td> <td>   -0.006</td> <td>    0.431</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>rad</th>    <td>   -0.0062</td> <td>    0.031</td> <td>   -0.203</td> <td> 0.841</td> <td>   -0.071</td> <td>    0.059</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>height</th> <td>   -0.2274</td> <td>    0.658</td> <td>   -0.346</td> <td> 0.734</td> <td>   -1.615</td> <td>    1.160</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 0.589</td> <th>  Durbin-Watson:     </th> <td>   1.812</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.745</td> <th>  Jarque-Bera (JB):  </th> <td>   0.321</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.286</td> <th>  Prob(JB):          </th> <td>   0.852</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.851</td> <th>  Cond. No.          </th> <td>1.46e+04</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.46e+04. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:             difference   R-squared:                       0.550\n",
       "Model:                            OLS   Adj. R-squared:                  0.444\n",
       "Method:                 Least Squares   F-statistic:                     5.185\n",
       "Date:                Fri, 06 Dec 2019   Prob (F-statistic):            0.00645\n",
       "Time:                        22:50:02   Log-Likelihood:                -89.741\n",
       "No. Observations:                  22   AIC:                             189.5\n",
       "Df Residuals:                      17   BIC:                             194.9\n",
       "Df Model:                           4                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const         33.5226    101.061      0.332      0.744    -179.698     246.743\n",
       "base          -0.5393      0.160     -3.378      0.004      -0.876      -0.202\n",
       "chemo          0.2124      0.103      2.053      0.056      -0.006       0.431\n",
       "rad           -0.0062      0.031     -0.203      0.841      -0.071       0.059\n",
       "height        -0.2274      0.658     -0.346      0.734      -1.615       1.160\n",
       "==============================================================================\n",
       "Omnibus:                        0.589   Durbin-Watson:                   1.812\n",
       "Prob(Omnibus):                  0.745   Jarque-Bera (JB):                0.321\n",
       "Skew:                           0.286   Prob(JB):                        0.852\n",
       "Kurtosis:                       2.851   Cond. No.                     1.46e+04\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 1.46e+04. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = h_data[['difference']]      # response\n",
    "x3 = h_data[['base', 'chemo', 'rad', 'height']]  # predictors\n",
    "\n",
    "# specify that the model includes an intercept\n",
    "x3_with_int = sm.add_constant(x3) \n",
    "\n",
    "bad_regression = sm.OLS(y, x3_with_int).fit()\n",
    "bad_regression.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is not a good idea. We end up with a far more complicated model for no appreciable gain in `R-squared`. The \"adjusted $R^2$\" penalizes us for using more predictor variables: notice that the value of `Adj. R-squared` is smaller for the regression with all the predictors than for the regression with just `base` and `chemo`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Curious about how to select predictors, or about what makes a good regression? Then take some more statistics classes! This one is complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}