{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# DSM on SEER Dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The SEER dataset originally comes from https://seer.cancer.gov/.\n", "We selected some features and culumns from \"Incidence - SEER Research Plus Data, 18 Registries, Nov 2019 Sub (2000-2017) - Linked To County Attributes - Total U.S., 1969-2018 Counties, National Cancer Institute, DCCPS, Surveillance Research Program, released April 2020, based on the November 2019 submission.\" for Breast cancer with the Registry Type as another feature.\n", "\n", "In this notebook, we will apply Deep Survival Machines for survival prediction on the SEER data." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Load the SEER Dataset\n", "\n", "The package includes helper functions to load the dataset.\n", "\n", "X represents an np.array of features (covariates),\n", "T is the event/censoring times and,\n", "E is the censoring indicator." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "sys.path.append('../')\n", "from auton_survival import datasets_ns\n", "\n", "def load_seerER(csv_file):\n", " \"\"\"Helper function to load and preprocess the SEER dataset with RegistryType.\n", "\n", " \"\"\"\n", "\n", " data = pkgutil.get_data(__name__, csv_file)\n", " data = pd.read_csv(io.BytesIO(data), low_memory=False)\n", "\n", " drop_cols = ['death', 'd.time']\n", " data.drop(data[data['Grade'] == 'U'].index, axis=0, inplace=True)\n", " \n", " outcomes = data.copy()\n", " outcomes['event'] = data['death']\n", " outcomes['time'] = data['d.time']\n", " outcomes = outcomes[['event', 'time']]\n", "\n", " cat_feats = ['Sex', 'Marital', 'HisCdGrp',\n", " 'Laterality', 'RadCd', 'ChemCd', 'RegistryT']\n", " num_feats = ['Age', 'TNMalTumor', 'TNBenTumor','ERMean','PSite','HisTICD03','Year','Grade']\n", "\n", " \n", " return outcomes, data[cat_feats+num_feats]\n", "\n", "outcomes, features = load_seerER('datasets/Dataset.csv')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from auton_survival.preprocessing import Preprocessor\n", "cat_feats = ['Sex', 'Marital', 'HisCdGrp',\n", " 'Laterality', 'RadCd', 'ChemCd', 'RegistryT']\n", "num_feats = ['Age', 'TNMalTumor', 'TNBenTumor','ERMean','PSite','HisTICD03','Year','Grade']\n", "\n", "features = Preprocessor().fit_transform(features, cat_feats=cat_feats, num_feats=num_feats)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Compute horizons at which we evaluate the performance of DSM\n", "\n", "Survival predictions are issued at certain time horizons. Here we will evaluate the performance\n", "of DSM to issue predictions at the 25th, 50th and 75th event time quantile as is standard practice in Survival Analysis." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "horizons = [0.25, 0.5, 0.75]\n", "times = np.quantile(outcomes.time[outcomes.event==1], horizons).tolist()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting the data into train, test and validation sets\n", "\n", "We will train DSM on 70% of the Data, use a Validation set of 10% for Model Selection and report performance on the remaining 20% held out test set." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "x, t, e = features.values, outcomes.time.values, outcomes.event.values\n", "\n", "n = len(x)\n", "\n", "tr_size = int(n*0.70)\n", "vl_size = int(n*0.10)\n", "te_size = int(n*0.20)\n", "\n", "x_train, x_test, x_val = x[:tr_size], x[-te_size:], x[tr_size:tr_size+vl_size]\n", "t_train, t_test, t_val = t[:tr_size], t[-te_size:], t[tr_size:tr_size+vl_size]\n", "e_train, e_test, e_val = e[:tr_size], e[-te_size:], e[tr_size:tr_size+vl_size]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Setting the parameter grid\n", "\n", "Lets set up the parameter grid to tune hyper-parameters. We will tune the number of underlying survival distributions, \n", "($K$), the distribution choices (Log-Normal or Weibull), the learning rate for the Adam optimizer between $1\\times10^{-3}$ and $1\\times10^{-4}$ and the number of hidden layers between $0, 1$ and $2$." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import ParameterGrid" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "param_grid = {'k' : [3, 4, 6],\n", " 'distribution' : ['LogNormal', 'Weibull'],\n", " 'learning_rate' : [ 1e-4, 1e-3],\n", " 'layers' : [ [], [100], [100, 100] ]\n", " }\n", "params = ParameterGrid(param_grid)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Model Training and Selection" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from auton_survival.models.dsm import DeepSurvivalMachines\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models = []\n", "for param in params:\n", " print(param)\n", " model = DeepSurvivalMachines(k = param['k'],\n", " distribution = param['distribution'],\n", " layers = param['layers'])\n", " # The fit method is called to train the model\n", " model.fit(x_train, t_train, e_train, iters = 100, learning_rate = param['learning_rate'])\n", " models.append([[model.compute_nll(x_val, t_val, e_val), model]])\n", "best_model = min(models)\n", "model = best_model[0][1]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Inference" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "out_risk = model.predict_risk(x_test, times)\n", "out_survival = model.predict_survival(x_test, times)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation\n", "\n", "We evaluate the performance of DSM in its discriminative ability (Time Dependent Concordance Index and Cumulative Dynamic AUC) as well as Brier Score." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sksurv.metrics import concordance_index_ipcw, brier_score, cumulative_dynamic_auc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cis = []\n", "brs = []\n", "\n", "et_train = np.array([(e_train[i], t_train[i]) for i in range(len(e_train))],\n", " dtype = [('e', bool), ('t', float)])\n", "et_test = np.array([(e_test[i], t_test[i]) for i in range(len(e_test))],\n", " dtype = [('e', bool), ('t', float)])\n", "et_val = np.array([(e_val[i], t_val[i]) for i in range(len(e_val))],\n", " dtype = [('e', bool), ('t', float)])\n", "\n", "for i, _ in enumerate(times):\n", " cis.append(concordance_index_ipcw(et_train, et_test, out_risk[:, i], times[i])[0])\n", "brs.append(brier_score(et_train, et_test, out_survival, times)[1])\n", "roc_auc = []\n", "for i, _ in enumerate(times):\n", " roc_auc.append(cumulative_dynamic_auc(et_train, et_test, out_risk[:, i], times[i])[0])\n", "print(cis[2])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 4 }