\documentclass[11pt]{article} \usepackage[pdftex]{graphicx} \usepackage{Sweave} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{xcolor} \usepackage{chicago} \usepackage[colorlinks = true, linkcolor = blue, breaklinks, citecolor = blue]{hyperref} \usepackage{tikz} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} %\VignetteIndexEntry{intro} %\VignetteDepends{rpart} %\VignetteDepends{parallel} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE, width=7, height=4.5} \newcommand{\myfig}[1]{\resizebox{\textwidth}{!} {\includegraphics{#1.pdf}}} \def\tree{\texttt{tree}} \def\GPLTR{\texttt{GPLTR}} \def\splus{S-Plus} \newcommand{\Co}[1]{\texttt{#1}} \title {An Introduction to the GPLTR package} \author{Cyprien Mbogning\\ Inserm UNIT 1178, France} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= options(continue = " ", width = 60) options(SweaveHooks=list(fig=function() par(mar = c(4.1, 4.1, 0.8, 1.1)))) pdf.options(pointsize = 10) par(xpd = NA) #stop clipping library(GPLTR) @ \section{Introduction} This document is intended to give a short overview of the methods found in the {\GPLTR} package. The acronym {\GPLTR} is designed for --- Generalized Partially Linear Tree-based Regression model ---. The {\GPLTR} programs build classification or regression models of a very general structure using a three stage procedure with several additional tools (see \shortcite{mbogningpltr2014}); the resulting model is an hybrid model combining a generalized linear model with an additional tree part on the same scale. The model was first proposed by \shortcite{chenPLTR2007} for genetic epidemiology studies in order to assess complex joint gene-gene and gene-environment effects taking into account confounding variables. In practice, the GPLTR models represent a new class of semi-parametric regression models that integrates the advantages of generalized linear regression and tree-structure models. To our best knowledge, there is currently no implemented package dealing with this kind of model. The available classical tree-based methods do not provide a way for controlling confounding factors outside the tree part (the final tree is generally a mixture of confounders and explanatory variables lacking of clear interpretation and resulting in a distorted joint effect). We also proposed an ensemble method (see \shortcite{mbogningbag2014}), mainly the bagging, to address a classical concern of tree-based methods, their instability. the Bagged GPLTR is proposed with several scores of variable importance for prediction and variable selection in the framework of the GPLTR model. \section{GPLTR model} Denote $\mathbf{Y}$ the outcome of interest, $\mathbf{X}$ a set of confounding variables, and $\mathbf{G}$ the explanatory variables. The model fitted inside the {\GPLTR} package is specified by: % \begin{equation} g\left( \mathbb{E}\left( \mathbf{Y}|\mathbf{X},\mathbf{G}\right) \right) =% \mathbf{X}'\theta +\beta _{T}F\left( T\left( \mathbf{G}\right) \right) , \label{pltr} \end{equation}% where $g(\cdot)$ is a known link function (generalized linear model), $F\left( T\left( \mathbf{Z}\right) \right) $ is a vector of indicator variables representing the leaves of the tree $T\left( \mathbf{G}\right)$. The variables considered in the linear part (confounding variables or variables we wish to control) of the model (\ref{pltr}) have a direct impact on the structure of the tree, beginning by the split criterion and ending by the pruning procedure. \section{Fitting methods} The method we used in this package to fit the model (\ref{pltr}) can be summarized into three major steps: \begin{description} \item[Step1] Fit the linear part and build a maximal tree within an iterative procedure by playing on several offsets terms. The nodes of the tree are splitted by maximizing a deviance criterion, while an intercept coefficient is fitted inside the node using the corresponding glm with the linear part considered as offset. \item[Step2] In order to prune back the maximal tree obtained previously, we use a forward procedure to build a sequence of nested subtrees. \item[Step3] The optimal tree is selected, using either a BIC criterion, a AIC criterion, a K-fold Cross-validation procedure on the underlying GPLTR models corresponding to the nested trees sequence. The original parametric bootstrap test procedure proposed by Chen et al. is also available. \end{description} We further propose a procedure to test the joint effect of the selected tree while adjusting for confounders. The users are encouraged to read the recent paper of \shortcite{mbogningpltr2014} for a more thorough explanation about the model and the methods. Table (\ref{tabdesc1}) represents a brief summary of the main functions available inside the {\GPLTR} package. A more complete description is available inside the package documentation. \begin{table}[!h] \caption{Main function in the {\GPLTR} package} \label{tabdesc1} \begin{tabular}{ll} \hline Function & Description \\ \hline pltr.glm & Fit an unprunned pltr model \\ & \\ best.tree.BIC.AIC & Prunned back the unprunned pltr with a BIC or AIC criterion \\ & \\ best.tree.CV & Prunned back the unprunned pltr with a CV procedure \\ & \\ best.tree.bootstrap & prunned back the unprunned pltr with a parametric bootstrap procedure \\ & \\ p.val.tree & Test the joint effect of the selected tree part while adjusting for confounders \\ & \\ bagging.pltr & Aggregate a set of pltr models \\ & \\ predict{\_}bagg.pltr & Predict new features with bagging pltr predictor\\ & \\ VIMPBAG & compute several variable importance score with the bagging pltr predictor \\ \hline \end{tabular} \end{table} \section{Illustration via several examples} In the following, we will present the results obtained on the publicly available "burn" Data Set (Times to Infection for Burn Patients from the book of Klein and Moeschberger \cite{kleinSURV2003}). This dataset comes from a study \shortcite{ichidaBURN1993} that evaluates a protocol change in disinfectant practices for a cohort of $154$ patients. A complete description of the data is also available inside the {\GPLTR} package documentation. In this example, the dependent variable is the administration of prophylactic antibiotic treatment ($D2$: yes/no), the confounding variable is the gender ($Z2$: male/female) and the potential explanatory variables are: ethnicity, severity of the burn as measured by percentage of total surface area of body burned, burn site (head, buttocks, trunk, upper legs, lower legs, respiratory tract), and type of burn (chemical, scald, electric, flame). In this analysis, we included gender as a confounding factor since this factor has already been described as related to infections among burn patients \shortcite{wisplinghoffRISK1999}. Such adjustment for confounders cannot be performed within the classical CART framework. \begin{flushleft} \textbf{Results obtained with the classical CART algorithm} \end{flushleft} First of all, we have fitted a classical tree model on the dependent variable $D2$, using the CART algorithm \shortcite{breimanCART1984} via the 'rpart' routines \cite{therneaurpart2013} of the R software: <>= data(burn) head(burn, n = 10) rpart.burn <- rpart(D2 ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 + Z8 + Z9 + Z10 + Z11, data = burn, method = "class") print(rpart.burn) #par(mar = rep(0.1, 4)) plot(rpart.burn) text(rpart.burn, xpd = TRUE, cex = .6, use.n = TRUE) @ Figure (\ref{cartt}) represents the tree obtained using the classical CART algorithm. \begin{figure} \myfig{intro-cart} \caption{Tree obtained on the burn dataset with rpart.} \label{cartt} \end{figure} The tree is built by the following process: first the single variable is found which best splits the data into two groups (via the Gini index in the example). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. \begin{flushleft} \textbf{Results obtained with our proposed method within the {\GPLTR} package} \end{flushleft} A logistic partially linear tree-based regression model is fitted by using our proposed method: <>= ## use example(GPLTR) to have a brief overview of the contain of the package. ## fit the PLTR model after adjusting on gender (Z2) using the proposed method ## ?GPLTR or ?pltr.glm to access the help section ## setting the parameters args.rpart <- list(minbucket = 10, maxdepth = 4, cp = 0, maxcompete = 0, maxsurrogate = 0) family <- "binomial" X.names = "Z2" Y.name = "D2" G.names = c('Z1','Z3','Z4','Z5','Z6','Z7','Z8','Z9','Z10','Z11') ## Build the maximal tree with an adjustment on gender (Z2) pltr.burn <- pltr.glm(burn, Y.name, X.names, G.names, args.rpart = args.rpart, family = family,iterMax =8, iterMin = 6, verbose = TRUE) ## Prunned back the maximal tree using either the BIC or the AIC criterion pltr.burn_prun <- best.tree.BIC.AIC(xtree = pltr.burn$tree, burn, Y.name, X.names, family = family, verbose = FALSE) ## Summary of the selected tree by a BIC criterion summary(pltr.burn_prun$tree$BIC) ## Summary of the final selected pltr model summary(pltr.burn_prun$fit_glm$BIC) ## Plot the maximal tree and the BIC prunned tree par(mfrow = c(2,1)) plot(pltr.burn$tree, main = '', margin = 0.05) text(pltr.burn$tree, xpd = TRUE, cex = .4, col = 'blue') plot(pltr.burn_prun$tree$BIC, branch = .5, main = '', margin = 0.05) text(pltr.burn_prun$tree$BIC, xpd = TRUE, cex = .4, col = 'blue' ) @ \begin{figure} \myfig{intro-gpltr} \caption{The figure on the top is the maximal tree obtained with pltr.glm; the figure on the bottom is a pruned tree via a BIC criterion} \label{pltrr} \end{figure} \begin{itemize} \item The underlying method behind the 'binomial' family above is a new one, different from those implemented inside the \Co{rpart} package. The splitting criterion is based on a logistic deviance criterion considering the linear part as offset \shortcite{mbogningpltr2014}. \item The child nodes of node $x$ are always $2x$ and $2x+1$, to help in navigating the tree summary (compare the summary to the bottom figure \ref{pltrr}). \item They are many Items in the tree summary list: \begin{itemize} \item the complexity table \item the variable importance \item the node number \item the number of cases within the node \item the number of events (number of cases with attribute 1) inside the node \item the logistic intercept coefficient fitted inside the node, which represents the summary statistic of the node. This coefficient represents the predicted value of the node. that's the main difference with a conventional tree where the predicted value is the modal class of the node. \item the logistic deviance of the previous model inside the node which is used as the splitting criterion \end{itemize} \item * indicates that the node is terminal. \item the first split is on the Percentage of total surface area burned ($Z_4$). $64$ individuals with $ Z_4 < 15.5 $ go to the left while the remaining $90$ go to the right. The split with the maximum number of events is always on the right. \item The improvement listed is the change in deviance for the split, ie., $D(parent) - (D(left son) + D(right son))$, where $D$ is the deviance operator. This is similar to a likelihood ratio test statistic. \item the other nodes can be described similarly. \end{itemize} For all the two models (tree with rpart (Fig \ref{cartt}) and the tree with our proposed procedure (Fig \ref{pltrr})), the first split is due to the percentage of total surface area burned ($(< 16.5\%, > 16.5\%$) for \Co{rpart} and ($< 15.5\%, > 15.5\% $) for the proposed method). The subsequent splits are different between patients having a high or low percentage of total surface area burned. The classical rpart model shows only one split whereas our proposed PLTR model shows two splits. With the exception of CART model where no split occurs, the subsequent split for the group of patients with a low percentage of area burned is due to the initial treatment (routine bathing/body cleansing). For high percentage of surface area burned, the split for the two models is due to the respiratory tract damage. Our proposed procedure identifies other splits due to the treatment and the buttock burns. In particular, we observed that the group of patients with a high percentage of surface area of body burned, without tract respiratory damage, buttock injury and without routine bathing shows a lower proportion of prophylactic antibiotics administration. This group is not detected by the original PLTR method. It is worth noting that the confounding factor $Z2$ is significant with a higher proportion of prophylactic antibiotics administration among women. The particularity of the PLTR model is that in addition with the tree part, the final model is a classical logistic model with new risk factors emerging from the tree part. The summary of the final logistic model is presented within the R code above. We can see for example that the new risk factor constituted by individuals sharing the attributes $Z4 < 15.5$ and $Z1<0.5$ is highly significant (although naively due to randomness) w.r.t the reference leaf. Similar interpretation can be made for other factors. \section{Compute the generalization error of the procedure} We can further compute the generalization error of the procedure. This can be computed via the function \Co{best.tree.CV} which can also provide the best tree based on a K-fold cross-validation procedure. <<>>= set.seed(150) pltr.burn_CV <- best.tree.CV(pltr.burn$tree, burn, Y.name, X.names, G.names, family = family, args.rpart = args.rpart, epsi = 0.001, iterMax = 15, iterMin = 8, ncv = 10, verbose = FALSE) pltr.burn_CV$CV_ERROR Bic_size <- sum(pltr.burn_prun$tree$BIC$frame$var == '') ## Bic_size <- tree_select$best_index[[1]] CV_ERROR_BIC <- pltr.burn_CV$CV_ERROR[[2]][Bic_size] CV_ERROR_BIC @ The generalization error of the procedure using the BIC criterion is computed as presented in the previous r code. \section{Test the joint effect of the selected tree while adjusting for confounders.} We can also test the joint effect of the selected tree after adjusting for the confounding variable <>= ## Use only one worker on a window plateform. args.parallel = list(numWorkers = 10, type = "PSOCK") index = Bic_size p_value <- p.val.tree(xtree = fit_pltr$tree, data_pltr, Y.name, X.names, G.names, B = 1000, args.rpart = args.rpart, epsi = 1e-3, iterMax = 15, iterMin = 8, family = family, LB = FALSE, args.parallel = args.parallel, index = index, verbose = FALSE) p_value$P.value @ \section{Bagging a set of PLTR models} The high variability of the tree within the PLTR model can result in an unstable selected model. An ensemble method such as Bagging can stabilize the predictor. The user are encouraged to read the paper of \shortcite{mbogningbag2014} for a more thorough explanation about the procedure. A flowchart of the bagging procedure related to the PLTR model is as follow: \begin{center} \begin{tikzpicture}[thick, scale=0.7, every node/.style={scale=0.7}] \draw (0,0) node{$\mathcal{A}_n $}; \draw[->] (-1,-0.5)-- (-6,-5); \draw[->] (-0.5,-0.5)-- (-3,-5); \draw[->] (1,-0.5)-- (5.5,-5); \draw (-6.5,-5.5) node{$\mathcal{A}_n^{*1} $}; \draw (-3.5,-5.5) node{$\mathcal{A}_n^{*2} $}; \draw (6,-5.5) node{$\mathcal{A}_n^{*B} $}; \draw[dotted] (-2.5,-5.5) -- (5,-5.5); \draw (-4,-2.5) node[left]{Bootstrap}; \draw[->] (-6.5,-6) -- (-6.5,-10.5); \draw[->] (-3.5,-6)-- (-3.5,-10.5); \draw[->] (6,-6)-- (6,-10.5); \draw (-6.75,-8.25) node[left]{PLTR}; \draw (-6.5,-11) node{$\text{pltr}_1 $}; \draw (-3.5,-11) node{$\text{pltr}_2 $}; \draw (6,-11) node{$\text{pltr}_B $}; \draw[dotted] (-2.5,-11) -- (5,-11); \draw[->] (-6,-11.5) -- (-1,-16) ; \draw[->] (-3,-11.5) -- (-0.5,-16) ; \draw[->] (5.5,-11.5) -- (1,-16) ; \draw (0,-16.5) node{BagPLTR}; \end{tikzpicture} \end{center} <>= ## ?bagging.pltr set.seed(250) Bag.burn <- bagging.pltr(burn, Y.name, X.names, G.names, family, args.rpart,epsi = 0.01, iterMax = 4, iterMin = 3, Bag = 20, verbose = FALSE, doprune = FALSE) ## The thresshold values used Bag.burn$CUT ##The set of PLTR models in the bagging procedure PLTR_BAG.burn <- Bag.burn$Glm_BAG ##The set of trees in the bagging procedure TREE_BAG.burn <- Bag.burn$Tree_BAG ## Look for the variability of trees in the bagging sequence par(mfrow = c(3,2)) plot(TREE_BAG.burn[[1]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[1]], xpd = TRUE, cex = .6 ) plot(TREE_BAG.burn[[2]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[2]], xpd = TRUE, cex = .6 ) plot(TREE_BAG.burn[[3]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[3]], xpd = TRUE, cex = .6 ) plot(TREE_BAG.burn[[4]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[4]], xpd = TRUE, cex = .6 ) plot(TREE_BAG.burn[[5]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[5]], xpd = TRUE, cex = .6 ) plot(TREE_BAG.burn[[6]], branch = .5, main = '', margin = 0.05) text(TREE_BAG.burn[[6]], xpd = TRUE, cex = .6 ) @ \begin{figure} \myfig{intro-treesbag} \caption{set of the first 6 trees within the bagging predictor: The trees seem to fluctuate with the sample considered during the resampling step.} \label{bagtree} \end{figure} \section*{Prediction using the bagged pltr predictor} <<>>= ## Use the bagging procedure to predict new features # ?predict_bagg.pltr Pred_Bag.burn <- predict_bagg.pltr(Bag.burn, Y.name, newdata = burn, type = "response", thresshold = seq(0, 1, by = 0.1)) ## The confusion matrix for each thresshold value using the majority vote Pred_Bag.burn$CONF1 ## The prediction error for each thresshold value Pred_err.burn <- Pred_Bag.burn$PRED_ERROR1 Pred_err.burn @ \section*{Compute the variable importances of the bagging procedure} Several scores for variable importance are proposed \shortcite{mbogningbag2014}. Among them, \begin{itemize} \item the Permutation Importance Score (PIS) \item the Deviance Importance Score (DIS) \item the Depth Deviance Importance Score (DDIS) \item the minimal depth score \end{itemize} <>= Var_Imp_BAG.burn <- VIMPBAG(Bag.burn, burn, Y.name) ## Importance score using the permutaion method for each thresshold value Var_Imp_BAG.burn$PIS par(mfrow=c(1,3)) barplot(Var_Imp_BAG.burn$PIS$CUT5, main = 'PIS', horiz = TRUE, las = 1, cex.names = .8, col = 'lightblue') barplot(Var_Imp_BAG.burn$DIS, main = 'DIS', horiz = TRUE, las = 1, cex.names = .8, col = 'grey') barplot(Var_Imp_BAG.burn$DDIS, main = 'DDIS', horiz = TRUE, las = 1, cex.names = .8, col = 'purple') @ \begin{figure} \myfig{intro-varimp} \caption{Variable importances using the bagging procedure on the burn dataset: The permutation importance score (PIS on the left), the deviance importance score (DIS on the middle) and the depth deviance importance score (DDIS on the right).} \label{varimpp} \end{figure} \section*{compute the AUC of the Bagged predictor based on OOB samples} <>= auc_BAG_oob <- bag.aucoob(Bag.burn, burn, Y.name) ## AUC of the predictor on OOB samples auc_BAG_oob$AUCOOB ## Plot the ROC curve of the predictor based on OOB samples par(mfrow=c(1, 1)) plot(auc_BAG_oob$FPR, auc_BAG_oob$TPR, type = 'b', lty = 3, col = 'blue', xlab = 'false positive rate', ylab = 'true positive rate') legend(0.7, 0.3, sprintf('%3.3f', auc_BAG_oob$AUCOOB), lty = c(1, 1), lwd = c(2.5, 2.5), col = 'blue', title = 'AUC') @ \begin{figure} \myfig{intro-rocoob} \caption{Roc curve with AUC of the Bagged predictor using OOB samples.} \label{rocauc} \end{figure} \bibliographystyle{chicago} \bibliography{Bibliopltr} \end{document}