Survival Model and Attrition
Analysis
n Background
q Conventional statistical methods are very successful in predicting customers to have an event of interest
given target time window. However, they could be challenged by the
question: when is the event of interest most likely to occur given a customer?
Or how to estimate the following survivor function – S(•)?
Prob(event=‘Y’|time)
= S(time, covariates)
q
The goal of this study is, through estimating S(•), to show:
Ø How
to
understand parametric and semi-parametric approaches
Ø How
to
employ parametric and semi-parametric approaches to estimate survival function
Ø How
to
use SAS to conduct them
Ø How
to
evaluate the estimations
n What is Survival Analysis?
q Model time to event
q
Unlike linear regression, survival analysis can have a dichotomous (binary) outcome
q
Unlike logistic regression or decision tree, survival analysis analyzes the time to an
event
q
Why is that
important?
Ø Able to account for censoring and time-dependent covariates
Ø Can compare survival between 2+ groups
Ø Assess relationship between covariates and survival time
Ø Capable of answering “who/when are most likely to have an event?”
n When
to use Survival Analysis?
q Example:
Ø Time to cancellation of products or services (attrition)
Ø Time in acquiring add-on products or upgrading
Ø Re-deactivation rate after retention treatment
Ø etc.
q When one believes that 1+ explanatory
variable(s) explains the differences in time to
an event
q
Especially when follow-up is
incomplete
n Conventional modeling techniques are hard to handle two common features
of marketing data, i.e. censoring and time-dependent
n Survival analysis encompasses a wide variety of methods for analyzing the timing of events
n Like other
statistics we have studied we can do any of the following with survival
analysis:
q
Descriptive statistics
q
Univariate statistics
q
Multivariate
statistics
n Descriptive statistics:
q How to describe life time?
Ø Mean or Median of survival?
Ø What test would you use to compare statistics of survival between 2 cohorts?
q Average hazard rate
Ø Total # of failures divided by observed survival time
Ø An incidence rate, with a higher values indicating more events per time
n
Univariate
statistics:
q Univariate method: Kaplan-Meier survival curves:
Ø aka. product-limit formula
Ø Accounts for censoring
Ø
Does not account for confounding or effect modification by other
covariates
n Example (Kaplan-Meier curve): A
plot of the Kaplan–Meier estimate of the survival function is a series of horizontal steps of declining magnitude which approaches the true survival function
for that population
median
the median of lifetime of mcc_ind =
0 is 120 months longer than that of mcc_ind =
1
Question:
1. What are the medians
of lifetime of 2 types of
customers
(mcc_ind=0 and 1)?
2. Are their survival distributions significant
different?.
Test of Equality over Strata
120 months
Test Chi-Square Pr >Chi-Square
Log-Rank 243.7972 <.0001
Wilcoxon 254.0723 <.0001
-2Log(LR) 241.2043 <.0001
median of lifetime
n Comparing Multiple Kaplan-Meier curves
q Multiple pair-wise comparisons produce cumulative Type I error – multiple comparison problem
q Instead, compare all curves at once
Ø analogous to using ANOVA to
compare > 2 cohorts
Ø Then use judicious pair-wise testing
Ø Multivariate statistics
n Limit of Kaplan-Meier Curves
q What happens when you have
several covariates that you believe contribute
to survival?
q Can use stratified
K-M curves – for more than 2 covariates
q Need another approach –
Model With Covariates -- for
many covariates
n Three Types of Survival Models
q If we model the
survival time process without assuming a statistical distribution,
this is called non-parametric
survival analysis
q If
we
model the survival time process in a
regression model and assume that a
distribution applies to
the
error structure, we call this parametric survival analysis
q If we model the
survival time process in a
regression model and assume proportional hazard exists, we call
this semi-parametric survival analysis
n Proportional Hazards Model
It is to assume that the effect of
the covariates is to increase or decrease the hazard by a proportionate amount at
all
durations. Thus
where is
baseline hazard, is the relative risk associated with covariate vector x. So,
Then the survivor functions
can be derived
as
Parallel Hazard Functions from Proportional Hazards Model can graphed as follows:
n Proportional Hazards Model Contd.
Two Common Tests for Examining Proportional Assumption
q Test the interaction of covariates with time
The covariates
should be time-dependent if the test shows the interactions significantly
exist, which means the proportional assumption is violated
q Conduct
Schoenfeld residuals Test
Ø One popular assessment of proportional hazards is based on Schoenfeld residuals, which ought to show no association with time if proportionality holds. (Schoenfeld D. Residuals for the
proportional hazards regression model. Biometrika, 1982,
69(1):239-241)
n Parametric Survival Model
q We consider briefly the
analysis of
survival data when one is willing to
assume a parametric form for
the distribution of
survival time
q Survival distributions within the AFT
class are the Exponential, Weibull, Standard Gamma,
Log-normal, Generalized Gamma and Log-logistic
q AFT model describes a relationship between the
survivor functions of any two individuals. If Si(t) is the survivor function for
individual i, then for any
other
individual j, the AFT model holds that
Si (t) = S j (fij t)
for all t
where
fij
is a constant that is specific to the pair (i,j). This model says, in effect, that what
makes one individual
different from another
is the rate at which they age
n Parametric Survival Model Contd.
q Let T denote a
continuous non-negative random variable representing survival
time, then a family of survival distributions can be expressed as
follows:
log Ti
= b0
+ b1 xi1
+ ... + b k xik
+ sW
where W is a random disturbance term with a standard
distribution in are parameters to be estimated
(-¥, ¥)
and
s , b i
q A baseline hazard function may change over time
q A linear function of a set of k fixed covariates give
the relative risk when they
are exponentiated
q Parametric approach produces estimates of parametric regression models with censored survival data using the method of maximum likelihood
n
Parametric Survival Model Contd.
q The relationships between various distributions are shown below where the direction of
each arrow represents going from the general to
a special case
Loglogistic
Distribution
n Goodness-of-Fit Tests
q There are three
common Statistics methods for model comparisons
Ø Log-Likelihoods
Ø AIC
Ø Likelihood-Ratio Statistic
n Goodness-of-Fit Tests
q Graphical Methods
Ø Exponential Distribution:
The plot of - log S(t) versus t should yield a straight
line with an origin at 0
Ø Weibull Distribution
The plot of log[-logS(t)] versus log t
should be a straight line
Ø Log-Normal Distribution
The plot of F -1 (1 - S (t)) versus log t should be a straight
line, where
Ø Log-Logistic Distribution
F(·) is the c.d.f
The plot of log[ (1 -
S (t))
S (t)]
versus log t should be a straight line
q Cox-Snell
Residuals Plot (Collett 1994)
Ø Cox-Snell
Residual is defined as:
ei = - log S (ti
| xi )
Ø where ti is the observed event time or censoring time for individual i, xi is the vector of covariate
values for individual i,
and S(t) is the estimated probability of surviving to time t based on the fitted model.
n Formulate the Business Problem
q Rank the current TD type-A customers by their likelihood to have attrition given a point in
time
within next 12 months
n Time Framework
q from Dec2009 to
Nov2010
n Population
q All customers who are open and active as of Oct2009 except seasonal accounts
q
10K eligible customers for modeling
q
N customers are flagged as attritors in terms of attrition definition
q
m% overall attrition rate
n Target (involuntary
attrition is excluded)
Note: All examples in this presentation are based on a fake dataset.
n Model customer
data
with Cox proportional hazard model using SAS as follows:
proc phreg data=TDM.smpl_typeA_attri_data;
model month*attrition(0)=var1 - var31 /ties=efron ; baseline out=a survival=s logsurv=ls loglogs=lls; run;
n The syntax of the model statement
is MODEL time < *censor (
list ) > = effects <
/options > ;
n That is, our time scale is time
since Oct2009 (measured in completed months).
n Lift Charts
The lift charts illustrate the performance of
survival model is better than
that of logistic regression for modeling this
Attrition data
n Conduct the Tests
Using
SAS
proc phreg data=TDM.smpl_typeA_attri_data;
model month*attrition(0)= var1-var31 time*var1–
time*var31/ties=efron;
output out=b ressch=ressch1-ressch31; test_proportionality: test time*var1–time*var31; run;
The test shows that
most
of interactions of covariates with time are insignificant at alpha=0.05 level (e.g. p=0.57 and 0.43 for
var15*time and var29*time), but a couple of
them not. For instance, p<.0001 for var13*time
n Schoenfeld Residuals Test
q As an example, for var15, its residual has a fairly random scatter, and the OLS regression of
the residual on month
shows the p-values is 0.5953. That
indicates no significant trend exists.
q For the var29 residuals shows the p- values is 0.1847 and is not very
informative , which is typical of graphs for dichotomous covariates
q The Schoenfeld Residuals test demonstrate there is no evidence of the
proportional hazard assumption being
violated for those variables
q For var13, there appears to be a slight
tendency for the residuals to increase with time since entering study. The p- value for
var13 was 0.02, suggesting that
there may be some departure
from proportionality for
that variable
n Objectives
q
The example will show how to develop parametric survival model using SAS based on
TD type-B customer attrition data
q
This analysis will help TD business units better understand attrition risk and attrition
hazard by predicting “who will attrite” and most importantly “when will they attrite”
q The findings from this study can be used to
optimize customer retention and/or treatment resources in TD attrition reduction efforts
n Attrition
Definition
q TD
type-B customer attrition is defined as an type-B customer account that is closed certain number of days
(at least 120 days) before maturity
q The attrition in this study only refers to customer initiated attrition
n
Exclusions
q Involuntary attrition are excluded
q All records with repeat attritions are excluded
q Mortgage closed within one month after opened are excluded
n
Granularity
q This study examines
type-B customer attrition at
account level
n Time Frame For Modeling:
q 01Jan2008 is
the
origin of time, and 31Aug2009 is the observation termination time
n Population
q
Population For Modeling:
Ø All type-B customer accounts are active as of 01Jan2008 except those attrite
involuntarily in the following months
Ø
200K type-B customer accounts are eligible for modeling
Ø
M accounts are flagged as attritors
Ø
n% average attrition rate over the 20 months study time window (01Jan2008 to
31Aug2009 )
n Attrition Hazard Function Estimation
q The purpose of estimation is to gain knowledge of
hazard characteristics. E.g., when is the most risky time of account tenure
for the attrition?
q The scatter plot below shows that
the shape of hazard function approaches to a Log-
Logistic distribution.
The highest risk of midterm attrition
occurred around one and
half years of
type-B product tenure
n Variables For Modeling
q There are 20 variables in the modeling dataset
q 11 categorical variables (X1 – X11) with levels ranging from 2
to 3
q 9 numeric variables (X12 –
X20)
n Model Type Exploration
q The following scatter plot
indicates the
Log-Logistic model. However, we’ll try
multiple distributions and select the champion for the final model type
Plot for Evaluating Log-Logistic Model
-1.5
-2
-2.5
-3
-3.5
-4
-4.5
-5
|
-5.5
0 0.5 1 1.5 2 2.5 3
LogTime
n PROC LIFEREG -- Parametric Survival Model
proc lifereg data=TDM.smpl_typeB_attri_data;
model time*attrition(0)=&catvars &numvars/dist=&distr;
output out=a cdf=f;
run;
Notes:
1. &distr refers to Exponential, Weibull, LogNormal, LogLosgitic, Gamma
2. The performance of each model with different distribution is evaluated by AIC and Cox-Snell residuals plot
n Evaluation of
Model Specification
AIC and Log Likelihood By Model Distribution
Log Likelihood
Distribution AIC
-151746.90 Exponential 303495.81
-149094.71 W eibull 298193.42
-148662.78 Lognormal 297329.56
-252226.57 Logistic 504457.13
-148331.69 LLogistic 296667.39
-162677.70 Gamma 325361.40
Gamma >298193
Champion!!
Note: The Scale is 0.654503 for the
champion model
n Cox-Snell Residuals Plot
q The following scatter plot demonstrates
the Log-Logistic model fits the type-B customer attrition data nicely
Log SDF By Cox-Snell Residuals On Validation Data
4
3.5
3
2.5
2
1.5
1
0.5
0
|
0 0.5 1 1.5 2 2.5 3 3.5
4 4.5 5
Cox-Snell Residuals
n Model Performance Validation
q The lift decreases monotonically across deciles, which indicates the model
has strong predictive power to rank type-B product customers by the probability of attrition
Lift Charts By Month On Validation Data
3.5
3
2.5
2
Lift - Feb2008
Lift - Apr2008
Lift - Jul2008
Lift - Dec2008
1.5
1
0.5
0
0 1 2 3 4 5 6 7 8 9
decile
n Model Performance Validation Contd.
q The top decile lift decreases monotonically over month,
which is as expected. It means that
the power of model
rank ordering keeps decaying along with time
3.2
3
3.12
Top decile Lift by Month On Validation Data
3.05
2.99
2.93
2.8
2.6
2.84
2.76
2.66
2.58
2.52
2.4
2.42
2.33
2.2
2
Feb-08 Mar-08
Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08
Nov-08 Dec-08
n LIFEREG
Procedure Versus PHREG
q Estimates of parametric regression models with censored survival data using the
method of maximum likelihood
q Accommodates left censoring and interval censoring,
while PHREG only allows right censoring
q Can be used to
test
certain hypotheses about the shape of the
hazard function, while
PHREG only gives you
nonparametric estimates of the survivor function, which can be
difficult to interpret
q
More efficient estimates (with smaller standard
errors) than PHREG if the shape of the
survival distribution
is known
q Possible to perform likelihood-ratio
goodness-of-fit tests for
many of the other probability
distributions due to the
availability of the generalized gamma distribution
q
Does not handle time-dependent covariates
n Introduced parametric and semi- parametric survival model approaches, and
showed how to conduct and evaluate them
using SAS
n Demonstrated Survival analysis is very powerful statistical
tool to predict time-to-
event in database marketing
n Discovered the
insight of attrition risk and attrition hazard over the time of tenure, which is hard for conventional models to
do
n Overall, this study
is helpful in customizing marketing communications and customer treatment programs to optimally time their marketing intervention efforts
No comments:
Post a Comment