statistics Tutorial: Survival Model and Attrition Analysis

Survival Model and Attrition Analysis

n Background

q Conventional statistical methods are very successful in predicting customers to have an event of interest given target time window. However, they could be challenged by the question: when is the event of interest most likely to occur given a customer?

Or how to estimate the following survivor function – S(•)?

Prob(event=‘Y’|time) = S(time, covariates)

q The goal of this study is, through estimating S(•), to show:

Ø How to understand parametric and semi-parametric approaches

Ø How to employ parametric and semi-parametric approaches to estimate survival function

Ø How to use SAS to conduct them

Ø How to evaluate the estimations

n What is Survival Analysis?

q Model time to event

q Unlike linear regression, survival analysis can have a dichotomous (binary) outcome

q Unlike logistic regression or decision tree, survival analysis analyzes the time to an

event

q Why is that important?

Ø Able to account for censoring and time-dependent covariates

Ø Can compare survival between 2+ groups

Ø Assess relationship between covariates and survival time

Ø Capable of answering “who/when are most likely to have an event?”

n When to use Survival Analysis?

q Example:

Ø Time to cancellation of products or services (attrition)

Ø Time in acquiring add-on products or upgrading

Ø Re-deactivation rate after retention treatment

Ø etc.

q When one believes that 1+ explanatory variable(s) explains the differences in time to an event

q Especially when follow-up is incomplete

n Conventional modeling techniques are hard to handle two common features of marketing data, i.e. censoring and time-dependent

n Survival analysis encompasses a wide variety of methods for analyzing the timing of events

n Like other statistics we have studied we can do any of the following with survival analysis:

q Descriptive statistics

q Univariate statistics

q Multivariate statistics

n Descriptive statistics:

q How to describe life time?

Ø Mean or Median of survival?

Ø What test would you use to compare statistics of survival between 2 cohorts?

q Average hazard rate

Ø Total # of failures divided by observed survival time

Ø An incidence rate, with a higher values indicating more events per time

n Univariate statistics:

q Univariate method: Kaplan-Meier survival curves:

Ø aka. product-limit formula

Ø Accounts for censoring

Ø Does not account for confounding or effect modification by other covariates

n Example (Kaplan-Meier curve): A plot of the Kaplan–Meier estimate of the survival function is a series of horizontal steps of declining magnitude which approaches the true survival function for that population

median

the median of lifetime of mcc_ind =

0 is 120 months longer than that of mcc_ind = 1

Question:

1. What are the medians of lifetime of 2 types of customers (mcc_ind=0 and 1)?

2. Are their survival distributions significant different?.

Test of Equality over Strata

120 months

Test Chi-Square Pr >Chi-Square

Log-Rank 243.7972 <.0001

Wilcoxon 254.0723 <.0001

-2Log(LR) 241.2043 <.0001

median of lifetime

n Comparing Multiple Kaplan-Meier curves

q Multiple pair-wise comparisons produce cumulative Type I error – multiple comparison problem

q Instead, compare all curves at once

Ø analogous to using ANOVA to compare > 2 cohorts

Ø Then use judicious pair-wise testing

Ø Multivariate statistics

n Limit of Kaplan-Meier Curves

q What happens when you have several covariates that you believe contribute to survival?

q Can use stratified K-M curves – for more than 2 covariates

q Need another approach – Model With Covariates -- for many covariates

n Three Types of Survival Models

q If we model the survival time process without assuming a statistical distribution, this is called non-parametric survival analysis

q If we model the survival time process in a regression model and assume that a

distribution applies to the error structure, we call this parametric survival analysis

q If we model the survival time process in a regression model and assume proportional hazard exists, we call this semi-parametric survival analysis

n Proportional Hazards Model

It is to assume that the effect of the covariates is to increase or decrease the hazard by a proportionate amount at all durations. Thus

where

is baseline hazard,

is the relative risk associated with covariate vector x. So,

Then the survivor functions can be derived as

Parallel Hazard Functions from Proportional Hazards Model can graphed as follows:

n Proportional Hazards Model Contd.

Two Common Tests for Examining Proportional Assumption

q Test the interaction of covariates with time

The covariates should be time-dependent if the test shows the interactions significantly exist, which means the proportional assumption is violated

q Conduct Schoenfeld residuals Test

Ø One popular assessment of proportional hazards is based on Schoenfeld residuals, which ought to show no association with time if proportionality holds. (Schoenfeld D. Residuals for the proportional hazards regression model. Biometrika, 1982,

69(1):239-241)

n Parametric Survival Model

q We consider briefly the analysis of survival data when one is willing to

assume a parametric form for the distribution of survival time

q Survival distributions within the AFT class are the Exponential, Weibull, Standard Gamma, Log-normal, Generalized Gamma and Log-logistic

q AFT model describes a relationship between the survivor functions of any two individuals. If Si(t) is the survivor function for individual i, then for any other individual j, the AFT model holds that

Si (t) = S j (fij t)

for all t

where

fij

is a constant that is specific to the pair (i,j). This model says, in effect, that what

makes one individual different from another is the rate at which they age

n Parametric Survival Model Contd.

q Let T denote a continuous non-negative random variable representing survival time, then a family of survival distributions can be expressed as follows:

log Ti

= b0

+ b1 xi1

+ ... + b k xik

+ sW

where W is a random disturbance term with a standard distribution in are parameters to be estimated

(-¥, ¥)

and

s , b i

q A baseline hazard function may change over time

q A linear function of a set of k fixed covariates give the relative risk when they are exponentiated

q Parametric approach produces estimates of parametric regression models with censored survival data using the method of maximum likelihood

n Parametric Survival Model Contd.

q The relationships between various distributions are shown below where the direction of each arrow represents going from the general to a special case

Loglogistic

Distribution

n Goodness-of-Fit Tests

q There are three common Statistics methods for model comparisons

Ø Log-Likelihoods

Ø AIC

Ø Likelihood-Ratio Statistic

n Goodness-of-Fit Tests

q Graphical Methods

Ø Exponential Distribution:

The plot of - log S(t) versus t should yield a straight line with an origin at 0

Ø Weibull Distribution

The plot of log[-logS(t)] versus log t should be a straight line

Ø Log-Normal Distribution

The plot of F -1 (1 - S (t)) versus log t should be a straight line, where

Ø Log-Logistic Distribution

F(·) is the c.d.f

The plot of log[ (1 - S (t))

S (t)]

versus log t should be a straight line

q Cox-Snell Residuals Plot (Collett 1994)

Ø Cox-Snell Residual is defined as:

ei = - log S (ti

| xi )

Ø where ti is the observed event time or censoring time for individual i, xi is the vector of covariate values for individual i, and S(t) is the estimated probability of surviving to time t based on the fitted model.

n Formulate the Business Problem

q Rank the current TD type-A customers by their likelihood to have attrition given a point in time within next 12 months

n Time Framework

q from Dec2009 to Nov2010

n Population

q All customers who are open and active as of Oct2009 except seasonal accounts

q 10K eligible customers for modeling

q N customers are flagged as attritors in terms of attrition definition

q m% overall attrition rate

n Target (involuntary attrition is excluded)

Note: All examples in this presentation are based on a fake dataset.

n Model customer data with Cox proportional hazard model using SAS as follows:

proc phreg data=TDM.smpl_typeA_attri_data;

model month*attrition(0)=var1 - var31 /ties=efron ; baseline out=a survival=s logsurv=ls loglogs=lls; run;

n The syntax of the model statement is MODEL time < *censor ( list ) > = effects <

/options > ;

n That is, our time scale is time since Oct2009 (measured in completed months).

n Lift Charts

The lift charts illustrate the performance of survival model is better than that of logistic regression for modeling this Attrition data

n Conduct the Tests Using SAS

proc phreg data=TDM.smpl_typeA_attri_data;

model month*attrition(0)= var1-var31 time*var1–

time*var31/ties=efron;

output out=b ressch=ressch1-ressch31; test_proportionality: test time*var1–time*var31; run;

The test shows that most of interactions of covariates with time are insignificant at alpha=0.05 level (e.g. p=0.57 and 0.43 for var15*time and var29*time), but a couple of them not. For instance, p<.0001 for var13*time

n Schoenfeld Residuals Test

q As an example, for var15, its residual has a fairly random scatter, and the OLS regression of the residual on month

shows the p-values is 0.5953. That

indicates no significant trend exists.

q For the var29 residuals shows the p- values is 0.1847 and is not very informative , which is typical of graphs for dichotomous covariates

q The Schoenfeld Residuals test demonstrate there is no evidence of the proportional hazard assumption being violated for those variables

q For var13, there appears to be a slight tendency for the residuals to increase with time since entering study. The p- value for var13 was 0.02, suggesting that there may be some departure from proportionality for that variable

n Objectives

q The example will show how to develop parametric survival model using SAS based on

TD type-B customer attrition data

q This analysis will help TD business units better understand attrition risk and attrition

hazard by predicting “who will attrite” and most importantly “when will they attrite”

q The findings from this study can be used to optimize customer retention and/or treatment resources in TD attrition reduction efforts

n Attrition Definition

q TD type-B customer attrition is defined as an type-B customer account that is closed certain number of days (at least 120 days) before maturity

q The attrition in this study only refers to customer initiated attrition

n Exclusions

q Involuntary attrition are excluded

q All records with repeat attritions are excluded

q Mortgage closed within one month after opened are excluded

n Granularity

q This study examines type-B customer attrition at account level

n Time Frame For Modeling:

q 01Jan2008 is the origin of time, and 31Aug2009 is the observation termination time

n Population

q Population For Modeling:

Ø All type-B customer accounts are active as of 01Jan2008 except those attrite

involuntarily in the following months

Ø 200K type-B customer accounts are eligible for modeling

Ø M accounts are flagged as attritors

Ø n% average attrition rate over the 20 months study time window (01Jan2008 to

31Aug2009 )

n Attrition Hazard Function Estimation

q The purpose of estimation is to gain knowledge of hazard characteristics. E.g., when is the most risky time of account tenure for the attrition?

q The scatter plot below shows that the shape of hazard function approaches to a Log- Logistic distribution.

The highest risk of midterm attrition occurred around one and half years of

type-B product tenure

n Variables For Modeling

q There are 20 variables in the modeling dataset

q 11 categorical variables (X1 – X11) with levels ranging from 2 to 3

q 9 numeric variables (X12 – X20)

n Model Type Exploration

q The following scatter plot indicates the Log-Logistic model. However, we’ll try

multiple distributions and select the champion for the final model type

Plot for Evaluating Log-Logistic Model

-1.5

-2

-2.5

-3

-3.5

-4

-4.5

-5

R2 = 0.9939

-5.5

0 0.5 1 1.5 2 2.5 3

LogTime

n PROC LIFEREG -- Parametric Survival Model

proc lifereg data=TDM.smpl_typeB_attri_data;

model time*attrition(0)=&catvars &numvars/dist=&distr;

output out=a cdf=f;

run;

Notes:

1. &distr refers to Exponential, Weibull, LogNormal, LogLosgitic, Gamma

2. The performance of each model with different distribution is evaluated by AIC and Cox-Snell residuals plot

n Evaluation of Model Specification

AIC and Log Likelihood By Model Distribution

Log Likelihood Distribution AIC

-151746.90 Exponential 303495.81

-149094.71 W eibull 298193.42

-148662.78 Lognormal 297329.56

-252226.57 Logistic 504457.13

-148331.69 LLogistic 296667.39

-162677.70 Gamma 325361.40

Gamma >298193

Champion!!

Note: The Scale is 0.654503 for the champion model

n Cox-Snell Residuals Plot

q The following scatter plot demonstrates the Log-Logistic model fits the type-B customer attrition data nicely

Log SDF By Cox-Snell Residuals On Validation Data

3.5

2.5

1.5

0.5


	R2 = 0.9898

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Cox-Snell Residuals

n Model Performance Validation

q The lift decreases monotonically across deciles, which indicates the model has strong predictive power to rank type-B product customers by the probability of attrition

Lift Charts By Month On Validation Data

3.5

2.5

Lift - Feb2008

Lift - Apr2008

Lift - Jul2008

Lift - Dec2008

1.5

0.5

0 1 2 3 4 5 6 7 8 9

decile

n Model Performance Validation Contd.

q The top decile lift decreases monotonically over month, which is as expected. It means that the power of model rank ordering keeps decaying along with time

3.2

3.12

Top decile Lift by Month On Validation Data

3.05

2.99

2.93

2.8

2.6

2.84

2.76

2.66

2.58

2.52

2.4

2.42

2.33

2.2

Feb-08 Mar-08 Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 Dec-08

n LIFEREG Procedure Versus PHREG

q Estimates of parametric regression models with censored survival data using the method of maximum likelihood

q Accommodates left censoring and interval censoring, while PHREG only allows right censoring

q Can be used to test certain hypotheses about the shape of the hazard function, while PHREG only gives you nonparametric estimates of the survivor function, which can be difficult to interpret

q More efficient estimates (with smaller standard errors) than PHREG if the shape of the

survival distribution is known

q Possible to perform likelihood-ratio goodness-of-fit tests for many of the other probability distributions due to the availability of the generalized gamma distribution

q Does not handle time-dependent covariates

n Introduced parametric and semi- parametric survival model approaches, and showed how to conduct and evaluate them using SAS

n Demonstrated Survival analysis is very powerful statistical tool to predict time-to-

event in database marketing

n Discovered the insight of attrition risk and attrition hazard over the time of tenure, which is hard for conventional models to do

n Overall, this study is helpful in customizing marketing communications and customer treatment programs to optimally time their marketing intervention efforts

statistics Tutorial

Wednesday, January 18, 2017

Survival Model and Attrition Analysis

No comments:

Post a Comment