survive.KaplanMeier

class survive.KaplanMeier(*, conf_type='log-log', conf_level=0.95, var_type='greenwood', tie_break='discrete', n_boot=500, random_state=None)[source]

Kaplan-Meier nonparametric survival function estimator.

The Kaplan-Meier estimator [1] is also called the product-limit estimator. Much of this implementation is inspired by the R package survival [2].

For a quick introduction to the Kaplan-Meier estimator, see e.g. Section 4.2 in [3] or Section 1.4.1 in [4]. For a more thorough treatment, see Chapter 4 in [5].

Parameters:
conf_type : {‘log-log’, ‘linear’, ‘log’, ‘logit’, ‘arcsin’}

Type of confidence interval for the survival function estimate to report.

conf_level : float

Confidence level of the confidence intervals.

var_type : {‘greenwood’, ‘aalen-johansen’, ‘bootstrap’}

Type of variance estimate for the survival function to compute.

tie_break : {‘discrete’, ‘continuous’}

Specify how to handle tied event times when computing the Aalen-Johansen variance estimate (when var_type is “aalen-johansen”). Ignored for other values of var_type.

n_boot : int, optional

Number of bootstrap samples to draw when estimating the survival function variance using the bootstrap (when var_type is “bootstrap”). Ignored for other values of var_type.

random_state : int or numpy.random.RandomState, optional

Random number generator (or a seed for one) used for sampling and for variance computations if var_type is “bootstrap”. Ignored for other values of var_type.

Notes

Suppose we have observed right-censored and left-truncated event times. Let \(T_1 < T_2 < \cdots\) denote the ordered distinct event times. Let \(\Delta N(T_j)\) be the number of events observed at time \(T_j\), and let \(Y(T_j)\) denote the number of individuals at risk (under observation but not yet censored or “dead”) at time \(T_j\). The Kaplan-Meier estimator estimates the survival function \(S(t)\) of the time-to-event distribution by

\[\widehat{S}(t) = \prod_{j : T_j \leq t} \left(1 - \frac{\Delta N(T_j)}{Y(T_j)}\right).\]

There are several supported ways of estimating the Kaplan-Meier variance \(\mathrm{Var}(\widehat{S}(t))\), each one corresponding to a different value of var_type:

“greenwood”

This is the classical Greenwood’s formula [6]:

\[\widehat{\mathrm{Var}}(\widehat{S}(t)) = \widehat{S}(t)^2 \sum_{j : T_j \leq t} \frac{\Delta N(T_j)}{Y(T_j) (Y(T_j) - \Delta N(T_j))}.\]
“aalen-johansen”

This uses the Poisson moment approximation to the binomial suggested in [7]. This method requires choosing how to handle tied event times by specifying the parameter tie_break. See Sections 3.1.3 and 3.2.2 in [8]. Possible values are

“discrete”

Tied event times are possible and are treated as simultaneous. The variance estimate is

\[\widehat{\mathrm{Var}}(\widehat{S}(t)) = \widehat{S}(t)^2 \sum_{j : T_j \leq t} \frac{\Delta N(T_j)}{Y(T_j)^2}.\]
“continuous”

True event times almost surely don’t coincide, and any observed ties are due to grouping or rounding. Tied event times will be treated as if each one occurred in succession, each one immediately following the previous one. The variance estimate is

\[\widehat{\mathrm{Var}}(\widehat{S}(t)) = \widehat{S}(t)^2 \sum_{j : T_j \leq t} \sum_{k=0}^{\Delta N(T_j) - 1} \frac{1}{\left(Y(T_j) - k\right)^2}.\]

This method is less frequently used than Greenwood’s formula, and the two methods are usually close to each other numerically. However, [9] recommends using Greenwood’s formula because it is less biased and has comparable or lower mean squared error.

“bootstrap”

This uses the bootstrap to estimate the survival function variance [10]. Specifically, one chooses a positive integer \(B\) (the number of bootstrap samples n_boot), forms \(B\) bootstrap samples by sampling with replacement from the data, and computes the Kaplan-Meier estimate \(\widehat{S}_b^*(t)\) for each time \(t\) and each \(b=1,\ldots,B\). The resulting variance estimate is

\[\widehat{\mathrm{Var}}(\widehat{S}(t)) = \frac{1}{B} \sum_{b=1}^B \left(\widehat{S}_b^*(t) - \frac{1}{B} \sum_{c=1}^B \widehat{S}_c^*(t)\right)^2\]

Having chosen a variance estimate, we can estimate the standard error by

\[\widehat{\mathrm{SE}}(\widehat{S}(t)) = \sqrt{\widehat{\mathrm{Var}}(\widehat{S}(t))}.\]

Confidence intervals for \(p=\widehat{S}(t)\) exploit the asymptotic normality of the Kaplan-Meier estimator by choosing a strictly increasing and differentiable transformation \(f(p)\) and applying the delta method, which which states that \(f(p)\) is also asymptotically normal, with standard error \(\mathrm{SE}(p) f^\prime(p)\). Consequently, a normal approximation confidence interval for \(f(p)\) is

\[f(p) \pm z \widehat{\mathrm{SE}}(p) f^\prime(p)\]

where \(z\) is the (1-conf_level)/2-quantile of the standard normal distribution. A confidence interval for \(p\) is then

\[f^{-1}\left(f(p) \pm z \widehat{\mathrm{SE}}(p) f^\prime(p)\right)\]

These general types of confidence intervals were proposed in [11]. Our implementation also shrinks the intervals to be between 0 and 1 if necessary. We list the supported transformations below.

conf_type \(f(p)\)
“linear” \(p\)
“log” \(\log(p)\)
“log-log” \(-\log(-\log(p))\)
“logit” \(\log(p/(1-p))\)
“arcsin” \(\arcsin(\sqrt{p})\)

The confidence intervals implemented here are equivalent for large samples (i.e., asymptotically). For small samples (as small as 25 observations with up to 50% censoring away from the right tail), the “log” and “arcsin” confidence intervals have been shown to give close to the correct coverage probability, whereas the “linear” confidence interval needs much larger sample sizes to perform similarly [11]. For small samples, the “arcsin” intervals tend to be conservative, the “log” intervals tend to be slightly liberal, and the “linear” intervals tend to be very liberal [11].

The “log” intervals were introduced in the first edition of [4], and the “arcsin” intervals were introduced in [12].

References

[1](1, 2) E. L. Kaplan and P. Meier. “Nonparametric estimation from incomplete observations”. Journal of the American Statistical Association, Volume 53, Issue 282 (1958), pp. 457–481. DOI.
[2](1, 2) Terry M. Therneau. A Package for Survival Analysis in S. version 2.38 (2015). CRAN.
[3](1, 2) D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London (1984), pp. ix+201.
[4](1, 2, 3) John D. Kalbfleisch and Ross L. Prentice. The Statistical Analysis of Failure Time Data. Second Edition. Wiley (2002) pp. xiv+439.
[5](1, 2) John P. Klein and Melvin L. Moeschberger. Survival Analysis. Techniques for Censored and Truncated Data. Second Edition. Springer-Verlag, New York (2003) pp. xvi+538. DOI.
[6](1, 2) M. Greenwood. “The natural duration of cancer”. Reports on Public Health and Medical Subjects. Volume 33 (1926), pp. 1–26.
[7](1, 2) Odd O. Aalen and Søren Johansen. “An empirical transition matrix for non-homogeneous Markov chains based on censored observations.” Scandinavian Journal of Statistics. Volume 5, Number 3 (1978), pp. 141–150. JSTOR.
[8](1, 2) Odd O. Aalen, Ørnulf Borgan, and Håkon K. Gjessing. Survival and Event History Analysis. A Process Point of View. Springer-Verlag, New York (2008) pp. xviii+540. DOI.
[9](1, 2) John P. Klein. “Small sample moments of some estimators of the variance of the Kaplan-Meier and Nelson-Aalen estimators.” Scandinavian Journal of Statistics. Volume 18, Number 4 (1991), pp. 333–40. JSTOR.
[10](1, 2) Bradley Efron. “Censored data and the bootstrap.” Journal of the American Statistical Association. Volume 76, Number 374 (1981), pp. 312–19. DOI.
[11](1, 2, 3, 4) Ørnulf Borgan and Knut Liestøl. “A note on confidence intervals and bands for the survival function based on transformations.” Scandinavian Journal of Statistics. Volume 17, Number 1 (1990), pp. 35–41. JSTOR.
[12](1, 2) Vijayan N. Nair. “Confidence Bands for Survival Functions with Censored Data: A Comparative Study.” Technometrics, Volume 26, Number 3, (1984), pp. 265–75. DOI.
Attributes:
conf_level

Confidence level of the confidence intervals.

conf_type

Type of confidence intervals to report.

data_

Survival data used to fit the estimator.

n_boot

Number of bootstrap samples to draw when var_type is “bootstrap”.

random_state

Seed for this model’s random number generator.

summary

Get a summary of this estimator.

tie_break

How to handle tied event times.

var_type

Type of variance estimate to compute.

Methods

check_fitted() Check whether this model is fitted.
fit(time, **kwargs) Fit the Kaplan-Meier estimator to survival data.
plot(*groups[, ci, ci_style, ci_kwargs, …]) Plot the estimates.
predict(time, *[, return_se, return_ci]) Compute estimates.
quantile(prob, *[, return_ci]) Empirical quantile estimates for the time-to-event distribution.
to_string([max_line_length]) String representation of this model.
check_fitted()[source]

Check whether this model is fitted. If not, raise an exception.

conf_level

Confidence level of the confidence intervals.

Returns:
conf_level : float

The confidence level.

conf_type

Type of confidence intervals to report.

Returns:
conf_type : str

The type of confidence interval.

data_

Survival data used to fit the estimator.

This property is only available after fitting.

Returns:
data : SurvivalData

The survive.SurvivalData instance used to fit the estimator.

fit(time, **kwargs)[source]

Fit the Kaplan-Meier estimator to survival data.

Parameters:
time : one-dimensional array-like or str or SurvivalData

The observed times, or all the survival data. If this is a survive.SurvivalData instance, then it is used to fit the estimator and any other parameters are ignored. Otherwise, time and the keyword arguments in kwargs are used to initialize a survive.SurvivalData object on which this estimator is fitted.

**kwargs : keyword arguments

Any additional keyword arguments used to initialize a survive.SurvivalData instance.

Returns:
survive.nonparametric.KaplanMeier

This estimator.

See also

survive.SurvivalData
Structure used to store survival data.
n_boot

Number of bootstrap samples to draw when var_type is “bootstrap”. Not used for any other values of var_type.

plot(*groups, ci=True, ci_style='fill', ci_kwargs=None, mark_censor=True, mark_censor_kwargs=None, legend=True, legend_kwargs=None, colors=None, palette=None, ax=None, **kwargs)[source]

Plot the estimates.

Parameters:
*groups : list of group labels

Specify the groups whose curves should be plotted. If none are given, the curves for all groups are plotted.

ci : bool, optional

If True, draw pointwise confidence intervals.

ci_style : {“fill”, “lines”}, optional

Specify how to draw the confidence intervals. If ci_style is “fill”, the region between the lower and upper confidence interval curves will be filled. If ci_style is “lines”, only the lower and upper curves will be drawn (this is inspired by the style of confidence intervals drawn by plot.survfit in the R package survival).

ci_kwargs : dict, optional

Additional keyword parameters to pass to fill_between() (if ci_style is “fill”) or step() (if ci_style is “lines”) when plotting the pointwise confidence intervals.

mark_censor : bool, optional

If True, indicate the censored times by markers on the plot.

mark_censor_kwargs : dict, optional

Additional keyword parameters to pass to scatter() when marking censored times.

legend : bool, optional

Indicates whether to display a legend for the plot.

legend_kwargs : dict, optional

Keyword parameters to pass to legend().

colors : list or tuple or dict or str, optional

Colors for each group. This is ignored if palette is provided. This can be a sequence of valid matplotlib colors to cycle through, or a dictionary mapping group labels to matplotlib colors, or the name of a matplotlib colormap.

palette : str, optional

Name of a seaborn color palette. Requires seaborn to be installed. Setting a color palette overrides the colors parameter.

ax : matplotlib.axes.Axes, optional

The axes on which to plot. If this is not specified, the current axes will be used.

**kwargs : keyword arguments

Additional keyword arguments to pass to step() when plotting the estimates.

Returns:
matplotlib.axes.Axes

The Axes on which the plot was drawn.

predict(time, *, return_se=False, return_ci=False)[source]

Compute estimates.

Parameters:
time : array-like

One-dimensional array of times at which to make estimates.

return_se : bool, optional

If True, also return standard error estimates.

return_ci : bool, optional

If True, also return confidence intervals.

Returns:
estimate : pandas.DataFrame

DataFrame of estimates. Each columns represents a group, and each row represents an entry of time.

std_err : pandas.DataFrame, optional

Standard errors of the estimates. Same shape as estimate. Returned only if return_se is True.

lower : pandas.DataFrame, optional

Lower confidence interval bounds. Same shape as estimate. Returned only if return_ci is True.

upper : pandas.DataFrame, optional

Upper confidence interval bounds. Same shape as estimate. Returned only if return_ci is True.

quantile(prob, *, return_ci=False)[source]

Empirical quantile estimates for the time-to-event distribution.

Parameters:
prob : array-like

One-dimensional array of values between 0 and 1 representing the probability levels of the desired quantiles.

return_ci : bool, optional

Specify whether to return confidence intervals for the quantile estimates.

Returns:
quantiles : pandas.DataFrame

The quantile estimates. Rows are indexed by the entries of time and columns are indexed by the model’s group labels. Entries for probability levels for which the quantile estimate is not defined are nan (not a number).

lower : pandas.DataFrame, optional

Lower confidence interval bounds for the quantile estimates. Returned only if return_ci is True. Same shape as quantiles.

upper : pandas.DataFrame, optional

Upper confidence interval bounds for the quantile estimates. Returned only if return_ci is True. Same shape as quantiles.

Notes

For a probability level \(p\) between 0 and 1, the empirical \(p\)-quantile of the time-to-event distribution with estimated survival function \(\widehat{S}(t)\) is defined to be the time at which the horizontal line at height \(1-p\) intersects with the estimated survival curve. If such a time is not unique, then instead there is a time interval on which the estimated survival curve is flat and coincides with the horizontal line at height \(1-p\). In this case the midpoint of this interval is taken to be the empirical \(p\)-quantile estimate (this is just one of many possible conventions, and the one used by the R package survival [1]). If the survival function estimate never gets as low as \(1-p\), then the \(p\)-quantile cannot be estimated.

The confidence intervals computed here are based on finding the time at which the horizontal line at height \(1-p\) intersects the upper and lower confidence interval for \(\widehat{S}(t)\). This mimics the implementation in the R package survival [1], which is based on the confidence interval construction in [2].

References

[1](1, 2, 3) Terry M. Therneau. A Package for Survival Analysis in S. version 2.38 (2015). CRAN.
[2](1, 2) Ron Brookmeyer and John Crowley. “A Confidence Interval for the Median Survival Time.” Biometrics, Volume 38, Number 1 (1982), pp. 29–41. DOI.
random_state

Seed for this model’s random number generator. This may not be an numpy.random.RandomState instance. The internal RNG is not a public attribute and should not be used directly.

Returns:
random_state : object

The seed for this model’s RNG.

summary

Get a summary of this estimator.

Returns:
summary : NonparametricEstimatorSummary

The summary of this estimator.

tie_break

How to handle tied event times.

to_string(max_line_length=75)[source]

String representation of this model.

Parameters:
max_line_length : int, optional

Specifies the maximum length of a line. If None, everything will be on one line.

Returns:
model_string : str

A string representation of this model which should be able to be used to instantiate a new identical model.

var_type

Type of variance estimate to compute.