In 2017, HIV/AIDS was responsible for the deaths of one million people globally, including 50,000 children less than one year old Global Burden of Disease Collaborative Network (2018a). Although mathematical modeling has provided important insights into the dynamics of HIV infection during anti-retroviral treatment (ART), there is still a lack of accessible tools for researchers unfamiliar with modeling techniques to apply them to their own datasets.
Here we present ushr
, an open-source R package that
models the decline of HIV during ART using a popular mathematical
framework. ushr
can be applied to longitudinal data of
viral load measurements, and automates all stages of the model fitting
process. By mathematically fitting the data, important biological
parameters can be estimated, including the lifespans of short and
long-lived infected cells, and the time to reach viral suppression below
a defined detection threshold. The package also provides visualization
and summary tools for fast assessment of model results.
More generally, ushr
enables researchers without a
strong mathematical or computational background to model the dynamics of
HIV using longitudinal clinical data. Increasing accessibility to such
methods may facilitate quantitative analysis across a wide range of
independent studies, so that greater insights on HIV infection and
treatment dynamics may be gained.
Citation information can be found using
citation("ushr")
; the package paper is open access and
available at BMC Bioinformatics.
If you encounter any bugs related to this package please contact the package author directly. Additional descriptions of the model and analyses performed by this package are available in Morris et al. (2020) BMC Bioinformatics. Further details on the mathematical theory can also be found in the references cited below.
HIV decline in a patient on ART is typically described using ordinary differential equations (ODEs) that characterize the production and spread of virus by infected target cells, such as CD4 T cells Nowak and May (2000). Assuming ART completely blocks viral replication, and that viral dynamics occur on a faster timescale than those of infected cells, one can obtain the following expression for the timecourse of viral load, V, during treatment
Here δ and γ are the death rates of short and long-lived infected target cells, respectively (Shet, Nagaraja, and Dixit 2016). The parameters A and B are composite constants without direct interpretation; however, A + B represents the initial viral load (i.e. V(t = 0)), and A/(A + B) can be understood as the proportion of infected cells at ART initiation that are short-lived.
Eqn. $\ref{biphasic}$ is referred to as the biphasic model. According to this, viral load initially decays rapidly, reflecting the loss of short-lived infected cells (at rate δ), and then enters a second, slower decline phase reflecting the loss of longer-lived infected cells (at rate γ). For patient data exhibiting only one decline phase (for example, due to sparse or delayed viral load measurements), one can use a single phase version of Eqn. $\ref{biphasic}$ given by
where there are no assumptions on whether decay reflects the fast or slow phase of virus suppression.
For each individual, the time to reach virologic suppression below a defined threshold (`time to suppression’ (TTS)) can be estimated using both parametric and non-parametric methods. For the parametric approach, TTS was calculated as the first time at which V(t) = x, where x is the suppression threshold, and V(t) is given by Eqn. $\ref{biphasic}$ for the biphasic model and Eqn. $\ref{singlephase}$ for the single phase model. For the non-parametric approach, we first apply linear interpolation between the first measurement below the detection threshold and the preceding measurement. TTS is then defined as the time at which the interpolation line crosses the suppression threshold.
Raw clinical data is often noisy and sparse, making it unsuitable for
mathematical analysis of viral decline, and eventual suppression, during
ART. Therefore, prior to any analysis, data must be processed to exclude
individual trajectories that cannot be appropriately modeled. In
ushr
, we only consider individuals who reach suppression
below a pre-defined threshold, within a particular timeframe (both
specified by the user). By default, suppression is defined as having at
least one viral load measurement below the detection threshold of the
measurements assay, d.
Alternatively, the user may define suppression as sustaining at least
two consecutive measurements below d. Following previous work, all
measurements below the detection threshold are set to d/2 (Wu et
al. 1999). To isolate the kinetics leading to initial
suppression, viral load trajectories are truncated after the first
measurement below d.
To distinguish ‘true’ decay dynamics from instances of viral rebound (due to factors such as drug resistance or poor treatment adherence), we only consider viral load data that maintain a consistent decreasing trend towards suppression, such that each measurement is within a pre-defined range of the previous measurement. This buffer range ensures that transient increases in viral load (arising from noise and measurement error) do not exclude subjects from the analysis. We also allow initial increases in viral load (for example, arising from pharmacological delays in drug action) by defining the beginning of each individual’s decreasing sequence as the maximum value from a pre-defined range of initial measurements.
Parameter estimates with 95% confidence intervals are obtained for
each subject by fitting either the biphasic or single phase model to the
corresponding viral load data using maximum likelihood optimization (as
described previously (Hogan et al. 2015)).
Data are log10-transformed
prior to fitting and optimization is performed using
optim()
. After fitting, we use the resulting parameter
estimates to calculate the lifespans of HIV-infected cells: 1/δ and 1/γ for short and long-lived
infected cells from the biphasic model, respectively, and 1/γ̂ for the single phase model.
To improve parameter identifiability, only subjects with a minimum number of measurements above the detection threshold are fit using the biphasic or single phase models. These can be specified by the user, but we recommend at least six observations for the biphasic model and three for the single phase model. Individuals with fewer measurements are not included in the model fitting procedure, although they are still included in non-parametric TTS calculations.
To illustrate basic usage of the package and allow users to explore functionality, we include a publicly available data set from the ACTG315 clinical trial. Briefly, the raw data consist of longitudinal HIV viral load measurements from 46 chronically-infected adults up to 28 weeks following ART initiation. The detection threshold was 100 copies/ml and observations are recorded as log10 RNA copies/ml. These data are available at https://sph.uth.edu/divisions/biostatistics/wu/datasets/ACTG315LongitudinalDataViralLoad.htm (date originally accessed: 15 September 2019), and have been described previously (Lederman et al. 1998; Wu and Ding 1999; Connick et al. 2000).
To begin, we load the package and print the first six rows to identify our columns of interest; these are the viral load observations (‘log.10.RNA.’), the timing of these observations (‘Day’), and the identifier for each subject (‘Patid’).
## Obs.No Patid Day log10.RNA. CD4
## 1 1 1 0 4.3617 221.76
## 2 2 1 2 4.3617 159.84
## 3 3 1 7 3.5315 210.60
## 4 4 1 16 2.9777 204.12
## 5 5 1 29 2.6435 172.48
## 6 6 1 57 2.1139 270.94
Since ushr
requires absolute viral load (VL)
measurements, and specific column names (‘vl’, ‘time’, ‘id’), we first
back-transform the log10
viral load measurements into absolute values, and rename the column
headings.
actg315 <- actg315raw %>%
mutate(vl = 10^log10.RNA.) %>%
select(id = Patid, time = Day, vl)
print(head(actg315))
## id time vl
## 1 1 0 22998.5259
## 2 1 2 22998.5259
## 3 1 7 3400.1651
## 4 1 16 949.9484
## 5 1 29 440.0479
## 6 1 57 129.9870
We can then visualize these data using the plot_data()
function. The detection_threshold
argument defines the
detection threshold of the measurement assay.
Each panel represents a different individual, the points are the viral load measurements, and the dashed horizontal line is the assay detection threshold. From this we can see that the data is indeed noisy, individuals have different numbers of available observations, and only a subset suppress viral load below the detection threshold.
To fit the model to these data using just one line of code we call
the ushr()
function. This processes the data to filter out
any individuals who do not meet the inclusion criteria defined above,
and then fits either the single or biphasic model to each remaining
trajectory, depending on the number of available observations (see the
Background for more details). Note that the data processing step can be
omitted using the filter = FALSE
argument (default is
TRUE); however this is not recommended unless rigorous processing
efforts have already been made. Note also that the
censor_value
argument specifies how measurements below the
detection threshold should be treated (here we set them to half of the
detection theshold in line with previous work (Wu
et al. 1999)).
With the fitted model output, we can then plot both the biphasic and single phase fits as follows
Again, each panel represents a different individual, points are the
original data, and solid lines are the corresponding best-fit model. We
can see that twelve subjects were successfully fit with the biphasic
model, and four with the single phase model. Although some single phase
subjects had sufficient data to fit the biphasic model (i.e. at least
six observations), the resulting 95% parameter confidence intervals were
either unattainable or sufficiently wide to indicate an unreliable fit.
This can occur, for example, when one of the decay phases is poorly
documented (i.e. has few data points). As a result, the subjects were
re-fit with the single phase model. This re-fitting step is automated in
the package; however, the user can control the size of confidence
interval above which a biphasic fit is deemed unreliable using the
argument CI_max_diff
in ushr()
.
We can also visualize a summary of the fitting procedure and
parameter estimates using summarize_model()
. This creates a
list with the following elements: (i) a summary of which subjects were
successfully fit using the biphasic or single phase models, with their
corresponding infected cell lifespan estimates (summary
);
(ii) summary statistics for the estimated parameters from the biphasic
model (biphasicstats
); and (iii) summary statistics for the
estimated parameters from the single phase model
(singlestats
).
actg315_summary <- summarize_model(model_output, data = actg315, stats = TRUE)
head(actg315_summary$summary)
## id Included Model ShortLifespan LongLifespan SingleLifespan
## 1 1 Yes Biphasic 3.25 28.08
## 2 2 No
## 3 3 No
## 4 4 Yes Single phase 14.3
## 5 5 Yes Single phase 4.22
## 6 6 No
## # A tibble: 6 × 4
## Param Median SD Model
## <chr> <dbl> <dbl> <chr>
## 1 A 135000 367000 Biphasic
## 2 B 1890 6540 Biphasic
## 3 LongLifespan 25.6 11.9 Biphasic
## 4 ShortLifespan 2.08 0.834 Biphasic
## 5 delta 0.482 0.185 Biphasic
## 6 gamma 0.0391 0.0227 Biphasic
## # A tibble: 3 × 4
## Param Median SD Model
## <chr> <dbl> <dbl> <chr>
## 1 Bhat 26900 23300 Single phase
## 2 SingleLifespan 9.1 5.76 Single phase
## 3 gammahat 0.154 0.098 Single phase
For a better understanding of parameter identifiability, one can also print the parameter estimates for each individual and model, along with their corresponding 95% confidence intervals.
## # A tibble: 6 × 5
## id param estimate lowerCI upperCI
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 A 27545. 19879. 38167.
## 2 1 delta 0.308 0.228 0.414
## 3 1 B 1174. 701. 1966.
## 4 1 gamma 0.0356 0.0283 0.0448
## 5 13 A 385392. 300226. 494717.
## 6 13 delta 0.512 0.455 0.575
## id param estimate lowerCI upperCI
## 1 4 Bhat 1.621519e+03 5.480429e+02 4.797663e+03
## 2 4 gammahat 6.993061e-02 3.909354e-02 1.250920e-01
## 3 5 Bhat 4.223169e+04 1.962813e+04 9.086529e+04
## 4 5 gammahat 2.368478e-01 1.952923e-01 2.872457e-01
## 5 31 Bhat 1.158218e+04 8.791109e+03 1.525937e+04
## 6 31 gammahat 7.151531e-02 6.466642e-02 7.908956e-02
In addition to fitting the biphasic and single phase models, we can
calculate the time to viral suppression (TTS) using both the parametric
and non-parametric methods (see the Background for more details). Here
we set the suppression threshold to be the same as the detection
threshold (i.e. we want to know when viral load drops below the
detection threshold of the assay). First, to get parametric estimates
from the fitted model output, we use get_TTS()
with the
argument parametric = TRUE
. We can subsequently obtain
median and SD statistics, and the total number of subjects included in
the analysis, using the summarize()
function from
dplyr
.
TTSparametric <- get_TTS(model_output = model_output, parametric = TRUE,
suppression_threshold = 100)
head(TTSparametric)
## # A tibble: 6 × 4
## id TTS model calculation
## <dbl> <dbl> <chr> <chr>
## 1 1 69.2 biphasic parametric
## 2 13 48.7 biphasic parametric
## 3 17 73.1 biphasic parametric
## 4 19 65.0 biphasic parametric
## 5 26 122. biphasic parametric
## 6 27 69.8 biphasic parametric
## # A tibble: 1 × 3
## median SD N
## <dbl> <dbl> <int>
## 1 65.7 31.0 16
Alternatively, to calculate non-parametric TTS estimates, we set the
argument parametric = FALSE
, and supply the original data
using data = actg315
, rather than the fitted model output.
The estimates are similar to those for the parametric method but, given
the less stringent conditions for inclusion in the non-parametric
analysis (there is no minimum requirement on the number of
observations), we are able to estimate TTS for more subjects.
TTSnonparametric <- get_TTS(data = actg315, parametric = FALSE,
suppression_threshold = 100, censor_value = 50)
head(TTSnonparametric)
## # A tibble: 6 × 3
## id TTS calculation
## <dbl> <dbl> <chr>
## 1 1 69.8 non-parametric
## 2 2 6.89 non-parametric
## 3 4 43.3 non-parametric
## 4 5 27.7 non-parametric
## 5 13 54.0 non-parametric
## 6 17 77.9 non-parametric
## # A tibble: 1 × 3
## median SD N
## <dbl> <dbl> <int>
## 1 69.8 37.9 17
We can also plot the histograms for both methods using
plot_TTS()
.
ushr
provides additional functionality to the examples
documented here. Notable examples are:
ushr_triphasic()
(see
?ushr_triphasic()
); this may be more appropriate than the
biphasic model (Cardozo et al. 2017).
Results can be visualized using the same plotting/summary functions as
above.simulate_data()
function.Further details of all functions and user-specific customizations can be found in the documentation.