GPPM for disease progression modeling: basic use#

This tutorial introduces the use of the software GPMM available at https://gitlab.inria.fr/epione/GP_progression_model_V2.git.

Preparation#

We start by downloading and installing the package.

!pip install git+https://gitlab.inria.fr/epione/GP_progression_model_V2.git@fix_prediction
Collecting git+https://gitlab.inria.fr/epione/GP_progression_model_V2.git@fix_prediction
  Cloning https://gitlab.inria.fr/epione/GP_progression_model_V2.git (to revision fix_prediction) to /tmp/pip-req-build-u4p6a4ky
  Running command git clone -q https://gitlab.inria.fr/epione/GP_progression_model_V2.git /tmp/pip-req-build-u4p6a4ky
  Running command git checkout -b fix_prediction --track origin/fix_prediction
  Switched to a new branch 'fix_prediction'
  Branch 'fix_prediction' set up to track remote branch 'fix_prediction' from 'origin'.
Building wheels for collected packages: gppm
  Building wheel for gppm (setup.py) ... ?25l?25hdone
  Created wheel for gppm: filename=gppm-2.0.0-cp37-none-any.whl size=15757 sha256=f272c7e528112ed3711a1cf8a8b977c983851f88fa957502769635b8e1013434
  Stored in directory: /tmp/pip-ephem-wheel-cache-1_upv448/wheels/7b/4f/99/ae6da29607cff6cb811b4f4ca93aeabc1678a71e582534c93f
Successfully built gppm
Installing collected packages: gppm
Successfully installed gppm-2.0.0

Once GPPM is installed, we can import what is needed to run the tutorial

import GP_progression_model
# Setup
import matplotlib.pyplot as plt
import numpy as np
import os
#%matplotlib inline
## workaround for OS X
from sys import platform as sys_pf
if sys_pf == 'darwin':
    matplotlib.use("TkAgg")
import GP_progression_model
from GP_progression_model import DataGenerator
import torch
import pandas as pd
import torch

Illustration on a synthetic case#

To understand the principle behind GPPM, let’s generate some data that will be later used to fit the model. The aim of GPPM is to reconstruct long-term temporal trajectories from the analysis of collections of short-terms observations. To get an idea of the principle, the following piece of code can be used to generate some synthetic trajectories for Nbiom biomarkers, and to sample idnvidual measurements from them.

os.mkdir('synth')

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.fastest = True
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Parameters for progression curves and data generation
# Max value of the trajectories 
# (we assume that they range from 0, healthy, to L, pathological)
L = 1
# time interval for the trajectories
interval = [-15,15]

# Number of biomarkers
Nbiom = 4
# Number of individuals
Nsubs = 30
# Gaussian observational noise
noise = 0.1

# We set the seed=111 as a reference for the further analysis
seed = 111
np.random.seed(seed)

# Creating random parameter sets for each biomarker's progression
flag = 0
while (flag!=1):
    CurveParam = []
    for i in range(Nbiom):
        CurveParam.append([L,1*(0.5+np.random.rand()),noise])
        if CurveParam[i][1] > 0.0:
            flag = 1
# Calling the data generator
dg = DataGenerator.DataGenerator(Nbiom, interval, CurveParam, Nsubs, seed)

Here are the generated biomarkers trajectories (continuous lines) that we generated, along with observations from the different subjects (dashed segments). Each subject’s observation is assumed to be a noisy realisation from the original trajectory, evaluated over few samples. The number of time points available per subject is assigned randomly in the DataGenerator routine.

dg.plot('long')
../../../_images/e31b1b74995e9ca13ac669410db9640b3a3e9ff7cc84d92444aaaff8f6f49c62.png

The difference in shape between biomarkers trajectories is due to the different in speed (the slope) and in the timing (the moment at which the trajectory starts increasing). Here is the timing associated with each biomarker:

# The time_shift indicates the moment at which each biomarker becomes abnormal
ground_truth_ts = dg.time_shift
dict_gt_ts = {}
for biom in range(Nbiom):
  dict_gt_ts['biom_'+ str(biom)] = ground_truth_ts[biom]

pd.DataFrame(dict_gt_ts, index=['change time'])
biom_0 biom_1 biom_2 biom_3
change time 4.0 7.0 14.0 8.0

We note that the change time associated to biomarker 2 is the furthest in time (at t=14), which means that the trajectory of this biomarker is constant along the entire observation interval.

We are ready not to generate our real observations, by resetting the individual observations above to the same origin. In this case, the time information is zero-centered for each subject:

dg.plot('short')
../../../_images/a7d250d1ba4bdff41440b0347809880218946b829251ac0d2aff4b25f31cc1da.png

This kind of data is what is generally called spaghetti plot. The time axis is not related anymore directly to the original progression, but instead to an arbitrary unit (which may correspond for instance to a visit time). The goal of GPPM is to retrieve the long-term progressions from the spaghetti plot.

Let’s extract our synthetic data as a pandas DataFrame

df  = dg.get_df()

The data frame has the following jey fields:

  • RID: the identifier for each individual,

  • time: the zero-centered time (e.g. visit time)

  • time_gt: the ground thruth time corresponding to position with respect to the long-term original synthetic trajectories

  • biom_*: the measurements for each biomarker and time point

We note that the structure of this data frame is very similar to the one of typical clinical and research datasets (e.g. ADNI).

df
RID time biom_0 biom_1 biom_2 biom_3 time_gt
0 0.0 -1.0 0.979932 0.732228 0.011024 0.412049 8.0
1 0.0 1.0 1.138082 0.790095 -0.039992 0.836982 10.0
2 1.0 0.0 0.991504 0.928058 0.122740 1.042216 11.0
3 2.0 0.0 0.292893 0.020794 -0.023364 0.043239 3.0
4 3.0 0.0 1.012792 0.604817 -0.092707 0.403241 8.0
... ... ... ... ... ... ... ...
61 28.0 0.0 0.371390 0.165749 -0.118679 -0.054633 4.0
62 29.0 -2.0 0.062574 -0.041898 -0.056814 -0.005067 0.0
63 29.0 0.0 0.182433 0.080903 0.027772 -0.043215 2.0
64 29.0 1.0 0.256241 0.005242 -0.016863 -0.064809 3.0
65 29.0 2.0 0.544529 0.041862 0.040187 0.032076 4.0

66 rows × 7 columns

GPPM in action#

We are ready to apply GPPM to solve ou estimation problem:

# since the biomarkers are increasing from normal to abnormal states 
# (resp. from 0 to L) we set the monotonicity constraint to 1 for each biomarker
monotonicity = np.repeat(1,Nbiom)

# the input is the generated synthetic data frame
input_data = df.drop(columns=['time_gt'])

# we convert the input data to the pieces that will be fed to GPPM  
Xdata, Ydata, RID, list_biomarkers, group = GP_progression_model.convert_from_df(input_data,
                                                                 ['biom_' + str(i) for i in range(Nbiom)], time_var = 'time')

# here we call GPPM with the appropriate input
model = GP_progression_model.GP_Progression_Model(Xdata,Ydata, monotonicity = monotonicity, trade_off = 50,
                                                  names_biomarkers=['biom_' + str(i) for i in range(Nbiom)] )

Once created our GPPM object, we are ready to optimize. The optimization of GPPM iterates over two steps:

  • Estimation of the biomarkers trajectories. This is a regression problem, and is solved in GPPM by fitting a Gaussian Process (GP) mixed effect model to the observations. Gaussian processes are a fabulous family of non-parametric functions that can be used to solve a large variety of machine learning problems. In the case of GPPM, we impose that the trajectory must be monotonic over time, to describe steady evolutions from normal to pathological states.

  • Estimation of the subject-specific time reparameterization function. This step idetifies the most likely instant during the pathological evolution at which the individual has been observed, by optimizing the timing of the measurements with respect to the trajectories estimated above. The current version of the model support the estimation of both time-shift and scaling parameters.

Let’s run the optimization and observe how trajectories and individual parameters evolve over the iterations.

model.Optimize(plot = True, verbose = True, benchmark=True)
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Optimization step: 1 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 15.51 - Cost (fit): 355.66 - Cost (constr): 170.16|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 50 of 200 || Cost (DKL): 15.30 - Cost (fit): 238.45 - Cost (constr): 150.13|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 16.57 - Cost (fit): 205.81 - Cost (constr): 77.51|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 15.42 - Cost (fit): 226.96 - Cost (constr): 44.91|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 14.17 - Cost (fit): 184.17 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/5ce916ecd019d18f30fda8ee0cf974c1142d656b4dca2772c70e464fa702978f.png
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 14.17 - Cost (fit): 240.22 - Cost (constr): 61.96|| Batch (each iter) of size 30 || Time (each iter): 0.04s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 14.17 - Cost (fit): 163.64 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 14.17 - Cost (fit): 176.37 - Cost (constr): 41.22|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 14.17 - Cost (fit): 164.56 - Cost (constr): 20.46|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 14.17 - Cost (fit): 155.17 - Cost (constr): 16.51|| Batch (each iter) of size 30 || Time (each iter): 0.04s
../../../_images/6e7751ff7e54850ca1a77d00fe4dd37d3f4fb17688e7a96c0607727ee42cddb8.png
Optimization step: 2 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 14.15 - Cost (fit): 204.55 - Cost (constr): 36.75|| Batch (each iter) of size 30 || Time (each iter): 0.04s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 15.09 - Cost (fit): 154.31 - Cost (constr): 17.32|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 15.66 - Cost (fit): 103.17 - Cost (constr): 30.12|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 15.17 - Cost (fit): 112.74 - Cost (constr): 20.25|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 200 of 200 || Cost (DKL): 15.51 - Cost (fit): 110.74 - Cost (constr): 16.42|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/b4aee7d6045d4c506afcd36c4a923ba4c3aaa41165651c34c08512ebf9120e19.png
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 15.51 - Cost (fit): 66.27 - Cost (constr): 16.45|| Batch (each iter) of size 30 || Time (each iter): 0.05s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 15.51 - Cost (fit): 87.60 - Cost (constr): 20.36|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 100 of 200 || Cost (DKL): 15.51 - Cost (fit): 58.95 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 150 of 200 || Cost (DKL): 15.51 - Cost (fit): 54.43 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 200 of 200 || Cost (DKL): 15.51 - Cost (fit): 62.00 - Cost (constr): 21.28|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/561b02f723aba76afb8bb8f390a36c4097ade9754cff5924237debf0834a1596.png
Optimization step: 3 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 15.50 - Cost (fit): 47.49 - Cost (constr): 28.55|| Batch (each iter) of size 30 || Time (each iter): 0.04s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 15.61 - Cost (fit): 21.12 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 17.46 - Cost (fit): 0.83 - Cost (constr): 17.02|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 150 of 200 || Cost (DKL): 19.71 - Cost (fit): 8.68 - Cost (constr): 22.75|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 21.50 - Cost (fit): -26.30 - Cost (constr): 23.18|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/e4fa6669dfbddd4e5ef9c3a320ca7df88bba039b29b5797580a120b4a3201c7b.png
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 21.50 - Cost (fit): 13.32 - Cost (constr): 17.05|| Batch (each iter) of size 30 || Time (each iter): 0.06s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 21.50 - Cost (fit): -10.16 - Cost (constr): 26.20|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 21.50 - Cost (fit): -39.39 - Cost (constr): 23.83|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 21.50 - Cost (fit): -14.59 - Cost (constr): 21.50|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 21.50 - Cost (fit): -48.67 - Cost (constr): 16.59|| Batch (each iter) of size 30 || Time (each iter): 0.04s
../../../_images/e7b4166de5908aadec1a334844cda3b7c900e1a74f78055f5e371e5a1cf679d1.png
Optimization step: 4 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 21.52 - Cost (fit): -17.67 - Cost (constr): 16.49|| Batch (each iter) of size 30 || Time (each iter): 0.05s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 23.31 - Cost (fit): -38.82 - Cost (constr): 20.30|| Batch (each iter) of size 30 || Time (each iter): 0.05s
Iteration 100 of 200 || Cost (DKL): 26.56 - Cost (fit): -51.46 - Cost (constr): 16.40|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 26.74 - Cost (fit): -53.30 - Cost (constr): 23.91|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 200 of 200 || Cost (DKL): 30.27 - Cost (fit): -52.75 - Cost (constr): 17.37|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/4d7d93b7ea83b933381f382fe7f32be284ac86f6110ae472b6620f45d0798adc.png
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 30.27 - Cost (fit): -79.87 - Cost (constr): 22.59|| Batch (each iter) of size 30 || Time (each iter): 0.04s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 30.27 - Cost (fit): -85.82 - Cost (constr): 23.56|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 30.27 - Cost (fit): -88.86 - Cost (constr): 16.39|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 30.27 - Cost (fit): -84.62 - Cost (constr): 18.26|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 30.27 - Cost (fit): -70.81 - Cost (constr): 18.83|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/d5eca4dce1a1b88ba11f28b45fdfc15e4fe51433bcf31851df852df926b54aba.png
Optimization step: 5 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 30.23 - Cost (fit): -51.87 - Cost (constr): 18.02|| Batch (each iter) of size 30 || Time (each iter): 0.05s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 32.13 - Cost (fit): -63.44 - Cost (constr): 18.78|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 33.72 - Cost (fit): -65.22 - Cost (constr): 17.61|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 150 of 200 || Cost (DKL): 34.77 - Cost (fit): -41.57 - Cost (constr): 20.51|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 36.42 - Cost (fit): -110.31 - Cost (constr): 21.96|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/df8d6cd34fa95de932a060b720c2361c9bcd9fda994f4490c6020d850c66c7a3.png
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 36.42 - Cost (fit): -112.59 - Cost (constr): 17.11|| Batch (each iter) of size 30 || Time (each iter): 0.03s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 36.42 - Cost (fit): -98.19 - Cost (constr): 26.84|| Batch (each iter) of size 30 || Time (each iter): 0.04s
Iteration 100 of 200 || Cost (DKL): 36.42 - Cost (fit): -125.11 - Cost (constr): 28.69|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 36.42 - Cost (fit): -111.69 - Cost (constr): 16.38|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 36.42 - Cost (fit): -78.53 - Cost (constr): 17.14|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/d2e91146a7873e9e81a64b0b019a426c9164c0178f0392841e76e359815a30f6.png
Optimization step: 6 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 36.50 - Cost (fit): -110.81 - Cost (constr): 16.67|| Batch (each iter) of size 30 || Time (each iter): 0.04s
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Iteration 50 of 200 || Cost (DKL): 38.18 - Cost (fit): -105.39 - Cost (constr): 23.05|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 100 of 200 || Cost (DKL): 39.09 - Cost (fit): -115.45 - Cost (constr): 23.37|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 150 of 200 || Cost (DKL): 40.47 - Cost (fit): -128.37 - Cost (constr): 22.59|| Batch (each iter) of size 30 || Time (each iter): 0.03s
Iteration 200 of 200 || Cost (DKL): 41.02 - Cost (fit): -137.26 - Cost (constr): 33.65|| Batch (each iter) of size 30 || Time (each iter): 0.03s
../../../_images/9326dd26db9e50d2a9f6ef226e739c674f03971066fbf77dc1f119d3489e2b1c.png

Here we go! The estimated trajectories can be observed with the command model.Plot()

model.Plot()
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
../../../_images/431c79dabe0e6f31ccc70c34692c0534ccd6e765981e1922a41339fa74aa65d0.png

GPPM an produce a variety of figures that can be used to interpret the results and extract a lot of useful information about data and trajectories

#This command saves the fitted biomarkers trajectories in separate files
model.Plot(save_fig = './synth/')
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")

Let’s read the images one by one

plt.figure(figsize=(10,10))
for i in range(Nbiom):
  plt.subplot(2,2,i+1)
  fig_biom = plt.imread('./synth/biom_'+str(i)+'.png')
  plt.imshow(fig_biom)
../../../_images/bf1404f481af2266bb7f9dff2b80db647e71a67bd643ee50a7bbf224cf4f26bf.png

We note that the estimated uncertainty for biomarker 2 is large, and that the trajectory is mostly flat.

By plotting with the option join=True we can extract more information on the fitted trajectories:

# with the option 
model.Plot(save_fig = './synth/', joint = True)
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")

A key information for understanding the relationship between biomarkers is the estimated time of maximum change. This is the moment at which the speed of the estimated trajectory is maximum, and indicates the relative positioning (ranking) of the biomarkers when varying from normal to pathological stages.

plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
timing = plt.imread('./synth/change_timing.png')
plt.imshow(timing, aspect='auto')
plt.title('Estimated trajectories timing')
plt.subplot(1,2,2)
dg.plot('long')
../../../_images/01b85c859d1acf22825bfde75a97baed292ccd60680d05232d78c1b0727a3a5c.png

We observe that the order of the four biomarkers across the estimated time-axis correponds to the ground thruth change time generated for the synthetic data. The most interesting observation is that biomarker 2 is associated with very large uncertainty, since it is flat. In this case there is not a well defined transition time, and this leads to high uncertainty of the model: based on the data, it is not possible to establish when this biomarker is going to transition from normal to pathological values.

We can also quantify the magnitude of the trajectories of changes of each biomarker:

plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
magnitude = plt.imread('./synth/change_magnitude.png')
plt.imshow(magnitude, aspect = 'auto')
plt.title('Estimated trajectories magnitude')
plt.subplot(1,2,2)
dg.plot('long')
../../../_images/9065161351497cfc48b4c36ae0e355ac59f73a3d8edd0c9de3c08b696344d7c1.png

We can see that biomarker 2 is associated with the lowest amount of change across the estimated long-term progression. Biomarker 0 and 3 are the one changing the most, and this is in agreement with the synthetic trajectories originally generated.

Given the model, it is possible to estimate the time-shift (or disease severity) associated with each individual, based on the available set of biomarkers. This can be simply done with the following command:

pred_time_shift = model.PredictOptimumTime(input_data, id_var='RID', time_var='time')
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
pred_time_shift
time Time Shift
RID
0.0 -1.0 1.530266
1.0 0.0 4.387256
2.0 0.0 -3.112342
3.0 0.0 0.816019
4.0 -1.0 -2.755218
5.0 0.0 -0.969600
6.0 0.0 -0.255352
7.0 -1.0 0.816019
8.0 -2.0 0.101771
9.0 -2.0 2.958761
10.0 -2.0 -2.398095
11.0 0.0 1.887390
12.0 -1.5 -2.398095
13.0 -0.5 0.458895
14.0 -1.0 -2.398095
15.0 -1.5 0.458895
16.0 -1.0 3.673009
17.0 -0.5 -1.683847
18.0 -2.0 -2.755218
19.0 -1.5 3.315885
20.0 -0.5 3.673009
21.0 -1.0 1.173143
22.0 -1.5 3.315885
23.0 -1.0 -4.897961
24.0 0.0 4.387256
25.0 0.0 -2.755218
26.0 -0.5 -0.969600
27.0 0.0 -0.255352
28.0 0.0 -2.398095
29.0 -2.0 -3.826590

We can plot the predicted individual time-shift against the ground truth time-shift generated for each individual.

ground_truth = dg.OutputTimeShift()

plt.scatter(pred_time_shift['Time Shift'],ground_truth) 
plt.xlabel('estimated time-shift')
plt.ylabel('ground truth disease stage')
Text(0, 0.5, 'ground truth disease stage')
../../../_images/dcca069f183689eef1a9912f34fdfd8042cdbc4e09b0b5e6995e8165405fc6f7.png

The estimated time-shift is in lage agreement with the original positioning of each individual in the time-axis of the long-term trajectories.

Modeling of longitudinal biomarkers form the (pseudo-)ADNI dataset#

This last section illustrates the use of GPPM on real data. We are going to work with a pseudo-version of ADNI. The dataset consists of biomarkers collected over time for 171 subjects. 82 individuals are healhty (NL), 752 are affected by mild cognitive impairment (MCI), and 82 are patients with Alzheimer’s disease (AD). The measurements were created based on the biomarker’s measures observed in the original ADNI dataset.


os.mkdir('real')

pseudo_adni = pd.read_csv('http://marcolorenzi.github.io/material/pseudo_adni.csv')

print(pseudo_adni.columns)

diags = { 'NL' : np.sum(pseudo_adni['group']==1), 
          'MCI' : np.sum(pseudo_adni['group']==2), 
          'AD' : np.sum(pseudo_adni['group']==3), }
diags
Index(['RID', 'Time', 'Hippocampus', 'Ventricles', 'Entorhinal', 'WholeBrain',
       'ADAS11', 'FAQ', 'AV45', 'FDG', 'group'],
      dtype='object')
{'AD': 320, 'MCI': 752, 'NL': 82}

GPPM in action#

As for the synthetic example, we first convert the original data frame into a GPPM input

biomarkers = ['Hippocampus', 'Ventricles', 'Entorhinal', 'WholeBrain',
       'ADAS11', 'FAQ', 'AV45', 'FDG']
Xdata, Ydata, RID, list_biomarkers, group = GP_progression_model.convert_from_df(pseudo_adni,
                                                                 biomarkers, time_var = 'Time')

In this case we need to specify a meaningful monotonicity constraint for each biomarkers. The monotonicity will reflect our knowledge on how each biomarker varies from normal to pathological values. For example, the brain volume decreases over time, and therefore the monotonicity constraint will be negative. On the contrary, the clinical score ADAS11 increases with more advanced pathological states, and therefore the monotonicity constraint will be positive.

We obtain:

dict_monotonicity = { 'Hippocampus' : -1, #decreasing
                     'Ventricles': 1, #increasing
                     'Entorhinal': -1, #decreasing
                     'WholeBrain': -1, #decreasing
                     'ADAS11': 1, #increasing
                     'FAQ': 1, #increasing
                     'AV45': 1, #increasing
                     'FDG': -1, #decreasing
                    }

We are ready to run and optimize GPPM on this dataset:

# trade off between data-fit and monotonicity
trade_off = 100

# create a GPPM object
model = GP_progression_model.GP_Progression_Model(Xdata,Ydata, names_biomarkers = biomarkers, monotonicity = [dict_monotonicity[k] for k in dict_monotonicity.keys()], trade_off = trade_off,
                                                  groups = group, group_names = ['NL','MCI','AD'], device = device)
model.model = model.model.to(device)

# Optimise the model
N_outer_iterations = 6
N_iterations = 200
model.Optimize(N_outer_iterations = N_outer_iterations, N_iterations = N_iterations, n_minibatch = 10, verbose = True, plot = False, benchmark = True)
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
Optimization step: 1 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 15.25 - Cost (fit): 319.07 - Cost (constr): 2140.06|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 50 of 200 || Cost (DKL): 15.11 - Cost (fit): 375.92 - Cost (constr): 875.30|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 17.36 - Cost (fit): 257.78 - Cost (constr): 466.63|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 150 of 200 || Cost (DKL): 20.35 - Cost (fit): 226.63 - Cost (constr): 147.82|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 200 of 200 || Cost (DKL): 22.76 - Cost (fit): 210.95 - Cost (constr): 116.81|| Batch (each iter) of size 9 || Time (each iter): 0.09s
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 22.76 - Cost (fit): 164.13 - Cost (constr): 134.60|| Batch (each iter) of size 9 || Time (each iter): 0.08s
Iteration 50 of 200 || Cost (DKL): 22.76 - Cost (fit): 213.85 - Cost (constr): 164.01|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 22.76 - Cost (fit): 229.10 - Cost (constr): 103.49|| Batch (each iter) of size 9 || Time (each iter): 0.10s
Iteration 150 of 200 || Cost (DKL): 22.76 - Cost (fit): 182.90 - Cost (constr): 188.14|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 22.76 - Cost (fit): 212.29 - Cost (constr): 107.95|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Optimization step: 2 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 22.79 - Cost (fit): 209.65 - Cost (constr): 200.35|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 50 of 200 || Cost (DKL): 23.85 - Cost (fit): 155.76 - Cost (constr): 136.65|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 100 of 200 || Cost (DKL): 23.64 - Cost (fit): 136.93 - Cost (constr): 88.38|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 150 of 200 || Cost (DKL): 23.37 - Cost (fit): 130.74 - Cost (constr): 99.00|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 200 of 200 || Cost (DKL): 23.29 - Cost (fit): 113.99 - Cost (constr): 93.18|| Batch (each iter) of size 9 || Time (each iter): 0.06s
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 23.29 - Cost (fit): 146.91 - Cost (constr): 67.48|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 50 of 200 || Cost (DKL): 23.29 - Cost (fit): 105.83 - Cost (constr): 58.19|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 23.29 - Cost (fit): 129.35 - Cost (constr): 44.85|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 150 of 200 || Cost (DKL): 23.29 - Cost (fit): 140.34 - Cost (constr): 67.39|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 23.29 - Cost (fit): 112.25 - Cost (constr): 71.30|| Batch (each iter) of size 9 || Time (each iter): 0.09s
Optimization step: 3 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 23.31 - Cost (fit): 182.45 - Cost (constr): 85.02|| Batch (each iter) of size 9 || Time (each iter): 0.10s
Iteration 50 of 200 || Cost (DKL): 23.43 - Cost (fit): 159.42 - Cost (constr): 68.73|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 22.30 - Cost (fit): 95.59 - Cost (constr): 49.70|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 150 of 200 || Cost (DKL): 21.20 - Cost (fit): 112.01 - Cost (constr): 50.06|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 20.72 - Cost (fit): 104.30 - Cost (constr): 53.42|| Batch (each iter) of size 9 || Time (each iter): 0.09s
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 20.72 - Cost (fit): 190.25 - Cost (constr): 72.73|| Batch (each iter) of size 9 || Time (each iter): 0.10s
Iteration 50 of 200 || Cost (DKL): 20.72 - Cost (fit): 87.13 - Cost (constr): 41.18|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 20.72 - Cost (fit): 102.05 - Cost (constr): 43.72|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 150 of 200 || Cost (DKL): 20.72 - Cost (fit): 77.64 - Cost (constr): 54.73|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 20.72 - Cost (fit): 107.69 - Cost (constr): 42.10|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Optimization step: 4 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 20.71 - Cost (fit): 83.48 - Cost (constr): 57.58|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 50 of 200 || Cost (DKL): 19.77 - Cost (fit): 157.81 - Cost (constr): 93.12|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 100 of 200 || Cost (DKL): 18.75 - Cost (fit): 127.68 - Cost (constr): 49.10|| Batch (each iter) of size 9 || Time (each iter): 0.08s
Iteration 150 of 200 || Cost (DKL): 18.15 - Cost (fit): 150.10 - Cost (constr): 41.18|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 17.73 - Cost (fit): 135.45 - Cost (constr): 47.16|| Batch (each iter) of size 9 || Time (each iter): 0.06s
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 17.73 - Cost (fit): 139.73 - Cost (constr): 60.69|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 50 of 200 || Cost (DKL): 17.73 - Cost (fit): 89.27 - Cost (constr): 53.93|| Batch (each iter) of size 9 || Time (each iter): 0.07s
Iteration 100 of 200 || Cost (DKL): 17.73 - Cost (fit): 108.91 - Cost (constr): 63.02|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 150 of 200 || Cost (DKL): 17.73 - Cost (fit): 104.92 - Cost (constr): 50.13|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Iteration 200 of 200 || Cost (DKL): 17.73 - Cost (fit): 121.70 - Cost (constr): 41.42|| Batch (each iter) of size 9 || Time (each iter): 0.06s
Optimization step: 5 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 17.75 - Cost (fit): 910.51 - Cost (constr): 41.22|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 50 of 200 || Cost (DKL): 26.76 - Cost (fit): 618.36 - Cost (constr): 44.85|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 100 of 200 || Cost (DKL): 30.98 - Cost (fit): 602.58 - Cost (constr): 54.54|| Batch (each iter) of size 86 || Time (each iter): 0.27s
Iteration 150 of 200 || Cost (DKL): 33.82 - Cost (fit): 613.27 - Cost (constr): 48.94|| Batch (each iter) of size 86 || Time (each iter): 0.21s
Iteration 200 of 200 || Cost (DKL): 37.01 - Cost (fit): 601.11 - Cost (constr): 45.43|| Batch (each iter) of size 86 || Time (each iter): 0.28s
 -- Time reparameterization --
Iteration 1 of 200 || Cost (DKL): 37.01 - Cost (fit): 536.88 - Cost (constr): 42.77|| Batch (each iter) of size 86 || Time (each iter): 0.35s
Iteration 50 of 200 || Cost (DKL): 37.01 - Cost (fit): 503.89 - Cost (constr): 41.24|| Batch (each iter) of size 86 || Time (each iter): 0.28s
Iteration 100 of 200 || Cost (DKL): 37.01 - Cost (fit): 606.19 - Cost (constr): 44.02|| Batch (each iter) of size 86 || Time (each iter): 0.21s
Iteration 150 of 200 || Cost (DKL): 37.01 - Cost (fit): 464.64 - Cost (constr): 59.21|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 200 of 200 || Cost (DKL): 37.01 - Cost (fit): 505.09 - Cost (constr): 42.63|| Batch (each iter) of size 86 || Time (each iter): 0.21s
Optimization step: 6 out of 6
 -- Regression --
Iteration 1 of 200 || Cost (DKL): 37.12 - Cost (fit): 550.92 - Cost (constr): 50.04|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 50 of 200 || Cost (DKL): 38.77 - Cost (fit): 459.65 - Cost (constr): 43.52|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 100 of 200 || Cost (DKL): 41.05 - Cost (fit): 522.94 - Cost (constr): 43.02|| Batch (each iter) of size 86 || Time (each iter): 0.31s
Iteration 150 of 200 || Cost (DKL): 41.90 - Cost (fit): 427.11 - Cost (constr): 44.36|| Batch (each iter) of size 86 || Time (each iter): 0.22s
Iteration 200 of 200 || Cost (DKL): 42.45 - Cost (fit): 442.68 - Cost (constr): 56.11|| Batch (each iter) of size 86 || Time (each iter): 0.21s

And here is the output of our model: the biomarkers show different progression shapes, and different variability of the individual measurements.

model.Plot()
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
../../../_images/5c339a384d207dd088b89d959fbcc3474e2b061ad45e760431b94af41f04476a.png
model.Plot(save_fig = './real/')
model.Plot(save_fig = './real/', joint = True)

plt.figure(figsize=(6,6))
timing = plt.imread('./real/change_timing.png')
plt.imshow(timing, aspect='auto')
plt.tight_layout()
plt.title('Estimated trajectories timing')
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
Text(0.5, 1.0, 'Estimated trajectories timing')
../../../_images/05ad09d7a6ab8ec0ca2de25efab428607f6c896b26cb3caf29736bbf4344dfb5.png

The orderning of the biomarkers reflects the progression curves estimated above. The biomarkers showing the earliest changes are Hippocampus, whole brain and AV45, although the timing of this latter biomarker is more uncertain. The abnormality of entorhinal cortex and ventricles appears later. The changes of glucose metabolism are more spread towards the latest stages of the disease, while the clinical scores FAQ and ADAS11 are the last to reach abnormal values. This figure is quite in agreement with the known biomarker progression reported for Alzheimer’s disease.

plt.figure(figsize=(8,8))
timing = plt.imread('./real/change_magnitude.png')
plt.imshow(timing, aspect='auto')
plt.tight_layout()
plt.title('Estimated trajectories magnitude')
Text(0.5, 1.0, 'Estimated trajectories magnitude')
../../../_images/af85157c5c0a895746e39d4fbce163f62136c4e2931faf5074c8d766a2928b5f.png

We can finally predict the disease severity associated with each individual, and compare these estimates group-wise.

pred_time_shift = model.PredictOptimumTime(pseudo_adni, id_var='RID', time_var='Time')
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:474: UserWarning: non-inplace resize is deprecated
  warnings.warn("non-inplace resize is deprecated")
# This command creates another .csv file
pred_time_shift.to_csv('./real/predictions.csv')

diag_dict = dict([(1,'NL'),(2,'MCI'),(3,'AD')])
pred_time_shift.reset_index(inplace=True)

# This command creates another .png file
#optim_time = model.Diagnostic_predictions(predictions, verbose = True, group = [diag_dict[i] for i in group], save_fig = './real/')
pred_time_shift
RID Time Time Shift
0 0 0 47.603824
1 1 0 6.981236
2 2 0 13.751667
3 3 0 74.685547
4 4 0 122.078564
... ... ... ...
166 166 0 67.915117
167 167 0 -67.493503
168 168 0 54.374256
169 169 0 54.374256
170 170 0 61.144688

171 rows × 3 columns

rid_group = pseudo_adni.loc[~pseudo_adni['RID'].duplicated(),][['RID','group']]
group_and_ts = rid_group.merge(pred_time_shift, on = 'RID')
group_and_ts = rid_group.merge(pred_time_shift, on = 'RID')
fig, ax = plt.subplots(figsize=(8,6))
for label, df in group_and_ts.groupby('group'):
    if len(df['Time Shift'])>1:
        df['Time Shift'].plot(kind="kde", ax=ax, label= diag_dict[label])
    else:
        print('Warning: ' + label + ' group has only 1 element and will not be displayed')
plt.legend()
<matplotlib.legend.Legend at 0x7f4aa1c4e650>
../../../_images/2e9d82835a21d16a214b3741376dd47e26af69769a6feb953e51c22b5581fa14.png

Patients with AD appear significantly more advanced on the disease time-axis as compared to MCI and healthy, while MCI subjects are in between healthy and AD groups.