First steps with Leaspy

First steps with Leaspy#

Welcome for the second practical session of the day!

Objectives :#

Learn to use Leaspy methods

The set-up#

As before, if you have followed the installation details carefully, you should

be running this notebook in the leaspy_tutorial virtual environment
having all the needed packages already installed

💬 Question 1 💬 Run the following command lines

import os, sys
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import main classes from Leaspy package
sys.path.append('/Users/igor.koval/Documents/work/leaspy/')
from leaspy import Leaspy, Data, AlgorithmSettings, IndividualParameters#, __watermark__

# Watermark trace with all packages versions
#print("\n".join([f'{pkg}: {pkg_version}' for pkg, pkg_version in __watermark__.items()]))

Part I: Data#

ℹ️ Information ℹ️ Data that can be used as a leaspy input should have the following format.

This result in multiple rows per subject. The input format MUST follow the following rules:

A column named ID: corresponds to the subject indices
A columns named TIME: corresponds to the subject’s age at the corresponding visit
One column per feature
Each row is a visit, therefore the concatenation of the subject ID, the patient age at which the corresponding visit occured, and then the feature values

Concerning the features’ values, as we are using a logistic model, they MUST:

Be between 0 and 1
In average increase with time for each subject (normal states correspond to values near 0 and pathological states to values near 1)

Moreover, to calibrate the progression model, we highly recommend to keep subjects that have been seen at least two times. You probably noticed that there are NaN: do not worry, Leaspy can handle them ;)

💬 Question 2 💬 Run the following lines to load the data.

data_path = os.path.join(os.getcwd(), 'data', 'TP2_leaspy_beginner')
df = pd.read_csv(os.path.join(data_path, 'simulated_data-corrected.csv'))
df = df.set_index(["ID","TIME"])
df.head()

		MDS1_total	MDS2_total	MDS3_off_total	SCOPA_total	MOCA_total	REM_total	PUTAMEN_R	PUTAMEN_L	CAUDATE_R	CAUDATE_L	PUTAMEN
ID	TIME
GS-001	71.354607	0.112301	0.122472	0.171078	0.160001	0.275257	0.492485	0.780210	0.676774	0.622611	0.494641	0.728492
	71.554604	0.140880	0.109504	0.118693	0.135852	0.380934	0.577203	0.751444	0.719796	0.618434	0.530978	0.735620
	72.054604	0.225499	0.270502	0.061310	0.211134	0.351172	0.835828	0.823315	0.691504	0.717099	0.576978	0.757409
	73.054604	0.132519	0.253548	0.258786	0.245323	0.377842	0.566496	0.813593	0.787914	0.770048	0.709486	0.800754
	73.554604	0.278923	0.321920	0.143350	0.223102	0.292768	0.741811	0.888792	0.852720	0.797368	0.715465	0.870756

💬 Question 3 💬 Does the data set seem to have the good format?

Your answer: …

💬 Question 4 💬 How many patients are there in the dataset?

# To complete

# –––––––––––––––– #
# –––– Answer –––– #
# –––––––––––––––– #

n_subjects = df.index.get_level_values('ID').unique().shape[0]
print(f'{n_subjects} subjects in the dataset.')

200 subjects in the dataset.

💬 Question 5 💬 Create a training test that contains the first 160 patients and a testing set the rest. Each set will only contain the following features:

MDS1_total
MDS2_total
MDS3_off_total

Help : Be careful, one patient is not one line …

# To complete

df_train = ######################
df_test = ######################

# –––––––––––––––– #
# –––– Answer –––– #
# –––––––––––––––– #

df_train = df.loc[:'GS-160'][["MDS1_total", "MDS2_total", "MDS3_off_total"]]
df_test = df.loc['GS-161':][["MDS1_total", "MDS2_total", "MDS3_off_total"]]

Leaspy’s `Data` container#

ℹ️ Information ℹ️ Leaspy comes with its own data containers. The one used in a daily basis is Data. You can load your data from a csv with it Data.from_csv_file or from a DataFrame Data.from_dataframe.

💬 Question 6 💬 Run the following lines to convert DataFrame into Data object.

data_train = Data.from_dataframe(df_train)
data_test = Data.from_dataframe(df_test)

Part II : Instantiate a `Leaspy` object#

ℹ️ Information ℹ️ Before creating a leaspy object, you need to choose the type of progression shape you want to give to your data. The available models are the following:

linear
logistic

with the possibility to enforce a parallelism between the features. Parallelism imposes that all the features have the same average pace of progression.

Once that is done, you just have to call Leaspy('model_name'). The dedicated names are :

univariate_linear
linear
univariate_logistic
logistic
logistic_parallel
lme_model
constant_model

💬 Question 7 💬 We choose a logistic model. Run the following line to instantiate the leaspy object.

leaspy = Leaspy("logistic", source_dimension=2)

ℹ️ Information ℹ️ Leaspy object contains all the main methods provided by the software. With this object, you can:

calibrate a model
personalize a model to individual data (basically you infer the random effects with a gradient descent)
estimate the features values of subjects at given ages based on your calibrated model and their individual parameters
simulate synthetic subjects base on your calibrated model, a collection of individual parameters and data
load and save a model

💬 Question 8 💬 Check it out by running the following line

? Leaspy

💬 Question 9 💬 This Leaspy object comes with an handy attribute for vizualization. Let’s have a look on the data that will be used to calibrate our model

leaspy.model.dimension

ax = leaspy.plotting.patient_observations(data_train, alpha=.7, figsize=(14, 6))
ax.set_ylim(0,.8)
ax.grid()
plt.show()

../../../_images/dc681a456dd167197c4e33cf20d796d988e17bf25aff88bbf60bdf6646b325e8.png

Well… not so engaging, right? Let’s see what Leaspy can do for you.

Part III : Choose your algorithms#

ℹ️ Information ℹ️ Once you choosed your model, you need to choose an algorithm to calibrate it.

To run any algorithm, you need to specify the settings of the related algorithm thanks to the AlgorithmSettings object. To ease Leaspy’s usage for new users, we specified default values for each algorithm. Therefore, the name of the algorithm used is enough to run it. The one you need to fit your progression model is mcmc_saem, which stands for Markov chain Monte Carlo - Stochastic Approximation of Expectation Maximization.

💬 Question 10 💬 Run the following line to instanciate a AlgorithmSettings object.

algo_settings = AlgorithmSettings('mcmc_saem', 
                                  n_iter=2000,           # n_iter defines the number of iterations
                                  loss='MSE_diag_noise', # estimate the residual noise scaling per feature
                                  progress_bar=True)     # To display a nice progression bar during calibration

ℹ️ Information ℹ️ You can specify many more settings that are left by default for now. You can also save and load an AlgorithmSettings object in a json file.

💬 Question 11 💬 Run the following line to get more informations.

? AlgorithmSettings

ℹ️ Information ℹ️ It is often usefull, even if it is optional to store the different logs of the model during the iterations. You can use the following method with the path of the folder where the logs will be stored.

💬 Question 12 💬 Run the following lines.

algo_settings.set_logs(
    path='logs',          # Creates a logs file ; if existing, ask if rewrite it
    plot_periodicity=50,  # Saves the values to display in pdf every 50 iterations
    save_periodicity=10,  # Saves the values in csv files every 10 iterations
    console_print_periodicity=None,  # If = N, it display logs in the console/terminal every N iterations
    overwrite_logs_folder=True       # Default behaviour raise an error if the folder already exists.
)

...overwrite logs folder...

Part IV : Fit your model#

💬 Question 13 💬 Run the following lines to fit the model.

leaspy.fit(data_train, algorithm_settings=algo_settings)

|##################################################|   2000/2000 iterations
The standard deviation of the noise at the end of the calibration is:
MDS1_total: 0.0512
MDS2_total: 0.0544
MDS3_off_total: 0.0649

Calibration took: 1min 22s

ℹ️ Information ℹ️ This might take several minutes, so let’s discuss about the keyword argument source_dimension. This parameters depend on the number of variable you want the model to learn: it can go from 1 to the number of variables. If it is not set by the user the default value is \(\sqrt{N_{features}}\) as it has been shown empirically to give good results. You will learn more about the mathematical formulation of the model below (part V).

ℹ️ Information ℹ️ Before assuming that the model is estimated, you have to check that the convergence went well. For that, you can look the at the convergence during the iterations. To do so, you can explore the logs folder (in the same folder than this jupyter notebook) that shows the model convergence during the iterations. The first thing to look at is probably the plots/convergence_1.pdf and plots/convergence_2.pdf files : a run has had enough iterations to converge if the last 20 or even 30% of the iterations were stable for all the parameters. If not, you should provably re-run it with more iterations.

from IPython.display import IFrame
IFrame('./logs/plots/convergence_1.pdf', width=990, height=670)

💬 Question 14 💬 Check out the parameters of the model that are stored here

leaspy.model.parameters

{'g': tensor([1.9860, 1.8967, 1.2987]),
 'v0': tensor([-4.7647, -4.5189, -4.6165]),
 'betas': tensor([[ 0.0230,  0.0497],
         [-0.0321,  0.0040]]),
 'tau_mean': tensor(65.0794),
 'tau_std': tensor(10.3418),
 'xi_mean': tensor(0.),
 'xi_std': tensor(0.5935),
 'sources_mean': tensor(0.),
 'sources_std': tensor(1.),
 'noise_std': tensor([0.0512, 0.0544, 0.0649])}

ℹ️ Information ℹ️ Parameters are probably not straightfoward for now. The most important one is probably noise_std. It corresponds to the standard deviation of the Gaussian errors (one per feature). The smallest, the better - up to the lower bound which is the intrinsic noise in the data. Note that usually, cognitive measurements have an intrinsic error (computed on test-retest exams) between 5% and 10%.

💬 Question 15 💬 Let’s display noise_std

noise = leaspy.model.parameters['noise_std']
features = leaspy.model.features

print('Standard deviation of the residual noise for the feature:')
for n, f in zip(noise, features):
    print(f'- {f}: {n*100:.2f}%')

Standard deviation of the residual noise for the feature:
- MDS1_total: 5.12%
- MDS2_total: 5.44%
- MDS3_off_total: 6.49%

💬 Question 16 💬 Save the model with the command below

leaspy.save(os.path.join(data_path, "model_parameters.json"), indent=2)

💬 Question 17 💬 Load the model with the command below

leaspy = Leaspy.load(os.path.join(data_path, 'model_parameters.json'))

ℹ️ Information ℹ️ Now that we have sufficient evidence that the model has converged, let’s output what the average progression looks like!

First, let’s detail a bit what we are going to represent. We are going to display a trajectory: it corresponds to the temporal progression of the biomarkers. There is not only one trajectory for a cohort, as each subject has his or her own specific trajectory, meaning his or her disease progression. Each of these individual trajectories rely on individual parameters that are subject-specific. We will see those individual parameters a bit later, do not worry. For now, let’s stick to the average trajectory.

So what does the average trajectory corresponds to? The average trajectory correspond to a virtual patient whose individual parameters are the average individual parameters. And these averages are already estimated during the calibration.

💬 Question 18 💬 Let’s plot the average trajectory

ax = leaspy.plotting.average_trajectory(alpha=1, figsize=(14,6))
ax.grid()
plt.show()

../../../_images/330b11c111d69b9be82e08a1ec98bb8177a7ed31a081ee02670bf89ccce519c6.png

Part V : Personalize the model to individual data#

ℹ️ Information ℹ️ The personalization procedure allows to estimate the individual parameters that allows to modify the average progression to individual observations. The variations from the average trajectory to the individual one are encoded within three individual parameters :

\(\alpha_i = \exp(\xi_i)\) : the acceleration factor, that modulates the speed of progression : \(\alpha_i > 1\) means faster, \(\alpha_i < 1\) means slower than the average progression
\(\tau_i\) : the time shift which delays the progression in a given number of years. It has to be compared to tau_mean \( = \bar{\tau} \) which is in the model parameters above. In fact, \( \tau_i \sim \mathcal{N}( \bar{\tau}, \sigma_{\tau}^2)\) , so \(\tau_i > \bar{\tau}\) means that the patient has a disease that starts later than average, while \(\tau_i < \bar{\tau}\) means that the patient has a disease that starts earlier than average
\(w_i = (w_1, ..., w_N)\) (\(N\) being the number of features) : the space-shift which might, for a given individual, change the ordering of the conversion of the different features, compared to the mean trajectory.

In a nutshell, the \(k\)-th feature at the \(j\)-th visit of subject \(i\), which occurs at time \(t_{ij}\) writes:

\[y_{ijk} = f_\theta ( w_{ik}+ \exp(\xi_i) * (t_{ij} - \tau_i) ) + \epsilon_{ijk}\]

With:

\(\theta\) being the population parameters, infered during calibration of the model,
\(f_\nu\) a parametric family of trajectories depending of model type,
\(\epsilon_{ijk}\) an independent normally distributed error term.

This writing is not exactly correct but helps understand the role of each individual parameters.

[ Advanced ] Remember the sources, or the source_dimension? Well, \(w_i\) is not estimated directly, but rather thanks to a Independant Component Analysis, such that \(w_i = A s_i\) where \(s_i\) is of dimension \(N_s\) = source_dimension. See associated papers for further details.

Now, let’s estimate these individual parameters. The procedure relies on the scipy_minimize algorithm (gradient descent) that you have to define (or to load from an appropriate json file) :

💬 Question 19 💬 First set the parameters

settings_personalization = AlgorithmSettings('scipy_minimize', progress_bar=True, use_jacobian=True)

💬 Question 20 💬 Then use the second most important function of leaspy : leaspy.personalize. It estimates the individual parameters for the data you provide:

?leaspy.personalize

ip = leaspy.personalize(data_test, settings_personalization)

|########################################|   40/40 subjects
The standard deviation of the noise at the end of the personalization is:
MDS1_total: 0.0466
MDS2_total: 0.0492
MDS3_off_total: 0.0582

Personalization scipy_minimize took: 2s

ℹ️ Information ℹ️ Note here that you can personalize your model on patients that have only one visit! And you don’t have to use the same data as previously. It is especially useful, and important, in order to validate your model!

💬 Question 20b 💬 Once the personalization is done, check the different functions that the IndividualParameters provides (you can save and load them, transform them to dataframes, etc) :

?ip

💬 Question 21 💬 Plot the test data, but with reparametrized ages instead of real ages

# Plot the test data with individually reparametrized ages
ax = leaspy.plotting.patient_observations_reparametrized(data_test, ip, 
                                                         alpha=.7, linestyle='-', 
                                                         #patients_idx=list(data_test.individuals.keys())[:20],
                                                         figsize=(14, 6))
ax.grid()
plt.show()

../../../_images/224672410b047369aa8633ee4c5e22d9be244380ce603cb3bfca77828e614bb1.png

Remember the raw plotting of values during question 9? Better, no?

# Plot the test data with individually with true ages
ax = leaspy.plotting.patient_observations(data_test,
                                          alpha=.7, linestyle='-', 
                                          #patients_idx=list(data_test.individuals.keys())[:4],
                                          figsize=(14, 6))
ax.grid()
plt.show()

../../../_images/d1876eaf58fd9ecb4a46cb69fd0c003c20f455071b53fe2c3ef54e2d946f8e58.png

Now, let’s see what you can do with the individual parameters.

Part VI : Impute missing values & predict individual trajectories#

ℹ️ Information ℹ️ Together with the population parameters, the individual parameters entirely defines the individual trajectory, and thus, the biomarker values at any time. So you can reconstruct the individual biomarkers at different ages.

You can reconstruct your observations at seen ages, i.e. at visits that have been used to personalize the model. There are two reasons you might want to do that:

see how well the model fitted individual data
impute missing values: as Leaspy handles missing values, it can then reconstruct them (note that this reconstruction will be noiseless)

The third very important function - after leaspy.fit and leaspy.personalize - is leaspy.estimate. Given some individual parameters and timepoints, the function estimates the values of the biomarkers at the given timepoints which derive from the individual trajectory encoded thanks to the individual parameters.

💬 Question 22 💬 Check out the documentation

?leaspy.estimate

💬 Question 23 💬 Before running leaspy.estimate, let’s first retrieve the observations of subject ‘GS-187’ in the initial dataset. Get also his/her individual parameters as shown here:

observations = df_test.loc['GS-187']
print(f'Seen ages: {observations.index.values}')
print("Individual Parameters : ", ip['GS-187'])

Seen ages: [61.34811783 62.34811783 63.84811783 64.34812164 67.84812164 68.34812164
 69.34812164 69.84812164 70.84812164 71.34812164 71.84812164 72.34812164
 72.84812164 73.34812164]
Individual Parameters :  {'xi': 8.263221388915554e-05, 'tau': 70.71927642822266, 'sources': [0.8920779228210449, -0.9051297903060913]}

ℹ️ Information ℹ️ The estimate first argument is a dictionary, so that you can estimate the trajectory of multiple individuals simultaneously (as long as the individual parameters of all your queried patients are in ip.

💬 Question 24 💬 Now, let’s estimate the trajectory for this patient.

timepoints = np.linspace(60, 100, 100)
reconstruction = leaspy.estimate({'GS-187': timepoints}, ip)

def plot_trajectory(timepoints, reconstruction, observations=None):

    if observations is not None:
        ages = observations.index.values
    
    plt.figure(figsize=(14, 6))
    plt.grid()
    plt.ylim(0, .75)
    plt.ylabel('Biomarker normalized value')
    plt.xlim(60, 100)
    plt.xlabel('Patient age')
    colors = ['#003f5c', '#7a5195', '#ef5675', '#ffa600']
    
    for c, name, val in zip(colors, leaspy.model.features, reconstruction.T):
        plt.plot(timepoints, val, label=name, c=c, linewidth=3)
        if observations is not None:
            plt.plot(ages, observations[name], c=c, marker='o', markersize=12)
        
    plt.legend()
    plt.show()
                                
plot_trajectory(timepoints, reconstruction['GS-187'], observations)

/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/cbook/__init__.py:1402: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x[:, None]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/axes/_base.py:278: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  y = y[:, np.newaxis]

../../../_images/27ab67fb7b62b3e83356aa675cff141ae9b6ee291f3551d53438e8299c86a40d.png

data_test.to_dataframe().set_index(['ID']).loc['GS-187']

	TIME	MDS1_total	MDS2_total	MDS3_off_total
ID
GS-187	61.348118	0.032155	0.079023	0.194349
GS-187	62.348118	0.000000	0.000000	0.105951
GS-187	63.848118	0.056661	0.015366	0.213064
GS-187	64.348122	0.113765	0.151636	0.220763
GS-187	67.848122	0.087926	0.142727	0.121259
GS-187	68.348122	0.127519	0.208425	0.215517
GS-187	69.348122	0.014113	0.116000	0.115312
GS-187	69.848122	0.098333	0.067149	0.147486
GS-187	70.848122	0.088108	0.168970	0.235806
GS-187	71.348122	0.025140	0.206862	0.245327
GS-187	71.848122	0.088468	0.193132	0.215797
GS-187	72.348122	0.156485	0.281915	0.255358
GS-187	72.848122	0.072483	0.134019	0.297301
GS-187	73.348122	0.076749	0.177526	0.196283

# Or with plotting object
ax = leaspy.plotting.patient_trajectories(data_test, ip,
                                          patients_idx=['GS-187','GS-180'],
                                          labels=['MDS1_total','MDS2_total', 'MDS3_off_total'],
                                          #reparametrized_ages=True, # check sources effect
                                          
                                          # plot kwargs
                                          #color=['#003f5c', '#7a5195', '#ef5675', '#ffa600'],
                                          alpha=1, linestyle='-', linewidth=2,
                                          #marker=None,
                                          markersize=8, obs_alpha=.5, #obs_ls=':', 
                                          figsize=(16, 6),
                                          factor_past=.5,
                                          factor_future=5, # future extrapolation
                                          )
ax.grid()
#ax.set_ylim(0, .75)
ax.set_xlim(45, 120)
plt.show()

../../../_images/6477502b86c67679e07eb8336391c1527aac08feeab2b64b663ca8d32e0d10d7.png

# Grasp source effects
ax = leaspy.plotting.patient_trajectories(data_test, ip,
                                          patients_idx='all',
                                          labels=['MDS1','MDS2', 'MDS3 (off)'],
                                          reparametrized_ages=True, # check sources effect
                                          
                                          # plot kwargs
                                          alpha=1, linestyle='-', linewidth=1,
                                          marker=None,
                                          figsize=(16, 6),
                                          factor_past=0,
                                          factor_future=0, # no extrapolation (future) nor past
                                          )
ax.grid()
#ax.set_ylim(0, .75)
#ax.set_xlim(45, 120)
plt.show()

../../../_images/624fb9396cbd3d1b945d3f25f3eb4fa98da6d0dea250021fdab54b9577525e2e.png

Part VII : Leaspy application - Cofactor analysis#

ℹ️ Information ℹ️ Besides prediction, the individual parameters are interesting in the sense that they provide meaningful and interesting insights about the disease progression. For that reason, these individual parameters can be correlated to other cofactors. Let’s consider that you have a covariate Cofactor 1 that encodes a genetic status: 1 if a specific mutation is present, 0 otherwise.

💬 Question 25 💬 Now, let’s see if this mutation has an effect on the disease progression:

import seaborn as sns

# —— Convert individual parameters to dataframe
df_ip = ip.to_dataframe()

# —— Join the cofactors to individual parameters
cofactor = pd.read_csv(os.path.join(data_path, "cof_leaspy1.csv"), index_col=['ID'])
df_ip = df_ip.join(cofactor.replace({'MUTATION':{0: 'Non-carrier', 1: 'Carrier'}}))

_, ax = plt.subplots(1, 2, figsize=(14, 6))

# —— Plot the time shifts in carriers and non-carriers
ax[0].set_title('Time shift histogram')
sns.histplot(data=df_ip, x='tau', hue='MUTATION', bins=15, ax=ax[0], stat='count', common_norm=False, kde=True)

# —— Plot the acceleration factor in carriers and non-carriers
ax[1].set_title('Log-Acceleration factor histogram')
sns.histplot(data=df_ip, x='xi', hue='MUTATION', bins=15, ax=ax[1], stat='count', common_norm=False, kde=True)

plt.show()

# __ Joint density (tau, xi) __
g = sns.jointplot(data=df_ip, x="tau", y="xi", hue="MUTATION", height=6)
g.plot_joint(sns.kdeplot, zorder=0, levels=8, bw_adjust=1.5)
g.ax_joint.grid();

/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/cbook/__init__.py:1402: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x[:, None]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/axes/_base.py:276: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x = x[:, np.newaxis]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/axes/_base.py:278: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  y = y[:, np.newaxis]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/cbook/__init__.py:1402: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x[:, None]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/axes/_base.py:276: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x = x[:, np.newaxis]
/usr/local/Cellar/graph-tool/2.35/libexec/lib/python3.8/site-packages/matplotlib/axes/_base.py:278: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  y = y[:, np.newaxis]

../../../_images/9ea21d8a52cca8e729b670e05db22d6cff5836b754b519a9fcd301439bb3fe2d.png

../../../_images/90c31a2978fa218aacb9df97410f728d32f8179a93a436f7d7fd25630fbb5f45.png

💬 Question 26 💬 Now, check your hypothesis with statistical tests

# Shortcuts of df_ip for 2 sub-populations
carriers = df_ip[df_ip['MUTATION'] == 'Carrier']
non_carriers = df_ip[df_ip['MUTATION'] == 'Non-carrier']
# —— Student t-test (under the asumption of a gaussian distribution only)
print(stats.ttest_ind(carriers['tau'], non_carriers['tau']))
print(stats.ttest_ind(carriers['xi'], non_carriers['xi']))

# —— Mann-withney t-test
print(stats.mannwhitneyu(carriers['tau'], non_carriers['tau']))
print(stats.mannwhitneyu(carriers['xi'], non_carriers['xi']))

Ttest_indResult(statistic=1.941230030309452, pvalue=0.059673501020270414)
Ttest_indResult(statistic=-3.3108067640055814, pvalue=0.0020460286770261965)
MannwhitneyuResult(statistic=120.0, pvalue=0.030617610211415968)
MannwhitneyuResult(statistic=65.0, pvalue=0.00032679719774224823)

Part VIII : Data Simulation#

ℹ️ Information ℹ️ Now that you are able to predict the evolution of a patient and use it to analyse cofactors, you might want to simulate a new one thanks to the information that you have learned. To do so you can use the last method of leaspy that we will study : simulate.

💬 Question 27 💬 Have a look to the function

?leaspy.simulate

ℹ️ Information ℹ️ To use the fuction we will first extract the individual parameters using personalize with mode_real option. The simulate function learns the joined distribution of the individual parameters and baseline age of the subjects present in individual_parameters and data respectively to sample new patients from this joined distribution.

💬 Question 28 💬 Define the settings for the personalization and get the individual parameters

settings_ip_simulate = AlgorithmSettings('mode_real')
individual_params = leaspy.personalize(data_test, settings_ip_simulate)

The standard deviation of the noise at the end of the personalization is:
MDS1_total: 0.0469
MDS2_total: 0.0494
MDS3_off_total: 0.0583

Personalization mode_real took: 6s

💬 Question 29 💬 Define your algorithm for the simulation and simulate individuals from previously obtained individual parameters and dataset

settings_simulate = AlgorithmSettings('simulation')
simulated_data = leaspy.simulate(individual_params, data_test, settings_simulate)

💬 Question 30 💬 Access to the individual parameters of one individual that you have created

print(simulated_data.get_patient_individual_parameters("Generated_subject_001"))

{'tau': tensor([65.7623], dtype=torch.float64), 'xi': tensor([0.5340], dtype=torch.float64), 'sources': tensor([1.5067, 0.3401], dtype=torch.float64)}

💬 Question 31 💬 Plot the joint distribution of individual parameters (tau, xi) for simulated individuals that you have created

# Create a dataframe with individual parameters from both real & simulated individuals
df_ip_both = pd.concat({
    'estimated': individual_params.to_dataframe(),
    'simulated': simulated_data.get_dataframe_individual_parameters()
}, names=['Origin'])

g = sns.jointplot(data=df_ip_both, x='tau', y='xi', hue='Origin', height=8, 
                  marginal_kws=dict(common_norm=False))
g.plot_joint(sns.kdeplot, zorder=0, levels=8, bw_adjust=2.)
g.ax_joint.grid();

../../../_images/eb9a25c2ca51349925b760be029da8734658c4246c618a49f2122ba213e36c4e.png

💬 Question 32 💬 Plot some simulated individual trajectories

ax = leaspy.plotting.patient_observations(simulated_data.data, alpha=.7, figsize=(14, 4),
                                          patients_idx=[f'Generated_subject_{i:03}' for i in [1,2,7,42]])
ax.grid()
plt.show()

../../../_images/da4ce715138bf52fd3bffe18273e6d69afc7773d51a448120c68dfd914738faf.png