# Linear Mixed-effects ModelsΒΆ

Welcome for the first practical session of the day !

## Objectives :ΒΆ

• Get a better idea of medical data, especially longitudinal ones

• Understand mixed-effects models

• Get a taste of state-of-the-art techniques

## The set-upΒΆ

If you have followed the installation details carefully, you should

• be running this notebook in the leaspy_tutorial conda environment (be sure that the kernel you are using is leaspy_tutorial => check Kernel above)

• having all the needed packages already install

π¬ Question 1 π¬ Run the following command lines

import os
import sys

import pandas as pd
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


## Part I: The dataΒΆ

We import a functional medical imaging dataset. We have extracted, for each timepoint, the average value of the metabolic activity of the putamen. This brain region is commonly damaged by Parkinsonβs disease.

IMPORTANT: The values have been normalized such that a normal value is zero and a very abrnomal value is one.

π¬ Question 2 π¬ Run the following cell and look at the head of the dataframe to better understand what the data are.

from leaspy.datasets import Loader

# To complete

# ββββββββββββββββ #
# ββββββββββββββββ #


PUTAMEN
ID TIME SPLIT
GS-001 71.354607 train 0.728492
71.554604 train 0.735620
72.054604 train 0.757409
73.054604 train 0.800754
73.554604 train 0.870756

βΉοΈ Information βΉοΈ The SPLIT column already distinguishes the train and test data.

π¬ Question 3 π¬ Describe the target variable PUTAMEN and the explicative variable TIME. You can plot:

• Sample size

• Mean, std

• Min & max values

• Quantiles

# To complete

# ββββββββββββββββ #
# ββββββββββββββββ #

df.reset_index().describe().round(2).T

count mean std min 25% 50% 75% max
TIME 1997.0 65.37 10.05 33.14 58.66 66.48 72.06 91.24
PUTAMEN 1997.0 0.71 0.10 0.35 0.64 0.71 0.77 0.96

π¬ Question 4 π¬ From this value, what can you say about the disease stage of the population?

Answer: The median and mean value is 0.71, so the average disease stage is high for these subjects.

π¬ Question 5 π¬ Display the data, where the Putamen (y-axis) is plot with respect to the Time (x-axis)

# To complete

# ββββββββββββββββ #
# ββββββββββββββββ #

sns.set_style('whitegrid')
plt.figure(figsize=(14,6))
sns.scatterplot(data=df.xs('train', level='SPLIT').reset_index(),
x='TIME', y='PUTAMEN', alpha=.5, s=60)
plt.title('PUTAMEN - Raw data')
plt.show()


β‘ Remark β‘ At first look, the PUTAMEN values do not seem highly correlated to TIME.

## Part II: Linear RegressionΒΆ

As we are some pro ML players, letβs make some predictions : letβs try to predict the putamen value based on the time alone.

π¬ Question 6 π¬ Store the train and test data in df_train and df_test

# To complete

# ββββββββββββββββ #
# ββββββββββββββββ #

pds = pd.IndexSlice
df_train = df.loc[pds[:, :, 'train']].copy()  # one possibility
df_test = df.xs('test', level='SPLIT').copy()  # an other one


π¬ Question 7 π¬ Run the linear reagression that is in scipy.

Be carefull, you have to train it only with the train set!

x = # Complete with the appropriate data
y = # Complete with the appropriate data

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

# ββββββββββββββββ #
# ββββββββββββββββ #

x = df_train.index.get_level_values('TIME').values
y = df_train['PUTAMEN'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)


βΉοΈ Information βΉοΈ To run the notebook smoothly, you must comply to the following rules: we are going to try different models to predict the putamen values based on each observation.

We will store the results in the dataframe df such that :

ID

TIME

SPLIT

PUTAMEN

Model 1

Model 2

β¦

GS-001

74.4

train

0.78

0.93

0.75

β¦

GS-003

75.4

train

0.44

0.84

0.46

β¦

GS-018

51.8

test

0.71

0.73

0.78

β¦

GS-056

89.2

train

0.76

0.56

0.61

β¦

This will ease the comparison of the models.

β‘ Remark β‘ No need to add these predictions to df_train and df_test. You should be able to easily run the notebook by keeping df_train the way it is while appending the results in df.

π¬ Question 8 π¬ Add the predictions done by the linear regression in the column Linear Regression

df['Linear Regression'] = # Your code here

# ββββββββββββββββ #
# ββββββββββββββββ #

df['Linear Regression'] = intercept + slope * df.index.get_level_values('TIME')


βΉοΈ Information βΉοΈ Letβs introduce an object and a fonction that will be used to compare the models:

• overall_results will be the dataframe that stores the root mean square error on the train and test set for the different models

• compute_rmse_train_test is the function that given the dataframe df and a model_name (Linear Regression for instance), compute the mean absolute error on the train and test set and stores it in overall_results

π¬ Question 9 π¬ Run the following cell to see the results

overall_results = pd.DataFrame({'train': [], 'test': []})

def compute_rmse(df, model_name):
"""Compute RMSE between PUTAMEN column and the <model_name> column of df"""
y = df['PUTAMEN']
y_hat = df[model_name]
diff = y - y_hat
return np.sqrt(np.mean(diff * diff))

def compute_rmse_train_test(df, overall_results, model_name):
"""Inplace modification of <overall_results>"""
overall_results.loc[model_name, 'train'] = compute_rmse(df.xs('train', level='SPLIT'), model_name)
overall_results.loc[model_name, 'test'] = compute_rmse(df.xs('test', level='SPLIT'), model_name)

compute_rmse_train_test(df, overall_results, 'Linear Regression')
overall_results

train test
Linear Regression 0.091403 0.10213

β‘ Remark β‘ The RMSE is higher on the test set as on the train set

Letβs look at what we are doing by plotting the data and the linear regression. Throughout the notebook, we will use the function plot_individuals that, given a subset of IDs and a model name (as stored in the df dataframe) plots the individual data and their prediction

π¬ Question 10 π¬ Use the following cell.

def get_title(overall_results, model_name):
"""Precise model's name and its RMSE train & test"""
rmse_train = overall_results.loc[model_name, 'train']
rmse_test = overall_results.loc[model_name, 'test']
title = f'PUTAMEN - Raw data vs {model_name:s}\n'
title += r'$RMSE_{train}$ = %.3f     $RMSE_{test}$ = %.3f' % (rmse_train, rmse_test)
return title

def plot_individuals(df, overall_results, model_name, **kwargs):
# ---- Input manager
kind = kwargs.get('kind', 'lines')
sublist = kwargs.get('sublist', None)
highlight_test = kwargs.get('highlight_test', True)
ax = kwargs.get('ax', None)

if ax is None:
fig, ax = plt.subplots(figsize=(14, 6))
else:
plt.figure(figsize=(14, 6))

# ---- Select subjects
if sublist is None:
sublist = df.index.unique('ID')

# ---- If too many subject, do not display them in the legend
display_id_legend = len(sublist) <= 10

# ---- Plot
if kind == 'scatter':
# -- Used for Linear Regression
sns.scatterplot(data=df.reset_index(), x='TIME', y='PUTAMEN', alpha=.5,
s=60, label='Observations', ax=ax)
ax.plot(df.index.get_level_values('TIME').values,
df[model_name].values,
label=model_name, c='C1')
ax.legend(title='LABEL')
if highlight_test:
test = df.xs('test', level='SPLIT').loc[sublist].reset_index()
sns.scatterplot(data=test, x='TIME', y='PUTAMEN', legend=None, ax=ax)

elif kind == 'lines':
# -- Used for the other models
# - Stack observations & reconstructions by the model
df_stacked = df[['PUTAMEN', model_name]].copy()
df_stacked.rename(columns={'PUTAMEN': 'Observations'}, inplace=True)
df_stacked = df_stacked.stack().reset_index().set_index(['ID', 'SPLIT'])
df_stacked.columns = ['TIME', 'LABEL', 'PUTAMEN']

# - Plot
sns.lineplot(data=df_stacked.loc[sublist],
x='TIME', y='PUTAMEN', hue='ID', style='LABEL',
legend=display_id_legend, ax=ax)
if highlight_test:
test = df.xs('test', level='SPLIT').loc[sublist].reset_index()
sns.scatterplot(data=test, x='TIME', y='PUTAMEN', hue='ID', legend=None, ax=ax)
if display_id_legend:
ax.legend(title='LABEL', bbox_to_anchor=(1.05, 1), loc='upper left')
else:
raise ValueError('<kind> input accept only "scatter" and "lines".'
f' You gave {kind}')

ax.set_title(get_title(overall_results, model_name))
return ax

plot_individuals(df, overall_results, 'Linear Regression',
kind='scatter', highlight_test=False)
plt.show()


Is the previous plot relevant to assess the quality of our model?

We will answer this question in the following cells:

## Part III: The longitudinal aspectΒΆ

π¬ Question 11 π¬ Run the cell to have a better understanding of your data:

plot_individuals(df, overall_results, 'Linear Regression',
kind='lines', highlight_test=True)
plt.title('Longitudinal aspect')
plt.show()


The test data are highlited with dots. π¬ Question 12 π¬ What are actually the test data ?

π¬ Question 13 π¬Why does the global linear model not describe the temporal evolution of the variable?

## PART IV: Indivual Linear RegressionsΒΆ

βΉοΈ Information βΉοΈ In fact, this is not the best idea to have one general linear regression. Because we do not benefit from indiviudal information. Therefore, letβs do one linear regression per individual

π¬ Question 14 π¬ Look at what this function is doing and at the result

individual_parameters = pd.DataFrame({'INTERCEPT': [], 'SLOPE': []})

subject_idx = 'GS-194'

def compute_individual_parameters(df, subject_idx):
df_patient = df.loc[subject_idx]

x = df_patient.index.get_level_values('TIME').values
y = df_patient['PUTAMEN'].values

# -- Linear regression
slope, intercept, _, _, _ = stats.linregress(x, y)

return intercept, slope

individual_parameters.loc[subject_idx] = compute_individual_parameters(df_train, subject_idx)

individual_parameters

INTERCEPT SLOPE
GS-194 0.025188 -0.807941

π¬ Question 15 π¬ Apply the function to everyone

# Your answer

# ββββββββββββββββ #
# ββββββββββββββββ #

for subject_idx in df_train.index.unique('ID'):

slope, intercept = compute_individual_parameters(df_train, subject_idx)
individual_parameters.loc[subject_idx] = (intercept, slope)


INTERCEPT SLOPE
GS-194 -0.807941 0.025188
GS-001 -2.946305 0.051464
GS-002 -0.252949 0.018643
GS-003 0.517972 0.003816
GS-004 -1.076280 0.025680

π¬ Question 16 π¬ Now append the result of the model in df using the function below.

β‘ Remark β‘ Take a close look at what we are doing cause we will use the same syntax if other questions.

def compute_individual_reconstruction(x, parameters):

subject_idx = x.name[0]

slope = parameters.loc[subject_idx]['SLOPE']
intercept = parameters.loc[subject_idx]['INTERCEPT']

time = x.name[1]

return intercept + slope * time

df['Individual Linear Regression'] = df.apply(
lambda x: compute_individual_reconstruction(x, individual_parameters), axis=1)


π¬ Question 17 π¬ Use the compute_train_test_mean_absolute_error function to get the train and test errors and compare the two models.

# Your code here

# ββββββββββββββββ #
# ββββββββββββββββ #

compute_rmse_train_test(df, overall_results, 'Individual Linear Regression')
overall_results

train test
Linear Regression 0.091403 0.102130
Individual Linear Regression 0.017825 0.032197

We clearly see that the RMSE is much better! π¬ Question 18 π¬ Create a list with the five patients having the more visits and the five patients having the less visits, then use plot_individuals function to display their observations and reconstructions by the model Hint : use the keyword sublist

# sublist = # TODO
# plot_individuals(df, overall_results, model_name, sublist=sublist)

# ββββββββββββββββ #
# ββββββββββββββββ #

visits_per_subjects = df.groupby(df.index.get_level_values('ID')).count().sort_values('PUTAMEN')
sublist = visits_per_subjects.tail(5).index.tolist()

plot_individuals(df, overall_results, 'Individual Linear Regression', sublist=sublist)
plt.show()


π¬ Question 19 π¬ Explain why $$RMSE_{test} >> RMSE_{train}$$:

Answer: the LM overfit for patients with only few data

## Part V : Linear Mixed effects Model with statsmodelsΒΆ

With the previous method, we made a significant improvement. However, we suffer fro an overfitting problem. Letβs see what a mixed effect model can do for us!

### Run a LMM with statsmodelsΒΆ

βΉοΈ Information βΉοΈ We will use the statsmodel package to run a Linear Mixed Effect Model (LMM or LMEM in the literature).

π¬ Question 20 π¬ Load the following lines to import the packages

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.regression.mixed_linear_model import MixedLMParams


Statsmodels contains several API to create a model. For the ones familiar with R, you will be here in a familiar ground with the formula API.

• formula='PUTAMEN ~ TIME + 1' means that you want to explain PUTAMEN with TIME and an intercept

• groups="ID" means that you want random effect for all levels of ID

• re_formula="~TIME + 1" means that you want a random intercept and a random slope for TIME

If you go back to the equation you get : $$PUTAMEN_{id,time} = \underbrace{\alpha*TIME_{id,time} + \beta}_\text{formula} + \underbrace{\alpha_{id}*TIME_{id,time} + \beta_{id}}_\text{re_formula}$$

π¬ Question 21 π¬ Letβs try a very naive run:

lmm = smf.mixedlm(formula='PUTAMEN ~ 1 + TIME',
data=df_train.reset_index(),
groups="ID", re_formula="~ 1 + TIME").fit()
lmm.summary()

/home/juliette.ortholand/miniconda3/envs/leaspype/lib/python3.8/site-packages/statsmodels/base/model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/home/juliette.ortholand/miniconda3/envs/leaspype/lib/python3.8/site-packages/statsmodels/regression/mixed_linear_model.py:2200: ConvergenceWarning: Retrying MixedLM optimization with lbfgs
warnings.warn(
/home/juliette.ortholand/miniconda3/envs/leaspype/lib/python3.8/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
warnings.warn(msg, ConvergenceWarning)

 Model: MixedLM Dependent Variable: PUTAMEN No. Observations: 1415 Method: REML No. Groups: 200 Scale: 0.0007 Min. group size: 2 Log-Likelihood: 2497.0458 Max. group size: 13 Converged: Yes Mean group size: 7.1
Coef. Std.Err. z P>|z| [0.025 0.975] -0.685 0.044 -15.644 0.000 -0.771 -0.599 0.022 0.001 29.528 0.000 0.020 0.023 0.003 0.259 0.000 0.005 0.000 0.000

β‘ Remark β‘ Letβs skip the different warning for now and see what happens if we ignore it

Letβs try and see.

π¬ Question 22 π¬ Run the following commands to get the intercept and slope

print(lmm.fe_params.loc['Intercept'])
print(lmm.fe_params.loc['TIME'])

-0.6848661069361888
0.021679337332944623


π¬ Question 23 π¬ Run the following commands to get the variation to the mean slope and intercept Example on few subject

{key: val for key, val in lmm.random_effects.items() if key in
['GS-00'+str(i) for i in range(1, 4)]}

{'GS-001': ID     -0.019149
TIME   -0.001092
dtype: float64,
'GS-002': ID      0.058495
TIME    0.004182
dtype: float64,
'GS-003': ID     -0.011974
TIME   -0.001002
dtype: float64}


π¬ Question 24 π¬ From the fixed and random effects, compute for each subject its INTERCEPT and SLOPE:

# df_random_effects['INTERCEPT'] = # TODO
# df_random_effects['SLOPE'] = # TODO


# ββββββββββββββββ #
# ββββββββββββββββ #

df_random_effects = pd.DataFrame.from_dict(lmm.random_effects, orient='index')
df_random_effects = df_random_effects.rename({'ID': 'Random intercept', 'TIME': 'Random slope'}, axis=1)

df_random_effects['INTERCEPT'] = df_random_effects['Random intercept'] + lmm.fe_params.loc['Intercept']
df_random_effects['SLOPE'] = df_random_effects['Random slope'] + lmm.fe_params.loc['TIME']


Random intercept Random slope INTERCEPT SLOPE
GS-001 -0.019149 -0.001092 -0.704015 0.020588
GS-002 0.058495 0.004182 -0.626371 0.025861
GS-003 -0.011974 -0.001002 -0.696840 0.020677
GS-004 -0.020968 -0.001529 -0.705834 0.020150
GS-005 0.006285 0.000363 -0.678581 0.022043

π¬ Question 25 π¬ Use the compute_individual_reconstruction function but with df_random_effects to compute the prediction with the new individual effects

df['Linear Mixed Effect Model'] = # Your code here

# ββββββββββββββββ #
# ββββββββββββββββ #

df['Linear Mixed Effect Model'] = df.apply(
lambda x: compute_individual_reconstruction(x, df_random_effects), axis=1)


π¬ Question 26 π¬ Store the results in overall_results (thanks to compute_rmse_train_test function) and compare the models

# Your code here

# ββββββββββββββββ #
# ββββββββββββββββ #

compute_rmse_train_test(df, overall_results, 'Linear Mixed Effect Model')

overall_results

train test
Linear Regression 0.091403 0.102130
Individual Linear Regression 0.017825 0.032197
Linear Mixed Effect Model 0.024533 0.039367

β‘ Remark β‘ The result is worse than with the previous model.

π¬ Question 27 π¬ What do you think happened? Letβs check it visually

# Your code here

# ββββββββββββββββ #
# ββββββββββββββββ #

plot_individuals(df, overall_results, 'Linear Mixed Effect Model', sublist=sublist)
plt.show()


β‘ Remark β‘ All the slopes are the same. This is related to the warning above : these are warnings are a way of alerting you that you may be in a non-standard situation. Most likely, one of your variance parameters is converging to zero. Which is the case if you have a look to time variance.

π¬ Question 28 π¬ Letβs rerun it by normalizing the time first. Add to df_train and df_test a renormalizing function. Be careful to normalize only with the known ages from train. Hint : Watchout to data leakage!

# Your code

# ---- Split again train & test
df_train = df.xs('train',level='SPLIT')
df_test = df.xs('test',level='SPLIT')

# ββββββββββββββββ #
# ββββββββββββββββ #

# ---- We use only train data to compute mean & std
ages = df_train.index.get_level_values(1).values
age_mean = ages.mean()
age_std = ages.std()

df['TIME_NORMALIZED'] = (df.index.get_level_values('TIME') -  age_mean) / age_std

# ---- Split again train & test
df_train = df.xs('train',level='SPLIT')
df_test = df.xs('test',level='SPLIT')


π¬ Question 29 π¬ Rerun the previous mixed lm (some cells above) but with the TIME_NORMALIZED instead of TIME in the formula and re_formula.

# YOUR CODE

# ββββββββββββββββ #
# ββββββββββββββββ #

lmm = smf.mixedlm(formula='PUTAMEN ~ TIME_NORMALIZED + 1', data=df_train.reset_index(),
groups="ID", re_formula="~TIME_NORMALIZED + 1").fit()

lmm.summary()

 Model: MixedLM Dependent Variable: PUTAMEN No. Observations: 1415 Method: REML No. Groups: 200 Scale: 0.0005 Min. group size: 2 Log-Likelihood: 2629.4084 Max. group size: 13 Converged: Yes Mean group size: 7.1
Coef. Std.Err. z P>|z| [0.025 0.975] 0.679 0.016 43.031 0.000 0.648 0.710 0.231 0.011 20.134 0.000 0.209 0.254 0.045 0.288 -0.006 0.111 0.017 0.126

Ahaaaaah! No warnings!

π¬ Question 30 π¬ Get the parameters as previously in df_random_effects_2 and store the INTERCEPT_NORMALIZED and SLOPE_NORMALIZED

# TODO

# ββββββββββββββββ #
# ββββββββββββββββ #

df_random_effects_2 = pd.DataFrame.from_dict(lmm.random_effects, orient='index')
df_random_effects_2 = df_random_effects_2.rename({'ID': 'Random intercept', 'TIME_NORMALIZED': 'Random slope'},
axis=1)

df_random_effects_2['INTERCEPT_NORMALIZED'] = df_random_effects_2['Random intercept'] + \
lmm.fe_params.loc['Intercept']
df_random_effects_2['SLOPE_NORMALIZED'] = df_random_effects_2['Random slope'] + \
lmm.fe_params.loc['TIME_NORMALIZED']


Random intercept Random slope INTERCEPT_NORMALIZED SLOPE_NORMALIZED
GS-001 -0.222614 0.185999 0.456428 0.417479
GS-002 0.233375 -0.075922 0.912417 0.155558
GS-003 0.009845 -0.091712 0.688887 0.139767
GS-004 -0.093862 0.017902 0.585179 0.249382
GS-005 0.055686 -0.050411 0.734728 0.181068

Here, we computed

$y = SLOPE_{normalized} * TIME_{normalized} + INTERCEPT_{normalized}$

which corresponds to

$y = SLOPE_{normalized} * \frac{(TIME - \mu(ages))}{ \sigma_{ages}} + INTERCEPT_{normalized}$

which is :

$y = SLOPE * TIME + INTERCEPT$

where $$$SLOPE = \frac{SLOPE_{normalized}}{ std_{ages}}$$$$and$$$$INTERCEPT = INTERCEPT_{normalized} - \frac{SLOPE_{normalized} * mean_{ages}}{std_{ages}}$$$

π¬ Question 31 π¬ From INTERCEPT_NORMALIZED & SLOPE_NORMALIZED, compute for each subject its INTERCEPT and SLOPE:

parameters['GS-001']
>>> {'INTERCEPT': ..., 'SLOPE': ...}

# YOUR CODE

# ββββββββββββββββ #
# ββββββββββββββββ #

df_random_effects_2['SLOPE'] = df_random_effects_2['SLOPE_NORMALIZED'] / age_std
df_random_effects_2['INTERCEPT'] = df_random_effects_2['INTERCEPT_NORMALIZED'] -\
(df_random_effects_2['SLOPE_NORMALIZED'] * age_mean)/age_std


π¬ Question 32 π¬ Use the compute_individual_reconstruction function but with df_random_effects_2 to compute the prediction with the new individual effects

# Your code

# ββββββββββββββββ #
# ββββββββββββββββ #

df['Linear Mixed Effect Model - V2'] = df.apply(
lambda x: compute_individual_reconstruction(x, df_random_effects_2), axis=1)


π¬ Question 33 π¬ Store the results in overall_results (thanks to compute_rmse_train_test function) and compare the models

# Your code

# ββββββββββββββββ #
# ββββββββββββββββ #

compute_rmse_train_test(df, overall_results, 'Linear Mixed Effect Model - V2')

overall_results

train test
Linear Regression 0.091403 0.102130
Individual Linear Regression 0.017825 0.032197
Linear Mixed Effect Model 0.024533 0.039367
Linear Mixed Effect Model - V2 0.019081 0.028939

The RMSE is much better!

π¬ Question 34 π¬ Display the subjects of sublist :

# Your code

# ββββββββββββββββ #
# ββββββββββββββββ #

plot_individuals(df, overall_results, 'Linear Mixed Effect Model - V2', sublist=sublist)
plt.show()


π¬ Question 35 π¬ What is the main advantage of a Linear Mixed-effects Model compared to multiple Linear Models ?

π¬ Question 36 π¬ Letβs check the average RMSE per subject depending of the number of timepoints per subjects:

models = ['Individual Linear Regression', 'Linear Mixed Effect Model - V2']

def plot_rmse_by_number_of_visit(models, df=df):
rmse = df[models].copy()
rmse.columns.names = ['MODEL']
rmse = rmse.stack()
rmse = (rmse - df['PUTAMEN']) ** 2

nbr_of_visits = rmse.xs('train',level='SPLIT').groupby(['MODEL','ID']).size()

rmse = rmse.xs('test',level='SPLIT')
rmse = rmse.reset_index().rename(columns={0: 'RMSE'})
rmse = rmse.reset_index()[['ID', 'MODEL', 'RMSE']]
rmse = rmse.groupby(['MODEL','ID']).mean() ** .5
rmse['Visits number in train'] = nbr_of_visits

plt.figure(figsize=(12, 5))
ax = sns.boxplot(data=rmse.reset_index(), y='RMSE',
x='Visits number in train', hue='MODEL',
showmeans=True, whis=[5,95], showfliers=False)
ax = sns.stripplot(data=rmse.reset_index(), y='RMSE',
x='Visits number in train', hue='MODEL',
dodge=True, alpha=.5, linewidth=1, ax=ax)

plt.grid(True, axis='both')
plt.ylim(0,None)
plt.legend()
return ax

plot_rmse_by_number_of_visit(models, df)
plt.show()


## PART VI: A taste of the future - Linear mixed-effect model with LeaspyΒΆ

In the next practical sessions you will learn to use the package developped by the Aramis team. Now, just to be able to compare performances you will run a few methods of leaspy in advanceβ¦

π¬ Question 37 π¬ Run the following cell to make the import of leaspy methods, format the data and fit a model :

# --- Import methods
from leaspy import Leaspy, Data, AlgorithmSettings

# --- Format the data
data = Data.from_dataframe(df_train[['PUTAMEN']])

# --- Fit a model
leaspy_univariate = Leaspy('univariate_linear')
settings_fit = AlgorithmSettings('mcmc_saem', progress_bar=True, seed=0)
leaspy_univariate.fit(data, settings_fit)

 ==> Setting seed to 0
|##################################################|   10000/10000 iterations
The standard deviation of the noise at the end of the calibration is:
0.0215

Calibration took: 26s


Well itβs a bit slow, but here is a joke to wait:

• What did the triangle say to the circle? βYouβre pointless.β

β¦

Okay, it was short. Here is another one:

• I had an argument with a 90Β° angle. It turns out it was right.

π¬ Question 38 π¬ Run the following two cells to make the predictions:

settings_personalize = AlgorithmSettings('scipy_minimize', progress_bar=True, use_jacobian=True)

individual_parameters = leaspy_univariate.personalize(data, settings_personalize)

|##################################################|   200/200 subjects
The standard deviation of the noise at the end of the personalization is:
0.0184

Personalization scipy_minimize took: 20s

timepoints = {idx: df.loc[idx].index.get_level_values('TIME').values for idx in
df.index.get_level_values('ID').unique()}
estimates = leaspy_univariate.estimate(timepoints, individual_parameters)


π¬ Question 39 π¬ Add the predictions to df

df['Leaspy linear'] = float('nan')

for idx in df.index.unique('ID'):
df.loc[idx, 'Leaspy linear'] = estimates[idx]


π¬ Question 40 π¬ Compute and add the new rmse

compute_rmse_train_test(df, overall_results, 'Leaspy linear')

overall_results

train test
Linear Regression 0.091403 0.102130
Individual Linear Regression 0.017825 0.032197
Linear Mixed Effect Model 0.024533 0.039367
Leaspy linear 0.018394 0.027894

π¬ Question 41 π¬ Display the subjects of sublist :

# Your code

# ββββββββββββββββ #
# ββββββββββββββββ #

plot_individuals(df, overall_results, 'Leaspy linear', sublist=sublist)
plt.show()


## PART VII: A taste of the future - Non-linear mixed-effect model with LeaspyΒΆ

Again you will use an other model of leapsyβ¦

π¬ Question 42 π¬ Fit the model with the data.

leaspy_univariate = Leaspy('univariate_logistic')
settings_fit = AlgorithmSettings('mcmc_saem', progress_bar=True, seed=0)
leaspy_univariate.fit(data, settings_fit)

 ==> Setting seed to 0
|##################################################|   10000/10000 iterations
The standard deviation of the noise at the end of the calibration is:
0.0214

Calibration took: 27s


π¬ Question 43 π¬ Run the following two cells to make the predictions:

settings_personalize = AlgorithmSettings('scipy_minimize', progress_bar=True, use_jacobian=True)

individual_parameters = leaspy_univariate.personalize(data, settings_personalize)

|##################################################|   200/200 subjects
The standard deviation of the noise at the end of the personalization is:
0.0183

Personalization scipy_minimize took: 4s

timepoints = {idx: df.loc[idx].index.get_level_values('TIME').values for idx in
df.index.get_level_values('ID').unique()}
estimates = leaspy_univariate.estimate(timepoints, individual_parameters)


π¬ Question 44 π¬ Add the predictions to df

df['Leaspy logistic'] = float('nan')

for idx in df.index.unique('ID'):
df.loc[idx, 'Leaspy logistic'] = estimates[idx]


π¬ Question 45 π¬ Compute and add the new rmse

compute_rmse_train_test(df, overall_results, 'Leaspy logistic')

overall_results

train test
Linear Regression 0.091403 0.102130
Individual Linear Regression 0.017825 0.032197
Linear Mixed Effect Model 0.024533 0.039367
Leaspy linear 0.018394 0.027894
Leaspy logistic 0.018331 0.026822

π¬ Question 46 π¬ Display the subjects of sublist :

# Your code

# ββββββββββββββββ #
# ββββββββββββββββ #

plot_individuals(df, overall_results, 'Leaspy logistic', sublist=sublist)
plt.show()


π¬ Question 47 π¬ Check the average RMSE per subject depending of the number of timepoints per subjects

models = ['Individual Linear Regression',
'Linear Mixed Effect Model - V2',
'Leaspy linear',
'Leaspy logistic']

plot_rmse_by_number_of_visit(models, df)
plt.show()


Here we clearly see that the few subjects who have less than 6 timepoints are better reconstructed with a mixed effect model!

## For the fast-running Zebras: all-in-one with a more realistic split of dataΒΆ

Letβs try to split differently data so to mimic real-world applications.

We donβt want to fit again a new model every time we predict the future of a new patient. To this aim, we want to calibrate (fit) a model on a large and representative dataset once and then personalize this model to totally new individuals. We can then make forecasts about these new individuals.

The previous split was not taking into account this constraint and this extra part will go through it.

• Train part: all data from some individuals (used for calibration of model)

• Test part: new individuals

• Present data: partial data of these new individuals that will be known (used for personalization of model)

• Future data: hidden data from these individuals (not known during personalization, used for prediction)

π¬ Question 48 π¬ Split data differently so to respect real-world constraint

# be sure that ages are increasing
df.sort_index(inplace=True)

individuals = df.index.unique('ID')
n_individuals = len(individuals)

# split on individuals
fraction = .75
individuals_train, individuals_test = individuals[:int(fraction*n_individuals)], individuals[int(fraction*n_individuals):]

s_train = df.droplevel(-1).loc[individuals_train][['PUTAMEN']]
s_test = df.droplevel(-1).loc[individuals_test][['PUTAMEN']]

# we split again test set in 2 parts: present (known) / future (to predict)
s_test_future = s_test.groupby('ID').tail(2) # 2 last tpts
s_test_present = s_test.loc[s_test.index.difference(s_test_future.index)]

s_test.loc[s_test_future.index, 'PART'] = 'future'
s_test.loc[s_test_present.index, 'PART'] = 'present'

s_test.set_index('PART',append=True,inplace=True)

# check no intersection in individuals between train/test
assert len(s_train.index.unique('ID').intersection(s_test.index.unique('ID'))) == 0

# check no intersection of (individual, timepoints) between test present/future
assert len(s_test_present.index.intersection(s_test_future.index)) == 0

# Leaspy Data objects creation
data_train = Data.from_dataframe(s_train)
data_test = {
'all': Data.from_dataframe(s_test.droplevel('PART')), # all test data [present+future pooled]
'present': Data.from_dataframe(s_test_present),
'future': Data.from_dataframe(s_test_future)
}


Letβs check that distributions of ages and putamen values between train & test are similarβ¦ unlike previous split!

π¬ Question 49 π¬ Check consistence of new train/test split and compare to previous split

df_new_split = pd.concat({
'Train': s_train,
'Test': s_test.droplevel(-1)
}, names=['SPLIT'])

for which_split, df_split in {'Previous split': df, 'New split': df_new_split}.items():
fig, axs = plt.subplots(1, 2, figsize=(14,6))
fig.suptitle(which_split, fontsize=20, fontweight='bold')

for (var, title), ax in zip({'TIME': 'Ages', 'PUTAMEN': 'Putamen value'}.items(), axs):

sns.histplot(data=df_split.reset_index(), x=var,
hue='SPLIT', stat='density', common_norm=False,
ax.set_title(f'{title} - comparison of distributions')
ax.set_xlabel(title)

plt.show()

#sns.scatterplot(data=s_train.reset_index(),
#                x='TIME', y='PUTAMEN', alpha=.5, s=60)
#sns.scatterplot(data=s_test.reset_index(),
#                x='TIME', y='PUTAMEN', alpha=.5, s=60)


π¬ Question 50 π¬ Double check number of individuals and visits in each split of our dataset

# Some descriptive stats on number of visits in test set (present part)

pd.options.display.float_format = '{:.1f}'.format

print('Visits in each split')
print({'Train': len(s_train),
'All test': len(s_test),
'Test - known part': len(s_test_present),
'Test - to predict part': len(s_test_future)
})
print()

print('Visits per individual in each split')
pd.concat({
'Calibration': s_train.groupby('ID').size().describe(percentiles=[]), # min: 3 tpts known
'Personalization': s_test_present.groupby('ID').size().describe(percentiles=[]), # min: 1 tpt known
'Prediction': s_test_future.groupby('ID').size().describe(percentiles=[]), # min: 2 tpts to predict
}).unstack(0).rename({'count':'nb of individuals'}, axis=0)

Visits in each split
{'Train': 1499, 'All test': 498, 'Test - known part': 398, 'Test - to predict part': 100}

Visits per individual in each split

Calibration Personalization Prediction
nb of individuals 150.0 50.0 50.0
mean 10.0 8.0 2.0
std 2.5 2.5 0.0
min 3.0 1.0 2.0
50% 10.0 8.0 2.0
max 18.0 13.0 2.0

π¬ Question 51 π¬ Write a all-in-one personalization & prediction function thanks to Leaspy api

def personalize_model(leaspy_model, settings_perso):
"""
all-in-one function that:
1. takes a calibrated leaspy model,
2. personalizes it with different splits of data
3. and estimates prediction errors compared to real data (including the hidden parts)
"""

ips_test = {}
rmse_test = {}

# personalize on train part just to have a comparison (baseline)
ips_test['train'], rmse_test['train'] = leaspy_model.personalize(data_train, settings_perso,
return_noise=True)
print(f'RMSE on train: {100*rmse_test["train"]:.2f}%')

# personalize using different test parts
for test_part, data_test_part in data_test.items():
ips_test[test_part], rmse_test[test_part] = leaspy_model.personalize(data_test_part, settings_perso,
return_noise=True)
print(f'RMSE on test {test_part}: {100*rmse_test[test_part]:.2f}%')

# reconstruct using different personalizations made
s_train_ix = s_train.assign(PART='train').set_index('PART',append=True).index # with fake part added

all_recons_df = pd.concat({
test_part: leaspy_model.estimate(s_test.index if test_part != 'train' else s_train_ix,
ips_test_part)
for test_part, ips_test_part in ips_test.items()
}, names=['PERSO_ON']).reorder_levels([1,2,3,0])

true_vals_same_ix = all_recons_df[[]].join(df.droplevel(-1).iloc[:,0])
pred_errs = all_recons_df - true_vals_same_ix

# return everything that could be needed
return pred_errs, all_recons_df, ips_test, rmse_test


π¬ Question 52 π¬ Calibrate, personalize & predict with LMM, univariate linear & univariate logistic thanks to Leaspy

## Leaspy LMM (using integrated exact personalization formula)

leaspy_lmm = Leaspy('lme')
settings_fit = AlgorithmSettings('lme_fit',
with_random_slope_age=True,
force_independent_random_effects=True, # orthogonal rand. eff.
seed=0)

# ------- Compute population parameters on train data -------
# (fixed effects, noise level and var-covar of random effects)
leaspy_lmm.fit(data_train, settings_fit)

# ------- Compute individual parameters (random effects) on different splits -------
pred_errs_lmm, all_recons_df_lmm, _, _  = personalize_model(leaspy_lmm, AlgorithmSettings('lme_personalize'))

 ==> Setting seed to 0
The standard deviation of the noise at the end of the calibration is: 0.0218
RMSE on train: 1.98%
RMSE on test all: 1.91%
RMSE on test present: 1.88%
RMSE on test future: 1.61%

## Leaspy univariate linear
leaspy_lin = Leaspy('univariate_linear')
settings_fit = AlgorithmSettings('mcmc_saem', n_iter=8000, progress_bar=True, seed=0)
settings_perso = AlgorithmSettings('scipy_minimize', progress_bar=True, use_jacobian=True)

# ------- Compute population parameters on train data -------
leaspy_lin.fit(data_train, settings_fit)

# ------- Compute individual parameters on different splits -------
pred_errs_lin, all_recons_df_lin, _, _  = personalize_model(leaspy_lin, settings_perso)

 ==> Setting seed to 0
|##################################################|   8000/8000 iterations
The standard deviation of the noise at the end of the calibration is:
0.0216

Calibration took: 22s
|##################################################|   150/150 subjects
The standard deviation of the noise at the end of the personalization is:
0.0194

Personalization scipy_minimize took: 4s
RMSE on train: 1.94%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0186

Personalization scipy_minimize took: 1s
RMSE on test all: 1.86%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0178

Personalization scipy_minimize took: 2s
RMSE on test present: 1.78%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0162

Personalization scipy_minimize took: 1s
RMSE on test future: 1.62%

## Leaspy univariate logistic
leaspy_log = Leaspy('univariate_logistic')

# Calibrate
leaspy_log.fit(data_train, settings_fit)

# Personalize
pred_errs_log, all_recons_df_log, _, _  = personalize_model(leaspy_log, settings_perso)

 ==> Setting seed to 0
|##################################################|   8000/8000 iterations
The standard deviation of the noise at the end of the calibration is:
0.0215

Calibration took: 30s
|##################################################|   150/150 subjects
The standard deviation of the noise at the end of the personalization is:
0.0194

Personalization scipy_minimize took: 5s
RMSE on train: 1.94%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0184

Personalization scipy_minimize took: 1s
RMSE on test all: 1.84%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0177

Personalization scipy_minimize took: 2s
RMSE on test present: 1.77%
|##################################################|   50/50 subjects
The standard deviation of the noise at the end of the personalization is:
0.0162

Personalization scipy_minimize took: 1s
RMSE on test future: 1.62%


π¬ Question 53 π¬ Display RMSE on all splits of data depending on models & part where models were personalized on

# reconstruct using different personalizations made
pd.options.display.float_format = '{:.5f}'.format

all_pred_errs = pd.concat({
'lmm': pred_errs_lmm,
'leaspy_linear': pred_errs_lin,
'leaspy_logistic': pred_errs_log,
}, names=['MODEL'])

print('RMSE by:\n- model\n- part of data for personalization\n- part of data for reconstruction\n')

rmses_by_perso_split = (all_pred_errs**2).groupby(['MODEL','PERSO_ON','PART']).mean() ** .5
rmses_by_perso_split.unstack('MODEL').droplevel(0, axis=1).sort_index(ascending=[True,False]).rename_axis(index={'PART':'RECONS_ON'})

RMSE by:
- model
- part of data for personalization
- part of data for reconstruction

MODEL lmm leaspy_linear leaspy_logistic
PERSO_ON RECONS_ON
all present 0.01897 0.01841 0.01821
future 0.01944 0.01910 0.01913
future present 0.04730 0.04808 0.04923
future 0.01609 0.01618 0.01616
present present 0.01880 0.01776 0.01775
future 0.02692 0.02666 0.02538
train train 0.01979 0.01942 0.01941

π¬ Question 54 π¬ Plot distributions of absolute errors

def plot_dist_errs(pred_errs, model_name, grouping_reverse=False):
plt.figure(figsize=(14,6))
plt.grid(True,axis='both',zorder=-1)

plot_opts = dict(
x='PART', order=['train','present','future'],
hue='PERSO_ON', hue_order=['all','present','future','train'],
)
x_lbl = 'Part of test set we look errors at'
legend_lbl = 'Personalization on\nwhich test part?'

if grouping_reverse:
plot_opts = dict(
hue='PART', hue_order=['train','present','future'],
x='PERSO_ON', order=['all','present','future','train'],
)
x_lbl, legend_lbl = legend_lbl, x_lbl # swap

sns.boxplot(data=pred_errs.abs().reset_index(),
y='PUTAMEN', **plot_opts, showfliers=False)

plt.ylabel(f'Distribution of {model_name} absolute errors', fontsize=14)

plt.xlabel(x_lbl, fontsize=14)
plt.legend().set_title(legend_lbl)

plot_dist_errs(pred_errs_lmm, 'LMM')
plot_dist_errs(pred_errs_lmm, 'LMM', True)


π¬ Question 55 π¬ Plot some individual trajectories with their associated predicted trajectories

idx_list = np.random.RandomState(5).choice(s_test.index.unique('ID'), 6).tolist()

def plot_recons(all_recons_df, model_name):

recontruction_and_raw_data = pd.concat({'truth': s_test,
'prediction': all_recons_df.xs('present', level='PERSO_ON')
}, names=['LABEL']).swaplevel(0,1).sort_index()

plt.figure(figsize=(14,6))

sns.lineplot(data=recontruction_and_raw_data.loc[idx_list].reset_index(),
x='TIME', y='PUTAMEN', hue='ID', style='LABEL', style_order=['truth','prediction'])

sns.scatterplot(data=s_test_future.loc[idx_list].reset_index(),
x='TIME', y='PUTAMEN', hue='ID', legend=None)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

title = f'PUTAMEN - Raw data vs {model_name} (test known/to predict)'
plt.title(title)
plt.grid(True)
plt.show()

plot_recons(all_recons_df_lmm, 'LMM')
#plot_recons(all_recons_df_lin, 'Leaspy linear (univ.)')
#plot_recons(all_recons_df_log, 'Leaspy logistic (univ.)')


## Bonus: have a look at the LMM log-likelihood landscapeΒΆ

import statsmodels.formula.api as smf
from tqdm import tqdm

import site

from loglikelihood_landspace_lmm import plt_ll_landscape, plt_ll_landscape_

# new model (TIME_norm)
lmm_model = smf.mixedlm(formula="PUTAMEN ~ 1 + TIME_NORMALIZED",
data=df.xs('train',level='SPLIT').reset_index(),
groups="ID", re_formula="~ 1 + TIME_NORMALIZED")

lmm_model.cov_pen = None

view = dict(
vmin=2200,
vmid=2500,
vmax=2630,
levels=[2500, 2600, 2620]
)

Rs = {} # placeholder for results


π¬ Question 56 π¬ Plot log-likelihood landscapes for LMM model, depending on variance-covariance matrix of random effects

for fix_k, fix_v in tqdm([
# corr fixed
('corr', -.99),
('corr', -.9),
('corr', -.75),
('corr', -.5),
('corr', -.25), # ~best
('corr', 0),
('corr', .5),
('corr', .75),
('corr', .99),
# sx fixed
('sx', 1.),
('sx', 5),
('sx', 10), # ~best
('sx', 20),
('sx', 40),
# sy fixed
('sy', 1),
('sy', 6), # ~best
('sy', 10),
('sy', 20),
]):
if (fix_k, fix_v) in Rs:
plt_ll_landscape_(Rs[(fix_k, fix_v)], **view)
else:
Rs[(fix_k, fix_v)] = plt_ll_landscape(lmm_model, N=50, **{fix_k: fix_v}, **view)

  0%|          | 0/18 [00:00<?, ?it/s]

  6%|β         | 1/18 [01:23<23:39, 83.51s/it]

 11%|β         | 2/18 [02:58<24:06, 90.39s/it]

 17%|ββ        | 3/18 [04:43<24:16, 97.10s/it]

 22%|βββ       | 4/18 [06:01<20:53, 89.53s/it]

 28%|βββ       | 5/18 [07:27<19:04, 88.05s/it]

 33%|ββββ      | 6/18 [08:53<17:31, 87.61s/it]

 39%|ββββ      | 7/18 [10:06<15:08, 82.59s/it]

 44%|βββββ     | 8/18 [11:19<13:17, 79.72s/it]

 50%|βββββ     | 9/18 [12:31<11:35, 77.33s/it]/Users/etienne.maheux/Documents/repos/disease-course-mapping-solutions/TP1_LMM/utils/loglikelihood_landspace_lmm.py:130: UserWarning: No contour levels were found within the data range.
colors='black', alpha=.8, linewidths=1)

 56%|ββββββ    | 10/18 [13:44<10:07, 75.94s/it]

 61%|ββββββ    | 11/18 [14:58<08:46, 75.22s/it]

 67%|βββββββ   | 12/18 [16:11<07:27, 74.57s/it]

 72%|ββββββββ  | 13/18 [17:24<06:10, 74.09s/it]

 78%|ββββββββ  | 14/18 [18:38<04:56, 74.21s/it]

 83%|βββββββββ | 15/18 [19:56<03:45, 75.18s/it]

 89%|βββββββββ | 16/18 [21:14<02:32, 76.14s/it]

 94%|ββββββββββ| 17/18 [22:27<01:15, 75.09s/it]

100%|ββββββββββ| 18/18 [23:52<00:00, 79.58s/it]