Skip to content

Tutorials Overview

Welcome to the Sklong tutorials. This section walks through the core workflows, from representing longitudinal data to training estimators and building pipelines. If you are new to the library, start with temporal dependencies and data format before moving into the applied tutorials.

In order to visualise what the library delivers, the figure below shows the high-level flow from raw longitudinal data to the main Sklong available to-pick components.

Scikit-Longitudinal Banner
Click the image to expand it.

List of Tutorials

  • Temporal Dependency


    Learn how to set up temporal dependencies using features_group and non_longitudinal_features. Essential for all Sklong usage.

    Read the tutorial

  • Longitudinal Data Format


    Understand wide vs. long formats and why Sklong prefers wide. Includes loading and preparing data.

    Read the tutorial

  • Long ⇄ Wide Reshape


    Pivot messy long-format cohorts into the wide layout Sklong expects (and back) using LongitudinalDataset.to_wide / to_long.

    Read the tutorial

  • Advanced Feature Group (Temporal) Setup


    Handle uneven numbers of observations per subject, including missing waves and padded feature groups.

    Read the tutorial

  • Data Preparation: Flatten Temporal Dependency for Scikit-Learn Estimators


    Flatten longitudinal structure and plug it into standard estimators using transformations like AggrFunc.

    Read the tutorial

  • Algorithm Adaptation: Preserve Temporal Dependency for Sklong Estimators


    Fit and predict with a longitudinal-aware estimator like LexicoDecisionTreeClassifier.

    Read the tutorial

  • Binary vs. Multiclass Classification


    Compare the same longitudinal workflow across binary and multiclass targets, including predict_proba, classes_, and AUPRC evaluation.

    Read the tutorial

  • Pipelines: Mix Longitudinal Components


    Build a full pipeline combining transformation, preprocessing, and estimation steps.

    Read the tutorial

  • Hyperparameter Tuning: Grid vs. Random Search


    Compare grid search and random search for tuning LexicoRandomForestClassifier hyperparameters.

    Read the tutorial

  • Automated Machine Learning (CASH)


    Automate model selection and tuning across pipelines with Auto-Sklong.

    Jump onto Auto-Sklong

Dataset Used in Tutorials

All tutorials share the same synthetic health-inspired longitudinal dataset. Generate it once and reuse it across lessons:

import pandas as pd
import numpy as np

n_rows = 500
columns = [
    'age', 'gender',
    'smoke_w1', 'smoke_w2',
    'cholesterol_w1', 'cholesterol_w2',
    'blood_pressure_w1', 'blood_pressure_w2',
    'diabetes_w1', 'diabetes_w2',
    'exercise_w1', 'exercise_w2',
    'obesity_w1', 'obesity_w2',
    'stroke_w2'
]

data = []
for _ in range(n_rows):
    row = {
        'age': np.random.randint(40, 71),
        'gender': np.random.choice([0, 1]),
    }
    for feature in ['smoke', 'cholesterol', 'blood_pressure', 'diabetes', 'exercise', 'obesity']:
        w1 = np.random.choice([0, 1], p=[0.7, 0.3])
        w2 = np.random.choice([0, 1], p=[0.2, 0.8]) if w1 == 1 else np.random.choice([0, 1], p=[0.9, 0.1])
        row[f'{feature}_w1'] = w1
        row[f'{feature}_w2'] = w2

    stroke_risk = row['smoke_w2'] == 1 or row['cholesterol_w2'] == 1 or row['blood_pressure_w2'] == 1
    p_stroke = 0.2 if stroke_risk else 0.05
    row['stroke_w2'] = np.random.choice([0, 1], p=[1 - p_stroke, p_stroke])
    data.append(row)

df = pd.DataFrame(data)
csv_file = './extended_stroke_longitudinal.csv'
df.to_csv(csv_file, index=False)
print(f"Extended CSV file '{csv_file}' created successfully.")
age gender smoke_w1 smoke_w2 cholesterol_w1 cholesterol_w2 blood_pressure_w1 blood_pressure_w2 diabetes_w1 diabetes_w2 exercise_w1 exercise_w2 obesity_w1 obesity_w2 stroke_w2
66 0 0 1 0 0 0 0 1 1 0 1 0 0 0
59 0 0 0 1 1 0 0 1 1 1 1 1 1 1
63 0 0 0 1 1 0 0 0 0 0 0 0 0 1