Binary vs. Multiclass Classification¶
Dataset Used in Tutorials
Use the shared synthetic dataset defined in the tutorials overview. Generate it once there and reuse it here.
The same Sklong workflow supports both binary and multiclass classification. In practice, the main changes are:
| Aspect | Binary classification | Multiclass classification |
|---|---|---|
| Target labels | Two labels such as 0 and 1 |
Three or more labels such as 0, 1, and 2 |
predict_proba shape |
(n_samples, 2) |
(n_samples, n_classes) |
classes_ |
Two class labels | One entry per class |
| AUPRC | Usually computed from the positive-class scores | Usually computed with macro, weighted, or micro averaging |
The estimators below support both binary and multiclass targets:
LexicoDecisionTreeClassifierLexicoRandomForestClassifierLexicoGradientBoostingClassifierLexicoDeepForestClassifierNestedTreesClassifierSepWavwith voting or stacking
The animation below summarises what actually changes between the two settings: the fitting workflow is identical, classes_ simply lists every observed label, and predict_proba grows one extra column per added class.
Step 1: Binary classification¶
This first example uses the original stroke_w2 target from the tutorial dataset.
from sklearn.metrics import accuracy_score
from scikit_longitudinal import auprc_score
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble import LexicoRandomForestClassifier
binary_dataset = LongitudinalDataset("./extended_stroke_longitudinal.csv")
binary_dataset.load_data_target_train_test_split(
target_column="stroke_w2",
test_size=0.2,
random_state=42,
)
binary_dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]])
binary_clf = LexicoRandomForestClassifier(
features_group=binary_dataset.feature_groups(),
n_estimators=100,
random_state=42,
)
binary_clf.fit(binary_dataset.X_train, binary_dataset.y_train)
binary_pred = binary_clf.predict(binary_dataset.X_test)
binary_proba = binary_clf.predict_proba(binary_dataset.X_test)
print(binary_clf.classes_) # Example: [0 1]
print(binary_proba.shape) # Example: (100, 2)
print(accuracy_score(binary_dataset.y_test, binary_pred))
print(auprc_score(binary_dataset.y_test, binary_proba[:, 1]))
Note
In the binary case, predict_proba returns two columns ordered according to classes_. When you compute AUPRC from a one-dimensional score vector, pass the scores for the positive class, which is usually the second column.
Step 2: Create a multiclass target from the same longitudinal table¶
To compare like for like, we can derive a three-class risk target from the same wave-2 measurements.
import pandas as pd
df = pd.read_csv("./extended_stroke_longitudinal.csv")
risk_score = (
df["smoke_w2"]
+ df["cholesterol_w2"]
+ df["blood_pressure_w2"]
+ df["diabetes_w2"]
)
df["risk_group_w2"] = pd.cut(
risk_score,
bins=[-1, 0, 2, 4],
labels=[0, 1, 2],
).astype(int)
df.to_csv("./extended_stroke_multiclass_longitudinal.csv", index=False)
Here the derived classes are:
0: low risk1: medium risk2: high risk
Step 3: Multiclass classification¶
The fitting workflow stays almost identical. The main difference is that the target now contains three labels and predict_proba returns three columns.
from sklearn.metrics import accuracy_score, classification_report
from scikit_longitudinal import auprc_score
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble import LexicoRandomForestClassifier
multiclass_dataset = LongitudinalDataset("./extended_stroke_multiclass_longitudinal.csv")
multiclass_dataset.load_data_target_train_test_split(
target_column="risk_group_w2",
test_size=0.2,
random_state=42,
)
multiclass_dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]])
multiclass_clf = LexicoRandomForestClassifier(
features_group=multiclass_dataset.feature_groups(),
n_estimators=100,
random_state=42,
)
multiclass_clf.fit(multiclass_dataset.X_train, multiclass_dataset.y_train)
multiclass_pred = multiclass_clf.predict(multiclass_dataset.X_test)
multiclass_proba = multiclass_clf.predict_proba(multiclass_dataset.X_test)
print(multiclass_clf.classes_) # Example: [0 1 2]
print(multiclass_proba.shape) # Example: (100, 3)
print(accuracy_score(multiclass_dataset.y_test, multiclass_pred))
print(classification_report(multiclass_dataset.y_test, multiclass_pred))
print(auprc_score(multiclass_dataset.y_test, multiclass_proba, average="macro"))
Note
In the multiclass case, auprc_score expects the full two-dimensional score matrix and an averaging strategy such as macro, weighted, micro, or None.
Step 4: The same multiclass target also works with SepWav¶
If you prefer wave-wise ensembling, the multiclass target can also be used with SepWav.
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from scikit_longitudinal.data_preparation import SepWav
from scikit_longitudinal.estimators.ensemble.longitudinal_voting.longitudinal_voting import (
LongitudinalEnsemblingStrategy,
)
sepwav = SepWav(
estimator=RandomForestClassifier(max_depth=5, random_state=42),
features_group=multiclass_dataset.feature_groups(),
non_longitudinal_features=multiclass_dataset.non_longitudinal_features(),
feature_list_names=multiclass_dataset.data.columns.tolist(),
voting=LongitudinalEnsemblingStrategy.STACKING,
stacking_meta_learner=LogisticRegression(max_iter=200),
)
sepwav.fit(multiclass_dataset.X_train, multiclass_dataset.y_train)
sepwav_proba = sepwav.predict_proba(multiclass_dataset.X_test)
print(sepwav.classes_) # Example: [0 1 2]
print(sepwav_proba.shape) # Example: (100, 3)