Skip to content

Binary vs. Multiclass Classification

Dataset Used in Tutorials

Use the shared synthetic dataset defined in the tutorials overview. Generate it once there and reuse it here.

The same Sklong workflow supports both binary and multiclass classification. In practice, the main changes are:

Aspect Binary classification Multiclass classification
Target labels Two labels such as 0 and 1 Three or more labels such as 0, 1, and 2
predict_proba shape (n_samples, 2) (n_samples, n_classes)
classes_ Two class labels One entry per class
AUPRC Usually computed from the positive-class scores Usually computed with macro, weighted, or micro averaging

The estimators below support both binary and multiclass targets:

  • LexicoDecisionTreeClassifier
  • LexicoRandomForestClassifier
  • LexicoGradientBoostingClassifier
  • LexicoDeepForestClassifier
  • NestedTreesClassifier
  • SepWav with voting or stacking

The animation below summarises what actually changes between the two settings: the fitting workflow is identical, classes_ simply lists every observed label, and predict_proba grows one extra column per added class.

Binary vs. multiclass: same workflow, wider predict_proba Binary vs. multiclass: same workflow, wider predict_proba
Click the image to expand it.

Step 1: Binary classification

This first example uses the original stroke_w2 target from the tutorial dataset.

from sklearn.metrics import accuracy_score

from scikit_longitudinal import auprc_score
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble import LexicoRandomForestClassifier

binary_dataset = LongitudinalDataset("./extended_stroke_longitudinal.csv")
binary_dataset.load_data_target_train_test_split(
 target_column="stroke_w2",
 test_size=0.2,
 random_state=42,
)
binary_dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]])

binary_clf = LexicoRandomForestClassifier(
 features_group=binary_dataset.feature_groups(),
 n_estimators=100,
 random_state=42,
)
binary_clf.fit(binary_dataset.X_train, binary_dataset.y_train)

binary_pred = binary_clf.predict(binary_dataset.X_test)
binary_proba = binary_clf.predict_proba(binary_dataset.X_test)

print(binary_clf.classes_) # Example: [0 1]
print(binary_proba.shape) # Example: (100, 2)
print(accuracy_score(binary_dataset.y_test, binary_pred))
print(auprc_score(binary_dataset.y_test, binary_proba[:, 1]))

Note

In the binary case, predict_proba returns two columns ordered according to classes_. When you compute AUPRC from a one-dimensional score vector, pass the scores for the positive class, which is usually the second column.

Step 2: Create a multiclass target from the same longitudinal table

To compare like for like, we can derive a three-class risk target from the same wave-2 measurements.

import pandas as pd

df = pd.read_csv("./extended_stroke_longitudinal.csv")

risk_score = (
 df["smoke_w2"]
 + df["cholesterol_w2"]
 + df["blood_pressure_w2"]
 + df["diabetes_w2"]
)

df["risk_group_w2"] = pd.cut(
 risk_score,
 bins=[-1, 0, 2, 4],
 labels=[0, 1, 2],
).astype(int)

df.to_csv("./extended_stroke_multiclass_longitudinal.csv", index=False)

Here the derived classes are:

  • 0: low risk
  • 1: medium risk
  • 2: high risk

Step 3: Multiclass classification

The fitting workflow stays almost identical. The main difference is that the target now contains three labels and predict_proba returns three columns.

from sklearn.metrics import accuracy_score, classification_report

from scikit_longitudinal import auprc_score
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble import LexicoRandomForestClassifier

multiclass_dataset = LongitudinalDataset("./extended_stroke_multiclass_longitudinal.csv")
multiclass_dataset.load_data_target_train_test_split(
 target_column="risk_group_w2",
 test_size=0.2,
 random_state=42,
)
multiclass_dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]])

multiclass_clf = LexicoRandomForestClassifier(
 features_group=multiclass_dataset.feature_groups(),
 n_estimators=100,
 random_state=42,
)
multiclass_clf.fit(multiclass_dataset.X_train, multiclass_dataset.y_train)

multiclass_pred = multiclass_clf.predict(multiclass_dataset.X_test)
multiclass_proba = multiclass_clf.predict_proba(multiclass_dataset.X_test)

print(multiclass_clf.classes_) # Example: [0 1 2]
print(multiclass_proba.shape) # Example: (100, 3)
print(accuracy_score(multiclass_dataset.y_test, multiclass_pred))
print(classification_report(multiclass_dataset.y_test, multiclass_pred))
print(auprc_score(multiclass_dataset.y_test, multiclass_proba, average="macro"))

Note

In the multiclass case, auprc_score expects the full two-dimensional score matrix and an averaging strategy such as macro, weighted, micro, or None.

Step 4: The same multiclass target also works with SepWav

If you prefer wave-wise ensembling, the multiclass target can also be used with SepWav.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from scikit_longitudinal.data_preparation import SepWav
from scikit_longitudinal.estimators.ensemble.longitudinal_voting.longitudinal_voting import (
 LongitudinalEnsemblingStrategy,
)

sepwav = SepWav(
 estimator=RandomForestClassifier(max_depth=5, random_state=42),
 features_group=multiclass_dataset.feature_groups(),
 non_longitudinal_features=multiclass_dataset.non_longitudinal_features(),
 feature_list_names=multiclass_dataset.data.columns.tolist(),
 voting=LongitudinalEnsemblingStrategy.STACKING,
 stacking_meta_learner=LogisticRegression(max_iter=200),
)

sepwav.fit(multiclass_dataset.X_train, multiclass_dataset.y_train)
sepwav_proba = sepwav.predict_proba(multiclass_dataset.X_test)

print(sepwav.classes_) # Example: [0 1 2]
print(sepwav_proba.shape) # Example: (100, 3)