Lexicographical Decision Tree Regressor¶
Abstract of LexicoDecisionTreeRegressor
Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".
Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.
Adapted to regression, this estimator applies the same lexicographic split-selection procedure inside DecisionTreeRegressor, replacing information-gain ratio with variance reduction (friedman_mse) as the primary objective while still preferring more recent waves on near-ties.
What are features_group and non_longitudinal_features?
Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the
temporal structure of longitudinal data.
- features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
- non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.
Proper setup of these attributes is critical for leveraging temporal patterns effectively.
LexicoDecisionTreeRegressor ¶
Bases: DecisionTreeRegressor
Lexico Decision Tree Regressor for longitudinal data regression.
This regressor extends scikit-learn's DecisionTreeRegressor for longitudinal data by integrating a
lexicographic optimisation approach that prioritises more recent waves during split selection. Splits are
evaluated with a bi-objective rule: the primary objective maximises the variance-reduction information gain
(friedman_mse criterion), and the secondary objective favours features from more recent waves whenever
competing gains are within threshold_gain. This is a powerful tool for modelling time-dependent phenomena
like patient health trends or economic forecasts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold_gain
|
float, default=0.0015
|
Threshold for comparing gain ratios during split selection. Lower values prioritize recency more strictly; higher values allow more flexibility in balancing gain and recency. |
0.0015
|
features_group
|
List[List[int]]
|
A list of lists where each sublist contains feature indices for a longitudinal attribute, ordered from oldest to most recent wave. Required for longitudinal functionality. |
None
|
criterion
|
str, default="friedman_mse"
|
The split quality metric. Fixed to "friedman_mse"; do not modify. |
'friedman_mse'
|
splitter
|
str, default="lexicoRF"
|
The split strategy. Fixed to "lexicoRF"; do not modify. |
'lexicoRF'
|
max_depth
|
Optional[int], default=None
|
Maximum tree depth. If None, grows until purity or other limits are reached. |
None
|
min_samples_split
|
int, default=2
|
Minimum samples required to split a node. |
2
|
min_samples_leaf
|
int, default=1
|
Minimum samples required at a leaf node. |
1
|
min_weight_fraction_leaf
|
float, default=0.0
|
Minimum weighted fraction of total sample weight at a leaf. |
0.0
|
max_features
|
Optional[Union[int, str]], default=None
|
Number of features to consider for splits (e.g., "auto", "sqrt", int). |
None
|
random_state
|
Optional[int], default=None
|
Seed for random number generation. |
None
|
max_leaf_nodes
|
Optional[int], default=None
|
Maximum number of leaf nodes. |
None
|
min_impurity_decrease
|
float, default=0.0
|
Minimum impurity decrease required for a split. |
0.0
|
ccp_alpha
|
float, default=0.0
|
Complexity parameter for pruning; non-negative. |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
n_features_ |
int
|
Number of features in the fitted model. |
n_outputs_ |
int
|
Number of outputs (fixed to 1 for regression). |
feature_importances_ |
ndarray of shape (n_features,
|
Impurity-based feature importances. |
max_features_ |
int
|
Inferred value of |
tree_ |
Tree object
|
The underlying decision tree structure. |
Examples:
While Sklong focussed classification tasks only as of now. This regressor model is used by
our LexicographicalGradientBoosting primitive. Feel free to experiment with it in your own
longitudinal regression tasks but we do not guarantee its performance.
Source code in scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
fit(X, y, *args, **kwargs)
¶
Fit the Lexico Decision Tree Regressor to the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Training input samples. |
required |
y
|
array-like of shape (n_samples,)
|
Target values. |
required |
*args
|
Additional positional arguments for the superclass |
()
|
|
**kwargs
|
Additional keyword arguments for the superclass |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
Fitted regressor instance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Data Prep Tip
Ensure X matches the features_group structure for accurate temporal modeling.
Source code in scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py
predict(X, check_input=True)
¶
Predict regression target values for the input samples.
Inherited from scikit-learn's DecisionTreeRegressor. The Lexico tree only customises
split selection at fit time; prediction is the standard tree-traversal routine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input samples. |
required |
check_input
|
bool, default=True
|
Allow to bypass input validation. Forwarded to scikit-learn. |
True
|
Returns:
| Type | Description |
|---|---|
|
np.ndarray: Predicted target values of shape |