Correlation Based Feature Selection Per Group (CFS Per Group)¶
Abstract of CorrelationBasedFeatureSelectionPerGroup
Extracted from Pomsuwan & Freitas (2017), "Feature selection for the classification of longitudinal human ageing data".
We propose a new variant of the Correlation-based Feature Selection (CFS) method for coping with longitudinal data - where variables are repeatedly measured across different time points. The proposed CFS variant is evaluated on ten datasets created using data from the English Longitudinal Study of Ageing (ELSA), with different age-related diseases used as the class variables to be predicted. The results show that, overall, the proposed CFS variant leads to better predictive performance than the standard CFS and the baseline approach of no feature selection, when using Naïve Bayes and J48 decision tree induction as classification algorithms (although the difference in performance is very small in the results for J4.8). We also report the most relevant features selected by J48 across the datasets.
What are features_group and non_longitudinal_features?
Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the
temporal structure of longitudinal data.
- features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
- non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.
Proper setup of these attributes is critical for leveraging temporal patterns effectively.
CorrelationBasedFeatureSelectionPerGroup ¶
Bases: CustomTransformerMixinEstimator
Correlation-based Feature Selection (CFS) per group (CFS Per Group).
The CorrelationBasedFeatureSelectionPerGroup class implements the CFS-Per-Group algorithm, a longitudinal variant
of the standard CFS method. It is designed to handle feature selection in longitudinal datasets by considering
temporal variations across multiple waves (time points). The algorithm operates in two phases:
- Phase 1: For each longitudinal feature group, CFS with a specified search method (e.g., exhaustive or greedy) is applied to select relevant and non-redundant features across waves. The selected features are then aggregated.
- Phase 2: The aggregated features from Phase 1 are combined with non-longitudinal features, and a standard CFS is applied to further refine the selection by removing redundant features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
non_longitudinal_features
|
Optional[List[int]]
|
List of indices for non-longitudinal features. These features are not part of the temporal matrix and are treated separately. Defaults to None. |
None
|
search_method
|
str
|
The search method for Phase 1. Options are "exhaustiveSearch" or "greedySearch". Defaults to "greedySearch". |
'greedySearch'
|
features_group
|
Optional[List[List[int]]]
|
A temporal matrix where each sublist contains indices of a longitudinal attribute's waves. Required for the longitudinal component. Defaults to None. |
None
|
parallel
|
bool
|
Whether to use parallel processing for CFS (useful for exhaustive search with multiple groups). Defaults to False. |
False
|
outer_search_method
|
str
|
The search method for Phase 2 (outer search). If None, defaults to
|
None
|
inner_search_method
|
str
|
The search method for Phase 1 (inner search). Defaults to "exhaustiveSearch". |
'exhaustiveSearch'
|
version
|
int
|
The version of the CFS-Per-Group algorithm to use. Version 1 applies CFS per group without an outer search, while Version 2 includes an outer CFS on the aggregated features. Defaults to 1. |
1
|
num_cpus
|
int
|
Number of CPUs for parallel processing. If -1, uses all available CPUs. Defaults to -1. |
-1
|
Attributes:
| Name | Type | Description |
|---|---|---|
selected_features_ |
ndarray
|
Indices of the selected features after fitting. |
Examples:
Below are examples demonstrating the usage of the CorrelationBasedFeatureSelectionPerGroup class.
Basic Usage
from scikit_longitudinal.preprocessors.feature_selection.correlation_feature_selection import CorrelationBasedFeatureSelectionPerGroup
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble.longitudinal_voting.longitudinal_voting import LongitudinalEnsemblingStrategy
# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)
# Initialize CFS-Per-Group
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
features_group=dataset.feature_groups(),
non_longitudinal_features=dataset.non_longitudinal_features()
)
# Fit to data
cfs_longitudinal.fit(dataset.X_train, dataset.y_train)
# Transform data
X_selected = cfs_longitudinal.apply_selected_features_and_rename(dataset.X_train, cfs_longitudinal.selected_features_)
print(X_selected)
Advanced: parallel processing
# ... Same as above, but with parallel processing enabled ...
# Initialize with parallel processing
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
features_group=features_group,
search_method="exhaustiveSearch",
parallel=True, # Enable parallel processing
num_cpus=4 # Specify number of CPUs to use, -1 for all available CPUs
)
# ... Same as above, but with parallel processing enabled ...
Advanced: version 2 with outer search
# ... Same as above, but with parallel processing enabled ...
# Initialize with version 2 and outer search method
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
features_group=features_group,
non_longitudinal_features=non_longitudinal_features,
version=2, # Use version 2 of CFS-Per-Group
outer_search_method="greedySearch" # Specify outer search method
)
# ... Same as above, but with parallel processing enabled ...
Source code in scikit_longitudinal/preprocessors/feature_selection/correlation_feature_selection/cfs_per_group.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
fit(X, y=None)
¶
Fit the transformer to the input data.
Validates X (and y when provided) with scikit-learn's
check_X_y / check_array and then delegates to the subclass
implementation in _fit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Training input samples of shape |
required |
y
|
ndarray
|
Target values of shape |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
CustomTransformerMixinEstimator |
CustomTransformerMixinEstimator
|
The fitted transformer ( |
Source code in scikit_longitudinal/templates/custom_transformer_mixin_estimator.py
transform(X)
¶
Apply the transformation to the input data.
Validates X with scikit-learn's check_array and delegates to
the subclass implementation in _transform.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input samples of shape |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Transformed array. |
Source code in scikit_longitudinal/templates/custom_transformer_mixin_estimator.py
apply_selected_features_and_rename(df, selected_features, regex_match='^(.+)_w(\\d+)$')
staticmethod
¶
Apply selected features to the DataFrame and rename non-longitudinal features.
This method selects the specified features from the DataFrame and renames any features that, after selection, appear as single-wave features (i.e., non-longitudinal). This ensures that such features are not misinterpreted as longitudinal in future processing.
Usage Note
This method should be used instead of the standard transform method to handle both feature selection and
renaming in one step, especially in pipelines where the temporal structure needs to be preserved.
Regex Match, what is that all about?
The regex match is used to identify features that are longitudinal in nature. The default pattern
^(.+)_w(\d+)$ captures features with a base name followed by a wave number (e.g., feature_w1, feature_w2).
Working by default with the ELSA databases in a nutshell.
The first group (.+) captures the base name of the feature, while the second group (\d+) captures the wave
number. This allows the method to identify and rename features that are longitudinal in nature, ensuring that
they are treated correctly in subsequent analyses.
Why is that important? Because we want to make sure that the features are not misinterpreted as longitudinal when they are actually single-wave features. This is particularly important in longitudinal datasets where features are collected over multiple time points.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
selected_features
|
List
|
List of selected feature indices. |
required |
regex_match
|
str
|
Regex pattern to identify wave-based features. Defaults to "^(.+)_w(\d+)$". |
'^(.+)_w(\\d+)$'
|
Returns:
| Type | Description |
|---|---|
[DataFrame, None]
|
pd.DataFrame: DataFrame with selected features and renamed non-longitudinal features. |