Aggregation Function¶
What is the AggrFunc module?
The AggrFunc module facilitates the application of aggregation functions to feature groups within a longitudinal
dataset, enabling the use of temporal information before applying traditional machine learning algorithms.
What are features_group and non_longitudinal_features?
Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the
temporal structure of longitudinal data.
- features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
- non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.
Proper setup of these attributes is critical for leveraging temporal patterns effectively.
AggrFunc ¶
Bases: DataPreparationMixin
AggrFunc stands for Aggregation Functions, aggregation on feature groups in longitudinal datasets.
The AggrFunc facilitates the application of aggregation functions to feature groups within a longitudinal
dataset, enabling the use of temporal information before applying traditional machine learning algorithms like
those in Scikit-Learn or any other alike machine learning-based libarires.
The aggregation function is applied iteratively across waves for each feature group, producing a single aggregated
feature per group (e.g., mean_income from income_wave1, income_wave2, income_wave3 using the mean
function). Supported aggregation functions include mean, median, mode, and custom callable functions that
take a pandas Series as input and return a single value. Parallel processing is also supported via the Ray library
for enhanced efficiency on large datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_group
|
List[List[int]]
|
A temporal matrix representing the temporal dependency of a longitudinal dataset. Each sublist contains indices of a longitudinal attribute's waves. Defaults to None. See the "Temporal Dependency" page in the documentation for details. |
None
|
non_longitudinal_features
|
List[Union[int, str]]
|
A list of indices or names of non-longitudinal features. Defaults to None. |
None
|
feature_list_names
|
List[str]
|
A list of feature names in the dataset. Defaults to None. |
None
|
aggregation_func
|
Union[str, Callable]
|
The aggregation function to apply. Options are "mean",
"median", "mode", or a custom callable function. Defaults to "mean". See further in
the |
'mean'
|
parallel
|
bool
|
Whether to use parallel processing for aggregation. Defaults to False. |
False
|
num_cpus
|
int
|
Number of CPUs for parallel processing. Defaults to -1 (uses all available CPUs). |
-1
|
Attributes:
| Name | Type | Description |
|---|---|---|
dataset |
DataFrame
|
The longitudinal dataset to transform. |
aggregation_func |
Union[str, Callable]
|
The aggregation function applied to feature groups. |
parallel |
bool
|
Whether parallel processing is enabled. |
num_cpus |
int
|
Number of CPUs used for parallel processing. |
Examples:
Below are examples demonstrating the usage of the AggrFunc class with the "stroke.csv" dataset.
Please, note that "stroke.csv" is a placeholder and should be replaced with the actual path to your dataset.
Basic Usage
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.data_preparation.aggregation_function import AggrFunc
# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)
# Initialize AggrFunc
agg_func = AggrFunc(
aggregation_func="mean",
features_group=dataset.feature_groups(),
non_longitudinal_features=dataset.non_longitudinal_features(),
feature_list_names=dataset.data.columns.tolist()
)
# Apply transformation
agg_func.prepare_data(dataset.X_train)
transformed_dataset, _, _, _ = agg_func._transform()
Advanced: custom aggregation function
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.data_preparation.aggregation_function import AggrFunc
# Load dataset
dataset = LongitudinalDataset("./stroke_longitudinal.csv")
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)
# Define custom function
custom_func = lambda x: x.quantile(0.25) # First quartile
# Initialize AggrFunc
agg_func = AggrFunc(
aggregation_func=custom_func,
features_group=dataset.feature_groups(),
non_longitudinal_features=dataset.non_longitudinal_features(),
feature_list_names=dataset.data.columns.tolist(),
)
# Apply transformation
agg_func.prepare_data(dataset.X_train)
transformed_dataset, _, _, _ = agg_func._transform()
Advanced: parallel processing
# ... similar to the previous example, prepare data and transform ...
# Initialize AggrFunc with parallel processing
agg_func = AggrFunc(
aggregation_func="mean",
features_group=dataset.feature_groups(),
non_longitudinal_features=dataset.non_longitudinal_features(),
feature_list_names=dataset.data.columns.tolist(),
parallel=True, # Enable parallel processing
num_cpus=4 # Specify number of CPUs (optional, -1 for all available)
)
# ... similar to the previous example, prepare data and transform ...
Source code in scikit_longitudinal/data_preparation/aggregation_function.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 | |
get_params(deep=True)
¶
Get the parameters of the AggrFunc instance.
This method retrieves the configuration parameters of the AggrFunc instance, useful for inspection or
hyperparameter tuning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
deep
|
bool
|
Unused parameter but kept for consistency with the scikit-learn API. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The parameters of the AggrFunc instance. |
Source code in scikit_longitudinal/data_preparation/aggregation_function.py
_prepare_data(X, y=None)
¶
Prepare the data for transformation.
This method, overridden from DataPreparationMixin, converts input numpy arrays into a pandas DataFrame and
stores the target data for compatibility, though the target is not used in the transformation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
The input data. |
required |
y
|
ndarray
|
The target data. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
AggrFunc |
AggrFunc
|
The instance with prepared data. |
Source code in scikit_longitudinal/data_preparation/aggregation_function.py
_transform()
¶
Apply the aggregation function to feature groups in the dataset.
This method applies the specified aggregation function to each feature group, replacing it with a single aggregated feature.
Parallel Processing
If parallel processing is enabled, it uses the Ray library.
Categorical Data Handling
For "mean" or "median" functions with categorical data, it switches to "mode" and issues a warning automatically.
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
|
Source code in scikit_longitudinal/data_preparation/aggregation_function.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 | |