automatminer.preprocessing package

Submodules

automatminer.preprocessing.core module

Top level preprocessing classes.

class automatminer.preprocessing.core.DataCleaner(max_na_frac=0.01, feature_na_method='drop', encode_categories=True, encoder='one-hot', drop_na_targets=True, na_method_fit='drop', na_method_transform='fill')

Bases: automatminer.base.DFTransformer

Transform a featurized dataframe into an ML-ready dataframe.

Works by first removing samples not having a target value (if desired), then dropping features with high nan rates. Finally, removing or otherwise handling nans for individual samples (a relatively uncommon occurrence).

Parameters
  • max_na_frac (float) – The maximum fraction (0.0 - 1.0) of samples for a given feature allowed. Columns containing a higher nan fraction are handled according to feature_na_method.

  • feature_na_method (str) – Defines how to handle features (column) with higher na fraction than max_na_frac. “drop” for dropping these features. “fill” for filling these features with pandas bfill and ffill. “mean” to fill categorical variables and mean for numerical variables. Alternatively, specify a number to replace the nans, e.g. 0. If all samples are nan, feature will be dropped regardless.

  • encode_categories (bool) – If True, retains features which are categorical (data type is string or object) and then one-hot encodes them. If False, drops them.

  • encoder (str) – choose a method for encoding the categorical variables. Current options: ‘one-hot’ and ‘label’.

  • drop_na_targets (bool) – Drop samples containing target values which are na.

  • na_method_fit (str, float, int) – Set the na_method for samples in fit. Select one of the following methods: “fill” (use pandas fillna with ffill and bfill, sequentially), “ignore” (totally ignore nans in samples), “drop” (drop any remaining samples having a nan feature), “mean” (fills categorical variables, takes means of numerical). Alternatively, specify a number to replace the nans, e.g. 0.

  • na_method_transform (str, float, int) – The same as na_method_fit, but for transform.

max_problem_col_warning_threshold

The max number of “problematic” columns (as a fraction of total columns) which are nan which are allowed before logging a warning.

Type

float

The following attrs are set during fitting.
object_cols

The features identified as objects/categories

Type

list

number_cols

The features identified as numerical

Type

list

fitted_df

The fitted dataframe

Type

pd.DataFrame

fitted_target

The target variable in the dataframe.

Type

str

dropped_features

The features which were dropped.

Type

list

dropped_samples

A dataframe of samples to be dropped

Type

pandas.DataFrame

warnings

A list of warnings accumulated during fitting.

Type

[str]

fit(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

fit_transform(df, target, **fit_kwargs)

Combines the fitting and transformation of a dataframe.

Parameters
  • df (pandas.DataFrame) – The pandas dataframe to be fit.

  • target (str) – the target string specifying the ML target.

Returns

The transformed dataframe.

Return type

(pandas.DataFrame)

handle_na(df, target, na_method, coerce_mismatch=True)

First pass for handling cells without values (null or nan). Additional preprocessing may be necessary as one column may be filled with median while the other with mean or mode, etc.

Parameters
  • df (pandas.DataFrame) – The dataframe containing features

  • target (str) – The key defining the ML target.

  • coerce_mismatch (bool) – If there is a mismatch between the fitted dataframe columns and the argument dataframe columns, create and drop mismatch columns so the dataframes are matching. If False, raises an error. New columns are instantiated as all zeros, as most of the time this is a onehot encoding issue.

  • na_method (str) – How to deal with samples still containing nans after troublesome columns are already dropped. Default is ‘drop’. Other options are from pandas.DataFrame.fillna: {‘bfill’, ‘pad’, ‘ffill’}, or ‘ignore’ to ignore nans. Alternatively, specify a value to replace the nans, e.g. 0.

Returns

(pandas.DataFrame) The cleaned df

property retained_features

The features retained during fitting, which may be used to craft the dataframe during transform.

Returns

The list of features retained.

Return type

(list)

to_numerical(df, target)

Transforms non-numerical columns to numerical columns which are machine learning-friendly.

Parameters
  • df (pandas.DataFrame) – The dataframe containing features

  • target (str) – The key defining the ML target.

Returns

(pandas.DataFrame) The numerical df

transform(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

class automatminer.preprocessing.core.FeatureReducer(reducers=('pca', ), corr_threshold=0.95, tree_importance_percentile=0.9, n_pca_features='auto', n_rebate_features=0.3, keep_features=None, remove_features=None)

Bases: automatminer.base.DFTransformer

Perform feature reduction on a clean dataframe.

Parameters
  • reducers ((str)) –

    The set of feature reduction operations to be performed on the data. The order of strings determines the order in which the reducers will be applied. Valid reducer strings are the following:

    ’corr’: Removes any cross-correlated features having corr.

    coefficients larger than a threshold value. Retains feature names.

    ’tree’: Perform iterative feature reduction via a tree-based

    feature reduction, using ._feature_importances implemented in sklearn. Retains feature names.

    ’rebate’: Perform ReliefF feature reduction using the skrebate

    package. Retains feature names.

    ’pca’: Perform Principal Component Analysis via

    eigendecomposition. Note the feature labels will be renamed to “PCA Feature X” if pca is present anywhere in the feature reduction scheme!

    Example: Apply tree-based feature reduction, then pca:

    reducers = (‘tree’, ‘pca’)

  • corr_threshold (float) – The correlation threshold between any two features needed for one to be removed (calculated with R).

  • tree_importance_percentile (float) – the selected percentile (between 0.0 and 1.0)of the features sorted (descending) based on their importance.

  • n_pca_features (int, float) – If int, the number of features to be retained by PCA. If float, the fraction of features to be retained by PCA once the dataframe is passed to it (i.e., 0.5 means PCA retains half of the features it is passed). PCA must be present in the reducers. ‘auto’ automatically determines the number of features to retain.

  • n_rebate_features (int, float) – If int, the number of ReBATE relief features to be retained. If float, the fraction of features to be retained by ReBATE once it is passed the dataframe (i.e., 0.5 means ReBATE retains half of the features it is passed). ReBATE must be present in the reducers.

  • keep_features (list, None) – A list of features that will not be removed. This option does nothing if PCA feature removal is present.

  • remove_features (list, None) – A list of features that will be removed. This option does nothing if PCA feature removal is present.

The following attrs are set during fitting.
removed_features

The keys are the feature reduction methods applied. The values are the feature labels removed by that feature reduction method.

Type

dict

retained_features

The features retained.

Type

list

reducer_params

The keys are the feature reduction methods applied. The values are the parameters used by each feature reducer.

Type

dict

fit(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

rm_correlated(df, target, r_max=0.95)

A feature selection method that remove those that are cross correlated by more than threshold.

Parameters
  • df (pandas.DataFrame) – The dataframe containing features, target_key

  • target (str) – the name of the target column/feature

  • r_max (0<float<=1) – if R is greater than this value, the feature that has lower correlation with the target is removed.

Returns (pandas.DataFrame):

the dataframe with the highly cross-correlated features removed.

transform(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

automatminer.preprocessing.feature_selection module

Various in-house feature reduction techniques.

class automatminer.preprocessing.feature_selection.TreeFeatureReducer(mode, importance_percentile=0.95, random_state=0)

Bases: automatminer.base.DFTransformer

Tree-based feature reduction tools based on sklearn models that have

the .feature_importances_ attribute.

Parameters
  • mode (str) – “regression” or “classification”

  • importance_percentile (float) – the selected percentile of the features sorted (descending) based on their importance.

  • random_state (int) – relevant if non-deterministic algorithms such as random forest are used.

fit(X, y, tree='rf', recursive=True, cv=5)

Fits to the data (X) and target (y) to determine the selected_features.

Parameters
  • X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names

  • y (pandas.Series or np.ndarray) – list of outputs used for fitting the tree model

  • tree (str or instantiated sklearn tree-based model) – if a model is directly fed, it must have the .feature_importances_ attribute

  • recursive (bool) – whether to recursively reduce the features (True) or just do it once (False)

  • cv (int or CrossValidation) – sklearn’s cross-validation with the same options (int or actual instantiated CrossValidation)

Returns (None):

sets the class attribute .selected_features

get_reduced_features(tree_model, X, y, recursive=True)
Gives a reduced list of feature names given a tree-based model that

has the .feature_importances_ attribute.

Parameters
  • tree_model (instantiated sklearn tree-based model) –

  • X (pandas.dataframe) –

  • y (pandas.Series or numpy.ndarray) – the target column

  • recursive (bool) –

Returns ([str]): list of the top * percentile of features. * determined

by importance_percentile argument.

get_top_features(feat_importance)
Simple function to through a sorted list of features and select top

percentiles.

Parameters

feat_importance ([(str, float)]) – a sorted list of (feature, importance) tuples

Returns ([str]): list of the top * percentile of features. * determined

by importance_percentile argument.

transform(X, y=None)
Transforms the data with the subset of features determined after

calling the fit method on the data.

Parameters
  • X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names

  • y (placeholder) – ignored input (for consistency in notation)

Returns (pandas.DataFrame): the data with reduced number of features.

automatminer.preprocessing.feature_selection.lower_corr_clf(df, target, f1, f2)

Train a simple linear model on the data to decide on the worse of two features. The feature which should be dropped is returned.

Parameters
  • df (pd.DataFrame) – The dataframe containing the target values and features in question

  • target (str) – The key for the target column

  • f1 (str) – The key for the first feature.

  • f2 (str) – The key for the second feature.

Returns

The name of the feature to be dropped (worse score).

Return type

(str)

automatminer.preprocessing.feature_selection.rebate(df, target, n_features)

Run the MultiSURF* algorithm on a dataframe, returning the reduced df.

Parameters
  • df (pandas.DataFrame) – A dataframe

  • target (str) – The target key (must be present in df)

  • n_features (int) – The number of features desired to be returned.

Returns

pd.DataFrame The dataframe with fewer features, and no target

Module contents