automatminer.preprocessing package¶

Subpackages¶

automatminer.preprocessing.tests package

Submodules¶

automatminer.preprocessing.core module¶

Top level preprocessing classes.

class automatminer.preprocessing.core.DataCleaner(max_na_frac=0.01, feature_na_method='drop', encode_categories=True, encoder='one-hot', drop_na_targets=True, na_method_fit='drop', na_method_transform='fill')¶

Bases: automatminer.base.DFTransformer

Transform a featurized dataframe into an ML-ready dataframe.

Works by first removing samples not having a target value (if desired), then dropping features with high nan rates. Finally, removing or otherwise handling nans for individual samples (a relatively uncommon occurrence).

Parameters

max_na_frac (float) – The maximum fraction (0.0 - 1.0) of samples for a given feature allowed. Columns containing a higher nan fraction are handled according to feature_na_method.
feature_na_method (str) – Defines how to handle features (column) with higher na fraction than max_na_frac. “drop” for dropping these features. “fill” for filling these features with pandas bfill and ffill. “mean” to fill categorical variables and mean for numerical variables. Alternatively, specify a number to replace the nans, e.g. 0. If all samples are nan, feature will be dropped regardless.
encode_categories (bool) – If True, retains features which are categorical (data type is string or object) and then one-hot encodes them. If False, drops them.
encoder (str) – choose a method for encoding the categorical variables. Current options: ‘one-hot’ and ‘label’.
drop_na_targets (bool) – Drop samples containing target values which are na.
na_method_fit (str, float, int) – Set the na_method for samples in fit. Select one of the following methods: “fill” (use pandas fillna with ffill and bfill, sequentially), “ignore” (totally ignore nans in samples), “drop” (drop any remaining samples having a nan feature), “mean” (fills categorical variables, takes means of numerical). Alternatively, specify a number to replace the nans, e.g. 0.
na_method_transform (str, float, int) – The same as na_method_fit, but for transform.

max_problem_col_warning_threshold¶

The max number of “problematic” columns (as a fraction of total columns) which are nan which are allowed before logging a warning.

Type: float

The following attrs are set during fitting.

object_cols¶

The features identified as objects/categories

Type: list

number_cols¶

The features identified as numerical

Type: list

fitted_df¶

The fitted dataframe

Type: pd.DataFrame

fitted_target¶

The target variable in the dataframe.

Type: str

dropped_features¶

The features which were dropped.

Type: list

dropped_samples¶

A dataframe of samples to be dropped

Type: pandas.DataFrame

warnings¶

A list of warnings accumulated during fitting.

Type: [str]

fit(**kwargs)¶

Wrapper for a method to log.

Parameters: operation (str) – The operation to be logging.
Returns: The method result.
Return type: result

fit_transform(df, target, **fit_kwargs)¶

Combines the fitting and transformation of a dataframe.

Parameters

df (pandas.DataFrame) – The pandas dataframe to be fit.
target (str) – the target string specifying the ML target.

Returns

The transformed dataframe.

Return type

(pandas.DataFrame)

handle_na(df, target, na_method, coerce_mismatch=True)¶

First pass for handling cells without values (null or nan). Additional preprocessing may be necessary as one column may be filled with median while the other with mean or mode, etc.

Parameters

df (pandas.DataFrame) – The dataframe containing features
target (str) – The key defining the ML target.
coerce_mismatch (bool) – If there is a mismatch between the fitted dataframe columns and the argument dataframe columns, create and drop mismatch columns so the dataframes are matching. If False, raises an error. New columns are instantiated as all zeros, as most of the time this is a onehot encoding issue.
na_method (str) – How to deal with samples still containing nans after troublesome columns are already dropped. Default is ‘drop’. Other options are from pandas.DataFrame.fillna: {‘bfill’, ‘pad’, ‘ffill’}, or ‘ignore’ to ignore nans. Alternatively, specify a value to replace the nans, e.g. 0.

Returns

(pandas.DataFrame) The cleaned df

property retained_features¶

The features retained during fitting, which may be used to craft the dataframe during transform.

Returns: The list of features retained.
Return type: (list)

to_numerical(df, target)¶

Transforms non-numerical columns to numerical columns which are machine learning-friendly.

Parameters

df (pandas.DataFrame) – The dataframe containing features
target (str) – The key defining the ML target.

Returns

(pandas.DataFrame) The numerical df

transform(**kwargs)¶

Wrapper for a method to log.

Parameters: operation (str) – The operation to be logging.
Returns: The method result.
Return type: result

class automatminer.preprocessing.core.FeatureReducer(reducers=('pca', ), corr_threshold=0.95, tree_importance_percentile=0.9, n_pca_features='auto', n_rebate_features=0.3, keep_features=None, remove_features=None)¶

Bases: automatminer.base.DFTransformer

Perform feature reduction on a clean dataframe.

Parameters

reducers ((str)) –
The set of feature reduction operations to be performed on the data. The order of strings determines the order in which the reducers will be applied. Valid reducer strings are the following:

’corr’: Removes any cross-correlated features having corr.
coefficients larger than a threshold value. Retains feature names.

’tree’: Perform iterative feature reduction via a tree-based
feature reduction, using ._feature_importances implemented in sklearn. Retains feature names.

’rebate’: Perform ReliefF feature reduction using the skrebate
package. Retains feature names.

’pca’: Perform Principal Component Analysis via
eigendecomposition. Note the feature labels will be renamed to “PCA Feature X” if pca is present anywhere in the feature reduction scheme!

Example: Apply tree-based feature reduction, then pca:
reducers = (‘tree’, ‘pca’)
corr_threshold (float) – The correlation threshold between any two features needed for one to be removed (calculated with R).
tree_importance_percentile (float) – the selected percentile (between 0.0 and 1.0)of the features sorted (descending) based on their importance.
n_pca_features (int, float) – If int, the number of features to be retained by PCA. If float, the fraction of features to be retained by PCA once the dataframe is passed to it (i.e., 0.5 means PCA retains half of the features it is passed). PCA must be present in the reducers. ‘auto’ automatically determines the number of features to retain.
n_rebate_features (int, float) – If int, the number of ReBATE relief features to be retained. If float, the fraction of features to be retained by ReBATE once it is passed the dataframe (i.e., 0.5 means ReBATE retains half of the features it is passed). ReBATE must be present in the reducers.
keep_features (list, None) – A list of features that will not be removed. This option does nothing if PCA feature removal is present.
remove_features (list, None) – A list of features that will be removed. This option does nothing if PCA feature removal is present.

The following attrs are set during fitting.

removed_features¶

The keys are the feature reduction methods applied. The values are the feature labels removed by that feature reduction method.

Type: dict

retained_features¶

The features retained.

Type: list

reducer_params¶

The keys are the feature reduction methods applied. The values are the parameters used by each feature reducer.

Type: dict

fit(**kwargs)¶

Wrapper for a method to log.

Parameters: operation (str) – The operation to be logging.
Returns: The method result.
Return type: result

rm_correlated(df, target, r_max=0.95)¶

A feature selection method that remove those that are cross correlated by more than threshold.

Parameters

df (pandas.DataFrame) – The dataframe containing features, target_key
target (str) – the name of the target column/feature
r_max (0<float<=1) – if R is greater than this value, the feature that has lower correlation with the target is removed.

Returns (pandas.DataFrame):: the dataframe with the highly cross-correlated features removed.

transform(**kwargs)¶

Wrapper for a method to log.

Parameters: operation (str) – The operation to be logging.
Returns: The method result.
Return type: result

automatminer.preprocessing.feature_selection module¶

Various in-house feature reduction techniques.

class automatminer.preprocessing.feature_selection.TreeFeatureReducer(mode, importance_percentile=0.95, random_state=0)¶

Bases: automatminer.base.DFTransformer

Tree-based feature reduction tools based on sklearn models that have: the .feature_importances_ attribute.

Parameters

mode (str) – “regression” or “classification”
importance_percentile (float) – the selected percentile of the features sorted (descending) based on their importance.
random_state (int) – relevant if non-deterministic algorithms such as random forest are used.

fit(X, y, tree='rf', recursive=True, cv=5)¶

Fits to the data (X) and target (y) to determine the selected_features.

Parameters

X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names
y (pandas.Series or np.ndarray) – list of outputs used for fitting the tree model
tree (str or instantiated sklearn tree-based model) – if a model is directly fed, it must have the .feature_importances_ attribute
recursive (bool) – whether to recursively reduce the features (True) or just do it once (False)
cv (int or CrossValidation) – sklearn’s cross-validation with the same options (int or actual instantiated CrossValidation)

Returns (None):: sets the class attribute .selected_features

get_reduced_features(tree_model, X, y, recursive=True)¶

Gives a reduced list of feature names given a tree-based model that: has the .feature_importances_ attribute.

Parameters

tree_model (instantiated sklearn tree-based model) –
X (pandas.dataframe) –
y (pandas.Series or numpy.ndarray) – the target column
recursive (bool) –

Returns ([str]): list of the top * percentile of features. * determined: by importance_percentile argument.

get_top_features(feat_importance)¶

Simple function to through a sorted list of features and select top: percentiles.

Parameters: feat_importance ([(str, float)]) – a sorted list of (feature, importance) tuples

Returns ([str]): list of the top * percentile of features. * determined: by importance_percentile argument.

transform(X, y=None)¶

Transforms the data with the subset of features determined after: calling the fit method on the data.

Parameters

X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names
y (placeholder) – ignored input (for consistency in notation)

Returns (pandas.DataFrame): the data with reduced number of features.

automatminer.preprocessing.feature_selection.lower_corr_clf(df, target, f1, f2)¶

Train a simple linear model on the data to decide on the worse of two features. The feature which should be dropped is returned.

Parameters

df (pd.DataFrame) – The dataframe containing the target values and features in question
target (str) – The key for the target column
f1 (str) – The key for the first feature.
f2 (str) – The key for the second feature.

Returns

The name of the feature to be dropped (worse score).

Return type

(str)

automatminer.preprocessing.feature_selection.rebate(df, target, n_features)¶

Run the MultiSURF* algorithm on a dataframe, returning the reduced df.

Parameters

df (pandas.DataFrame) – A dataframe
target (str) – The target key (must be present in df)
n_features (int) – The number of features desired to be returned.

Returns

pd.DataFrame The dataframe with fewer features, and no target

Navigation

automatminer.preprocessing package¶

Subpackages¶

Submodules¶

automatminer.preprocessing.core module¶

automatminer.preprocessing.feature_selection module¶

Module contents¶