automatminer.preprocessing package¶
Subpackages¶
Submodules¶
automatminer.preprocessing.core module¶
Top level preprocessing classes.
-
class
automatminer.preprocessing.core.
DataCleaner
(max_na_frac=0.01, feature_na_method='drop', encode_categories=True, encoder='one-hot', drop_na_targets=True, na_method_fit='drop', na_method_transform='fill')¶ Bases:
automatminer.base.DFTransformer
Transform a featurized dataframe into an ML-ready dataframe.
Works by first removing samples not having a target value (if desired), then dropping features with high nan rates. Finally, removing or otherwise handling nans for individual samples (a relatively uncommon occurrence).
- Parameters
max_na_frac (float) – The maximum fraction (0.0 - 1.0) of samples for a given feature allowed. Columns containing a higher nan fraction are handled according to feature_na_method.
feature_na_method (str) – Defines how to handle features (column) with higher na fraction than max_na_frac. “drop” for dropping these features. “fill” for filling these features with pandas bfill and ffill. “mean” to fill categorical variables and mean for numerical variables. Alternatively, specify a number to replace the nans, e.g. 0. If all samples are nan, feature will be dropped regardless.
encode_categories (bool) – If True, retains features which are categorical (data type is string or object) and then one-hot encodes them. If False, drops them.
encoder (str) – choose a method for encoding the categorical variables. Current options: ‘one-hot’ and ‘label’.
drop_na_targets (bool) – Drop samples containing target values which are na.
na_method_fit (str, float, int) – Set the na_method for samples in fit. Select one of the following methods: “fill” (use pandas fillna with ffill and bfill, sequentially), “ignore” (totally ignore nans in samples), “drop” (drop any remaining samples having a nan feature), “mean” (fills categorical variables, takes means of numerical). Alternatively, specify a number to replace the nans, e.g. 0.
na_method_transform (str, float, int) – The same as na_method_fit, but for transform.
-
max_problem_col_warning_threshold
¶ The max number of “problematic” columns (as a fraction of total columns) which are nan which are allowed before logging a warning.
- Type
-
The following attrs are set during fitting.
-
fitted_df
¶ The fitted dataframe
- Type
pd.DataFrame
-
dropped_samples
¶ A dataframe of samples to be dropped
- Type
pandas.DataFrame
-
fit
(**kwargs)¶ Wrapper for a method to log.
- Parameters
operation (str) – The operation to be logging.
- Returns
The method result.
- Return type
result
-
fit_transform
(df, target, **fit_kwargs)¶ Combines the fitting and transformation of a dataframe.
- Parameters
df (pandas.DataFrame) – The pandas dataframe to be fit.
target (str) – the target string specifying the ML target.
- Returns
The transformed dataframe.
- Return type
(pandas.DataFrame)
-
handle_na
(df, target, na_method, coerce_mismatch=True)¶ First pass for handling cells without values (null or nan). Additional preprocessing may be necessary as one column may be filled with median while the other with mean or mode, etc.
- Parameters
df (pandas.DataFrame) – The dataframe containing features
target (str) – The key defining the ML target.
coerce_mismatch (bool) – If there is a mismatch between the fitted dataframe columns and the argument dataframe columns, create and drop mismatch columns so the dataframes are matching. If False, raises an error. New columns are instantiated as all zeros, as most of the time this is a onehot encoding issue.
na_method (str) – How to deal with samples still containing nans after troublesome columns are already dropped. Default is ‘drop’. Other options are from pandas.DataFrame.fillna: {‘bfill’, ‘pad’, ‘ffill’}, or ‘ignore’ to ignore nans. Alternatively, specify a value to replace the nans, e.g. 0.
- Returns
(pandas.DataFrame) The cleaned df
-
property
retained_features
¶ The features retained during fitting, which may be used to craft the dataframe during transform.
- Returns
The list of features retained.
- Return type
(list)
-
class
automatminer.preprocessing.core.
FeatureReducer
(reducers=('pca', ), corr_threshold=0.95, tree_importance_percentile=0.9, n_pca_features='auto', n_rebate_features=0.3, keep_features=None, remove_features=None)¶ Bases:
automatminer.base.DFTransformer
Perform feature reduction on a clean dataframe.
- Parameters
reducers ((str)) –
The set of feature reduction operations to be performed on the data. The order of strings determines the order in which the reducers will be applied. Valid reducer strings are the following:
- ’corr’: Removes any cross-correlated features having corr.
coefficients larger than a threshold value. Retains feature names.
- ’tree’: Perform iterative feature reduction via a tree-based
feature reduction, using ._feature_importances implemented in sklearn. Retains feature names.
- ’rebate’: Perform ReliefF feature reduction using the skrebate
package. Retains feature names.
- ’pca’: Perform Principal Component Analysis via
eigendecomposition. Note the feature labels will be renamed to “PCA Feature X” if pca is present anywhere in the feature reduction scheme!
- Example: Apply tree-based feature reduction, then pca:
reducers = (‘tree’, ‘pca’)
corr_threshold (float) – The correlation threshold between any two features needed for one to be removed (calculated with R).
tree_importance_percentile (float) – the selected percentile (between 0.0 and 1.0)of the features sorted (descending) based on their importance.
n_pca_features (int, float) – If int, the number of features to be retained by PCA. If float, the fraction of features to be retained by PCA once the dataframe is passed to it (i.e., 0.5 means PCA retains half of the features it is passed). PCA must be present in the reducers. ‘auto’ automatically determines the number of features to retain.
n_rebate_features (int, float) – If int, the number of ReBATE relief features to be retained. If float, the fraction of features to be retained by ReBATE once it is passed the dataframe (i.e., 0.5 means ReBATE retains half of the features it is passed). ReBATE must be present in the reducers.
keep_features (list, None) – A list of features that will not be removed. This option does nothing if PCA feature removal is present.
remove_features (list, None) – A list of features that will be removed. This option does nothing if PCA feature removal is present.
-
The following attrs are set during fitting.
-
removed_features
¶ The keys are the feature reduction methods applied. The values are the feature labels removed by that feature reduction method.
- Type
-
reducer_params
¶ The keys are the feature reduction methods applied. The values are the parameters used by each feature reducer.
- Type
-
fit
(**kwargs)¶ Wrapper for a method to log.
- Parameters
operation (str) – The operation to be logging.
- Returns
The method result.
- Return type
result
A feature selection method that remove those that are cross correlated by more than threshold.
- Parameters
df (pandas.DataFrame) – The dataframe containing features, target_key
target (str) – the name of the target column/feature
r_max (0<float<=1) – if R is greater than this value, the feature that has lower correlation with the target is removed.
- Returns (pandas.DataFrame):
the dataframe with the highly cross-correlated features removed.
automatminer.preprocessing.feature_selection module¶
Various in-house feature reduction techniques.
-
class
automatminer.preprocessing.feature_selection.
TreeFeatureReducer
(mode, importance_percentile=0.95, random_state=0)¶ Bases:
automatminer.base.DFTransformer
- Tree-based feature reduction tools based on sklearn models that have
the .feature_importances_ attribute.
- Parameters
-
fit
(X, y, tree='rf', recursive=True, cv=5)¶ Fits to the data (X) and target (y) to determine the selected_features.
- Parameters
X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names
y (pandas.Series or np.ndarray) – list of outputs used for fitting the tree model
tree (str or instantiated sklearn tree-based model) – if a model is directly fed, it must have the .feature_importances_ attribute
recursive (bool) – whether to recursively reduce the features (True) or just do it once (False)
cv (int or CrossValidation) – sklearn’s cross-validation with the same options (int or actual instantiated CrossValidation)
- Returns (None):
sets the class attribute .selected_features
-
get_reduced_features
(tree_model, X, y, recursive=True)¶ - Gives a reduced list of feature names given a tree-based model that
has the .feature_importances_ attribute.
- Parameters
tree_model (instantiated sklearn tree-based model) –
X (pandas.dataframe) –
y (pandas.Series or numpy.ndarray) – the target column
recursive (bool) –
- Returns ([str]): list of the top * percentile of features. * determined
by importance_percentile argument.
-
get_top_features
(feat_importance)¶ - Simple function to through a sorted list of features and select top
percentiles.
- Returns ([str]): list of the top * percentile of features. * determined
by importance_percentile argument.
-
transform
(X, y=None)¶ - Transforms the data with the subset of features determined after
calling the fit method on the data.
- Parameters
X (pandas.DataFrame) – input data, note that numpy matrix is NOT accepted since the X.columns is used for feature names
y (placeholder) – ignored input (for consistency in notation)
Returns (pandas.DataFrame): the data with reduced number of features.
-
automatminer.preprocessing.feature_selection.
lower_corr_clf
(df, target, f1, f2)¶ Train a simple linear model on the data to decide on the worse of two features. The feature which should be dropped is returned.
- Parameters
- Returns
The name of the feature to be dropped (worse score).
- Return type
(str)
-
automatminer.preprocessing.feature_selection.
rebate
(df, target, n_features)¶ Run the MultiSURF* algorithm on a dataframe, returning the reduced df.