automatminer.featurization package

Submodules

automatminer.featurization.base module

Base classes for sets of featurizers.

class automatminer.featurization.base.FeaturizerSet(exclude=None)

Bases: abc.ABC

Abstract class for defining sets of featurizers.

All FeaturizerSets should implement at least fours sets of featurizers:

  • express - The “go-to” set of featurizers

  • heavy - A more expensive and complete (though not necessarily

    better) version of express.

  • all - All featurizers available for the intended featurization type(s)

  • debug - An ultra-minimal set of featurizers for debugging purposes.

Each set returned is a list of matminer featurizer objects. The choice of featurizers for a given set is at the discrtetion of the implementor.

Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

abstract property all

All featurizers available for this featurization type. These featurizers are allowed to:

  • have multiple, highly similar versions of the same featurizer,

  • not work on standard versions of the input types (e.g., SiteDOS works

    on the DOS for a single site, not structure

  • return non-vectorized outputs (e.g., matrices, other data types).

Return type

List[~T]

abstract property debug

An ultra-minimal set of featurizers for debugging.

Return type

List[~T]

abstract property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

Return type

List[~T]

abstract property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

Return type

List[~T]

automatminer.featurization.core module

Classes for automatic featurization and core featurizer functionality.

class automatminer.featurization.core.AutoFeaturizer(cache_src=None, preset=None, featurizers=None, exclude=None, functionalize=False, ignore_cols=None, ignore_errors=True, drop_inputs=True, guess_oxistates=True, multiindex=False, do_precheck=True, n_jobs=None, composition_col='composition', structure_col='structure', bandstructure_col='bandstructure', dos_col='dos')

Bases: automatminer.base.DFTransformer

Automatically featurize a dataframe.

Use this object first by calling fit, then by calling transform.

AutoFeaturizer requires you to specify the column names for each type of

featurization, or just use the defaults:

“composition”: To use composition features “structure”: To use structure features “bandstructure”: To use bandstructure features “dos”: To use density of states features

The featurizers corresponding to each featurizer type cannot be used if the correct column name is not present.

Parameters
  • cache_src (str) – An absolute path to a json file holding feature information. If file exists, will read features (loc indexwise) from this file instead of featurizing. If this file does not exist, AutoFeaturizer will featurize normally, then save the features to a new file. Only features (not featurizer input objects) will be saved

  • preset (str) – “express” or “heavy” or “debug” or “all. Determines by preset the featurizers that should be applied. See the Featurizer sets for specifics of each. Default is “express”. Incompatible with the featurizers arg.

  • featurizers (dict) –

    Use this option if you want to manually specify the featurizers to use. Keys are the featurizer types you want applied (e.g., “structure”, “composition”). The corresponding values are lists of featurizer objects you want for each featurizer type.

    Example

    {“composition”: [ElementProperty.from_preset(“matminer”),

    EwaldEnergy()]

    ”structure”: [BagofBonds(), GlobalSymmetryFeatures()]}

  • exclude ([str]) – Class names of featurizers to exclude. Only used if you use a preset.

  • ignore_cols ([str]) – Column names to be ignored/removed from any dataframe undergoing fitting or transformation. If columns are not ignored, they may be used later on for learning.

  • ignore_errors (bool) – If True, each featurizer will ignore all errors during featurization.

  • drop_inputs (bool) – Drop the columns containing input objects for featurization after they are featurized.

  • guess_oxistates (bool) – If True, try to decorate sites with oxidation state.

  • multiiindex (bool) – If True, returns a multiindexed dataframe. Not recommended for use in MatPipe.

  • do_precheck (bool) – Execute a precheck on each featurizer before featurizing with it. See matminer prechecking for more info.

  • n_jobs (int) –

    The number of parallel jobs to use during featurization for each featurizer. Default is n_cores

    composition_col=”composition”,

  • composition_col (str) – Name of the column containing structures to be featurized.

  • structure_col (str) – featurized

  • bandstructure (str) – Name of the column containing bandstructures to be featurized.

  • dos_col (str) – Name of the column containing density of states obejcts to be featurized.

These attributes are set during fitting
featurizers

Same format as input dictionary in Args. Values contain the actual objects being used for featurization. Featurizers can be removed if check_validity=True and the featurizer is not valid for more than self.min_precheck_frac fraction of the fitting dataset.

Type

dict

features

The features generated from the application of all featurizers.

Type

dict

auto_featurizer

whether the featurizers are set automatically, or passed by the users.

Type

bool

fitted_input_df

The dataframe which was fitted on

Type

pd.DataFrame

converted_input_df

The converted dataframe which was fitted on (i.e., strings converted to compositions).

Type

pd.DataFrame

removed_featurizers

A list of featurizers removed by prechecking methods, if applicable

Type

[BaseFeaturizer]

Attributes not set during fitting and not specified by arguments
min_precheck_frac

The minimum fraction of a featuriser’s input that can be valid (via featurizer.precheck(data).

Type

float

fit(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

transform(**kwargs)

Wrapper for a method to log.

Parameters

operation (str) – The operation to be logging.

Returns

The method result.

Return type

result

automatminer.featurization.sets module

Defines sets of featurizers to be used by automatminer during featurization.

Featurizer sets are classes with attributes containing lists of featurizers. For example, the set of all express structure featurizers could be found with:

StructureFeaturizers().express
class automatminer.featurization.sets.AllFeaturizers(exclude=None)

Bases: automatminer.featurization.base.FeaturizerSet

Featurizer set containing all available featurizers.

This class provides subsets for composition, structure, density of states and band structure based featurizers. Additional sets containing all featurizers and the set of express/heavy/etc. featurizers are provided.

Example usage:

composition_featurizers = AllFeaturizers().composition
Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

property all

All featurizers available for this featurization type. These featurizers are allowed to:

  • have multiple, highly similar versions of the same featurizer,

  • not work on standard versions of the input types (e.g., SiteDOS works

    on the DOS for a single site, not structure

  • return non-vectorized outputs (e.g., matrices, other data types).

property bandstructure

List of all band structure based featurizers.

property composition

List of all composition based featurizers.

property debug

An ultra-minimal set of featurizers for debugging.

property dos

List of all density of states based featurizers.

property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

property structure

List of all structure based featurizers.

class automatminer.featurization.sets.BSFeaturizers(exclude=None)

Bases: automatminer.featurization.base.FeaturizerSet

Featurizer set containing band structure featurizers.

See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).

Example usage:

bs_featurizers = BSFeaturizers().express
Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

property all

List of all band structure based featurizers.

property debug

An ultra-minimal set of featurizers for debugging.

property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

class automatminer.featurization.sets.CompositionFeaturizers(exclude=None)

Bases: automatminer.featurization.base.FeaturizerSet

Featurizer set containing composition featurizers.

See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).

Example usage:

best_featurizers = CompositionFeaturizers().express
Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

property all

All featurizers available for this featurization type. These featurizers are allowed to:

  • have multiple, highly similar versions of the same featurizer,

  • not work on standard versions of the input types (e.g., SiteDOS works

    on the DOS for a single site, not structure

  • return non-vectorized outputs (e.g., matrices, other data types).

property debug

An ultra-minimal set of featurizers for debugging.

property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

class automatminer.featurization.sets.DOSFeaturizers(exclude=None)

Bases: automatminer.featurization.base.FeaturizerSet

Featurizer set containing density of states featurizers.

See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).

Example usage:

dos_featurizers = DOSFeaturizers().express

Density of states featurizers should work on the entire density of states if they are in express or heavy. If they are in “all” they may work on sites or return matrices.

Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

property all

List of all density of states based featurizers.

property debug

An ultra-minimal set of featurizers for debugging.

property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

class automatminer.featurization.sets.StructureFeaturizers(exclude=None)

Bases: automatminer.featurization.base.FeaturizerSet

Featurizer set containing structure featurizers.

See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).

Example usage:

best_featurizers = StructureFeaturizers().express
Parameters

exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.

property all

All featurizers available for this featurization type. These featurizers are allowed to:

  • have multiple, highly similar versions of the same featurizer,

  • not work on standard versions of the input types (e.g., SiteDOS works

    on the DOS for a single site, not structure

  • return non-vectorized outputs (e.g., matrices, other data types).

property debug

An ultra-minimal set of featurizers for debugging.

property express

A focused set of featurizers which should:

  • be reasonably fast to featurize

  • be not prone to errors/nans

  • provide informative learning features

  • do not include many irrelevant features making ML expensive

  • have each featurizer return a vector

  • allow the recognized type (structure, composition, etc.) as input.

property heavy

A more expensive and complete (though not necessarily better) version of express.

Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:

  • generate many (thousands+) features

  • be expensive to featurize (1s+ per item)

  • be prone to NaNs on certain datasets

property need_fit

Module contents