automatminer.featurization package¶
Subpackages¶
Submodules¶
automatminer.featurization.base module¶
Base classes for sets of featurizers.
-
class
automatminer.featurization.base.
FeaturizerSet
(exclude=None)¶ Bases:
abc.ABC
Abstract class for defining sets of featurizers.
All FeaturizerSets should implement at least fours sets of featurizers:
express - The “go-to” set of featurizers
- heavy - A more expensive and complete (though not necessarily
better) version of express.
all - All featurizers available for the intended featurization type(s)
debug - An ultra-minimal set of featurizers for debugging purposes.
Each set returned is a list of matminer featurizer objects. The choice of featurizers for a given set is at the discrtetion of the implementor.
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
abstract property
all
¶ All featurizers available for this featurization type. These featurizers are allowed to:
have multiple, highly similar versions of the same featurizer,
- not work on standard versions of the input types (e.g., SiteDOS works
on the DOS for a single site, not structure
return non-vectorized outputs (e.g., matrices, other data types).
- Return type
List
[~T]
-
abstract property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
- Return type
List
[~T]
-
abstract property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
- Return type
List
[~T]
automatminer.featurization.core module¶
Classes for automatic featurization and core featurizer functionality.
-
class
automatminer.featurization.core.
AutoFeaturizer
(cache_src=None, preset=None, featurizers=None, exclude=None, functionalize=False, ignore_cols=None, ignore_errors=True, drop_inputs=True, guess_oxistates=True, multiindex=False, do_precheck=True, n_jobs=None, composition_col='composition', structure_col='structure', bandstructure_col='bandstructure', dos_col='dos')¶ Bases:
automatminer.base.DFTransformer
Automatically featurize a dataframe.
Use this object first by calling fit, then by calling transform.
- AutoFeaturizer requires you to specify the column names for each type of
featurization, or just use the defaults:
“composition”: To use composition features “structure”: To use structure features “bandstructure”: To use bandstructure features “dos”: To use density of states features
The featurizers corresponding to each featurizer type cannot be used if the correct column name is not present.
- Parameters
cache_src (str) – An absolute path to a json file holding feature information. If file exists, will read features (loc indexwise) from this file instead of featurizing. If this file does not exist, AutoFeaturizer will featurize normally, then save the features to a new file. Only features (not featurizer input objects) will be saved
preset (str) – “express” or “heavy” or “debug” or “all. Determines by preset the featurizers that should be applied. See the Featurizer sets for specifics of each. Default is “express”. Incompatible with the featurizers arg.
featurizers (dict) –
Use this option if you want to manually specify the featurizers to use. Keys are the featurizer types you want applied (e.g., “structure”, “composition”). The corresponding values are lists of featurizer objects you want for each featurizer type.
Example
- {“composition”: [ElementProperty.from_preset(“matminer”),
EwaldEnergy()]
”structure”: [BagofBonds(), GlobalSymmetryFeatures()]}
exclude ([str]) – Class names of featurizers to exclude. Only used if you use a preset.
ignore_cols ([str]) – Column names to be ignored/removed from any dataframe undergoing fitting or transformation. If columns are not ignored, they may be used later on for learning.
ignore_errors (bool) – If True, each featurizer will ignore all errors during featurization.
drop_inputs (bool) – Drop the columns containing input objects for featurization after they are featurized.
guess_oxistates (bool) – If True, try to decorate sites with oxidation state.
multiiindex (bool) – If True, returns a multiindexed dataframe. Not recommended for use in MatPipe.
do_precheck (bool) – Execute a precheck on each featurizer before featurizing with it. See matminer prechecking for more info.
n_jobs (int) –
The number of parallel jobs to use during featurization for each featurizer. Default is n_cores
composition_col=”composition”,
composition_col (str) – Name of the column containing structures to be featurized.
structure_col (str) – featurized
bandstructure (str) – Name of the column containing bandstructures to be featurized.
dos_col (str) – Name of the column containing density of states obejcts to be featurized.
-
These attributes are set during fitting
-
featurizers
¶ Same format as input dictionary in Args. Values contain the actual objects being used for featurization. Featurizers can be removed if check_validity=True and the featurizer is not valid for more than self.min_precheck_frac fraction of the fitting dataset.
- Type
-
fitted_input_df
¶ The dataframe which was fitted on
- Type
pd.DataFrame
-
converted_input_df
¶ The converted dataframe which was fitted on (i.e., strings converted to compositions).
- Type
pd.DataFrame
-
removed_featurizers
¶ A list of featurizers removed by prechecking methods, if applicable
- Type
[BaseFeaturizer]
-
Attributes not set during fitting and not specified by arguments
-
min_precheck_frac
¶ The minimum fraction of a featuriser’s input that can be valid (via featurizer.precheck(data).
- Type
automatminer.featurization.sets module¶
Defines sets of featurizers to be used by automatminer during featurization.
Featurizer sets are classes with attributes containing lists of featurizers. For example, the set of all express structure featurizers could be found with:
StructureFeaturizers().express
-
class
automatminer.featurization.sets.
AllFeaturizers
(exclude=None)¶ Bases:
automatminer.featurization.base.FeaturizerSet
Featurizer set containing all available featurizers.
This class provides subsets for composition, structure, density of states and band structure based featurizers. Additional sets containing all featurizers and the set of express/heavy/etc. featurizers are provided.
Example usage:
composition_featurizers = AllFeaturizers().composition
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
property
all
¶ All featurizers available for this featurization type. These featurizers are allowed to:
have multiple, highly similar versions of the same featurizer,
- not work on standard versions of the input types (e.g., SiteDOS works
on the DOS for a single site, not structure
return non-vectorized outputs (e.g., matrices, other data types).
-
property
bandstructure
¶ List of all band structure based featurizers.
-
property
composition
¶ List of all composition based featurizers.
-
property
debug
¶ An ultra-minimal set of featurizers for debugging.
-
property
dos
¶ List of all density of states based featurizers.
-
property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
-
property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
-
property
structure
¶ List of all structure based featurizers.
-
class
automatminer.featurization.sets.
BSFeaturizers
(exclude=None)¶ Bases:
automatminer.featurization.base.FeaturizerSet
Featurizer set containing band structure featurizers.
See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).
Example usage:
bs_featurizers = BSFeaturizers().express
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
property
all
¶ List of all band structure based featurizers.
-
property
debug
¶ An ultra-minimal set of featurizers for debugging.
-
property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
-
property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
-
class
automatminer.featurization.sets.
CompositionFeaturizers
(exclude=None)¶ Bases:
automatminer.featurization.base.FeaturizerSet
Featurizer set containing composition featurizers.
See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).
Example usage:
best_featurizers = CompositionFeaturizers().express
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
property
all
¶ All featurizers available for this featurization type. These featurizers are allowed to:
have multiple, highly similar versions of the same featurizer,
- not work on standard versions of the input types (e.g., SiteDOS works
on the DOS for a single site, not structure
return non-vectorized outputs (e.g., matrices, other data types).
-
property
debug
¶ An ultra-minimal set of featurizers for debugging.
-
property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
-
property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
-
class
automatminer.featurization.sets.
DOSFeaturizers
(exclude=None)¶ Bases:
automatminer.featurization.base.FeaturizerSet
Featurizer set containing density of states featurizers.
See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).
Example usage:
dos_featurizers = DOSFeaturizers().express
Density of states featurizers should work on the entire density of states if they are in express or heavy. If they are in “all” they may work on sites or return matrices.
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
property
all
¶ List of all density of states based featurizers.
-
property
debug
¶ An ultra-minimal set of featurizers for debugging.
-
property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
-
property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
-
class
automatminer.featurization.sets.
StructureFeaturizers
(exclude=None)¶ Bases:
automatminer.featurization.base.FeaturizerSet
Featurizer set containing structure featurizers.
See the FeaturizerSet documentation for inspect of each property (sublist of featurizers).
Example usage:
best_featurizers = StructureFeaturizers().express
- Parameters
exclude (list of str, optional) – A list of featurizer class names that will be excluded from the set of featurizers returned.
-
property
all
¶ All featurizers available for this featurization type. These featurizers are allowed to:
have multiple, highly similar versions of the same featurizer,
- not work on standard versions of the input types (e.g., SiteDOS works
on the DOS for a single site, not structure
return non-vectorized outputs (e.g., matrices, other data types).
-
property
debug
¶ An ultra-minimal set of featurizers for debugging.
-
property
express
¶ A focused set of featurizers which should:
be reasonably fast to featurize
be not prone to errors/nans
provide informative learning features
do not include many irrelevant features making ML expensive
have each featurizer return a vector
allow the recognized type (structure, composition, etc.) as input.
-
property
heavy
¶ A more expensive and complete (though not necessarily better) version of express.
Similar to express, all featurizers selected should return useful learning features. However the selected featurizers may now:
generate many (thousands+) features
be expensive to featurize (1s+ per item)
be prone to NaNs on certain datasets
-
property
need_fit
¶