matminer.featurizers package

Subpackages

Submodules

matminer.featurizers.bandstructure module

class matminer.featurizers.bandstructure.BandFeaturizer(kpoints=None, find_method='nearest', nbands=2)

Bases: BaseFeaturizer

Featurizes a pymatgen band structure object.

Args:
kpoints ([1x3 numpy array]): list of fractional coordinates of

k-points at which energy is extracted.

find_method (str): the method for finding or interpolating for energy

at given kpoints. It does nothing if kpoints is None. options are:

‘nearest’: the energy of the nearest available k-point to

the input k-point is returned.

‘linear’: the result of linear interpolation is returned see the documentation for scipy.interpolate.griddata

nbands (int): the number of valence/conduction bands to be featurized

__init__(kpoints=None, find_method='nearest', nbands=2)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(bs)
Args:
bs (pymatgen BandStructure or BandStructureSymmLine or their dict):

The band structure to featurize. To obtain all features, bs should include the structure attribute.

Returns:
([float]): a list of band structure features. If not bs.structure,

features that require the structure will be returned as NaN.

List of currently supported features:

band_gap (eV): the difference between the CBM and VBM energy is_gap_direct (0.0|1.0): whether the band gap is direct or not direct_gap (eV): the minimum direct distance of the last

valence band and the first conduction band

p_ex1_norm (float): k-space distance between Gamma point

and k-point of VBM

n_ex1_norm (float): k-space distance between Gamma point

and k-point of CBM

p_ex1_degen: degeneracy of VBM n_ex1_degen: degeneracy of CBM if kpoints is provided (e.g. for kpoints == [[0.0, 0.0, 0.0]]):

n_0.0;0.0;0.0_en: (energy of the first conduction band at

[0.0, 0.0, 0.0] - CBM energy)

p_0.0;0.0;0.0_en: (energy of the last valence band at

[0.0, 0.0, 0.0] - VBM energy)

static get_bindex_bspin(extremum, is_cbm)

Returns the band index and spin of band extremum

Args:
extremum (dict): dictionary containing the CBM/VBM, i.e. output of

Bandstructure.get_cbm()

is_cbm (bool): whether the extremum is the CBM or not

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.bandstructure.BranchPointEnergy(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)

Bases: BaseFeaturizer

Branch point energy and absolute band edge position.

Calculates the branch point energy and (optionally) an absolute band edge position assuming the branch point energy is the center of the gap

Args:

n_vb (int): number of valence bands to include in BPE calc n_cb (int): number of conduction bands to include in BPE calc calculate_band_edges: (bool) whether to also return band edge

positions

atol (float): absolute tolerance when finding equivalent fractional

k-points in irreducible brillouin zone (IBZ) when weights is None

__init__(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): absolute energy levels as provided in the input

BandStructure. “absolute” means no reference energy is subtracted from branch_point_energy, vbm or cbm.

featurize(bs, target_gap=None, weights=None)
Args:

bs (BandStructure): Uniform (not symm line) band structure target_gap (float): if set the band gap is scissored to match this

number

weights ([float]): if set, its length has to be equal to bs.kpoints

to explicitly determine the k-point weights when averaging

Returns:

(int) branch point energy on same energy scale as BS eigenvalues

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.base module

class matminer.featurizers.base.BaseFeaturizer

Bases: BaseEstimator, TransformerMixin, ABC

Abstract class to calculate features from raw materials input data such a compound formula or a pymatgen crystal structure or bandstructure object.

## Using a BaseFeaturizer Class

There are multiple ways for running the featurize routines:

featurize: Featurize a single entry featurize_many: Featurize a list of entries featurize_dataframe: Compute features for many entries, store results

as columns in a dataframe

Some featurizers require first calling the fit method before the featurization methods can function. Generally, you pass the dataset to fit to determine which features a featurizer should compute. For example, a featurizer that returns the partial radial distribution function may need to know which elements are present in a dataset.

You can also use the precheck and precheck_dataframe methods to ensure a featurizer is in scope for a given sample (or dataset) before featurizing.

You can also employ the featurizer as part of a ScikitLearn Pipeline object. For these cases, ScikitLearn calls the transform function of the BaseFeaturizer which is a less-featured wrapper of featurize_many. You would then provide your input data as an array to the Pipeline, which would output the features as an array.

Beyond the featurizing capability, BaseFeaturizer also includes methods for retrieving proper references for a featurizer. The citations function returns a list of papers that should be cited. The implementors function returns a list of people who wrote the featurizer, so that you know who to contact with questions.

## Implementing a New BaseFeaturizer Class

These operations must be implemented for each new featurizer:
featurize - Takes a single material as input, returns the features of

that material.

feature_labels - Generates a human-meaningful name for each of the

features.

citations - Returns a list of citations in BibTeX format implementors - Returns a list of people who contributed to writing a

paper.

None of these operations should change the state of the featurizer. I.e., running each method twice should not produce different results, no class attributes should be changed, and running one operation should not affect the output of another.

All options of the featurizer must be set by the __init__ function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with the get_params and set_params methods of BaseEstimator, which enable easy interoperability with ScikitLearn

Depending on the complexity of your featurizer, it may be worthwhile to implement a from_preset class method. The from_preset method takes the name of a preset and returns an instance of the featurizer with some hard-coded set of inputs. The from_preset option is particularly useful for defining the settings used by papers in the literature.

Optionally, you can implement the fit operation if there are attributes of your featurizer that must be set for the featurizer to work. Any variables that are set by fitting should be stored as class attributes that end with an underscore. (This follows the pattern used by ScikitLearn).

Another option to consider is whether it is worth making any utility operations for your featurizer. featurize must return a list of features, but this may not be the most natural representation for your features (e.g., a dict could be better). Making a separate function for computing features in this natural representation and having the featurize function call this method and then convert the data into a list is a recommended approach. Users who want to compute the representation in the natural form can use the utility function and users who want the data in a ML-ready format (list) can call featurize. See PartialRadialDistributionFunction for an example of this concept.

An additional factor to consider is the chunksize for data parallelisation. For lightweight computational tasks, the overhead associated with passing data from multiprocessing.Pool.map() to the function being parallelized can increase the time taken for all tasks to be completed. By setting the self._chunksize argument, the overhead associated with passing data to the tasks can be reduced. Note that there is only an advantage to using chunksize when the time taken to pass the data from map to the function call is within several orders of magnitude to that of the function call itself. By default, we allow the Python multiprocessing library to determine the chunk size automatically based on the size of the list being featurized. You may want to specify a small chunk size for computationally-expensive featurizers, which will enable better distribution of tasks across threads. In contrast, for more lightweight featurizers, it is recommended that the implementor trial a range of chunksize values to find the optimum. As a general rule of thumb, if the featurize function takes 0.1 seconds or less, a chunksize of around 30 will perform best.

## Documenting a BaseFeaturizer

The class documentation for each featurizer must contain a description of the options and the features that will be computed. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).

For auto-generated documentation purposes, the first line of the featurizer doc should come under the class declaration (not under __init__) and should be a one line summary of the featurizer.

We recommend starting the class documentation with a high-level overview of the features. For example, mention what kind of characteristics of the material they describe and refer the reader to a paper that describes these features well (use a hyperlink if possible, so that the readthedocs will link to that paper). Then, describe each of the individual features in a block named “Features”. It is necessary here to give the user enough information for user to map a feature name what it means. The objective in this part is to allow people to understand what each column of their dataframe is without having to read the Python code. You do not need to explain all of the math/algorithms behind each feature for them to be able to reproduce the feature, just to get an idea what it is.

property chunksize
abstract citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

abstract feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

abstract featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_dataframe(df, col_id, ignore_errors=False, return_errors=False, inplace=False, multiindex=False, pbar=True)

Compute features for all entries contained in input dataframe.

Args:

df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

ignore_errors (bool): Returns NaN for dataframe rows where

exceptions are thrown if True. If False, exceptions are thrown as normal.

return_errors (bool). Returns the errors encountered for each

row in a separate XFeaturizer errors column if True. Requires ignore_errors to be True.

inplace (bool): If True, adds columns to the original object in

memory and returns None. Else, returns the updated object. Should be identical to pandas inplace behavior.

multiindex (bool): If True, use a Featurizer - Feature 2-level

index using the MultiIndex capabilities of pandas. If done inplace, multiindex featurization will overwrite the original dataframe’s column index.

pbar (bool): Shows a progress bar if True.

Returns:

updated dataframe.

featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)

Featurize a list of entries.

If featurize takes multiple inputs, supply inputs as a list of tuples.

Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.

Args:

entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

pbar (bool): Show a progress bar for featurization if True.

Returns:

(list) features for each entry.

featurize_wrapper(x, return_errors=False, ignore_errors=False)

An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.

Args:

x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

Returns:

(list) one or more features.

fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:

X - [list of tuples], training data

Returns:

self

fit_featurize_dataframe(df, col_id, fit_args=None, *args, **kwargs)

The dataframe equivalent of fit_transform. Takes a dataframe and column id as input, fits the featurizer to that dataframe, and returns a featurized dataframe. Accepts the same arguments as featurize_dataframe.

Args:

df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

fit_args (list): list of arguments for fit function.

Returns:

updated dataframe based on featurizer fitted to that dataframe.

abstract implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property n_jobs
precheck(x: Any) bool

Precheck (provide an estimate of whether a featurizer will work or not) for a single entry (e.g., a single composition). If the entry fails the precheck, it will most likely fail featurization; if it passes, it is likely (but not guaranteed) to featurize correctly.

Prechecks should be:
  • accurate (but can be good estimates rather than ground truth)

  • fast to evaluate

  • unlikely to be obsolete via changes in the featurizer in the near

    future

This method should be overridden by any featurizer requiring its use, as by default all entries will pass prechecking. Also, precheck is a good opportunity to throw warnings about long runtimes (e.g., doing nearest neighbors computations on a structure with many thousand sites).

See the documentation for precheck_dataframe for more information.

Args:
*x (Composition, Structure, etc.): Input to-be-featurized. Can be

a single input or multiple inputs.

Returns:

(bool): True, if passes the precheck. False, if fails.

precheck_dataframe(df, col_id, return_frac=True, inplace=False) float | DataFrame

Precheck an entire dataframe. Subclasses wanting to use precheck functionality should not override this method, they should override precheck (unless the entire df determines whether single entries pass or fail a precheck).

Prechecking should be a quick and useful way to check that for a particular dataframe (set of featurizer inputs), the featurizer is:

  1. in scope, and/or…

  2. robust to errors and/or…

  3. any other reason you would not practically want to use this

    featurizer in on this dataframe.

By prechecking before featurizing, you can avoid applying featurizers to data that will ultimately fail, return unreliable numbers, or are out of scope. Prechecking is also a good time to throw/observe warnings (such as long runtime warnings!).

Args:

df (pd.DataFrame): A dataframe col_id (str or [str]): column label containing objects to featurize.

Can be multiple labels if the featurize function requires multiple inputs.

return_frac (bool): If True, returns the fraction of entries

passing the precheck (e.g., 0.5). Else, returns a dataframe.

inplace (bool); Only relevant if return_frac=False. If inplace=True,

the input dataframe is modified in memory with a boolean column for precheck. Otherwise, a new df with this column is returned.

Returns:
(bool, pd.DataFrame): If return_frac=True, returns the fraction of

entries passing the precheck. Else, returns the dataframe with an extra boolean column added for the precheck.

set_chunksize(chunksize)

Set the chunksize used for Pool.map parallelisation.

set_n_jobs(n_jobs: int) None

Set the number of concurrent jobs to spawn during featurization.

Args:

n_jobs (int): Number of threads in multiprocessing pool.

Note: It seems multiprocessing can be the cause of out-of-memory (OOM) errors, especially when trying to featurize large structures on HPC nodes with strict memory limits. Using featurizer.set_n_jobs(1) has been known to help as a workaround.

transform(X)

Compute features for a list of inputs

class matminer.featurizers.base.MultipleFeaturizer(featurizers, iterate_over_entries=True)

Bases: BaseFeaturizer

Class to run multiple featurizers on the same input data.

All featurizers must take the same kind of data as input to the featurize function.

Args:

featurizers (list of BaseFeaturizer): A list of featurizers to run. iterate_over_entries (bool): Whether to iterate over the entries or

featurizers. Iterating over entries will enable increased caching but will only display a single progress bar for all featurizers. If set to False, iteration will be performed over featurizers, resulting in reduced caching but individual progress bars for each featurizer.

__init__(featurizers, iterate_over_entries=True)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)

Featurize a list of entries.

If featurize takes multiple inputs, supply inputs as a list of tuples.

Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.

Args:

entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

pbar (bool): Show a progress bar for featurization if True.

Returns:

(list) features for each entry.

featurize_wrapper(x, return_errors=False, ignore_errors=False)

An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.

Args:

x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

Returns:

(list) one or more features.

fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:

X - [list of tuples], training data

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

set_n_jobs(n_jobs)

Set the number of concurrent jobs to spawn during featurization.

Args:

n_jobs (int): Number of threads in multiprocessing pool.

Note: It seems multiprocessing can be the cause of out-of-memory (OOM) errors, especially when trying to featurize large structures on HPC nodes with strict memory limits. Using featurizer.set_n_jobs(1) has been known to help as a workaround.

class matminer.featurizers.base.StackedFeaturizer(featurizer=None, model=None, name=None, class_names=None)

Bases: BaseFeaturizer

Use the output of a machine learning model as features

For regression models, we use the single output class.

For classification models, we use the probability for the first N-1 classes where N is the number of classes.

__init__(featurizer=None, model=None, name=None, class_names=None)

Initialize featurizer

Args:

featurizer (BaseFeaturizer): Featurizer used to generate inputs to the model model (BaseEstimator): Fitted machine learning model to be evaluated name (str): [Optional] name of model, used when creating feature names

class_names ([str]): Required for classification models, used when creating feature names (scikit-learn does not specify the number of classes for a classifier). Class names must be in the same order as the classes in the model (e.g., class_names[0] must be the name of the class 0)

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.conversions module

This module defines featurizers that can convert between different data formats

Note that these featurizers do not produce machine learning-ready features. Instead, they should be used to pre-process data, either through a standalone transformation or as part of a Pipeline.

class matminer.featurizers.conversions.ASEAtomstoStructure(target_col_id='PMG Structure from ASE Atoms', overwrite_data=False)

Bases: ConversionFeaturizer

Convert dataframes of ase structures to pymatgen structures for further use with matminer.

Args:

target_col_id (str): Column to place PMG structures. overwrite_data (bool): If True, will overwrite target_col_id even if there is

data currently in that column

__init__(target_col_id='PMG Structure from ASE Atoms', overwrite_data=False)
featurize(ase_atoms)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.CompositionToOxidComposition(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)

Bases: ConversionFeaturizer

Utility featurizer to add oxidation states to a pymatgen Composition.

Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.composition.Composition object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

coerce_mixed (bool): If a composition has both species containing

oxid states and not containing oxid states, strips all of the oxid states and guesses the entire composition’s oxid states.

return_original_on_error: If the oxidation states cannot be

guessed and set to True, the composition without oxidation states will be returned. If set to False, an error will be thrown.

**kwargs: Parameters to control the settings for

pymatgen.io.structure.Structure.add_oxidation_state_by_guess().

__init__(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(comp)

Add oxidation states to a Structure using pymatgen’s guessing routines.

Args:

comp (pymatgen.core.composition.Composition): A composition.

Returns:
(pymatgen.core.composition.Composition): A Composition object

decorated with oxidation states.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.CompositionToStructureFromMP(target_col_id='structure', overwrite_data=False, mapi_key=None)

Bases: ConversionFeaturizer

Featurizer to get a Structure object from Materials Project using the composition alone. The most stable entry from Materials Project is selected, or NaN if no entry is found in the Materials Project.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

map_key (str): Materials API key

__init__(target_col_id='structure', overwrite_data=False, mapi_key=None)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(comp)

Get the most stable structure from Materials Project Args:

comp (pymatgen.core.composition.Composition): A composition.

Returns:

(pymatgen.core.structure.Structure): A Structure object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.ConversionFeaturizer(target_col_id, overwrite_data)

Bases: BaseFeaturizer

Abstract class to perform data conversions.

Featurizers subclassing this class do not produce machine learning-ready features but instead are used to pre-process data. As Featurizers, the conversion process can take advantage of the parallelisation implemented in ScikitLearn.

Note that feature_labels are set dynamically and may depend on the column id of the data being featurized. As such, feature_labels may differ before and after featurization.

ConversionFeaturizers differ from other Featurizers in that the user can can specify the column in which to write the converted data. The output column is controlled through target_col_id. ConversionFeaturizers also have the ability to overwrite data in existing columns. This is controlled by the overwrite_data option. “in place” conversion of data can be achieved by setting target_col_id=None and overwrite_data=True. See the docstring below for more details.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_col_id if it

exists.

__init__(target_col_id, overwrite_data)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_dataframe(df, col_id, **kwargs)

Perform the data conversion and set the target column dynamically.

target_col_id, and accordingly feature_labels, may depend on the column id of the data being featurized. As such, target_col_id is first set dynamically before the BaseFeaturizer.featurize_dataframe() super method is called.

Args:

df (Pandas.DataFrame): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

**kwargs: Additional keyword arguments that will be passed through

to BaseFeaturizer.featurize_dataframe().

Returns:

(Pandas.Dataframe): The updated dataframe.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.DictToObject(target_col_id='_object', overwrite_data=False)

Bases: ConversionFeaturizer

Utility featurizer to decode a dict to Python object via MSON.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='_object', overwrite_data=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(dict_data)

Convert a string to a pymatgen Composition.

Args:
dict_data (dict): A MSONable dictionary. E.g. Produced from

pymatgen.core.structure.Structure.as_dict().

Returns:

(object): An object with the type specified by dict_data.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.JsonToObject(target_col_id='_object', overwrite_data=False)

Bases: ConversionFeaturizer

Utility featurizer to decode json data to a Python object via MSON.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='_object', overwrite_data=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(json_data)

Convert a string to a pymatgen Composition.

Args:
json_data (dict): MSONable json data. E.g. Produced from

pymatgen.core.structure.Structure.to_json().

Returns:

(object): An object with the type specified by json_data.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.PymatgenFunctionApplicator(func, func_args=None, func_kwargs=None, target_col_id=None, overwrite_data=False)

Bases: ConversionFeaturizer

Featurizer to run any function using on/from pymatgen primitives.

For example, apply

lambda structure: structure.composition.anonymized_formula

To all rows in a dataframe.

And return the results in the specified column.

Args:

func (function): Function object or lambda to pass the pmg primitive objects to. func_args (list): List of args to pass along with the pmg object to func. func_kwargs (dict): Dict of kwargs to pass along with the pmg object to func, target_col_id (str): Output column for the results. If not provided, the func name

will be used.

overwrite_data (bool): If True, will overwrite target_col_id even if there is

data currently in that column

__init__(func, func_args=None, func_kwargs=None, target_col_id=None, overwrite_data=False)
featurize(obj)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StrToComposition(reduce=False, target_col_id='composition', overwrite_data=False)

Bases: ConversionFeaturizer

Utility featurizer to convert a string to a Composition

The expected input is a composition in string form (e.g. “Fe2O3”).

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
reduce (bool): Whether to return a reduced

pymatgen.core.composition.Composition object.

target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(reduce=False, target_col_id='composition', overwrite_data=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(string_composition)

Convert a string to a pymatgen Composition.

Args:
string_composition (str): A chemical formula as a string (e.g.

“Fe2O3”).

Returns:

(pymatgen.core.composition.Composition): A composition object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToComposition(reduce=False, target_col_id='composition', overwrite_data=False)

Bases: ConversionFeaturizer

Utility featurizer to convert a Structure to a Composition.

The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:

reduce (bool): Whether to return a reduced Composition object. target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(reduce=False, target_col_id='composition', overwrite_data=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Convert a string to a pymatgen Composition.

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:

(pymatgen.core.composition.Composition): A Composition object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToIStructure(target_col_id='istructure', overwrite_data=False)

Bases: ConversionFeaturizer

Utility featurizer to convert a Structure to an immutable IStructure.

This is useful if you are using features that employ caching.

The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='istructure', overwrite_data=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Convert a pymatgen Structure to an immutable IStructure,

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:
(pymatgen.core.structure.IStructure): An immutable IStructure

object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToOxidStructure(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)

Bases: ConversionFeaturizer

Utility featurizer to add oxidation states to a pymatgen Structure.

Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

return_original_on_error: If the oxidation states cannot be

guessed and set to True, the structure without oxidation states will be returned. If set to False, an error will be thrown.

**kwargs: Parameters to control the settings for

pymatgen.io.structure.Structure.add_oxidation_state_by_guess().

__init__(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Add oxidation states to a Structure using pymatgen’s guessing routines.

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:
(pymatgen.core.structure.Structure): A Structure object decorated

with oxidation states.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.dos module

class matminer.featurizers.dos.DOSFeaturizer(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Bases: BaseFeaturizer

Significant character and contribution of the density of state from a CompleteDos, object. Contributors are the atomic orbitals from each site within the structure. This underlines the importance of dos.structure.

Args:
contributors (int):

Sets the number of top contributors to the DOS that are returned as features. (i.e. contributors=1 will only return the main cb and main vb orbital)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

Returns (featurize returns [float] and featurize_labels returns [str]):

xbm_score_i (float): fractions of ith contributor orbital xbm_location_i (str): fractional coordinate of ith contributor/site xbm_character_i (str): character of ith contributor (s, p, d, f) xbm_specie_i (str): elemental specie of ith contributor (ex: ‘Ti’) xbm_hybridization (int): the amount of hybridization at the band edge

characterized by an entropy score (x ln x). the hybridization score is larger for a greater number of significant contributors

__init__(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): list of names of the features. See the docs for the

featurize method for more information.

featurize(dos)
Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS) and must contain the structure.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.DopingFermi(dopings=None, eref='midgap', T=300, return_eref=False)

Bases: BaseFeaturizer

The fermi level (w.r.t. selected reference energy) associated with a specified carrier concentration (1/cm3) and temperature. This featurizar requires the total density of states and structure. The Structure as dos.structure (e.g. in CompleteDos) is required by FermiDos class.

Args:
dopings ([float]): list of doping concentrations 1/cm3. Note that a

negative concentration is treated as electron majority carrier (n-type) and positive for holes (p-type)

eref (str or int or float): energy alignment reference. Defaults

to midgap (equilibrium fermi). A fixed number can also be used. str options: “midgap”, “vbm”, “cbm”, “dos_fermi”, “band_center”

T (float): absolute temperature in Kelvin return_eref: if True, instead of aligning the fermi levels based

on eref, it (eref) will be explicitly returned as a feature

Returns (featurize returns [float] and featurize_labels returns [str]):
examples:
fermi_c-1e+20T300 (float): the fermi level for the electron

concentration of 1e20 and the temperature of 300K.

fermi_c1e+18T600 (float): fermi level for the hole concentration

of 1e18 and the temperature of 600K.

midgap eref (float): if return_eref==True then eref (midgap here)

energy is returned. In this case, fermi levels are absolute as opposed to relative to eref (i.e. if not return_eref)

__init__(dopings=None, eref='midgap', T=300, return_eref=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): list of names of the features generated by featurize

example: “fermi_c-1e+20T300” that is the fermi level for the electron concentration of 1e20 (c-1e+20) and temperature of 300K.

featurize(dos, bandgap=None)
Args:

dos (pymatgen Dos, CompleteDos or FermiDos): bandgap (float): for example the experimentally measured band gap

or one that is calculated via more accurate methods than the one used to generate dos. dos will be scissored to have the same electronic band gap as bandgap.

Returns ([float]): features are fermi levels in eV at the given

concentrations and temperature + eref in eV if return_eref

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.DosAsymmetry(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)

Bases: BaseFeaturizer

Quantifies the asymmetry of the DOS near the Fermi level.

The DOS asymmetry is defined the natural logarithm of the quotient of the total DOS above the Fermi level and the total DOS below the Fermi level. A positive number indicates that there are more states directly above the Fermi level than below the Fermi level. This featurizer is only meant for metals and semi-metals.

Args:
decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

__init__(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Returns the labels for each of the features.

featurize(dos)

Calculates the DOS asymmetry.

Args:

dos (Dos): A pymatgen Dos object.

Returns:

A float describing the asymmetry of the DOS.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.Hybridization(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)

Bases: BaseFeaturizer

quantify s/p/d/f orbital character and their hybridizations at band edges

Args:
decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

species ([str]): the species for which orbital contributions are

separately returned.

Returns (featurize returns [float] and featurize_labels returns [str]):

set of orbitals contributions and hybridizations. If species, then also individual contributions from given species. Examples:

cbm_s (float): s-orbital character of the cbm up to energy_cutoff vbm_sp (float): sp-hybridization at the vbm edge. Minimum is 0

or no hybridization (e.g. all s or vbm_s==1) and 1.0 is maximum hybridization (i.e. vbm_s==0.5, vbm_p==0.5)

cbm_Si_p (float): p-orbital character of Si

__init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Returns ([str]): feature names starting with the extrema (cbm or vbm) followed by either s,p,d,f orbital to show normalized contribution or a pair showing their hybridization or contribution of an element. See the class docs for examples.

featurize(dos, decay_length=None)

takes in the density of state and return the orbitals contributions and hybridizations.

Args:

dos (pymatgen CompleteDos): note that dos.structure is required decay_length (float or None): if set, it overrides the instance

variable self.decay_length.

Returns ([float]): features, see class doc for more info

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.SiteDOS(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Bases: BaseFeaturizer

report the fractional s/p/d/f dos for a particular site. a CompleteDos object is required because knowledge of the structure is needed. this featurizer will work for metals as well as semiconductors. if the dos is a semiconductor, cbm and vbm will correspond to the two respective band edges. if the dos is a metal, then cbm and vbm correspond to above and below the fermi level, respectively.

Args:
decay_length (float in eV):

the dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. three times the decay_length corresponds to 10% sampling strength. there is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

number of points to sample dos

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in dos

Returns (list of floats):

cbm_score_i (float): fractional score for i in {s,p,d,f} cbm_score_total (float): the total sum of all the {s,p,d,f} scores

this is useful information when comparing the relative contributions from multiples sites

vbm_score_i (float): fractional score for i in {s,p,d,f} vbm_score_total (float): the total sum of all the {s,p,d,f} scores

this is useful information when comparing the relative contributions from multiples sites

__init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns (list of str): list of names of the features. See the docs for

the featurizer class for more information.

featurize(dos, idx)

get dos scores for given site index

Args:
dos (pymatgen CompleteDos or their dict):

dos to featurize, must contain pdos and structure

idx (int): index of target site in structure.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.dos.get_cbm_vbm_scores(dos, decay_length, sampling_resolution, gaussian_smear)

Quantifies the contribution of all atomic orbitals (s/p/d/f) from all crystal sites to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. An example use may be sorting the output based on cbm_score or vbm_score.

Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

Returns:
orbital_scores [(dict)]:

A list of how much each orbital contributes to the partial density of states near the band edge. Dictionary items are: .. cbm_score: (float) fractional contribution to conduction band .. vbm_score: (float) fractional contribution to valence band .. species: (pymatgen Specie) the Specie of the orbital .. character: (str) is the orbital character s, p, d, or f .. location: [(float)] fractional coordinates of the orbital

matminer.featurizers.dos.get_site_dos_scores(dos, idx, decay_length, sampling_resolution, gaussian_smear)

Quantifies the contribution of all atomic orbitals (s/p/d/f) from a particular crystal site to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. if the dos is a metal, then CBM and VBM indicate the orbital scores above and below the fermi energy, respectively.

Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

idx (int):

site index for which to gather dos s/p/d/f scores

Returns:
orbital_scores (dict):

a dictionary of the fractional s/p/d/f orbital scores from the total dos accumulated from that site. dictionary structure:

{cbm: {s: (float), …, f: (float), total: (float)},

vbm: {s: (float), …, f: (float), total: (float)}}

matminer.featurizers.function module

class matminer.featurizers.function.FunctionFeaturizer(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)

Bases: BaseFeaturizer

Features from functions applied to existing features, e.g. “1/x”

This featurizer must be fit either by calling .fit_featurize_dataframe or by calling .fit followed by featurize_dataframe.

This class featurizes a dataframe according to a set of expressions representing functions to apply to existing features. The approach here has uses a sympy-based parsing of string expressions, rather than explicit python functions. The primary reason this has been done is to provide for better support for book-keeping (e. g. with feature labels), substitution, and elimination of symbolic redundancy, which sympy is well-suited for.

Note original feature names in the resulting feature set will have their sympy-illegal characters substituted with underscores. For example:

“exp(-MagpieData_avg_dev_NfValence)/sqrt(MagpieData_range_Number)”

Where the original feature names were

“MagpieData avg_dev NfValence” and “MagpieData range Number”

Args:
expressions ([str]): list of sympy-parseable expressions

representing a function of a single variable x, e. g. [“1 / x”, “x ** 2”], defaults to the list above

multi_feature_depth (int): how many features to include if using

multiple fields for functionalization, e. g. 2 will include pairwise combined features

postprocess (function or type): type to cast functional outputs

to, if, for example, you want to include the possibility of complex numbers in your outputs, use postprocess=np.complex128, defaults to float

combo_function (function): function to combine multi-features,

defaults to np.prod (i.e. cumulative product of expressions), note that a combo function must cleanly process sympy expressions and takes a list of arbitrary length as input, other options include np.sum

latexify_labels (bool): whether to render labels in latex,

defaults to False

ILLEGAL_CHARACTERS = ['|', ' ', '/', '\\', '?', '@', '#', '$', '%']
__init__(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)
citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

property exp_dict

Generates a dictionary of expressions keyed by number of variables in each expression

Returns:

Dictionary of expressions keyed by number of variables

feature_labels()
Returns:

Set of feature labels corresponding to expressions

featurize(*args)

Main featurizer function, essentially iterates over all of the functions in self.function_list to generate features for each argument.

Args:
*args: list of numbers to generate functional output

features

Returns:

list of functional outputs corresponding to input args

fit(X, y=None, **fit_kwargs)

Sets the feature labels. Not intended to be used by a user, only intended to be invoked as part of featurize_dataframe

Args:

X (DataFrame or array-like): data to fit to

Returns:

Set of feature labels corresponding to expressions

generate_string_expressions(input_variable_names)

Method to generate string expressions for input strings, mainly used to generate columns names for featurize_dataframe

Args:
input_variable_names ([str]): strings corresponding to

functional input variable names

Returns:

List of string expressions generated by substitution of variable names into functions

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.function.generate_expressions_combinations(expressions, combo_depth=2, combo_function=<function prod>)

This function takes a list of strings representing functions of x, converts them to sympy expressions, and combines them according to the combo_depth parameter. Also filters resultant expressions for any redundant ones determined by sympy expression equivalence.

Args:
expressions (strings): all of the sympy-parseable strings

to be converted to expressions and combined, e. g. [“1 / x”, “x ** 2”], must be functions of x

combo_depth (int): the number of independent variables to consider combo_function (method): the function which combines the

the respective expressions provided, defaults to np.prod, i. e. the cumulative product of the expressions

Returns:
list of unique non-trivial expressions for featurization

of inputs

Module contents