matminer.featurizers package

Submodules

matminer.featurizers.bandstructure module

class matminer.featurizers.bandstructure.BandFeaturizer(kpoints=None, find_method='nearest', nbands=2)

Bases: matminer.featurizers.base.BaseFeaturizer

Featurizes a pymatgen band structure object.

Args:
kpoints ([1x3 numpy array]): list of fractional coordinates of

k-points at which energy is extracted.

find_method (str): the method for finding or interpolating for energy

at given kpoints. It does nothing if kpoints is None. options are:

‘nearest’: the energy of the nearest available k-point to

the input k-point is returned.

‘linear’: the result of linear interpolation is returned see the documentation for scipy.interpolate.griddata

nbands (int): the number of valence/conduction bands to be featurized

__init__(kpoints=None, find_method='nearest', nbands=2)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(bs)
Args:
bs (pymatgen BandStructure or BandStructureSymmLine or their dict):

The band structure to featurize. To obtain all features, bs should include the structure attribute.

Returns:
([float]): a list of band structure features. If not bs.structure,

features that require the structure will be returned as NaN.

List of currently supported features:

band_gap (eV): the difference between the CBM and VBM energy is_gap_direct (0.0|1.0): whether the band gap is direct or not direct_gap (eV): the minimum direct distance of the last

valence band and the first conduction band

p_ex1_norm (float): k-space distance between Gamma point

and k-point of VBM

n_ex1_norm (float): k-space distance between Gamma point

and k-point of CBM

p_ex1_degen: degeneracy of VBM n_ex1_degen: degeneracy of CBM if kpoints is provided (e.g. for kpoints == [[0.0, 0.0, 0.0]]):

n_0.0;0.0;0.0_en: (energy of the first conduction band at

[0.0, 0.0, 0.0] - CBM energy)

p_0.0;0.0;0.0_en: (energy of the last valence band at

[0.0, 0.0, 0.0] - VBM energy)

static get_bindex_bspin(extremum, is_cbm)

Returns the band index and spin of band extremum

Args:
extremum (dict): dictionary containing the CBM/VBM, i.e. output of

Bandstructure.get_cbm()

is_cbm (bool): whether the extremum is the CBM or not

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.bandstructure.BranchPointEnergy(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)

Bases: matminer.featurizers.base.BaseFeaturizer

Branch point energy and absolute band edge position.

Calculates the branch point energy and (optionally) an absolute band edge position assuming the branch point energy is the center of the gap

Args:

n_vb (int): number of valence bands to include in BPE calc n_cb (int): number of conduction bands to include in BPE calc calculate_band_edges: (bool) whether to also return band edge

positions

atol (float): absolute tolerance when finding equivalent fractional

k-points in irreducible brillouin zone (IBZ) when weights is None

__init__(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): absolute energy levels as provided in the input

BandStructure. “absolute” means no reference energy is subtracted from branch_point_energy, vbm or cbm.

featurize(bs, target_gap=None, weights=None)
Args:

bs (BandStructure): Uniform (not symm line) band structure target_gap (float): if set the band gap is scissored to match this

number

weights ([float]): if set, its length has to be equal to bs.kpoints

to explicitly determine the k-point weights when averaging

Returns:

(int) branch point energy on same energy scale as BS eigenvalues

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.base module

class matminer.featurizers.base.BaseFeaturizer

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin, abc.ABC

Abstract class to calculate features from raw materials input data such a compound formula or a pymatgen crystal structure or bandstructure object.

## Using a BaseFeaturizer Class

There are multiple ways for running the featurize routines:

featurize: Featurize a single entry featurize_many: Featurize a list of entries featurize_dataframe: Compute features for many entries, store results

as columns in a dataframe

Some featurizers require first calling the fit method before the featurization methods can function. Generally, you pass the dataset to fit to determine which features a featurizer should compute. For example, a featurizer that returns the partial radial distribution function may need to know which elements are present in a dataset.

You can can also use the precheck and precheck_dataframe methods to ensure a featurizer is in scope for a given sample (or dataset) before featurizing.

You can also employ the featurizer as part of a ScikitLearn Pipeline object. For these cases, ScikitLearn calls the transform function of the BaseFeaturizer which is a less-featured wrapper of featurize_many. You would then provide your input data as an array to the Pipeline, which would output the features as an array.

Beyond the featurizing capability, BaseFeaturizer also includes methods for retrieving proper references for a featurizer. The citations function returns a list of papers that should be cited. The implementors function returns a list of people who wrote the featurizer, so that you know who to contact with questions.

## Implementing a New BaseFeaturizer Class

These operations must be implemented for each new featurizer:
featurize - Takes a single material as input, returns the features of

that material.

feature_labels - Generates a human-meaningful name for each of the

features.

citations - Returns a list of citations in BibTeX format implementors - Returns a list of people who contributed to writing a

paper.

None of these operations should change the state of the featurizer. I.e., running each method twice should not produce different results, no class attributes should be changed, and running one operation should not affect the output of another.

All options of the featurizer must be set by the __init__ function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with the get_params and set_params methods of BaseEstimator, which enable easy interoperability with ScikitLearn

Depending on the complexity of your featurizer, it may be worthwhile to implement a from_preset class method. The from_preset method takes the name of a preset and returns an instance of the featurizer with some hard-coded set of inputs. The from_preset option is particularly useful for defining the settings used by papers in the literature.

Optionally, you can implement the fit operation if there are attributes of your featurizer that must be set for the featurizer to work. Any variables that are set by fitting should be stored as class attributes that end with an underscore. (This follows the pattern used by ScikitLearn).

Another option to consider is whether it is worth making any utility operations for your featurizer. featurize must return a list of features, but this may not be the most natural representation for your features (e.g., a dict could be better). Making a separate function for computing features in this natural representation and having the featurize function call this method and then convert the data into a list is a recommended approach. Users who want to compute the representation in the natural form can use the utility function and users who want the data in a ML-ready format (list) can call featurize. See PartialRadialDistributionFunction for an example of this concept.

An additional factor to consider is the chunksize for data parallelisation. For lightweight computational tasks, the overhead associated with passing data from multiprocessing.Pool.map() to the function being parallelized can increase the time taken for all tasks to be completed. By setting the self._chunksize argument, the overhead associated with passing data to the tasks can be reduced. Note that there is only an advantage to using chunksize when the time taken to pass the data from map to the function call is within several orders of magnitude to that of the function call itself. By default, we allow the Python multiprocessing library to determine the chunk size automatically based on the size of the list being featurized. You may want to specify a small chunk size for computationally-expensive featurizers, which will enable better distribution of tasks across threads. In contrast, for more lightweight featurizers, it is recommended that the implementor trial a range of chunksize values to find the optimum. As a general rule of thumb, if the featurize function takes 0.1 seconds or less, a chunksize of around 30 will perform best.

## Documenting a BaseFeaturizer

The class documentation for each featurizer must contain a description of the options and the features that will be computed. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).

For auto-generated documentation purposes, the first line of the featurizer doc should come under the class declaration (not under __init__) and should be a one line summary of the featurizer.

We recommend starting the class documentation with a high-level overview of the features. For example, mention what kind of characteristics of the material they describe and refer the reader to a paper that describes these features well (use a hyperlink if possible, so that the readthedocs will link to that paper). Then, describe each of the individual features in a block named “Features”. It is necessary here to give the user enough information for user to map a feature name what it means. The objective in this part is to allow people to understand what each column of their dataframe is without having to read the Python code. You do not need to explain all of the math/algorithms behind each feature for them to be able to reproduce the feature, just to get an idea what it is.

property chunksize
abstract citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

abstract feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

abstract featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_dataframe(df, col_id, ignore_errors=False, return_errors=False, inplace=False, multiindex=False, pbar=True)

Compute features for all entries contained in input dataframe.

Args:

df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

ignore_errors (bool): Returns NaN for dataframe rows where

exceptions are thrown if True. If False, exceptions are thrown as normal.

return_errors (bool). Returns the errors encountered for each

row in a separate XFeaturizer errors column if True. Requires ignore_errors to be True.

inplace (bool): If True, adds columns to the original object in

memory and returns None. Else, returns the updated object. Should be identical to pandas inplace behavior.

multiindex (bool): If True, use a Featurizer - Feature 2-level

index using the MultiIndex capabilities of pandas. If done inplace, multiindex featurization will overwrite the original dataframe’s column index.

pbar (bool): Shows a progress bar if True.

Returns:

updated dataframe.

featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)

Featurize a list of entries.

If featurize takes multiple inputs, supply inputs as a list of tuples.

Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.

Args:

entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

pbar (bool): Show a progress bar for featurization if True.

Returns:

(list) features for each entry.

featurize_wrapper(x, return_errors=False, ignore_errors=False)

An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.

Args:

x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

Returns:

(list) one or more features.

fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:

X - [list of tuples], training data

Returns:

self

fit_featurize_dataframe(df, col_id, fit_args=None, *args, **kwargs)

The dataframe equivalent of fit_transform. Takes a dataframe and column id as input, fits the featurizer to that dataframe, and returns a featurized dataframe. Accepts the same arguments as featurize_dataframe.

Args:

df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

fit_args (list): list of arguments for fit function.

Returns:

updated dataframe based on featurizer fitted to that dataframe.

abstract implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property n_jobs
precheck(*x)bool

Precheck (provide an estimate of whether a featurizer will work or not) for a single entry (e.g., a single composition). If the entry fails the precheck, it will most likely fail featurization; if it passes, it is likely (but not guaranteed) to featurize correctly.

Prechecks should be:
  • accurate (but can be good estimates rather than ground truth)

  • fast to evaluate

  • unlikely to be obsolete via changes in the featurizer in the near

    future

This method should be overridden by any featurizer requiring its use, as by default all entries will pass prechecking. Also, precheck is a good opportunity to throw warnings about long runtimes (e.g., doing nearest neighbors computations on a structure with many thousand sites).

See the documentation for precheck_dataframe for more information.

Args:
*x (Composition, Structure, etc.): Input to-be-featurized. Can be

a single input or multiple inputs.

Returns:

(bool): True, if passes the precheck. False, if fails.

precheck_dataframe(df, col_id, return_frac=True, inplace=False) → [<class ‘float’>, <class ‘pandas.core.frame.DataFrame’>]

Precheck an entire dataframe. Subclasses wanting to use precheck functionality should not override this method, they should override precheck (unless the entire df determines whether single entries pass or fail a precheck).

Prechecking should be a quick and useful way to check that for a particular dataframe (set of featurizer inputs), the featurizer is:

  1. in scope, and/or…

  2. robust to errors and/or…

  3. any other reason you would not practically want to use this

    featurizer in on this dataframe.

By prechecking before featurizing, you can avoid applying featurizers to data that will ultimately fail, return unreliable numbers, or are out of scope. Prechecking is also a good time to throw/observe warnings (such as long runtime warnings!).

Args:

df (pd.DataFrame): A dataframe col_id (str or [str]): column label containing objects to featurize.

Can be multiple labels if the featurize function requires multiple inputs.

return_frac (bool): If True, returns the fraction of entries

passing the precheck (e.g., 0.5). Else, returns a dataframe.

inplace (bool); Only relevant if return_frac=False. If inplace=True,

the input dataframe is modified in memory with a boolean column for precheck. Otherwise, a new df with this column is returned.

Returns:
(bool, pd.DataFrame): If return_frac=True, returns the fraction of

entries passing the precheck. Else, returns the dataframe with an extra boolean column added for the precheck.

set_chunksize(chunksize)

Set the chunksize used for Pool.map parallelisation.

set_n_jobs(n_jobs)

Set the number of threads for this.

transform(X)

Compute features for a list of inputs

class matminer.featurizers.base.MultipleFeaturizer(featurizers, iterate_over_entries=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Class to run multiple featurizers on the same input data.

All featurizers must take the same kind of data as input to the featurize function.

Args:

featurizers (list of BaseFeaturizer): A list of featurizers to run. iterate_over_entries (bool): Whether to iterate over the entries or

featurizers. Iterating over entries will enable increased caching but will only display a single progress bar for all featurizers. If set to False, iteration will be performed over featurizers, resulting in reduced caching but individual progress bars for each featurizer.

__init__(featurizers, iterate_over_entries=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)

Featurize a list of entries.

If featurize takes multiple inputs, supply inputs as a list of tuples.

Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.

Args:

entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

pbar (bool): Show a progress bar for featurization if True.

Returns:

(list) features for each entry.

featurize_wrapper(x, return_errors=False, ignore_errors=False)

An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.

Args:

x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.

return_errors (bool): If True, returns the feature list as

determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.

Returns:

(list) one or more features.

fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:

X - [list of tuples], training data

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

set_n_jobs(n_jobs)

Set the number of threads for this.

class matminer.featurizers.base.StackedFeaturizer(featurizer=None, model=None, name=None, class_names=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Use the output of a machine learning model as features

For regression models, we use the single output class.

For classification models, we use the probability for the first N-1 classes where N is the number of classes.

__init__(featurizer=None, model=None, name=None, class_names=None)

Initialize featurizer

Args:

featurizer (BaseFeaturizer): Featurizer used to generate inputs to the model model (BaseEstimator): Fitted machine learning model to be evaluated name (str): [Optional] name of model, used when creating feature names

class_names ([str]): Required for classification models, used when creating feature names (scikit-learn does not specify the number of classes for a classifier). Class names must be in the same order as the classes in the model (e.g., class_names[0] must be the name of the class 0)

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.composition module

class matminer.featurizers.composition.AtomicOrbitals

Bases: matminer.featurizers.base.BaseFeaturizer

Determine HOMO/LUMO features based on a composition.

The highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) are estiated from the atomic orbital energies of the composition. The atomic orbital energies are from NIST: https://www.nist.gov/pml/data/atomic-reference-data-electronic-structure-calculations

Warning: For compositions with inter-species fractions greater than 10,000 (e.g. dilute alloys such as FeC0.00001) the composition will be truncated (to Fe in this example). In such extreme cases, the truncation likely reflects the true physics of the situation (i.e. that the dilute element does not significantly contribute orbital character to the band structure), but the user should be aware of this behavior.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:
comp: (Composition)

pymatgen Composition object

Returns:

HOMO_character: (str) orbital symbol (‘s’, ‘p’, ‘d’, or ‘f’) HOMO_element: (str) symbol of element for HOMO HOMO_energy: (float in eV) absolute energy of HOMO LUMO_character: (str) orbital symbol (‘s’, ‘p’, ‘d’, or ‘f’) LUMO_element: (str) symbol of element for LUMO LUMO_energy: (float in eV) absolute energy of LUMO gap_AO: (float in eV)

the estimated bandgap from HOMO and LUMO energeis

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.AtomicPackingEfficiency(threshold=0.01, n_nearest=1, 3, 5, max_types=6)

Bases: matminer.featurizers.base.BaseFeaturizer

Packing efficiency based on a geometric theory of the amorphous packing of hard spheres.

This featurizer computes two different kinds of the features. The first relate to the distance between a composition and the composition of the clusters of atoms expected to be efficiently packed based on a theory from `Laws et al.<http://www.nature.com/doifinder/10.1038/ncomms9123>`_. The second corresponds to the packing efficiency of a system if all atoms in the alloy are simultaneously as efficiently-packed as possible.

The packing efficiency in these models is based on the Atomic Packing Efficiency (APE), which measures the difference between the ratio of the radii of the central atom to its neighbors and the ideal ratio of a cluster with the same number of atoms that has optimal packing efficiency. If the difference between the ratios is too large, the APE is positive. If the difference is too small, the APE is negative.

Features:
dist from {k} clusters |APE| < {thr} - The distance between an

alloy composition and the k clusters that have a packing efficiency below thr from ideal

mean simul. packing efficiency - Mean packing efficiency of all atoms.

The packing efficiency is measured with respect to ideal (0)

mean abs simul. packing efficiency - Mean absolute value of the

packing efficiencies. Closer to zero is more efficiently packed

References:

[1] K.J. Laws, D.B. Miracle, M. Ferry, A predictive structural model for bulk metallic glasses, Nat. Commun. 6 (2015) 8123. doi:10.1038/ncomms9123.

__init__(threshold=0.01, n_nearest=1, 3, 5, max_types=6)

Initialize the featurizer

Args:
threshold (float):Threshold to use for determining whether

a cluster is efficiently packed.

n_nearest ({int}): Number of nearest clusters to use when considering features max_types (int): Maximum number of atom types to consider when

looking for efficient clusters. The process for finding efficient clusters very expensive for large numbers of types

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

compute_nearest_cluster_distance(comp)

Compute the distance between a composition and that the nearest efficiently-packed clusters.

Measures the mean L_2 distance between the alloy composition and the k-nearest clusters with Atomic Packing Efficiencies within the user-specified tolerance of 1. k is any of the numbers defined in the “n_nearest” parameter of this class.

If there are less than k efficient clusters in the system, we use the maximum distance betweeen any two compositions (1) for the unmatched neighbors.

Args:

comp (Composition): Composition of material to evaluate

Return:

[float] Average distances

compute_simultaneous_packing_efficiency(comp)

Compute the packing efficiency of the system when the neighbor shell of each atom has the same composition as the alloy. When this criterion is satisfied, it is possible for every atom in this system to be simultaneously as efficiently-packed as possible.

Args:

comp (Composition): Composition to be assessed

Returns

(float) Average APE of all atoms (float) Average deviation of the APE of each atom from ideal (0)

create_cluster_lookup_tool(elements)

Get the compositions of efficiently-packed clusters in a certain system of elements

Args:

elements ([Element]): Elements in system

Return:
(NearNeighbors): Tool to find nearby clusters in this system. None

if there are no efficiently-packed clusters for this combination of elements

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

find_ideal_cluster_size(radius_ratio)

Get the optimal cluster size for a certain radius ratio

Finds the number of nearest neighbors n that minimizes |1 - rp(n)/r|, where rp(n) is the ideal radius ratio for a certain n and r is the actual ratio.

Args:

radius_ratio (float): r / r_{neighbor}

Returns:

(int) number of neighboring atoms for that will be the most efficiently packed. (float) Optimal APE

get_ideal_radius_ratio(n_neighbors)

Compute the idea ratio between the central atom and neighboring atoms for a neighbor with a certain number of nearest neighbors.

Based on work by Miracle, Lord, and Ranganathan.

Args:

n_neighbors (int): Number of atoms in 1st NN shell

Return:

(float) ideal radius ratio r / r_{neighbor}

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.BandCenter

Bases: matminer.featurizers.base.BaseFeaturizer

Estimation of absolute position of band center using electronegativity.

Features
  • Band center

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

(Rough) estimation of absolution position of band center using geometric mean of electronegativity.

Args:

comp (Composition).

Returns:

(float) band center.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.CationProperty(data_source, features, stats)

Bases: matminer.featurizers.composition.ElementProperty

Features based on properties of cations in a material

Requires that oxidation states have already been determined. Property statistics weighted by composition.

Features: Based on the statistics of the data_source chosen, computed by element stoichiometry. The format generally is:

“{data source} {statistic} {property}”

For example:

“DemlData range magn_moment” # Range of magnetic moment via Deml et al. data

For a list of all statistics, see the PropertyStats documentation; for a list of all attributes available for a given data_source, see the documentation for the data sources (e.g., PymatgenData, MagpieData, MatscholarElementData, etc.).

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Get elemental property attributes

Args:

comp: Pymatgen composition object

Returns:

all_attributes: Specified property statistics of features

classmethod from_preset(preset_name)

Return ElementProperty from a preset string Args:

preset_name: (str) can be one of “magpie”, “deml”, “matminer”,

“matscholar_el”, or “megnet_el”.

Returns:

ElementProperty based on the preset name.

class matminer.featurizers.composition.CohesiveEnergy(mapi_key=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Cohesive energy per atom using elemental cohesive energies and formation energy.

Get cohesive energy per atom of a compound by adding known elemental cohesive energies from the formation energy of the compound.

Parameters:
mapi_key (str): Materials API key for looking up formation energy

by composition alone (if you don’t set the formation energy yourself).

__init__(mapi_key=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp, formation_energy_per_atom=None)
Args:

comp: (str) compound composition, eg: “NaCl” formation_energy_per_atom: (float) the formation energy per atom of

your compound. If not set, will look up the most stable formation energy from the Materials Project database.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.CohesiveEnergyMP(mapi_key=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Cohesive energy per atom lookup using Materials Project

Parameters:
mapi_key (str): Materials API key for looking up cohesive energy

by composition alone.

__init__(mapi_key=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:

comp: (str) compound composition, eg: “NaCl”

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.ElectronAffinity

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate average electron affinity times formal charge of anion elements. Note: The formal charges must already be computed before calling featurize. Generates average (electron affinity*formal charge) of anions.

__init__()

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:

comp: (Composition) Composition to be featurized

Returns:

avg_anion_affin (single-element list): average electron affinity*formal charge of anions

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.ElectronegativityDiff(stats=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Features from electronegativity differences between anions and cations.

These features are computed by first determining the concentration-weighted average electronegativity of the anions. For example, the average electronegativity of the anions in CaCoSO is equal to 1/2 of that of S and 1/2 of that of O. We then compute the difference between the electronegativity of each cation and the average anion electronegativity.

The feature values are then determined based on the concentration-weighted statistics in the same manner as ElementProperty features. For example, one value could be the mean electronegativity difference over all the anions.

Parameters:

data_source (data class): source from which to retrieve element data stats: Property statistics to compute

Generates average electronegativity difference between cations and anions

__init__(stats=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:

comp: Pymatgen Composition object

Returns:

en_diff_stats (list of floats): Property stats of electronegativity difference

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.ElementFraction

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate the atomic fraction of each element in a composition.

Generates a vector where each index represents an element in atomic number order.

__init__()

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:

comp: Pymatgen Composition object

Returns:

vector (list of floats): fraction of each element in a composition

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.ElementProperty(data_source, features, stats)

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate elemental property attributes.

To initialize quickly, use the from_preset() method.

Features: Based on the statistics of the data_source chosen, computed by element stoichiometry. The format generally is:

“{data source} {statistic} {property}”

For example:

“PymetgenData range X” # Range of electronegativity from Pymatgen data

For a list of all statistics, see the PropertyStats documentation; for a list of all attributes available for a given data_source, see the documentation for the data sources (e.g., PymatgenData, MagpieData, MatscholarElementData, etc.).

Args:
data_source (AbstractData or str): source from which to retrieve

element property data (or use str for preset: “pymatgen”, “magpie”, or “deml”)

features (list of strings): List of elemental properties to use

(these must be supported by data_source)

stats (list of strings): a list of weighted statistics to compute to for each

property (see PropertyStats for available stats)

__init__(data_source, features, stats)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Get elemental property attributes

Args:

comp: Pymatgen composition object

Returns:

all_attributes: Specified property statistics of features

classmethod from_preset(preset_name)

Return ElementProperty from a preset string Args:

preset_name: (str) can be one of “magpie”, “deml”, “matminer”,

“matscholar_el”, or “megnet_el”.

Returns:

ElementProperty based on the preset name.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.IonProperty(data_source=<matminer.utils.data.PymatgenData object>, fast=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Ionic property attributes. Similar to ElementProperty.

__init__(data_source=<matminer.utils.data.PymatgenData object>, fast=False)
Args:
data_source - (OxidationStateMixin) - A AbstractData class that supports

the get_oxidation_state method.

fast - (boolean) whether to assume elements exist in a single oxidation state,

which can dramatically accelerate the calculation of whether an ionic compound is possible, but will miss heterovalent compounds like Fe3O4.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Ionic character attributes

Args:

comp: (Composition) Composition to be featurized

Returns:

cpd_possible (bool): Indicates if a neutral ionic compound is possible max_ionic_char (float): Maximum ionic character between two atoms avg_ionic_char (float): Average ionic character

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.Meredig

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate features as defined in Meredig et. al.

Features:

Atomic fraction of each of the first 103 elements, in order of atomic number. 17 statistics of elemental properties;

Mean atomic weight of constituent elements Mean periodic table row and column number Mean and range of atomic number Mean and range of atomic radius Mean and range of electronegativity Mean number of valence electrons in each orbital Fraction of total valence electrons in each orbital

__init__()

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Get elemental property attributes

Args:

comp: Pymatgen composition object

Returns:

all_attributes: Specified property statistics of features

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.Miedema(struct_types='all', ss_types='min', data_source='Miedema')

Bases: matminer.featurizers.base.BaseFeaturizer

Formation enthalpies of intermetallic compounds, from Miedema et al.

Calculate the formation enthalpies of the intermetallic compound, solid solution and amorphous phase of a given composition, based on semi-empirical Miedema model (and some extensions), particularly for transitional metal alloys.

Support elemental, binary and multicomponent alloys. For elemental/binary alloys, the formulation is based on the original works by Miedema et al. in 1980s; For multicomponent alloys, the formulation is basically the linear combination of sub-binary systems. This is reported to work well for ternary alloys, but needs to be careful with quaternary alloys and more.

Args:
struct_types (str or [str]): default=’all’

‘inter’: intermetallic compound; ‘ss’: solid solution ‘amor’: amorphous phase; ‘all’: same for [‘inter’, ‘ss’, ‘amor’] [‘inter’, ‘ss’]: amorphous phase and solid solution

ss_types (str or [str]): only for ss, default=’min’

‘fcc’: fcc solid solution; ‘bcc’: bcc solid solution ‘hcp’: hcp solid solution; ‘no_latt’: solid solution with no specific structure type ‘min’: min value of [‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’] ‘all’: same for [‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’] [‘fcc’, ‘bcc’]: fcc and bcc solid solutions

data_source (str): source of dataset, default=’Miedema’

‘Miedema’: ‘Miedema.csv’ placed in “matminer/utils/data_files/”, containing the following model parameters for 73 elements: ‘molar_volume’, ‘electron_density’, ‘electronegativity’ ‘valence_electrons’, ‘a_const’, ‘R_const’, ‘H_trans’ ‘compressibility’, ‘shear_modulus’, ‘melting_point’ ‘structural_stability’. Please see the references for details.

Returns:
(list of floats) Miedema formation enthalpies (eV/atom) for input

struct_types: -Miedema_deltaH_inter: for intermetallic compound -Miedema_deltaH_ss: for solid solution, can include ‘fcc’, ‘bcc’,

‘hcp’, ‘no_latt’, ‘min’ based on input ss_types

-Miedema_deltaH_amor: for amorphous phase

__init__(struct_types='all', ss_types='min', data_source='Miedema')

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

deltaH_chem(elements, fracs, struct)

Chemical term of formation enthalpy Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions struct (str): ‘inter’, ‘ss’ or ‘amor’

Returns:

deltaH_chem (float): chemical term of formation enthalpy

deltaH_elast(elements, fracs)

Elastic term of formation enthalpy Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions

Returns:

deltaH_elastic (float): elastic term of formation enthalpy

deltaH_struct(elements, fracs, latt)

Structural term of formation enthalpy, only for solid solution Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions latt (str): ‘fcc’, ‘bcc’, ‘hcp’ or ‘no_latt’

Returns:

deltaH_struct (float): structural term of formation enthalpy

deltaH_topo(elements, fracs)

Topological term of formation enthalpy, only for amorphous phase Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions

Returns:

deltaH_topo (float): topological term of formation enthalpy

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Get Miedema formation enthalpies of target structures: inter, amor, ss (can be further divided into ‘min’, ‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’

for different lattice_types)

Args:

comp: Pymatgen composition object

Returns:

miedema (list of floats): formation enthalpies of target structures

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

precheck(c: pymatgen.core.composition.Composition)bool

Precheck a single entry. Miedema does not work for compositons containing any elments for which the Miedema model has no parameters. To precheck an entire dataframe (qnd automatically gather the fraction of structures that will pass the precheck), please use precheck_dataframe.

Args:

c (pymatgen.Composition): The composition to precheck.

Returns:

(bool): If True, s passed the precheck; otherwise, it failed.

class matminer.featurizers.composition.OxidationStates(stats=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Statistics about the oxidation states for each specie. Features are concentration-weighted statistics of the oxidation states.

__init__(stats=None)
Args:

stats - (list of string), which statistics compute

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

classmethod from_preset(preset_name)
implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.Stoichiometry(p_list=0, 2, 3, 5, 7, 10, num_atoms=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate norms of stoichiometric attributes.

Parameters:

p_list (list of ints): list of norms to calculate num_atoms (bool): whether to return number of atoms per formula unit

__init__(p_list=0, 2, 3, 5, 7, 10, num_atoms=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Get stoichiometric attributes Args:

comp: Pymatgen composition object p_list (list of ints)

Returns:
p_norm (list of floats): Lp norm-based stoichiometric attributes.

Returns number of atoms if no p-values specified.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.TMetalFraction

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate fraction of magnetic transition metals in a composition.

Parameters:

data_source (data class): source from which to retrieve element data

Generates: Fraction of magnetic transition metal atoms in a compound

__init__()

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)
Args:

comp: Pymatgen Composition object

Returns:

frac_magn_atoms (single-element list): fraction of magnetic transitional metal atoms in a compound

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.ValenceOrbital(orbitals='s', 'p', 'd', 'f', props='avg', 'frac')

Bases: matminer.featurizers.base.BaseFeaturizer

Attributes of valence orbital shells

Args:

data_source (data object): source from which to retrieve element data orbitals (list): orbitals to calculate props (list): specifies whether to return average number of electrons in each orbital,

fraction of electrons in each orbital, or both

__init__(orbitals='s', 'p', 'd', 'f', props='avg', 'frac')

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Weighted fraction of valence electrons in each orbital

Args:

comp: Pymatgen composition object

Returns:
valence_attributes (list of floats): Average number and/or

fraction of valence electrons in specfied orbitals

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.composition.YangSolidSolution

Bases: matminer.featurizers.base.BaseFeaturizer

Mixing thermochemistry and size mismatch terms of Yang and Zhang (2012)

This featurizer returns two different features developed by .. Yang and Zhang https://linkinghub.elsevier.com/retrieve/pii/S0254058411009357 to predict whether metal alloys will form metallic glasses, crystalline solid solutions, or intermetallics. The first, Omega, is related to the balance between the mixing entropy and mixing enthalpy of the liquid phase. The second, delta, is related to the atomic size mismatch between the different elements of the material.

Features

Yang omega - Mixing thermochemistry feature, Omega Yang delta - Atomic size mismatch term

References:
__init__()

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

compute_delta(comp)

Compute Yang’s delta parameter

\sqrt{\sum^n_{i=1} c_i \left( 1 - \frac{r_i}{\bar{r}} \right)^2 }

where c_i and r_i are the fraction and radius of element i, and \bar{r} is the fraction-weighted average of the radii. We use the radii compiled by .. Miracle et al. https://www.tandfonline.com/doi/ref/10.1179/095066010X12646898728200?scroll=top.

Args:

comp (Composition) - Composition to assess

Returns:

(float) delta

compute_omega(comp)

Compute Yang’s mixing thermodynamics descriptor

\frac{T_m \Delta S_{mix}}{ |  \Delta H_{mix} | }

Where T_m is average melting temperature, \Delta S_{mix} is the ideal mixing entropy, and \Delta H_{mix} is the average mixing enthalpies of all pairs of elements in the alloy

Args:

comp (Composition) - Composition to featurizer

Returns:

(float) Omega

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(comp)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

precheck(c: pymatgen.core.composition.Composition)bool

Precheck a single entry. YangSolidSolution does not work for compositons containing any binary elment combinations for which the model has no parameters. We can nearly equivalently approximate this by checking against the unary element list.

To precheck an entire dataframe (qnd automatically gather the fraction of structures that will pass the precheck), please use precheck_dataframe.

Args:

c (pymatgen.Composition): The composition to precheck.

Returns:

(bool): If True, s passed the precheck; otherwise, it failed.

matminer.featurizers.composition.has_oxidation_states(comp)

Check if a composition object has oxidation states for each element

TODO: Does this make sense to add to pymatgen? -wardlt

Args:

comp (Composition): Composition to check

Returns:

(boolean) Whether this composition object contains oxidation states

matminer.featurizers.composition.is_ionic(comp)

Determines whether a compound is an ionic compound.

Looks at the oxidation states of each site and checks if both anions and cations exist

Args:

comp (Composition): Composition to check

Returns:

(bool) Whether the composition describes an ionic compound

matminer.featurizers.conversions module

This module defines featurizers that can convert between different data formats

Note that these featurizers do not produce machine learning-ready features. Instead, they should be used to pre-process data, either through a standalone transformation or as part of a Pipeline.

class matminer.featurizers.conversions.CompositionToOxidComposition(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to add oxidation states to a pymatgen Composition.

Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.composition.Composition object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

coerce_mixed (bool): If a composition has both species containing

oxid states and not containing oxid states, strips all of the oxid states and guesses the entire composition’s oxid states.

return_original_on_error: If the oxidation states cannot be

guessed and set to True, the composition without oxidation states will be returned. If set to False, an error will be thrown.

**kwargs: Parameters to control the settings for

pymatgen.io.structure.Structure.add_oxidation_state_by_guess().

__init__(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(comp)

Add oxidation states to a Structure using pymatgen’s guessing routines.

Args:

comp (pymatgen.core.composition.Composition): A composition.

Returns:
(pymatgen.core.composition.Composition): A Composition object

decorated with oxidation states.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.CompositionToStructureFromMP(target_col_id='structure', overwrite_data=False, mapi_key=None)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Featurizer to get a Structure object from Materials Project using the composition alone. The most stable entry from Materials Project is selected, or NaN if no entry is found in the Materials Project.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

map_key (str): Materials API key

__init__(target_col_id='structure', overwrite_data=False, mapi_key=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(comp)

Get the most stable structure from Materials Project Args:

comp (pymatgen.core.composition.Composition): A composition.

Returns:

(pymatgen.core.structure.Structure): A Structure object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.ConversionFeaturizer(target_col_id, overwrite_data)

Bases: matminer.featurizers.base.BaseFeaturizer

Abstract class to perform data conversions.

Featurizers subclassing this class do not produce machine learning-ready features but instead are used to pre-process data. As Featurizers, the conversion process can take advantage of the parallelisation implemented in ScikitLearn.

Note that feature_labels are set dynamically and may depend on the column id of the data being featurized. As such, feature_labels may differ before and after featurization.

ConversionFeaturizers differ from other Featurizers in that the user can can specify the column in which to write the converted data. The output column is controlled through target_col_id. ConversionFeaturizers also have the ability to overwrite data in existing columns. This is controlled by the overwrite_data option. “in place” conversion of data can be achieved by setting target_col_id=None and overwrite_data=True. See the docstring below for more details.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_col_id if it

exists.

__init__(target_col_id, overwrite_data)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

featurize_dataframe(df, col_id, **kwargs)

Perform the data conversion and set the target column dynamically.

target_col_id, and accordingly feature_labels, may depend on the column id of the data being featurized. As such, target_col_id is first set dynamically before the BaseFeaturizer.featurize_dataframe() super method is called.

Args:

df (Pandas.DataFrame): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.

**kwargs: Additional keyword arguments that will be passed through

to BseFeaturizer.featurize_dataframe().

Returns:

(Pandas.Dataframe): The updated dataframe.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.DictToObject(target_col_id='_object', overwrite_data=False)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to decode a dict to Python object via MSON.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='_object', overwrite_data=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(dict_data)

Convert a string to a pymatgen Composition.

Args:
dict_data (dict): A MSONable dictionary. E.g. Produced from

pymatgen.core.structure.Structure.as_dict().

Returns:

(object): An object with the type specified by dict_data.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.JsonToObject(target_col_id='_object', overwrite_data=False)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to decode json data to a Python object via MSON.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='_object', overwrite_data=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(json_data)

Convert a string to a pymatgen Composition.

Args:
json_data (dict): MSONable json data. E.g. Produced from

pymatgen.core.structure.Structure.to_json().

Returns:

(object): An object with the type specified by json_data.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StrToComposition(reduce=False, target_col_id='composition', overwrite_data=False)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to convert a string to a Composition

The expected input is a composition in string form (e.g. “Fe2O3”).

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
reduce (bool): Whether to return a reduced

pymatgen.core.composition.Composition object.

target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(reduce=False, target_col_id='composition', overwrite_data=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(string_composition)

Convert a string to a pymatgen Composition.

Args:
string_composition (str): A chemical formula as a string (e.g.

“Fe2O3”).

Returns:

(pymatgen.core.composition.Composition): A composition object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToComposition(reduce=False, target_col_id='composition', overwrite_data=False)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to convert a Structure to a Composition.

The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:

reduce (bool): Whether to return a reduced Composition object. target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(reduce=False, target_col_id='composition', overwrite_data=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Convert a string to a pymatgen Composition.

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:

(pymatgen.core.composition.Composition): A Composition object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToIStructure(target_col_id='istructure', overwrite_data=False)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to convert a Structure to an immutable IStructure.

This is useful if you are using features that employ caching.

The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

__init__(target_col_id='istructure', overwrite_data=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Convert a pymatgen Structure to an immutable IStructure,

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:
(pymatgen.core.structure.IStructure): An immutable IStructure

object.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.conversions.StructureToOxidStructure(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)

Bases: matminer.featurizers.conversions.ConversionFeaturizer

Utility featurizer to add oxidation states to a pymatgen Structure.

Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.structure.Structure object.

Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.

Args:
target_col_id (str or None): The column in which the converted data will

be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).

overwrite_data (bool): Overwrite any data in target_column if it

exists.

return_original_on_error: If the oxidation states cannot be

guessed and set to True, the structure without oxidation states will be returned. If set to False, an error will be thrown.

**kwargs: Parameters to control the settings for

pymatgen.io.structure.Structure.add_oxidation_state_by_guess().

__init__(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

featurize(structure)

Add oxidation states to a Structure using pymatgen’s guessing routines.

Args:

structure (pymatgen.core.structure.Structure): A structure.

Returns:
(pymatgen.core.structure.Structure): A Structure object decorated

with oxidation states.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.deprecated module

class matminer.featurizers.deprecated.CrystalSiteFingerprint(**kwargs)

Bases: matminer.featurizers.base.BaseFeaturizer

A local order parameter fingerprint for periodic crystals.

A site fingerprint intended for periodic crystals. The fingerprint represents the value of various order parameters for the site; each value is the product two quantities: (i) the value of the order parameter itself and (ii) a factor that describes how consistent the number of neighbors is with that order parameter. Note that we can include only factor (ii) using the “wt” order parameter which is always set to 1. Also note that the cation-anion flag works only if the structures are oxidation-state decorated (e.g., use pymatgen’s BVAnalyzer or matminer’s structure_to_oxidstructure()).

__init__(**kwargs)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get crystal fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

list of weighted order parameters of target site.

static from_preset(preset, cation_anion=False)

Use preset parameters to get the fingerprint Args:

preset (str): name of preset (“cn” or “ops”) cation_anion (bool): whether to only consider cation<->anion bonds

(bonds with zero charge are also allowed)

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.dos module

class matminer.featurizers.dos.DOSFeaturizer(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Bases: matminer.featurizers.base.BaseFeaturizer

Significant character and contribution of the density of state from a CompleteDos, object. Contributors are the atomic orbitals from each site within the structure. This underlines the importance of dos.structure.

Args:
contributors (int):

Sets the number of top contributors to the DOS that are returned as features. (i.e. contributors=1 will only return the main cb and main vb orbital)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

Returns (featurize returns [float] and featurize_labels returns [str]):

xbm_score_i (float): fractions of ith contributor orbital xbm_location_i (str): fractional coordinate of ith contributor/site xbm_character_i (str): character of ith contributor (s, p, d, f) xbm_specie_i (str): elemental specie of ith contributor (ex: ‘Ti’) xbm_hybridization (int): the amount of hybridization at the band edge

characterized by an entropy score (x ln x). the hybridization score is larger for a greater number of significant contributors

__init__(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): list of names of the features. See the docs for the

featurize method for more information.

featurize(dos)
Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS) and must contain the structure.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.DopingFermi(dopings=None, eref='midgap', T=300, return_eref=False)

Bases: matminer.featurizers.base.BaseFeaturizer

The fermi level (w.r.t. selected reference energy) associated with a specified carrier concentration (1/cm3) and temperature. This featurizar requires the total density of states and structure. The Structure as dos.structure (e.g. in CompleteDos) is required by FermiDos class.

Args:
dopings ([float]): list of doping concentrations 1/cm3. Note that a

negative concentration is treated as electron majority carrier (n-type) and positive for holes (p-type)

eref (str or int or float): energy alignment reference. Defaults

to midgap (equilibrium fermi). A fixed number can also be used. str options: “midgap”, “vbm”, “cbm”, “dos_fermi”, “band_center”

T (float): absolute temperature in Kelvin return_eref: if True, instead of aligning the fermi levels based

on eref, it (eref) will be explicitly returned as a feature

Returns (featurize returns [float] and featurize_labels returns [str]):
examples:
fermi_c-1e+20T300 (float): the fermi level for the electron

concentration of 1e20 and the temperature of 300K.

fermi_c1e+18T600 (float): fermi level for the hole concentration

of 1e18 and the temperature of 600K.

midgap eref (float): if return_eref==True then eref (midgap here)

energy is returned. In this case, fermi levels are absolute as opposed to relative to eref (i.e. if not return_eref)

__init__(dopings=None, eref='midgap', T=300, return_eref=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns ([str]): list of names of the features generated by featurize

example: “fermi_c-1e+20T300” that is the fermi level for the electron concentration of 1e20 (c-1e+20) and temperature of 300K.

featurize(dos, bandgap=None)
Args:

dos (pymatgen Dos, CompleteDos or FermiDos): bandgap (float): for example the experimentally measured band gap

or one that is calculated via more accurate methods than the one used to generate dos. dos will be scissored to have the same electronic band gap as bandgap.

Returns ([float]): features are fermi levels in eV at the given

concentrations and temperature + eref in eV if return_eref

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.DosAsymmetry(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)

Bases: matminer.featurizers.base.BaseFeaturizer

Quantifies the asymmetry of the DOS near the Fermi level.

The DOS asymmetry is defined the natural logarithm of the quotient of the total DOS above the Fermi level and the total DOS below the Fermi level. A positive number indicates that there are more states directly above the Fermi level than below the Fermi level. This featurizer is only meant for metals and semi-metals.

Args:
decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

__init__(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Returns the labels for each of the features.

featurize(dos)

Calculates the DOS asymmetry.

Args:

dos (Dos): A pymatgen Dos object.

Returns:

A float describing the asymmetry of the DOS.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.Hybridization(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)

Bases: matminer.featurizers.base.BaseFeaturizer

quantify s/p/d/f orbital character and their hybridizations at band edges

Args:
decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

species ([str]): the species for which orbital contributions are

separately returned.

Returns (featurize returns [float] and featurize_labels returns [str]):

set of orbitals contributions and hybridizations. If species, then also individual contributions from given species. Examples:

cbm_s (float): s-orbital character of the cbm up to energy_cutoff vbm_sp (float): sp-hybridization at the vbm edge. Minimum is 0

or no hybridization (e.g. all s or vbm_s==1) and 1.0 is maximum hybridization (i.e. vbm_s==0.5, vbm_p==0.5)

cbm_Si_p (float): p-orbital character of Si

__init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Returns ([str]): feature names starting with the extrema (cbm or vbm) followed by either s,p,d,f orbital to show normalized contribution or a pair showing their hybridization or contribution of an element. See the class docs for examples.

featurize(dos, decay_length=None)

takes in the density of state and return the orbitals contributions and hybridizations.

Args:

dos (pymatgen CompleteDos): note that dos.structure is required decay_length (float or None): if set, it overrides the instance

variable self.decay_length.

Returns ([float]): features, see class doc for more info

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.dos.SiteDOS(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Bases: matminer.featurizers.base.BaseFeaturizer

report the fractional s/p/d/f dos for a particular site. a CompleteDos object is required because knowledge of the structure is needed. this featurizer will work for metals as well as semiconductors. if the dos is a semiconductor, cbm and vbm will correspond to the two respective band edges. if the dos is a metal, then cbm and vbm correspond to above and below the fermi level, respectively.

Args:
decay_length (float in eV):

the dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. three times the decay_length corresponds to 10% sampling strength. there is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

number of points to sample dos

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in dos

Returns (list of floats):

cbm_score_i (float): fractional score for i in {s,p,d,f} cbm_score_total (float): the total sum of all the {s,p,d,f} scores

this is useful information when comparing the relative contributions from multiples sites

vbm_score_i (float): fractional score for i in {s,p,d,f} vbm_score_total (float): the total sum of all the {s,p,d,f} scores

this is useful information when comparing the relative contributions from multiples sites

__init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()
Returns (list of str): list of names of the features. See the docs for

the featurizer class for more information.

featurize(dos, idx)

get dos scores for given site index

Args:
dos (pymatgen CompleteDos or their dict):

dos to featurize, must contain pdos and structure

idx (int): index of target site in structure.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.dos.get_cbm_vbm_scores(dos, decay_length, sampling_resolution, gaussian_smear)

Quantifies the contribution of all atomic orbitals (s/p/d/f) from all crystal sites to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. An example use may be sorting the output based on cbm_score or vbm_score.

Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

Returns:
orbital_scores [(dict)]:

A list of how much each orbital contributes to the partial density of states near the band edge. Dictionary items are: .. cbm_score: (float) fractional contribution to conduction band .. vbm_score: (float) fractional contribution to valence band .. species: (pymatgen Specie) the Specie of the orbital .. character: (str) is the orbital character s, p, d, or f .. location: [(float)] fractional coordinates of the orbital

matminer.featurizers.dos.get_site_dos_scores(dos, idx, decay_length, sampling_resolution, gaussian_smear)

Quantifies the contribution of all atomic orbitals (s/p/d/f) from a particular crystal site to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. if the dos is a metal, then CBM and VBM indicate the orbital scores above and below the fermi energy, respectively.

Args:
dos (pymatgen CompleteDos or their dict):

The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)

decay_length (float in eV):

The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)

sampling_resolution (int):

Number of points to sample DOS

gaussian_smear (float in eV):

Gaussian smearing (sigma) around each sampled point in the DOS

idx (int):

site index for which to gather dos s/p/d/f scores

Returns:
orbital_scores (dict):

a dictionary of the fractional s/p/d/f orbital scores from the total dos accumulated from that site. dictionary structure:

{cbm: {s: (float), …, f: (float), total: (float)},

vbm: {s: (float), …, f: (float), total: (float)}}

matminer.featurizers.function module

class matminer.featurizers.function.FunctionFeaturizer(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Features from functions applied to existing features, e.g. “1/x”

This featurizer must be fit either by calling .fit_featurize_dataframe or by calling .fit followed by featurize_dataframe.

This class featurizes a dataframe according to a set of expressions representing functions to apply to existing features. The approach here has uses a sympy-based parsing of string expressions, rather than explicit python functions. The primary reason this has been done is to provide for better support for book-keeping (e. g. with feature labels), substitution, and elimination of symbolic redundancy, which sympy is well-suited for.

Args:
expressions ([str]): list of sympy-parseable expressions

representing a function of a single variable x, e. g. [“1 / x”, “x ** 2”], defaults to the list above

multi_feature_depth (int): how many features to include if using

multiple fields for functionalization, e. g. 2 will include pairwise combined features

postprocess (function or type): type to cast functional outputs

to, if, for example, you want to include the possibility of complex numbers in your outputs, use postprocess=np.complex, defaults to float

combo_function (function): function to combine multi-features,

defaults to np.prod (i.e. cumulative product of expressions), note that a combo function must cleanly process sympy expressions and takes a list of arbitrary length as input, other options include np.sum

latexify_labels (bool): whether to render labels in latex,

defaults to False

__init__(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

property exp_dict

Generates a dictionary of expressions keyed by number of variables in each expression

Returns:

Dictionary of expressions keyed by number of variables

feature_labels()
Returns:

Set of feature labels corresponding to expressions

featurize(*args)

Main featurizer function, essentially iterates over all of the functions in self.function_list to generate features for each argument.

Args:
*args: list of numbers to generate functional output

features

Returns:

list of functional outputs corresponding to input args

fit(X, y=None, **fit_kwargs)

Sets the feature labels. Not intended to be used by a user, only intended to be invoked as part of featurize_dataframe

Args:

X (DataFrame or array-like): data to fit to

Returns:

Set of feature labels corresponding to expressions

generate_string_expressions(input_variable_names)

Method to generate string expressions for input strings, mainly used to generate columns names for featurize_dataframe

Args:
input_variable_names ([str]): strings corresponding to

functional input variable names

Returns:

List of string expressions generated by substitution of variable names into functions

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.function.generate_expressions_combinations(expressions, combo_depth=2, combo_function=<function prod>)

This function takes a list of strings representing functions of x, converts them to sympy expressions, and combines them according to the combo_depth parameter. Also filters resultant expressions for any redundant ones determined by sympy expression equivalence.

Args:
expressions (strings): all of the sympy-parseable strings

to be converted to expressions and combined, e. g. [“1 / x”, “x ** 2”], must be functions of x

combo_depth (int): the number of independent variables to consider combo_function (method): the function which combines the

the respective expressions provided, defaults to np.prod, i. e. the cumulative product of the expressions

Returns:
list of unique non-trivial expressions for featurization

of inputs

matminer.featurizers.site module

class matminer.featurizers.site.AGNIFingerprints(directions=None, 'x', 'y', 'z', etas=None, cutoff=8)

Bases: matminer.featurizers.base.BaseFeaturizer

Product integral of RDF and Gaussian window function, from Botu et al.

Integral of the product of the radial distribution function and a Gaussian window function. Originally used by Botu et al to fit empiricial potentials. These features come in two forms: atomic fingerprints and direction-resolved fingerprints. Atomic fingerprints describe the local environment of an atom and are computed using the function: A_i(\eta) = \sum\limits_{i \ne j} e^{-(\frac{r_{ij}}{\eta})^2} f(r_{ij}) where i is the index of the atom, j is the index of a neighboring atom, \eta is a scaling function, r_{ij} is the distance between atoms i and j, and f(r) is a cutoff function where f(r) = 0.5[\cos(\frac{\pi r_{ij}}{R_c}) + 1] if r < R_c and 0 otherwise. The direction-resolved fingerprints are computed using V_i^k(\eta) = \sum\limits_{i \ne j} \frac{r_{ij}^k}{r_{ij}} e^{-(\frac{r_{ij}}{\eta})^2} f(r_{ij}) where r_{ij}^k is the k^{th} component of \bold{r}_i - \bold{r}_j. Parameters: TODO: Differentiate between different atom types (maybe as another class)

__init__(directions=None, 'x', 'y', 'z', etas=None, cutoff=8)
Args:
directions (iterable): List of directions for the fingerprints. Can

be one or more of ‘None`, ‘x’, ‘y’, or ‘z’

etas (iterable of floats): List of which window widths to compute cutoff (float): Cutoff distance (Angstroms)

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.AngularFourierSeries(bins, cutoff=10.0)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the angular Fourier series (AFS), including both angular and radial info

The AFS is the product of pairwise distance function (g_n, g_n’) between two pairs of atoms (sharing the common central site) and the cosine of the angle between the two pairs. The AFS is a 2-dimensional feature (the axes are g_n, g_n’).

Examples of distance functionals are square functions, Gaussian, trig functions, and Bessel functions. An example for Gaussian:

lambda d: exp( -(d - d_n)**2 ), where d_n is the coefficient for g_n

See grdf() for a full list of available binning functions.

There are two preset conditions:

gaussian: bin functions are gaussians histogram: bin functions are rectangular functions

Features:

AFS ([gn], [gn’]) - Angular Fourier Series between binning functions (g1 and g2)

Args:
bins: ([AbstractPairwise]) a list of binning functions that

implement the AbstractPairwise base class

cutoff: (float) maximum distance to look for neighbors. The

featurizer will run slowly for large distance cutoffs because of the number of neighbor pairs scales as the square of the number of neighbors

__init__(bins, cutoff=10.0)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get AFS of the input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.

Returns:
Flattened list of AFS values. the list order is:

g_n g_n’

static from_preset(preset, width=0.5, spacing=0.5, cutoff=10)
Preset bin functions for this featurizer. Example use:
>>> AFS = AngularFourierSeries.from_preset('gaussian')
>>> AFS.featurize(struct, idx)
Args:

preset (str): shape of bin (either ‘gaussian’ or ‘histogram’) width (float): bin width. std dev for gaussian, width for histogram spacing (float): the spacing between bin centers cutoff (float): maximum distance to look for neighbors

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.AverageBondAngle(method)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines the average bond angles of a specific site with its nearest neighbors using one of pymatgen’s NearNeighbor classes. Neighbors that are adjacent to each other are stored and angle between them are computed. ‘Average bond angle’ of a site is the mean bond angle between all its nearest neighbors.

__init__(method)

Initialize featurizer

Args:
method (NearNeighbor) - subclass under NearNeighbor used to compute nearest

neighbors

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)

Get average bond length of a site and all its nearest neighbors.

Args:

strc (Structure): Pymatgen Structure object idx (int): index of target site in structure object

Returns:

average bond length (list)

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.AverageBondLength(method)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines the average bond length between one specific site and all its nearest neighbors using one of pymatgen’s NearNeighbor classes. These nearest neighbor calculators return weights related to the proximity of each neighbor to this site. ‘Average bond length’ of a site is the weighted average of the distance between site and all its nearest neighbors.

__init__(method)

Initialize featurizer

Args:

method (NearNeighbor) - subclass under NearNeighbor used to compute nearest neighbors

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)

Get weighted average bond length of a site and all its nearest neighbors.

Args:

strc (Structure): Pymatgen Structure object idx (int): index of target site in structure object

Returns:

average bond length (list)

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.BondOrientationalParameter(max_l=10, compute_w=False, compute_w_hat=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Averages of spherical harmonics of local neighbors

Bond Orientational Parameters (BOPs) describe the local environment around an atom by considering the local symmetry of the bonds as computed using spherical harmonics. To create descriptors that are invariant to rotating the coordinate system, we use the average of all spherical harmonics of a certain degree - following the approach of Steinhardt et al.. We weigh the contributions of each neighbor with the solid angle of the Voronoi tessellation (see Mickel et al. <https://aip.scitation.org/doi/abs/10.1063/1.4774084>_ for further discussion). The weighing scheme makes these descriptors vary smoothly with small distortions of a crystal structure.

In addition to the average spherical harmonics, this class can also compute the W and \hat{W} parameters proposed by Steinhardt et al..

Attributes:

BOOP Q l=<n> - Average spherical harmonic for a certain degree, n. BOOP W l=<n> - W parameter for a certain degree of spherical harmonic, n. BOOP What l=<n> - \hat{W} parameter for a certain degree of spherical harmonic, n.

References:

Steinhardt et al., _PRB_ (1983) Seko et al., _PRB_ (2017)

__init__(max_l=10, compute_w=False, compute_w_hat=False)

Initialize the featurizer

Args:

max_l (int) - Maximum spherical harmonic to consider compute_w (bool) - Whether to compute Ws as well compute_w_hat (bool) - Whether to compute What

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.ChemEnvSiteFingerprint(cetypes, strategy, geom_finder, max_csm=8, max_dist_fac=1.41)

Bases: matminer.featurizers.base.BaseFeaturizer

Resemblance of given sites to ideal environments

Site fingerprint computed from pymatgen’s ChemEnv package that provides resemblance percentages of a given site to ideal environments. Args:

cetypes ([str]): chemical environments (CEs) to be

considered.

strategy (ChemenvStrategy): ChemEnv neighbor-finding strategy. geom_finder (LocalGeometryFinder): ChemEnv local geometry finder. max_csm (float): maximum continuous symmetry measure (CSM;

default of 8 taken from chemenv). Note that any CSM larger than max_csm will be set to max_csm in order to avoid negative values (i.e., all features are constrained to be between 0 and 1).

max_dist_fac (float): maximum distance factor (default: 1.41).

__init__(cetypes, strategy, geom_finder, max_csm=8, max_dist_fac=1.41)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get ChemEnv fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.

Returns:
(numpy array): resemblance fraction of target site to ideal

local environments.

static from_preset(preset)

Use a standard collection of CE types and choose your ChemEnv neighbor-finding strategy. Args:

preset (str): preset types (“simple” or

“multi_weights”).

Returns:

ChemEnvSiteFingerprint object from a preset.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.ChemicalSRO(nn, includes=None, excludes=None, sort=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Chemical short range ordering, deviation of local site and nominal structure compositions

Chemical SRO features to evaluate the deviation of local chemistry with the nominal composition of the structure.

A local bonding preference is computed using f_el = N_el/(sum of N_el) - c_el, where N_el is the number of each element type in the neighbors around the target site, sum of N_el is the sum of all possible element types (coordination number), and c_el is the composition of the specific element in the entire structure. A positive f_el indicates the “bonding” with the specific element is favored, at least in the target site; A negative f_el indicates the “bonding” is not favored, at least in the target site.

Note that ChemicalSRO is only featurized for elements identified by “fit” (see following), thus “fit” must be called before “featurize”, or else an error will be raised.

Features:
CSRO__[nn method]_[element] - The Chemical SRO of a site computed based

on neighbors determined with a certain NN-detection method for a certain element.

__init__(nn, includes=None, excludes=None, sort=True)

Initialize the featurizer

Args:
nn (NearestNeighbor): instance of one of pymatgen’s NearestNeighbor

classes.

includes (array-like or str): elements included to calculate CSRO. excludes (array-like or str): elements excluded to calculate CSRO. sort (bool): whether to sort elements by mendeleev number.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get CSRO features of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

(list of floats): Chemical SRO features for each element.

fit(X, y=None)

Identify elements to be included in the following featurization, by intersecting the elements present in the passed structures with those explicitly included (or excluded) in __init__. Only elements in the self.el_list_ will be featurized. Besides, compositions of the passed structures will also be “stored” in a dict of self.el_amt_dict_, avoiding repeated calculation of composition when featurizing multiple sites in the same structure. Args:

X (array-like): containing Pymatgen structures and sites, supports

multiple choices: -2D array-like object:

e.g. [[struct, site], [struct, site], …]

np.array([[struct, site], [struct, site], …])

-Pandas dataframe:

e.g. df[[‘struct’, ‘site’]]

y : unused (added for consistency with overridden method signature)

Returns:

self

static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Args:

preset (str): preset type (“VoronoiNN”, “JmolNN”,

“MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).

**kwargs: allow to pass args to the NearNeighbor class.

Returns:

ChemicalSRO from a preset.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.CoordinationNumber(nn=None, use_weights='none')

Bases: matminer.featurizers.base.BaseFeaturizer

Number of first nearest neighbors of a site.

Determines the number of nearest neighbors of a site using one of pymatgen’s NearNeighbor classes. These nearest neighbor calculators can return weights related to the proximity of each neighbor to this site. It is possible to take these weights into account to prevent the coordination number from changing discontinuously with small perturbations of a structure, either by summing the total weights or using the normalization method presented by [Ward et al.](http://link.aps.org/doi/10.1103/PhysRevB.96.014107)

Features:
CN_[method] - Coordination number computed using a certain method

for calculating nearest neighbors.

__init__(nn=None, use_weights='none')

Initialize the featurizer

Args:

nn (NearestNeighbor) - Method used to determine coordination number use_weights (string) - Method used to account for weights of neighbors:

‘none’ - Do not use weights when computing coordination number ‘sum’ - Use sum of weights as the coordination number ‘effective’ - Compute the ‘effective coordination number’, which

is computed as \frac{(\sum_n w_n)^2)}{\sum_n w_n^2}

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get coordintion number of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.

Returns:

[float] - Coordination number

static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Args:

preset (str): preset type (“VoronoiNN”, “JmolNN”,

“MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).

**kwargs: allow to pass args to the NearNeighbor class.

Returns:

CoordinationNumber from a preset.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.CrystalNNFingerprint(op_types, chem_info=None, **kwargs)

Bases: matminer.featurizers.base.BaseFeaturizer

A local order parameter fingerprint for periodic crystals.

The fingerprint represents the value of various order parameters for the site. The “wt” order parameter describes how consistent a site is with a certain coordination number. The remaining order parameters are computed by multiplying the “wt” for that coordination number with the OP value.

The chem_info parameter can be used to also get chemical descriptors that describe differences in some chemical parameter (e.g., electronegativity) between the central site and the site neighbors.

__init__(op_types, chem_info=None, **kwargs)

Initialize the CrystalNNFingerprint. Use the from_preset() function to use default params. Args:

op_types (dict): a dict of coordination number (int) to a list of str

representing the order parameter types

chem_info (dict): a dict of chemical properties (e.g., atomic mass)

to dictionaries that map an element to a value (e.g., chem_info[“Pauling scale”][“O”] = 3.44)

**kwargs: other settings to be passed into CrystalNN class

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get crystal fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

list of weighted order parameters of target site.

static from_preset(preset, **kwargs)

Use preset parameters to get the fingerprint Args:

preset (str): name of preset (“cn” or “ops”) **kwargs: other settings to be passed into CrystalNN class

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.EwaldSiteEnergy(accuracy=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute site energy from Coulombic interactions

User notes:
  • This class uses that charges that are already-defined for the structure.

  • Ewald summations can be expensive. If you evaluating every site in many large structures, run all of the sites for each structure at the same time. We cache the Ewald result for the structure that was run last, so looping over sites and then structures is faster than structures than sites.

Features:

ewald_site_energy - Energy for the site computed from Coulombic interactions

__init__(accuracy=None)
Args:

accuracy (int): Accuracy of Ewald summation, number of decimal places

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)
Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

([float]) - Electrostatic energy of the site

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.GaussianSymmFunc(etas_g2=None, etas_g4=None, zetas_g4=None, gammas_g4=None, cutoff=6.5)

Bases: matminer.featurizers.base.BaseFeaturizer

Gaussian symmetry function features suggested by Behler et al.

The function is based on pair distances and angles, to approximate the functional dependence of local energies, originally used in the fitting of machine-learning potentials. The symmetry functions can be divided to a set of radial functions (g2 function), and a set of angular functions (g4 function). The number of symmetry functions returned are based on parameters of etas_g2, etas_g4, zetas_g4 and gammas_g4. See the original papers for more details: “Atom-centered symmetry functions for constructing high-dimensional neural network potentials”, J Behler, J Chem Phys 134, 074106 (2011). The cutoff function is taken as the polynomial form (cosine_cutoff) to give a smoothed truncation. A Fortran and a different Python version can be found in the code Amp: Atomistic Machine-learning Package (https://bitbucket.org/andrewpeterson/amp). Args:

etas_g2 (list of floats): etas used in radial functions.

(default: [0.05, 4., 20., 80.])

etas_g4 (list of floats): etas used in angular functions.

(default: [0.005])

zetas_g4 (list of floats): zetas used in angular functions.

(default: [1., 4.])

gammas_g4 (list of floats): gammas used in angular functions.

(default: [+1., -1.])

cutoff (float): cutoff distance. (default: 6.5)

__init__(etas_g2=None, etas_g4=None, zetas_g4=None, gammas_g4=None, cutoff=6.5)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

static cosine_cutoff(rs, cutoff)

Polynomial cutoff function to give a smoothed truncation of the Gaussian symmetry functions. Args:

rs (ndarray): distances to elements cutoff (float): cutoff distance.

Returns:

(ndarray) cutoff function.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get Gaussian symmetry function features of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

(list of floats): Gaussian symmetry function features.

static g2(eta, rs, cutoff)

Gaussian radial symmetry function of the center atom, given an eta parameter. Args:

eta: radial function parameter. rs: distances from the central atom to each neighbor cutoff (float): cutoff distance.

Returns:

(float) Gaussian radial symmetry function.

static g4(etas, zetas, gammas, neigh_dist, neigh_coords, cutoff)

Gaussian angular symmetry function of the center atom, given a set of eta, zeta and gamma parameters. Args:

eta ([float]): angular function parameters. zeta ([float]): angular function parameters. gamma ([float]): angular function parameters. neigh_coords (list of [floats]): coordinates of neighboring atoms, with respect

to the central atom

cutoff (float): cutoff parameter.

Returns:

(float) Gaussian angular symmetry function for all combinations of eta, zeta, gamma

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.GeneralizedRadialDistributionFunction(bins, cutoff=20.0, mode='GRDF')

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the general radial distribution function (GRDF) for a site.

The GRDF is a radial measure of crystal order around a site. There are two featurizing modes:

  1. GRDF: (recommended) - n_bins length vector

    In GRDF mode, The GRDF is computed by considering all sites around a central site (i.e., no sites are omitted when computing the GRDF). The features output from this mode will be vectors with length n_bins.

  2. pairwise GRDF: (advanced users) - n_bins x n_sites matrix

    In this mode, GRDFs are are still computed around a central site, but only one other site (and their translational equivalents) are used to compute a GRDF (e.g. site 1 with site 2 and the translational equivalents of site 2). This results in a a n_sites x n_bins matrix of features. Requires fit for determining the max number of sites for

The GRDF is a generalization of the partial radial distribution function (PRDF). In contrast with the PRDF, the bins of the GRDF are not mutually- exclusive and need not carry a constant weight of 1. The PRDF is a case of the GRDF when the bins are rectangular functions. Examples of other functions to use with the GRDF are Gaussian, trig, and Bessel functions.

See grdf() for a full list of available binning functions.

There are two preset conditions:

gaussian: bin functions are gaussians histogram: bin functions are rectangular functions

Args:
bins: ([AbstractPairwise]) List of pairwise binning functions. Each of these functions

must implement the AbstractPairwise class.

cutoff: (float) maximum distance to look for neighbors mode: (str) the featurizing mode. supported options are:

‘GRDF’ and ‘pairwise_GRDF’

__init__(bins, cutoff=20.0, mode='GRDF')

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get GRDF of the input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.

Returns:
Flattened list of GRDF values. For each run mode the list order is:

GRDF: bin# pairwise GRDF: site2# bin#

The site2# corresponds to a pymatgen site index and bin# corresponds to one of the bin functions

fit(X, y=None, **fit_kwargs)

Determine the maximum number of sites in X to assign correct feature labels

Args:
X - [list of tuples], training data

tuple values should be (struc, idx)

Returns:

self

static from_preset(preset, width=1.0, spacing=1.0, cutoff=10, mode='GRDF')
Preset bin functions for this featurizer. Example use:
>>> GRDF = GeneralizedRadialDistributionFunction.from_preset('gaussian')
>>> GRDF.featurize(struct, idx)
Args:

preset (str): shape of bin (either ‘gaussian’ or ‘histogram’) width (float): bin width. std dev for gaussian, width for histogram spacing (float): the spacing between bin centers cutoff (float): maximum distance to look for neighbors mode (str): featurizing mode. either ‘GRDF’ or ‘pairwise_GRDF’

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.IntersticeDistribution(cutoff=6.5, interstice_types=None, stats=None, radius_type='MiracleRadius')

Bases: matminer.featurizers.base.BaseFeaturizer

Interstice distribution in the neighboring cluster around an atom site.

The interstices are categorized to distance, area and volume interstices. Each of these metrics is a measures of the relative amount of empty space around each atom as determined using atomic sphere models. The distance interstice is the fraction of a bonding line unoccupied by the atom spheres; The area interstice is the unoccupied area within the triangulated surface formed by atom triplets in convex hull formed by neighbors, and the volume interstice is the unoccupied portion of a tetrahedron formed between the central atom and neighbor atom triplets. Please refer to the original paper for more details (Wang et al. Nat Commun 10, 5537 (2019))

For amorphous alloys (metallic glasses), the coordination environments are anisotropic, which can be reflected in the inequality of the interstices present around an atom. To describe the anisotropy, here we derive statistics of the interstices to featurize the interstice distribution around the atom. Other methods can be grouping the interstices into histogram grids of fixed bins and the features are then a vector of the values of the histograms.

User note: This class is particularly designed for featuring the site-specific packing heterogeneity in metallic glasses, especially the all-metallic-element ones. If non-metallic-elements are present in the structures, the interstice estimates may have larger deviation from actual values (despite this deviation is systematic and thus the interstice estimates can still be used to represent the packing heterogeneity).

Args:
cutoff (float): cutoff distance in determining the potential

neighbors for Voronoi tessellation analysis. (default: 6.5)

interstice_types (str or [str]): interstice distribution types,

support sub-list of [‘dist’, ‘area’, ‘vol’].

stats ([str]): statistics of distance/area/volume interstices. radius_type (str): source of radius estimate. (default: “MiracleRadius”)

__init__(cutoff=6.5, interstice_types=None, stats=None, radius_type='MiracleRadius')

Initialize self. See help(type(self)) for accurate signature.

static analyze_area_interstice(nn_coords, nn_rs, convex_hull_simplices)

Analyze the area interstices in the neighbor convex hull facets. Args:

nn_coords (array-like, shape (N, 3)): Nearest Neighbors’ coordinates nn_rs ([float]): Nearest Neighbors’ radii. convex_hull_simplices (array-like, shape (M, 3)): Indices of points

forming the simplicial facets of convex hull.

Returns:

area_interstice_list ([float]): Area interstice list.

static analyze_dist_interstices(center_r, nn_rs, nn_dists)

Analyze the distance interstices between center atom and neighbors. Args:

center_r (float): central atom’s radius. nn_rs ([float]): Nearest Neighbors’ radii. nn_dists ([float]): Nearest Neighbors’ distances.

Returns:

dist_interstice_list ([float]): Distance interstice list.

static analyze_vol_interstice(center_coords, nn_coords, center_r, nn_rs, convex_hull_simplices)

Analyze the volume interstices in the tetrahedra formed by center atom and neighbor convex hull triplets. Args:

center_coords ([float]): Central atomic coordinates. nn_coords (array-like, shape (N, 3)): Nearest Neighbors’ coordinates center_r (float): central atom’s radius. nn_rs ([float]): Nearest Neighbors’ radii. convex_hull_simplices (array-like, shape (M, 3)): Indices of points

forming the simplicial facets of convex hull.

Returns:

volume_interstice_list ([float]): Volume interstice list.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get interstice distribution fingerprints of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

interstice_fps ([float]): Interstice distribution fingerprints.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.LocalPropertyDifference(data_source=<matminer.utils.data.MagpieData object>, weight='area', properties=('Electronegativity', ), signed=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Differences in elemental properties between site and its neighboring sites.

Uses the Voronoi tessellation of the structure to determine the neighbors of the site, and assigns each neighbor (n) a weight (A_n) that corresponds to the area of the facet on the tessellation corresponding to that neighbor. The local property difference is then computed by \frac{\sum_n {A_n |p_n - p_0|}}{\sum_n {A_n}} where p_n is the property (e.g., atomic number) of a neighbor and p_0 is the property of a site. If signed parameter is assigned True, signed difference of the properties is returned instead of absolute difference.

Features:
  • “local property difference in [property]” - Weighted average

    of differences between an elemental property of a site and that of each of its neighbors, weighted by size of face on Voronoi tessellation

References:

Ward et al. _PRB_ 2017

__init__(data_source=<matminer.utils.data.MagpieData object>, weight='area', properties=('Electronegativity', ), signed=False)

Initialize the featurizer

Args:
data_source (AbstractData) - Class from which to retrieve

elemental properties

weight (str) - What aspect of each voronoi facet to use to

weigh each neighbor (see VoronoiNN)

properties ([str]) - List of properties to use (default=[‘Electronegativity’]) signed (bool) - whether to return absolute difference or signed difference of

properties(default=False (absolute difference))

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

static from_preset(preset)

Create a new LocalPropertyDifference class according to a preset

Args:

preset (str) - Name of preset

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.OPSiteFingerprint(target_motifs=None, dr=0.1, ddr=0.01, ndr=1, dop=0.001, dist_exp=2, zero_ops=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Local structure order parameters computed from a site’s neighbor env.

For each order parameter, we determine the neighbor shell that complies with the expected coordination number. For example, we find the 4 nearest neighbors for the tetrahedral OP, the 6 nearest for the octahedral OP, and the 8 nearest neighbors for the bcc OP. If we don’t find such a shell, the OP is either set to zero or evaluated with the shell of the next largest observed coordination number. Args:

target_motifs (dict): target op or motif type where keys

are corresponding coordination numbers (e.g., {4: “tetrahedral”}).

dr (float): width for binning neighbors in unit of relative

distances (= distance/nearest neighbor distance). The binning is necessary to make the neighbor-finding step robust against small numerical variations in neighbor distances (default: 0.1).

ddr (float): variation of width for finding stable OP values. ndr (int): number of width variations for each variation direction

(e.g., ndr = 0 only uses the input dr, whereas ndr=1 tests dr = dr - ddr, dr, and dr + ddr.

dop (float): binning width to compute histogram for each OP

if ndr > 0.

dist_exp (boolean): exponent for distance factor to multiply

order parameters with that penalizes (large) variations in distances in a given motif. 0 will switch the option off (default: 2).

zero_ops (boolean): set an OP to zero if there is no neighbor

shell that complies with the expected coordination number of a given OP (e.g., CN=4 for tetrahedron; default: True).

__init__(target_motifs=None, dr=0.1, ddr=0.01, ndr=1, dop=0.001, dist_exp=2, zero_ops=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get OP fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:

opvals (numpy array): order parameters of target site.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.SOAP(**kwargs)

Bases: matminer.featurizers.base.BaseFeaturizer

Smooth overlap of atomic positions (interface via DScribe).

Class for generating a partial power spectrum from Smooth Overlap of Atomic Orbitals (SOAP). This implementation uses real (tesseral) spherical harmonics as the angular basis set and provides two orthonormalized alternatives for the radial basis functions: spherical primitive gaussian type orbitals (“gto”) or the polynomial basis set (“polynomial”). By default the faster gto-basis is used. Please see the DScribe SOAP documentation for more details.

Note that SOAP is only featurized for elements identified by “fit” (see following), thus “fit” must be called before “featurize”, or else an error will be raised.

Based originally on the following publications:

“On representing chemical environments, Albert P. Bartók, Risi

Kondor, and Gábor Csányi, Phys. Rev. B 87, 184115, (2013), https://doi.org/10.1103/PhysRevB.87.184115

“Comparing molecules and solids across structural and alchemical

space”, Sandip De, Albert P. Bartók, Gábor Csányi and Michele Ceriotti, Phys. Chem. Chem. Phys. 18, 13754 (2016), https://doi.org/10.1039/c6cp00415f

Implementation (and some documentation) originally based on DScribe: https://github.com/SINGROUP/dscribe.

“DScribe: Library of descriptors for machine learning in materials science”,

Himanen, L., J{“a}ger, M. O.J., Morooka, E. V., Federici Canova, F., Ranawat, Y. S., Gao, D. Z., Rinke, P. and Foster, A. S. Computer Physics Communications, 106949 (2019), https://doi.org/10.1016/j.cpc.2019.106949

Args:
rcut (float): A cutoff for local region in angstroms. Should be

bigger than 1 angstrom.

nmax (int): The number of radial basis functions. lmax (int): The maximum degree of spherical harmonics. sigma (float): The standard deviation of the gaussians used to expand the

atomic density.

rbf (str): The radial basis functions to use. The available options are:

  • “gto”: Spherical gaussian type orbitals defined as g_{nl}(r) = \sum_{n'=1}^{n_\mathrm{max}}\,\beta_{nn'l} r^l e^{-\alpha_{n'l}r^2}

  • “polynomial”: Polynomial basis defined as g_{n}(r) = \sum_{n'=1}^{n_\mathrm{max}}\,\beta_{nn'} (r-r_\mathrm{cut})^{n'+2}

periodic (bool): Determines whether the system is considered to be

periodic.

crossover (bool): Determines if crossover of atomic types should

be included in the power spectrum. If enabled, the power spectrum is calculated over all unique species combinations Z and Z’. If disabled, the power spectrum does not contain cross-species information and is only run over each unique species Z. Turned on by default to correspond to the original definition

__init__(rcut, nmax, lmax, sigma, periodic, rbf='gto', crossover=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

fit(X, y=None)

Fit the SOAP featurizer to a dataframe.

Args:

X ([SiteCollection]): For example, a list of pymatgen Structures. y : unused (added for consistency with overridden method signature)

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.SiteElementalProperty(data_source=None, properties='Number')

Bases: matminer.featurizers.base.BaseFeaturizer

Elemental properties of atom on a certain site

Features:

site [property] - Elemental property for this site

References:

Seko et al., _PRB_ (2017) Schmidt et al., _Chem Mater_. (2017)

__init__(data_source=None, properties='Number')

Initialize the featurizer

Args:

data_source (AbstractData): Tool used to look up elemental properties properties ([string]): List of properties to use for features

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc, idx)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

static from_preset(preset)

Create the class with pre-defined settings

Args:

preset (string): Desired preset

Returns:

SiteElementalProperty initialized with desired settings

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.site.VoronoiFingerprint(cutoff=6.5, use_symm_weights=False, symm_weights='solid_angle', stats_vol=None, stats_area=None, stats_dist=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Voronoi tessellation-based features around target site.

Calculate the following sets of features based on Voronoi tessellation analysis around the target site: Voronoi indices

n_i denotes the number of i-edged facets, and i is in the range of 3-10. e.g. for bcc lattice, the Voronoi indices are [0,6,0,8,…]; for fcc/hcp lattice, the Voronoi indices are [0,12,0,0,…]; for icosahedra, the Voronoi indices are [0,0,12,0,…];

i-fold symmetry indices

computed as n_i/sum(n_i), and i is in the range of 3-10. reflect the strength of i-fold symmetry in local sites. e.g. for bcc lattice, the i-fold symmetry indices are [0,6/14,0,8/14,…]

indicating both 4-fold and a stronger 6-fold symmetries are present;

for fcc/hcp lattice, the i-fold symmetry factors are [0,1,0,0,…],

indicating only 4-fold symmetry is present;

for icosahedra, the Voronoi indices are [0,0,1,0,…],

indicating only 5-fold symmetry is present;

Weighted i-fold symmetry indices

if use_weights = True

Voronoi volume

total volume of the Voronoi polyhedron around the target site

Voronoi volume statistics of sub_polyhedra formed by each facet + center

stats_vol = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]

Voronoi area

total area of the Voronoi polyhedron around the target site

Voronoi area statistics of the facets

stats_area = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]

Voronoi nearest-neighboring distance statistics

stats_dist = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]

Args:
cutoff (float): cutoff distance in determining the potential

neighbors for Voronoi tessellation analysis. (default: 6.5)

use_symm_weights(bool): whether to use weights to derive weighted

i-fold symmetry indices.

symm_weights(str): weights to be used in weighted i-fold symmetry

indices. Supported options: ‘solid_angle’, ‘area’, ‘volume’, ‘face_dist’. (default: ‘solid_angle’)

stats_vol (list of str): volume statistics types. stats_area (list of str): area statistics types. stats_dist (list of str): neighboring distance statistics types.

__init__(cutoff=6.5, use_symm_weights=False, symm_weights='solid_angle', stats_vol=None, stats_area=None, stats_dist=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct, idx)

Get Voronoi fingerprints of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.

Returns:
(list of floats): Voronoi fingerprints.

-Voronoi indices -i-fold symmetry indices -weighted i-fold symmetry indices (if use_symm_weights = True) -Voronoi volume -Voronoi volume statistics -Voronoi area -Voronoi area statistics -Voronoi dist statistics

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

matminer.featurizers.site.get_wigner_coeffs(l)

Get the list of non-zero Wigner 3j triplets

Args:

l (int): Desired l

Returns:
List of tuples that contain:
  • ((int)) m coordinates of the triplet

  • (float) Wigner coefficient

matminer.featurizers.structure module

class matminer.featurizers.structure.BagofBonds(coulomb_matrix=SineCoulombMatrix(flatten=False), token=' - ')

Bases: matminer.featurizers.base.BaseFeaturizer

Compute a Bag of Bonds vector, as first described by Hansen et al. (2015).

The Bag of Bonds approach is based creating an even-length vector from a Coulomb matrix output. Practically, it represents the Coloumbic interactions between each possible set of sites in a structure as a vector.

BagofBonds must be fit to an iterable of structures using the “fit” method before featurization can occur. This is because the bags and the maximum lengths of each bag must be set prior to featurization. We recommend fitting and featurizing on the same data to maintain consistency between generated feature sets. This can be done using the fit_transform method (for lists of structures) or the fit_featurize_dataframe method (for dataframes).

BagofBonds is based on a method by Hansen et. al “Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space” (2015).

Args:
coulomb_matrix (BaseFeaturizer): A featurizer object containing a

“featurize” method which returns a matrix of size nsites x nsites. Good choices are CoulombMatrix() or SineCoulombMatrix(), with the flatten=False parameter set.

token (str): The string used to separate species in a bond, including

spaces. The token must contain at least one space and cannot have alphabetic characters in it, and should be padded by spaces. For example, for the bond Cs+ - Cl-, the token is ‘ - ‘. This determines how bonds are represented in the dataframe.

__init__(coulomb_matrix=SineCoulombMatrix(flatten=False), token=' - ')

Initialize self. See help(type(self)) for accurate signature.

bag(s, return_baglens=False)

Convert a structure into a bag of bonds, where each bag has no padded zeros. using this function will give the ‘raw’ bags, which when concatenated, will have different lengths.

Args:
s (Structure): A pymatgen Structure or IStructure object. May also

work with a

return_baglens (bool): If True, returns the bag of bonds with as

a dictionary with the number of bonds as values in place of the vectors of coulomb matrix vals. If False, calculates Coulomb matrix values and returns ‘raw’ bags.

Returns:
(dict) A bag of bonds, where the keys are sorted tuples of pymatgen

Site objects representing bonds or sites, and the values are the Coulomb matrix values for that bag.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Featurizes a structure according to the bag of bonds method. Specifically, each structure is first bagged by flattening the Coulomb matrix for the structure. Then, it is zero-padded according to the maximum number of bonds in each bag, for the set of bags that BagofBonds was fit with.

Args:

s (Structure): A pymatgen structure object

Returns:

(list): The Bag of Bonds vector for the input structure

fit(X, y=None)

Define the bags using a list of structures.

Both the names of the bags (e.g., Cs-Cl) and the maximum lengths of the bags are set with fit.

Args:
X (Series/list): An iterable of pymatgen Structure

objects which will be used to determine the allowed bond types and bag lengths.

y : unused (added for consistency with overridden method signature)

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.BondFractions(nn=<pymatgen.analysis.local_env.CrystalNN object>, bbv=0, no_oxi=False, approx_bonds=False, token=' - ', allowed_bonds=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the fraction of each bond in a structure, based on NearestNeighbors.

For example, in a structure with 2 Li-O bonds and 3 Li-P bonds:

Li-0: 0.4 Li-P: 0.6

Features:

BondFractions must be fit with iterable of structures before featurization in order to define the allowed bond types (features). To do this, pass a list of allowed_bonds. Otherwise, fit based on a list of structures. If allowed_bonds is defined and BondFractions is also fit, the intersection of the two lists of possible bonds is used.

For dataframes containing structures of various compositions, a unified dataframe is returned which has the collection of all possible bond types gathered from all structures as columns. To approximate bonds based on chemical rules (ie, for a structure which you’d like to featurize but has bonds not in the allowed set), use approx_bonds = True.

BondFractions is based on the “sum over bonds” in the Bag of Bonds approach, based on a method by Hansen et. al “Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space” (2015).

Args:
nn (NearestNeighbors): A Pymatgen nearest neighbors derived object. For

example, pymatgen.analysis.local_env.VoronoiNN().

bbv (float): The ‘bad bond values’, values substituted for

structure-bond combinations which can not physically exist, but exist in the unified dataframe. For example, if a dataframe contains structures of BaLiP and BaTiO3, determines the value to place in the Li-P column for the BaTiO3 row; by default, is 0.

no_oxi (bool): If True, the featurizer will be agnostic to oxidation

states, which prevents oxidation states from differentiating bonds. For example, if True, Ca - O is identical to Ca2+ - O2-, Ca3+ - O-, etc., and all of them will be included in Ca - O column.

approx_bonds (bool): If True, approximates the fractions of bonds not

in allowed_bonds (forbidden bonds) with similar allowed bonds. Chemical rules are used to determine which bonds are most ‘similar’; particularly, the Euclidean distance between the 2-tuples of the bonds in Mendeleev no. space is minimized for the approximate bond chosen.

token (str): The string used to separate species in a bond, including

spaces. The token must contain at least one space and cannot have alphabetic characters in it, and should be padded by spaces. For example, for the bond Cs+ - Cl-, the token is ‘ - ‘. This determines how bonds are represented in the dataframe.

allowed_bonds ([str]): A listlike object containing bond types as

strings. For example, Cs - Cl, or Li+ - O2-. Ions and elements will still have distinct bonds if (1) the bonds list originally contained them and (2) no_oxi is False. These must match the token specified.

__init__(nn=<pymatgen.analysis.local_env.CrystalNN object>, bbv=0, no_oxi=False, approx_bonds=False, token=' - ', allowed_bonds=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

enumerate_all_bonds(structures)

Identify all the unique, possible bonds types of all structures present, and create the ‘unified’ bonds list.

Args:

structures (list/ndarray): List of pymatgen Structures

Returns:

A tuple of unique, possible bond types for an entire list of structures. This tuple is used to form the unified feature labels.

enumerate_bonds(s)

Lists out all the bond possibilities in a single structure.

Args:

s (Structure): A pymatgen structure

Returns:

A list of bond types in ‘Li-O’ form, where the order of the elements in each bond type is alphabetic.

feature_labels()

Returns the list of allowed bonds. Throws an error if the featurizer has not been fit.

featurize(s)

Quantify the fractions of each bond type in a structure.

For collections of structures, bonds types which are not found in a particular structure (e.g., Li-P in BaTiO3) are represented as NaN.

Args:

s (Structure): A pymatgen Structure object

Returns:
(list) The feature list of bond fractions, in the order of the

alphabetized corresponding bond names.

fit(X, y=None)

Define the bond types allowed to be returned during each featurization. Bonds found during featurization which are not allowed will be omitted from the returned dataframe or matrix.

Fit BondFractions by either passing an iterable of structures to training_data or by defining the bonds explicitly with allowed_bonds in __init__.

Args:
X (Series/list): An iterable of pymatgen Structure

objects which will be used to determine the allowed bond types.

y : unused (added for consistency with overridden method signature)

Returns:

self

static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Pass args to __init__, such as allowed_bonds, using this method as well.

Args:

preset (str): preset type (“CrystalNN”, “VoronoiNN”, “JmolNN”, “MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).

Returns:

CoordinationNumber from a preset.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.CGCNNFeaturizer(**kwargs)

Bases: matminer.featurizers.base.BaseFeaturizer

Features generated by training a Crystal Graph Convolutional Neural Network (CGCNN) model.

This featurizer requires a CGCNN model that can either be:
  1. from a pretrained model, currently only supports the models from the CGCNN repo (12/10/18): https://github.com/txie-93/cgcnn;

  2. train a CGCNN model based on the X (structures) and y (target) from fresh start;

  3. similar to 2), but train a model from a warm_start model that can either be a pretrained model or saved checkpoints.

Please see the fit function for more details.

After obtaining a CGCNN model, we will featurize the structures by taking the crystal feature vector obtained after pooling as the features.

This featurizer requires installing cgcnn and torch. We wrap and refractor some of the classes and functions from the original cgcnn to make them work better for matminer. Please also see utils/cgcnn for more details.

Features:
  • Features for the structures extracted from CGCNN model after pooling.

__init__(task='classification', atom_init_fea=None, pretrained_name=None, warm_start_file=None, warm_start_latest=False, save_model_to_dir=None, save_checkpoint_to_dir=None, checkpoint_interval=100, del_checkpoint=True, **cgcnn_kwargs)
Args:
task (str):

Task type, “classification” or “regression”.

atom_init_fea (dict):

A dict of {atom type: atom feature}. If not provided, will use the default atom features from the CGCNN repo.

pretrained_name (str):

CGCNN pretrained model name, if None don’t use pre-trained model

warm_start_file (str):

The warm start model file, if None, don’t warm start.

warm_start_latest(bool):

Warm start from the latest model or best model. This is set because we customize our checkpoints to contain both best model and latest model. And if the warm start model does not contain these two options, will just use the static_dict given in the model/checkpoints to warm start.

save_model_to_dir (str):

Whether to save the best model to disk, if None, don’t save, otherwise, save the best model to ‘save_model_to_dir’ path.

save_checkpoint_to_dir (str):

Whether to save checkpoint during training, if None, don’t save, otherwise, save the it to ‘save_checkpoint_to_dir’ path.

checkpoint_interval (int):

Save checkpoint every n epochs if save_checkpoint_to_dir is not None. If the epochs is less than this checkpoint_interval, will reset the checkpoint_interval as int(epochs/2).

del_checkpoint (bool):

Whether to delete checkpoints if training ends successfully.

**cgcnn_kwargs (optional): settings of CGCNN, containing:
CrystalGraphConvNet model kwargs:
-atom_fea_len (int): Number of hidden atom features in conv

layers, default 64.

-n_conv (int): Number of conv layers, default 3. -h_fea_len (int): Number of hidden features after pooling,

default 128.

-n_epochs (int): Number of total epochs to run, default 30. -print_freq (bool): Print frequency, default 10. -test (bool): Whether to save test predictions -task (str): “classification” or “regression”,

default “classification”.

Dataset (CIFDataWrapper) kwargs:
-max_num_nbr (int): The maximum number of neighbors while

constructing the crystal graph, default 12

-radius (float): The cutoff radius for searching neighbors,

default 8

-dmin (float): The minimum distance for constructing

GaussianDistance, default 0

-step (float): The step size for constructing

GaussianDistance, default 0.2

-random_seed (int): Random seed for shuffling the dataset,

default 123

DataLoader kwargs:

batch_size (int): Mini-batch size, default 256 num_workers (int): Number of data loading workers, default 0 train_size (int): Number of training data to be loaded,

default none

val_size (int): Number of validation data to be loaded,

default 1000

test_size (int): Number of test data to be loaded,

default 1000

“return_test” (bool): Whether to return the test dataset

loader. default True

Optimizer kwargs:
-optim (str): Choose an optimizer, “SGD” or “Adam”,

default “SGD”.

-lr (float): Initial learning rate, default 0.01 -momentum (float): Momentum, default 0.9 -weight_decay (float): Weight decay (default: 0)

Scheduler MultiStepLR kwargs:
-gamma (float): Multiplicative factor of learning rate

decay, default: 0.1.

-lr_milestones (list): List of epoch indices.

Must be increasing.

These input cgcnn_kwargs will be processed and grouped in _initialize_kwargs.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Get the feature vector after pooling layer of the CGCNN model obtained from fit. Args:

strc (Structure): Structure object

Returns:

Features extracted after the pooling layer in CGCNN model

fit(X, y)

Get a CGCNN model that can either be: 1) from a pretrained model, currently only supports the models from

the CGCNN repo;

  1. train a CGCNN model based on the X (structures) and y (target) from fresh start;

  2. similar to 2), but train a model from a warm_start model that can either be a pretrained model or saved checkpoints.

Note that to use CGCNNFeaturizer, a target y is needed! Args:

X (Series/list):

An iterable of pymatgen Structure objects.

y (Series/list):

Target property that CGCNN is designed to predict.

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property latest_model

Get the latest model

property model

Get the best model

class matminer.featurizers.structure.ChemicalOrdering(shells=1, 2, 3, weight='area')

Bases: matminer.featurizers.base.BaseFeaturizer

How much the ordering of species in the structure differs from random

These parameters describe how much the ordering of all species in a structure deviates from random using a Warren-Cowley-like ordering parameter. The first step of this calculation is to determine the nearest neighbor shells of each site. Then, for each shell a degree of order for each type is determined by computing:

\alpha (t,s) = 1 - \frac{\sum_n w_n \delta (t - t_n)}{x_t \sum_n w_n}

where w_n is the weight associated with a certain neighbor, t_p is the type of the neighbor, and x_t is the fraction of type t in the structure. For atoms that are randomly dispersed in a structure, this formula yields 0 for all types. For structures where each site is surrounded only by atoms of another type, this formula yields large values of alpha.

The mean absolute value of this parameter across all sites is used as a feature.

Features:
mean ordering parameter shell [n] - Mean ordering parameter for

atoms in the n<sup>th</sup> neighbor shell

References:

Ward et al. _PRB_ 2017

__init__(shells=1, 2, 3, weight='area')

Initialize the featurizer

Args:

shells ([int]) - Which neighbor shells to evaluate weight (str) - Attribute used to weigh neighbor contributions

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.CoulombMatrix(diag_elems=True, flatten=True)

Bases: matminer.featurizers.base.BaseFeaturizer

The Coulomb matrix, a representation of nuclear coulombic interaction.

Generate the Coulomb matrix, M, of the input structure (or molecule). The Coulomb matrix was put forward by Rupp et al. (Phys. Rev. Lett. 108, 058301, 2012) and is defined by off-diagonal elements M_ij = Z_i*Z_j/|R_i-R_j| and diagonal elements 0.5*Z_i^2.4, where Z_i and R_i denote the nuclear charge and the position of atom i, respectively.

Coulomb Matrix features are flattened (for ML-readiness) by default. Use fit before featurizing to use flattened features. To return the matrix form, set flatten=False.

Args:
diag_elems (bool): flag indication whether (True, default) to use

the original definition of the diagonal elements; if set to False, the diagonal elements are set to 0

flatten (bool): If True, returns a flattened vector based on eigenvalues

of the matrix form. Otherwise, returns a matrix object (single feature), which will likely need to be processed further.

__init__(diag_elems=True, flatten=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Get Coulomb matrix of input structure.

Args:

s: input Structure (or Molecule) object.

Returns:

m: (Nsites x Nsites matrix) Coulomb matrix.

fit(X, y=None)

Fit the Coulomb Matrix to a list of structures.

Args:

X ([Structure]): A list of pymatgen structures. y : unused (added for consistency with overridden method signature)

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.DensityFeatures(desired_features=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculates density and density-like features

Features:
  • density

  • volume per atom

  • (“vpa”), and packing fraction

__init__(desired_features=None)
Args:
desired_features: [str] - choose from “density”, “vpa”,

“packing fraction”

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

precheck(s: pymatgen.core.structure.Structure)bool

Precheck a single entry. DensityFeatures does not work for disordered structures. To precheck an entire dataframe (qnd automatically gather the fraction of structures that will pass the precheck), please use precheck_dataframe.

Args:

s (pymatgen.Structure): The structure to precheck.

Returns:

(bool): If True, s passed the precheck; otherwise, it failed.

class matminer.featurizers.structure.Dimensionality(nn_method=<pymatgen.analysis.local_env.CrystalNN object>)

Bases: matminer.featurizers.base.BaseFeaturizer

Returns dimensionality of structure: 1 means linear chains of atoms OR isolated atoms/no bonds, 2 means layered, 3 means 3D connected structure. This feature is sensitive to bond length tables that you use.

__init__(nn_method=<pymatgen.analysis.local_env.CrystalNN object>)
Args:
**nn_method: The nearest neighbor method used to determine atomic

connectivity.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.ElectronicRadialDistributionFunction(cutoff=None, dr=0.05)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the inherent electronic radial distribution function (ReDF)

The ReDF is defined according to Willighagen et al., Acta Cryst., 2005, B61, 29-36.

The ReDF is a structure-integral RDF (i.e., summed over all sites) in which the positions of neighboring sites are weighted by electrostatic interactions inferred from atomic partial charges. Atomic charges are obtained from the ValenceIonicRadiusEvaluator class.

Args:
cutoff: (float) distance up to which the ReDF is to be

calculated (default: longest diagaonal in primitive cell).

dr: (float) width of bins (“x”-axis) of ReDF (default: 0.05 A).

__init__(cutoff=None, dr=0.05)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Get ReDF of input structure.

Args:

s: input Structure object.

Returns: (dict) a copy of the electronic radial distribution

functions (ReDF) as a dictionary. The distance list (“x”-axis values of ReDF) can be accessed via key ‘distances’; the ReDF itself is accessible via key ‘redf’.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.EwaldEnergy(accuracy=4, per_atom=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the energy from Coulombic interactions.

Note: The energy is computed using _charges already defined for the structure_.

Features:

ewald_energy - Coulomb interaction energy of the structure

__init__(accuracy=4, per_atom=True)
Args:

accuracy (int): Accuracy of Ewald summation, number of decimal places

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)
Args:

(Structure) - Structure being analyzed

Returns:

([float]) - Electrostatic energy of the structure

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.GlobalInstabilityIndex(r_cut=4.0, disordered_pymatgen=False)

Bases: matminer.featurizers.base.BaseFeaturizer

The global instability index of a structure.

The default is to use IUCr 2016 bond valence parameters for computing bond valence sums. If the structure has disordered site occupancies or non-integer valences on sites, pymatgen’s bond valence sum method can be used instead.

Note that pymatgen’s bond valence sum method is prone to error unless the correct scale factor is supplied. A scale factor based on testing with perovskites is used here. TODO: Use scipy to optimize scale factor for minimizing GII

Based on the following publication:

‘Structural characterization of R2BaCuO5 (R = Y, Lu, Yb, Tm, Er, Ho,

Dy, Gd, Eu and Sm) oxides by X-ray and neutron diffraction’, A.Salinas-Sanchez, J.L.Garcia-Muñoz, J.Rodriguez-Carvajal, R.Saez-Puche, and J.L.Martinez, Journal of Solid State Chemistry, 100, 201-211 (1992), https://doi.org/10.1016/0022-4596(92)90094-C

Args:

r_cut: Float, how far to search for neighbors when computing bond valences disordered_pymatgen: Boolean, whether to fall back on pymatgen’s bond

valence sum method for disordered structures

Features:
The global instability index is the square root of the sum of squared

differences of the bond valence sums from the formal valences averaged over all atoms in the unit cell.

__init__(r_cut=4.0, disordered_pymatgen=False)

Initialize self. See help(type(self)) for accurate signature.

calc_bv_sum(site_val, site_el, neighbor_list)

Computes bond valence sum for site. Args:

site_val (Integer): valence of site site_el (String): element name neighbor_list (List): List of neighboring sites and their distances

calc_gii_iucr(s)

Computes global instability index using tabulated bv params.

Args:

s: Pymatgen Structure object

Returns:

gii: Float, the global instability index

calc_gii_pymatgen(struct, scale_factor=0.965)

Calculates global instability index using Pymatgen’s bond valence sum Args:

struct: Pymatgen Structure object scale: Float, tunable scale factor for bond valence

Returns:

gii: Float, global instability index

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

static compute_bv(params, dist)

Compute bond valence from parameters. Args:

params: Dataframe with Ro and B parameters dist: Float, distance to neighboring atom

Returns:

bv: Float, bond valence

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct)

Get global instability index.

Args:

struct: Pymatgen Structure object

Returns:

[gii]: Length 1 list with float value

get_bv_params(cation, anion, cat_val, an_val)

Lookup bond valence parameters from IUPAC table. Args:

cation: String, cation element anion: String, anion element cat_val: Integer, cation formal valence an_val: Integer, anion formal valence

Returns:

bond_val_list: dataframe of bond valence parameters

get_equiv_sites(s, site)

Find identical sites from analyzing space group symmetry.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

precheck(struct)

Bond valence methods require atom pairs with oxidation states.

Additionally, check if at least the first and last site’s species have a entry in the bond valence parameters.

Args:

struct: Pymatgen Structure

class matminer.featurizers.structure.GlobalSymmetryFeatures(desired_features=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines symmetry features, e.g. spacegroup number and crystal system

Features:
  • Spacegroup number

  • Crystal system (1 of 7)

  • Centrosymmetry (has inversion symmetry)

__init__(desired_features=None)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

crystal_idx = {'cubic': 1, 'hexagonal': 2, 'monoclinic': 6, 'orthorhombic': 5, 'tetragonal': 4, 'triclinic': 7, 'trigonal': 3}
feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.JarvisCFID(use_cell=True, use_chem=True, use_chg=True, use_rdf=True, use_adf=True, use_ddf=True, use_nn=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Classical Force-Field Inspired Descriptors (CFID) from Jarvis-ML.

Chemo-structural descriptors from five different sub-methods,cincluding pairwise radial, nearest neighbor, bond-angle, dihedral-angle and core-charge distributions. With all descriptors enabled, there are 1,557 features per structure.

Adapted from the nist/jarvis package hosted at: https://github.com/usnistgov/jarvis

Find details at: https://journals.aps.org/prmaterials/abstract/10.1103/

PhysRevMaterials.2.083801

Args/Features:
use_cell (bool): Use structure cell descriptors (4 features, based

on DensityFeatures and log volume per atom).

use_chem (bool): Use chemical composition descriptors (438 features) use_chg (bool): Use core charge descriptors (378 features) use_adf (bool): Use angular distribution function (179 features x 2, one

set of features for each cutoff).

use_rdf (bool): Use radial distribution function (100 features) use_ddf (bool): Use dihedral angle distribution function (179 features) use_nn (bool): Use nearest neighbors (100 descriptors)

__init__(use_cell=True, use_chem=True, use_chg=True, use_rdf=True, use_adf=True, use_ddf=True, use_nn=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Get chemo-structural CFID decriptors

Args:

s: Structure object

Returns:

(np.ndarray) Final descriptors

get_chem(element)

Get chemical descriptors for an element

Args:

element: element name

Returns:

arr: descriptor array value

get_chg(element)

Get charge descriptors for an element

Args:

element: element name

Returns:

arr: descriptor array values

get_distributions(structure, c_size=10.0, max_cut=5.0)

Get radial and angular distribution functions

Args:

structure: Structure object c_size: max. cell size max_cut: max. bond cut-off for angular distribution

Retruns:

adfa, adfb, ddf, rdf, bondo Angular distribution upto first cut-off Angular distribution upto second cut-off Dihedral angle distribution upto first cut-off Radial distribution funcion Bond order distribution

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.MaximumPackingEfficiency

Bases: matminer.featurizers.base.BaseFeaturizer

Maximum possible packing efficiency of this structure

Uses a Voronoi tessellation to determine the largest radius each atom can have before any atoms touches any one of their neighbors. Given the maximum radius size, this class computes the maximum packing efficiency of the structure as a feature.

Features:

max packing efficiency - Maximum possible packing efficiency

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.MinimumRelativeDistances(cutoff=10.0)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines the relative distance of each site to its closest neighbor.

We use the relative distance, f_ij = r_ij / (r^atom_i + r^atom_j), as a measure rather than the absolute distances, r_ij, to account for the fact that different atoms/species have different sizes. The function uses the valence-ionic radius estimator implemented in Pymatgen. Args:

cutoff: (float) (absolute) distance up to which tentative

closest neighbors (on the basis of relative distances) are to be determined.

__init__(cutoff=10.0)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s, cutoff=10.0)

Get minimum relative distances of all sites of the input structure.

Args:

s: Pymatgen Structure object.

Returns:
dists_relative_min: (list of floats) list of all minimum relative

distances (i.e., for all sites).

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.OrbitalFieldMatrix(period_tag=False, flatten=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Representation based on the valence shell electrons of neighboring atoms.

Each atom is described by a 32-element vector (or 39-element vector, see period tag for details) uniquely representing the valence subshell. A 32x32 (39x39) matrix is formed by multiplying two atomic vectors. An OFM for an atomic environment is the sum of these matrices for each atom the center atom coordinates with multiplied by a distance function (In this case, 1/r times the weight of the coordinating atom in the Voronoi

Polyhedra method). The OFM of a structure or molecule is the average of the OFMs for all the sites in the structure.

Args:
period_tag (bool): In the original OFM, an element is represented

by a vector of length 32, where each element is 1 or 0, which represents the valence subshell of the element. With period_tag=True, the vector size is increased to 39, where the 7 extra elements represent the period of the element. Note lanthanides are treated as period 6, actinides as period 7. Default False as in the original paper.

flatten (bool): Flatten the avg OFM to a 1024-vector (if period_tag

False) or a 1521-vector (if period_tag=True).

…attribute:: size

Either 32 or 39, the size of the vectors used to describe elements.

Reference:

Pham et al. _Sci Tech Adv Mat_. 2017 <http://dx.doi.org/10.1080/14686996.2017.1378060>_

__init__(period_tag=False, flatten=True)

Initialize the featurizer

Args:
period_tag (bool): In the original OFM, an element is represented

by a vector of length 32, where each element is 1 or 0, which represents the valence subshell of the element. With period_tag=True, the vector size is increased to 39, where the 7 extra elements represent the period of the element. Note lanthanides are treated as period 6, actinides as period 7. Default False as in the original paper.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Makes a supercell for structure s (to protect sites from coordinating with themselves), and then finds the mean of the orbital field matrices of each site to characterize a structure

Args:

s (Structure): structure to characterize

Returns:
mean_ofm (size X size matrix): orbital field matrix

characterizing s

get_atom_ofms(struct, symm=False)

Calls get_single_ofm for every site in struct. If symm=True, get_single_ofm is called for symmetrically distinct sites, and counts is constructed such that ofms[i] occurs counts[i] times in the structure

Args:

struct (Structure): structure for find ofms for symm (bool): whether to calculate ofm for only symmetrically

distinct sites

Returns:

ofms ([size X size matrix] X len(struct)): ofms for struct if symm:

ofms ([size X size matrix] X number of symmetrically distinct sites):

ofms for struct

counts: number of identical sites for each ofm

get_mean_ofm(ofms, counts)

Averages a list of ofms, weights by counts

get_ohv(sp, period_tag)

Get the “one-hot-vector” for pymatgen Element sp. This 32 or 39-length vector represents the valence shell of the given element. Args:

sp (Element): element whose ohv should be returned period_tag (bool): If true, the vector contains items

corresponding to the period of the element

Returns:

my_ohv (numpy array length 39 if period_tag, else 32): ohv for sp

get_single_ofm(site, site_dict)

Gets the orbital field matrix for a single chemical environment, where site is the center atom whose environment is characterized and site_dict is a dictionary of site : weight, where the weights are the Voronoi Polyhedra weights of the corresponding coordinating sites.

Args:

site (Site): center atom site_dict (dict of Site:float): chemical environment

Returns:

atom_ofm (size X size numpy matrix): ofm for site

get_structure_ofm(struct)

Calls get_mean_ofm on the results of get_atom_ofms to give a size X size matrix characterizing a structure

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.PartialRadialDistributionFunction(cutoff=20.0, bin_size=0.1, include_elems=(), exclude_elems=())

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the partial radial distribution function (PRDF) of an xtal structure

The PRDF of a crystal structure is the radial distibution function broken down for each pair of atom types. The PRDF was proposed as a structural descriptor by [Schutt et al.] (https://journals.aps.org/prb/abstract/10.1103/PhysRevB.89.205118)

Args:

cutoff: (float) distance up to which to calculate the RDF. bin_size: (float) size of each bin of the (discrete) RDF. include_elems: (list of string), list of elements that must be included in PRDF exclude_elems: (list of string), list of elmeents that should not be included in PRDF

Features:
Each feature corresponds to the density of number of bonds

for a certain pair of elements at a certain range of distances. For example, “Al-Al PRDF r=1.00-1.50” corresponds to the density of Al-Al bonds between 1 and 1.5 distance units By default, this featurizer generates RDFs for each pair of elements in the training set.

__init__(cutoff=20.0, bin_size=0.1, include_elems=(), exclude_elems=())

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

compute_prdf(s)

Compute the PRDF for a structure

Args:

s: (Structure), structure to be evaluated

Returns:

dist_bins - float, start of each of the bins prdf - dict, where the keys is a pair of elements (strings),

and the value is the radial distribution function for those paris of elements

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Get PRDF of the input structure. Args:

s: Pymatgen Structure object.

Returns:
prdf, dist: (tuple of arrays) the first element is a

dictionary where keys are tuples of element names and values are PRDFs.

fit(X, y=None)

Define the list of elements to be included in the PRDF. By default, the PRDF will include all of the elements in X

Args:
X: (numpy array nx1) structures used in the training set. Each entry

must be Pymatgen Structure objects.

y: Not used fit_kwargs: not used

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.RadialDistributionFunction(cutoff=20.0, bin_size=0.1)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the radial distribution function (RDF) of a crystal structure.

Features:
  • Radial distribution function

Args:

cutoff: (float) distance up to which to calculate the RDF. bin_size: (float) size of each bin of the (discrete) RDF.

__init__(cutoff=20.0, bin_size=0.1)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Get RDF of the input structure. Args:

s (Structure): Pymatgen Structure object.

Returns:
rdf, dist: (tuple of arrays) the first element is the

normalized RDF, whereas the second element is the inner radius of the RDF bin.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.SineCoulombMatrix(diag_elems=True, flatten=True)

Bases: matminer.featurizers.base.BaseFeaturizer

A variant of the Coulomb matrix developed for periodic crystals.

This function generates a variant of the Coulomb matrix developed for periodic crystals by Faber et al. (Inter. J. Quantum Chem. 115, 16, 2015). It is identical to the Coulomb matrix, except that the inverse distance function is replaced by the inverse of a sin**2 function of the vector between the sites which is periodic in the dimensions of the structure lattice. See paper for details.

Coulomb Matrix features are flattened (for ML-readiness) by default. Use fit before featurizing to use flattened features. To return the matrix form, set flatten=False.

Args:
diag_elems (bool): flag indication whether (True, default) to use

the original definition of the diagonal elements; if set to False, the diagonal elements are set to 0

flatten (bool): If True, returns a flattened vector based on eigenvalues

of the matrix form. Otherwise, returns a matrix object (single feature), which will likely need to be processed further.

__init__(diag_elems=True, flatten=True)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)
Args:

s (Structure or Molecule): input structure (or molecule)

Returns:

(Nsites x Nsites matrix) Sine matrix or

fit(X, y=None)

Fit the Sine Coulomb Matrix to a list of structures.

Args:

X ([Structure]): A list of pymatgen structures. y : unused (added for consistency with overridden method signature)

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.SiteStatsFingerprint(site_featurizer, stats='mean', 'std_dev', min_oxi=None, max_oxi=None, covariance=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Computes statistics of properties across all sites in a structure.

This featurizer first uses a site featurizer class (see site.py for options) to compute features of each site in a structure, and then computes features of the entire structure by measuring statistics of each attribute. Can optionally compute the the statistics of only sites with certain ranges of oxidation states (e.g., only anions).

Features:
  • Returns each statistic of each site feature

__init__(site_featurizer, stats='mean', 'std_dev', min_oxi=None, max_oxi=None, covariance=False)
Args:

site_featurizer (BaseFeaturizer): a site-based featurizer stats ([str]): list of weighted statistics to compute for each feature.

If stats is None, a list is returned for each features that contains the calculated feature for each site in the structure. *Note for nth mode, stat must be ‘n*_mode’; e.g. stat=’2nd_mode’

min_oxi (int): minimum site oxidation state for inclusion (e.g.,

zero means metals/cations only)

max_oxi (int): maximum site oxidation state for inclusion covariance (bool): Whether to compute the covariance of site features

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(s)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

static from_preset(preset, **kwargs)

Create a SiteStatsFingerprint class according to a preset

Args:

preset (str) - Name of preset kwargs - Options for SiteStatsFingerprint

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.StructuralComplexity(symprec=0.1)

Bases: matminer.featurizers.base.BaseFeaturizer

Shannon information entropy of a structure.

This descriptor treat a structure as a message to evaluate structural complexity (S) using the following equation:

S = - v \sum_{i=1}^{k} p_i \log_2 p_i

p_i = m_i / v

where v is the total number of atoms in the unit cell, p_i is the probability mass function, k is the number of symmetrically inequivalent sites, and m_i is the number of sites classified in i th symmetrically inequivalent site.

Features:
  • information entropy (bits/atom)

  • information entropy (bits/unit cell)

Args:

symprec: precision for symmetrizing a structure

__init__(symprec=0.1)

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(struct)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.StructuralHeterogeneity(weight='area', stats='minimum', 'maximum', 'range', 'mean', 'avg_dev')

Bases: matminer.featurizers.base.BaseFeaturizer

Variance in the bond lengths and atomic volumes in a structure

These features are based on several statistics derived from the Voronoi tessellation of a structure. The first set of features relate to the variance in the average bond length across all atoms in the structure. The second relate to the variance of bond lengths between each neighbor of each atom. The final feature is the variance in Voronoi cell sizes across the structure.

We define the ‘average bond length’ of a site as the weighted average of the bond lengths for all neighbors. By default, the weight is the area of the face between the sites.

The ‘neighbor distance variation’ is defined as the weighted mean absolute deviation in both length for all neighbors of a particular site. As before, the weight is according to face area by default. For this statistic, we divide the mean absolute deviation by the mean neighbor distance for that site.

Features:
mean absolute deviation in relative bond length - Mean absolute deviation

in the average bond lengths for all sites, divided by the mean average bond length

max relative bond length - Maximum average bond length, divided by the

mean average bond length

min relative bond length - Minimum average bond length, divided by the

mean average bond length

[stat] neighbor distance variation - Statistic (e.g., mean) of the

neighbor distance variation

mean absolute deviation in relative cell size - Mean absolute deviation

in the Voronoi cell volume across all sites in the structure. Divided by the mean Voronoi cell volume.

References:

Ward et al. _PRB_ 2017

__init__(weight='area', stats='minimum', 'maximum', 'range', 'mean', 'avg_dev')

Initialize self. See help(type(self)) for accurate signature.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.StructureComposition(featurizer=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Features related to the composition of a structure

This class is just a wrapper that calls a composition-based featurizer on the composition of a Structure

Features:
  • Depends on the featurizer

__init__(featurizer=None)

Initialize the featurizer

Args:

featurizer (BaseFeaturizer) - Composition-based featurizer

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:

X - [list of tuples], training data

Returns:

self

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

class matminer.featurizers.structure.XRDPowderPattern(two_theta_range=0, 127, bw_method=0.05, pattern_length=None, **kwargs)

Bases: matminer.featurizers.base.BaseFeaturizer

1D array representing powder diffraction of a structure as calculated by pymatgen. The powder is smeared / normalized according to gaussian_kde.

__init__(two_theta_range=0, 127, bw_method=0.05, pattern_length=None, **kwargs)

Initialize the featurizer.

Args:
two_theta_range ([float of length 2]): Tuple for range of

two_thetas to calculate in degrees. Defaults to (0, 90). Set to None if you want all diffracted beams within the limiting sphere of radius 2 / wavelength.

bw_method (float): how much to smear the XRD pattern pattern_length (float): length of final array; defaults to one value

per degree (i.e. two_theta_range + 1)

**kwargs: any other arguments to pass into pymatgen’s XRDCalculator,

such as the type of radiation.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,

ideally in BibTeX format.

feature_labels()

Generate attribute names.

Returns:

([str]) attribute labels.

featurize(strc)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:

x: input data to featurize (type depends on featurizer).

Returns:

(list) one or more features.

implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

Module contents