matminer.featurizers package¶
Subpackages¶
- matminer.featurizers.composition package
- Subpackages
- matminer.featurizers.composition.tests package
- Submodules
- matminer.featurizers.composition.tests.base module
- matminer.featurizers.composition.tests.test_alloy module
- matminer.featurizers.composition.tests.test_composite module
- matminer.featurizers.composition.tests.test_element module
- matminer.featurizers.composition.tests.test_ion module
- matminer.featurizers.composition.tests.test_orbital module
- matminer.featurizers.composition.tests.test_packing module
- matminer.featurizers.composition.tests.test_thermo module
- Module contents
- matminer.featurizers.composition.tests package
- Submodules
- matminer.featurizers.composition.alloy module
Miedema
WenAlloys
WenAlloys.__init__()
WenAlloys.citations()
WenAlloys.compute_atomic_fraction()
WenAlloys.compute_configuration_entropy()
WenAlloys.compute_delta()
WenAlloys.compute_enthalpy()
WenAlloys.compute_gamma_radii()
WenAlloys.compute_lambda()
WenAlloys.compute_local_mismatch()
WenAlloys.compute_magpie_summary()
WenAlloys.compute_strength_local_mismatch_shear()
WenAlloys.compute_weight_fraction()
WenAlloys.feature_labels()
WenAlloys.featurize()
WenAlloys.implementors()
WenAlloys.precheck()
YangSolidSolution
- matminer.featurizers.composition.composite module
- matminer.featurizers.composition.element module
- matminer.featurizers.composition.ion module
- matminer.featurizers.composition.orbital module
- matminer.featurizers.composition.packing module
AtomicPackingEfficiency
AtomicPackingEfficiency.__init__()
AtomicPackingEfficiency.citations()
AtomicPackingEfficiency.compute_nearest_cluster_distance()
AtomicPackingEfficiency.compute_simultaneous_packing_efficiency()
AtomicPackingEfficiency.create_cluster_lookup_tool()
AtomicPackingEfficiency.feature_labels()
AtomicPackingEfficiency.featurize()
AtomicPackingEfficiency.find_ideal_cluster_size()
AtomicPackingEfficiency.get_ideal_radius_ratio()
AtomicPackingEfficiency.implementors()
- matminer.featurizers.composition.thermo module
- Module contents
- Subpackages
- matminer.featurizers.site package
- Subpackages
- matminer.featurizers.site.tests package
- Submodules
- matminer.featurizers.site.tests.base module
- matminer.featurizers.site.tests.test_bonding module
- matminer.featurizers.site.tests.test_chemical module
- matminer.featurizers.site.tests.test_external module
- matminer.featurizers.site.tests.test_fingerprint module
- matminer.featurizers.site.tests.test_misc module
- matminer.featurizers.site.tests.test_rdf module
- Module contents
- matminer.featurizers.site.tests package
- Submodules
- matminer.featurizers.site.bonding module
- matminer.featurizers.site.chemical module
- matminer.featurizers.site.external module
- matminer.featurizers.site.fingerprint module
- matminer.featurizers.site.misc module
CoordinationNumber
IntersticeDistribution
IntersticeDistribution.__init__()
IntersticeDistribution.analyze_area_interstice()
IntersticeDistribution.analyze_dist_interstices()
IntersticeDistribution.analyze_vol_interstice()
IntersticeDistribution.citations()
IntersticeDistribution.feature_labels()
IntersticeDistribution.featurize()
IntersticeDistribution.implementors()
- matminer.featurizers.site.rdf module
AngularFourierSeries
GaussianSymmFunc
GeneralizedRadialDistributionFunction
GeneralizedRadialDistributionFunction.__init__()
GeneralizedRadialDistributionFunction.citations()
GeneralizedRadialDistributionFunction.feature_labels()
GeneralizedRadialDistributionFunction.featurize()
GeneralizedRadialDistributionFunction.fit()
GeneralizedRadialDistributionFunction.from_preset()
GeneralizedRadialDistributionFunction.implementors()
- Module contents
- Subpackages
- matminer.featurizers.structure package
- Subpackages
- matminer.featurizers.structure.tests package
- Submodules
- matminer.featurizers.structure.tests.base module
- matminer.featurizers.structure.tests.test_bonding module
- matminer.featurizers.structure.tests.test_composite module
- matminer.featurizers.structure.tests.test_matrix module
- matminer.featurizers.structure.tests.test_misc module
- matminer.featurizers.structure.tests.test_order module
- matminer.featurizers.structure.tests.test_rdf module
- matminer.featurizers.structure.tests.test_sites module
- matminer.featurizers.structure.tests.test_symmetry module
- Module contents
- matminer.featurizers.structure.tests package
- Submodules
- matminer.featurizers.structure.bonding module
BagofBonds
BondFractions
GlobalInstabilityIndex
GlobalInstabilityIndex.__init__()
GlobalInstabilityIndex.calc_bv_sum()
GlobalInstabilityIndex.calc_gii_iucr()
GlobalInstabilityIndex.calc_gii_pymatgen()
GlobalInstabilityIndex.citations()
GlobalInstabilityIndex.compute_bv()
GlobalInstabilityIndex.feature_labels()
GlobalInstabilityIndex.featurize()
GlobalInstabilityIndex.get_bv_params()
GlobalInstabilityIndex.get_equiv_sites()
GlobalInstabilityIndex.implementors()
GlobalInstabilityIndex.precheck()
MinimumRelativeDistances
StructuralHeterogeneity
- matminer.featurizers.structure.composite module
- matminer.featurizers.structure.matrix module
CoulombMatrix
OrbitalFieldMatrix
OrbitalFieldMatrix.__init__()
OrbitalFieldMatrix.citations()
OrbitalFieldMatrix.feature_labels()
OrbitalFieldMatrix.featurize()
OrbitalFieldMatrix.get_atom_ofms()
OrbitalFieldMatrix.get_mean_ofm()
OrbitalFieldMatrix.get_ohv()
OrbitalFieldMatrix.get_single_ofm()
OrbitalFieldMatrix.get_structure_ofm()
OrbitalFieldMatrix.implementors()
SineCoulombMatrix
- matminer.featurizers.structure.misc module
- matminer.featurizers.structure.order module
- matminer.featurizers.structure.rdf module
ElectronicRadialDistributionFunction
PartialRadialDistributionFunction
PartialRadialDistributionFunction.__init__()
PartialRadialDistributionFunction.citations()
PartialRadialDistributionFunction.compute_prdf()
PartialRadialDistributionFunction.feature_labels()
PartialRadialDistributionFunction.featurize()
PartialRadialDistributionFunction.fit()
PartialRadialDistributionFunction.implementors()
PartialRadialDistributionFunction.precheck()
RadialDistributionFunction
get_rdf_bin_labels()
- matminer.featurizers.structure.sites module
- matminer.featurizers.structure.symmetry module
- Module contents
- Subpackages
- matminer.featurizers.tests package
- Submodules
- matminer.featurizers.tests.test_bandstructure module
- matminer.featurizers.tests.test_base module
FittableFeaturizer
MatrixFeaturizer
MultiArgs2
MultiTypeFeaturizer
MultipleFeatureFeaturizer
SingleFeaturizer
SingleFeaturizerMultiArgs
SingleFeaturizerMultiArgsWithPrecheck
SingleFeaturizerWithPrecheck
TestBaseClass
TestBaseClass.make_test_data()
TestBaseClass.setUp()
TestBaseClass.test_caching()
TestBaseClass.test_dataframe()
TestBaseClass.test_featurize_many()
TestBaseClass.test_fittable()
TestBaseClass.test_ignore_errors()
TestBaseClass.test_indices()
TestBaseClass.test_inplace()
TestBaseClass.test_matrix()
TestBaseClass.test_multifeature_no_zero_index()
TestBaseClass.test_multifeatures_multiargs()
TestBaseClass.test_multiindex_in_multifeaturizer()
TestBaseClass.test_multiindex_inplace()
TestBaseClass.test_multiindex_return()
TestBaseClass.test_multiple()
TestBaseClass.test_multiprocessing_df()
TestBaseClass.test_multitype_multifeat()
TestBaseClass.test_precheck()
TestBaseClass.test_stacked_featurizer()
- matminer.featurizers.tests.test_conversions module
TestConversions
TestConversions.test_ase_conversion()
TestConversions.test_composition_to_oxidcomposition()
TestConversions.test_composition_to_structurefromMP()
TestConversions.test_conversion_multiindex()
TestConversions.test_conversion_multiindex_dynamic()
TestConversions.test_conversion_overwrite()
TestConversions.test_dict_to_object()
TestConversions.test_json_to_object()
TestConversions.test_pymatgen_general_converter()
TestConversions.test_str_to_composition()
TestConversions.test_structure_to_composition()
TestConversions.test_structure_to_oxidstructure()
TestConversions.test_to_istructure()
- matminer.featurizers.tests.test_dos module
- matminer.featurizers.tests.test_function module
- Module contents
- matminer.featurizers.utils package
- Subpackages
- Submodules
- matminer.featurizers.utils.grdf module
- matminer.featurizers.utils.oxidation module
- matminer.featurizers.utils.stats module
PropertyStats
PropertyStats.avg_dev()
PropertyStats.calc_stat()
PropertyStats.eigenvalues()
PropertyStats.flatten()
PropertyStats.geom_std_dev()
PropertyStats.holder_mean()
PropertyStats.inverse_mean()
PropertyStats.kurtosis()
PropertyStats.maximum()
PropertyStats.mean()
PropertyStats.minimum()
PropertyStats.mode()
PropertyStats.quantile()
PropertyStats.range()
PropertyStats.skewness()
PropertyStats.sorted()
PropertyStats.std_dev()
- Module contents
Submodules¶
matminer.featurizers.bandstructure module¶
- class matminer.featurizers.bandstructure.BandFeaturizer(kpoints=None, find_method='nearest', nbands=2)¶
Bases:
BaseFeaturizer
Featurizes a pymatgen band structure object.
- Args:
- kpoints ([1x3 numpy array]): list of fractional coordinates of
k-points at which energy is extracted.
- find_method (str): the method for finding or interpolating for energy
at given kpoints. It does nothing if kpoints is None. options are:
- ‘nearest’: the energy of the nearest available k-point to
the input k-point is returned.
‘linear’: the result of linear interpolation is returned see the documentation for scipy.interpolate.griddata
nbands (int): the number of valence/conduction bands to be featurized
- __init__(kpoints=None, find_method='nearest', nbands=2)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Generate attribute names.
- Returns:
([str]) attribute labels.
- featurize(bs)¶
- Args:
- bs (pymatgen BandStructure or BandStructureSymmLine or their dict):
The band structure to featurize. To obtain all features, bs should include the structure attribute.
- Returns:
- ([float]): a list of band structure features. If not bs.structure,
features that require the structure will be returned as NaN.
- List of currently supported features:
band_gap (eV): the difference between the CBM and VBM energy is_gap_direct (0.0|1.0): whether the band gap is direct or not direct_gap (eV): the minimum direct distance of the last
valence band and the first conduction band
- p_ex1_norm (float): k-space distance between Gamma point
and k-point of VBM
- n_ex1_norm (float): k-space distance between Gamma point
and k-point of CBM
p_ex1_degen: degeneracy of VBM n_ex1_degen: degeneracy of CBM if kpoints is provided (e.g. for kpoints == [[0.0, 0.0, 0.0]]):
- n_0.0;0.0;0.0_en: (energy of the first conduction band at
[0.0, 0.0, 0.0] - CBM energy)
- p_0.0;0.0;0.0_en: (energy of the last valence band at
[0.0, 0.0, 0.0] - VBM energy)
- static get_bindex_bspin(extremum, is_cbm)¶
Returns the band index and spin of band extremum
- Args:
- extremum (dict): dictionary containing the CBM/VBM, i.e. output of
Bandstructure.get_cbm()
is_cbm (bool): whether the extremum is the CBM or not
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.bandstructure.BranchPointEnergy(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)¶
Bases:
BaseFeaturizer
Branch point energy and absolute band edge position.
Calculates the branch point energy and (optionally) an absolute band edge position assuming the branch point energy is the center of the gap
- Args:
n_vb (int): number of valence bands to include in BPE calc n_cb (int): number of conduction bands to include in BPE calc calculate_band_edges: (bool) whether to also return band edge
positions
- atol (float): absolute tolerance when finding equivalent fractional
k-points in irreducible brillouin zone (IBZ) when weights is None
- __init__(n_vb=1, n_cb=1, calculate_band_edges=True, atol=1e-05)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
- Returns ([str]): absolute energy levels as provided in the input
BandStructure. “absolute” means no reference energy is subtracted from branch_point_energy, vbm or cbm.
- featurize(bs, target_gap=None, weights=None)¶
- Args:
bs (BandStructure): Uniform (not symm line) band structure target_gap (float): if set the band gap is scissored to match this
number
- weights ([float]): if set, its length has to be equal to bs.kpoints
to explicitly determine the k-point weights when averaging
- Returns:
(int) branch point energy on same energy scale as BS eigenvalues
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
matminer.featurizers.base module¶
- class matminer.featurizers.base.BaseFeaturizer¶
Bases:
BaseEstimator
,TransformerMixin
,ABC
Abstract class to calculate features from raw materials input data such a compound formula or a pymatgen crystal structure or bandstructure object.
## Using a BaseFeaturizer Class
There are multiple ways for running the featurize routines:
featurize: Featurize a single entry featurize_many: Featurize a list of entries featurize_dataframe: Compute features for many entries, store results
as columns in a dataframe
Some featurizers require first calling the fit method before the featurization methods can function. Generally, you pass the dataset to fit to determine which features a featurizer should compute. For example, a featurizer that returns the partial radial distribution function may need to know which elements are present in a dataset.
You can also use the precheck and precheck_dataframe methods to ensure a featurizer is in scope for a given sample (or dataset) before featurizing.
You can also employ the featurizer as part of a ScikitLearn Pipeline object. For these cases, ScikitLearn calls the transform function of the BaseFeaturizer which is a less-featured wrapper of featurize_many. You would then provide your input data as an array to the Pipeline, which would output the features as an array.
Beyond the featurizing capability, BaseFeaturizer also includes methods for retrieving proper references for a featurizer. The citations function returns a list of papers that should be cited. The implementors function returns a list of people who wrote the featurizer, so that you know who to contact with questions.
## Implementing a New BaseFeaturizer Class
- These operations must be implemented for each new featurizer:
- featurize - Takes a single material as input, returns the features of
that material.
- feature_labels - Generates a human-meaningful name for each of the
features.
citations - Returns a list of citations in BibTeX format implementors - Returns a list of people who contributed to writing a
paper.
None of these operations should change the state of the featurizer. I.e., running each method twice should not produce different results, no class attributes should be changed, and running one operation should not affect the output of another.
All options of the featurizer must be set by the __init__ function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with the get_params and set_params methods of BaseEstimator, which enable easy interoperability with ScikitLearn
Depending on the complexity of your featurizer, it may be worthwhile to implement a from_preset class method. The from_preset method takes the name of a preset and returns an instance of the featurizer with some hard-coded set of inputs. The from_preset option is particularly useful for defining the settings used by papers in the literature.
Optionally, you can implement the fit operation if there are attributes of your featurizer that must be set for the featurizer to work. Any variables that are set by fitting should be stored as class attributes that end with an underscore. (This follows the pattern used by ScikitLearn).
Another option to consider is whether it is worth making any utility operations for your featurizer. featurize must return a list of features, but this may not be the most natural representation for your features (e.g., a dict could be better). Making a separate function for computing features in this natural representation and having the featurize function call this method and then convert the data into a list is a recommended approach. Users who want to compute the representation in the natural form can use the utility function and users who want the data in a ML-ready format (list) can call featurize. See PartialRadialDistributionFunction for an example of this concept.
An additional factor to consider is the chunksize for data parallelisation. For lightweight computational tasks, the overhead associated with passing data from multiprocessing.Pool.map() to the function being parallelized can increase the time taken for all tasks to be completed. By setting the self._chunksize argument, the overhead associated with passing data to the tasks can be reduced. Note that there is only an advantage to using chunksize when the time taken to pass the data from map to the function call is within several orders of magnitude to that of the function call itself. By default, we allow the Python multiprocessing library to determine the chunk size automatically based on the size of the list being featurized. You may want to specify a small chunk size for computationally-expensive featurizers, which will enable better distribution of tasks across threads. In contrast, for more lightweight featurizers, it is recommended that the implementor trial a range of chunksize values to find the optimum. As a general rule of thumb, if the featurize function takes 0.1 seconds or less, a chunksize of around 30 will perform best.
## Documenting a BaseFeaturizer
The class documentation for each featurizer must contain a description of the options and the features that will be computed. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).
For auto-generated documentation purposes, the first line of the featurizer doc should come under the class declaration (not under __init__) and should be a one line summary of the featurizer.
We recommend starting the class documentation with a high-level overview of the features. For example, mention what kind of characteristics of the material they describe and refer the reader to a paper that describes these features well (use a hyperlink if possible, so that the readthedocs will link to that paper). Then, describe each of the individual features in a block named “Features”. It is necessary here to give the user enough information for user to map a feature name what it means. The objective in this part is to allow people to understand what each column of their dataframe is without having to read the Python code. You do not need to explain all of the math/algorithms behind each feature for them to be able to reproduce the feature, just to get an idea what it is.
- property chunksize¶
- abstract citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- abstract feature_labels()¶
Generate attribute names.
- Returns:
([str]) attribute labels.
- abstract featurize(*x)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- featurize_dataframe(df, col_id, ignore_errors=False, return_errors=False, inplace=False, multiindex=False, pbar=True)¶
Compute features for all entries contained in input dataframe.
- Args:
df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to
featurize. Can be multiple labels if the featurize function requires multiple inputs.
- ignore_errors (bool): Returns NaN for dataframe rows where
exceptions are thrown if True. If False, exceptions are thrown as normal.
- return_errors (bool). Returns the errors encountered for each
row in a separate XFeaturizer errors column if True. Requires ignore_errors to be True.
- inplace (bool): If True, adds columns to the original object in
memory and returns None. Else, returns the updated object. Should be identical to pandas inplace behavior.
- multiindex (bool): If True, use a Featurizer - Feature 2-level
index using the MultiIndex capabilities of pandas. If done inplace, multiindex featurization will overwrite the original dataframe’s column index.
pbar (bool): Shows a progress bar if True.
- Returns:
updated dataframe.
- featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)¶
Featurize a list of entries.
If featurize takes multiple inputs, supply inputs as a list of tuples.
Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.
- Args:
entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are
thrown if True. If False, exceptions are thrown as normal.
- return_errors (bool): If True, returns the feature list as
determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.
pbar (bool): Show a progress bar for featurization if True.
- Returns:
(list) features for each entry.
- featurize_wrapper(x, return_errors=False, ignore_errors=False)¶
An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.
- Args:
x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are
thrown if True. If False, exceptions are thrown as normal.
- return_errors (bool): If True, returns the feature list as
determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.
- Returns:
(list) one or more features.
- fit(X, y=None, **fit_kwargs)¶
Update the parameters of this featurizer based on available data
- Args:
X - [list of tuples], training data
- Returns:
self
- fit_featurize_dataframe(df, col_id, fit_args=None, *args, **kwargs)¶
The dataframe equivalent of fit_transform. Takes a dataframe and column id as input, fits the featurizer to that dataframe, and returns a featurized dataframe. Accepts the same arguments as featurize_dataframe.
- Args:
df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to
featurize. Can be multiple labels if the featurize function requires multiple inputs.
fit_args (list): list of arguments for fit function.
- Returns:
updated dataframe based on featurizer fitted to that dataframe.
- abstract implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- property n_jobs¶
- precheck(x: Any) bool ¶
Precheck (provide an estimate of whether a featurizer will work or not) for a single entry (e.g., a single composition). If the entry fails the precheck, it will most likely fail featurization; if it passes, it is likely (but not guaranteed) to featurize correctly.
- Prechecks should be:
accurate (but can be good estimates rather than ground truth)
fast to evaluate
- unlikely to be obsolete via changes in the featurizer in the near
future
This method should be overridden by any featurizer requiring its use, as by default all entries will pass prechecking. Also, precheck is a good opportunity to throw warnings about long runtimes (e.g., doing nearest neighbors computations on a structure with many thousand sites).
See the documentation for precheck_dataframe for more information.
- Args:
- *x (Composition, Structure, etc.): Input to-be-featurized. Can be
a single input or multiple inputs.
- Returns:
(bool): True, if passes the precheck. False, if fails.
- precheck_dataframe(df, col_id, return_frac=True, inplace=False) float | DataFrame ¶
Precheck an entire dataframe. Subclasses wanting to use precheck functionality should not override this method, they should override precheck (unless the entire df determines whether single entries pass or fail a precheck).
Prechecking should be a quick and useful way to check that for a particular dataframe (set of featurizer inputs), the featurizer is:
in scope, and/or…
robust to errors and/or…
- any other reason you would not practically want to use this
featurizer in on this dataframe.
By prechecking before featurizing, you can avoid applying featurizers to data that will ultimately fail, return unreliable numbers, or are out of scope. Prechecking is also a good time to throw/observe warnings (such as long runtime warnings!).
- Args:
df (pd.DataFrame): A dataframe col_id (str or [str]): column label containing objects to featurize.
Can be multiple labels if the featurize function requires multiple inputs.
- return_frac (bool): If True, returns the fraction of entries
passing the precheck (e.g., 0.5). Else, returns a dataframe.
- inplace (bool); Only relevant if return_frac=False. If inplace=True,
the input dataframe is modified in memory with a boolean column for precheck. Otherwise, a new df with this column is returned.
- Returns:
- (bool, pd.DataFrame): If return_frac=True, returns the fraction of
entries passing the precheck. Else, returns the dataframe with an extra boolean column added for the precheck.
- set_chunksize(chunksize)¶
Set the chunksize used for Pool.map parallelisation.
- set_n_jobs(n_jobs: int) None ¶
Set the number of concurrent jobs to spawn during featurization.
- Args:
n_jobs (int): Number of threads in multiprocessing pool.
Note: It seems multiprocessing can be the cause of out-of-memory (OOM) errors, especially when trying to featurize large structures on HPC nodes with strict memory limits. Using featurizer.set_n_jobs(1) has been known to help as a workaround.
- transform(X)¶
Compute features for a list of inputs
- class matminer.featurizers.base.MultipleFeaturizer(featurizers, iterate_over_entries=True)¶
Bases:
BaseFeaturizer
Class to run multiple featurizers on the same input data.
All featurizers must take the same kind of data as input to the featurize function.
- Args:
featurizers (list of BaseFeaturizer): A list of featurizers to run. iterate_over_entries (bool): Whether to iterate over the entries or
featurizers. Iterating over entries will enable increased caching but will only display a single progress bar for all featurizers. If set to False, iteration will be performed over featurizers, resulting in reduced caching but individual progress bars for each featurizer.
- __init__(featurizers, iterate_over_entries=True)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Generate attribute names.
- Returns:
([str]) attribute labels.
- featurize(*x)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- featurize_many(entries, ignore_errors=False, return_errors=False, pbar=True)¶
Featurize a list of entries.
If featurize takes multiple inputs, supply inputs as a list of tuples.
Featurize_many supports entries as a list, tuple, numpy array, Pandas Series, or Pandas DataFrame.
- Args:
entries (list-like object): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are
thrown if True. If False, exceptions are thrown as normal.
- return_errors (bool): If True, returns the feature list as
determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.
pbar (bool): Show a progress bar for featurization if True.
- Returns:
(list) features for each entry.
- featurize_wrapper(x, return_errors=False, ignore_errors=False)¶
An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.
- Args:
x: input data to featurize (type depends on featurizer). ignore_errors (bool): Returns NaN for entries where exceptions are
thrown if True. If False, exceptions are thrown as normal.
- return_errors (bool): If True, returns the feature list as
determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.
- Returns:
(list) one or more features.
- fit(X, y=None, **fit_kwargs)¶
Update the parameters of this featurizer based on available data
- Args:
X - [list of tuples], training data
- Returns:
self
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- set_n_jobs(n_jobs)¶
Set the number of concurrent jobs to spawn during featurization.
- Args:
n_jobs (int): Number of threads in multiprocessing pool.
Note: It seems multiprocessing can be the cause of out-of-memory (OOM) errors, especially when trying to featurize large structures on HPC nodes with strict memory limits. Using featurizer.set_n_jobs(1) has been known to help as a workaround.
- class matminer.featurizers.base.StackedFeaturizer(featurizer=None, model=None, name=None, class_names=None)¶
Bases:
BaseFeaturizer
Use the output of a machine learning model as features
For regression models, we use the single output class.
For classification models, we use the probability for the first N-1 classes where N is the number of classes.
- __init__(featurizer=None, model=None, name=None, class_names=None)¶
Initialize featurizer
- Args:
featurizer (BaseFeaturizer): Featurizer used to generate inputs to the model model (BaseEstimator): Fitted machine learning model to be evaluated name (str): [Optional] name of model, used when creating feature names
class_names ([str]): Required for classification models, used when creating feature names (scikit-learn does not specify the number of classes for a classifier). Class names must be in the same order as the classes in the model (e.g., class_names[0] must be the name of the class 0)
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Generate attribute names.
- Returns:
([str]) attribute labels.
- featurize(*x)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
matminer.featurizers.conversions module¶
This module defines featurizers that can convert between different data formats
Note that these featurizers do not produce machine learning-ready features. Instead, they should be used to pre-process data, either through a standalone transformation or as part of a Pipeline.
- class matminer.featurizers.conversions.ASEAtomstoStructure(target_col_id='PMG Structure from ASE Atoms', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Convert dataframes of ase structures to pymatgen structures for further use with matminer.
- Args:
target_col_id (str): Column to place PMG structures. overwrite_data (bool): If True, will overwrite target_col_id even if there is
data currently in that column
- __init__(target_col_id='PMG Structure from ASE Atoms', overwrite_data=False)¶
- featurize(ase_atoms)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.CompositionToOxidComposition(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)¶
Bases:
ConversionFeaturizer
Utility featurizer to add oxidation states to a pymatgen Composition.
Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.composition.Composition object.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- coerce_mixed (bool): If a composition has both species containing
oxid states and not containing oxid states, strips all of the oxid states and guesses the entire composition’s oxid states.
- return_original_on_error: If the oxidation states cannot be
guessed and set to True, the composition without oxidation states will be returned. If set to False, an error will be thrown.
- **kwargs: Parameters to control the settings for
pymatgen.io.structure.Structure.add_oxidation_state_by_guess().
- __init__(target_col_id='composition_oxid', overwrite_data=False, coerce_mixed=True, return_original_on_error=False, **kwargs)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(comp)¶
Add oxidation states to a Structure using pymatgen’s guessing routines.
- Args:
comp (pymatgen.core.composition.Composition): A composition.
- Returns:
- (pymatgen.core.composition.Composition): A Composition object
decorated with oxidation states.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.CompositionToStructureFromMP(target_col_id='structure', overwrite_data=False, mapi_key=None)¶
Bases:
ConversionFeaturizer
Featurizer to get a Structure object from Materials Project using the composition alone. The most stable entry from Materials Project is selected, or NaN if no entry is found in the Materials Project.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
map_key (str): Materials API key
- __init__(target_col_id='structure', overwrite_data=False, mapi_key=None)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(comp)¶
Get the most stable structure from Materials Project Args:
comp (pymatgen.core.composition.Composition): A composition.
- Returns:
(pymatgen.core.structure.Structure): A Structure object.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.ConversionFeaturizer(target_col_id, overwrite_data)¶
Bases:
BaseFeaturizer
Abstract class to perform data conversions.
Featurizers subclassing this class do not produce machine learning-ready features but instead are used to pre-process data. As Featurizers, the conversion process can take advantage of the parallelisation implemented in ScikitLearn.
Note that feature_labels are set dynamically and may depend on the column id of the data being featurized. As such, feature_labels may differ before and after featurization.
ConversionFeaturizers differ from other Featurizers in that the user can can specify the column in which to write the converted data. The output column is controlled through target_col_id. ConversionFeaturizers also have the ability to overwrite data in existing columns. This is controlled by the overwrite_data option. “in place” conversion of data can be achieved by setting target_col_id=None and overwrite_data=True. See the docstring below for more details.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_col_id if it
exists.
- __init__(target_col_id, overwrite_data)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Generate attribute names.
- Returns:
([str]) attribute labels.
- featurize(*x)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- featurize_dataframe(df, col_id, **kwargs)¶
Perform the data conversion and set the target column dynamically.
target_col_id, and accordingly feature_labels, may depend on the column id of the data being featurized. As such, target_col_id is first set dynamically before the BaseFeaturizer.featurize_dataframe() super method is called.
- Args:
df (Pandas.DataFrame): Dataframe containing input data. col_id (str or list of str): column label containing objects to
featurize. Can be multiple labels if the featurize function requires multiple inputs.
- **kwargs: Additional keyword arguments that will be passed through
to BaseFeaturizer.featurize_dataframe().
- Returns:
(Pandas.Dataframe): The updated dataframe.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.DictToObject(target_col_id='_object', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Utility featurizer to decode a dict to Python object via MSON.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- __init__(target_col_id='_object', overwrite_data=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(dict_data)¶
Convert a string to a pymatgen Composition.
- Args:
- dict_data (dict): A MSONable dictionary. E.g. Produced from
pymatgen.core.structure.Structure.as_dict().
- Returns:
(object): An object with the type specified by dict_data.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.JsonToObject(target_col_id='_object', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Utility featurizer to decode json data to a Python object via MSON.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- __init__(target_col_id='_object', overwrite_data=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(json_data)¶
Convert a string to a pymatgen Composition.
- Args:
- json_data (dict): MSONable json data. E.g. Produced from
pymatgen.core.structure.Structure.to_json().
- Returns:
(object): An object with the type specified by json_data.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.PymatgenFunctionApplicator(func, func_args=None, func_kwargs=None, target_col_id=None, overwrite_data=False)¶
Bases:
ConversionFeaturizer
Featurizer to run any function using on/from pymatgen primitives.
For example, apply
lambda structure: structure.composition.anonymized_formula
To all rows in a dataframe.
And return the results in the specified column.
- Args:
func (function): Function object or lambda to pass the pmg primitive objects to. func_args (list): List of args to pass along with the pmg object to func. func_kwargs (dict): Dict of kwargs to pass along with the pmg object to func, target_col_id (str): Output column for the results. If not provided, the func name
will be used.
- overwrite_data (bool): If True, will overwrite target_col_id even if there is
data currently in that column
- __init__(func, func_args=None, func_kwargs=None, target_col_id=None, overwrite_data=False)¶
- featurize(obj)¶
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Args:
x: input data to featurize (type depends on featurizer).
- Returns:
(list) one or more features.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.StrToComposition(reduce=False, target_col_id='composition', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Utility featurizer to convert a string to a Composition
The expected input is a composition in string form (e.g. “Fe2O3”).
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- reduce (bool): Whether to return a reduced
pymatgen.core.composition.Composition object.
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- __init__(reduce=False, target_col_id='composition', overwrite_data=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(string_composition)¶
Convert a string to a pymatgen Composition.
- Args:
- string_composition (str): A chemical formula as a string (e.g.
“Fe2O3”).
- Returns:
(pymatgen.core.composition.Composition): A composition object.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.StructureToComposition(reduce=False, target_col_id='composition', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Utility featurizer to convert a Structure to a Composition.
The expected input is a pymatgen.core.structure.Structure object.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
reduce (bool): Whether to return a reduced Composition object. target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- __init__(reduce=False, target_col_id='composition', overwrite_data=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(structure)¶
Convert a string to a pymatgen Composition.
- Args:
structure (pymatgen.core.structure.Structure): A structure.
- Returns:
(pymatgen.core.composition.Composition): A Composition object.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.StructureToIStructure(target_col_id='istructure', overwrite_data=False)¶
Bases:
ConversionFeaturizer
Utility featurizer to convert a Structure to an immutable IStructure.
This is useful if you are using features that employ caching.
The expected input is a pymatgen.core.structure.Structure object.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- __init__(target_col_id='istructure', overwrite_data=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(structure)¶
Convert a pymatgen Structure to an immutable IStructure,
- Args:
structure (pymatgen.core.structure.Structure): A structure.
- Returns:
- (pymatgen.core.structure.IStructure): An immutable IStructure
object.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.conversions.StructureToOxidStructure(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)¶
Bases:
ConversionFeaturizer
Utility featurizer to add oxidation states to a pymatgen Structure.
Oxidation states are determined using pymatgen’s guessing routines. The expected input is a pymatgen.core.structure.Structure object.
Note that this Featurizer does not produce machine learning-ready features but instead can be applied to pre-process data or as part of a Pipeline.
- Args:
- target_col_id (str or None): The column in which the converted data will
be written. If the column already exists then an error will be thrown unless overwrite_data is set to True. If target_col_id begins with an underscore the data will be written to the column: “{}_{}”.format(col_id, target_col_id[1:]), where col_id is the column being featurized. If target_col_id is set to None then the data will be written “in place” to the col_id column (this will only work if overwrite_data=True).
- overwrite_data (bool): Overwrite any data in target_column if it
exists.
- return_original_on_error: If the oxidation states cannot be
guessed and set to True, the structure without oxidation states will be returned. If set to False, an error will be thrown.
- **kwargs: Parameters to control the settings for
pymatgen.io.structure.Structure.add_oxidation_state_by_guess().
- __init__(target_col_id='structure_oxid', overwrite_data=False, return_original_on_error=False, **kwargs)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- featurize(structure)¶
Add oxidation states to a Structure using pymatgen’s guessing routines.
- Args:
structure (pymatgen.core.structure.Structure): A structure.
- Returns:
- (pymatgen.core.structure.Structure): A Structure object decorated
with oxidation states.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
matminer.featurizers.dos module¶
- class matminer.featurizers.dos.DOSFeaturizer(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)¶
Bases:
BaseFeaturizer
Significant character and contribution of the density of state from a CompleteDos, object. Contributors are the atomic orbitals from each site within the structure. This underlines the importance of dos.structure.
- Args:
- contributors (int):
Sets the number of top contributors to the DOS that are returned as features. (i.e. contributors=1 will only return the main cb and main vb orbital)
- decay_length (float in eV):
The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
Number of points to sample DOS
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
- Returns (featurize returns [float] and featurize_labels returns [str]):
xbm_score_i (float): fractions of ith contributor orbital xbm_location_i (str): fractional coordinate of ith contributor/site xbm_character_i (str): character of ith contributor (s, p, d, f) xbm_specie_i (str): elemental specie of ith contributor (ex: ‘Ti’) xbm_hybridization (int): the amount of hybridization at the band edge
characterized by an entropy score (x ln x). the hybridization score is larger for a greater number of significant contributors
- __init__(contributors=1, decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
- Returns ([str]): list of names of the features. See the docs for the
featurize method for more information.
- featurize(dos)¶
- Args:
- dos (pymatgen CompleteDos or their dict):
The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS) and must contain the structure.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.dos.DopingFermi(dopings=None, eref='midgap', T=300, return_eref=False)¶
Bases:
BaseFeaturizer
The fermi level (w.r.t. selected reference energy) associated with a specified carrier concentration (1/cm3) and temperature. This featurizar requires the total density of states and structure. The Structure as dos.structure (e.g. in CompleteDos) is required by FermiDos class.
- Args:
- dopings ([float]): list of doping concentrations 1/cm3. Note that a
negative concentration is treated as electron majority carrier (n-type) and positive for holes (p-type)
- eref (str or int or float): energy alignment reference. Defaults
to midgap (equilibrium fermi). A fixed number can also be used. str options: “midgap”, “vbm”, “cbm”, “dos_fermi”, “band_center”
T (float): absolute temperature in Kelvin return_eref: if True, instead of aligning the fermi levels based
on eref, it (eref) will be explicitly returned as a feature
- Returns (featurize returns [float] and featurize_labels returns [str]):
- examples:
- fermi_c-1e+20T300 (float): the fermi level for the electron
concentration of 1e20 and the temperature of 300K.
- fermi_c1e+18T600 (float): fermi level for the hole concentration
of 1e18 and the temperature of 600K.
- midgap eref (float): if return_eref==True then eref (midgap here)
energy is returned. In this case, fermi levels are absolute as opposed to relative to eref (i.e. if not return_eref)
- __init__(dopings=None, eref='midgap', T=300, return_eref=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
- Returns ([str]): list of names of the features generated by featurize
example: “fermi_c-1e+20T300” that is the fermi level for the electron concentration of 1e20 (c-1e+20) and temperature of 300K.
- featurize(dos, bandgap=None)¶
- Args:
dos (pymatgen Dos, CompleteDos or FermiDos): bandgap (float): for example the experimentally measured band gap
or one that is calculated via more accurate methods than the one used to generate dos. dos will be scissored to have the same electronic band gap as bandgap.
- Returns ([float]): features are fermi levels in eV at the given
concentrations and temperature + eref in eV if return_eref
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.dos.DosAsymmetry(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)¶
Bases:
BaseFeaturizer
Quantifies the asymmetry of the DOS near the Fermi level.
The DOS asymmetry is defined the natural logarithm of the quotient of the total DOS above the Fermi level and the total DOS below the Fermi level. A positive number indicates that there are more states directly above the Fermi level than below the Fermi level. This featurizer is only meant for metals and semi-metals.
- Args:
- decay_length (float in eV):
The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
Number of points to sample DOS
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
- __init__(decay_length=0.5, sampling_resolution=100, gaussian_smear=0.05)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Returns the labels for each of the features.
- featurize(dos)¶
Calculates the DOS asymmetry.
- Args:
dos (Dos): A pymatgen Dos object.
- Returns:
A float describing the asymmetry of the DOS.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.dos.Hybridization(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)¶
Bases:
BaseFeaturizer
quantify s/p/d/f orbital character and their hybridizations at band edges
- Args:
- decay_length (float in eV):
The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
Number of points to sample DOS
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
- species ([str]): the species for which orbital contributions are
separately returned.
- Returns (featurize returns [float] and featurize_labels returns [str]):
set of orbitals contributions and hybridizations. If species, then also individual contributions from given species. Examples:
cbm_s (float): s-orbital character of the cbm up to energy_cutoff vbm_sp (float): sp-hybridization at the vbm edge. Minimum is 0
or no hybridization (e.g. all s or vbm_s==1) and 1.0 is maximum hybridization (i.e. vbm_s==0.5, vbm_p==0.5)
cbm_Si_p (float): p-orbital character of Si
- __init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05, species=None)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
Returns ([str]): feature names starting with the extrema (cbm or vbm) followed by either s,p,d,f orbital to show normalized contribution or a pair showing their hybridization or contribution of an element. See the class docs for examples.
- featurize(dos, decay_length=None)¶
takes in the density of state and return the orbitals contributions and hybridizations.
- Args:
dos (pymatgen CompleteDos): note that dos.structure is required decay_length (float or None): if set, it overrides the instance
variable self.decay_length.
Returns ([float]): features, see class doc for more info
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- class matminer.featurizers.dos.SiteDOS(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)¶
Bases:
BaseFeaturizer
report the fractional s/p/d/f dos for a particular site. a CompleteDos object is required because knowledge of the structure is needed. this featurizer will work for metals as well as semiconductors. if the dos is a semiconductor, cbm and vbm will correspond to the two respective band edges. if the dos is a metal, then cbm and vbm correspond to above and below the fermi level, respectively.
- Args:
- decay_length (float in eV):
the dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. three times the decay_length corresponds to 10% sampling strength. there is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
number of points to sample dos
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in dos
- Returns (list of floats):
cbm_score_i (float): fractional score for i in {s,p,d,f} cbm_score_total (float): the total sum of all the {s,p,d,f} scores
this is useful information when comparing the relative contributions from multiples sites
vbm_score_i (float): fractional score for i in {s,p,d,f} vbm_score_total (float): the total sum of all the {s,p,d,f} scores
this is useful information when comparing the relative contributions from multiples sites
- __init__(decay_length=0.1, sampling_resolution=100, gaussian_smear=0.05)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- feature_labels()¶
- Returns (list of str): list of names of the features. See the docs for
the featurizer class for more information.
- featurize(dos, idx)¶
get dos scores for given site index
- Args:
- dos (pymatgen CompleteDos or their dict):
dos to featurize, must contain pdos and structure
idx (int): index of target site in structure.
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- matminer.featurizers.dos.get_cbm_vbm_scores(dos, decay_length, sampling_resolution, gaussian_smear)¶
Quantifies the contribution of all atomic orbitals (s/p/d/f) from all crystal sites to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. An example use may be sorting the output based on cbm_score or vbm_score.
- Args:
- dos (pymatgen CompleteDos or their dict):
The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)
- decay_length (float in eV):
The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
Number of points to sample DOS
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
- Returns:
- orbital_scores [(dict)]:
A list of how much each orbital contributes to the partial density of states near the band edge. Dictionary items are: .. cbm_score: (float) fractional contribution to conduction band .. vbm_score: (float) fractional contribution to valence band .. species: (pymatgen Specie) the Specie of the orbital .. character: (str) is the orbital character s, p, d, or f .. location: [(float)] fractional coordinates of the orbital
- matminer.featurizers.dos.get_site_dos_scores(dos, idx, decay_length, sampling_resolution, gaussian_smear)¶
Quantifies the contribution of all atomic orbitals (s/p/d/f) from a particular crystal site to the conduction band minimum (CBM) and the valence band maximum (VBM). An exponential decay function is used to sample the DOS. if the dos is a metal, then CBM and VBM indicate the orbital scores above and below the fermi energy, respectively.
- Args:
- dos (pymatgen CompleteDos or their dict):
The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)
- decay_length (float in eV):
The dos is sampled by an exponential decay function. this parameter sets the decay length of the exponential. Three times the decay length corresponds to 10% sampling strength. There is a hard cutoff at five times the decay length (1% sampling strength)
- sampling_resolution (int):
Number of points to sample DOS
- gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
- idx (int):
site index for which to gather dos s/p/d/f scores
- Returns:
- orbital_scores (dict):
a dictionary of the fractional s/p/d/f orbital scores from the total dos accumulated from that site. dictionary structure:
- {cbm: {s: (float), …, f: (float), total: (float)},
vbm: {s: (float), …, f: (float), total: (float)}}
matminer.featurizers.function module¶
- class matminer.featurizers.function.FunctionFeaturizer(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)¶
Bases:
BaseFeaturizer
Features from functions applied to existing features, e.g. “1/x”
This featurizer must be fit either by calling .fit_featurize_dataframe or by calling .fit followed by featurize_dataframe.
This class featurizes a dataframe according to a set of expressions representing functions to apply to existing features. The approach here has uses a sympy-based parsing of string expressions, rather than explicit python functions. The primary reason this has been done is to provide for better support for book-keeping (e. g. with feature labels), substitution, and elimination of symbolic redundancy, which sympy is well-suited for.
Note original feature names in the resulting feature set will have their sympy-illegal characters substituted with underscores. For example:
“exp(-MagpieData_avg_dev_NfValence)/sqrt(MagpieData_range_Number)”
Where the original feature names were
“MagpieData avg_dev NfValence” and “MagpieData range Number”
- Args:
- expressions ([str]): list of sympy-parseable expressions
representing a function of a single variable x, e. g. [“1 / x”, “x ** 2”], defaults to the list above
- multi_feature_depth (int): how many features to include if using
multiple fields for functionalization, e. g. 2 will include pairwise combined features
- postprocess (function or type): type to cast functional outputs
to, if, for example, you want to include the possibility of complex numbers in your outputs, use postprocess=np.complex128, defaults to float
- combo_function (function): function to combine multi-features,
defaults to np.prod (i.e. cumulative product of expressions), note that a combo function must cleanly process sympy expressions and takes a list of arbitrary length as input, other options include np.sum
- latexify_labels (bool): whether to render labels in latex,
defaults to False
- ILLEGAL_CHARACTERS = ['|', ' ', '/', '\\', '?', '@', '#', '$', '%']¶
- __init__(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)¶
- citations()¶
Citation(s) and reference(s) for this feature.
- Returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- property exp_dict¶
Generates a dictionary of expressions keyed by number of variables in each expression
- Returns:
Dictionary of expressions keyed by number of variables
- feature_labels()¶
- Returns:
Set of feature labels corresponding to expressions
- featurize(*args)¶
Main featurizer function, essentially iterates over all of the functions in self.function_list to generate features for each argument.
- Args:
- *args: list of numbers to generate functional output
features
- Returns:
list of functional outputs corresponding to input args
- fit(X, y=None, **fit_kwargs)¶
Sets the feature labels. Not intended to be used by a user, only intended to be invoked as part of featurize_dataframe
- Args:
X (DataFrame or array-like): data to fit to
- Returns:
Set of feature labels corresponding to expressions
- generate_string_expressions(input_variable_names)¶
Method to generate string expressions for input strings, mainly used to generate columns names for featurize_dataframe
- Args:
- input_variable_names ([str]): strings corresponding to
functional input variable names
- Returns:
List of string expressions generated by substitution of variable names into functions
- implementors()¶
List of implementors of the feature.
- Returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- matminer.featurizers.function.generate_expressions_combinations(expressions, combo_depth=2, combo_function=<function prod>)¶
This function takes a list of strings representing functions of x, converts them to sympy expressions, and combines them according to the combo_depth parameter. Also filters resultant expressions for any redundant ones determined by sympy expression equivalence.
- Args:
- expressions (strings): all of the sympy-parseable strings
to be converted to expressions and combined, e. g. [“1 / x”, “x ** 2”], must be functions of x
combo_depth (int): the number of independent variables to consider combo_function (method): the function which combines the
the respective expressions provided, defaults to np.prod, i. e. the cumulative product of the expressions
- Returns:
- list of unique non-trivial expressions for featurization
of inputs