matminer.data_retrieval package

Subpackages

Submodules

matminer.data_retrieval.retrieve_AFLOW module

class matminer.data_retrieval.retrieve_AFLOW.AFLOWDataRetrieval

Bases: BaseDataRetrieval

Retrieves data from the AFLOW database.

AFLOW uses the AFLUX API syntax, and the aflow library handles the HTTP requests for material properties. Note that this helper library is not an official repository of the AFLOW consortium. However, this library does dynamically generate the keywords supported by the AFLUX API from their servers, which makes it robust against changes in the AFLOW system.

If you use this data retrieval class, please additionally cite: Rose, F., Toher, C., Gossett, E., Oses, C., Nardelli, M.B., Fornari, M., Curtarolo, S., 2017. AFLUX: The LUX materials search API for the AFLOW data repositories. Computational Materials Science 137, 362–370. https://doi.org/10.1016/j.commatsci.2017.04.036

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

get_dataframe(criteria, properties, files=None, request_size=10000, request_limit=0, index_auid=True)

Retrieves data from AFLOW in a DataFrame format.

The method builds an AFLUX API query from pymongo-like filter criteria and requested properties. Then, results are collected over HTTP. Note that the “compound”, “auid”, and “aurl” fields are always returned.

Args:
criteria: (dict) Pymongo-like query operator. The first-level

dictionary keys must be supported AFLOW properties. The values of the dictionary must either be singletons (int, str, etc.) or dictionaries. The keys of this second-level dictionary can be the pymongo operators ‘$in’, ‘$gt’, ‘$lt’, or ‘$not.’ There can not be further nesting. VALID:

{‘auid’: {‘$in’: [‘aflow:a17a2da2f3d3953a’]}}

INVALID:

{‘auid’: {‘$not’: {‘$in’: [‘aflow:a17a2da2f3d3953a’]}}}

properties: (list of str) Properties returned in the DataFrame.

See the api link for a list of supported properties.

files: (list of str) For convenience, specific files may also be

downloaded as pymatgen objects. Each file download is collected by a separate HTTP request (read slow). The default behavior is to return none of these objects. Supported files:

“prototype_structure” - the prototype structure “input_structure” - the input structure “band_structure” - TODO “dos” - TODO

request_size: (int) Number of results to return per HTTP request. request_limit: (int) Maximum number of requests to submit. The

default behavior is to request all matching records.

index_auid: (bool) Whether to set the “AFLOW unique identifier” as

the index of the DataFrame.

Returns (pandas.DataFrame): The data requested from the AFLOW database.

static get_relaxed_structure(aurl)

Collects the relaxed structure as a pymatgen.Structure.

Args:

aurl: (str) The url for the material entry in AFLOW.

Returns: (pymatgen.Structure) The relaxed structure.

class matminer.data_retrieval.retrieve_AFLOW.RetrievalQuery(catalog=None, batch_size=100, step=1)

Bases: Query

Provides instance constructors for pymongo-like queries.

classmethod from_pymongo(criteria, properties, request_size)

Generates an aflow Query object from pymongo-like arguments.

Args:
criteria: (dict) Pymongo-like query operator. See the

AFLOWDataRetrieval.get_DataFrame method for more details

properties: (list of str) Properties returned in the DataFrame.

See the api link for a list of supported properties.

request_size: (int) Number of results to return per HTTP request.

Note that this is similar to “limit” in pymongo.find.

matminer.data_retrieval.retrieve_Citrine module

class matminer.data_retrieval.retrieve_Citrine.CitrineDataRetrieval(api_key=None)

Bases: BaseDataRetrieval

CitrineDataRetrieval is used to retrieve data from the Citrination database See API client docs at api_link below.

__init__(api_key=None)
Args:
api_key: (str) Your Citrine API key, or None if

you’ve set the CITRINE_KEY environment variable

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

get_data(formula=None, prop=None, data_type=None, reference=None, min_measurement=None, max_measurement=None, from_record=None, data_set_id=None, max_results=None)

Gets raw api data from Citrine in json format. See api_link for more information on input parameters

Args:
formula: (str) filter for the chemical formula field; only those

results that have chemical formulas that contain this string will be returned

prop: (str) name of the property to search for data_type: (str) ‘EXPERIMENTAL’/’COMPUTATIONAL’/’MACHINE_LEARNING’;

filter for properties obtained from experimental work, computational methods, or machine learning.

reference: (str) filter for the reference field; only those

results that have contributors that contain this string will be returned

min_measurement: (str/num) minimum of the property value range max_measurement: (str/num) maximum of the property value range from_record: (int) index of first record to return (indexed from 0) data_set_id: (int) id of the particular data set to search on max_results: (int) number of records to limit the results to

Returns: (list) of jsons/pifs returned by Citrine’s API

get_dataframe(criteria, properties=None, common_fields=None, secondary_fields=False, print_properties_options=True)

Gets a Pandas dataframe object from data retrieved from the Citrine API.

Args:
criteria (dict): see get_data method for supported keys except

prop; prop should be included in properties.

properties ([str]): requested properties/fields/columns.

For example, [“Seebeck coefficient”, “Band gap”]. If unsure about the exact words, capitalization, etc try something like [“gap”] and “max_results”: 3 and print_properties_options=True to see the exact options for this field

common_fields ([str]): fields that are common to all the requested

properties. Common example can be “chemicalFormula”. Look for suggested common fields after a quick query for more info

secondary_fields (bool): if True, fields not included in properties

may be added to the output (e.g. references). Recommended only if len(properties)==1

print_properties_options (bool): whether to print available options

for “properties” and “common_fields” arguments.

Returns: (object) Pandas dataframe object containing the results

matminer.data_retrieval.retrieve_Citrine.get_value(dict_item)
matminer.data_retrieval.retrieve_Citrine.parse_scalars(scalars)

matminer.data_retrieval.retrieve_MDF module

class matminer.data_retrieval.retrieve_MDF.MDFDataRetrieval(anonymous=False, **kwargs)

Bases: BaseDataRetrieval

MDFDataRetrieval is used to retrieve data from the Materials Data Facility database and convert them into a Pandas DataFrame. Note that invocation with full access to MDF will require authentication (see api_link) but an anonymous mode is supported, which can be used with anonymous=True as a keyword arg.

Examples:

>>>mdf_dr = MDFDataRetrieval(anonymous=True) >>>results = mdf_dr.get_dataframe({“elements”:[“Ag”, “Be”], “source_names”: [“oqmd”]})

>>>results = mdf_dr.get_dataframe({“source_names”: [“oqmd”], >>> “match_ranges”: {“oqmd.band_gap.value”: [4.0, “*”]}})

If you use this data retrieval class, please additionally cite: Blaiszik, B., Chard, K., Pruyne, J., Ananthakrishnan, R., Tuecke, S., Foster, I., 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. JOM 68, 2045–2052. https://doi.org/10.1007/s11837-016-2001-3

__init__(anonymous=False, **kwargs)
Args:
anonymous (bool): whether to use anonymous login (i. e. no

globus authentication)

**kwargs: kwargs for Forge, including index (globus search index

to search on), local_ep, anonymous

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

get_data(squery, unwind_arrays=True, **kwargs)

Gets a dataframe from the MDF API from an explicit string query (rather than input args like get_dataframe).

Args:

squery (str): String for explicit query unwind_arrays (bool): whether or not to unwind arrays in

flattening docs for dataframe

**kwargs: kwargs for query

Returns:

dataframe corresponding to query

get_dataframe(criteria, properties=None, unwind_arrays=True)

Retrieves data from the MDF API and formats it as a Pandas Dataframe

Args:
criteria (dict): options for keys are

source_names ([str]): source names to include, e. g. [“oqmd”] elements ([str]): elements to include, e. g. [“Ag”, “Si”] titles ([str]): titles to include, e. g. [“Coarsening of a

semisolid Al-Cu alloy”]

tags ([str]): tags to include, e. g. [“outcar”] resource_types ([str]): resources to include, e. g. [“record”] match_fields ({}): field-value mappings to include, e. g.

{“oqmd.converged”: True}

exclude_fields ({}): field-value mappings to exclude, e. g.

{“oqmd.converged”: False}

match_ranges ({}): field-range mappings to include, e. g.

{“oqmd.band_gap.value”: [1, 5]}, use “*” for no lower or upper bound, e. g. {“oqdm.band_gap.value”: [1, “*”]},

exclude_ranges ({}): field-range mapping to exclude,

{“oqmd.band_gap.value”: [3, “*”]} to exclude all results with band gap higher than 3.

raw (bool): whether or not to return raw (non-dataframe)

output, defaults to False

unwind_arrays (bool): whether or not to unwind arrays in

flattening docs for dataframe

Returns (pandas.DataFrame):

DataFrame corresponding to all documents from aggregated query

matminer.data_retrieval.retrieve_MDF.make_dataframe(docs, unwind_arrays=True)

Formats raw docs returned from MDF API search into a dataframe

Args:
docs [{}]: list of documents from forge search

or aggregation

Returns: DataFrame corresponding to formatted docs

matminer.data_retrieval.retrieve_MP module

class matminer.data_retrieval.retrieve_MP.MPDataRetrieval(api_key=None)

Bases: BaseDataRetrieval

Retrieves data from the Materials Project database.

If you use this data retrieval class, please additionally cite:

Ong, S.P., Cholia, S., Jain, A., Brafman, M., Gunter, D., Ceder, G., Persson, K.A., 2015. The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles. Computational Materials Science 97, 209–215. https://doi.org/10.1016/j.commatsci.2014.10.037

__init__(api_key=None)
Args:
api_key: (str) Your Materials Project API key, or None if you’ve

set up your pymatgen config.

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

get_data(criteria, properties, mp_decode=True, index_mpid=True)
Args:
criteria: (str/dict) see MPRester.query() for a description of this

parameter. String examples: “mp-1234”, “Fe2O3”, “Li-Fe-O’, “*2O3”. Dict example: {“band_gap”: {“$gt”: 1}}

properties: (list) see MPRester.query() for a description of this

parameter. Example: [“formula”, “formation_energy_per_atom”]

mp_decode: (bool) see MPRester.query() for a description of this

parameter. Whether to decode to a Pymatgen object where possible.

index_mpid: (bool) Whether to set the materials_id as the dataframe

index.

Returns ([dict]):

a list of jsons that match the criteria and contain properties

get_dataframe(criteria, properties, index_mpid=True, **kwargs)

Gets data from MP in a dataframe format. See api_link for more details.

Args:

criteria (dict): the same as in get_data properties ([str]): the same properties supported as in get_data

plus: “structure”, “initial_structure”, “final_structure”, “bandstructure” (line mode), “bandstructure_uniform”, “phonon_bandstructure”, “phonon_ddb”, “phonon_bandstructure”, “phonon_dos”. Note that for a long list of compounds, it may take a long time to retrieve some of these objects.

index_mpid (bool): the same as in get_data kwargs (dict): the same keyword arguments as in get_data

Returns (pandas.Dataframe):

try_get_prop_by_material_id(prop, material_id_list, **kwargs)

Call the relevant get_prop_by_material_id. “prop” is a property such as bandstructure that is not readily available in supported properties of the get_data function but via the get_bandstructure_by_material_id method for example.

Args:
prop (str): the name of the property. Options are:

“bandstructure”, “dos”, “phonon_dos”, “phonon_bandstructure”, “phonon_ddb”

material_id_list ([str]): list of material_id of compounds kwargs (dict): other keyword arguments that get_*_by_material_id

may have; e.g. line_mode in get_bandstructure_by_material_id

Returns ([target prop object or NaN]):

If the target property is not available for a certain material_id, NaN is returned.

matminer.data_retrieval.retrieve_MPDS module

Warning: This retrieval class is to be deprecated in favor of the mpds_client library pip install mpds_client (https://pypi.org/project/mpds-client), which is fully compatible with matminer

exception matminer.data_retrieval.retrieve_MPDS.APIError(msg, code=0)

Bases: Exception

Simple error handling

__init__(msg, code=0)
class matminer.data_retrieval.retrieve_MPDS.MPDSDataRetrieval(api_key=None, endpoint=None)

Bases: BaseDataRetrieval

Retrieves data from Materials Platform for Data Science (MPDS). See api_link for more information.

Usage: $>export MPDS_KEY=…

client = MPDSDataRetrieval()

dataframe = client.get_dataframe({“formula”:”SrTiO3”, “props”:”phonons”})

or jsonobj = client.get_data(

{“formula”:”SrTiO3”, “sgs”: 99, “props”:”atomic properties”}, fields={

‘S’:[“entry”, “cell_abc”, “sg_n”, “basis_noneq”, “els_noneq”]

}

)

or jsonobj = client.get_data({“formula”:”SrTiO3”}, fields={})

If you use this data retrieval class, please additionally cite: Blokhin, E., Villars, P., 2018. The PAULING FILE Project and Materials Platform for Data Science: From Big Data Toward Materials Genome, in: Andreoni, W., Yip, S. (Eds.), Handbook of Materials Modeling: Methods: Theory and Modeling. Springer International Publishing, Cham, pp. 1-26. https://doi.org/10.1007/978-3-319-42913-7_62-2

__init__(api_key=None, endpoint=None)

MPDS API consumer constructor

Args:

api_key: (str) The MPDS API key, or None if the MPDS_KEY envvar is set endpoint: (str) MPDS API gateway URL

Returns: None

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

chillouttime = 2
citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

static compile_crystal(datarow, flavor='pmg')

Helper method for representing the MPDS crystal structures in two flavors: either as a Pymatgen Structure object, or as an ASE Atoms object.

Attention! These two flavors are not compatible, e.g. primitive vs. crystallographic cell is defaulted, atoms wrapped or non-wrapped into the unit cell, etc.

Note, that the crystal structures are not retrieved by default, so one needs to specify the fields while retrieval:

  • cell_abc

  • sg_n

  • basis_noneq

  • els_noneq

e.g. like this: {‘S’:[‘cell_abc’, ‘sg_n’, ‘basis_noneq’, ‘els_noneq’]} NB. occupancies are not considered.

Args:
datarow: (list) Required data to construct crystal structure:

[cell_abc, sg_n, basis_noneq, els_noneq]

flavor: (str) Either “pmg”, or “ase”

Returns:
  • if flavor is pmg, Pymatgen Structure object

  • if flavor is ase, ASE Atoms object

default_properties = ('Phase', 'Formula', 'SG', 'Entry', 'Property', 'Units', 'Value')
endpoint = 'https://api.mpds.io/v0/download/facet'
get_data(criteria, phases=None, fields=None)

Retrieve data in JSON. JSON is expected to be valid against the schema at http://developer.mpds.io/mpds.schema.json

Args:
criteria (dict): Search query like {“categ_A”: “val_A”, “categ_B”: “val_B”},

documented at http://developer.mpds.io/#Categories example: criteria={“elements”: “K-Ag”, “classes”: “iodide”,

“props”: “heat capacity”, “lattices”: “cubic”}

phases (list): Phase IDs, according to the MPDS distinct phases concept fields (dict): Data of interest for C-, S-, and P-entries,

e.g. for phase diagrams: {‘C’: [‘naxes’, ‘arity’, ‘shapes’]}, documented at http://developer.mpds.io/#JSON-schemata

Returns:

List of dicts: C-, S-, and P-entries, the format is documented at http://developer.mpds.io/#JSON-schemata

get_dataframe(criteria, properties=('Phase', 'Formula', 'SG', 'Entry', 'Property', 'Units', 'Value'), **kwargs)

Retrieve data as a Pandas dataframe.

Args:

criteria (dict): the same as criteria in get_data properties ([str]): list of properties/titles to be included **kwargs: other keyword arguments available in get_data

Returns: (object) Pandas DataFrame object containing the results

maxnpages = 100
pagesize = 1000

matminer.data_retrieval.retrieve_MongoDB module

class matminer.data_retrieval.retrieve_MongoDB.MongoDataRetrieval(coll)

Bases: BaseDataRetrieval

__init__(coll)

Retrieves data from a MongoDB collection to a pandas.Dataframe object

Args:

coll: A MongoDB collection object

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

get_dataframe(criteria, properties=None, limit=0, sort=None, idx_field=None, strict=False)
Args:

criteria: (dict) - a pymongo-style query to filter data records properties: ([str] or None) - a list of str fields to retrieve;

dot-notation is allowed (e.g. “structure.lattice.a”). Set to “None” to try to auto-detect the fields.

limit: (int) - max number of entries. 0 means no limit sort: (tuple) - pymongo-style sort option idx_field: (str) - name of field to use as index (must have unique

entries)

strict: (bool) - if False, replaces missing values with NaN

Returns (pandas.DataFrame):

matminer.data_retrieval.retrieve_MongoDB.clean_projection(projection)

Projecting on e.g. ‘a.b.’ and ‘a’ is disallowed in MongoDb, so project inclusively. See unit tests for examples of what this is doing.

Args:

projection: (list) - list of fields to retrieve; dot-notation is allowed.

matminer.data_retrieval.retrieve_MongoDB.is_int(x)
matminer.data_retrieval.retrieve_MongoDB.remove_ints(projection)

Transforms a string like “a.1.x” to “a.x” - for Mongo projection purposes

Args:

projection: (str) the projection to remove ints from

Returns (str)

matminer.data_retrieval.retrieve_base module

class matminer.data_retrieval.retrieve_base.BaseDataRetrieval

Bases: object

Abstract class to retrieve data from various material APIs while adhering to a quasi-standard format for querying.

## Implementing a new DataRetrieval class

If you have an API which you’d like to incorporate into matminer’s data retrieval tools, using BaseDataRetrieval is the preferred way of doing so. All DataRetrieval classes should subclass BaseDataRetrieval and implement the following:

  • get_dataframe()

  • api_link()

Retrieving data should be done by the user with get_dataframe. Criteria should be a dictionary which will be used to form a query to the database. Properties should be a list which defines the columns that will be returned. While the ‘criteria’ and ‘properties’ arguments may have different valid values depending on the database, they should always have sensible formats and names if possible. For example, the user should be calling this:

df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: 0.0},

properties=[‘structure’])

…or this:

df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: [0.0, 0.15]},

properties=[“density of states”])

NOT this:

df = MyDataRetrieval().get_dataframe(criteria={‘query.bg[0] && band_gap’: 0.0},

properties=[‘Struct.page[Value]’])

The implemented DataRetrieval class should handle the conversion from a ‘sensible’ query to a query fit for the individual API and database.

There may be cases where a ‘sensible’ query is not sufficient to define a query to the API; in this case, use the get_dataframe kwargs sparingly to augment the criteria, properties, or form of the underlying API query.

A method for accessing raw DB data with an API-native query may be provided by overriding get_data. The link to the original API documentation must be provided by overriding api_link().

## Documenting a DataRetrieval class

The class documentation for each DataRetrieval class must contain a brief description of the possible data that can be retrieved with the API source. It should also detail the form of the criteria and properties that can be retrieved with the class, and/or should link to a web page showing this information. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).

The link to comprehensive API documentation or data source.

Returns:

(str): A link to the API documentation for this DataRetrieval class.

citations()

Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.

Returns:

([str]): Bibtext formatted entries

get_dataframe(criteria, properties, **kwargs)

Retrieve a dataframe of properties from the database which satisfy criteria.

Args:
criteria (dict): The name of each criterion is the key; the value

or range of the criterion is the value.

properties (list): Properties to return from the query matching

the criteria. For example, [‘structure’, ‘formula’]

Returns:
(pandas DataFrame) The dataframe containing properties as columns

and samples as rows.

Module contents