matminer.data_retrieval package¶
Subpackages¶
- matminer.data_retrieval.tests package
- Submodules
- matminer.data_retrieval.tests.base module
- matminer.data_retrieval.tests.test_retrieve_AFLOW module
- matminer.data_retrieval.tests.test_retrieve_Citrine module
- matminer.data_retrieval.tests.test_retrieve_MDF module
- matminer.data_retrieval.tests.test_retrieve_MP module
- matminer.data_retrieval.tests.test_retrieve_MPDS module
- matminer.data_retrieval.tests.test_retrieve_MongoDB module
- Module contents
Submodules¶
matminer.data_retrieval.retrieve_AFLOW module¶
- class matminer.data_retrieval.retrieve_AFLOW.AFLOWDataRetrieval¶
Bases:
BaseDataRetrieval
Retrieves data from the AFLOW database.
AFLOW uses the AFLUX API syntax, and the aflow library handles the HTTP requests for material properties. Note that this helper library is not an official repository of the AFLOW consortium. However, this library does dynamically generate the keywords supported by the AFLUX API from their servers, which makes it robust against changes in the AFLOW system.
If you use this data retrieval class, please additionally cite: Rose, F., Toher, C., Gossett, E., Oses, C., Nardelli, M.B., Fornari, M., Curtarolo, S., 2017. AFLUX: The LUX materials search API for the AFLOW data repositories. Computational Materials Science 137, 362–370. https://doi.org/10.1016/j.commatsci.2017.04.036
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- get_dataframe(criteria, properties, files=None, request_size=10000, request_limit=0, index_auid=True)¶
Retrieves data from AFLOW in a DataFrame format.
The method builds an AFLUX API query from pymongo-like filter criteria and requested properties. Then, results are collected over HTTP. Note that the “compound”, “auid”, and “aurl” fields are always returned.
- Args:
- criteria: (dict) Pymongo-like query operator. The first-level
dictionary keys must be supported AFLOW properties. The values of the dictionary must either be singletons (int, str, etc.) or dictionaries. The keys of this second-level dictionary can be the pymongo operators ‘$in’, ‘$gt’, ‘$lt’, or ‘$not.’ There can not be further nesting. VALID:
{‘auid’: {‘$in’: [‘aflow:a17a2da2f3d3953a’]}}
- INVALID:
{‘auid’: {‘$not’: {‘$in’: [‘aflow:a17a2da2f3d3953a’]}}}
- properties: (list of str) Properties returned in the DataFrame.
See the api link for a list of supported properties.
- files: (list of str) For convenience, specific files may also be
downloaded as pymatgen objects. Each file download is collected by a separate HTTP request (read slow). The default behavior is to return none of these objects. Supported files:
“prototype_structure” - the prototype structure “input_structure” - the input structure “band_structure” - TODO “dos” - TODO
request_size: (int) Number of results to return per HTTP request. request_limit: (int) Maximum number of requests to submit. The
default behavior is to request all matching records.
- index_auid: (bool) Whether to set the “AFLOW unique identifier” as
the index of the DataFrame.
Returns (pandas.DataFrame): The data requested from the AFLOW database.
- static get_relaxed_structure(aurl)¶
Collects the relaxed structure as a pymatgen.Structure.
- Args:
aurl: (str) The url for the material entry in AFLOW.
Returns: (pymatgen.Structure) The relaxed structure.
- class matminer.data_retrieval.retrieve_AFLOW.RetrievalQuery(catalog=None, batch_size=100, step=1)¶
Bases:
Query
Provides instance constructors for pymongo-like queries.
- classmethod from_pymongo(criteria, properties, request_size)¶
Generates an aflow Query object from pymongo-like arguments.
- Args:
- criteria: (dict) Pymongo-like query operator. See the
AFLOWDataRetrieval.get_DataFrame method for more details
- properties: (list of str) Properties returned in the DataFrame.
See the api link for a list of supported properties.
- request_size: (int) Number of results to return per HTTP request.
Note that this is similar to “limit” in pymongo.find.
matminer.data_retrieval.retrieve_Citrine module¶
- class matminer.data_retrieval.retrieve_Citrine.CitrineDataRetrieval(api_key=None)¶
Bases:
BaseDataRetrieval
CitrineDataRetrieval is used to retrieve data from the Citrination database See API client docs at api_link below.
- __init__(api_key=None)¶
- Args:
- api_key: (str) Your Citrine API key, or None if
you’ve set the CITRINE_KEY environment variable
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- get_data(formula=None, prop=None, data_type=None, reference=None, min_measurement=None, max_measurement=None, from_record=None, data_set_id=None, max_results=None)¶
Gets raw api data from Citrine in json format. See api_link for more information on input parameters
- Args:
- formula: (str) filter for the chemical formula field; only those
results that have chemical formulas that contain this string will be returned
prop: (str) name of the property to search for data_type: (str) ‘EXPERIMENTAL’/’COMPUTATIONAL’/’MACHINE_LEARNING’;
filter for properties obtained from experimental work, computational methods, or machine learning.
- reference: (str) filter for the reference field; only those
results that have contributors that contain this string will be returned
min_measurement: (str/num) minimum of the property value range max_measurement: (str/num) maximum of the property value range from_record: (int) index of first record to return (indexed from 0) data_set_id: (int) id of the particular data set to search on max_results: (int) number of records to limit the results to
Returns: (list) of jsons/pifs returned by Citrine’s API
- get_dataframe(criteria, properties=None, common_fields=None, secondary_fields=False, print_properties_options=True)¶
Gets a Pandas dataframe object from data retrieved from the Citrine API.
- Args:
- criteria (dict): see get_data method for supported keys except
prop; prop should be included in properties.
- properties ([str]): requested properties/fields/columns.
For example, [“Seebeck coefficient”, “Band gap”]. If unsure about the exact words, capitalization, etc try something like [“gap”] and “max_results”: 3 and print_properties_options=True to see the exact options for this field
- common_fields ([str]): fields that are common to all the requested
properties. Common example can be “chemicalFormula”. Look for suggested common fields after a quick query for more info
- secondary_fields (bool): if True, fields not included in properties
may be added to the output (e.g. references). Recommended only if len(properties)==1
- print_properties_options (bool): whether to print available options
for “properties” and “common_fields” arguments.
Returns: (object) Pandas dataframe object containing the results
- matminer.data_retrieval.retrieve_Citrine.get_value(dict_item)¶
- matminer.data_retrieval.retrieve_Citrine.parse_scalars(scalars)¶
matminer.data_retrieval.retrieve_MDF module¶
- class matminer.data_retrieval.retrieve_MDF.MDFDataRetrieval(anonymous=False, **kwargs)¶
Bases:
BaseDataRetrieval
MDFDataRetrieval is used to retrieve data from the Materials Data Facility database and convert them into a Pandas DataFrame. Note that invocation with full access to MDF will require authentication (see api_link) but an anonymous mode is supported, which can be used with anonymous=True as a keyword arg.
- Examples:
>>>mdf_dr = MDFDataRetrieval(anonymous=True) >>>results = mdf_dr.get_dataframe({“elements”:[“Ag”, “Be”], “source_names”: [“oqmd”]})
>>>results = mdf_dr.get_dataframe({“source_names”: [“oqmd”], >>> “match_ranges”: {“oqmd.band_gap.value”: [4.0, “*”]}})
If you use this data retrieval class, please additionally cite: Blaiszik, B., Chard, K., Pruyne, J., Ananthakrishnan, R., Tuecke, S., Foster, I., 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. JOM 68, 2045–2052. https://doi.org/10.1007/s11837-016-2001-3
- __init__(anonymous=False, **kwargs)¶
- Args:
- anonymous (bool): whether to use anonymous login (i. e. no
globus authentication)
- **kwargs: kwargs for Forge, including index (globus search index
to search on), local_ep, anonymous
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- get_data(squery, unwind_arrays=True, **kwargs)¶
Gets a dataframe from the MDF API from an explicit string query (rather than input args like get_dataframe).
- Args:
squery (str): String for explicit query unwind_arrays (bool): whether or not to unwind arrays in
flattening docs for dataframe
**kwargs: kwargs for query
- Returns:
dataframe corresponding to query
- get_dataframe(criteria, properties=None, unwind_arrays=True)¶
Retrieves data from the MDF API and formats it as a Pandas Dataframe
- Args:
- criteria (dict): options for keys are
source_names ([str]): source names to include, e. g. [“oqmd”] elements ([str]): elements to include, e. g. [“Ag”, “Si”] titles ([str]): titles to include, e. g. [“Coarsening of a
semisolid Al-Cu alloy”]
tags ([str]): tags to include, e. g. [“outcar”] resource_types ([str]): resources to include, e. g. [“record”] match_fields ({}): field-value mappings to include, e. g.
{“oqmd.converged”: True}
- exclude_fields ({}): field-value mappings to exclude, e. g.
{“oqmd.converged”: False}
- match_ranges ({}): field-range mappings to include, e. g.
{“oqmd.band_gap.value”: [1, 5]}, use “*” for no lower or upper bound, e. g. {“oqdm.band_gap.value”: [1, “*”]},
- exclude_ranges ({}): field-range mapping to exclude,
{“oqmd.band_gap.value”: [3, “*”]} to exclude all results with band gap higher than 3.
- raw (bool): whether or not to return raw (non-dataframe)
output, defaults to False
- unwind_arrays (bool): whether or not to unwind arrays in
flattening docs for dataframe
- Returns (pandas.DataFrame):
DataFrame corresponding to all documents from aggregated query
- matminer.data_retrieval.retrieve_MDF.make_dataframe(docs, unwind_arrays=True)¶
Formats raw docs returned from MDF API search into a dataframe
- Args:
- docs [{}]: list of documents from forge search
or aggregation
Returns: DataFrame corresponding to formatted docs
matminer.data_retrieval.retrieve_MP module¶
- class matminer.data_retrieval.retrieve_MP.MPDataRetrieval(api_key=None)¶
Bases:
BaseDataRetrieval
Retrieves data from the Materials Project database.
If you use this data retrieval class, please additionally cite:
Ong, S.P., Cholia, S., Jain, A., Brafman, M., Gunter, D., Ceder, G., Persson, K.A., 2015. The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles. Computational Materials Science 97, 209–215. https://doi.org/10.1016/j.commatsci.2014.10.037
- __init__(api_key=None)¶
- Args:
- api_key: (str) Your Materials Project API key, or None if you’ve
set up your pymatgen config.
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- get_data(criteria, properties, mp_decode=True, index_mpid=True)¶
- Args:
- criteria: (str/dict) see MPRester.query() for a description of this
parameter. String examples: “mp-1234”, “Fe2O3”, “Li-Fe-O’, “*2O3”. Dict example: {“band_gap”: {“$gt”: 1}}
- properties: (list) see MPRester.query() for a description of this
parameter. Example: [“formula”, “formation_energy_per_atom”]
- mp_decode: (bool) see MPRester.query() for a description of this
parameter. Whether to decode to a Pymatgen object where possible.
- index_mpid: (bool) Whether to set the materials_id as the dataframe
index.
- Returns ([dict]):
a list of jsons that match the criteria and contain properties
- get_dataframe(criteria, properties, index_mpid=True, **kwargs)¶
Gets data from MP in a dataframe format. See api_link for more details.
- Args:
criteria (dict): the same as in get_data properties ([str]): the same properties supported as in get_data
plus: “structure”, “initial_structure”, “final_structure”, “bandstructure” (line mode), “bandstructure_uniform”, “phonon_bandstructure”, “phonon_ddb”, “phonon_bandstructure”, “phonon_dos”. Note that for a long list of compounds, it may take a long time to retrieve some of these objects.
index_mpid (bool): the same as in get_data kwargs (dict): the same keyword arguments as in get_data
Returns (pandas.Dataframe):
- try_get_prop_by_material_id(prop, material_id_list, **kwargs)¶
Call the relevant get_prop_by_material_id. “prop” is a property such as bandstructure that is not readily available in supported properties of the get_data function but via the get_bandstructure_by_material_id method for example.
- Args:
- prop (str): the name of the property. Options are:
“bandstructure”, “dos”, “phonon_dos”, “phonon_bandstructure”, “phonon_ddb”
material_id_list ([str]): list of material_id of compounds kwargs (dict): other keyword arguments that get_*_by_material_id
may have; e.g. line_mode in get_bandstructure_by_material_id
- Returns ([target prop object or NaN]):
If the target property is not available for a certain material_id, NaN is returned.
matminer.data_retrieval.retrieve_MPDS module¶
Warning: This retrieval class is to be deprecated in favor of the mpds_client library pip install mpds_client (https://pypi.org/project/mpds-client), which is fully compatible with matminer
- exception matminer.data_retrieval.retrieve_MPDS.APIError(msg, code=0)¶
Bases:
Exception
Simple error handling
- __init__(msg, code=0)¶
- class matminer.data_retrieval.retrieve_MPDS.MPDSDataRetrieval(api_key=None, endpoint=None)¶
Bases:
BaseDataRetrieval
Retrieves data from Materials Platform for Data Science (MPDS). See api_link for more information.
Usage: $>export MPDS_KEY=…
client = MPDSDataRetrieval()
dataframe = client.get_dataframe({“formula”:”SrTiO3”, “props”:”phonons”})
or jsonobj = client.get_data(
{“formula”:”SrTiO3”, “sgs”: 99, “props”:”atomic properties”}, fields={
‘S’:[“entry”, “cell_abc”, “sg_n”, “basis_noneq”, “els_noneq”]
}
)
or jsonobj = client.get_data({“formula”:”SrTiO3”}, fields={})
If you use this data retrieval class, please additionally cite: Blokhin, E., Villars, P., 2018. The PAULING FILE Project and Materials Platform for Data Science: From Big Data Toward Materials Genome, in: Andreoni, W., Yip, S. (Eds.), Handbook of Materials Modeling: Methods: Theory and Modeling. Springer International Publishing, Cham, pp. 1-26. https://doi.org/10.1007/978-3-319-42913-7_62-2
- __init__(api_key=None, endpoint=None)¶
MPDS API consumer constructor
- Args:
api_key: (str) The MPDS API key, or None if the MPDS_KEY envvar is set endpoint: (str) MPDS API gateway URL
Returns: None
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- chillouttime = 2¶
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- static compile_crystal(datarow, flavor='pmg')¶
Helper method for representing the MPDS crystal structures in two flavors: either as a Pymatgen Structure object, or as an ASE Atoms object.
Attention! These two flavors are not compatible, e.g. primitive vs. crystallographic cell is defaulted, atoms wrapped or non-wrapped into the unit cell, etc.
Note, that the crystal structures are not retrieved by default, so one needs to specify the fields while retrieval:
cell_abc
sg_n
basis_noneq
els_noneq
e.g. like this: {‘S’:[‘cell_abc’, ‘sg_n’, ‘basis_noneq’, ‘els_noneq’]} NB. occupancies are not considered.
- Args:
- datarow: (list) Required data to construct crystal structure:
[cell_abc, sg_n, basis_noneq, els_noneq]
flavor: (str) Either “pmg”, or “ase”
- Returns:
if flavor is pmg, Pymatgen Structure object
if flavor is ase, ASE Atoms object
- default_properties = ('Phase', 'Formula', 'SG', 'Entry', 'Property', 'Units', 'Value')¶
- endpoint = 'https://api.mpds.io/v0/download/facet'¶
- get_data(criteria, phases=None, fields=None)¶
Retrieve data in JSON. JSON is expected to be valid against the schema at http://developer.mpds.io/mpds.schema.json
- Args:
- criteria (dict): Search query like {“categ_A”: “val_A”, “categ_B”: “val_B”},
documented at http://developer.mpds.io/#Categories example: criteria={“elements”: “K-Ag”, “classes”: “iodide”,
“props”: “heat capacity”, “lattices”: “cubic”}
phases (list): Phase IDs, according to the MPDS distinct phases concept fields (dict): Data of interest for C-, S-, and P-entries,
e.g. for phase diagrams: {‘C’: [‘naxes’, ‘arity’, ‘shapes’]}, documented at http://developer.mpds.io/#JSON-schemata
- Returns:
List of dicts: C-, S-, and P-entries, the format is documented at http://developer.mpds.io/#JSON-schemata
- get_dataframe(criteria, properties=('Phase', 'Formula', 'SG', 'Entry', 'Property', 'Units', 'Value'), **kwargs)¶
Retrieve data as a Pandas dataframe.
- Args:
criteria (dict): the same as criteria in get_data properties ([str]): list of properties/titles to be included **kwargs: other keyword arguments available in get_data
Returns: (object) Pandas DataFrame object containing the results
- maxnpages = 100¶
- pagesize = 1000¶
matminer.data_retrieval.retrieve_MongoDB module¶
- class matminer.data_retrieval.retrieve_MongoDB.MongoDataRetrieval(coll)¶
Bases:
BaseDataRetrieval
- __init__(coll)¶
Retrieves data from a MongoDB collection to a pandas.Dataframe object
- Args:
coll: A MongoDB collection object
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- get_dataframe(criteria, properties=None, limit=0, sort=None, idx_field=None, strict=False)¶
- Args:
criteria: (dict) - a pymongo-style query to filter data records properties: ([str] or None) - a list of str fields to retrieve;
dot-notation is allowed (e.g. “structure.lattice.a”). Set to “None” to try to auto-detect the fields.
limit: (int) - max number of entries. 0 means no limit sort: (tuple) - pymongo-style sort option idx_field: (str) - name of field to use as index (must have unique
entries)
strict: (bool) - if False, replaces missing values with NaN
Returns (pandas.DataFrame):
- matminer.data_retrieval.retrieve_MongoDB.clean_projection(projection)¶
Projecting on e.g. ‘a.b.’ and ‘a’ is disallowed in MongoDb, so project inclusively. See unit tests for examples of what this is doing.
- Args:
projection: (list) - list of fields to retrieve; dot-notation is allowed.
- matminer.data_retrieval.retrieve_MongoDB.is_int(x)¶
- matminer.data_retrieval.retrieve_MongoDB.remove_ints(projection)¶
Transforms a string like “a.1.x” to “a.x” - for Mongo projection purposes
- Args:
projection: (str) the projection to remove ints from
Returns (str)
matminer.data_retrieval.retrieve_base module¶
- class matminer.data_retrieval.retrieve_base.BaseDataRetrieval¶
Bases:
object
Abstract class to retrieve data from various material APIs while adhering to a quasi-standard format for querying.
## Implementing a new DataRetrieval class
If you have an API which you’d like to incorporate into matminer’s data retrieval tools, using BaseDataRetrieval is the preferred way of doing so. All DataRetrieval classes should subclass BaseDataRetrieval and implement the following:
get_dataframe()
api_link()
Retrieving data should be done by the user with get_dataframe. Criteria should be a dictionary which will be used to form a query to the database. Properties should be a list which defines the columns that will be returned. While the ‘criteria’ and ‘properties’ arguments may have different valid values depending on the database, they should always have sensible formats and names if possible. For example, the user should be calling this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: 0.0},
properties=[‘structure’])
…or this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: [0.0, 0.15]},
properties=[“density of states”])
NOT this:
- df = MyDataRetrieval().get_dataframe(criteria={‘query.bg[0] && band_gap’: 0.0},
properties=[‘Struct.page[Value]’])
The implemented DataRetrieval class should handle the conversion from a ‘sensible’ query to a query fit for the individual API and database.
There may be cases where a ‘sensible’ query is not sufficient to define a query to the API; in this case, use the get_dataframe kwargs sparingly to augment the criteria, properties, or form of the underlying API query.
A method for accessing raw DB data with an API-native query may be provided by overriding get_data. The link to the original API documentation must be provided by overriding api_link().
## Documenting a DataRetrieval class
The class documentation for each DataRetrieval class must contain a brief description of the possible data that can be retrieved with the API source. It should also detail the form of the criteria and properties that can be retrieved with the class, and/or should link to a web page showing this information. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).
- api_link()¶
The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
- citations()¶
Retrieve a list of formatted strings of bibtex citations which should be cited when using a data retrieval method.
- Returns:
([str]): Bibtext formatted entries
- get_dataframe(criteria, properties, **kwargs)¶
Retrieve a dataframe of properties from the database which satisfy criteria.
- Args:
- criteria (dict): The name of each criterion is the key; the value
or range of the criterion is the value.
- properties (list): Properties to return from the query matching
the criteria. For example, [‘structure’, ‘formula’]
- Returns:
- (pandas DataFrame) The dataframe containing properties as columns
and samples as rows.