Guide to adding datasets to matminer¶
All information current as of 10/24/2018
In addition to providing tools for retrieving current data from several standard materials science databases, matminer also provides a suite of static datasets pre-formatted as pandas DataFrame objects and stored as compressed JSON files. These files are stored on Figshare, a long term academic data storage platform, and include metadata making each dataset discoverable and clearly linked to the research and contributors that generated said data.
This functionality serves two purposes. First as a way for others to quickly and easily access data used in research at the Hacking Materials group and second to provide the community with a set of standard datasets that can be used for benchmarking purposes. As the application of machine learning in materials science matures, there is a growing need for a group of benchmark datasets that researchers can use as standard training and testing sets for comparing model performance against.
To add a dataset to the collection currently supported by matminer there are six primary steps:
1. Fork the matminer repository on GitHub¶
The matminer code base, including the metadata that defines how matminer handles datasets, is available on GitHub.
All editing should take place either within the dev_scripts/dataset_management folder or within matminer/datasets
2. Prepare the dataset for long term hosting¶
To work properly with matminer’s loading functions it is assumed that all datasets are pandas DataFrame objects stored as JSON files using the MontyEncoder encoding scheme available in the monty Python package. Any datasets added to matminer should ensure this requirement.
The script prep_dataset_for_figshare.py
was written to expedite and
standardize this process. If the dataset being uploaded needs no
modification from the contents stored in the file, one can simply run this
script to convert your dataset to the desired format like so:
python prep_dataset_for_figshare.py -fp /path/to/dataset(s)
-ct (compression_type: gz or bz2)
This script can take multiple file
paths and/or directory names. If given a directory name it will crawl
the directory and try to process all files within. The prepped files will
then be available in ~/dataset_to_json/
along with a .txt file containing
some metadata on the files which will be used later on.
If modification does need to be made to the dataset or if you would like the dataset name to be separate from that of the file being converted, users will need to make a small modification to this script prior to running it on their selected datasets.
To update the script to preprocess your dataset:¶
Write a preprocessor for the dataset in
prep_dataset_for_figshare.py
The preprocessor should take the dataset path, do any necessary preprocessing to turn it into a usable dataframe, and return a tuple of the form
(string_of_dataset_name, dataframe)
. If the preprocessor produces more than one dataset it should return a tuple of two lists of the form([dataset_name_1, dataset_name_2, …], [df_1, df_2, …])
For example:
def _preprocess_heusler_magnetic(file_path): df = _read_dataframe_from_file(file_path) dropcols = ['gap width', 'stability'] df = df.drop(dropcols, axis=1) return "heusler_magnetic", df
Here
_read_dataframe_from_file()
is a simple utility function that determines what pandas loading function to use based on the file type of the path passed to it. Keyword arguments passed to this function are passed on to the underlying pandas loading functions.An example for preprocessors that return multiple datasets:
def _preprocess_double_perovskites_gap(file_path): df = pd.read_excel(file_path, sheet_name='bandgap') df = df.rename(columns={'A1_atom': 'a_1', 'B1_atom': 'b_1', 'A2_atom': 'a_2', 'B2_atom': 'b_2'}) lumo = pd.read_excel(file_path, sheet_name='lumo') return ["double_perovskites_gap", "double_perovskites_gap_lumo"], [df, lumo]
Add the preprocessor function to a dictionary which maps file names to preprocessors in
prep_dataset_for_figshare.py
The prep script identifies datasets by their file name, a dictionary called
_datasets_to_preprocessing_routines
maps these dataset names to their preprocessor and should be updated like so:_datasets_to_preprocessing_routines = { "elastic_tensor_2015": _preprocess_elastic_tensor_2015, "piezoelectric_tensor": _preprocess_piezoelectric_tensor, . . . "wolverton_oxides": _preprocess_wolverton_oxides, "m2ax_elastic": _preprocess_m2ax, YOUR_DATASET_FILE_NAME: YOUR_PREPROCESSOR, }
Once this is done the preprocessor is ready to use.
3. Upload the dataset to long term hosting¶
Once the dataset file is ready, it should be hosted on Figshare or a comparable open access academic data hosting service. The Hacking Materials group maintains a collective Figshare account and follows the following procedure when adding a dataset. Other contributors should follow a similar protocol.
Add the dataset compressed json file as well as the original file as an entry in the “matminer datasets” figshare project
Fill out ALL metadata carefully, see existing entries for examples of expected quality of citations and descriptions.
If the dataset was originally generated from a source outside the group the source should be thoroughly cited within the dataset description and metadata
4. Update the matminer dataset metadata file¶
Matminer stores a file called dataset_metadata.json
which contains
information on all datasets available in the package. This file is
automatically checked by CircleCI for proper formatting and the
available datasets are regularly checked to ensure they match the
descriptors contained in this metadata. While the appropriate metadata
can be added manually, it is preferable to run the helper script
modify_dataset_metadata.py
to do the bulk of interfacing with this file
to prevent missing data or formatting mistakes.
Run the
modify_dataset_metadata.py
file and add the appropriate metadata, see existing metadata as a guideline for new datasets.The url attribute should be filled with a figshare download link for the individual file on figshare. Other items will be dataset specific or included in the .txt file produced in step 2.
Replace the metadata file in matminer/datasets with the newly generated file (should be done automatically)
Look over the modified
dataset_metadata.json
file and fix mistakes if necessary.
5. Update the dataset tests and loading code¶
Dataset testing uses unit tests to ensure dataset metadata and dataset content is formatted properly and available. When adding new datasets these tests need updated. In addition matminer provides a set of convenience functions that explicitly load a single dataset as opposed to the keyword based generic loader. These convenience functions provide additional post processing options for filtering or modifying data in the dataset after it has been loaded. A convenience function should be added alongside dataset tests.
Update dataset names saved in matminer/datasets/tests/base.py
class DatasetTest(unittest.TestCase): def setUp(self): self.dataset_names = [ 'flla', 'elastic_tensor_2015', 'piezoelectric_tensor', . . . YOUR DATASET NAME HERE ]
Write a test for loading the dataset in test_datasets.py.
These tests ensure that the dataset is downloadable and that its data matches what is described in the file metadata. See prior datasets for examples and the .txt file from step 2 for column type info. A typical test consists of a call to a universal test function that only needs specifiers of what dataframe columns should be of what type and dataset specific type tests if necessary.
Example:
def test_dielectric_constant(self): object_headers = ['material_id', 'formula', 'structure', 'e_electronic', 'e_total', 'cif', 'meta', 'poscar'] numeric_headers = ['nsites', 'space_group', 'volume', 'band_gap', 'n', 'poly_electronic', 'poly_total'] bool_headers = ['pot_ferroelectric'] # Unique Tests def _unique_tests(df): self.assertEqual(type(df['structure'][0]), Structure) # Universal Tests self.universal_dataset_check( "dielectric_constant", object_headers, numeric_headers, bool_headers=bool_headers, test_func=_unique_tests )
Write a convenience function for the dataset in convenience_loaders.py
This can be as simple as just returning the results of load_dataset or provide the user with extra options to return only subsets of the dataset with certain properties.
Example:
def load_elastic_tensor(version="2015", include_metadata=False, data_home=None, download_if_missing=True): """ Convenience function for loading the elastic_tensor dataset. Args: version (str): Version of the elastic_tensor dataset to load (defaults to 2015) include_metadata (bool): Whether or not to include the cif, meta, and poscar dataset columns. False by default. data_home (str, None): Where to loom for and store the loaded dataset download_if_missing (bool): Whether or not to download the dataset if it isn't on disk Returns: (pd.DataFrame) """ df = load_dataset("elastic_tensor" + "_" + version, data_home, download_if_missing) if not include_metadata: df = df.drop(['cif', 'kpoint_density', 'poscar'], axis=1) return df
Write a test for the added convenience function
These tests can be simple and will depend on the options provided in the convenience function. See existing tests for examples.
6. Make a Pull Request to the matminer GitHub repository¶
Make a commit describing the added dataset
Make a pull request from your fork to the primary repository