Predicting bulk moduli with matminer

Fit data mining models to ~6000 calculated bulk moduli from Materials Project

Time to complete: 30 minutes

This notebook is an example of using the MP data retrieval tool retrieve_MP.py to retrieve computed bulk moduli from the materials project databases in the form of a pandas dataframe, using matminer’s tools to populate the dataframe with descriptors/features from pymatgen, and then fitting regression models from the scikit-learn library to the dataset.

Preamble

Import libraries, and set pandas display options.

# filter warnings messages from the notebook
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

# Set pandas view options
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Step 1: Use matminer to obtain data from MP (automatically) in a “pandas” dataframe

Step 1a: Import matminer’s MP data retrieval tool and get calculated bulk moduli and possible descriptors.

from matminer.data_retrieval.retrieve_MP import MPDataRetrieval

api_key = None   # Set your MP API key here. If set as an environment variable 'MAPI_KEY', set it to 'None'
mpr = MPDataRetrieval(api_key)     # Create an adapter to the MP Database.

# criteria is to get all entries with elasticity (K_VRH is bulk modulus) data
criteria = {'elasticity.K_VRH': {'$ne': None}}

# properties are the materials attributes we want
# See https://github.com/materialsproject/mapidoc for available properties you can specify
properties = ['pretty_formula', 'spacegroup.symbol', 'elasticity.K_VRH', 'formation_energy_per_atom', 'band_gap',
              'e_above_hull', 'density', 'volume', 'nsites']

# get the data!
df_mp = mpr.get_dataframe(criteria=criteria, properties=properties)
print 'Number of bulk moduli extracted = {}'.format(len(df_mp))

Number of bulk moduli extracted = 6023

Step 1b: Explore the dataset.

df_mp.head()
df_mp.describe()

Step 1c. Filter out unstable entries and negative bulk moduli

df_mp = df_mp[df_mp['elasticity.K_VRH'] > 0]
df_mp = df_mp[df_mp['e_above_hull'] < 0.1]
df_mp.describe()

Step 2: Add descriptors/features

Step 2a: create volume per atom descriptor

# add volume per atom descriptor
df_mp['vpa'] = df_mp['volume']/df_mp['nsites']

# explore columns
df_mp.head()

Step 2b: add several more descriptors using MatMiner’s pymatgen descriptor getter tools

from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.data import PymatgenData
from pymatgen import Composition

df_mp["composition"] = df_mp['pretty_formula'].map(lambda x: Composition(x))

dataset = PymatgenData()
descriptors = ['row', 'group', 'atomic_mass',
               'atomic_radius', 'boiling_point', 'melting_point', 'X']
stats = ["mean", "std_dev"]

ep = ElementProperty(data_source=dataset, features=descriptors, stats=stats)
df_mp = ep.featurize_dataframe(df_mp, "composition")

#Remove NaN values
df_mp = df_mp.dropna()

df_mp.head()

Step 3: Fit a Linear Regression model, get R2 and RMSE

Step 3a: Define what column is the target output, and what are the relevant descriptors

# target output column
y = df_mp['elasticity.K_VRH'].values

# possible descriptor columns
X_cols = [c for c in df_mp.columns
          if c not in ['elasticity.K_VRH', 'pretty_formula',
                       'volume', 'nsites', 'spacegroup.symbol', 'e_above_hull', 'composition']]
X = df_mp.as_matrix(X_cols)

print("Possible descriptors are: {}".format(X_cols))
Possible descriptors are: ['formation_energy_per_atom', 'band_gap', 'density', 'vpa', 'mean X', 'mean atomic_mass',
'mean atomic_radius', 'mean boiling_point', 'mean group', 'mean melting_point', 'mean row', 'std_dev X',
'std_dev atomic_mass', 'std_dev atomic_radius', 'std_dev boiling_point', 'std_dev group', 'std_dev melting_point',
'std_dev row']

Step 3b: Fit the linear regression model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()

lr.fit(X, y)

# get fit statistics
print 'R2 = ' + str(round(lr.score(X, y), 3))
print 'RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X)))
R2 = 0.804
RMSE = 32.558

Step 3c: Cross validate the results

from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)

# compute cross validation scores for random forest model
scores = cross_val_score(lr, X, y, scoring='mean_squared_error',
                         cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]

print 'Cross-validation results:'
print 'Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores)))
Cross-validation results:
Folds: 10, mean RMSE: 33.200

Step 4: Plot the results with FigRecipes

from matminer.figrecipes.plotly.make_plots import PlotlyFig

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               plot_title='Linear regression',
               plot_mode='offline',
               margin_left=150,
               textsize=35,
               ticksize=30,
               filename="lr_regression.html")

# a line to represent a perfect model with 1:1 prediction
xy_params = {'x_col': [0, 400],
             'y_col': [0, 400],
             'color': 'black',
             'mode': 'lines',
             'legend': None,
             'text': None,
             'size': None}

pf.xy_plot(x_col=y,
           y_col=lr.predict(X),
           size=3,
           marker_outline_width=0.5,
           text=df_mp['pretty_formula'],
           add_xy_plot=[xy_params])
_images/example_bulkmod.png

Great! We just fit a linear regression model to pymatgen features using matminer and sklearn. Now let’s use a Random Forest model to examine the importance of our features.

Step 5: Follow similar steps for a Random Forest model

Step 5a: Fit the Random Forest model, get R2 and RMSE

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=1)

rf.fit(X, y)
print 'R2 = ' + str(round(rf.score(X, y), 3))
print 'RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=rf.predict(X)))
R2 = 0.988
RMSE = 7.947

Step 5b: Cross-validate the results

# compute cross validation scores for random forest model
scores = cross_val_score(rf, X, y, scoring='mean_squared_error', cv=crossvalidation, n_jobs=1)

rmse_scores = [np.sqrt(abs(s)) for s in scores]
print 'Cross-validation results:'
print 'Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores)))
Cross-validation results:
Folds: 10, mean RMSE: 20.087

Step 6: Plot our results and determine what features are the most important

Step 6a: Plot the random forest model

from matminer.figrecipes.plotly.make_plots import PlotlyFig

pf_rf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
                  y_title='Random forest bulk modulus (GPa)',
                  plot_title='Random forest regression',
                  plot_mode='offline',
                  margin_left=150,
                  textsize=35,
                  ticksize=30,
                  filename="rf_regression.html")

# a line to represent a perfect model with 1:1 prediction
xy_line = {'x_col': [0, 450],
           'y_col': [0, 450],
           'color': 'black',
           'mode': 'lines',
           'legend': None,
           'text': None,
           'size': None}


pf_rf.xy_plot(x_col=y,
              y_col=rf.predict(X),
              size=3,
              marker_outline_width=0.5,
              text=df_mp['pretty_formula'],
              add_xy_plot=[xy_line])
_images/example_bulkmod_rf.png

Step 6b: Plot the importance of the features we used

importances = rf.feature_importances_
X_cols = np.asarray(X_cols)
indices = np.argsort(importances)[::-1]

pf = PlotlyFig(y_title='Importance (%)',
               plot_title='Feature by importances',
               plot_mode='offline',
               margin_left=150,
               margin_bottom=200,
               textsize=20,
               ticksize=15,
               filename="rf_importances.html")

pf.bar_chart(x=X_cols[indices], y=importances[indices])
_images/example_bulkmod_feats.png