==================================== Predicting bulk moduli with matminer ==================================== ----------------------------------------------------------------------------- Fit data mining models to ~6000 calculated bulk moduli from Materials Project ----------------------------------------------------------------------------- **Time to complete: 30 minutes** This notebook is an example of using the MP data retrieval tool :code:`retrieve_MP.py` to retrieve computed bulk moduli from `the materials project databases `_ in the form of a pandas dataframe, using matminer's tools to populate the dataframe with descriptors/features from pymatgen, and then fitting regression models from the scikit-learn library to the dataset. Preamble ________ **Import libraries, and set pandas display options.** .. code-block:: python # filter warnings messages from the notebook import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd # Set pandas view options pd.set_option('display.width', 1000) pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) Step 1: Use matminer to obtain data from MP (automatically) in a "pandas" dataframe ___________________________________________________________________________________ **Step 1a: Import matminer's MP data retrieval tool and get calculated bulk moduli and possible descriptors.** .. code-block:: python from matminer.data_retrieval.retrieve_MP import MPDataRetrieval api_key = None # Set your MP API key here. If set as an environment variable 'MAPI_KEY', set it to 'None' mpr = MPDataRetrieval(api_key) # Create an adapter to the MP Database. # criteria is to get all entries with elasticity (K_VRH is bulk modulus) data criteria = {'elasticity.K_VRH': {'$ne': None}} # properties are the materials attributes we want # See https://github.com/materialsproject/mapidoc for available properties you can specify properties = ['pretty_formula', 'spacegroup.symbol', 'elasticity.K_VRH', 'formation_energy_per_atom', 'band_gap', 'e_above_hull', 'density', 'volume', 'nsites'] # get the data! df_mp = mpr.get_dataframe(criteria=criteria, properties=properties) print 'Number of bulk moduli extracted = {}'.format(len(df_mp)) :code:`Number of bulk moduli extracted = 6023` **Step 1b: Explore the dataset.** .. code-block:: python df_mp.head() df_mp.describe() **Step 1c. Filter out unstable entries and negative bulk moduli** .. code-block:: python df_mp = df_mp[df_mp['elasticity.K_VRH'] > 0] df_mp = df_mp[df_mp['e_above_hull'] < 0.1] df_mp.describe() Step 2: Add descriptors/features ________________________________ **Step 2a: create volume per atom descriptor** .. code-block:: python # add volume per atom descriptor df_mp['vpa'] = df_mp['volume']/df_mp['nsites'] # explore columns df_mp.head() **Step 2b: add several more descriptors using MatMiner's pymatgen descriptor getter tools** .. code-block:: python from matminer.featurizers.composition import ElementProperty from matminer.featurizers.data import PymatgenData from pymatgen import Composition df_mp["composition"] = df_mp['pretty_formula'].map(lambda x: Composition(x)) dataset = PymatgenData() descriptors = ['row', 'group', 'atomic_mass', 'atomic_radius', 'boiling_point', 'melting_point', 'X'] stats = ["mean", "std_dev"] ep = ElementProperty(data_source=dataset, features=descriptors, stats=stats) df_mp = ep.featurize_dataframe(df_mp, "composition") #Remove NaN values df_mp = df_mp.dropna() df_mp.head() Step 3: Fit a Linear Regression model, get R\ :sup:`2` and RMSE _______________________________________________________________ **Step 3a: Define what column is the target output, and what are the relevant descriptors** .. code-block:: python # target output column y = df_mp['elasticity.K_VRH'].values # possible descriptor columns X_cols = [c for c in df_mp.columns if c not in ['elasticity.K_VRH', 'pretty_formula', 'volume', 'nsites', 'spacegroup.symbol', 'e_above_hull', 'composition']] X = df_mp.as_matrix(X_cols) print("Possible descriptors are: {}".format(X_cols)) .. code-block:: python Possible descriptors are: ['formation_energy_per_atom', 'band_gap', 'density', 'vpa', 'mean X', 'mean atomic_mass', 'mean atomic_radius', 'mean boiling_point', 'mean group', 'mean melting_point', 'mean row', 'std_dev X', 'std_dev atomic_mass', 'std_dev atomic_radius', 'std_dev boiling_point', 'std_dev group', 'std_dev melting_point', 'std_dev row'] **Step 3b: Fit the linear regression model** .. code-block:: python from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error lr = LinearRegression() lr.fit(X, y) # get fit statistics print 'R2 = ' + str(round(lr.score(X, y), 3)) print 'RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X))) .. code-block:: python R2 = 0.804 RMSE = 32.558 **Step 3c: Cross validate the results** .. code-block:: python from sklearn.model_selection import KFold, cross_val_score # Use 10-fold cross validation (90% training, 10% test) crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1) # compute cross validation scores for random forest model scores = cross_val_score(lr, X, y, scoring='mean_squared_error', cv=crossvalidation, n_jobs=1) rmse_scores = [np.sqrt(abs(s)) for s in scores] print 'Cross-validation results:' print 'Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))) .. code-block:: python Cross-validation results: Folds: 10, mean RMSE: 33.200 Step 4: Plot the results with FigRecipes ________________________________________ .. code-block:: python from matminer.figrecipes.plotly.make_plots import PlotlyFig pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)', y_title='Predicted bulk modulus (GPa)', plot_title='Linear regression', plot_mode='offline', margin_left=150, textsize=35, ticksize=30, filename="lr_regression.html") # a line to represent a perfect model with 1:1 prediction xy_params = {'x_col': [0, 400], 'y_col': [0, 400], 'color': 'black', 'mode': 'lines', 'legend': None, 'text': None, 'size': None} pf.xy_plot(x_col=y, y_col=lr.predict(X), size=3, marker_outline_width=0.5, text=df_mp['pretty_formula'], add_xy_plot=[xy_params]) .. image:: _static/example_bulkmod.png :scale: 70 Great! We just fit a linear regression model to pymatgen features using matminer and sklearn. Now let's use a Random Forest model to examine the importance of our features. Step 5: Follow similar steps for a Random Forest model ______________________________________________________ **Step 5a: Fit the Random Forest model, get R2 and RMSE** .. code-block:: python from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=50, random_state=1) rf.fit(X, y) print 'R2 = ' + str(round(rf.score(X, y), 3)) print 'RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=rf.predict(X))) .. code-block:: python R2 = 0.988 RMSE = 7.947 **Step 5b: Cross-validate the results** .. code-block:: python # compute cross validation scores for random forest model scores = cross_val_score(rf, X, y, scoring='mean_squared_error', cv=crossvalidation, n_jobs=1) rmse_scores = [np.sqrt(abs(s)) for s in scores] print 'Cross-validation results:' print 'Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))) .. code-block:: python Cross-validation results: Folds: 10, mean RMSE: 20.087 Step 6: Plot our results and determine what features are the most important ___________________________________________________________________________ **Step 6a: Plot the random forest model** .. code-block:: python from matminer.figrecipes.plotly.make_plots import PlotlyFig pf_rf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)', y_title='Random forest bulk modulus (GPa)', plot_title='Random forest regression', plot_mode='offline', margin_left=150, textsize=35, ticksize=30, filename="rf_regression.html") # a line to represent a perfect model with 1:1 prediction xy_line = {'x_col': [0, 450], 'y_col': [0, 450], 'color': 'black', 'mode': 'lines', 'legend': None, 'text': None, 'size': None} pf_rf.xy_plot(x_col=y, y_col=rf.predict(X), size=3, marker_outline_width=0.5, text=df_mp['pretty_formula'], add_xy_plot=[xy_line]) .. image:: _static/example_bulkmod_rf.png :scale: 80 **Step 6b: Plot the importance of the features we used** .. code-block:: python importances = rf.feature_importances_ X_cols = np.asarray(X_cols) indices = np.argsort(importances)[::-1] pf = PlotlyFig(y_title='Importance (%)', plot_title='Feature by importances', plot_mode='offline', margin_left=150, margin_bottom=200, textsize=20, ticksize=15, filename="rf_importances.html") pf.bar_chart(x=X_cols[indices], y=importances[indices]) .. image:: _static/example_bulkmod_feats.png :scale: 60