Basic Usage¶
Basic usage of Automatminer requires using only one class - MatPipe
.
MatPipe
works with pandas dataframes as input and output. It is
able to train
on training data using it’s fit
method, predict on new data using
predict
, and run benchmarks using benchmark
- all in an
automatic and end-to-end fashion.
Materials primitives (e.g., crystal structures) go in one end, and property
predictions come out the other. MatPipe
handles the intermediate
operations such as assigning descriptors, cleaning problematic data, data
conversions, imputation, and machine learning.
This is just a quick overview of the basic functionality. For a detailed and comprehensive tutorial, see the jupyter notebooks in the automatminer directory of the matminer_examples repository.
Initializing a pipeline¶
The easiest way to initialize a matpipe is using a preset.
from automatminer import MatPipe
pipe = MatPipe.from_preset("express")
This preset is a set of options specifying exactly how each of
MatPipe
’s constituent classes are set up. Typically, the “express”
preset will give you results with a moderate degree of accuracy and relatively
quick training, so we’ll use that here.
Note: The default MatPipe()
is equivalent to
MatPipe.from_preset("express")
; other presets have different
configuration options!
Training a pipeline¶
MatPipe has similar fit/transform syntax to scikit-learn. Your dataframe might be of the form:
train_df
|
|
---|---|
|
0.3819 |
|
-0.1123 |
|
-0.091 |
… |
… |
Where the structure column contains pymatgen
Structure
objects and
the property column is the property you are interested in (the target). Use
fit
to train, and specify the target column. For the dataframe we
used above, you’d do:
from automatminer import MatPipe
pipe = MatPipe.from_preset("express")
# Fitting pipe on train_df using "my_property" as target
pipe.fit(train_df, "my_property")
The MatPipe is now fit and can be used to make predictions on new data!
Making predictions¶
Once the pipeline is fit, we can make predictions on out-of-sample data, provided that data has the same input types that our pipeline was trained on. For example:
prediction_df
|
---|
|
|
|
… |
Use predict
to predict new data.
from automatminer import MatPipe
pipe = MatPipe.from_preset("express")
pipe.fit(train_df, "my_property")
# Predicting my_property values of some unknown prediction_df structures
prediction_df = pipe.predict(prediction_df)
The output will be stored in a column called "<your property> predicted"
.
prediction_df
|
|
---|---|
|
0.449 |
|
-0.573 |
|
-0.005 |
… |
… |
Using different presets¶
You can try out different configurations - such as more intensive featurization routines, quicker training, etc. by initializing MatPipe with a different config.
The “heavy” preset typically includes more CPU-intensive featurization and longer training times.
from automatminer import MatPipe
pipe = MatPipe.from_preset("heavy")
In contrast, use “debug” if you want very quick predictions.
from automatminer import MatPipe
pipe = MatPipe.from_preset("debug")
Saving your pipeline for later¶
Once fit, you can save your pipeline as a pickle file:
pipe.save("my_pipeline.p")
To load your file, use the MatPipe.load
static method.
pipe = MatPipe.load("my_pipeline.p")
Examine your pipeline¶
Summarize
For a human-readable executive summary of your pipeline, use
MatPipe.summarize()
.
summary = pipe.summarize()
The dict returned by summarize specifies the top-level information as strings.
An analogy: if your pipeline were a plumbing system, summarize
would
tell you how long each section of pipe is and the pump model.
Inspect
To get comprehensive details on a pipeline, use MatPipe.inspect()
.
details = pipe.inspect()
Inspection specifies all parameters to all Automatminer objects needed to
construct the pipeline and all of its internal operations. In contrast to the
summary which provides a more human interpretable digest, inspection generates
the true attribute names and values of each object in the MatPipe heirarchy.
It is typically very long, though human readable. An analogy: if your pipeline
were a plumbing system, inspect
would tell you everything
summarize
tells you, plus the model numbers of all the bolts, joints,
and valves.
Save to a file
Both summarize
and inspect
accept a filename argument if you’d
like to save their outputs to JSON, YAML, or text.
summary = pipe.summarize("my_summary.yaml")
details = pipe.inspect("my_details.json")
Monitoring the log¶
The Automatminer log is a powerful tool for determining what is happening within the pipeline in real time. We recommend you monitor it closely as the pipeline runs.
In addition to the stdout, automatminer writes a log file in the current
working directory (automatminer.log
, timestamped if duplicates).
Here’s an example of an automatminer log when fitting on a dataset.
2019-10-11 16:05:41 INFO Problem type is: regression
2019-10-11 16:05:41 INFO Fitting MatPipe pipeline to data.
2019-10-11 16:05:41 INFO AutoFeaturizer: Starting fitting.
2019-10-11 16:05:41 INFO AutoFeaturizer: Adding compositions from structures.
...
2019-10-11 16:05:47 INFO DataCleaner: Handling feature na by max na threshold of 0.01 with method 'drop'.
2019-10-11 16:05:47 INFO DataCleaner: After handling na: 636 samples, 168 features
2019-10-11 16:05:47 INFO DataCleaner: Finished fitting.
2019-10-11 16:05:47 INFO FeatureReducer: Starting fitting.
2019-10-11 16:05:47 INFO FeatureReducer: 57 features removed due to cross correlation more than 0.95
2019-10-11 16:05:49 INFO TreeFeatureReducer: Finished tree-based feature reduction of 110 initial features to 13
2019-10-11 16:05:49 INFO FeatureReducer: Finished fitting.
2019-10-11 16:05:49 INFO FeatureReducer: Starting transforming.
2019-10-11 16:05:49 INFO FeatureReducer: Finished transforming.
2019-10-11 16:05:49 INFO TPOTAdaptor: Starting fitting.
2019-10-11 16:07:50 INFO TPOTAdaptor: Finished fitting.
2019-10-11 16:07:50 INFO MatPipe successfully fit.
If you see WARNING
or ERROR
, you should inspect the pipeline
to make sure everything is configured as intended. If you see a CRITICAL
,
it is likely something is misconfigured within the pipeline and should be
looked into in detail!
Quick reminders¶
A quick note: Default MatPipe configs automatically infer the type of pymatgen object from the dataframe column name: e.g.,
“composition” = pymatgen.Composition
,
“structure” = pymatgen.Structure
,
“bandstructure” = pymatgen.electronic_structure.bandstructure.BandStructure
,
“dos” = pymatgen.electronic_structure.dos.DOS
.
Make sure your dataframe has the correct name for its input! If you want to use custom names, see the advanced usage page.