API Reference¶
Method¶
-
class
mlgauge.SklearnMethod(estimator, metrics, export_model=False, cv=5)¶ A wrapper to directly use an sklearn estimator with an analysis.
-
__init__(estimator, metrics, export_model=False, cv=5)¶ Initialize sklearn method.
- Parameters
estimator (estimator) – An sklearn estimator or a pipeline.
metrics (list) – list of metric string or an sklearn callable metric function. Refer sklearn documentation for metrics.
export_model (bool) – Exports the sklearn estimator through joblib as estimator(_fold_k).joblib if set to True.
cv (int) – The cross-validation to use when use_test_set is False. Ignored otherwise.
-
train(X_train, y_train, feature_names=None, category_indicator=None)¶ Train the model and return the training score.
- Parameters
X_train (array) – array of training vector.
y_train (array) – array of target vector.
feature_names (list) – list of names of the features in X_train
category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.
- Returns
list of metric scores evaluated on the training data.
- Return type
list
- Raises
AttributeError – raised when method is called with use_test_set set to false
-
test(X_test, y_test, feature_names=None, category_indicator=None)¶ Evaluate the model and return the test score.
- Parameters
X_test (array) – array of training vector.
y_test (array) – array of target vector.
feature_names (list) – list of names of the features in X_test
category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.
- Returns
list of metric scores evaluated on the testing data.
- Return type
list
-
-
class
mlgauge.Method¶ The baseclass for method definitions that will be used in the analysis. Inherit this class into your class to define your own methods.
An Analysis instance will dynamically set the different attributes to different values based on the dataset that is currently used. These attributes will be available to the inheriting class to provide additional information.
-
output_dir¶ Path of the directory where the outputs should be saved. Will be automatically assigned by Analysis class. When defining a method based on this class, use this attribute to save any artifacts like plots, model dumps etc.
- Type
str
-
feature_names¶ List of feature names of the input dataset.
- Type
list
-
use_test_set¶ method implements a test method when set to True.
- Type
bool
-
cv¶ The value indicates the number of folds used when use_test_set is set to False.
- Type
int
-
__init__()¶ Initialize self. See help(type(self)) for accurate signature.
-
set_output_dir(path)¶ Set output directory to save results of the method.
- Parameters
path (str) – path of the directory where the outputs should be saved.
-
set_test_set(use_test_set)¶ Specify if the method requires a test set.
- Parameters
use_test_set (bool) – method implements a test method when set to True.
-
train(X_train, y_train, feature_names=None, category_indicator=None)¶ Train the model and return the training score.
- Parameters
X_train (array) – array of training vector.
y_train (array) – array of target vector.
feature_names (list) – list of names of the features in X_train.
category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.
- Raises
NotImplementedError – raised when called from the base class.
-
test(X_test, y_test, feature_names=None, category_indicator=None)¶ Evaluate the model and return the test score.
- Parameters
X_test (array) – array of training vector.
y_test (array) – array of target vector.
feature_names (list) – list of names of the features in X_test
category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.
- Raises
NotImplementedError – raised when called from the base class.
-
Analysis¶
-
class
mlgauge.Analysis(methods, metric_names=None, datasets='all', n_datasets=20, data_source='pmlb', drop_na=False, use_test_set=True, test_size=0.25, random_state=None, output_dir=None, local_cache_dir=None, disable_progress=False)¶ The analysis class to run the method comparisons.
The class gathers datasets, methods and runs the given methods across different datasets and compiles the results.
-
results¶ Named array containing resulting metrics of the analysis.
The dimensions are named “datasets”, “methods”, “metrics”, “splits”. You can index on each dimension using the name of the dataset, method, metrics and split (“train”/”test” if method uses test, “fold_1”, “fold_2”, … otherwise) using the
locattribute similar to pandas.For example to identify the test mse score of your linear model on the houses dataset:
result.loc['houses', 'linear', 'mse', 'test']
Note
When integer IDs are specified for openml datasets, the
resultsattribute’s dataset key will be set as string.Refer the documentation of xarray for a more detailed usage.
- Type
xr.DataArray
-
__init__(methods, metric_names=None, datasets='all', n_datasets=20, data_source='pmlb', drop_na=False, use_test_set=True, test_size=0.25, random_state=None, output_dir=None, local_cache_dir=None, disable_progress=False)¶ Initialize analysis.
- Parameters
methods (list) – List of tuple containing the method name and a method object.
metric_names (list) –
List of strings representing the names of the metric. The names are only used to represent the metrics output by the method objects. If None will not collect metrics from methods.
The size of the list should be the same as that returned by the Method’s instance train and test methods.
datasets (str or list) –
One of the following options:
”all”: randomly select n_datasets from all available datasets in pmlb.
”classification”: randomly select n_datasets from all available classification datasets in pmlb.
”regression”: randomly select n_datasets from all available regression datasets in pmlb.
list of strings: a list of valid pmlb/openml dataset names. list of ints: a list of valid openml dataset IDs. This is recommended for openml to avoid issues with versions.
list of (‘dataset_name’, (X, y)) tuples: Use the method to pass a custom dataset in the X y format.
list of (‘dataset_name’, (X_train, y_train), (X_test, y_test)) tuples: Use the method to pass a custom training and testing set in the X y format.
Here, X y could be a numpy array or a pandas DataFrame, using a DataFrame will allow the input feature names to be passed to the methods.
n_datasets (int) – Number of datasets to randomly sample from the available pmlb datasets. Ignored if datasets is not a string.
data_source (str) – Source to fetch from when dataset names/IDs are passed. ‘pmlb’ or ‘openml’
drop_na (bool) – If True will drop all rows in the dataset with null values.
random_state (None, int or RandomState instance) – seed for the PRNG.
use_test_set (bool) – If the methods use a testing set.
test_size (float) – The size of the test set. Ignored if use_test_set is False.
output_dir (str) – Path of the output directory where method artifacts will be stored. A separate directory for each method will be created inside the directory. Defaults to an “output” directory in the current working directory.
local_cache_dir (str) – Local cache to use for pmlb datasets. If None will not use cached data.
-
run()¶ Load the datasets, run the methods and collect the results.
-
get_result()¶ get result of the analysis.
- Returns
A 4d named array containing the result metrics.
- Return type
(xr.DataArray)
-
get_result_as_df(metric=None, train=False, mean_folds=True)¶ Get results as a pandas dataframe.
- Parameters
metric (str) – Enter the metric string for which the result should be displayed. Defaults to the first name in metric_names.
train (bool) – If true, will also return the train scores. Ignored if use_test_set is False.
mean_folds (bool) – If true, will return mean and std deviation of the k-fold results, otherwise returns all folds. Ignored if use_test_set is True.
- Returns
- Pandas dataframe with datasets for rows.
When use_test_set is True, the columns contain the train and test results otherwise the mean and standard deviation of the k-fold validation is returned. If mean_folds is set to False, all folds scores are returned.
- Return type
(pd.DataFrame)
-
plot_results(metric=None, ax=None)¶ Plot results as a bar plot.
- Parameters
metric (str) – Enter the metric string for which the result should be displayed.
ax (matplotlib Axes) – Axes in which to draw the plot, otherwise use the currently-active Axes.
- Returns
Axes containing the plot.
- Return type
(matplotlib Axes)
-