API Reference

Method

class mlgauge.SklearnMethod(estimator, metrics, export_model=False, cv=5)

A wrapper to directly use an sklearn estimator with an analysis.

__init__(estimator, metrics, export_model=False, cv=5)

Initialize sklearn method.

Parameters
  • estimator (estimator) – An sklearn estimator or a pipeline.

  • metrics (list) – list of metric string or an sklearn callable metric function. Refer sklearn documentation for metrics.

  • export_model (bool) – Exports the sklearn estimator through joblib as estimator(_fold_k).joblib if set to True.

  • cv (int) – The cross-validation to use when use_test_set is False. Ignored otherwise.

train(X_train, y_train, feature_names=None, category_indicator=None)

Train the model and return the training score.

Parameters
  • X_train (array) – array of training vector.

  • y_train (array) – array of target vector.

  • feature_names (list) – list of names of the features in X_train

  • category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.

Returns

list of metric scores evaluated on the training data.

Return type

list

Raises

AttributeError – raised when method is called with use_test_set set to false

test(X_test, y_test, feature_names=None, category_indicator=None)

Evaluate the model and return the test score.

Parameters
  • X_test (array) – array of training vector.

  • y_test (array) – array of target vector.

  • feature_names (list) – list of names of the features in X_test

  • category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.

Returns

list of metric scores evaluated on the testing data.

Return type

list

class mlgauge.Method

The baseclass for method definitions that will be used in the analysis. Inherit this class into your class to define your own methods.

An Analysis instance will dynamically set the different attributes to different values based on the dataset that is currently used. These attributes will be available to the inheriting class to provide additional information.

output_dir

Path of the directory where the outputs should be saved. Will be automatically assigned by Analysis class. When defining a method based on this class, use this attribute to save any artifacts like plots, model dumps etc.

Type

str

feature_names

List of feature names of the input dataset.

Type

list

use_test_set

method implements a test method when set to True.

Type

bool

cv

The value indicates the number of folds used when use_test_set is set to False.

Type

int

__init__()

Initialize self. See help(type(self)) for accurate signature.

set_output_dir(path)

Set output directory to save results of the method.

Parameters

path (str) – path of the directory where the outputs should be saved.

set_test_set(use_test_set)

Specify if the method requires a test set.

Parameters

use_test_set (bool) – method implements a test method when set to True.

train(X_train, y_train, feature_names=None, category_indicator=None)

Train the model and return the training score.

Parameters
  • X_train (array) – array of training vector.

  • y_train (array) – array of target vector.

  • feature_names (list) – list of names of the features in X_train.

  • category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.

Raises

NotImplementedError – raised when called from the base class.

test(X_test, y_test, feature_names=None, category_indicator=None)

Evaluate the model and return the test score.

Parameters
  • X_test (array) – array of training vector.

  • y_test (array) – array of target vector.

  • feature_names (list) – list of names of the features in X_test

  • category_indicator (list) – list of boolean indicating whether a feature is a categorical variable.

Raises

NotImplementedError – raised when called from the base class.

Analysis

class mlgauge.Analysis(methods, metric_names=None, datasets='all', n_datasets=20, data_source='pmlb', drop_na=False, use_test_set=True, test_size=0.25, random_state=None, output_dir=None, local_cache_dir=None, disable_progress=False)

The analysis class to run the method comparisons.

The class gathers datasets, methods and runs the given methods across different datasets and compiles the results.

results

Named array containing resulting metrics of the analysis.

The dimensions are named “datasets”, “methods”, “metrics”, “splits”. You can index on each dimension using the name of the dataset, method, metrics and split (“train”/”test” if method uses test, “fold_1”, “fold_2”, … otherwise) using the loc attribute similar to pandas.

For example to identify the test mse score of your linear model on the houses dataset:

result.loc['houses', 'linear', 'mse', 'test']

Note

When integer IDs are specified for openml datasets, the results attribute’s dataset key will be set as string.

Refer the documentation of xarray for a more detailed usage.

Type

xr.DataArray

__init__(methods, metric_names=None, datasets='all', n_datasets=20, data_source='pmlb', drop_na=False, use_test_set=True, test_size=0.25, random_state=None, output_dir=None, local_cache_dir=None, disable_progress=False)

Initialize analysis.

Parameters
  • methods (list) – List of tuple containing the method name and a method object.

  • metric_names (list) –

    List of strings representing the names of the metric. The names are only used to represent the metrics output by the method objects. If None will not collect metrics from methods.

    The size of the list should be the same as that returned by the Method’s instance train and test methods.

  • datasets (str or list) –

    One of the following options:

    ”all”: randomly select n_datasets from all available datasets in pmlb.

    ”classification”: randomly select n_datasets from all available classification datasets in pmlb.

    ”regression”: randomly select n_datasets from all available regression datasets in pmlb.

    list of strings: a list of valid pmlb/openml dataset names. list of ints: a list of valid openml dataset IDs. This is recommended for openml to avoid issues with versions.

    list of (‘dataset_name’, (X, y)) tuples: Use the method to pass a custom dataset in the X y format.

    list of (‘dataset_name’, (X_train, y_train), (X_test, y_test)) tuples: Use the method to pass a custom training and testing set in the X y format.

    Here, X y could be a numpy array or a pandas DataFrame, using a DataFrame will allow the input feature names to be passed to the methods.

  • n_datasets (int) – Number of datasets to randomly sample from the available pmlb datasets. Ignored if datasets is not a string.

  • data_source (str) – Source to fetch from when dataset names/IDs are passed. ‘pmlb’ or ‘openml’

  • drop_na (bool) – If True will drop all rows in the dataset with null values.

  • random_state (None, int or RandomState instance) – seed for the PRNG.

  • use_test_set (bool) – If the methods use a testing set.

  • test_size (float) – The size of the test set. Ignored if use_test_set is False.

  • output_dir (str) – Path of the output directory where method artifacts will be stored. A separate directory for each method will be created inside the directory. Defaults to an “output” directory in the current working directory.

  • local_cache_dir (str) – Local cache to use for pmlb datasets. If None will not use cached data.

run()

Load the datasets, run the methods and collect the results.

get_result()

get result of the analysis.

Returns

A 4d named array containing the result metrics.

Return type

(xr.DataArray)

get_result_as_df(metric=None, train=False, mean_folds=True)

Get results as a pandas dataframe.

Parameters
  • metric (str) – Enter the metric string for which the result should be displayed. Defaults to the first name in metric_names.

  • train (bool) – If true, will also return the train scores. Ignored if use_test_set is False.

  • mean_folds (bool) – If true, will return mean and std deviation of the k-fold results, otherwise returns all folds. Ignored if use_test_set is True.

Returns

Pandas dataframe with datasets for rows.

When use_test_set is True, the columns contain the train and test results otherwise the mean and standard deviation of the k-fold validation is returned. If mean_folds is set to False, all folds scores are returned.

Return type

(pd.DataFrame)

plot_results(metric=None, ax=None)

Plot results as a bar plot.

Parameters
  • metric (str) – Enter the metric string for which the result should be displayed.

  • ax (matplotlib Axes) – Axes in which to draw the plot, otherwise use the currently-active Axes.

Returns

Axes containing the plot.

Return type

(matplotlib Axes)