XGBoost integration guide#

Custom dashboard displaying metadata logged with XGBoost

XGBoost is an optimized distributed library that implements machine learning algorithms under the Gradient Boosting framework. With the Neptune-XGBoost integration, the following metadata is logged automatically:

Metrics
Parameters
The pickled model
The feature importance chart
Visualized trees
Hardware consumption metrics
stdout and stderr streams
Training code and Git information

See in Neptune Code examples

Before you start#

Sign up at neptune.ai/register.
Create a project for storing your metadata.
Ensure that you have at least version 1.3.0 of XGBoost installed:
```
pip install xgboost>=1.3.0
```

Installing the integration#

Install integration onlyInstall Neptune + integration

To use your preinstalled version of Neptune together with the integration:

pip

pip install -U neptune-xgboost

To install both Neptune and the integration:

pip

pip install -U "neptune[xgboost]"

Passing your Neptune credentials

Once you've signed up and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"

To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

export NEPTUNE_PROJECT="ml-team/classification"

Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'

While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

run = neptune.init_run(
    project="ml-team/classification",  # your full project name here
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
)

For more help, see Set Neptune credentials.

If you want to log visualized trees after training (recommended), additionally install Graphviz:

pip install -U graphviz

Note

The above installation is only for the pure Python interface to the Graphviz software. You need to install Graphviz separately.

For installation help, see the Graphviz documentation .

If you'd rather follow the guide without any setup, you can run the example in Colab .

XGBoost logging example#

This example walks you through logging metadata as you train your model with XGBoost.

You can log metadata during training with NeptuneCallback.

Logging metadata during training#

Start a run:

import neptune
run = neptune.init_run() # (1)!

If you haven't set up your credentials, you can log anonymously:

neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/xgboost-integration",
)

Initialize the Neptune callback:

from neptune.integrations.xgboost import NeptuneCallback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

Prepare your data, parameters, and so on.

Pass the callback to the train() function and train the model:

xgb.train(
    params=model_params,
    dtrain=dtrain,
    callbacks=[neptune_callback],
)

To stop the connection to Neptune and sync all data, call the stop() method:
```
run.stop()
```
Run your script as you normally would.

To open the run, click the Neptune link that appears in the console output.

Example link: https://app.neptune.ai/common/xgboost-integration/e/XGBOOST-84

Exploring results in Neptune#

In the run view, you can see the logged metadata organized into folder-like namespaces.

Name	Description
`booster_config`	All parameters for the booster.
`early_stopping`	`best_score` and `best_iteration` (logged if early stopping was activated)
`epoch`	Epochs (visualized as a chart from first to last epoch).
`learning_rate`	Learning rate visualized as a chart.
`pickled_model`	Trained model logged as a pickled file.
`plots`	Feature importance and visualized trees.
`train`	Training metrics.
`valid`	Validation metrics.

See example in Neptune

More options#

Changing the base namespace#

By default, the metadata is logged under the namespace training.

You can change the namespace when creating the Neptune callback:

neptune_callback = NeptuneCallback(
    run=run,
    base_namespace="my_custom_name",
)

Using Neptune callback with CV function#

You can use NeptuneCallback in the xgboost.cv function. Neptune will log additional metadata for each fold in CV.

Pass the Neptune callback to the callbacks argument of lgb.cv():

Core codeFull script

import neptune
from neptune.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data, params, etc.
...

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()

import neptune
import xgboost as xgb
from neptune.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/xgboost-integration", # (2)!
    name="xgb-cv",  # optional
    tags=["xgb-integration", "cv"],  # optional
)

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data
X, y = load_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)

# Define parameters
model_params = {
    "eta": 0.7,
    "gamma": 0.001,
    "max_depth": 9,
    "objective": "reg:squarederror",
    "eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    num_boost_round=num_round,
    nfold=7,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()

The api_token argument is included to enable anonymous logging.

Once you've registered, leave the token out of your script and instead save it as an environment variable.
Projects in the common workspace are public and can be used for testing.

To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

You can copy the name from the project details ( → Details & privacy).

In the All metadata section of the run view, you can see a fold_n namespace for each fold in an n-fold CV:

fold_n
  |—— booster_config
  |—— pickled_model
  |—— plots
        |—— importance
        |—— trees

Namespaces inside the fold_n namespace:

Name	Description
`booster_config`	All parameters for the booster.
`pickled_model`	Trained model logged as a pickled file.
`plots`	Feature importance and visualized trees.

See in Neptune

Working with scikit-learn API#

You can use NeptuneCallback in the scikit-learn API of XGBoost.

Pass the Neptune callback while creating the regressor object:

Core codeFull script

import neptune
from neptune.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, etc.
X_train = ...
y_train = ...
model_params = {...}

# Create regressor object and pass the Neptune callback
reg = xgb.XGBRegressor(**model_params, callbacks=[neptune_callback])

# Fit the model
reg.fit(X_train, y_train)

# Stop run
run.stop()

import neptune
import xgboost as xgb
from neptune.integrations.xgboost import NeptuneCallback
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/xgboost-integration", # (2)!
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
data = fetch_california_housing()
y = data["target"]
X = data["data"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Create regressor object and pass the Neptune callback
reg = xgb.XGBRegressor(callbacks=[neptune_callback])

# Fit the model
reg.fit(X_train, y_train)

# Stop run
run.stop()

The api_token argument is included to enable anonymous logging.

Once you've registered, leave the token out of your script and instead save it as an environment variable.
Projects in the common workspace are public and can be used for testing.

To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

You can copy the name from the project details ( → Details & privacy).

See in Neptune

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics inside loops
for epoch in range(n_epochs):
    # Your training loop

    run["train/epoch/loss"].append(loss)  # Each append() call appends a value
    run["train/epoch/accuracy"].append(acc)

# Track artifact versions and metadata
run["train/images"].track_files("./datasets/images")

# Upload entire files
run["test/preds"].upload("path/to/test_preds.csv")

# Log text or other metadata, in a structure of your choosing
run["tokenizer"] = "regexp_tokenize"

Add Neptune to your code
What you can log and display
Resume a run
API reference ≫ XGBoost integration
neptune-xgboost repo on GitHub
XGBoost on GitHub