The autoembedder

The Autoembedder¶

Introduction¶

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation¶

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

Installing dependencies¶

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Usage¶

0. Some imports¶

from autoembedder import Autoembedder, dataloader, fit

1. Create dataloaders¶

First, we create two dataloaders. One for training, and the other for validation data. As source they either accept a path to a Parquet file, to a folder of Parquet files or a Pandas/Dask DataFrame.

train_dl = dataloader(train_df)
valid_dl = dataloader(vaild_df)

2. Set parameters¶

Now, we need to set the parameters. They are going to be used for handling the data and training the model. In this example, only parameters for the training are set. Here you find a list of all possible parameters. This should do it:

parameters = {
    "hidden_layers": [[25, 20], [20, 10]],
    "epochs": 10,
    "lr": 0.0001,
    "verbose": 1,
}

3. Initialize the autoembedder¶

Then, we need to initialize the autoembedder. In this example, we are not using any categorical features. So we can skip the embedding_sizes argument.

model = Autoembedder(parameters, num_cont_features=train_df.shape[1])

4. Train the model¶

Everything is set up. Now we can fit the model.

fit(parameters, model, train_dl, valid_dl)

Example¶

Check out this Jupyter notebook for an applied example using the Credit Card Fraud Detection from Kaggle.

Parameters¶

This is a list of all parameters that can be passed to the Autoembedder for training. When using the training script the _ needs to be replaced with - and the parameters need to be passed as arguments. For boolean values please have a look at the Comment column for understanding how to pass them.

Run the training script¶

You can also simply use the training script::

python3 training.py \
--epochs 20 \
--train-input-path "path/to/your/train_data" \
--test-input-path "path/to/your/test_data" \
--hidden-layers "[[12, 6], [6, 3]]"

for help just run:

python3 training.py --help

Argument	Type	Required	Default value	Comment
batch_size	int	False	32
drop_last	bool	False	True	--drop-last / --no-drop-last
pin_memory	bool	False	True	--pin-memory / --no-pin-memory
num_workers	int	False	0	0 means that the data will be loaded in the main process
use_mps	bool	False	False	--use-mps / --no-use-mps
model_title	str	False	autoembedder_{`datetime`}.bin
model_save_path	str	False
n_save_checkpoints	int	False
lr	float	False	0.001
amsgrad	bool	False	False	--amsgrad / --no-amsgrad
epochs	int	True
dropout_rate	float	False	0	Dropout rate for the dropout layers in the encoder and decoder.
layer_bias	bool	False	True	--layer-bias / --no-layer-bias
weight_decay	float	False	False
l1_lambda	float	False	0
xavier_init	bool	False	False	--xavier-init / --no-xavier-init
activation	str	False	tanh	Activation function; either `tanh`, `relu`, `leaky_relu` or `elu`
tensorboard_log_path	str	False
trim_eval_errors	bool	False	False	--trim-eval-errors / --no-trim-eval-errors; Removes the max and min loss when calculating the `mean loss diff` and `median loss diff`. This can be useful if some rows create very high losses.
verbose	int	False	0	Set this to `1` if you want to see the model summary and the validation and evaluation results. set this to `2` if you want to see the training progress bar. `0` means no output.
target	str	False		The target column. If not set no evaluation will be performed.
train_input_path	str	True
test_input_path	str	True
eval_input_path	str	False		Path to the evaluation data. If no path is provided no evaluation will be performed.
hidden_layers	str	True		Contains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: `"[[64, 32], [32, 16], [16, 8]]"` activation
cat_columns	str	False	"[]"	Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: `"[['a', 'b'], ['c']]"`.
drop-cat-columns	bool	False		--drop-cat-columns / --no-drop-cat-columns

Why additional embedding layers?¶

The additional embedding layers automatically embed all columns with the Pandas category data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.