Skip to content

Evaluator

loss_delta(_, __, model, parameters, df=None)

This evaluation function calculates the loss delta between the training and test set. This delta describes how well the model can distinguish between the categories of the target variable.

Parameters:

Name Type Description Default
_ None

Not in use. Needed by Pytorch-ignite.

required
__ None

Not in use. Needed by Pytorch-ignite.

required
model Autoembedder

Instance from the model used for prediction.

required
parameters Dict[str, Any]

Dictionary with the parameters used for training and prediction. In the documentation all possible parameters are listed.

required
df Optional[Union[dd.DataFrame, pd.DataFrame]]

Dask or Pandas DataFrame. If it is not given, the data is loaded from the given path (eval_input_path).

None

Returns:

Type Description
Tuple[float, float]

Tuple[float, float]: loss_mean_diff, loss_std_diff and DataFrame.

Source code in src/autoembedder/evaluator.py
def loss_delta(_, __, model: Autoembedder, parameters: Dict[str, Any], df: Optional[Union[dd.DataFrame, pd.DataFrame]] = None) -> Tuple[float, float]:  # type: ignore
    """This evaluation function calculates the loss delta between the training
    and test set. This delta describes how well the model can distinguish
    between the categories of the target variable.

    Args:
        _ (None): Not in use. Needed by Pytorch-ignite.
        __ (None): Not in use. Needed by Pytorch-ignite.
        model (Autoembedder): Instance from the model used for prediction.
        parameters (Dict[str, Any]): Dictionary with the parameters used for training and prediction.
            In the [documentation](https://chrislemke.github.io/autoembedder/#parameters) all possible parameters are listed.
        df (Optional[Union[dd.DataFrame, pd.DataFrame]], optional): Dask or Pandas DataFrame. If it is not given,
            the data is loaded from the given path (`eval_input_path`).

    Returns:
        Tuple[float, float]: `loss_mean_diff`, `loss_std_diff` and DataFrame.
    """
    if parameters.get("eval_input_path", None) is not None:
        try:
            df = (
                dd.read_parquet(parameters["eval_input_path"], infer_divisions=True)
                .compute()
                .sample(frac=1)
            )
        except ValueError:
            df = pd.read_parquet(parameters["eval_input_path"]).sample(frac=1)
    elif df is not None:
        if isinstance(df, dd.DataFrame):
            df = df.compute()
        df = df.sample(frac=1)
    else:
        raise ValueError(
            "No DataFrame given! Please provide a DataFrame or a path to a parquet file."
        )

    target = parameters["target"]

    df_1 = df.query(f"{target} == 1").drop([target], axis=1)
    df_0 = df.query(f"{target} == 0").drop([target], axis=1).sample(n=df_1.shape[0])

    losses_0: List[float] = []
    losses_1: List[float] = []

    for losses_df, losses in [(df_0, losses_0), (df_1, losses_1)]:
        loss = MSELoss()
        for batch in losses_df.itertuples(index=False):
            losses.append(_predict(model, batch, loss, parameters))

    if parameters.get("trim_eval_errors", 0) == 1:
        losses_0.remove(max(losses_0))
        losses_0.remove(min(losses_0))
        losses_1.remove(max(losses_1))
        losses_1.remove(min(losses_1))

    return np.absolute(np.mean(losses_1) - np.mean(losses_0)), np.absolute(
        np.median(losses_1) - np.median(losses_0)
    )