Skip to content

Sentence Embeddings

SWEN's Tier 1 classification uses sentence embeddings to find transactions that are semantically similar to past examples, even when the exact wording differs.

The Model

The encoder model is configurable via the SWEN_ML_ENCODER_MODEL environment variable. The default is: paraphrase-multilingual-MiniLM-L12-v2

Property Value (default model)
Architecture MiniLM-L12
Language Multilingual (50+ languages, including German)
Training objective Paraphrase similarity (cosine)
Embedding dimension 384
Model size ~120 MB
Licence Apache 2.0

Why a paraphrase model?

FinTS transaction purposes (Verwendungszweck) contain many variations of the same merchant — REWE MARKT 123 HAMBURG, REWE SAGT DANKE 456. A paraphrase-optimised model is explicitly trained to embed such variations close to each other, making cosine similarity a reliable clustering signal.

Custom models

Any model supported by the configured encoder backend can be substituted. For higher accuracy (at the cost of a larger download and slower inference), replace the default with a larger German-specific model such as deutsche-telekom/gbert-large-paraphrase-cosine.

HuggingFace Cache

The model is downloaded once on first startup and cached. In Docker:

volumes:
  ml-model-cache:

services:
  ml:
    volumes:
      - ml-model-cache:/root/.cache/huggingface

The environment variable HF_HOME can override the cache directory.

First-run download

The first time the ML service starts, it downloads ~1.5 GB from HuggingFace. Classification requests will return 503 until the download completes. Monitor with:

docker compose logs -f ml
Subsequent restarts load the model from the local cache (a few seconds).

Encoder Backend

SWEN supports two encoder backends, selected via SWEN_ML_ENCODER_BACKEND:

Backend SWEN_ML_ENCODER_BACKEND value Notes
sentence-transformers sentence-transformers Recommended — automatic pooling and normalisation
HuggingFace transformers huggingface Manual pooling via SWEN_ML_ENCODER_POOLING (mean / cls / max)

sentence-transformers example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
embedding = model.encode("REWE MARKT 123 HAMBURG", normalize_embeddings=True)

HuggingFace backend additionally respects SWEN_ML_ENCODER_NORMALIZE, SWEN_ML_ENCODER_MAX_LENGTH, and SWEN_ML_ENCODER_POOLING.

Pooling Strategy

gbert-large-paraphrase-cosine uses mean pooling over the last hidden states of all non-padding tokens. This is configured in the model's 1_Pooling/config.json on HuggingFace and applied automatically by sentence-transformers.

Stored example embeddings are retrieved from the ML service's database and compared using cosine similarity:

\[ \text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} \]

Since embeddings are already L2-normalised (normalize_embeddings=True), this reduces to a simple dot product — fast and numerically stable.

The top-2 nearest neighbours are compared. A match is accepted when:

  • the best similarity ≥ 0.85 (high confidence), or
  • the best similarity ≥ 0.70 and the margin over the 2nd-best ≥ 0.10

The margin check prevents accepting an ambiguous result when two accounts score similarly close.

Warm-up

On service startup, SWEN loads the model and runs a single dummy inference to ensure the CUDA/CPU kernels are compiled before the first real request:

# Warm-up call in lifespan
model.encode("warm-up", normalize_embeddings=True)

This prevents a slow first-request penalty.