From ae5f51b2793058744bbfc64d72bf5be73b19928b Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Thu, 25 Apr 2024 13:37:23 +0200
Subject: [PATCH 01/39] [`v3`] Training refactor - MultiGPU, loss logging,
 bf16, etc. (#2449)

* See #1638: Adds huggingface trainer for sentence transformers

* Fix type of tokenizer

* Get the trainer using the feature collation

* Update the docstring to reflect changes

* Initial draft for refactoring training usig the Transformers Trainer

* Separate 'fit' functionality (new and old) into a mixin

* Resolve test issues

* Reformat

* Update the imports

* Add TODO regarding custom label columns

* Remove dead code

* Don't provide the trainer to the eval sampler

* Introduce datasets as a dependency

* Introduce "accelerate" as a dependency

* Avoid use_amp on CPU tests

* Specify that SentenceTransformer is a class, not a module

* Avoid circular import

* Remove | used as an "or" operator in typing

* Use test evaluator after training, as intended

* Use tokenize function instead of tokenizer;

Add EvaluatorCallback which calls the evaluator on every epoch (for BC);
Stop saving "do_lower_case" from Transformer;

* Reformat

* Revert Transformer tokenizer changes

* Add support for the tokenizer to return more than just input_ids & attention_masks

Required for LSTM

* Use the test evaluators after training the examples

* Use pure torch for BoW tokenization

* Use dev evaluator for BiLSTM - test fails

* Add Trainer support for BoW-based models

* Pass epoch to evaluator in every-epoch callback

For fit backwards compatibility

* Run formatting

* Use steps_per_epoch to set max_steps if possible

* Ignore extracting dataloader arguments for now

* Remove dead code

* Allow both "label" and "score" columns for labels

* Reformatting

* Improve errors if datasets don't match with loss dictionary well

* Made tests more consistent; list instead of set

* Simplify trainer with DatasetDict

* Implement a proportional sampler in addition to round robin

* Add CLIP finetuning support to the Trainer

* Start updating evaluators to return dictionaries

* Reformat

* Hackishly insert the DataParallel model into the loss function

* Allow for fsdp=["full_shard", "auto_wrap"]

with fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"}

* Re-add support for DataParallel

* Use 'ParallelMode.NOT_PARALLEL'

* Prevent crash with DDP & an evaluation set

* When training with multiple datasets, add "dataset_name" column

Rather than relying on some Batch Sampler hacking (which fails with some distributed training approaches)

* Update type hints: make loss & evaluator optional

Co-authored-by: Wang Bo <kingbolanda@live.com>

* Set correct superclasses for samplers

* Override 'accelerator.even_batches' as it's incompatible with multi-dataset

* Throw exception if "return_loss" or "dataset_name" columns are used

* Set min. version for accelerate

* Heavily extend model card generation

* Remove some dead code

* Fix evaluator type hints

* Ensure that 'model_card_template.md' is included in the built package

* Rephrase comments slightly

* Heavily refactor samplers; add no duplicates/group by label samplers

* Ensure that data_loader.dataset exists in FitMixin

* Adopt 8 as the default batch

* Fix logging error in example

* Remove the deprecated correct_bias

* Simplify with walrus operator

* Fix some bugs in set_widget_examples with short datasets

* Improve docstring slightly

* Add edge case in case training data has an unrecognized format

* Fix extracting dataset metadata

* Remove moot TYPE_CHECKING

* Set base model when loading a ST model also

* Add test_dataloader, add prefetch_factor to dataloaders

* Resolve predict_example fix; fix newlines in text

* Fix bug in compute_dataset_metrics examples

* Add call to action in ValueError

* Reuse original model card if no training is done

* Also collect nested losses (e.g. MatryoshkaLoss) and make losses in tags

* Remove generated tag; keep loss: prefix on tags

* Remove unused arguments

* Add support for "best model step" in model card

* Make hyperparameters code-formatted

* Fix load_best_model for Transformers models, prevent for non-Transformers

* Store base_model_revision in model_card_data

* Prevent crash when loading a local model

* Allow for bfloat16 inference

---------

Co-authored-by: Matthew Franglen <matthew@franglen.org>
Co-authored-by: Wang Bo <kingbolanda@live.com>
---
 .gitignore                                    |   6 +-
 MANIFEST.in                                   |   1 +
 ...aining_stsbenchmark_avg_word_embeddings.py |   2 +-
 .../training_stsbenchmark_bow.py              |   2 +-
 .../training_stsbenchmark_cnn.py              |   2 +-
 ...ing_stsbenchmark_tf-idf_word_embeddings.py |   2 +-
 examples/training/clip/train_clip.ipynb       |   6 +-
 .../distillation/model_distillation.py        |   2 +-
 .../ms_marco/train_bi-encoder_margin-mse.py   |   2 +-
 .../multilingual/make_multilingual.py         |   6 +-
 .../train_askubuntu_ct-improved.py            |   2 +
 requirements.txt                              |   4 +-
 sentence_transformers/SentenceTransformer.py  | 400 ++------
 sentence_transformers/__init__.py             |  16 +
 sentence_transformers/data_collator.py        |  39 +
 .../BinaryClassificationEvaluator.py          |  40 +-
 .../EmbeddingSimilarityEvaluator.py           |  46 +-
 .../InformationRetrievalEvaluator.py          |  25 +-
 .../evaluation/LabelAccuracyEvaluator.py      |  12 +-
 .../evaluation/MSEEvaluator.py                |  16 +-
 .../evaluation/MSEEvaluatorFromDataFrame.py   |  16 +-
 .../evaluation/ParaphraseMiningEvaluator.py   |  17 +-
 .../evaluation/RerankingEvaluator.py          |  17 +-
 .../evaluation/SentenceEvaluator.py           |  43 +-
 .../evaluation/SequentialEvaluator.py         |  29 +-
 .../evaluation/TranslationEvaluator.py        |  17 +-
 .../evaluation/TripletEvaluator.py            |  29 +-
 sentence_transformers/fit_mixin.py            | 619 ++++++++++++
 .../losses/AdaptiveLayerLoss.py               |  13 +
 sentence_transformers/losses/AnglELoss.py     |  13 +
 .../losses/BatchAllTripletLoss.py             |  13 +
 .../losses/BatchHardSoftMarginTripletLoss.py  |  13 +
 .../losses/BatchHardTripletLoss.py            |  13 +
 .../losses/BatchSemiHardTripletLoss.py        |  13 +
 .../CachedMultipleNegativesRankingLoss.py     |  18 +-
 sentence_transformers/losses/CoSENTLoss.py    |  12 +
 .../losses/ContrastiveLoss.py                 |  15 +
 .../losses/ContrastiveTensionLoss.py          |  24 +
 .../losses/CosineSimilarityLoss.py            |   9 +-
 .../losses/DenoisingAutoEncoderLoss.py        |  16 +
 sentence_transformers/losses/GISTEmbedLoss.py |  13 +
 sentence_transformers/losses/MSELoss.py       |  14 +
 sentence_transformers/losses/MarginMSELoss.py |  13 +
 .../losses/Matryoshka2dLoss.py                |  13 +
 .../losses/MatryoshkaLoss.py                  |  13 +
 .../losses/MegaBatchMarginLoss.py             |  18 +
 .../losses/MultipleNegativesRankingLoss.py    |  13 +
 sentence_transformers/losses/SoftmaxLoss.py   |  14 +
 sentence_transformers/losses/TripletLoss.py   |  31 +-
 sentence_transformers/model_card.py           | 920 ++++++++++++++++++
 sentence_transformers/model_card_template.md  | 228 +++++
 sentence_transformers/models/BoW.py           |   5 +-
 sentence_transformers/models/CLIPModel.py     |   4 +
 sentence_transformers/models/Transformer.py   |   2 +
 sentence_transformers/sampler.py              | 210 ++++
 sentence_transformers/trainer.py              | 553 +++++++++++
 sentence_transformers/training_args.py        |  39 +
 sentence_transformers/util.py                 |  20 +
 setup.py                                      |   3 +
 tests/conftest.py                             |   6 +
 tests/test_evaluator.py                       |   9 +-
 tests/test_model_card_data.py                 |  24 +
 tests/test_pretrained_stsb.py                 |   3 +-
 tests/test_train_stsb.py                      |  12 +-
 tests/test_trainer.py                         | 127 +++
 65 files changed, 3450 insertions(+), 447 deletions(-)
 create mode 100644 MANIFEST.in
 create mode 100644 sentence_transformers/data_collator.py
 create mode 100644 sentence_transformers/fit_mixin.py
 create mode 100644 sentence_transformers/model_card.py
 create mode 100644 sentence_transformers/model_card_template.md
 create mode 100644 sentence_transformers/sampler.py
 create mode 100644 sentence_transformers/trainer.py
 create mode 100644 sentence_transformers/training_args.py
 create mode 100644 tests/test_model_card_data.py
 create mode 100644 tests/test_trainer.py

diff --git a/.gitignore b/.gitignore
index 7c27f10bf..9eac52b8e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,7 +19,9 @@ nr_*/
 /docs/make.bat
 /examples/training/quora_duplicate_questions/quora-IR-dataset/
 build
-
 htmlcov
 .coverage
-.venv
\ No newline at end of file
+wandb
+checkpoints
+tmp
+.venv
diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 000000000..d6144a5d0
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1 @@
+include sentence_transformers/model_card_template.md
\ No newline at end of file
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
index e3c9a7376..bb965df98 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
@@ -105,4 +105,4 @@
 
 model = SentenceTransformer(model_save_path)
 test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+model.evaluate(test_evaluator)
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
index 16121753f..503de464d 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
@@ -128,4 +128,4 @@
 
 model = SentenceTransformer(model_save_path)
 test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+model.evaluate(test_evaluator)
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
index c73315364..a7c822f52 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
@@ -105,4 +105,4 @@
 
 model = SentenceTransformer(model_save_path)
 test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+model.evaluate(test_evaluator)
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
index 17006638d..f45a4e7d3 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
@@ -132,4 +132,4 @@
 
 model = SentenceTransformer(model_save_path)
 test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+model.evaluate(test_evaluator)
diff --git a/examples/training/clip/train_clip.ipynb b/examples/training/clip/train_clip.ipynb
index a5dd57de1..ea5e7758e 100644
--- a/examples/training/clip/train_clip.ipynb
+++ b/examples/training/clip/train_clip.ipynb
@@ -89,15 +89,15 @@
     "train_dataset = []\n",
     "for idx in range(0, len(photos), 2):\n",
     "    # We can use image pairs directly. Because our images aren't labeled, we use a random label as an example\n",
-    "    train_dataset.append(InputExample(texts=[photos[idx], photos[idx + 1]], label=random.choice([0, 1])))\n",
+    "    # train_dataset.append(InputExample(texts=[photos[idx], photos[idx + 1]], label=random.choice([0, 1])))\n",
     "    \n",
     "    # Or images and text together\n",
     "    train_dataset.append(InputExample(texts=[photos[idx], \"This is the caption\"], label=1))\n",
     "    train_dataset.append(InputExample(texts=[photos[idx], \"This is another unrelated caption\"], label=0))\n",
     "\n",
     "    # Or just texts\n",
-    "    train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is a similar caption\"], label=1))\n",
-    "    train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is an unrelated caption\"], label=0))\n"
+    "    # train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is a similar caption\"], label=1))\n",
+    "    # train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is an unrelated caption\"], label=0))\n"
    ]
   },
   {
diff --git a/examples/training/distillation/model_distillation.py b/examples/training/distillation/model_distillation.py
index bf469bc8f..f8e6bf333 100644
--- a/examples/training/distillation/model_distillation.py
+++ b/examples/training/distillation/model_distillation.py
@@ -205,6 +205,6 @@
     evaluation_steps=5000,
     output_path=output_path,
     save_best_model=True,
-    optimizer_params={"lr": 1e-4, "eps": 1e-6, "correct_bias": False},
+    optimizer_params={"lr": 1e-4, "eps": 1e-6},
     use_amp=True,
 )
diff --git a/examples/training/ms_marco/train_bi-encoder_margin-mse.py b/examples/training/ms_marco/train_bi-encoder_margin-mse.py
index d84852861..7b397da62 100644
--- a/examples/training/ms_marco/train_bi-encoder_margin-mse.py
+++ b/examples/training/ms_marco/train_bi-encoder_margin-mse.py
@@ -165,7 +165,7 @@
                 negs_to_use = args.negs_to_use.split(",")
             else:  # Use all systems
                 negs_to_use = list(data["neg"].keys())
-            logging.info("Using negatives from the following systems:", negs_to_use)
+            logging.info("Using negatives from the following systems: {}".format(", ".join(negs_to_use)))
 
         for system_name in negs_to_use:
             if system_name not in data["neg"]:
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index 0b8f3d29b..fafe454dd 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -189,7 +189,7 @@ def download_corpora(filepaths):
     dev_mse = evaluation.MSEEvaluator(
         src_sentences,
         trg_sentences,
-        name=os.path.basename(dev_file),
+        name=os.path.basename(dev_file).split(".")[0],
         teacher_model=teacher_model,
         batch_size=inference_batch_size,
     )
@@ -197,7 +197,7 @@ def download_corpora(filepaths):
 
     # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences
     dev_trans_acc = evaluation.TranslationEvaluator(
-        src_sentences, trg_sentences, name=os.path.basename(dev_file), batch_size=inference_batch_size
+        src_sentences, trg_sentences, name=os.path.basename(dev_file).split(".")[0], batch_size=inference_batch_size
     )
     evaluators.append(dev_trans_acc)
 
@@ -238,7 +238,7 @@ def download_corpora(filepaths):
         data["sentences2"],
         data["scores"],
         batch_size=inference_batch_size,
-        name=filename,
+        name=filename.split(".")[0],
         show_progress_bar=False,
     )
     evaluators.append(test_evaluator)
diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
index fa0b9f8d7..e0c56e7f4 100644
--- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
+++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
@@ -103,6 +103,8 @@ def read_eval_dataset(filepath):
 
 model.fit(
     train_objectives=[(train_dataloader, train_loss)],
+    evaluator=dev_evaluator,
+    evaluation_steps=100,
     epochs=1,
     warmup_steps=100,
     use_amp=True,  # Set to True, if your GPU has optimized FP16 cores
diff --git a/requirements.txt b/requirements.txt
index 7fe24f4a6..6344944fb 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -5,4 +5,6 @@ numpy
 scikit-learn
 scipy
 huggingface-hub>=0.15.1
-Pillow
\ No newline at end of file
+Pillow
+datasets
+accelerate>=0.20.3
\ No newline at end of file
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index 656103966..bdbd03905 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -2,10 +2,10 @@
 import json
 import logging
 import os
-import shutil
 from collections import OrderedDict
+from pathlib import Path
 import warnings
-from typing import List, Dict, Literal, Tuple, Iterable, Type, Union, Callable, Optional, TYPE_CHECKING
+from typing import List, Dict, Literal, Tuple, Iterable, Union, Optional
 import numpy as np
 from numpy import ndarray
 import transformers
@@ -13,20 +13,20 @@
 from huggingface_hub import HfApi
 import torch
 from torch import nn, Tensor, device
-from torch.optim import Optimizer
-from torch.utils.data import DataLoader
 import torch.multiprocessing as mp
 from tqdm.autonotebook import trange
 import math
 import queue
 import tempfile
 
+from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card
+
+
 from . import __MODEL_HUB_ORGANIZATION__
 from .evaluation import SentenceEvaluator
 from .util import (
     import_from_string,
     batch_to_device,
-    fullname,
     is_sentence_transformer_model,
     load_dir_path,
     load_file_path,
@@ -36,17 +36,13 @@
 )
 from .quantization import quantize_embeddings
 from .models import Transformer, Pooling, Normalize
-from .model_card_templates import ModelCardTemplate
+from .fit_mixin import FitMixin
 from . import __version__
 
 logger = logging.getLogger(__name__)
 
 
-if TYPE_CHECKING:
-    from sentence_transformers.readers import InputExample
-
-
-class SentenceTransformer(nn.Sequential):
+class SentenceTransformer(nn.Sequential, FitMixin):
     """
     Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings.
 
@@ -89,11 +85,13 @@ def __init__(
         token: Optional[Union[bool, str]] = None,
         use_auth_token: Optional[Union[bool, str]] = None,
         truncate_dim: Optional[int] = None,
+        model_card_data: Optional[SentenceTransformerModelCardData] = None,
     ):
         # Note: self._load_sbert_model can also update `self.prompts` and `self.default_prompt_name`
         self.prompts = prompts or {}
         self.default_prompt_name = default_prompt_name
         self.truncate_dim = truncate_dim
+        self.model_card_data = model_card_data or SentenceTransformerModelCardData()
         self._model_card_vars = {}
         self._model_card_text = None
         self._model_config = {}
@@ -263,6 +261,9 @@ def __init__(
                     "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model."
                 )
 
+        # Pass the model to the model card data for later use in generating a model card upon saving this model
+        self.model_card_data.register_model(self)
+
     def encode(
         self,
         sentences: Union[str, List[str]],
@@ -423,7 +424,10 @@ def encode(
                 all_embeddings = torch.Tensor()
         elif convert_to_numpy:
             if not isinstance(all_embeddings, np.ndarray):
-                all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
+                if all_embeddings[0].dtype == torch.bfloat16:
+                    all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
+                else:
+                    all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
         elif isinstance(all_embeddings, np.ndarray):
             all_embeddings = [torch.from_numpy(embedding) for embedding in all_embeddings]
 
@@ -724,63 +728,34 @@ def save(
             self._create_model_card(path, model_name, train_datasets)
 
     def _create_model_card(
-        self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = None
+        self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated"
     ):
         """
-        Create an automatic model and stores it in path
+        Create an automatic model and stores it in path. If no training was done, and the loaded model was
+        a Sentence Transformer model already, then its model card is reused.
         """
-        if self._model_card_text is not None and len(self._model_card_text) > 0:
+        if model_name:
+            model_path = Path(model_name)
+            if not model_path.exists() and not self.model_card_data.model_id:
+                self.model_card_data.model_id = model_name
+
+        # If we loaded a Sentence Transformer model from the Hub, and no training was done, then
+        # we don't generate a new model card, but reuse the old one instead.
+        if self._model_card_text and self.model_card_data.trainer is None:
             model_card = self._model_card_text
         else:
-            tags = ModelCardTemplate.__TAGS__.copy()
-            model_card = ModelCardTemplate.__MODEL_CARD__
-
-            if (
-                len(self._modules) == 2
-                and isinstance(self._first_module(), Transformer)
-                and isinstance(self._last_module(), Pooling)
-                and self._last_module().get_pooling_mode_str() in ["cls", "max", "mean"]
-            ):
-                pooling_module = self._last_module()
-                pooling_mode = pooling_module.get_pooling_mode_str()
-                model_card = model_card.replace(
-                    "{USAGE_TRANSFORMERS_SECTION}", ModelCardTemplate.__USAGE_TRANSFORMERS__
-                )
-                pooling_fct_name, pooling_fct = ModelCardTemplate.model_card_get_pooling_function(pooling_mode)
-                model_card = (
-                    model_card.replace("{POOLING_FUNCTION}", pooling_fct)
-                    .replace("{POOLING_FUNCTION_NAME}", pooling_fct_name)
-                    .replace("{POOLING_MODE}", pooling_mode)
+            try:
+                model_card = generate_model_card(self)
+            except Exception as exc:
+                logger.error(
+                    f"Error while generating model card: {exc}\n"
+                    "Consider opening an issue on https://github.com/UKPLab/sentence-transformers/issues with these logs.\n"
+                    "Skipping model card creation."
                 )
-                tags.append("transformers")
-
-            # Print full model
-            model_card = model_card.replace("{FULL_MODEL_STR}", str(self))
-
-            # Add tags
-            model_card = model_card.replace("{TAGS}", "\n".join(["- " + t for t in tags]))
-
-            datasets_str = ""
-            if train_datasets is not None:
-                datasets_str = "datasets:\n" + "\n".join(["- " + d for d in train_datasets])
-            model_card = model_card.replace("{DATASETS}", datasets_str)
-
-            # Add dim info
-            self._model_card_vars["{NUM_DIMENSIONS}"] = self.get_sentence_embedding_dimension()
-
-            # Replace vars we created while using the model
-            for name, value in self._model_card_vars.items():
-                model_card = model_card.replace(name, str(value))
-
-            # Replace remaining vars with default values
-            for name, value in ModelCardTemplate.__DEFAULT_VARS__.items():
-                model_card = model_card.replace(name, str(value))
-
-        if model_name is not None:
-            model_card = model_card.replace("{MODEL_NAME}", model_name.strip())
+                return
 
         with open(os.path.join(path, "README.md"), "w", encoding="utf8") as fOut:
-            fOut.write(model_card.strip())
+            fOut.write(model_card)
 
     @save_to_hub_args_decorator
     def save_to_hub(
@@ -881,6 +856,7 @@ def push_to_hub(
             exist_ok=exist_ok,
         )
         repo_id = repo_url.repo_id  # Update the repo_id in case the old repo_id didn't contain a user or organization
+        self.model_card_data.set_model_id(repo_id)
         if local_model_path:
             folder_url = api.upload_folder(
                 repo_id=repo_id, folder_path=local_model_path, commit_message=commit_message
@@ -904,21 +880,6 @@ def push_to_hub(
         # This isn't expected to ever be reached.
         return folder_url
 
-    def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict[str, Tensor]], Tensor]:
-        """
-        Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model
-        Here, batch is a list of InputExample instances: [InputExample(...), ...]
-
-        :param batch:
-            a batch from a SmartBatchingDataset
-        :return:
-            a batch of tensors for the model
-        """
-        texts = [example.texts for example in batch]
-        sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)]
-        labels = torch.tensor([example.label for example in batch])
-        return sentence_features, labels
-
     def _text_length(self, text: Union[List[int], List[List[int]]]):
         """
         Help function to get the length for the input text. Text can be either
@@ -935,214 +896,6 @@ def _text_length(self, text: Union[List[int], List[List[int]]]):
         else:
             return sum([len(t) for t in text])  # Sum of length of individual strings
 
-    def fit(
-        self,
-        train_objectives: Iterable[Tuple[DataLoader, nn.Module]],
-        evaluator: SentenceEvaluator = None,
-        epochs: int = 1,
-        steps_per_epoch=None,
-        scheduler: str = "WarmupLinear",
-        warmup_steps: int = 10000,
-        optimizer_class: Type[Optimizer] = torch.optim.AdamW,
-        optimizer_params: Dict[str, object] = {"lr": 2e-5},
-        weight_decay: float = 0.01,
-        evaluation_steps: int = 0,
-        output_path: str = None,
-        save_best_model: bool = True,
-        max_grad_norm: float = 1,
-        use_amp: bool = False,
-        callback: Callable[[float, int, int], None] = None,
-        show_progress_bar: bool = True,
-        checkpoint_path: str = None,
-        checkpoint_save_steps: int = 500,
-        checkpoint_save_total_limit: int = 0,
-    ):
-        """
-        Train the model with the given training objective
-        Each training objective is sampled in turn for one batch.
-        We sample only as many batches from each objective as there are in the smallest one
-        to make sure of equal training with each dataset.
-
-        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
-        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
-        :param epochs: Number of epochs for training
-        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
-        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
-        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
-        :param optimizer_class: Optimizer
-        :param optimizer_params: Optimizer parameters
-        :param weight_decay: Weight decay for model parameters
-        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
-        :param output_path: Storage path for the model and evaluation files
-        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
-        :param max_grad_norm: Used for gradient normalization.
-        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
-        :param callback: Callback function that is invoked after each evaluation.
-                It must accept the following three parameters in this order:
-                `score`, `epoch`, `steps`
-        :param show_progress_bar: If True, output a tqdm progress bar
-        :param checkpoint_path: Folder to save checkpoints during training
-        :param checkpoint_save_steps: Will save a checkpoint after so many steps
-        :param checkpoint_save_total_limit: Total number of checkpoints to store
-        """
-
-        ##Add info to model card
-        # info_loss_functions = "\n".join(["- {} with {} training examples".format(str(loss), len(dataloader)) for dataloader, loss in train_objectives])
-        info_loss_functions = []
-        for dataloader, loss in train_objectives:
-            info_loss_functions.extend(ModelCardTemplate.get_train_objective_info(dataloader, loss))
-        info_loss_functions = "\n\n".join([text for text in info_loss_functions])
-
-        info_fit_parameters = json.dumps(
-            {
-                "evaluator": fullname(evaluator),
-                "epochs": epochs,
-                "steps_per_epoch": steps_per_epoch,
-                "scheduler": scheduler,
-                "warmup_steps": warmup_steps,
-                "optimizer_class": str(optimizer_class),
-                "optimizer_params": optimizer_params,
-                "weight_decay": weight_decay,
-                "evaluation_steps": evaluation_steps,
-                "max_grad_norm": max_grad_norm,
-            },
-            indent=4,
-            sort_keys=True,
-        )
-        self._model_card_text = None
-        self._model_card_vars["{TRAINING_SECTION}"] = ModelCardTemplate.__TRAINING_SECTION__.replace(
-            "{LOSS_FUNCTIONS}", info_loss_functions
-        ).replace("{FIT_PARAMETERS}", info_fit_parameters)
-
-        if use_amp:
-            if is_torch_npu_available():
-                scaler = torch.npu.amp.GradScaler()
-            else:
-                scaler = torch.cuda.amp.GradScaler()
-        self.to(self.device)
-
-        dataloaders = [dataloader for dataloader, _ in train_objectives]
-
-        # Use smart batching
-        for dataloader in dataloaders:
-            dataloader.collate_fn = self.smart_batching_collate
-
-        loss_models = [loss for _, loss in train_objectives]
-        for loss_model in loss_models:
-            loss_model.to(self.device)
-
-        self.best_score = -9999999
-
-        if steps_per_epoch is None or steps_per_epoch == 0:
-            steps_per_epoch = min([len(dataloader) for dataloader in dataloaders])
-
-        num_train_steps = int(steps_per_epoch * epochs)
-
-        # Prepare optimizers
-        optimizers = []
-        schedulers = []
-        for loss_model in loss_models:
-            param_optimizer = list(loss_model.named_parameters())
-
-            no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
-            optimizer_grouped_parameters = [
-                {
-                    "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
-                    "weight_decay": weight_decay,
-                },
-                {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
-            ]
-
-            optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params)
-            scheduler_obj = self._get_scheduler(
-                optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps
-            )
-
-            optimizers.append(optimizer)
-            schedulers.append(scheduler_obj)
-
-        global_step = 0
-        data_iterators = [iter(dataloader) for dataloader in dataloaders]
-
-        num_train_objectives = len(train_objectives)
-
-        skip_scheduler = False
-        for epoch in trange(epochs, desc="Epoch", disable=not show_progress_bar):
-            training_steps = 0
-
-            for loss_model in loss_models:
-                loss_model.zero_grad()
-                loss_model.train()
-
-            for _ in trange(steps_per_epoch, desc="Iteration", smoothing=0.05, disable=not show_progress_bar):
-                for train_idx in range(num_train_objectives):
-                    loss_model = loss_models[train_idx]
-                    optimizer = optimizers[train_idx]
-                    scheduler = schedulers[train_idx]
-                    data_iterator = data_iterators[train_idx]
-
-                    try:
-                        data = next(data_iterator)
-                    except StopIteration:
-                        data_iterator = iter(dataloaders[train_idx])
-                        data_iterators[train_idx] = data_iterator
-                        data = next(data_iterator)
-
-                    features, labels = data
-                    labels = labels.to(self.device)
-                    features = list(map(lambda batch: batch_to_device(batch, self.device), features))
-
-                    if use_amp:
-                        with torch.autocast(device_type=self.device.type):
-                            loss_value = loss_model(features, labels)
-
-                        scale_before_step = scaler.get_scale()
-                        scaler.scale(loss_value).backward()
-                        scaler.unscale_(optimizer)
-                        torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)
-                        scaler.step(optimizer)
-                        scaler.update()
-
-                        skip_scheduler = scaler.get_scale() != scale_before_step
-                    else:
-                        loss_value = loss_model(features, labels)
-                        loss_value.backward()
-                        torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)
-                        optimizer.step()
-
-                    optimizer.zero_grad()
-
-                    if not skip_scheduler:
-                        scheduler.step()
-
-                training_steps += 1
-                global_step += 1
-
-                if evaluation_steps > 0 and training_steps % evaluation_steps == 0:
-                    self._eval_during_training(
-                        evaluator, output_path, save_best_model, epoch, training_steps, callback
-                    )
-
-                    for loss_model in loss_models:
-                        loss_model.zero_grad()
-                        loss_model.train()
-
-                if (
-                    checkpoint_path is not None
-                    and checkpoint_save_steps is not None
-                    and checkpoint_save_steps > 0
-                    and global_step % checkpoint_save_steps == 0
-                ):
-                    self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step)
-
-            self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback)
-
-        if evaluator is None and output_path is not None:  # No evaluator, but output path: save final model version
-            self.save(output_path)
-
-        if checkpoint_path is not None:
-            self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step)
-
     def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None):
         """
         Evaluate the model
@@ -1156,38 +909,6 @@ def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None):
             os.makedirs(output_path, exist_ok=True)
         return evaluator(self, output_path)
 
-    def _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback):
-        """Runs evaluation during the training"""
-        eval_path = output_path
-        if output_path is not None:
-            os.makedirs(output_path, exist_ok=True)
-            eval_path = os.path.join(output_path, "eval")
-            os.makedirs(eval_path, exist_ok=True)
-
-        if evaluator is not None:
-            score = evaluator(self, output_path=eval_path, epoch=epoch, steps=steps)
-            if callback is not None:
-                callback(score, epoch, steps)
-            if score > self.best_score:
-                self.best_score = score
-                if save_best_model:
-                    self.save(output_path)
-
-    def _save_checkpoint(self, checkpoint_path, checkpoint_save_total_limit, step):
-        # Store new checkpoint
-        self.save(os.path.join(checkpoint_path, str(step)))
-
-        # Delete old checkpoints
-        if checkpoint_save_total_limit is not None and checkpoint_save_total_limit > 0:
-            old_checkpoints = []
-            for subdir in os.listdir(checkpoint_path):
-                if subdir.isdigit():
-                    old_checkpoints.append({"step": int(subdir), "path": os.path.join(checkpoint_path, subdir)})
-
-            if len(old_checkpoints) > checkpoint_save_total_limit:
-                old_checkpoints = sorted(old_checkpoints, key=lambda x: x["step"])
-                shutil.rmtree(old_checkpoints[0]["path"])
-
     def _load_auto_model(
         self,
         model_name_or_path: str,
@@ -1222,6 +943,7 @@ def _load_auto_model(
             },
         )
         pooling_model = Pooling(transformer_model.get_word_embedding_dimension(), "mean")
+        self.model_card_data.set_base_model(model_name_or_path, revision=revision)
         return [transformer_model, pooling_model]
 
     def _load_sbert_model(
@@ -1353,37 +1075,19 @@ def _load_sbert_model(
                 module = module_class.load(module_path)
             modules[module_config["name"]] = module
 
+        if revision is None:
+            path_parts = Path(modules_json_path)
+            if len(path_parts.parts) >= 2:
+                revision_path_part = Path(modules_json_path).parts[-2]
+                if len(revision_path_part) == 40:
+                    revision = revision_path_part
+        self.model_card_data.set_base_model(model_name_or_path, revision=revision)
         return modules
 
     @staticmethod
     def load(input_path):
         return SentenceTransformer(input_path)
 
-    @staticmethod
-    def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int):
-        """
-        Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
-        """
-        scheduler = scheduler.lower()
-        if scheduler == "constantlr":
-            return transformers.get_constant_schedule(optimizer)
-        elif scheduler == "warmupconstant":
-            return transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)
-        elif scheduler == "warmuplinear":
-            return transformers.get_linear_schedule_with_warmup(
-                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
-            )
-        elif scheduler == "warmupcosine":
-            return transformers.get_cosine_schedule_with_warmup(
-                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
-            )
-        elif scheduler == "warmupcosinewithhardrestarts":
-            return transformers.get_cosine_with_hard_restarts_schedule_with_warmup(
-                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
-            )
-        else:
-            raise ValueError("Unknown scheduler {}".format(scheduler))
-
     @property
     def device(self) -> device:
         """
@@ -1440,3 +1144,17 @@ def _target_device(self) -> torch.device:
     @_target_device.setter
     def _target_device(self, device: Optional[Union[int, str, torch.device]] = None) -> None:
         self.to(device)
+
+    @property
+    def _no_split_modules(self) -> List[str]:
+        try:
+            return self._first_module()._no_split_modules
+        except AttributeError:
+            return []
+
+    @property
+    def _keys_to_ignore_on_save(self) -> List[str]:
+        try:
+            return self._first_module()._keys_to_ignore_on_save
+        except AttributeError:
+            return []
diff --git a/sentence_transformers/__init__.py b/sentence_transformers/__init__.py
index f8b457b7b..7cd7a2230 100644
--- a/sentence_transformers/__init__.py
+++ b/sentence_transformers/__init__.py
@@ -1,12 +1,25 @@
 __version__ = "2.8.0.dev0"
 __MODEL_HUB_ORGANIZATION__ = "sentence-transformers"
+
+import importlib
+import os
+
 from .datasets import SentencesDataset, ParallelSentencesDataset
 from .LoggingHandler import LoggingHandler
 from .SentenceTransformer import SentenceTransformer
 from .readers import InputExample
 from .cross_encoder.CrossEncoder import CrossEncoder
+from .trainer import SentenceTransformerTrainer
+from .training_args import SentenceTransformerTrainingArguments
+from .model_card import SentenceTransformerModelCardData
 from .quantization import quantize_embeddings
 
+
+# If codecarbon is installed and the log level is not defined,
+# automatically overwrite the default to "error"
+if importlib.util.find_spec("codecarbon") and "CODECARBON_LOG_LEVEL" not in os.environ:
+    os.environ["CODECARBON_LOG_LEVEL"] = "error"
+
 __all__ = [
     "LoggingHandler",
     "SentencesDataset",
@@ -14,5 +27,8 @@
     "SentenceTransformer",
     "InputExample",
     "CrossEncoder",
+    "SentenceTransformerTrainer",
+    "SentenceTransformerTrainingArguments",
+    "SentenceTransformerModelCardData",
     "quantize_embeddings",
 ]
diff --git a/sentence_transformers/data_collator.py b/sentence_transformers/data_collator.py
new file mode 100644
index 000000000..bd4d5ff27
--- /dev/null
+++ b/sentence_transformers/data_collator.py
@@ -0,0 +1,39 @@
+from dataclasses import dataclass, field
+from typing import Any, Callable, Dict, List
+
+import torch
+
+
+@dataclass
+class SentenceTransformerDataCollator:
+    """Collator for a SentenceTransformers model.
+    This encodes the text columns to {column}_input_ids and {column}_attention_mask columns.
+    This works with the two text dataset that is used as the example in the training overview:
+    https://www.sbert.net/docs/training/overview.html"""
+
+    tokenize_fn: Callable
+    valid_label_columns: List[str] = field(default_factory=lambda: ["label", "score"])
+
+    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
+        columns = list(features[0].keys())
+
+        # We should always be able to return a loss, label or not:
+        batch = {"return_loss": True}
+
+        if "dataset_name" in columns:
+            columns.remove("dataset_name")
+            batch["dataset_name"] = features[0]["dataset_name"]
+
+        # Extract the label column if it exists
+        for label_column in self.valid_label_columns:
+            if label_column in columns:
+                batch["label"] = torch.tensor([row[label_column] for row in features])
+                columns.remove(label_column)
+                break
+
+        # Extract the feature columns
+        for column in columns:
+            tokenized = self.tokenize_fn([row[column] for row in features])
+            for key, value in tokenized.items():
+                batch[f"{column}_{key}"] = value
+        return batch
diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
index e8cdab23e..40709ae18 100644
--- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
+++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
@@ -7,7 +7,7 @@
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
 from sklearn.metrics import average_precision_score
 import numpy as np
-from typing import List, Optional
+from typing import Dict, List, Optional
 from ..readers import InputExample
 
 
@@ -52,6 +52,8 @@ def __init__(
         self.labels = labels
         self.truncate_dim = truncate_dim
 
+        self.primary_metric = "max_ap"
+
         assert len(self.sentences1) == len(self.sentences2)
         assert len(self.sentences1) == len(self.labels)
         for label in labels:
@@ -70,13 +72,13 @@ def __init__(
         self.csv_headers = [
             "epoch",
             "steps",
-            "cossim_accuracy",
-            "cossim_accuracy_threshold",
-            "cossim_f1",
-            "cossim_precision",
-            "cossim_recall",
-            "cossim_f1_threshold",
-            "cossim_ap",
+            "cosine_accuracy",
+            "cosine_accuracy_threshold",
+            "cosine_f1",
+            "cosine_precision",
+            "cosine_recall",
+            "cosine_f1_threshold",
+            "cosine_ap",
             "manhattan_accuracy",
             "manhattan_accuracy_threshold",
             "manhattan_f1",
@@ -99,6 +101,7 @@ def __init__(
             "dot_f1_threshold",
             "dot_ap",
         ]
+        self.primary_metric = "cosine_accuracy"
 
     @classmethod
     def from_input_examples(cls, examples: List[InputExample], **kwargs):
@@ -112,7 +115,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
             scores.append(example.label)
         return cls(sentences1, sentences2, scores, **kwargs)
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -127,9 +132,6 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
 
         scores = self.compute_metrices(model)
 
-        # Main score is the max of Average Precision (AP)
-        main_score = max(scores[short_name]["ap"] for short_name in scores)
-
         file_output_data = [epoch, steps]
 
         for header_name in self.csv_headers:
@@ -149,7 +151,17 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
                     writer = csv.writer(f)
                     writer.writerow(file_output_data)
 
-        return main_score
+        metrics = {
+            f"{short_name}_{metric}": value
+            for short_name, values in scores.items()
+            for metric, value in values.items()
+        }
+        metrics.update(
+            {f"max_{metric}": max(scores[short_name][metric] for short_name in scores) for metric in scores["cosine"]}
+        )
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
 
     def compute_metrices(self, model):
         with nullcontext() if self.truncate_dim is None else model.truncate_sentence_embeddings(self.truncate_dim):
@@ -189,7 +201,7 @@ def compute_metrices(self, model):
         labels = np.asarray(self.labels)
         output_scores = {}
         for short_name, name, scores, reverse in [
-            ["cossim", "Cosine-Similarity", cosine_scores, True],
+            ["cosine", "Cosine-Similarity", cosine_scores, True],
             ["manhattan", "Manhattan-Distance", manhattan_distances, False],
             ["euclidean", "Euclidean-Distance", euclidean_distances, False],
             ["dot", "Dot-Product", dot_scores, True],
diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
index 531e0680a..0cb14500e 100644
--- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
+++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
@@ -8,7 +8,7 @@
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
 from scipy.stats import pearsonr, spearmanr
 import numpy as np
-from typing import List, Literal, Optional
+from typing import Dict, List, Literal, Optional
 from ..readers import InputExample
 
 
@@ -52,6 +52,7 @@ def __init__(
         :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current
             truncation dimension. Defaults to None.
         """
+        super().__init__()
         self.sentences1 = sentences1
         self.sentences2 = sentences2
         self.scores = scores
@@ -103,7 +104,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
             scores.append(example.label)
         return cls(sentences1, sentences2, scores, **kwargs)
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -200,15 +203,30 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
                     ]
                 )
 
-        if self.main_similarity == SimilarityFunction.COSINE:
-            return eval_spearman_cosine
-        elif self.main_similarity == SimilarityFunction.EUCLIDEAN:
-            return eval_spearman_euclidean
-        elif self.main_similarity == SimilarityFunction.MANHATTAN:
-            return eval_spearman_manhattan
-        elif self.main_similarity == SimilarityFunction.DOT_PRODUCT:
-            return eval_spearman_dot
-        elif self.main_similarity is None:
-            return max(eval_spearman_cosine, eval_spearman_manhattan, eval_spearman_euclidean, eval_spearman_dot)
-        else:
-            raise ValueError("Unknown main_similarity value")
+        self.primary_metric = {
+            SimilarityFunction.COSINE: "spearman_cosine",
+            SimilarityFunction.EUCLIDEAN: "spearman_euclidean",
+            SimilarityFunction.MANHATTAN: "spearman_manhattan",
+            SimilarityFunction.DOT_PRODUCT: "spearman_dot",
+        }.get(self.main_similarity, "spearman_max")
+        metrics = {
+            "pearson_cosine": eval_pearson_cosine,
+            "spearman_cosine": eval_spearman_cosine,
+            "pearson_manhattan": eval_pearson_manhattan,
+            "spearman_manhattan": eval_spearman_manhattan,
+            "pearson_euclidean": eval_pearson_euclidean,
+            "spearman_euclidean": eval_spearman_euclidean,
+            "pearson_dot": eval_pearson_dot,
+            "spearman_dot": eval_spearman_dot,
+            "pearson_max": max(eval_pearson_cosine, eval_pearson_manhattan, eval_pearson_euclidean, eval_pearson_dot),
+            "spearman_max": max(
+                eval_spearman_cosine, eval_spearman_manhattan, eval_spearman_euclidean, eval_spearman_dot
+            ),
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
+
+    @property
+    def description(self) -> str:
+        return "Semantic Similarity"
diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
index 9b0466420..54338e6f5 100644
--- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
+++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
@@ -40,11 +40,12 @@ def __init__(
         write_csv: bool = True,
         truncate_dim: Optional[int] = None,
         score_functions: Dict[str, Callable[[Tensor, Tensor], Tensor]] = {
-            "cos_sim": cos_sim,
-            "dot_score": dot_score,
+            "cosine": cos_sim,
+            "dot": dot_score,
         },  # Score function, higher=more similar
         main_score_function: str = None,
     ):
+        super().__init__()
         self.queries_ids = []
         for qid in queries:
             if qid in relevant_docs and len(relevant_docs[qid]) > 0:
@@ -97,7 +98,7 @@ def __init__(
 
     def __call__(
         self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs
-    ) -> float:
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -146,9 +147,23 @@ def __call__(
             fOut.close()
 
         if self.main_score_function is None:
-            return max([scores[name]["map@k"][max(self.map_at_k)] for name in self.score_function_names])
+            score_function = max(
+                [(name, scores[name]["map@k"][max(self.map_at_k)]) for name in self.score_function_names],
+                key=lambda x: x[1],
+            )[0]
+            self.primary_metric = f"{score_function}_map@{max(self.map_at_k)}"
         else:
-            return scores[self.main_score_function]["map@k"][max(self.map_at_k)]
+            self.primary_metric = f"{self.main_score_function}_map@{max(self.map_at_k)}"
+
+        metrics = {
+            f"{score_function}_{metric_name.replace('@k', '@' + str(k))}": value
+            for score_function, values_dict in scores.items()
+            for metric_name, values in values_dict.items()
+            for k, value in values.items()
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
 
     def compute_metrices(
         self, model: SentenceTransformer, corpus_model=None, corpus_embeddings: Tensor = None
diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
index e94e0adfe..05ebfe253 100644
--- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
+++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
@@ -1,3 +1,4 @@
+from typing import Dict
 from sentence_transformers import SentenceTransformer
 from . import SentenceEvaluator
 import torch
@@ -27,6 +28,7 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w
         :param dataloader:
             the data for the evaluation
         """
+        super().__init__()
         self.dataloader = dataloader
         self.name = name
         self.softmax_model = softmax_model
@@ -37,8 +39,11 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w
         self.write_csv = write_csv
         self.csv_file = "accuracy_evaluation" + name + "_results.csv"
         self.csv_headers = ["epoch", "steps", "accuracy"]
+        self.primary_metric = "accuracy"
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         model.eval()
         total = 0
         correct = 0
@@ -79,4 +84,7 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
                     writer = csv.writer(f)
                     writer.writerow([epoch, steps, accuracy])
 
-        return accuracy
+        metrics = {"accuracy": accuracy}
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py
index ecb92be09..80ae29899 100644
--- a/sentence_transformers/evaluation/MSEEvaluator.py
+++ b/sentence_transformers/evaluation/MSEEvaluator.py
@@ -4,7 +4,7 @@
 import logging
 import os
 import csv
-from typing import List, Optional
+from typing import Dict, List, Optional
 
 
 logger = logging.getLogger(__name__)
@@ -41,6 +41,7 @@ def __init__(
         write_csv: bool = True,
         truncate_dim: Optional[int] = None,
     ):
+        super().__init__()
         self.truncate_dim = truncate_dim
         with nullcontext() if self.truncate_dim is None else teacher_model.truncate_sentence_embeddings(
             self.truncate_dim
@@ -57,8 +58,9 @@ def __init__(
         self.csv_file = "mse_evaluation_" + name + "_results.csv"
         self.csv_headers = ["epoch", "steps", "MSE"]
         self.write_csv = write_csv
+        self.primary_metric = "negative_mse"
 
-    def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1):
+    def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -93,4 +95,12 @@ def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1):
 
                 writer.writerow([epoch, steps, mse])
 
-        return -mse  # Return negative score as SentenceTransformers maximizes the performance
+        # Return negative score as SentenceTransformers maximizes the performance
+        metrics = {"negative_mse": -mse}
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
+
+    @property
+    def description(self) -> str:
+        return "Knowledge Distillation"
diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
index fd66f8942..bb027614e 100644
--- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
+++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
@@ -42,6 +42,7 @@ def __init__(
         write_csv: bool = True,
         truncate_dim: Optional[int] = None,
     ):
+        super().__init__()
         self.combinations = combinations
         self.name = name
         self.batch_size = batch_size
@@ -51,6 +52,7 @@ def __init__(
 
         self.csv_file = "mse_evaluation" + name + "_results.csv"
         self.csv_headers = ["epoch", "steps"]
+        self.primary_metric = "negative_mse"
         self.write_csv = write_csv
         self.truncate_dim = truncate_dim
         self.data = {}
@@ -77,7 +79,9 @@ def __init__(
             all_src_embeddings = teacher_model.encode(all_source_sentences, batch_size=self.batch_size)
         self.teacher_embeddings = {sent: emb for sent, emb in zip(all_source_sentences, all_src_embeddings)}
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1):
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         model.eval()
 
         mse_scores = []
@@ -105,4 +109,12 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
 
                 writer.writerow([epoch, steps] + mse_scores)
 
-        return -np.mean(mse_scores)  # Return negative score as SentenceTransformers maximizes the performance
+        # Return negative score as SentenceTransformers maximizes the performance
+        metrics = {"negative_mse": -np.mean(mse_scores).item()}
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
+
+    @property
+    def description(self) -> str:
+        return "Knowledge Distillation"
diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
index de9fe7059..264cf7782 100644
--- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
+++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
@@ -54,6 +54,7 @@ def __init__(
             dimension. Defaults to None.
 
         """
+        super().__init__()
         self.sentences = []
         self.ids = []
 
@@ -99,8 +100,11 @@ def __init__(
         self.csv_file: str = "paraphrase_mining_evaluation" + name + "_results.csv"
         self.csv_headers = ["epoch", "steps", "precision", "recall", "f1", "threshold", "average_precision"]
         self.write_csv = write_csv
+        self.primary_metric = "average_precision"
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -174,7 +178,16 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
                     writer = csv.writer(f)
                     writer.writerow([epoch, steps, best_precision, best_recall, best_f1, threshold, average_precision])
 
-        return average_precision
+        metrics = {
+            "average_precision": average_precision,
+            "f1": best_f1,
+            "precision": best_precision,
+            "recall": best_recall,
+            "threshold": threshold,
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
 
     @staticmethod
     def add_transitive_closure(graph):
diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py
index 4cba7d15d..1216bff79 100644
--- a/sentence_transformers/evaluation/RerankingEvaluator.py
+++ b/sentence_transformers/evaluation/RerankingEvaluator.py
@@ -9,7 +9,7 @@
 import torch
 from sklearn.metrics import average_precision_score, ndcg_score
 import tqdm
-from typing import Callable, Optional
+from typing import Callable, Dict, Optional
 
 logger = logging.getLogger(__name__)
 
@@ -50,6 +50,7 @@ def __init__(
         truncate_dim: Optional[int] = None,
         mrr_at_k: Optional[int] = None,
     ):
+        super().__init__()
         self.samples = samples
         self.name = name
 
@@ -82,8 +83,11 @@ def __init__(
             "NDCG@{}".format(self.at_k),
         ]
         self.write_csv = write_csv
+        self.primary_metric = "map"
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -131,7 +135,14 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
 
                 writer.writerow([epoch, steps, mean_ap, mean_mrr, mean_ndcg])
 
-        return mean_ap
+        metrics = {
+            "map": mean_ap,
+            f"mrr@{self.at_k}": mean_mrr,
+            f"ndcg@{self.at_k}": mean_ndcg,
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
 
     def compute_metrices(self, model):
         return (
diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py
index 7e7116689..d8f4f5232 100644
--- a/sentence_transformers/evaluation/SentenceEvaluator.py
+++ b/sentence_transformers/evaluation/SentenceEvaluator.py
@@ -1,3 +1,6 @@
+import re
+from typing import Any, Dict, Union
+
 from sentence_transformers import SentenceTransformer
 
 
@@ -8,7 +11,13 @@ class SentenceEvaluator:
     Extend this class and implement __call__ for custom evaluators.
     """
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __init__(self):
+        self.greater_is_better = True
+        # TODO: Add better `primary_metrics` support
+
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Union[float, Dict[str, float]]:
         """
         This is called during training to evaluate the model.
         It returns a score for the evaluation with a higher score indicating a better result.
@@ -25,6 +34,36 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
             the steps in the current epoch at time of the evaluation.
             This is used for the file prefixes.
             If this is -1, then we assume evaluation at the end of the epoch.
-        :return: a score for the evaluation with a higher score indicating a better result
+        :return: Either a score for the evaluation with a higher score indicating a better result,
+            or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric`
+            must be defined
         """
         pass
+
+    def prefix_name_to_metrics(self, metrics: Dict[str, float], name: str):
+        if not name:
+            return metrics
+        metrics = {name + "_" + key: value for key, value in metrics.items()}
+        if hasattr(self, "primary_metric") and not self.primary_metric.startswith(name + "_"):
+            self.primary_metric = name + "_" + self.primary_metric
+        return metrics
+
+    def store_metrics_in_model_card_data(self, model: "SentenceTransformer", metrics: Dict[str, Any]) -> None:
+        model.model_card_data.set_evaluation_metrics(self, metrics)
+
+    @property
+    def description(self) -> str:
+        """
+        Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification
+
+        1. Remove "Evaluator" from the class name
+        2. Add a space before every capital letter
+        """
+        class_name = self.__class__.__name__
+        try:
+            index = class_name.index("Evaluator")
+            class_name = class_name[:index]
+        except IndexError:
+            pass
+
+        return re.sub(r"([a-z])([A-Z])", "\g<1> \g<2>", class_name)
diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py
index 6808530cc..2e2fb5ced 100644
--- a/sentence_transformers/evaluation/SequentialEvaluator.py
+++ b/sentence_transformers/evaluation/SequentialEvaluator.py
@@ -1,6 +1,6 @@
 from sentence_transformers import SentenceTransformer
 from . import SentenceEvaluator
-from typing import Iterable
+from typing import Dict, Iterable
 
 
 class SequentialEvaluator(SentenceEvaluator):
@@ -12,12 +12,31 @@ class SequentialEvaluator(SentenceEvaluator):
     """
 
     def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function=lambda scores: scores[-1]):
+        super().__init__()
         self.evaluators = evaluators
         self.main_score_function = main_score_function
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
+        evaluations = []
         scores = []
-        for evaluator in self.evaluators:
-            scores.append(evaluator(model, output_path, epoch, steps))
+        for evaluator_idx, evaluator in enumerate(self.evaluators):
+            evaluation = evaluator(model, output_path, epoch, steps)
 
-        return self.main_score_function(scores)
+            if not isinstance(evaluation, dict):
+                scores.append(evaluation)
+                evaluation = {f"evaluator_{evaluator_idx}": evaluation}
+            else:
+                if hasattr(evaluation, "primary_metric"):
+                    scores.append(evaluation[evaluation.primary_metric])
+                else:
+                    scores.append(evaluation[list(evaluation.keys())[0]])
+
+            evaluations.append(evaluation)
+
+        self.primary_metric = "sequential_score"
+        main_score = self.main_score_function(scores)
+        results = {key: value for evaluation in evaluations for key, value in evaluation.items()}
+        results["sequential_score"] = main_score
+        return results
diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py
index acc5a887d..199d4cad1 100644
--- a/sentence_transformers/evaluation/TranslationEvaluator.py
+++ b/sentence_transformers/evaluation/TranslationEvaluator.py
@@ -6,7 +6,7 @@
 import os
 import csv
 import numpy as np
-from typing import List, Optional
+from typing import Dict, List, Optional
 import torch
 
 
@@ -54,6 +54,7 @@ def __init__(
             The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension.
             Defaults to None.
         """
+        super().__init__()
         self.source_sentences = source_sentences
         self.target_sentences = target_sentences
         self.name = name
@@ -70,8 +71,11 @@ def __init__(
         self.csv_file = "translation_evaluation" + name + "_results.csv"
         self.csv_headers = ["epoch", "steps", "src2trg", "trg2src"]
         self.write_csv = write_csv
+        self.primary_metric = "mean_accuracy"
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -145,4 +149,11 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
 
                 writer.writerow([epoch, steps, acc_src2trg, acc_trg2src])
 
-        return (acc_src2trg + acc_trg2src) / 2
+        metrics = {
+            "src2trg_accuracy": acc_src2trg,
+            "trg2src_accuracy": acc_trg2src,
+            "mean_accuracy": (acc_src2trg + acc_trg2src) / 2,
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py
index 17e76790e..da7719f97 100644
--- a/sentence_transformers/evaluation/TripletEvaluator.py
+++ b/sentence_transformers/evaluation/TripletEvaluator.py
@@ -5,7 +5,7 @@
 import os
 import csv
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
-from typing import List, Optional
+from typing import Dict, List, Optional
 from ..readers import InputExample
 
 
@@ -42,6 +42,7 @@ def __init__(
         :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current
             truncation dimension. Defaults to None.
         """
+        super().__init__()
         self.anchors = anchors
         self.positives = positives
         self.negatives = negatives
@@ -76,7 +77,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
             negatives.append(example.texts[2])
         return cls(anchors, positives, negatives, **kwargs)
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+    def __call__(
+        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+    ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -157,11 +160,17 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i
                     writer = csv.writer(f)
                     writer.writerow([epoch, steps, accuracy_cos, accuracy_manhattan, accuracy_euclidean])
 
-        if self.main_distance_function == SimilarityFunction.COSINE:
-            return accuracy_cos
-        if self.main_distance_function == SimilarityFunction.MANHATTAN:
-            return accuracy_manhattan
-        if self.main_distance_function == SimilarityFunction.EUCLIDEAN:
-            return accuracy_euclidean
-
-        return max(accuracy_cos, accuracy_manhattan, accuracy_euclidean)
+        self.primary_metric = {
+            SimilarityFunction.COSINE: "accuracy_cosine",
+            SimilarityFunction.EUCLIDEAN: "accuracy_euclidean",
+            SimilarityFunction.MANHATTAN: "accuracy_manhattan",
+        }.get(self.main_distance_function, "accuracy_max")
+        metrics = {
+            "accuracy_cosine": accuracy_cos,
+            "accuracy_manhattan": accuracy_manhattan,
+            "accuracy_euclidean": accuracy_euclidean,
+            "accuracy_max": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean),
+        }
+        metrics = self.prefix_name_to_metrics(metrics, self.name)
+        self.store_metrics_in_model_card_data(model, metrics)
+        return metrics
diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
new file mode 100644
index 000000000..66e1b9130
--- /dev/null
+++ b/sentence_transformers/fit_mixin.py
@@ -0,0 +1,619 @@
+import json
+import logging
+import os
+from pathlib import Path
+import shutil
+from typing import Any, List, Dict, Tuple, Iterable, Type, Callable, Optional, TYPE_CHECKING
+import transformers
+import torch
+from torch import nn, Tensor
+from torch.optim import Optimizer
+from torch.utils.data import DataLoader
+from tqdm.autonotebook import trange
+from datasets import Dataset, DatasetDict
+from transformers import TrainerCallback, TrainerState, TrainerControl
+from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset
+from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader
+from sentence_transformers.training_args import (
+    SentenceTransformerTrainingArguments,
+    MultiDatasetBatchSamplers,
+    BatchSamplers,
+)
+
+from .evaluation import SentenceEvaluator
+from .util import (
+    batch_to_device,
+    fullname,
+)
+from .model_card_templates import ModelCardTemplate
+
+logger = logging.getLogger(__name__)
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
+    from sentence_transformers.readers.InputExample import InputExample
+
+
+class SaveModelCallback(TrainerCallback):
+    """A Callback to save the model to the `output_dir`.
+
+    There are two cases:
+    1. save_best_model is True and evaluator is defined:
+        We save on evaluate, but only if the new model is better than the currently saved one
+        according to the evaluator.
+    2. If evaluator is not defined:
+        We save after the model has been trained.
+    """
+
+    def __init__(self, output_dir: str, evaluator: Optional[SentenceEvaluator], save_best_model: bool) -> None:
+        super().__init__()
+        self.output_dir = output_dir
+        self.evaluator = evaluator
+        # TODO: ^ has to implement `greater_is_better` and `primary_metric`
+        self.save_best_model = save_best_model
+        self.best_metric = None
+
+    def is_better(self, new_metric: float) -> bool:
+        if getattr(self.evaluator, "greater_is_better", True):
+            return new_metric > self.best_metric
+        return new_metric < self.best_metric
+
+    def on_evaluate(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        metrics: Dict[str, Any],
+        model: "SentenceTransformer",
+        **kwargs,
+    ):
+        if self.evaluator is not None and self.save_best_model:
+            metric_key = getattr(self.evaluator, "primary_metric", "evaluator")
+            for key, value in metrics.items():
+                if key.endswith(metric_key):
+                    if self.best_metric is None or self.is_better(value):
+                        self.best_metric = value
+                        model.save(self.output_dir)
+
+    def on_train_end(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        **kwargs,
+    ):
+        if self.evaluator is None:
+            model.save(self.output_dir)
+
+
+class EvaluatorCallback(TrainerCallback):
+    """The SentenceTransformers.fit method always ran the evaluator on every epoch,
+    in addition to every "evaluation_steps". This callback is responsible for that.
+
+    The `.trainer` must be provided after the trainer has been created.
+    """
+
+    def __init__(self, evaluator: SentenceEvaluator) -> None:
+        super().__init__()
+        self.evaluator = evaluator
+        self.metric_key_prefix = "eval"
+        self.trainer = None
+
+    def on_epoch_end(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        **kwargs,
+    ):
+        evaluator_metrics = self.evaluator(model, epoch=state.epoch)
+        if not isinstance(evaluator_metrics, dict):
+            evaluator_metrics = {"evaluator": evaluator_metrics}
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(evaluator_metrics.keys()):
+            if not key.startswith(f"{self.metric_key_prefix}_"):
+                evaluator_metrics[f"{self.metric_key_prefix}_{key}"] = evaluator_metrics.pop(key)
+
+        if self.trainer is not None:
+            self.trainer.callback_handler.on_evaluate(args, state, control, metrics=evaluator_metrics)
+
+
+class OriginalCallback(TrainerCallback):
+    """A Callback to invoke the original callback function that was provided to SentenceTransformer.fit()
+
+    This callback has the following signature: `(score: float, epoch: int, steps: int) -> None`
+    """
+
+    def __init__(self, callback: Callable[[float, int, int], None], evaluator: SentenceEvaluator) -> None:
+        super().__init__()
+        self.callback = callback
+        self.evaluator = evaluator
+
+    def on_evaluate(
+        self,
+        args: transformers.TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        metrics: Dict[str, Any],
+        **kwargs,
+    ):
+        metric_key = getattr(self.evaluator, "primary_metric", "evaluator")
+        for key, value in metrics.items():
+            if key.endswith(metric_key):
+                return self.callback(value, state.epoch, state.global_step)
+
+
+class FitMixin:
+    """Mixin class for injecting the `fit` method into Sentence Transformers"""
+
+    def fit(
+        self,
+        train_objectives: Iterable[Tuple[DataLoader, nn.Module]],
+        evaluator: SentenceEvaluator = None,
+        epochs: int = 1,
+        steps_per_epoch=None,
+        scheduler: str = "WarmupLinear",
+        warmup_steps: int = 10000,
+        optimizer_class: Type[Optimizer] = torch.optim.AdamW,
+        optimizer_params: Dict[str, object] = {"lr": 2e-5},
+        weight_decay: float = 0.01,
+        evaluation_steps: int = 0,
+        output_path: str = None,
+        save_best_model: bool = True,
+        max_grad_norm: float = 1,
+        use_amp: bool = False,
+        callback: Callable[[float, int, int], None] = None,
+        show_progress_bar: bool = True,
+        checkpoint_path: str = None,
+        checkpoint_save_steps: int = 500,
+        checkpoint_save_total_limit: int = 0,
+    ):
+        """
+        Train the model with the given training objective
+        Each training objective is sampled in turn for one batch.
+        We sample only as many batches from each objective as there are in the smallest one
+        to make sure of equal training with each dataset.
+
+        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
+        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
+        :param epochs: Number of epochs for training
+        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
+        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
+        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
+        :param optimizer_class: Optimizer
+        :param optimizer_params: Optimizer parameters
+        :param weight_decay: Weight decay for model parameters
+        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
+        :param output_path: Storage path for the model and evaluation files
+        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
+        :param max_grad_norm: Used for gradient normalization.
+        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
+        :param callback: Callback function that is invoked after each evaluation.
+                It must accept the following three parameters in this order:
+                `score`, `epoch`, `steps`
+        :param show_progress_bar: If True, output a tqdm progress bar
+        :param checkpoint_path: Folder to save checkpoints during training
+        :param checkpoint_save_steps: Will save a checkpoint after so many steps
+        :param checkpoint_save_total_limit: Total number of checkpoints to store
+        """
+        # Delayed import to counter the SentenceTransformers -> FitMixin -> SentenceTransformerTrainer -> SentenceTransformers circular import
+        from sentence_transformers.trainer import SentenceTransformerTrainer
+
+        data_loaders, loss_fns = zip(*train_objectives)
+
+        # Clear the dataloaders from collate functions as we just want raw InputExamples
+        def identity(batch):
+            return batch
+
+        for data_loader in data_loaders:
+            data_loader.collate_fn = identity
+
+        batch_size = 8
+        batch_sampler = BatchSamplers.BATCH_SAMPLER
+        # Convert dataloaders into a DatasetDict
+        # TODO: This should be done in a more efficient way
+        train_dataset_dict = {}
+        for loader_idx, data_loader in enumerate(data_loaders, start=1):
+            if isinstance(data_loader, NoDuplicatesDataLoader):
+                batch_sampler = BatchSamplers.NO_DUPLICATES
+            elif hasattr(data_loader, "dataset") and isinstance(data_loader.dataset, SentenceLabelDataset):
+                batch_sampler = BatchSamplers.GROUP_BY_LABEL
+
+            batch_size = getattr(data_loader, "batch_size", batch_size)
+            texts = []
+            labels = []
+            for batch in data_loader:
+                batch_texts, batch_labels = zip(*[(example.texts, example.label) for example in batch])
+                texts += batch_texts
+                labels += batch_labels
+            dataset = Dataset.from_dict({f"sentence_{idx}": text for idx, text in enumerate(zip(*texts))})
+            # Add label column, unless all labels are 0 (the default value for `labels` in InputExample)
+            add_label_column = True
+            try:
+                if set(labels) == {0}:
+                    add_label_column = False
+            except TypeError:
+                pass
+            if add_label_column:
+                dataset = dataset.add_column("label", labels)
+            train_dataset_dict[f"_dataset_{loader_idx}"] = dataset
+
+        train_dataset_dict = DatasetDict(train_dataset_dict)
+
+        def _default_checkpoint_dir() -> str:
+            dir_name = "checkpoints/model"
+            idx = 1
+            while Path(dir_name).exists() and len(list(Path(dir_name).iterdir())) != 0:
+                dir_name = f"checkpoints/model_{idx}"
+                idx += 1
+            return dir_name
+
+        # Convert loss_fns into a dict with `dataset_{idx}` keys
+        loss_fn_dict = {f"_dataset_{idx}": loss_fn for idx, loss_fn in enumerate(loss_fns, start=1)}
+        # TODO: Test model checkpointing & loading
+
+        # Use steps_per_epoch to perhaps set max_steps
+        max_steps = -1
+        if steps_per_epoch is not None and steps_per_epoch > 0:
+            if epochs == 1:
+                max_steps = steps_per_epoch
+            else:
+                logger.warning(
+                    "Setting `steps_per_epoch` alongside `epochs` > 1 no longer works. "
+                    "We will train with the full datasets per epoch."
+                )
+                steps_per_epoch = None
+
+        args = SentenceTransformerTrainingArguments(
+            output_dir=checkpoint_path or _default_checkpoint_dir(),
+            batch_sampler=batch_sampler,
+            multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN,
+            per_device_train_batch_size=batch_size,
+            per_device_eval_batch_size=batch_size,
+            num_train_epochs=epochs,
+            max_steps=max_steps,
+            evaluation_strategy="steps" if evaluation_steps is not None and evaluation_steps > 0 else "no",
+            eval_steps=evaluation_steps,
+            # load_best_model_at_end=save_best_model, # <- TODO: Look into a good solution for save_best_model
+            max_grad_norm=max_grad_norm,
+            fp16=use_amp,
+            disable_tqdm=not show_progress_bar,
+            save_strategy="steps" if checkpoint_path is not None else "no",
+            save_steps=checkpoint_save_steps,
+            save_total_limit=checkpoint_save_total_limit,
+        )
+
+        if steps_per_epoch is None or steps_per_epoch == 0:
+            steps_per_epoch = min([len(train_dataset) // batch_size for train_dataset in train_dataset_dict.values()])
+        num_train_steps = int(steps_per_epoch * epochs)
+
+        # Prepare optimizer & scheduler
+        param_optimizer = list(self.named_parameters())
+
+        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
+                "weight_decay": weight_decay,
+            },
+            {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
+        ]
+
+        optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params)
+        scheduler_obj = self._get_scheduler(
+            optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps
+        )
+
+        # Create callbacks
+        callbacks = []
+        if evaluator is not None:
+            callbacks.append(EvaluatorCallback(evaluator))
+            if callback is not None:
+                callbacks.append(OriginalCallback(callback, evaluator))
+
+        trainer = SentenceTransformerTrainer(
+            model=self,
+            args=args,
+            train_dataset=train_dataset_dict,
+            eval_dataset=None,
+            loss=loss_fn_dict,
+            evaluator=evaluator,
+            optimizers=(optimizer, scheduler_obj),
+            callbacks=callbacks,
+        )
+        # Set the trainer on the EvaluatorCallback, required for logging the metrics
+        for callback in trainer.callback_handler.callbacks:
+            if isinstance(callback, EvaluatorCallback):
+                callback.trainer = trainer
+
+        if output_path is not None:
+            trainer.add_callback(SaveModelCallback(output_path, evaluator, save_best_model))
+
+        trainer.train()
+
+    @staticmethod
+    def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int):
+        """
+        Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
+        """
+        scheduler = scheduler.lower()
+        if scheduler == "constantlr":
+            return transformers.get_constant_schedule(optimizer)
+        elif scheduler == "warmupconstant":
+            return transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)
+        elif scheduler == "warmuplinear":
+            return transformers.get_linear_schedule_with_warmup(
+                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
+            )
+        elif scheduler == "warmupcosine":
+            return transformers.get_cosine_schedule_with_warmup(
+                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
+            )
+        elif scheduler == "warmupcosinewithhardrestarts":
+            return transformers.get_cosine_with_hard_restarts_schedule_with_warmup(
+                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
+            )
+        else:
+            raise ValueError("Unknown scheduler {}".format(scheduler))
+
+    def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict[str, Tensor]], Tensor]:
+        """
+        Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model
+        Here, batch is a list of InputExample instances: [InputExample(...), ...]
+
+        :param batch:
+            a batch from a SmartBatchingDataset
+        :return:
+            a batch of tensors for the model
+        """
+        texts = [example.texts for example in batch]
+        sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)]
+        labels = torch.tensor([example.label for example in batch])
+        return sentence_features, labels
+
+    """
+    Temporary methods that will be removed when this refactor is complete:
+    """
+
+    def old_fit(
+        self,
+        train_objectives: Iterable[Tuple[DataLoader, nn.Module]],
+        evaluator: SentenceEvaluator = None,
+        epochs: int = 1,
+        steps_per_epoch=None,
+        scheduler: str = "WarmupLinear",
+        warmup_steps: int = 10000,
+        optimizer_class: Type[Optimizer] = torch.optim.AdamW,
+        optimizer_params: Dict[str, object] = {"lr": 2e-5},
+        weight_decay: float = 0.01,
+        evaluation_steps: int = 0,
+        output_path: str = None,
+        save_best_model: bool = True,
+        max_grad_norm: float = 1,
+        use_amp: bool = False,
+        callback: Callable[[float, int, int], None] = None,
+        show_progress_bar: bool = True,
+        checkpoint_path: str = None,
+        checkpoint_save_steps: int = 500,
+        checkpoint_save_total_limit: int = 0,
+    ):
+        """
+        Train the model with the given training objective
+        Each training objective is sampled in turn for one batch.
+        We sample only as many batches from each objective as there are in the smallest one
+        to make sure of equal training with each dataset.
+
+        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
+        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
+        :param epochs: Number of epochs for training
+        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
+        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
+        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
+        :param optimizer_class: Optimizer
+        :param optimizer_params: Optimizer parameters
+        :param weight_decay: Weight decay for model parameters
+        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
+        :param output_path: Storage path for the model and evaluation files
+        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
+        :param max_grad_norm: Used for gradient normalization.
+        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
+        :param callback: Callback function that is invoked after each evaluation.
+                It must accept the following three parameters in this order:
+                `score`, `epoch`, `steps`
+        :param show_progress_bar: If True, output a tqdm progress bar
+        :param checkpoint_path: Folder to save checkpoints during training
+        :param checkpoint_save_steps: Will save a checkpoint after so many steps
+        :param checkpoint_save_total_limit: Total number of checkpoints to store
+        """
+
+        ##Add info to model card
+        # info_loss_functions = "\n".join(["- {} with {} training examples".format(str(loss), len(dataloader)) for dataloader, loss in train_objectives])
+        info_loss_functions = []
+        for dataloader, loss in train_objectives:
+            info_loss_functions.extend(ModelCardTemplate.get_train_objective_info(dataloader, loss))
+        info_loss_functions = "\n\n".join([text for text in info_loss_functions])
+
+        info_fit_parameters = json.dumps(
+            {
+                "evaluator": fullname(evaluator),
+                "epochs": epochs,
+                "steps_per_epoch": steps_per_epoch,
+                "scheduler": scheduler,
+                "warmup_steps": warmup_steps,
+                "optimizer_class": str(optimizer_class),
+                "optimizer_params": optimizer_params,
+                "weight_decay": weight_decay,
+                "evaluation_steps": evaluation_steps,
+                "max_grad_norm": max_grad_norm,
+            },
+            indent=4,
+            sort_keys=True,
+        )
+        self._model_card_text = None
+        self._model_card_vars["{TRAINING_SECTION}"] = ModelCardTemplate.__TRAINING_SECTION__.replace(
+            "{LOSS_FUNCTIONS}", info_loss_functions
+        ).replace("{FIT_PARAMETERS}", info_fit_parameters)
+
+        if use_amp:
+            from torch.cuda.amp import autocast
+
+            scaler = torch.cuda.amp.GradScaler()
+
+        self.to(self.device)
+
+        dataloaders = [dataloader for dataloader, _ in train_objectives]
+
+        # Use smart batching
+        for dataloader in dataloaders:
+            dataloader.collate_fn = self.smart_batching_collate
+
+        loss_models = [loss for _, loss in train_objectives]
+        for loss_model in loss_models:
+            loss_model.to(self.device)
+
+        self.best_score = -9999999
+
+        if steps_per_epoch is None or steps_per_epoch == 0:
+            steps_per_epoch = min([len(dataloader) for dataloader in dataloaders])
+
+        num_train_steps = int(steps_per_epoch * epochs)
+
+        # Prepare optimizers
+        optimizers = []
+        schedulers = []
+        for loss_model in loss_models:
+            param_optimizer = list(loss_model.named_parameters())
+
+            no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
+            optimizer_grouped_parameters = [
+                {
+                    "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
+                    "weight_decay": weight_decay,
+                },
+                {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
+            ]
+
+            optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params)
+            scheduler_obj = self._get_scheduler(
+                optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps
+            )
+
+            optimizers.append(optimizer)
+            schedulers.append(scheduler_obj)
+
+        global_step = 0
+        data_iterators = [iter(dataloader) for dataloader in dataloaders]
+
+        num_train_objectives = len(train_objectives)
+
+        skip_scheduler = False
+        for epoch in trange(epochs, desc="Epoch", disable=not show_progress_bar):
+            training_steps = 0
+
+            for loss_model in loss_models:
+                loss_model.zero_grad()
+                loss_model.train()
+
+            for _ in trange(steps_per_epoch, desc="Iteration", smoothing=0.05, disable=not show_progress_bar):
+                for train_idx in range(num_train_objectives):
+                    loss_model = loss_models[train_idx]
+                    optimizer = optimizers[train_idx]
+                    scheduler = schedulers[train_idx]
+                    data_iterator = data_iterators[train_idx]
+
+                    try:
+                        data = next(data_iterator)
+                    except StopIteration:
+                        data_iterator = iter(dataloaders[train_idx])
+                        data_iterators[train_idx] = data_iterator
+                        data = next(data_iterator)
+
+                    features, labels = data
+                    labels = labels.to(self.device)
+                    features = list(map(lambda batch: batch_to_device(batch, self.device), features))
+
+                    if use_amp:
+                        with autocast():
+                            loss_value = loss_model(features, labels)
+
+                        scale_before_step = scaler.get_scale()
+                        scaler.scale(loss_value).backward()
+                        scaler.unscale_(optimizer)
+                        torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)
+                        scaler.step(optimizer)
+                        scaler.update()
+
+                        skip_scheduler = scaler.get_scale() != scale_before_step
+                    else:
+                        loss_value = loss_model(features, labels)
+                        loss_value.backward()
+                        torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)
+                        optimizer.step()
+
+                    optimizer.zero_grad()
+
+                    if not skip_scheduler:
+                        scheduler.step()
+
+                training_steps += 1
+                global_step += 1
+
+                if evaluation_steps > 0 and training_steps % evaluation_steps == 0:
+                    self._eval_during_training(
+                        evaluator, output_path, save_best_model, epoch, training_steps, callback
+                    )
+
+                    for loss_model in loss_models:
+                        loss_model.zero_grad()
+                        loss_model.train()
+
+                if (
+                    checkpoint_path is not None
+                    and checkpoint_save_steps is not None
+                    and checkpoint_save_steps > 0
+                    and global_step % checkpoint_save_steps == 0
+                ):
+                    self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step)
+
+            self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback)
+
+        if evaluator is None and output_path is not None:  # No evaluator, but output path: save final model version
+            self.save(output_path)
+
+        if checkpoint_path is not None:
+            self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step)
+
+    def _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback):
+        """Runs evaluation during the training"""
+        eval_path = output_path
+        if output_path is not None:
+            os.makedirs(output_path, exist_ok=True)
+            eval_path = os.path.join(output_path, "eval")
+            os.makedirs(eval_path, exist_ok=True)
+
+        if evaluator is not None:
+            score = evaluator(self, output_path=eval_path, epoch=epoch, steps=steps)
+            if callback is not None:
+                callback(score, epoch, steps)
+            if score > self.best_score:
+                self.best_score = score
+                if save_best_model:
+                    self.save(output_path)
+
+    def _save_checkpoint(self, checkpoint_path, checkpoint_save_total_limit, step):
+        # Store new checkpoint
+        self.save(os.path.join(checkpoint_path, str(step)))
+
+        # Delete old checkpoints
+        if checkpoint_save_total_limit is not None and checkpoint_save_total_limit > 0:
+            old_checkpoints = []
+            for subdir in os.listdir(checkpoint_path):
+                if subdir.isdigit():
+                    old_checkpoints.append({"step": int(subdir), "path": os.path.join(checkpoint_path, subdir)})
+
+            if len(old_checkpoints) > checkpoint_save_total_limit:
+                old_checkpoints = sorted(old_checkpoints, key=lambda x: x["step"])
+                shutil.rmtree(old_checkpoints[0]["path"])
diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py
index f63c6b6d5..6337b95f3 100644
--- a/sentence_transformers/losses/AdaptiveLayerLoss.py
+++ b/sentence_transformers/losses/AdaptiveLayerLoss.py
@@ -230,3 +230,16 @@ def get_config_dict(self) -> Dict[str, Any]:
             "kl_div_weight": self.kl_div_weight,
             "kl_temperature": self.kl_temperature,
         }
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{li20242d,
+    title={2D Matryoshka Sentence Embeddings}, 
+    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
+    year={2024},
+    eprint={2402.14776},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py
index 00780444b..a506a1317 100644
--- a/sentence_transformers/losses/AnglELoss.py
+++ b/sentence_transformers/losses/AnglELoss.py
@@ -54,3 +54,16 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0):
                 train_loss = losses.AnglELoss(model=model)
         """
         super().__init__(model, scale, similarity_fct=util.pairwise_angle_sim)
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{li2023angleoptimized,
+    title={AnglE-optimized Text Embeddings}, 
+    author={Xianming Li and Jing Li},
+    year={2023},
+    eprint={2309.12871},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py
index a7364b59d..a0fb1055c 100644
--- a/sentence_transformers/losses/BatchAllTripletLoss.py
+++ b/sentence_transformers/losses/BatchAllTripletLoss.py
@@ -117,3 +117,16 @@ def batch_all_triplet_loss(self, labels, embeddings):
         triplet_loss = triplet_loss.sum() / (num_positive_triplets + 1e-16)
 
         return triplet_loss
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hermans2017defense,
+    title={In Defense of the Triplet Loss for Person Re-Identification}, 
+    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
+    year={2017},
+    eprint={1703.07737},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+"""
diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
index e3f1b0262..a70f419e1 100644
--- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
@@ -120,3 +120,16 @@ def batch_hard_triplet_soft_margin_loss(self, labels: Tensor, embeddings: Tensor
         triplet_loss = tl.mean()
 
         return triplet_loss
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hermans2017defense,
+    title={In Defense of the Triplet Loss for Person Re-Identification}, 
+    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
+    year={2017},
+    eprint={1703.07737},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+"""
diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py
index 6668fe70a..ab023ec3d 100644
--- a/sentence_transformers/losses/BatchHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardTripletLoss.py
@@ -238,3 +238,16 @@ def get_anchor_negative_triplet_mask(labels):
         # Uses broadcasting where the 1st argument has shape (1, batch_size) and the 2nd (batch_size, 1)
 
         return ~(labels.unsqueeze(0) == labels.unsqueeze(1))
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hermans2017defense,
+    title={In Defense of the Triplet Loss for Person Re-Identification}, 
+    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
+    year={2017},
+    eprint={1703.07737},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+"""
diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
index 15d71add8..a54a6bc26 100644
--- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
@@ -154,3 +154,16 @@ def _masked_maximum(data, mask, dim=1):
         masked_maximums += axis_minimums
 
         return masked_maximums
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hermans2017defense,
+    title={In Defense of the Triplet Loss for Person Re-Identification}, 
+    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
+    year={2017},
+    eprint={1703.07737},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+"""
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index 72868faea..12a0cb931 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -232,8 +232,9 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
             reps.append(reps_mbs)
             self.random_states.append(random_state_mbs)
 
-        # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings
-        loss = self.calculate_loss_and_cache_gradients(reps)
+        with torch.set_grad_enabled(True):
+            # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings
+            loss = self.calculate_loss_and_cache_gradients(reps)
 
         # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain
         loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self))
@@ -241,3 +242,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
 
     def get_config_dict(self):
         return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__}
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{gao2021scaling,
+    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
+    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
+    year={2021},
+    eprint={2101.06983},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+"""
diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py
index b88c3a353..e937d7ef9 100644
--- a/sentence_transformers/losses/CoSENTLoss.py
+++ b/sentence_transformers/losses/CoSENTLoss.py
@@ -83,3 +83,15 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
 
     def get_config_dict(self):
         return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__}
+
+    @property
+    def citation(self) -> str:
+        return """
+@online{kexuefm-8847,
+    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
+    author={Su Jianlin},
+    year={2022},
+    month={Jan},
+    url={https://kexue.fm/archives/8847},
+}
+"""
diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py
index 13be27fef..55f5ad993 100644
--- a/sentence_transformers/losses/ContrastiveLoss.py
+++ b/sentence_transformers/losses/ContrastiveLoss.py
@@ -95,3 +95,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
             labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2)
         )
         return losses.mean() if self.size_average else losses.sum()
+
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{hadsell2006dimensionality,
+    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
+    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)}, 
+    title={Dimensionality Reduction by Learning an Invariant Mapping}, 
+    year={2006},
+    volume={2},
+    number={},
+    pages={1735-1742},
+    doi={10.1109/CVPR.2006.100}
+}
+"""
diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py
index 828b72406..85af67fc3 100644
--- a/sentence_transformers/losses/ContrastiveTensionLoss.py
+++ b/sentence_transformers/losses/ContrastiveTensionLoss.py
@@ -86,6 +86,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
         loss = self.criterion(sim_scores, labels.type_as(sim_scores))
         return loss
 
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{carlsson2021semantic,
+    title={Semantic Re-tuning with Contrastive Tension},
+    author={Fredrik Carlsson and Amaru Cuba Gyllensten and Evangelia Gogoulou and Erik Ylip{\"a}{\"a} Hellqvist and Magnus Sahlgren},
+    booktitle={International Conference on Learning Representations},
+    year={2021},
+    url={https://openreview.net/forum?id=Ov_sMNau-PF}
+}
+"""
+
 
 class ContrastiveTensionLossInBatchNegatives(nn.Module):
     def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_fct=util.cos_sim):
@@ -161,6 +173,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
         labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)
         return (self.cross_entropy_loss(scores, labels) + self.cross_entropy_loss(scores.t(), labels)) / 2
 
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{carlsson2021semantic,
+    title={Semantic Re-tuning with Contrastive Tension},
+    author={Fredrik Carlsson and Amaru Cuba Gyllensten and Evangelia Gogoulou and Erik Ylip{\"a}{\"a} Hellqvist and Magnus Sahlgren},
+    booktitle={International Conference on Learning Representations},
+    year={2021},
+    url={https://openreview.net/forum?id=Ov_sMNau-PF}
+}
+"""
+
 
 ################# CT Data Loader #################
 # For CT, we need batches in a specific format
diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py
index 7920fc93b..46b075b38 100644
--- a/sentence_transformers/losses/CosineSimilarityLoss.py
+++ b/sentence_transformers/losses/CosineSimilarityLoss.py
@@ -1,6 +1,8 @@
 import torch
 from torch import nn, Tensor
-from typing import Iterable, Dict
+from typing import Any, Iterable, Dict
+
+from sentence_transformers.util import fullname
 from ..SentenceTransformer import SentenceTransformer
 
 
@@ -62,4 +64,7 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
         embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
         output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))
-        return self.loss_fct(output, labels.view(-1))
+        return self.loss_fct(output, labels.float().view(-1))
+
+    def get_config_dict(self) -> Dict[str, Any]:
+        return {"loss_fct": fullname(self.loss_fct)}
diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
index 8cdb607df..b29e70f68 100644
--- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
+++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
@@ -158,3 +158,19 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
         ce_loss_fct = nn.CrossEntropyLoss(ignore_index=self.tokenizer_decoder.pad_token_id)
         loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), label_ids.reshape(-1))
         return loss
+
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{wang-2021-TSDAE,
+    title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
+    author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna", 
+    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
+    month = nov,
+    year = "2021",
+    address = "Punta Cana, Dominican Republic",
+    publisher = "Association for Computational Linguistics",
+    pages = "671--688",
+    url = "https://arxiv.org/abs/2104.06979",
+}
+"""
diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py
index 1fdc753ef..ff8d3d288 100644
--- a/sentence_transformers/losses/GISTEmbedLoss.py
+++ b/sentence_transformers/losses/GISTEmbedLoss.py
@@ -152,3 +152,16 @@ def get_config_dict(self) -> Dict[str, Any]:
             "guide": self.guide,
             "temperature": self.temperature,
         }
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{solatorio2024gistembed,
+    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}, 
+    author={Aivin V. Solatorio},
+    year={2024},
+    eprint={2402.16829},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+"""
diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py
index acbc90c03..89db913bd 100644
--- a/sentence_transformers/losses/MSELoss.py
+++ b/sentence_transformers/losses/MSELoss.py
@@ -63,3 +63,17 @@ def __init__(self, model):
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
         rep = self.model(sentence_features[0])["sentence_embedding"]
         return self.loss_fct(rep, labels)
+
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{reimers-2020-multilingual-sentence-bert,
+    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2004.09813",
+}
+"""
diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py
index 063a36e65..26e202fe3 100644
--- a/sentence_transformers/losses/MarginMSELoss.py
+++ b/sentence_transformers/losses/MarginMSELoss.py
@@ -83,3 +83,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
         margin_pred = scores_pos - scores_neg
 
         return self.loss_fct(margin_pred, labels)
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hofstätter2021improving,
+    title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation}, 
+    author={Sebastian Hofstätter and Sophia Althammer and Michael Schröder and Mete Sertkan and Allan Hanbury},
+    year={2021},
+    eprint={2010.02666},
+    archivePrefix={arXiv},
+    primaryClass={cs.IR}
+}
+"""
diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py
index 7d0e8fd5e..da6f1512b 100644
--- a/sentence_transformers/losses/Matryoshka2dLoss.py
+++ b/sentence_transformers/losses/Matryoshka2dLoss.py
@@ -111,3 +111,16 @@ def get_config_dict(self) -> Dict[str, Any]:
             **super().get_config_dict(),
             **self.loss.get_config_dict(),
         }
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{li20242d,
+    title={2D Matryoshka Sentence Embeddings}, 
+    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
+    year={2024},
+    eprint={2402.14776},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py
index 3d25027a1..0e9ab32a6 100644
--- a/sentence_transformers/losses/MatryoshkaLoss.py
+++ b/sentence_transformers/losses/MatryoshkaLoss.py
@@ -142,3 +142,16 @@ def get_config_dict(self) -> Dict[str, Any]:
             "matryoshka_weights": self.matryoshka_weights,
             "n_dims_per_step": self.n_dims_per_step,
         }
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{kusupati2024matryoshka,
+    title={Matryoshka Representation Learning}, 
+    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
+    year={2024},
+    eprint={2205.13147},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+"""
diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py
index 3c2b983ef..b63d05afd 100644
--- a/sentence_transformers/losses/MegaBatchMarginLoss.py
+++ b/sentence_transformers/losses/MegaBatchMarginLoss.py
@@ -141,3 +141,21 @@ def forward_non_mini_batched(self, sentence_features: Iterable[Dict[str, Tensor]
         negatives_max, _ = torch.max(negative_scores, dim=1)
         losses = F.relu(self.positive_margin - positive_scores) + F.relu(negatives_max - self.negative_margin)
         return losses.mean()
+
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{wieting-gimpel-2018-paranmt,
+    title = "{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations",
+    author = "Wieting, John and Gimpel, Kevin",
+    editor = "Gurevych, Iryna and Miyao, Yusuke",
+    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = jul,
+    year = "2018",
+    address = "Melbourne, Australia",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/P18-1042",
+    doi = "10.18653/v1/P18-1042",
+    pages = "451--462",
+}
+"""
diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
index 0fd191b14..5416b3d70 100644
--- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
@@ -94,3 +94,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
 
     def get_config_dict(self):
         return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__}
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py
index 201af5573..c8da160e6 100644
--- a/sentence_transformers/losses/SoftmaxLoss.py
+++ b/sentence_transformers/losses/SoftmaxLoss.py
@@ -119,3 +119,17 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
             return loss
         else:
             return reps, output
+
+    @property
+    def citation(self) -> str:
+        return """
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+"""
diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py
index ee25cebf8..1768bc2b7 100644
--- a/sentence_transformers/losses/TripletLoss.py
+++ b/sentence_transformers/losses/TripletLoss.py
@@ -72,15 +72,6 @@ def __init__(
         self.distance_metric = distance_metric
         self.triplet_margin = triplet_margin
 
-    def get_config_dict(self):
-        distance_metric_name = self.distance_metric.__name__
-        for name, value in vars(TripletDistanceMetric).items():
-            if value == self.distance_metric:
-                distance_metric_name = "TripletDistanceMetric.{}".format(name)
-                break
-
-        return {"distance_metric": distance_metric_name, "triplet_margin": self.triplet_margin}
-
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
         reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
 
@@ -90,3 +81,25 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
 
         losses = F.relu(distance_pos - distance_neg + self.triplet_margin)
         return losses.mean()
+
+    def get_config_dict(self):
+        distance_metric_name = self.distance_metric.__name__
+        for name, value in vars(TripletDistanceMetric).items():
+            if value == self.distance_metric:
+                distance_metric_name = "TripletDistanceMetric.{}".format(name)
+                break
+
+        return {"distance_metric": distance_metric_name, "triplet_margin": self.triplet_margin}
+
+    @property
+    def citation(self) -> str:
+        return """
+@misc{hermans2017defense,
+    title={In Defense of the Triplet Loss for Person Re-Identification}, 
+    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
+    year={2017},
+    eprint={1703.07737},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+"""
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
new file mode 100644
index 000000000..633517574
--- /dev/null
+++ b/sentence_transformers/model_card.py
@@ -0,0 +1,920 @@
+from copy import copy
+import json
+import random
+from collections import Counter, defaultdict
+from dataclasses import dataclass, field, fields
+from pathlib import Path
+from platform import python_version
+import re
+from textwrap import indent
+from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union
+import logging
+
+import accelerate
+import datasets
+import tokenizers
+import torch
+from torch import nn
+import transformers
+from datasets import Dataset, DatasetDict
+from huggingface_hub import CardData, ModelCard, dataset_info as get_dataset_info, model_info as get_model_info
+from huggingface_hub.repocard_data import eval_results_to_model_index, EvalResult
+from huggingface_hub.utils import yaml_dump
+from transformers import TrainerCallback
+from transformers.integrations import CodeCarbonCallback
+from transformers.modelcard import make_markdown_table
+from transformers.trainer_callback import TrainerControl, TrainerState
+from tqdm.autonotebook import tqdm
+
+from sentence_transformers import __version__ as sentence_transformers_version
+from sentence_transformers.evaluation import SequentialEvaluator
+from sentence_transformers.models import Transformer
+from sentence_transformers.util import cos_sim, fullname
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
+
+logger = logging.getLogger(__name__)
+
+if TYPE_CHECKING:
+    from sentence_transformers.evaluation import SentenceEvaluator
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
+    from sentence_transformers.trainer import SentenceTransformerTrainer
+
+
+class ModelCardCallback(TrainerCallback):
+    def __init__(self, trainer: "SentenceTransformerTrainer", default_args_dict: Dict[str, Any]) -> None:
+        super().__init__()
+        self.trainer = trainer
+        self.default_args_dict = default_args_dict
+
+        callbacks = [
+            callback
+            for callback in self.trainer.callback_handler.callbacks
+            if isinstance(callback, CodeCarbonCallback)
+        ]
+        if callbacks:
+            trainer.model.model_card_data.code_carbon_callback = callbacks[0]
+
+        trainer.model.model_card_data.trainer = trainer
+
+    def on_init_end(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        **kwargs,
+    ):
+        from sentence_transformers.losses import AdaptiveLayerLoss, Matryoshka2dLoss, MatryoshkaLoss
+
+        # Try to infer the dataset "name", "id" and "revision" from the dataset cache files
+        if self.trainer.train_dataset:
+            model.model_card_data.train_datasets = model.model_card_data.extract_dataset_metadata(
+                self.trainer.train_dataset, model.model_card_data.train_datasets, "train"
+            )
+
+        if self.trainer.eval_dataset:
+            model.model_card_data.eval_datasets = model.model_card_data.extract_dataset_metadata(
+                self.trainer.eval_dataset, model.model_card_data.eval_datasets, "eval"
+            )
+
+        if isinstance(self.trainer.loss, dict):
+            losses = list(self.trainer.loss.values())
+        else:
+            losses = [self.trainer.loss]
+        # Some losses are known to use other losses internally, e.g. MatryoshkaLoss, AdaptiveLayerLoss and Matryoshka2dLoss
+        # So, verify for `loss` attributes in the losses
+        loss_idx = 0
+        while loss_idx < len(losses):
+            loss = losses[loss_idx]
+            if (
+                isinstance(loss, (MatryoshkaLoss, AdaptiveLayerLoss, Matryoshka2dLoss))
+                and hasattr(loss, "loss")
+                and loss.loss not in losses
+            ):
+                losses.append(loss.loss)
+            loss_idx += 1
+
+        model.model_card_data.set_losses(losses)
+
+    def on_train_begin(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        **kwargs,
+    ) -> None:
+        # model.model_card_data.hyperparameters = extract_hyperparameters_from_trainer(self.trainer)
+        ignore_keys = {
+            "output_dir",
+            "logging_dir",
+            "logging_strategy",
+            "logging_first_step",
+            "logging_steps",
+            "evaluation_strategy",
+            "eval_steps",
+            "eval_delay",
+            "save_strategy",
+            "save_steps",
+            "save_total_limit",
+            "metric_for_best_model",
+            "greater_is_better",
+            "report_to",
+            "samples_per_label",
+            "show_progress_bar",
+            "do_train",
+            "do_eval",
+            "do_test",
+            "run_name",
+            "hub_token",
+            "push_to_hub_token",
+        }
+        args_dict = args.to_dict()
+        model.model_card_data.all_hyperparameters = {
+            key: value for key, value in args_dict.items() if key not in ignore_keys
+        }
+        model.model_card_data.non_default_hyperparameters = {
+            key: value
+            for key, value in args_dict.items()
+            if key not in ignore_keys and key in self.default_args_dict and value != self.default_args_dict[key]
+        }
+
+    def on_evaluate(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        metrics: Dict[str, float],
+        **kwargs,
+    ) -> None:
+        loss_dict = {" ".join(key.split("_")[1:]): metrics[key] for key in metrics if key.endswith("_loss")}
+        if (
+            model.model_card_data.training_logs
+            and model.model_card_data.training_logs[-1]["Step"] == state.global_step
+        ):
+            model.model_card_data.training_logs[-1].update(loss_dict)
+        else:
+            model.model_card_data.training_logs.append(
+                {
+                    "Epoch": state.epoch,
+                    "Step": state.global_step,
+                    **loss_dict,
+                }
+            )
+
+    def on_log(
+        self,
+        args: SentenceTransformerTrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: "SentenceTransformer",
+        logs: Dict[str, float],
+        **kwargs,
+    ):
+        keys = {"loss"} & set(logs)
+        if keys:
+            if (
+                model.model_card_data.training_logs
+                and model.model_card_data.training_logs[-1]["Step"] == state.global_step
+            ):
+                model.model_card_data.training_logs[-1]["Training Loss"] = logs[keys.pop()]
+            else:
+                model.model_card_data.training_logs.append(
+                    {
+                        "Epoch": state.epoch,
+                        "Step": state.global_step,
+                        "Training Loss": logs[keys.pop()],
+                    }
+                )
+
+
+YAML_FIELDS = [
+    "language",
+    "license",
+    "library_name",
+    "tags",
+    "datasets",
+    "metrics",
+    "pipeline_tag",
+    "widget",
+    "model-index",
+    "co2_eq_emissions",
+    "base_model",
+]
+IGNORED_FIELDS = ["model", "trainer", "eval_results_dict"]
+
+
+@dataclass
+class SentenceTransformerModelCardData(CardData):
+    """A dataclass storing data used in the model card.
+
+    Args:
+        language (`Optional[Union[str, List[str]]]`): The model language, either a string or a list,
+            e.g. "en" or ["en", "de", "nl"]
+        license (`Optional[str]`): The license of the model, e.g. "apache-2.0", "mit",
+            or "cc-by-nc-sa-4.0"
+        model_name (`Optional[str]`): The pretty name of the model, e.g. "SentenceTransformer based on microsoft/mpnet-base".
+        model_id (`Optional[str]`): The model ID when pushing the model to the Hub,
+            e.g. "tomaarsen/sbert-mpnet-base-allnli".
+        train_datasets (`List[Dict[str, str]]`): A list of the names and/or Hugging Face dataset IDs of the training datasets.
+            e.g. [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}, {"name": "STSB"}]
+        eval_datasets (`List[Dict[str, str]]`): A list of the names and/or Hugging Face dataset IDs of the evaluation datasets.
+            e.g. [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"id": "mteb/stsbenchmark-sts"}]
+        task_name (`str`): The human-readable task the model is trained on,
+            e.g. "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more".
+        tags (`Optional[List[str]]`): A list of tags for the model,
+            e.g. ["sentence-transformers", "sentence-similarity", "feature-extraction"].
+        generate_widget_examples (`bool`): Whether to generate widget examples on every model save.
+
+    <Tip>
+
+    Install [``codecarbon``](https://github.com/mlco2/codecarbon) to automatically track carbon emission usage and
+    include it in your model cards.
+
+    </Tip>
+
+    Example::
+
+        >>> model = SentenceTransformer(
+        ...     "microsoft/mpnet-base",
+        ...     model_card_data=SentenceTransformerModelCardData(
+        ...         model_id="tomaarsen/sbert-mpnet-base-allnli",
+        ...         train_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}],
+        ...         eval_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}],
+        ...         license="apache-2.0",
+        ...         language="en",
+        ...     ),
+        ... )
+    """
+
+    # Potentially provided by the user
+    language: Optional[Union[str, List[str]]] = field(default_factory=list)
+    license: Optional[str] = None
+    model_name: Optional[str] = None
+    model_id: Optional[str] = None
+    train_datasets: List[Dict[str, str]] = field(default_factory=list)
+    eval_datasets: List[Dict[str, str]] = field(default_factory=list)
+    task_name: str = (
+        "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more"
+    )
+    tags: Optional[List[str]] = field(
+        default_factory=lambda: [
+            "sentence-transformers",
+            "sentence-similarity",
+            "feature-extraction",
+        ]
+    )
+    generate_widget_examples: bool = True
+
+    # Automatically filled by `ModelCardCallback` and the Trainer directly
+    base_model: Optional[str] = field(default=None, init=False)
+    base_model_revision: Optional[str] = field(default=None, init=False)
+    non_default_hyperparameters: Dict[str, Any] = field(default_factory=dict, init=False)
+    all_hyperparameters: Dict[str, Any] = field(default_factory=dict, init=False)
+    eval_results_dict: Optional[Dict["SentenceEvaluator", Dict[str, Any]]] = field(default_factory=dict, init=False)
+    training_logs: List[Dict[str, float]] = field(default_factory=list, init=False)
+    widget: List[Dict[str, str]] = field(default_factory=list, init=False)
+    predict_example: Optional[str] = field(default=None, init=False)
+    label_example_list: List[Dict[str, str]] = field(default_factory=list, init=False)
+    code_carbon_callback: Optional[CodeCarbonCallback] = field(default=None, init=False)
+    citations: Dict[str, str] = field(default_factory=dict, init=False)
+    best_model_step: Optional[int] = field(default=None, init=False)
+    trainer: Optional["SentenceTransformerTrainer"] = field(default=None, init=False, repr=False)
+
+    # Utility fields
+    first_save: bool = field(default=True, init=False)
+    widget_step: int = field(default=-1, init=False)
+
+    # Computed once, always unchanged
+    pipeline_tag: str = field(default="sentence-similarity", init=False)
+    library_name: str = field(default="sentence-transformers", init=False)
+    version: Dict[str, str] = field(
+        default_factory=lambda: {
+            "python": python_version(),
+            "sentence_transformers": sentence_transformers_version,
+            "transformers": transformers.__version__,
+            "torch": torch.__version__,
+            "accelerate": accelerate.__version__,
+            "datasets": datasets.__version__,
+            "tokenizers": tokenizers.__version__,
+        },
+        init=False,
+    )
+
+    # Passed via `register_model` only
+    model: Optional["SentenceTransformer"] = field(default=None, init=False, repr=False)
+
+    def __post_init__(self):
+        # We don't want to save "ignore_metadata_errors" in our Model Card
+        infer_languages = not self.language
+        if isinstance(self.language, str):
+            self.language = [self.language]
+
+        self.train_datasets = self.validate_datasets(self.train_datasets, infer_languages=infer_languages)
+        self.eval_datasets = self.validate_datasets(self.eval_datasets, infer_languages=infer_languages)
+
+        if self.model_id and self.model_id.count("/") != 1:
+            logger.warning(
+                f"The provided {self.model_id!r} model ID should include the organization or user,"
+                ' such as "tomaarsen/mpnet-base-nli-matryoshka". Setting `model_id` to None.'
+            )
+            self.model_id = None
+
+    def validate_datasets(self, dataset_list, infer_languages: bool = True) -> None:
+        output_dataset_list = []
+        for dataset in dataset_list:
+            if "name" not in dataset:
+                if "id" in dataset:
+                    dataset["name"] = dataset["id"]
+
+            if "id" in dataset:
+                # Try to determine the language from the dataset on the Hub
+                try:
+                    info = get_dataset_info(dataset["id"])
+                except Exception:
+                    logger.warning(
+                        f"The dataset `id` {dataset['id']!r} does not exist on the Hub. Setting the `id` to None."
+                    )
+                    del dataset["id"]
+                else:
+                    # TODO: Perhaps we can try to infer the dataset name from the dataset card
+                    if info.cardData and infer_languages and "language" in info.cardData:
+                        dataset_language = info.cardData.get("language")
+                        if isinstance(dataset_language, str):
+                            dataset_language = [dataset_language]
+                        for language in dataset_language:
+                            if language not in self.language:
+                                self.language.append(language)
+
+            output_dataset_list.append(dataset)
+        return output_dataset_list
+
+    def set_losses(self, losses: nn.Module) -> None:
+        citations = {
+            "Sentence Transformers": """
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+"""
+        }
+        for loss in losses:
+            try:
+                citations[loss.__class__.__name__] = loss.citation
+            except Exception:
+                pass
+        inverted_citations = defaultdict(list)
+        for loss, citation in citations.items():
+            inverted_citations[citation].append(loss)
+
+        def join_list(losses: List[str]) -> str:
+            if len(losses) > 1:
+                return ", ".join(losses[:-1]) + " and " + losses[-1]
+            return losses[0]
+
+        self.citations = {join_list(losses): citation for citation, losses in inverted_citations.items()}
+        self.tags += [f"loss:{loss}" for loss in {loss.__class__.__name__: loss for loss in losses}]
+
+    def set_best_model_step(self, step: int) -> None:
+        self.best_model_step = step
+
+    def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None:
+        if isinstance(dataset, Dataset):
+            dataset = DatasetDict(dataset=dataset)
+
+        self.widget = []
+        # Sample the datasets to use for the widget
+        dataset_names = random.choices(list(dataset.keys()), k=5)
+        num_samples = 1000
+        num_samples_to_encode = 500
+        source_sentences = set()
+        for dataset_name in tqdm(dataset_names, desc="Computing widget examples", unit="example", leave=False):
+            # Sample 1000 examples from the dataset, get the 500 shortest texts and encode them
+            dataset_size = len(dataset[dataset_name])
+            samples = dataset[dataset_name].select(
+                random.sample(range(dataset_size), k=min(num_samples, dataset_size))
+            )
+            all_texts = {
+                value
+                for sample in samples
+                for key, value in sample.items()
+                if isinstance(value, str) and value not in source_sentences and key != "dataset_name"
+            }
+            if len(all_texts) < 5:
+                continue
+
+            all_texts = sorted(all_texts, key=len)[:num_samples_to_encode]
+            embeddings = self.model.encode(all_texts, show_progress_bar=False)
+
+            # Select a relatively short example from the dataset as the source,
+            # and find the most similar, median, and dissimilar examples
+            source_sentence_idx, source_sentence = sorted(list(enumerate(all_texts)), key=lambda x: len(x[1]))[
+                min(len(all_texts) - 1, 10)
+            ]
+            _, indices = cos_sim(embeddings[source_sentence_idx], embeddings)[0].sort()
+            similar_sentence = all_texts[indices[-2]]
+            median_sentence = all_texts[len(all_texts) // 2]
+            dissimilar_sentence = all_texts[indices[0]]
+            self.widget.append(
+                {
+                    "source_sentence": source_sentence,
+                    "sentences": [similar_sentence, median_sentence, dissimilar_sentence],
+                }
+            )
+            source_sentences.add(source_sentence)
+
+            self.predict_example = [source_sentence, similar_sentence, median_sentence]
+
+    def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[str, Any]):
+        self.eval_results_dict[evaluator] = copy(metrics)
+
+        # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs
+        if hasattr(evaluator, "primary_metric"):
+            primary_metrics = evaluator.primary_metric
+            if isinstance(evaluator, SequentialEvaluator):
+                primary_metrics = [sub_evaluator.primary_metric for sub_evaluator in evaluator.evaluators]
+            elif isinstance(primary_metrics, str):
+                primary_metrics = [primary_metrics]
+
+            if self.trainer is None:
+                step = 0
+                epoch = 0
+            else:
+                step = self.trainer.state.global_step
+                epoch = self.trainer.state.epoch
+            training_log_metrics = {key: value for key, value in metrics.items() if key in primary_metrics}
+
+            if self.training_logs and self.training_logs[-1]["Step"] == step:
+                self.training_logs[-1].update(training_log_metrics)
+            else:
+                self.training_logs.append(
+                    {
+                        "Epoch": epoch,
+                        "Step": step,
+                        **training_log_metrics,
+                    }
+                )
+
+    def set_label_examples(self, dataset: Dataset) -> None:
+        num_examples_per_label = 3
+        examples = defaultdict(list)
+        finished_labels = set()
+        for sample in dataset:
+            text = sample["text"]
+            label = sample["label"]
+            if label not in finished_labels:
+                examples[label].append(f"<li>{repr(text)}</li>")
+                if len(examples[label]) >= num_examples_per_label:
+                    finished_labels.add(label)
+            if len(finished_labels) == self.num_classes:
+                break
+        self.label_example_list = [
+            {
+                "Label": self.model.labels[label] if self.model.labels and isinstance(label, int) else label,
+                "Examples": "<ul>" + "".join(example_set) + "</ul>",
+            }
+            for label, example_set in examples.items()
+        ]
+
+    def infer_datasets(self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None) -> None:
+        if isinstance(dataset, DatasetDict):
+            return [
+                dataset
+                for dataset_name, sub_dataset in dataset.items()
+                for dataset in self.infer_datasets(sub_dataset, dataset_name=dataset_name)
+            ]
+
+        def subtuple_finder(tuple: Tuple[str], subtuple: Tuple[str]) -> int:
+            for i, element in enumerate(tuple):
+                if element == subtuple[0] and tuple[i : i + len(subtuple)] == subtuple:
+                    return i
+            return -1
+
+        cache_files = dataset.cache_files
+        dataset_output = {}
+        # Ignore the dataset name if it is a default name from the FitMixin backwards compatibility
+        if dataset_name and re.match("_dataset_\d+", dataset_name):
+            dataset_name = None
+        if dataset_name:
+            dataset_output["name"] = dataset_name
+        if cache_files and "filename" in cache_files[0]:
+            cache_path_parts = Path(cache_files[0]["filename"]).parts
+            # Check if the cachefile is under "huggingface/datasets"
+            subtuple = ("huggingface", "datasets")
+            index = subtuple_finder(cache_path_parts, subtuple)
+            if index == -1:
+                return
+
+            # Get the folder after "huggingface/datasets"
+            cache_dataset_name = cache_path_parts[index + len(subtuple)]
+            # If the dataset has an author:
+            if "___" in cache_dataset_name:
+                author, dataset_name = cache_dataset_name.split("___")
+                dataset_output["id"] = f"{author}/{dataset_name}"
+            else:
+                author = None
+                dataset_name = cache_dataset_name
+                dataset_output["id"] = get_dataset_info(dataset_name).id
+
+            # If the cache path ends with a 40 character hash, it is the current revision
+            if len(cache_path_parts[-2]) == 40:
+                dataset_output["revision"] = cache_path_parts[-2]
+
+        return [dataset_output]
+
+    def compute_dataset_metrics(
+        self,
+        dataset: Dict[str, str],
+        dataset_info: Dict[str, Any],
+        loss: Optional[Union[Dict[str, nn.Module], nn.Module]],
+    ) -> Dict[str, str]:
+        """
+        Given a dataset, compute the following:
+        * Dataset Size
+        * Dataset Columns
+        * Dataset Stats
+            - Strings: min, mean, max word count/token length
+            - Integers: Counter() instance
+            - Floats: min, mean, max range
+            - List: number of elements or min, mean, max number of elements
+        * 3 Example samples
+        * Loss function name
+            - Loss function config
+        """
+        if not dataset:
+            return {}
+
+        dataset_info["size"] = len(dataset)
+        dataset_info["columns"] = [f"<code>{column}</code>" for column in dataset.column_names]
+        dataset_info["stats"] = {}
+        for column in dataset.column_names:
+            subsection = dataset[:1000][column]
+            first = subsection[0]
+            if isinstance(first, str):
+                tokenized = self.model.tokenize(subsection)
+                if isinstance(tokenized, dict) and "attention_mask" in tokenized:
+                    lengths = tokenized["attention_mask"].sum(dim=1).tolist()
+                    suffix = "tokens"
+                else:
+                    lengths = [len(sentence) for sentence in subsection]
+                    suffix = "characters"
+                dataset_info["stats"][column] = {
+                    "dtype": "string",
+                    "data": {
+                        "min": f"{round(min(lengths), 2)} {suffix}",
+                        "mean": f"{round(sum(lengths) / len(lengths), 2)} {suffix}",
+                        "max": f"{round(max(lengths), 2)} {suffix}",
+                    },
+                }
+            elif isinstance(first, (int, bool)):
+                counter = Counter(subsection)
+                dataset_info["stats"][column] = {
+                    "dtype": "int",
+                    "data": {
+                        key: f"{'~' if len(counter) > 1 else ''}{counter[key] / len(subsection):.2%}"
+                        for key in sorted(counter)
+                    },
+                }
+            elif isinstance(first, float):
+                dataset_info["stats"][column] = {
+                    "dtype": "float",
+                    "data": {
+                        "min": round(min(dataset[column]), 2),
+                        "mean": round(sum(dataset[column]) / len(dataset), 2),
+                        "max": round(max(dataset[column]), 2),
+                    },
+                }
+            elif isinstance(first, list):
+                counter = Counter([len(lst) for lst in subsection])
+                if len(counter) == 1:
+                    dataset_info["stats"][column] = {
+                        "dtype": "list",
+                        "data": {
+                            "size": f"{len(first)} elements",
+                        },
+                    }
+                else:
+                    dataset_info["stats"][column] = {
+                        "dtype": "list",
+                        "data": {
+                            "min": f"{min(counter)} elements",
+                            "mean": f"{sum(counter) / len(counter):.2f} elements",
+                            "max": f"{max(counter)} elements",
+                        },
+                    }
+            else:
+                dataset_info["stats"][column] = {"dtype": fullname(first), "data": {}}
+
+        def to_html_list(data: dict):
+            return "<ul><li>" + "</li><li>".join(f"{key}: {value}" for key, value in data.items()) + "</li></ul>"
+
+        stats_lines = [
+            {"": "type", **{key: value["dtype"] for key, value in dataset_info["stats"].items()}},
+            {"": "details", **{key: to_html_list(value["data"]) for key, value in dataset_info["stats"].items()}},
+        ]
+        dataset_info["stats_table"] = indent(make_markdown_table(stats_lines).replace("-:|", "--|"), "  ")
+
+        dataset_info["examples"] = dataset[:3]
+        num_samples = len(dataset_info["examples"][list(dataset_info["examples"])[0]])
+        examples_lines = []
+        for sample_idx in range(num_samples):
+            columns = {}
+            for column in dataset.column_names:
+                value = dataset_info["examples"][column][sample_idx]
+                # If the value is a long list, truncate it
+                if isinstance(value, list) and len(value) > 5:
+                    value = str(value[:5])[:-1] + ", ...]"
+                # Avoid newlines in the table
+                value = str(value).replace("\n", "<br>")
+                columns[column] = f"<code>{value}</code>"
+            examples_lines.append(columns)
+        dataset_info["examples_table"] = indent(make_markdown_table(examples_lines).replace("-:|", "--|"), "  ")
+
+        dataset_info["loss"] = {
+            "fullname": fullname(loss),
+        }
+        if hasattr(loss, "get_config_dict"):
+            config = loss.get_config_dict()
+            try:
+                str_config = json.dumps(config, indent=4)
+            except TypeError:
+                str_config = str(config)
+            dataset_info["loss"]["config_code"] = indent(f"```json\n{str_config}\n```", "  ")
+        return dataset_info
+
+    def extract_dataset_metadata(
+        self, dataset: Union[Dataset, DatasetDict], dataset_metadata, dataset_type: Literal["train", "eval"]
+    ) -> Dict[str, Any]:
+        if dataset:
+            if dataset_metadata and (
+                (isinstance(dataset, DatasetDict) and len(dataset_metadata) != len(dataset))
+                or (isinstance(dataset, Dataset) and len(dataset_metadata) != 1)
+            ):
+                logger.warning(
+                    f"The number of `{dataset_type}_datasets` in the model card data does not match the number of {dataset_type} datasets in the Trainer. "
+                    f"Removing the provided `{dataset_type}_datasets` from the model card data."
+                )
+                dataset_metadata = []
+
+            if not dataset_metadata:
+                dataset_metadata = self.infer_datasets(dataset)
+
+            if isinstance(dataset, DatasetDict):
+                dataset_metadata = [
+                    self.compute_dataset_metrics(
+                        dataset_value,
+                        dataset_info,
+                        self.trainer.loss[dataset_name] if isinstance(self.trainer.loss, dict) else self.trainer.loss,
+                    )
+                    for dataset_name, dataset_value, dataset_info in zip(
+                        dataset.keys(), dataset.values(), dataset_metadata
+                    )
+                ]
+            else:
+                dataset_metadata = [self.compute_dataset_metrics(dataset, dataset_metadata[0], self.trainer.loss)]
+
+        return self.validate_datasets(dataset_metadata)
+
+    def register_model(self, model: "SentenceTransformer") -> None:
+        self.model = model
+
+    def set_model_id(self, model_id: str) -> None:
+        self.model_id = model_id
+
+    def set_base_model(self, model_id: str, revision: Optional[str] = None) -> None:
+        try:
+            model_info = get_model_info(model_id)
+        except Exception:
+            # Getting the model info can fail for many reasons: model does not exist, no internet, outage, etc.
+            return False
+        self.base_model = model_info.id
+        if revision is None or revision == "main":
+            revision = model_info.sha
+        self.base_model_revision = revision
+        return True
+
+    def try_to_set_base_model(self) -> None:
+        if isinstance(self.model[0], Transformer):
+            base_model = self.model[0].auto_model.config._name_or_path
+            base_model_path = Path(base_model)
+            # Sometimes the name_or_path ends exactly with the model_id, e.g.
+            # "C:\\Users\\tom/.cache\\torch\\sentence_transformers\\BAAI_bge-small-en-v1.5\\"
+            candidate_model_ids = ["/".join(base_model_path.parts[-2:])]
+            # Sometimes the name_or_path its final part contains the full model_id, with "/" replaced with a "_", e.g.
+            # "/root/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/"
+            # In that case, we take the last part, split on _, and try all combinations
+            # e.g. "a_b_c_d" -> ['a/b_c_d', 'a_b/c_d', 'a_b_c/d']
+            splits = base_model_path.name.split("_")
+            candidate_model_ids += [
+                "_".join(splits[:idx]) + "/" + "_".join(splits[idx:]) for idx in range(1, len(splits))
+            ]
+            for model_id in candidate_model_ids:
+                if self.set_base_model(model_id):
+                    break
+
+    def format_eval_metrics(self) -> Dict[str, Any]:
+        """Format the evaluation metrics for the model card.
+
+        The following keys will be returned:
+        - eval_metrics: A list of dictionaries containing the class name, description, dataset name, and a markdown table
+          This is used to display the evaluation metrics in the model card.
+        - metrics: A list of all metric keys. This is used in the model card metadata.
+        - model-index: A list of dictionaries containing the task name, task type, dataset type, dataset name, metric name,
+          metric type, and metric value. This is used to display the evaluation metrics in the model card metadata.
+        """
+        eval_metrics = []
+        all_metrics = {}
+        eval_results = []
+        for evaluator, metrics in self.eval_results_dict.items():
+            name = getattr(evaluator, "name", None)
+            primary_metric = getattr(evaluator, "primary_metric", None)
+            if name and all(key.startswith(name + "_") for key in metrics.keys()):
+                metrics = {key[len(name) + 1 :]: value for key, value in metrics.items()}
+                if primary_metric and primary_metric.startswith(name + "_"):
+                    primary_metric = primary_metric[len(name) + 1 :]
+
+            def try_to_pure_python(value: Any) -> Any:
+                """Try to convert a value from a Numpy or Torch scalar to pure Python, if not already pure Python"""
+                try:
+                    if hasattr(value, "dtype"):
+                        return value.item()
+                except Exception:
+                    pass
+                return value
+
+            # Try to convert to pure Python
+            metrics = {key: try_to_pure_python(value) for key, value in metrics.items()}
+
+            table_lines = [
+                {
+                    "Metric": f"**{metric_key}**" if metric_key == primary_metric else metric_key,
+                    "Value": f"**{round(metric_value, 4)}**"
+                    if metric_key == primary_metric
+                    else round(metric_value, 4),
+                }
+                for metric_key, metric_value in metrics.items()
+            ]
+
+            # E.g. "Binary Classification" or "Semantic Similarity"
+            description = evaluator.description
+            dataset_name = getattr(evaluator, "name", None)
+            eval_metrics.append(
+                {
+                    "class_name": fullname(evaluator),
+                    "description": description,
+                    "dataset_name": dataset_name,
+                    "table": make_markdown_table(table_lines).replace("-:|", "--|"),
+                }
+            )
+            eval_results.extend(
+                [
+                    EvalResult(
+                        task_name=description,
+                        task_type=description.lower().replace(" ", "-"),
+                        dataset_type=dataset_name or "unknown",
+                        dataset_name=dataset_name.replace("_", " ").replace("-", " ") or "Unknown",
+                        metric_name=metric_key.replace("_", " ").title(),
+                        metric_type=metric_key,
+                        metric_value=metric_value,
+                    )
+                    for metric_key, metric_value in metrics.items()
+                    if isinstance(metric_value, (int, float))
+                ]
+            )
+            all_metrics.update(metrics)
+
+        return {
+            "eval_metrics": eval_metrics,
+            "metrics": list(all_metrics.keys()),
+            "model-index": eval_results_to_model_index(self.model_name, eval_results),
+        }
+
+    def format_training_logs(self):
+        # Get the keys from all evaluation lines
+        eval_lines_keys = {key for lines in self.training_logs for key in lines.keys()}
+
+        # Sort the metric columns: Epoch, Step, Training Loss, Validation Loss, Evaluator results
+        def sort_metrics(key: str) -> str:
+            if key == "Epoch":
+                return "0"
+            if key == "Step":
+                return "1"
+            if key == "Training Loss":
+                return "2"
+            if key.endswith("loss"):
+                return "3"
+            return key
+
+        sorted_eval_lines_keys = sorted(eval_lines_keys, key=sort_metrics)
+        training_logs = [
+            {
+                key: f"**{round(line[key], 4) if key in line else '-'}**"
+                if line["Step"] == self.best_model_step
+                else line.get(key, "-")
+                for key in sorted_eval_lines_keys
+            }
+            for line in self.training_logs
+        ]
+        eval_lines = make_markdown_table(training_logs)
+        return {
+            "eval_lines": eval_lines,
+            "explain_bold_in_eval": "**" in eval_lines,
+        }
+
+    def get_codecarbon_data(self):
+        emissions_data = self.code_carbon_callback.tracker._prepare_emissions_data()
+        results = {
+            "co2_eq_emissions": {
+                # * 1000 to convert kg to g
+                "emissions": float(emissions_data.emissions) * 1000,
+                "energy_consumed": float(emissions_data.energy_consumed),
+                "source": "codecarbon",
+                "training_type": "fine-tuning",
+                "on_cloud": emissions_data.on_cloud == "Y",
+                "cpu_model": emissions_data.cpu_model,
+                "ram_total_size": emissions_data.ram_total_size,
+                "hours_used": round(emissions_data.duration / 3600, 3),
+            }
+        }
+        if emissions_data.gpu_model:
+            results["co2_eq_emissions"]["hardware_used"] = emissions_data.gpu_model
+        return results
+
+    def to_dict(self) -> Dict[str, Any]:
+        # Extract some meaningful examples from the evaluation or training dataset to showcase the performance
+        if self.trainer and self.widget_step < self.trainer.state.global_step and self.generate_widget_examples:
+            if dataset := self.trainer.eval_dataset or self.trainer.train_dataset:
+                self.set_widget_examples(dataset)
+                self.widget_step = self.trainer.state.global_step
+
+        # Try to set the base model
+        if self.first_save and not self.base_model:
+            try:
+                self.try_to_set_base_model()
+            except Exception:
+                pass
+
+        # Set the model name
+        if not self.model_name:
+            if self.base_model:
+                self.model_name = f"SentenceTransformer based on {self.base_model}"
+            else:
+                self.model_name = "SentenceTransformer"
+
+        super_dict = {field.name: getattr(self, field.name) for field in fields(self)}
+
+        # Compute required formats from the (usually post-training) evaluation data
+        if self.eval_results_dict:
+            try:
+                super_dict.update(self.format_eval_metrics())
+            except Exception as exc:
+                logger.warning(f"Error while formatting evaluation metrics: {exc}")
+                raise exc
+
+        # Compute required formats for the during-training evaluation data
+        if self.training_logs:
+            try:
+                super_dict.update(self.format_training_logs())
+            except Exception as exc:
+                logger.warning(f"Error while formatting training logs: {exc}")
+
+        super_dict["hide_eval_lines"] = len(self.training_logs) > 100
+
+        # Try to add the code carbon callback data
+        if (
+            self.code_carbon_callback
+            and self.code_carbon_callback.tracker
+            and self.code_carbon_callback.tracker._start_time is not None
+        ):
+            super_dict.update(self.get_codecarbon_data())
+
+        # Add some additional metadata stored in the model itself
+        super_dict["model_max_length"] = self.model.get_max_seq_length()
+        super_dict["output_dimensionality"] = self.model.get_sentence_embedding_dimension()
+        super_dict["model_string"] = str(self.model)
+
+        self.first_save = False
+
+        for key in IGNORED_FIELDS:
+            super_dict.pop(key, None)
+        return super_dict
+
+    def to_yaml(self, line_break=None) -> str:
+        return yaml_dump(
+            {key: value for key, value in self.to_dict().items() if key in YAML_FIELDS and value is not None},
+            sort_keys=False,
+            line_break=line_break,
+        ).strip()
+
+
+def generate_model_card(model: "SentenceTransformer") -> str:
+    template_path = Path(__file__).parent / "model_card_template.md"
+    model_card = ModelCard.from_template(card_data=model.model_card_data, template_path=template_path, hf_emoji="🤗")
+    return model_card.content
diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md
new file mode 100644
index 000000000..2362eb0c6
--- /dev/null
+++ b/sentence_transformers/model_card_template.md
@@ -0,0 +1,228 @@
+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{{ card_data }}
+---
+
+# {{ model_name if model_name else "Sentence Transformer model" }}
+
+This is a [sentence-transformers](https://www.SBERT.net) model{% if base_model %} finetuned from [{{ base_model }}](https://huggingface.co/{{ base_model }}){% else %} trained{% endif %}{% if train_datasets | selectattr("name") | list %} on the {% for dataset in (train_datasets | selectattr("name")) %}{% if dataset.id %}[{{ dataset.name if dataset.name else dataset.id }}](https://huggingface.co/datasets/{{ dataset.id }}){% else %}{{ dataset.name }}{% endif %}{% if not loop.last %}{% if loop.index == (train_datasets | selectattr("name") | list | length - 1) %} and {% else %}, {% endif %}{% endif %}{% endfor %} dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}{% endif %}. It maps sentences & paragraphs to a {{ output_dimensionality }}-dimensional dense vector space and can be used for {{ task_name }}.
+
+## Model Details
+
+### Model Description
+- **Model Type:** Sentence Transformer
+{% if base_model -%}
+    {%- if base_model_revision -%}
+    - **Base model:** [{{ base_model }}](https://huggingface.co/{{ base_model }}) <!-- at revision {{ base_model_revision }} -->
+    {%- else -%}
+    - **Base model:** [{{ base_model }}](https://huggingface.co/{{ base_model }})
+    {%- endif -%}
+{%- else -%}
+    <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
+{%- endif %}
+- **Maximum Sequence Length:** {{ model_max_length }} tokens
+- **Output Dimensionality:** {{ output_dimensionality }} tokens
+{% if train_datasets | selectattr("name") | list -%}
+    - **Training Dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}:**
+    {%- for dataset in (train_datasets | selectattr("name")) %}
+        {%- if dataset.id %}
+    - [{{ dataset.name if dataset.name else dataset.id }}](https://huggingface.co/datasets/{{ dataset.id }})
+        {%- else %}
+    - {{ dataset.name }}
+        {%- endif %}
+    {%- endfor %}
+{%- else -%}
+    <!-- - **Training Dataset:** Unknown -->
+{%- endif %}
+{% if language -%}
+    - **Language{{"s" if language is not string and language | length > 1 else ""}}:**
+    {%- if language is string %} {{ language }}
+    {%- else %} {% for lang in language -%}
+            {{ lang }}{{ ", " if not loop.last else "" }}
+        {%- endfor %}
+    {%- endif %}
+{%- else -%}
+    <!-- - **Language:** Unknown -->
+{%- endif %}
+{% if license -%}
+    - **License:** {{ license }}
+{%- else -%}
+    <!-- - **License:** Unknown -->
+{%- endif %}
+
+### Model Sources
+
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+
+### Full Model Architecture
+
+```
+{{ model_string }}
+```
+
+## Usage
+
+### Direct Usage (Sentence Transformers)
+
+First install the Sentence Transformers library:
+
+```bash
+pip install -U sentence-transformers
+```
+
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+
+# Download from the {{ hf_emoji }} Hub
+model = SentenceTransformer("{{ model_id | default('sentence_transformers_model_id', true) }}")
+# Run inference
+sentences = [
+{%- for text in (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) %}
+    {{ "%r" | format(text) }},
+{%- endfor %}
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ output_dimensionality | default(1024, true) }}]
+```
+
+<!--
+### Direct Usage (Transformers)
+
+<details><summary>Click to see the direct usage in Transformers</summary>
+
+</details>
+-->
+
+<!--
+### Downstream Usage (Sentence Transformers)
+
+You can finetune this model on your own dataset.
+
+<details><summary>Click to expand</summary>
+
+</details>
+-->
+
+<!--
+### Out-of-Scope Use
+
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+{% if eval_metrics %}
+## Evaluation
+
+### Metrics
+{% for metrics in eval_metrics %}
+#### {{ metrics.description }}
+{% if metrics.dataset_name %}* Dataset: `{{ metrics.dataset_name }}`{% endif %}
+* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[<code>{{ metrics.class_name.split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}<code>{{ metrics.class_name }}</code>{% endif %}
+
+{{ metrics.table }}
+{%- endfor %}{% endif %}
+<!--
+## Bias, Risks and Limitations
+
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+
+<!--
+### Recommendations
+
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+
+## Training Details
+{% for dataset_type, dataset_list in [("training", train_datasets), ("evaluation", eval_datasets)] %}{% if dataset_list %}
+### {{ dataset_type.title() }} Dataset{{"s" if dataset_list | length > 1 else ""}}
+{% for dataset in dataset_list %}
+#### {{ dataset['name'] or 'Unnamed Dataset' }}
+
+{% if dataset['name'] %}* Dataset: {% if 'id' in dataset %}[{{ dataset['name'] }}](https://huggingface.co/datasets/{{ dataset['id'] }}){% else %}{{ dataset['name'] }}{% endif %}
+{%- if 'revision' in dataset and 'id' in dataset %} at [{{ dataset['revision'][:7] }}](https://huggingface.co/datasets/{{ dataset['id'] }}/tree/{{ dataset['revision'] }}){% endif %}{% endif %}
+* Size: {{ "{:,}".format(dataset['size']) }} {{ dataset_type }} samples
+* Columns: {% if dataset['columns'] | length == 1 %}{{ dataset['columns'][0] }}{% elif dataset['columns'] | length == 2 %}{{ dataset['columns'][0] }} and {{ dataset['columns'][1] }}{% else %}{{ dataset['columns'][:-1] | join(', ') }}, and {{ dataset['columns'][-1] }}{% endif %}
+* Approximate statistics based on the first 1000 samples:
+{{ dataset['stats_table'] }}* Samples:
+{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[<code>{{ dataset["loss"]["fullname"].split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}<code>{{ dataset["loss"]["fullname"] }}</code>{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters:
+{{ dataset["loss"]["config_code"] }}{% endif %}
+{% endfor %}{% endif %}{% endfor -%}
+
+{% if all_hyperparameters %}
+### Training Hyperparameters
+{% if non_default_hyperparameters -%}
+#### Non-Default Hyperparameters
+
+{% for name, value in non_default_hyperparameters.items() %}- `{{ name }}`: {{ value }}
+{% endfor %}{%- endif %}
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+
+{% for name, value in all_hyperparameters.items() %}- `{{ name }}`: {{ value }}
+{% endfor %}
+</details>
+{% endif %}
+
+{%- if eval_lines %}
+### Training Logs
+{% if hide_eval_lines %}<details><summary>Click to expand</summary>
+
+{% endif -%}
+{{ eval_lines }}{% if explain_bold_in_eval %}
+* The bold row denotes the saved checkpoint.{% endif %}
+{%- if hide_eval_lines %}
+</details>{% endif %}
+{% endif %}
+
+{%- if co2_eq_emissions %}
+### Environmental Impact
+Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
+- **Energy Consumed**: {{ "%.3f"|format(co2_eq_emissions["energy_consumed"]) }} kWh
+- **Carbon Emitted**: {{ "%.3f"|format(co2_eq_emissions["emissions"] / 1000) }} kg of CO2
+- **Hours Used**: {{ co2_eq_emissions["hours_used"] }} hours
+
+### Training Hardware
+- **On Cloud**: {{ "Yes" if co2_eq_emissions["on_cloud"] else "No" }}
+- **GPU Model**: {{ co2_eq_emissions["hardware_used"] or "No GPU used" }}
+- **CPU Model**: {{ co2_eq_emissions["cpu_model"] }}
+- **RAM Size**: {{ "%.2f"|format(co2_eq_emissions["ram_total_size"]) }} GB
+{% endif %}
+### Framework Versions
+- Python: {{ version["python"] }}
+- Sentence Transformers: {{ version["sentence_transformers"] }}
+- Transformers: {{ version["transformers"] }}
+- PyTorch: {{ version["torch"] }}
+- Accelerate: {{ version["accelerate"] }}
+- Datasets: {{ version["datasets"] }}
+- Tokenizers: {{ version["tokenizers"] }}
+
+## Citation
+
+### BibTeX
+{% for loss_name, citation in citations.items() %}
+#### {{ loss_name }}
+```bibtex
+{{ citation | trim }}
+```
+{% endfor %}
+<!--
+## Glossary
+
+*Clearly define terms in order to be accessible across audiences.*
+-->
+
+<!--
+## Model Card Authors
+
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+
+<!--
+## Model Card Contact
+
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->
\ No newline at end of file
diff --git a/sentence_transformers/models/BoW.py b/sentence_transformers/models/BoW.py
index 02fee875e..c9f7aef06 100644
--- a/sentence_transformers/models/BoW.py
+++ b/sentence_transformers/models/BoW.py
@@ -5,7 +5,6 @@
 import os
 import json
 import logging
-import numpy as np
 from .tokenizer import WhitespaceTokenizer
 
 
@@ -70,7 +69,7 @@ def get_sentence_features(self, tokenized_texts: List[List[int]], pad_seq_length
         vectors = []
 
         for tokens in tokenized_texts:
-            vector = np.zeros(self.get_sentence_embedding_dimension(), dtype=np.float32)
+            vector = torch.zeros(self.get_sentence_embedding_dimension(), dtype=torch.float32)
             for token in tokens:
                 if self.cumulative_term_frequency:
                     vector[token] += self.weights[token]
@@ -78,7 +77,7 @@ def get_sentence_features(self, tokenized_texts: List[List[int]], pad_seq_length
                     vector[token] = self.weights[token]
             vectors.append(vector)
 
-        return {"sentence_embedding": torch.tensor(vectors, dtype=torch.float)}
+        return {"sentence_embedding": torch.stack(vectors)}
 
     def get_config_dict(self):
         return {key: self.__dict__[key] for key in self.config_keys}
diff --git a/sentence_transformers/models/CLIPModel.py b/sentence_transformers/models/CLIPModel.py
index b4ab32e37..dea1c12a0 100644
--- a/sentence_transformers/models/CLIPModel.py
+++ b/sentence_transformers/models/CLIPModel.py
@@ -74,6 +74,10 @@ def tokenize(self, texts, padding: Union[str, bool] = True):
         encoding["image_text_info"] = image_text_info
         return encoding
 
+    @property
+    def tokenizer(self):
+        return self.processor
+
     def save(self, output_path: str):
         self.model.save_pretrained(output_path)
         self.processor.save_pretrained(output_path)
diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py
index 25727ab1e..ae9dbfbb9 100644
--- a/sentence_transformers/models/Transformer.py
+++ b/sentence_transformers/models/Transformer.py
@@ -35,6 +35,8 @@ def __init__(
         config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
         self._load_model(model_name_or_path, config, cache_dir, **model_args)
 
+        if max_seq_length is not None and "model_max_length" not in tokenizer_args:
+            tokenizer_args["model_max_length"] = max_seq_length
         self.tokenizer = AutoTokenizer.from_pretrained(
             tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path,
             cache_dir=cache_dir,
diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py
new file mode 100644
index 000000000..e717d21ec
--- /dev/null
+++ b/sentence_transformers/sampler.py
@@ -0,0 +1,210 @@
+from collections import defaultdict
+from itertools import accumulate, cycle
+from typing import List
+import logging
+
+from datasets import Dataset
+from torch.utils.data import BatchSampler, SubsetRandomSampler, ConcatDataset
+import torch
+
+logger = logging.getLogger(__name__)
+
+
+class SetEpochMixin:
+    """
+    Required for a BatchSampler as the Trainer will call set_epoch on the BatchSampler at the beginning of each epoch.
+    The BatchSampler can then set the generator seed accordingly.
+    """
+
+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+
+        self.epoch = 0
+
+    def set_epoch(self, epoch: int):
+        self.epoch = epoch
+
+
+class DefaultBatchSampler(SetEpochMixin, BatchSampler):
+    pass
+
+
+class GroupByLabelBatchSampler(SetEpochMixin, BatchSampler):
+    def __init__(
+        self,
+        dataset: Dataset,
+        batch_size: int,
+        drop_last: bool,
+        valid_label_columns: List[str] = None,
+        generator: torch.Generator = None,
+        seed: int = 0,
+    ):
+        super().__init__(dataset, batch_size, drop_last)
+        self.dataset = dataset
+        self.batch_size = batch_size
+        self.drop_last = drop_last
+        self.generator = generator
+        self.seed = seed
+
+        if self.batch_size % 2 == 1:
+            raise ValueError("The batch size for `GroupByLabelBatchSampler` must be divisible by 2.")
+
+        for column_name in valid_label_columns or []:
+            if column_name in dataset.column_names:
+                labels = dataset["label"]
+                break
+        else:
+            raise ValueError(f"None of the valid_label_columns {valid_label_columns} are in the dataset.")
+
+        del dataset
+        groups = defaultdict(list)
+        for sample_idx, label in enumerate(labels):
+            groups[label].append(sample_idx)
+
+        self.groups = {
+            label: sample_indices[:num_samples]
+            for label, sample_indices in groups.items()
+            if (num_samples := len(sample_indices) // 2)
+        }
+
+    def __iter__(self):
+        if self.generator and self.seed:
+            self.generator.manual_seed(self.seed + self.epoch)
+
+        labels = list(self.groups.keys())
+        partial_batch = []
+        for label_idx in torch.randperm(len(self.groups), generator=self.generator):
+            label = labels[label_idx]
+            samples = self.groups[label]
+            partial_batch.extend(samples)
+            while len(partial_batch) >= self.batch_size:
+                yield partial_batch[: self.batch_size]
+                partial_batch = partial_batch[self.batch_size :]
+
+        if not self.drop_last and partial_batch:
+            yield partial_batch
+
+
+class NoDuplicatesBatchSampler(SetEpochMixin, BatchSampler):
+    def __init__(
+        self,
+        dataset: Dataset,
+        batch_size: int,
+        drop_last: bool,
+        valid_label_columns: List[str] = [],
+        generator: torch.Generator = None,
+        seed: int = 0,
+    ):
+        super().__init__(dataset, batch_size, drop_last)
+        if label_columns := set(dataset.column_names) & (set(valid_label_columns) | {"dataset_name"}):
+            dataset = dataset.remove_columns(label_columns)
+        self.dataset = dataset
+        self.batch_size = batch_size
+        self.drop_last = drop_last
+        self.generator = generator
+        self.seed = seed
+
+    def __iter__(self):
+        """
+        Iterate over the remaining non-yielded indices. For each index, check if the sample values are already in the
+        batch. If not, add the sample values to the batch keep going until the batch is full. If the batch is full, yield
+        the batch indices and continue with the next batch.
+        """
+        if self.generator and self.seed:
+            self.generator.manual_seed(self.seed + self.epoch)
+
+        remaining_indices = set(torch.randperm(len(self.dataset), generator=self.generator).tolist())
+        while remaining_indices:
+            batch_values = set()
+            batch_indices = []
+            for index in remaining_indices:
+                sample_values = set(self.dataset[index].values())
+                if sample_values & batch_values:
+                    continue
+
+                batch_indices.append(index)
+                if len(batch_indices) == self.batch_size:
+                    yield batch_indices
+                    break
+
+                batch_values.update(sample_values)
+
+            else:
+                # NOTE: some indices might still have been ignored here
+                if not self.drop_last:
+                    yield batch_indices
+
+            remaining_indices -= set(batch_indices)
+
+    def __len__(self) -> int:
+        if self.drop_last:
+            return len(self.dataset) // self.batch_size
+        else:
+            return (len(self.dataset) + self.batch_size - 1) // self.batch_size
+
+
+class RoundRobinBatchSampler(SetEpochMixin, BatchSampler):
+    def __init__(
+        self,
+        dataset: ConcatDataset,
+        batch_samplers: List[BatchSampler],
+        generator: torch.Generator,
+        seed: int,
+    ):
+        super().__init__(dataset, batch_samplers[0].batch_size, batch_samplers[0].drop_last)
+        self.dataset = dataset
+        self.batch_samplers = batch_samplers
+        self.generator = generator
+        self.seed = seed
+
+    def __iter__(self):
+        self.generator.manual_seed(self.seed + self.epoch)
+
+        num_samples = [len(dataset) for dataset in self.dataset.datasets]
+        sample_offsets = [0] + list(accumulate(num_samples))
+
+        batch_samplers = [iter(sampler) for sampler in self.batch_samplers]
+        for dataset_idx in cycle(range(len(batch_samplers))):
+            sample_offset = sample_offsets[dataset_idx]
+            try:
+                yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])]
+
+            except StopIteration:
+                # current iterator is apparently exhausted
+                break
+
+    def __len__(self) -> int:
+        return min([len(sampler) for sampler in self.batch_samplers]) * len(self.batch_samplers)
+
+
+class ProportionalBatchSampler(SetEpochMixin, BatchSampler):
+    def __init__(
+        self,
+        dataset: ConcatDataset,
+        batch_samplers: List[BatchSampler],
+        generator: torch.Generator,
+        seed: int,
+    ):
+        super().__init__(dataset, batch_samplers[0].batch_size, batch_samplers[0].drop_last)
+        self.dataset = dataset
+        self.batch_samplers = batch_samplers
+        self.generator = generator
+        self.seed = seed
+
+    def __iter__(self):
+        self.generator.manual_seed(self.seed + self.epoch)
+
+        num_samples = [len(dataset) for dataset in self.dataset.datasets]
+        sample_offsets = [0] + list(accumulate(num_samples))
+
+        num_batches = [len(sampler) for sampler in self.batch_samplers]
+        dataset_indices = [idx for idx, length in enumerate(num_batches) for _ in range(length)]
+        dataset_idx_sampler = SubsetRandomSampler(dataset_indices, generator=self.generator)
+
+        batch_samplers = [iter(sampler) for sampler in self.batch_samplers]
+        for dataset_idx in dataset_idx_sampler:
+            sample_offset = sample_offsets[dataset_idx]
+            yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])]
+
+    def __len__(self) -> int:
+        return sum([len(sampler) for sampler in self.batch_samplers])
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
new file mode 100644
index 000000000..8eb4877ba
--- /dev/null
+++ b/sentence_transformers/trainer.py
@@ -0,0 +1,553 @@
+from contextlib import nullcontext
+import logging
+import os
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union, TYPE_CHECKING
+
+import torch
+from torch import nn
+from torch.utils.data import DataLoader, ConcatDataset, Dataset, BatchSampler, SubsetRandomSampler
+from transformers import PreTrainedTokenizerBase, Trainer, EvalPrediction, TrainerCallback
+from transformers.integrations import WandbCallback
+from transformers.trainer import TRAINING_ARGS_NAME
+from transformers.training_args import ParallelMode
+
+from datasets import DatasetDict
+from transformers.trainer_utils import EvalLoopOutput
+from transformers.data.data_collator import DataCollator
+from sentence_transformers.losses import CoSENTLoss
+
+from sentence_transformers.models.Transformer import Transformer
+from sentence_transformers.training_args import (
+    SentenceTransformerTrainingArguments,
+    BatchSamplers,
+    MultiDatasetBatchSamplers,
+)
+from sentence_transformers.data_collator import SentenceTransformerDataCollator
+from sentence_transformers.evaluation import SentenceEvaluator
+from sentence_transformers.sampler import (
+    DefaultBatchSampler,
+    GroupByLabelBatchSampler,
+    NoDuplicatesBatchSampler,
+    ProportionalBatchSampler,
+    RoundRobinBatchSampler,
+)
+from sentence_transformers.util import disable_logging
+
+from sentence_transformers.model_card import ModelCardCallback
+
+logger = logging.getLogger(__name__)
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
+
+
+class SentenceTransformerTrainer(Trainer):
+    def __init__(
+        self,
+        model: Optional["SentenceTransformer"] = None,
+        args: SentenceTransformerTrainingArguments = None,
+        train_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
+        eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
+        loss: Optional[Union[Dict[str, nn.Module], nn.Module]] = None,
+        evaluator: Optional[SentenceEvaluator] = None,
+        data_collator: Optional[DataCollator] = None,
+        tokenizer: Optional[Union[PreTrainedTokenizerBase, Callable]] = None,
+        model_init: Optional[Callable[[], "SentenceTransformer"]] = None,
+        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
+        callbacks: Optional[List[TrainerCallback]] = None,
+        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
+        preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None,
+    ) -> None:
+        if args is None:
+            output_dir = "tmp_trainer"
+            logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.")
+            args = SentenceTransformerTrainingArguments(output_dir=output_dir)
+        elif not isinstance(args, SentenceTransformerTrainingArguments):
+            raise ValueError("Please use `TrainingArguments` imported from `sentence_transformers`.")
+
+        # Get a dictionary of the default training arguments, so we can determine which arguments have been changed
+        # for the model card
+        default_args_dict = SentenceTransformerTrainingArguments(output_dir="unused").to_dict()
+
+        # If the model ID is set via the SentenceTransformerTrainingArguments, but not via the SentenceTransformerModelCardData,
+        # then we can set it here for the model card regardless
+        if args.hub_model_id and not model.model_card_data.model_id:
+            model.model_card_data.set_model_id(args.hub_model_id)
+
+        if tokenizer is None and isinstance(model.tokenizer, PreTrainedTokenizerBase):
+            tokenizer = model.tokenizer
+
+        if data_collator is None:
+            data_collator = SentenceTransformerDataCollator(tokenize_fn=model.tokenize)
+        super().__init__(
+            model=model,
+            args=args,
+            data_collator=data_collator,
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            tokenizer=tokenizer,
+            model_init=model_init,
+            compute_metrics=compute_metrics,
+            callbacks=callbacks,
+            optimizers=optimizers,
+            preprocess_logits_for_metrics=preprocess_logits_for_metrics,
+        )
+        # Set the W&B project via environment variables if it's not already set
+        if any([isinstance(callback, WandbCallback) for callback in self.callback_handler.callbacks]):
+            os.environ.setdefault("WANDB_PROJECT", "sentence-transformers")
+
+        if loss is None:
+            logger.info("No `loss` passed, using `losses.CoSENTLoss` as a default option.")
+            loss = CoSENTLoss(self.model)
+
+        self.loss = loss
+        if isinstance(loss, dict):
+            self.loss = {dataset_name: loss_fn.to(self.model.device) for dataset_name, loss_fn in loss.items()}
+            for dataset_name, dataset in zip(["train", "eval"], [train_dataset, eval_dataset]):
+                if dataset is None:
+                    continue
+                if not isinstance(dataset, dict):
+                    raise ValueError(
+                        f"If the provided `loss` is a dict, then the `{dataset_name}_dataset` must be a `DatasetDict`."
+                    )
+                if missing := set(dataset.keys()) - set(loss.keys()):
+                    raise ValueError(
+                        f"If the provided `loss` is a dict, then all keys from the `{dataset_name}_dataset` dictionary must occur in `loss` also. "
+                        f"Currently, {sorted(missing)} occur{'s' if len(missing) == 1 else ''} in `{dataset_name}_dataset` but not in `loss`."
+                    )
+        else:
+            self.loss.to(self.model.device)
+        self.evaluator = evaluator
+
+        # Add a callback responsible for automatically tracking data required for the automatic model card generation
+        model_card_callback = ModelCardCallback(self, default_args_dict)
+        self.add_callback(model_card_callback)
+        model_card_callback.on_init_end(self.args, self.state, self.control, self.model)
+
+    def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict:
+        for key, dataset in dataset_dict.items():
+            if "dataset_name" not in dataset.column_names:
+                dataset_dict[key] = dataset.add_column("dataset_name", [key] * len(dataset))
+        return dataset_dict
+
+    def compute_loss(
+        self,
+        model: "SentenceTransformer",
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        return_outputs: bool = False,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
+        dataset_name = inputs.pop("dataset_name", None)
+        features, labels = self.collect_features(inputs)
+        loss_fn = self.loss
+
+        if isinstance(loss_fn, dict) and dataset_name:
+            loss_fn = loss_fn[dataset_name]
+
+        # Hackishly insert the distributed model into the loss function, if the loss stores the model
+        # Only called once per process
+        if (
+            self.args.parallel_mode != ParallelMode.NOT_PARALLEL
+            and hasattr(model, "module")
+            and getattr(loss_fn, "model", None) == model.module
+        ):
+            loss_fn.model = model
+        loss = loss_fn(features, labels)
+        if return_outputs:
+            output = torch.cat([model(row)["sentence_embedding"][:, None] for row in features], dim=1)
+            return loss, output
+        return loss
+
+    def collect_features(
+        self, inputs: Dict[str, Union[torch.Tensor, Any]]
+    ) -> Tuple[List[Dict[str, torch.Tensor]], Optional[torch.Tensor]]:
+        """Turn the inputs from the dataloader into the separate model inputs & the labels.
+
+        Example::
+
+            >>> list(inputs.keys())
+            ['return_loss', 'label', 'sentence_0_input_ids', 'sentence_0_token_type_ids', 'sentence_0_attention_mask', 'sentence_1_input_ids', 'sentence_1_token_type_ids', 'sentence_1_attention_mask']
+            >>> features, labels = self.collect_features(inputs)
+            >>> len(features)
+            2
+            >>> list(features[0].keys())
+            ['input_ids', 'token_type_ids', 'attention_mask']
+            >>> list(features[1].keys())
+            ['input_ids', 'token_type_ids', 'attention_mask']
+            >>> torch.equal(labels, inputs["label"])
+            True
+        """
+        # All inputs ending with `_input_ids` (Transformers), `_sentence_embedding` (BoW), `_pixel_values` (CLIPModel)
+        # are considered to correspond to a feature
+        features = []
+        for column in inputs:
+            if column.endswith("_input_ids"):
+                prefix = column[: -len("input_ids")]
+            elif column.endswith("_sentence_embedding"):
+                prefix = column[: -len("sentence_embedding")]
+            elif column.endswith("_pixel_values"):
+                prefix = column[: -len("pixel_values")]
+            else:
+                continue
+            features.append({key[len(prefix) :]: value for key, value in inputs.items() if key.startswith(prefix)})
+        labels = inputs.get("label", None)
+        return features, labels
+
+    def evaluate(
+        self,
+        eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> Dict[str, float]:
+        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+        if isinstance(eval_dataset, DatasetDict):
+            eval_dataset = self.add_dataset_name_column(eval_dataset)
+        return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
+
+    def evaluation_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        prediction_loss_only: Optional[bool] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> EvalLoopOutput:
+        output = super().evaluation_loop(
+            dataloader=dataloader,
+            description=description,
+            prediction_loss_only=prediction_loss_only,
+            ignore_keys=ignore_keys,
+            metric_key_prefix=metric_key_prefix,
+        )
+
+        # If the evaluator is not defined, we can just return the output
+        if self.evaluator is None:
+            return output
+
+        # If we are training and eval_dataset is a DatasetDict, then we should
+        # 1) only run the evaluator for the first dataset
+        # 2) prefix that only run as "eval", rather than e.g. "eval_multi_nli"
+        if self.is_in_train and isinstance(self.eval_dataset, dict) and metric_key_prefix.startswith("eval_"):
+            if metric_key_prefix[5:] == list(self.eval_dataset.keys())[0]:
+                metric_key_prefix = "eval"
+            else:
+                return output
+
+        with nullcontext() if self.is_local_process_zero() else disable_logging(logging.INFO):
+            evaluator_metrics = self.evaluator(self.model)
+        if not isinstance(evaluator_metrics, dict):
+            evaluator_metrics = {"evaluator": evaluator_metrics}
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(evaluator_metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                evaluator_metrics[f"{metric_key_prefix}_{key}"] = evaluator_metrics.pop(key)
+
+        output.metrics.update(evaluator_metrics)
+
+        return output
+
+    def _load_best_model(self) -> None:
+        # We want to ensure that this does not fail, and it may change if transformers updates how checkpoints are saved
+        # Loading the best model is only supported for `transformers`-based models
+        if not isinstance(self.model[0], Transformer):
+            logger.info("Could not load best model, as the model is not a `transformers`-based model.")
+            return
+
+        try:
+            if checkpoint := self.state.best_model_checkpoint:
+                step = checkpoint.rsplit("-", 1)[-1]
+                self.model.model_card_data.set_best_model_step(int(step))
+        except Exception:
+            pass
+
+        # Override the model with the `tranformers`-based auto_model, and restore the original SentenceTransformers
+        # model with the loaded `transformers` model
+        full_model = self.model
+        self.model = self.model[0].auto_model
+        try:
+            return super()._load_best_model()
+        finally:
+            loaded_auto_model = self.model
+            self.model = full_model
+            self.model[0].auto_model = loaded_auto_model
+
+    def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] = None) -> bool:
+        if overlap := set(dataset.column_names) & {"return_loss", "dataset_name"}:
+            raise ValueError(
+                f"The following column names are invalid in your {dataset_name + ' ' if dataset_name else ''}dataset: {list(overlap)}."
+                " Avoid using these column names, as they are reserved for internal use."
+            )
+
+    def get_batch_sampler(
+        self,
+        dataset: Dataset,
+        batch_size: int,
+        drop_last: bool,
+        valid_label_columns: Optional[List[str]] = None,
+        generator: Optional[torch.Generator] = None,
+    ) -> BatchSampler:
+        if self.args.batch_sampler == BatchSamplers.NO_DUPLICATES:
+            return NoDuplicatesBatchSampler(
+                dataset=dataset,
+                batch_size=batch_size,
+                drop_last=drop_last,
+                valid_label_columns=valid_label_columns,
+                generator=generator,
+            )
+
+        if self.args.batch_sampler == BatchSamplers.GROUP_BY_LABEL:
+            return GroupByLabelBatchSampler(
+                dataset=dataset,
+                batch_size=batch_size,
+                drop_last=drop_last,
+                valid_label_columns=valid_label_columns,
+            )
+
+        if self.args.batch_sampler == BatchSamplers.BATCH_SAMPLER:
+            return DefaultBatchSampler(
+                SubsetRandomSampler(range(len(dataset)), generator=generator),
+                batch_size=batch_size,
+                drop_last=drop_last,
+            )
+
+    def get_multi_dataset_batch_sampler(
+        self,
+        dataset: ConcatDataset,
+        batch_samplers: List[BatchSampler],
+        generator: Optional[torch.Generator] = None,
+        seed: Optional[int] = 0,
+    ) -> BatchSampler:
+        if self.args.multi_dataset_batch_sampler == MultiDatasetBatchSamplers.ROUND_ROBIN:
+            return RoundRobinBatchSampler(
+                dataset=dataset,
+                batch_samplers=batch_samplers,
+                generator=generator,
+                seed=seed,
+            )
+
+        if self.args.multi_dataset_batch_sampler == MultiDatasetBatchSamplers.PROPORTIONAL:
+            return ProportionalBatchSampler(
+                dataset=dataset,
+                batch_samplers=batch_samplers,
+                generator=generator,
+                seed=seed,
+            )
+
+    def get_train_dataloader(self) -> DataLoader:
+        """
+        Returns the training [`~torch.utils.data.DataLoader`].
+
+        Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed
+        training if necessary) otherwise.
+
+        Subclass and override this method if you want to inject some custom behavior.
+        """
+        if self.train_dataset is None:
+            raise ValueError("Trainer: training requires a train_dataset.")
+
+        train_dataset = self.train_dataset
+        data_collator = self.data_collator
+
+        generator = torch.Generator()
+        if self.args.seed:
+            generator.manual_seed(self.args.seed)
+
+        if isinstance(train_dataset, DatasetDict):
+            for dataset_name, dataset in train_dataset.items():
+                self.validate_column_names(dataset, dataset_name=dataset_name)
+            train_dataset = self.add_dataset_name_column(train_dataset)
+            batch_samplers = [
+                self.get_batch_sampler(
+                    dataset,
+                    batch_size=self.args.per_device_train_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    valid_label_columns=data_collator.valid_label_columns,
+                    generator=generator,
+                )
+                for dataset in train_dataset.values()
+            ]
+
+            train_dataset = ConcatDataset(train_dataset.values())
+            batch_sampler = self.get_multi_dataset_batch_sampler(
+                dataset=train_dataset,
+                batch_samplers=batch_samplers,
+                generator=generator,
+                seed=self.args.seed,
+            )
+
+        else:
+            self.validate_column_names(train_dataset)
+
+            batch_sampler = self.get_batch_sampler(
+                train_dataset,
+                batch_size=self.args.train_batch_size,
+                drop_last=self.args.dataloader_drop_last,
+                valid_label_columns=data_collator.valid_label_columns,
+                generator=generator,
+            )
+
+        dataloader_params = {
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+            "prefetch_factor": self.args.dataloader_prefetch_factor,
+            "batch_sampler": batch_sampler,
+        }
+
+        # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can
+        # cause issues with multi-dataset training, so we want to set this to False.
+        # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True there.
+        self.accelerator.even_batches = False
+        self._train_dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
+        return self._train_dataloader
+
+    def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> DataLoader:
+        """
+        Returns the evaluation [`~torch.utils.data.DataLoader`].
+
+        Subclass and override this method if you want to inject some custom behavior.
+
+        Args:
+            eval_dataset (`torch.utils.data.Dataset`, *optional*):
+                If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted
+                by the `model.forward()` method are automatically removed. It must implement `__len__`.
+        """
+        if eval_dataset is None and self.eval_dataset is None:
+            # Prevent errors if the evaluator is set but no eval_dataset is provided
+            if self.evaluator is not None:
+                return DataLoader([])
+            raise ValueError("Trainer: evaluation requires an eval_dataset.")
+        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+        data_collator = self.data_collator
+
+        generator = torch.Generator()
+        if self.args.seed:
+            generator.manual_seed(self.args.seed)
+
+        # TODO: Correctly validate the column names for the eval_dataset
+        if isinstance(eval_dataset, DatasetDict):
+            eval_dataset = self.add_dataset_name_column(eval_dataset)
+            eval_dataset = self.add_dataset_name_column(eval_dataset)
+            batch_samplers = [
+                self.get_batch_sampler(
+                    dataset,
+                    batch_size=self.args.per_device_eval_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    valid_label_columns=data_collator.valid_label_columns,
+                    generator=generator,
+                )
+                for dataset in eval_dataset.values()
+            ]
+
+            eval_dataset = ConcatDataset(eval_dataset.values())
+            batch_sampler = self.get_multi_dataset_batch_sampler(
+                dataset=eval_dataset,
+                batch_samplers=batch_samplers,
+                generator=generator,
+                seed=self.args.seed,
+            )
+        else:
+            batch_sampler = self.get_batch_sampler(
+                eval_dataset,
+                batch_size=self.args.train_batch_size,
+                drop_last=self.args.dataloader_drop_last,
+                valid_label_columns=data_collator.valid_label_columns,
+                generator=generator,
+            )
+
+        dataloader_params = {
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+            "prefetch_factor": self.args.dataloader_prefetch_factor,
+            "batch_sampler": batch_sampler,
+        }
+
+        # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can
+        # cause issues with multi-dataset training, so we want to set this to False during training.
+        # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True here.
+        self.accelerator.even_batches = True
+        return self.accelerator.prepare(DataLoader(eval_dataset, **dataloader_params))
+
+    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
+        """
+        Returns the training [`~torch.utils.data.DataLoader`].
+
+        Subclass and override this method if you want to inject some custom behavior.
+
+        Args:
+            test_dataset (`torch.utils.data.Dataset`, *optional*):
+                The test dataset to use. If it is a [`~datasets.Dataset`], columns not accepted by the
+                `model.forward()` method are automatically removed. It must implement `__len__`.
+        """
+        data_collator = self.data_collator
+
+        generator = torch.Generator()
+        if self.args.seed:
+            generator.manual_seed(self.args.seed)
+
+        if isinstance(test_dataset, DatasetDict):
+            for dataset_name, dataset in test_dataset.items():
+                self.validate_column_names(dataset, dataset_name=dataset_name)
+            test_dataset = self.add_dataset_name_column(test_dataset)
+            batch_samplers = [
+                self.get_batch_sampler(
+                    dataset,
+                    batch_size=self.args.per_device_train_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    valid_label_columns=data_collator.valid_label_columns,
+                    generator=generator,
+                )
+                for dataset in test_dataset.values()
+            ]
+
+            test_dataset = ConcatDataset(test_dataset.values())
+            batch_sampler = self.get_multi_dataset_batch_sampler(
+                dataset=test_dataset,
+                batch_samplers=batch_samplers,
+                generator=generator,
+                seed=self.args.seed,
+            )
+
+        else:
+            self.validate_column_names(test_dataset)
+
+            batch_sampler = self.get_batch_sampler(
+                test_dataset,
+                batch_size=self.args.train_batch_size,
+                drop_last=self.args.dataloader_drop_last,
+                valid_label_columns=data_collator.valid_label_columns,
+                generator=generator,
+            )
+
+        dataloader_params = {
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+            "prefetch_factor": self.args.dataloader_prefetch_factor,
+            "batch_sampler": batch_sampler,
+        }
+
+        # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can
+        # cause issues with multi-dataset training, so we want to set this to False.
+        # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True there.
+        self.accelerator.even_batches = False
+        self._train_dataloader = self.accelerator.prepare(DataLoader(test_dataset, **dataloader_params))
+        return self._train_dataloader
+
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+
+        self.model.save(output_dir, safe_serialization=self.args.save_safetensors)
+
+        if self.tokenizer is not None:
+            self.tokenizer.save_pretrained(output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
new file mode 100644
index 000000000..bdd46a9aa
--- /dev/null
+++ b/sentence_transformers/training_args.py
@@ -0,0 +1,39 @@
+from dataclasses import dataclass, field
+from typing import Union
+from transformers import TrainingArguments as TransformersTrainingArguments
+from transformers.utils import ExplicitEnum
+
+
+class BatchSamplers(ExplicitEnum):
+    """
+    Stores the acceptable string identifiers for batch samplers.
+    """
+
+    BATCH_SAMPLER = "batch_sampler"  # Just the default PyTorch batch sampler [default]
+    NO_DUPLICATES = "no_duplicates"  # Ensures no duplicate samples in a batch
+    GROUP_BY_LABEL = "group_by_label"  # Ensure each batch has 2+ samples from the same label
+
+
+class MultiDatasetBatchSamplers(ExplicitEnum):
+    """
+    Stores the acceptable string identifiers for multi-dataset batch samplers.
+    """
+
+    ROUND_ROBIN = "round_robin"  # Round-robin sampling from each dataset
+    PROPORTIONAL = "proportional"  # Sample from each dataset in proportion to its size [default]
+
+
+@dataclass
+class SentenceTransformerTrainingArguments(TransformersTrainingArguments):
+    batch_sampler: Union[BatchSamplers, str] = field(
+        default=BatchSamplers.BATCH_SAMPLER, metadata={"help": "The batch sampler to use."}
+    )
+    multi_dataset_batch_sampler: Union[MultiDatasetBatchSamplers, str] = field(
+        default=MultiDatasetBatchSamplers.PROPORTIONAL, metadata={"help": "The multi-dataset batch sampler to use."}
+    )
+
+    def __post_init__(self):
+        super().__post_init__()
+
+        self.batch_sampler = BatchSamplers(self.batch_sampler)
+        self.multi_dataset_batch_sampler = MultiDatasetBatchSamplers(self.multi_dataset_batch_sampler)
diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py
index 4cc0a2c8a..85bda801c 100644
--- a/sentence_transformers/util.py
+++ b/sentence_transformers/util.py
@@ -1,3 +1,4 @@
+from contextlib import contextmanager
 import functools
 import requests
 from torch import Tensor, device
@@ -525,6 +526,25 @@ def __delattr__(self, attr: str) -> None:
                 raise
 
 
+@contextmanager
+def disable_logging(highest_level=logging.CRITICAL):
+    """
+    A context manager that will prevent any logging messages
+    triggered during the body from being processed.
+
+    :param highest_level: the maximum logging level allowed.
+    """
+
+    previous_level = logging.root.manager.disable
+
+    logging.disable(highest_level)
+
+    try:
+        yield
+    finally:
+        logging.disable(previous_level)
+
+
 def is_sentence_transformer_model(
     model_name_or_path: str,
     token: Optional[Union[bool, str]] = None,
diff --git a/setup.py b/setup.py
index 637e1f192..a791e73e2 100644
--- a/setup.py
+++ b/setup.py
@@ -16,6 +16,7 @@
     url="https://www.SBERT.net",
     download_url="https://github.com/UKPLab/sentence-transformers/",
     packages=find_packages(),
+    include_package_data=True,
     python_requires=">=3.8.0",
     install_requires=[
         "transformers>=4.34.0,<5.0.0",
@@ -26,6 +27,8 @@
         "scipy",
         "huggingface-hub>=0.15.1",
         "Pillow",
+        "datasets",
+        "accelerate>=0.20.3",
     ],
     extras_require={
         "dev": [
diff --git a/tests/conftest.py b/tests/conftest.py
index f9db97866..05609b7a9 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -5,6 +5,7 @@
 
 from sentence_transformers import SentenceTransformer, CrossEncoder
 from sentence_transformers.models import Transformer, Pooling
+from datasets import load_dataset, DatasetDict
 
 
 @pytest.fixture()
@@ -40,6 +41,11 @@ def distilbert_base_uncased_model() -> SentenceTransformer:
     return model
 
 
+@pytest.fixture(scope="session")
+def stsb_dataset_dict() -> DatasetDict:
+    return load_dataset("mteb/stsbenchmark-sts")
+
+
 @pytest.fixture()
 def cache_dir():
     """
diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py
index e62f0dcf4..ee6eb3409 100644
--- a/tests/test_evaluator.py
+++ b/tests/test_evaluator.py
@@ -77,8 +77,9 @@ def test_LabelAccuracyEvaluator(paraphrase_distilroberta_base_v1_model: Sentence
 
     dev_dataloader = DataLoader(dev_samples, shuffle=False, batch_size=16)
     evaluator = evaluation.LabelAccuracyEvaluator(dev_dataloader, softmax_model=train_loss)
-    acc = evaluator(model)
-    assert acc > 0.2
+    metrics = evaluator(model)
+    assert "accuracy" in metrics
+    assert metrics["accuracy"] > 0.2
 
 
 def test_ParaphraseMiningEvaluator(paraphrase_distilroberta_base_v1_model: SentenceTransformer) -> None:
@@ -91,5 +92,5 @@ def test_ParaphraseMiningEvaluator(paraphrase_distilroberta_base_v1_model: Sente
         3: "On the table the cat is",
     }
     data_eval = evaluation.ParaphraseMiningEvaluator(sentences, [(0, 1), (2, 3)])
-    score = data_eval(model)
-    assert score > 0.99
+    metrics = data_eval(model)
+    assert metrics[data_eval.primary_metric] > 0.99
diff --git a/tests/test_model_card_data.py b/tests/test_model_card_data.py
new file mode 100644
index 000000000..3c0a0f06a
--- /dev/null
+++ b/tests/test_model_card_data.py
@@ -0,0 +1,24 @@
+from sentence_transformers import SentenceTransformer
+
+import pytest
+
+
+@pytest.mark.parametrize(
+    ("revision", "expected_base_revision"),
+    [
+        ("f3cb857cba53019a20df283396bcca179cf051a4", "f3cb857cba53019a20df283396bcca179cf051a4"),
+        ("f3cb857", "f3cb857"),
+        ("main", "valid-revision"),
+        (None, "valid-revision"),
+    ],
+)
+def test_model_card_data(revision, expected_base_revision) -> None:
+    model_name = "sentence-transformers-testing/stsb-bert-tiny-safetensors"
+    model = SentenceTransformer(model_name, revision=revision)
+
+    assert model.model_card_data.base_model == model_name
+    if expected_base_revision == "valid-revision":
+        assert model.model_card_data.base_model_revision
+        assert len(model.model_card_data.base_model_revision) == 40
+    else:
+        assert model.model_card_data.base_model_revision == expected_base_revision
diff --git a/tests/test_pretrained_stsb.py b/tests/test_pretrained_stsb.py
index 0616cbbe0..4a98a337d 100644
--- a/tests/test_pretrained_stsb.py
+++ b/tests/test_pretrained_stsb.py
@@ -37,7 +37,8 @@ def pretrained_model_score(
 
     evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
 
-    score = model.evaluate(evaluator) * 100
+    scores = model.evaluate(evaluator)
+    score = scores[evaluator.primary_metric] * 100
     print(model_name, "{:.2f} vs. exp: {:.2f}".format(score, expected_score))
     assert score > expected_score or abs(score - expected_score) < 0.1
 
diff --git a/tests/test_train_stsb.py b/tests/test_train_stsb.py
index ca6c1d867..a71fe8f06 100644
--- a/tests/test_train_stsb.py
+++ b/tests/test_train_stsb.py
@@ -8,6 +8,7 @@
 from typing import Generator, List, Tuple
 
 import pytest
+import torch
 from torch.utils.data import DataLoader
 
 from sentence_transformers import (
@@ -63,7 +64,8 @@ def nli_resource() -> Generator[List[InputExample], None, None]:
 
 def evaluate_stsb_test(model, expected_score, test_samples) -> None:
     evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-    score = model.evaluate(evaluator) * 100
+    scores = model.evaluate(evaluator)
+    score = scores[evaluator.primary_metric] * 100
     print("STS-Test Performance: {:.2f} vs. exp: {:.2f}".format(score, expected_score))
     assert score > expected_score or abs(score - expected_score) < 0.1
 
@@ -83,7 +85,7 @@ def test_train_stsb_slow(
         epochs=1,
         evaluation_steps=1000,
         warmup_steps=int(len(train_dataloader) * 0.1),
-        use_amp=True,
+        use_amp=torch.cuda.is_available(),
     )
 
     evaluate_stsb_test(model, 80.0, sts_test_samples)
@@ -104,7 +106,7 @@ def test_train_stsb(
         epochs=1,
         evaluation_steps=1000,
         warmup_steps=int(len(train_dataloader) * 0.1),
-        use_amp=True,
+        use_amp=torch.cuda.is_available(),
     )
 
     evaluate_stsb_test(model, 60.0, sts_test_samples)
@@ -130,7 +132,7 @@ def test_train_nli_slow(
         evaluator=None,
         epochs=1,
         warmup_steps=int(len(train_dataloader) * 0.1),
-        use_amp=True,
+        use_amp=torch.cuda.is_available(),
     )
 
     evaluate_stsb_test(model, 50.0, sts_test_samples)
@@ -156,7 +158,7 @@ def test_train_nli(
         evaluator=None,
         epochs=1,
         warmup_steps=int(len(train_dataloader) * 0.1),
-        use_amp=True,
+        use_amp=torch.cuda.is_available(),
     )
 
     evaluate_stsb_test(model, 50.0, sts_test_samples)
diff --git a/tests/test_trainer.py b/tests/test_trainer.py
new file mode 100644
index 000000000..8d8c123af
--- /dev/null
+++ b/tests/test_trainer.py
@@ -0,0 +1,127 @@
+from pathlib import Path
+import re
+import tempfile
+import pytest
+from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer, losses
+from datasets import DatasetDict
+
+
+def test_trainer_multi_dataset_errors(
+    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict
+) -> None:
+    train_dataset = stsb_dataset_dict["train"]
+    loss = {
+        "multi_nli": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model),
+        "snli": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model),
+        "stsb": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model),
+    }
+    with pytest.raises(
+        ValueError, match="If the provided `loss` is a dict, then the `train_dataset` must be a `DatasetDict`."
+    ):
+        SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=train_dataset, loss=loss)
+
+    train_dataset = DatasetDict(
+        {
+            "multi_nli": stsb_dataset_dict["train"],
+            "snli": stsb_dataset_dict["train"],
+            "stsb": stsb_dataset_dict["train"],
+            "stsb-extra": stsb_dataset_dict["train"],
+        }
+    )
+    with pytest.raises(
+        ValueError,
+        match="If the provided `loss` is a dict, then all keys from the `train_dataset` dictionary must occur in `loss` also. "
+        "Currently, \['stsb-extra'\] occurs in `train_dataset` but not in `loss`.",
+    ):
+        SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=train_dataset, loss=loss)
+
+    train_dataset = DatasetDict(
+        {
+            "multi_nli": stsb_dataset_dict["train"],
+            "snli": stsb_dataset_dict["train"],
+            "stsb": stsb_dataset_dict["train"],
+        }
+    )
+    with pytest.raises(
+        ValueError, match="If the provided `loss` is a dict, then the `eval_dataset` must be a `DatasetDict`."
+    ):
+        SentenceTransformerTrainer(
+            model=stsb_bert_tiny_model,
+            train_dataset=train_dataset,
+            eval_dataset=stsb_dataset_dict["validation"],
+            loss=loss,
+        )
+
+    eval_dataset = DatasetDict(
+        {
+            "multi_nli": stsb_dataset_dict["validation"],
+            "snli": stsb_dataset_dict["validation"],
+            "stsb": stsb_dataset_dict["validation"],
+            "stsb-extra-1": stsb_dataset_dict["validation"],
+            "stsb-extra-2": stsb_dataset_dict["validation"],
+        }
+    )
+    with pytest.raises(
+        ValueError,
+        match="If the provided `loss` is a dict, then all keys from the `eval_dataset` dictionary must occur in `loss` also. "
+        "Currently, \['stsb-extra-1', 'stsb-extra-2'\] occur in `eval_dataset` but not in `loss`.",
+    ):
+        SentenceTransformerTrainer(
+            model=stsb_bert_tiny_model, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss
+        )
+
+
+def test_trainer_invalid_column_names(
+    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict
+) -> None:
+    train_dataset = stsb_dataset_dict["train"]
+    for column_name in ("return_loss", "dataset_name"):
+        invalid_train_dataset = train_dataset.rename_column("sentence1", column_name)
+        trainer = SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=invalid_train_dataset)
+        with pytest.raises(
+            ValueError,
+            match=re.escape(
+                f"The following column names are invalid in your dataset: ['{column_name}']."
+                " Avoid using these column names, as they are reserved for internal use."
+            ),
+        ):
+            trainer.train()
+
+        invalid_train_dataset = DatasetDict(
+            {
+                "stsb": train_dataset.rename_column("sentence1", column_name),
+                "stsb-2": train_dataset,
+            }
+        )
+        trainer = SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=invalid_train_dataset)
+        with pytest.raises(
+            ValueError,
+            match=re.escape(
+                f"The following column names are invalid in your stsb dataset: ['{column_name}']."
+                " Avoid using these column names, as they are reserved for internal use."
+            ),
+        ):
+            trainer.train()
+
+
+def test_model_card_reuse(stsb_bert_tiny_model: SentenceTransformer):
+    assert stsb_bert_tiny_model._model_card_text
+    # Reuse the model card if no training was done
+    with tempfile.TemporaryDirectory() as tmp_folder:
+        model_path = Path(tmp_folder) / "tiny_model_local"
+        stsb_bert_tiny_model.save(str(model_path))
+
+        with open(model_path / "README.md", "r") as f:
+            model_card_text = f.read()
+        assert model_card_text == stsb_bert_tiny_model._model_card_text
+
+    # Create a new model card if a Trainer was initialized
+    SentenceTransformerTrainer(model=stsb_bert_tiny_model)
+
+    with tempfile.TemporaryDirectory() as tmp_folder:
+        model_path = Path(tmp_folder) / "tiny_model_local"
+        stsb_bert_tiny_model.save(str(model_path))
+
+        with open(model_path / "README.md", "r") as f:
+            model_card_text = f.read()
+        assert model_card_text != stsb_bert_tiny_model._model_card_text

From d2ac37d7115e38b126bc0174b0c01a792cdf6498 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Thu, 25 Apr 2024 15:52:23 +0200
Subject: [PATCH 02/39] [`v3`] Add `similarity` and `similarity_pairwise`
 methods to Sentence Transformers (#2615)

* Add similarity function to model configuration

* Add more tests

* Replace util.cos_sim with model.similarity in some examples

* Reintroduce evaluation.SimilarityFunction

* Remove last references of score function in ST class

* Add similarity_fn_name to model card

* Add save_pretrained alias for save

* Introduce DOT alias for DOT_PRODUCT
---
 examples/applications/image-search/README.md  |   8 +-
 .../semantic-search/semantic_search.py        |   6 +-
 .../text-summarization/text-summarization.py  |  10 +-
 examples/training/adaptive_layer/README.md    |   3 +-
 examples/training/matryoshka/README.md        |   3 +-
 sentence_transformers/SentenceTransformer.py  | 142 +++++++++++++++++-
 .../BinaryClassificationEvaluator.py          |  58 +++----
 .../EmbeddingSimilarityEvaluator.py           |   9 +-
 .../InformationRetrievalEvaluator.py          |  14 +-
 .../evaluation/SimilarityFunction.py          |   9 +-
 .../evaluation/TripletEvaluator.py            |  47 ++++--
 sentence_transformers/model_card.py           |  38 +++--
 sentence_transformers/model_card_template.md  |   6 +
 sentence_transformers/similarity_functions.py |  69 +++++++++
 sentence_transformers/util.py                 | 127 ++++++++++------
 tests/test_sentence_transformer.py            |  47 ++++++
 tests/test_util.py                            |  66 +++++++-
 17 files changed, 513 insertions(+), 149 deletions(-)
 create mode 100644 sentence_transformers/similarity_functions.py

diff --git a/examples/applications/image-search/README.md b/examples/applications/image-search/README.md
index f995e409e..7d691d0c7 100644
--- a/examples/applications/image-search/README.md
+++ b/examples/applications/image-search/README.md
@@ -12,7 +12,7 @@ Ensure that you have [transformers](https://pypi.org/project/transformers/) inst
 SentenceTransformers provides a wrapper for the [OpenAI CLIP Model](https://github.com/openai/CLIP), which was trained on a variety of (image, text)-pairs.
 
 ```python
-from sentence_transformers import SentenceTransformer, util
+from sentence_transformers import SentenceTransformer
 from PIL import Image
 
 # Load CLIP model
@@ -26,9 +26,9 @@ text_emb = model.encode(
     ["Two dogs in the snow", "A cat on a table", "A picture of London at night"]
 )
 
-# Compute cosine similarities
-cos_scores = util.cos_sim(img_emb, text_emb)
-print(cos_scores)
+# Compute similarities
+similarity_scores = model.similarity(img_emb, text_emb)
+print(similarity_scores)
 ```
 
 You can use the CLIP model for:
diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py
index c8da195d5..5b0e3ad62 100644
--- a/examples/applications/semantic-search/semantic_search.py
+++ b/examples/applications/semantic-search/semantic_search.py
@@ -7,7 +7,7 @@
 This script outputs for various queries the top 5 most similar sentences in the corpus.
 """
 
-from sentence_transformers import SentenceTransformer, util
+from sentence_transformers import SentenceTransformer
 import torch
 
 embedder = SentenceTransformer("all-MiniLM-L6-v2")
@@ -40,8 +40,8 @@
     query_embedding = embedder.encode(query, convert_to_tensor=True)
 
     # We use cosine-similarity and torch.topk to find the highest 5 scores
-    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
-    top_results = torch.topk(cos_scores, k=top_k)
+    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
+    top_results = torch.topk(similarity_scores, k=top_k)
 
     print("\n\n======================\n\n")
     print("Query:", query)
diff --git a/examples/applications/text-summarization/text-summarization.py b/examples/applications/text-summarization/text-summarization.py
index a510debef..64dc0a4cd 100644
--- a/examples/applications/text-summarization/text-summarization.py
+++ b/examples/applications/text-summarization/text-summarization.py
@@ -19,7 +19,7 @@
 """
 
 import nltk
-from sentence_transformers import SentenceTransformer, util
+from sentence_transformers import SentenceTransformer
 import numpy as np
 from LexRank import degree_centrality_scores
 
@@ -43,13 +43,13 @@
 print("Num sentences:", len(sentences))
 
 # Compute the sentence embeddings
-embeddings = model.encode(sentences, convert_to_tensor=True)
+embeddings = model.encode(sentences)
 
-# Compute the pair-wise cosine similarities
-cos_scores = util.cos_sim(embeddings, embeddings).numpy()
+# Compute the similarity scores
+similarity_scores = model.similarity(embeddings, embeddings).numpy()
 
 # Compute the centrality for each sentence
-centrality_scores = degree_centrality_scores(cos_scores, threshold=None)
+centrality_scores = degree_centrality_scores(similarity_scores, threshold=None)
 
 # We argsort so that the first element is the sentence with the highest score
 most_central_sentence_indices = np.argsort(-centrality_scores)
diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md
index c3843bf56..8ab7dcf8b 100644
--- a/examples/training/adaptive_layer/README.md
+++ b/examples/training/adaptive_layer/README.md
@@ -120,7 +120,6 @@ Then we can run inference with it using <a href="../../../docs/package_reference
 
 ```python
 from sentence_transformers import SentenceTransformer
-from sentence_transformers.util import cos_sim
 
 model = SentenceTransformer("tomaarsen/mpnet-base-nli-adaptive-layer")
 new_num_layers = 3
@@ -134,7 +133,7 @@ embeddings = model.encode(
     ]
 )
 # Similarity of the first sentence with the other two
-similarities = cos_sim(embeddings[0], embeddings[1:])
+similarities = model.similarity(embeddings[0], embeddings[1:])
 # => tensor([[0.7761, 0.1655]])
 # compared to tensor([[ 0.7547, -0.0162]]) for the full model
 ```
diff --git a/examples/training/matryoshka/README.md b/examples/training/matryoshka/README.md
index 62fb2e623..6781bf53c 100644
--- a/examples/training/matryoshka/README.md
+++ b/examples/training/matryoshka/README.md
@@ -58,7 +58,6 @@ After a model has been trained using a Matryoshka loss, you can then run inferen
 
 ```python
 from sentence_transformers import SentenceTransformer
-from sentence_transformers.util import cos_sim
 import torch.nn.functional as F
 
 matryoshka_dim = 64
@@ -77,7 +76,7 @@ embeddings = model.encode(
 )
 assert embeddings.shape[-1] == matryoshka_dim
 
-similarities = cos_sim(embeddings[0], embeddings[1:])
+similarities = model.similarity(embeddings[0], embeddings[1:])
 # => tensor([[0.7839, 0.4933]])
 ```
 As you can see, the similarity between the search query and the correct document is much higher than that of an unrelated document, despite the very small matryoshka dimension applied. Feel free to copy this script locally, modify the `matryoshka_dim`, and observe the difference in similarities.
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index bdbd03905..54a1bae6d 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -5,7 +5,7 @@
 from collections import OrderedDict
 from pathlib import Path
 import warnings
-from typing import List, Dict, Literal, Tuple, Iterable, Union, Optional
+from typing import Callable, List, Dict, Literal, Tuple, Iterable, Union, Optional, overload
 import numpy as np
 from numpy import ndarray
 import transformers
@@ -20,7 +20,7 @@
 import tempfile
 
 from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card
-
+from sentence_transformers.similarity_functions import SimilarityFunction
 
 from . import __MODEL_HUB_ORGANIZATION__
 from .evaluation import SentenceEvaluator
@@ -59,6 +59,9 @@ class SentenceTransformer(nn.Sequential, FitMixin):
         titles in "}`.
     :param default_prompt_name: The name of the prompt that should be used by default. If not set,
         no prompt will be applied.
+    :param similarity_fn_name: The name of the similarity function to use. Valid options are "cosine", "dot",
+        "euclidean", and "manhattan". If not set, it is automatically to "cosine" if `similarity` or
+        `similarity_pairwise` are called while `model.similarity_fn_name` is still `None`.
     :param cache_folder: Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable.
     :param trust_remote_code: Whether or not to allow for custom models defined on the Hub in their own modeling files.
         This option should only be set to True for repositories you trust and in which you have read the code, as it
@@ -78,6 +81,7 @@ def __init__(
         device: Optional[str] = None,
         prompts: Optional[Dict[str, str]] = None,
         default_prompt_name: Optional[str] = None,
+        similarity_fn_name: Optional[Union[str, SimilarityFunction]] = None,
         cache_folder: Optional[str] = None,
         trust_remote_code: bool = False,
         revision: Optional[str] = None,
@@ -90,6 +94,7 @@ def __init__(
         # Note: self._load_sbert_model can also update `self.prompts` and `self.default_prompt_name`
         self.prompts = prompts or {}
         self.default_prompt_name = default_prompt_name
+        self.similarity_fn_name = similarity_fn_name
         self.truncate_dim = truncate_dim
         self.model_card_data = model_card_data or SentenceTransformerModelCardData()
         self._model_card_vars = {}
@@ -436,6 +441,105 @@ def encode(
 
         return all_embeddings
 
+    @property
+    def similarity_fn_name(self) -> Optional[str]:
+        return self._similarity_fn_name
+
+    @similarity_fn_name.setter
+    def similarity_fn_name(self, value: Union[str, SimilarityFunction]) -> None:
+        if isinstance(value, SimilarityFunction):
+            value = value.value
+        self._similarity_fn_name = value
+
+        if value is not None:
+            self._similarity = SimilarityFunction.to_similarity_fn(value)
+            self._similarity_pairwise = SimilarityFunction.to_similarity_pairwise_fn(value)
+
+    @overload
+    def similarity(self, embeddings1: Tensor, embeddings2: Tensor) -> Tensor: ...
+
+    @overload
+    def similarity(self, embeddings1: ndarray, embeddings2: ndarray) -> Tensor: ...
+
+    @property
+    def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        """
+        Compute the similarity between two collections of embeddings. The output will be a matrix with the similarity
+        scores between all embeddings from the first parameter and all embeddings from the second parameter. This
+        differs from `similarity_pairwise` which computes the similarity between each pair of embeddings.
+
+        Example
+            ::
+
+                >>> model = SentenceTransformer("all-mpnet-base-v2")
+                >>> sentences = [
+                ...     "The weather is so nice!",
+                ...     "It's so sunny outside.",
+                ...     "He's driving to the movie theater.",
+                ...     "She's going to the cinema.",
+                ... ]
+                >>> embeddings = model.encode(sentences, normalize_embeddings=True)
+                >>> model.similarity(embeddings, embeddings)
+                tensor([[1.0000, 0.7235, 0.0290, 0.1309],
+                        [0.7235, 1.0000, 0.0613, 0.1129],
+                        [0.0290, 0.0613, 1.0000, 0.5027],
+                        [0.1309, 0.1129, 0.5027, 1.0000]])
+                >>> model.similarity_fn_name
+                "cosine"
+                >>> model.similarity_fn_name = "euclidean"
+                >>> model.similarity(embeddings, embeddings)
+                tensor([[-0.0000, -0.7437, -1.3935, -1.3184],
+                        [-0.7437, -0.0000, -1.3702, -1.3320],
+                        [-1.3935, -1.3702, -0.0000, -0.9973],
+                        [-1.3184, -1.3320, -0.9973, -0.0000]])
+
+        :param embeddings1: [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+        :param embeddings2: [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+        :return: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores.
+        """
+        if self.similarity_fn_name is None:
+            self.similarity_fn_name = SimilarityFunction.COSINE
+        return self._similarity
+
+    @overload
+    def similarity_pairwise(self, embeddings1: Tensor, embeddings2: Tensor) -> Tensor: ...
+
+    @overload
+    def similarity_pairwise(self, embeddings1: ndarray, embeddings2: ndarray) -> Tensor: ...
+
+    @property
+    def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        """
+        Compute the similarity between two collections of embeddings. The output will be a vector with the similarity
+        scores between each pair of embeddings.
+
+        Example
+            ::
+
+                >>> model = SentenceTransformer("all-mpnet-base-v2")
+                >>> sentences = [
+                ...     "The weather is so nice!",
+                ...     "It's so sunny outside.",
+                ...     "He's driving to the movie theater.",
+                ...     "She's going to the cinema.",
+                ... ]
+                >>> embeddings = model.encode(sentences, normalize_embeddings=True)
+                >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])
+                tensor([0.7235, 0.5027])
+                >>> model.similarity_fn_name
+                "cosine"
+                >>> model.similarity_fn_name = "euclidean"
+                >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])
+                tensor([-0.7437, -0.9973])
+
+        :param embeddings1: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+        :param embeddings2: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+        :return: A [num_embeddings]-shaped torch tensor with pairwise similarity scores.
+        """
+        if self.similarity_fn_name is None:
+            self.similarity_fn_name = SimilarityFunction.COSINE
+        return self._similarity_pairwise
+
     def start_multi_process_pool(self, target_devices: List[str] = None):
         """
         Starts multi process to process the encoding with several, independent processes.
@@ -672,7 +776,8 @@ def save(
         safe_serialization: bool = True,
     ):
         """
-        Saves all elements for this seq. sentence embedder into different sub-folders
+        Saves a model and its configuration files to a directory, so that it can be loaded
+        with `SentenceTransformer(path)` again.
 
         :param path: Path on disc
         :param model_name: Optional model name
@@ -700,6 +805,7 @@ def save(
             config = self._model_config.copy()
             config["prompts"] = self.prompts
             config["default_prompt_name"] = self.default_prompt_name
+            config["similarity_fn_name"] = self.similarity_fn_name
             json.dump(config, fOut, indent=2)
 
         # Save modules
@@ -727,6 +833,32 @@ def save(
         if create_model_card:
             self._create_model_card(path, model_name, train_datasets)
 
+    def save_pretrained(
+        self,
+        path: str,
+        model_name: Optional[str] = None,
+        create_model_card: bool = True,
+        train_datasets: Optional[List[str]] = None,
+        safe_serialization: bool = True,
+    ):
+        """
+        Saves a model and its configuration files to a directory, so that it can be loaded
+        with `SentenceTransformer(path)` again. Alias of `SentenceTransformer.save`.
+
+        :param path: Path on disc
+        :param model_name: Optional model name
+        :param create_model_card: If True, create a README.md with basic information about this model
+        :param train_datasets: Optional list with the names of the datasets used to to train the model
+        :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way
+        """
+        self.save(
+            path,
+            model_name=model_name,
+            create_model_card=create_model_card,
+            train_datasets=train_datasets,
+            safe_serialization=safe_serialization,
+        )
+
     def _create_model_card(
         self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated"
     ):
@@ -982,7 +1114,9 @@ def _load_sbert_model(
                     )
                 )
 
-            # Set prompts if not already overridden by the __init__ calls
+            # Set score functions & prompts if not already overridden by the __init__ calls
+            if self.similarity_fn_name is None:
+                self.similarity_fn_name = self._model_config.get("similarity_fn_name", None)
             if not self.prompts:
                 self.prompts = self._model_config.get("prompts", {})
             if not self.default_prompt_name:
diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
index 40709ae18..a18ec9f85 100644
--- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
+++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
@@ -1,5 +1,7 @@
 from sentence_transformers import SentenceTransformer
 from contextlib import nullcontext
+
+from sentence_transformers.similarity_functions import SimilarityFunction
 from . import SentenceEvaluator
 import logging
 import os
@@ -18,7 +20,7 @@ class BinaryClassificationEvaluator(SentenceEvaluator):
     """
     Evaluate a model based on the similarity of the embeddings by calculating the accuracy of identifying similar and
     dissimilar sentences.
-    The metrics are the cosine similarity as well as euclidean and Manhattan distance
+    The metrics are the cosine similarity, dot score, Euclidean and Manhattan distance
     The returned score is the accuracy with a specified metric.
 
     The results are written in a CSV. If a CSV already exists, then values are appended.
@@ -69,39 +71,19 @@ def __init__(
         self.show_progress_bar = show_progress_bar
 
         self.csv_file = "binary_classification_evaluation" + ("_" + name if name else "") + "_results.csv"
-        self.csv_headers = [
-            "epoch",
-            "steps",
-            "cosine_accuracy",
-            "cosine_accuracy_threshold",
-            "cosine_f1",
-            "cosine_precision",
-            "cosine_recall",
-            "cosine_f1_threshold",
-            "cosine_ap",
-            "manhattan_accuracy",
-            "manhattan_accuracy_threshold",
-            "manhattan_f1",
-            "manhattan_precision",
-            "manhattan_recall",
-            "manhattan_f1_threshold",
-            "manhattan_ap",
-            "euclidean_accuracy",
-            "euclidean_accuracy_threshold",
-            "euclidean_f1",
-            "euclidean_precision",
-            "euclidean_recall",
-            "euclidean_f1_threshold",
-            "euclidean_ap",
-            "dot_accuracy",
-            "dot_accuracy_threshold",
-            "dot_f1",
-            "dot_precision",
-            "dot_recall",
-            "dot_f1_threshold",
-            "dot_ap",
+        self.csv_headers = ["epoch", "steps"]
+        metrics = [
+            "accuracy",
+            "accuracy_threshold",
+            "f1",
+            "precision",
+            "recall",
+            "f1_threshold",
+            "ap",
         ]
-        self.primary_metric = "cosine_accuracy"
+        for v in SimilarityFunction.possible_values():
+            for m in metrics:
+                self.csv_headers.append(f"{v}_{m}")
 
     @classmethod
     def from_input_examples(cls, examples: List[InputExample], **kwargs):
@@ -196,15 +178,15 @@ def compute_metrices(self, model):
 
         embeddings1_np = np.asarray(embeddings1)
         embeddings2_np = np.asarray(embeddings2)
-        dot_scores = [np.dot(embeddings1_np[i], embeddings2_np[i]) for i in range(len(embeddings1_np))]
+        dot_scores = np.sum(embeddings1_np * embeddings2_np, axis=-1)
 
         labels = np.asarray(self.labels)
         output_scores = {}
         for short_name, name, scores, reverse in [
-            ["cosine", "Cosine-Similarity", cosine_scores, True],
-            ["manhattan", "Manhattan-Distance", manhattan_distances, False],
-            ["euclidean", "Euclidean-Distance", euclidean_distances, False],
-            ["dot", "Dot-Product", dot_scores, True],
+            [SimilarityFunction.COSINE.value, "Cosine-Similarity", cosine_scores, True],
+            [SimilarityFunction.DOT_PRODUCT.value, "Dot-Product", dot_scores, True],
+            [SimilarityFunction.MANHATTAN.value, "Manhattan-Distance", manhattan_distances, False],
+            [SimilarityFunction.EUCLIDEAN.value, "Euclidean-Distance", euclidean_distances, False],
         ]:
             acc, acc_threshold = self.find_best_acc_and_threshold(scores, labels, reverse)
             f1, precision, recall, f1_threshold = self.find_best_f1_and_threshold(scores, labels, reverse)
diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
index 0cb14500e..0f2e9ca39 100644
--- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
+++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
@@ -1,14 +1,15 @@
 from contextlib import nullcontext
 
 from sentence_transformers import SentenceTransformer
-from . import SentenceEvaluator, SimilarityFunction
+from . import SentenceEvaluator
+from sentence_transformers.similarity_functions import SimilarityFunction
 import logging
 import os
 import csv
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
 from scipy.stats import pearsonr, spearmanr
 import numpy as np
-from typing import Dict, List, Literal, Optional
+from typing import Dict, List, Literal, Optional, Union
 from ..readers import InputExample
 
 
@@ -31,7 +32,7 @@ def __init__(
         sentences2: List[str],
         scores: List[float],
         batch_size: int = 16,
-        main_similarity: SimilarityFunction = None,
+        main_similarity: Optional[Union[str, SimilarityFunction]] = None,
         name: str = "",
         show_progress_bar: bool = False,
         write_csv: bool = True,
@@ -63,7 +64,7 @@ def __init__(
         assert len(self.sentences1) == len(self.sentences2)
         assert len(self.sentences1) == len(self.scores)
 
-        self.main_similarity = main_similarity
+        self.main_similarity = SimilarityFunction(main_similarity) if main_similarity else None
         self.name = name
 
         self.batch_size = batch_size
diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
index 54338e6f5..917574c3e 100644
--- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
+++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
@@ -1,5 +1,7 @@
 from sentence_transformers import SentenceTransformer
 from contextlib import nullcontext
+
+from sentence_transformers.similarity_functions import SimilarityFunction
 from . import SentenceEvaluator
 import torch
 from torch import Tensor
@@ -8,7 +10,7 @@
 from ..util import cos_sim, dot_score
 import os
 import numpy as np
-from typing import List, Dict, Optional, Set, Callable
+from typing import List, Dict, Optional, Set, Callable, Union
 import heapq
 
 
@@ -40,10 +42,10 @@ def __init__(
         write_csv: bool = True,
         truncate_dim: Optional[int] = None,
         score_functions: Dict[str, Callable[[Tensor, Tensor], Tensor]] = {
-            "cosine": cos_sim,
-            "dot": dot_score,
+            SimilarityFunction.COSINE.value: cos_sim,
+            SimilarityFunction.DOT_PRODUCT.value: dot_score,
         },  # Score function, higher=more similar
-        main_score_function: str = None,
+        main_score_function: Optional[Union[str, SimilarityFunction]] = None,
     ):
         super().__init__()
         self.queries_ids = []
@@ -70,7 +72,7 @@ def __init__(
         self.write_csv = write_csv
         self.score_functions = score_functions
         self.score_function_names = sorted(list(self.score_functions.keys()))
-        self.main_score_function = main_score_function
+        self.main_score_function = SimilarityFunction(main_score_function) if main_score_function else None
         self.truncate_dim = truncate_dim
 
         if name:
@@ -153,7 +155,7 @@ def __call__(
             )[0]
             self.primary_metric = f"{score_function}_map@{max(self.map_at_k)}"
         else:
-            self.primary_metric = f"{self.main_score_function}_map@{max(self.map_at_k)}"
+            self.primary_metric = f"{self.main_score_function.value}_map@{max(self.map_at_k)}"
 
         metrics = {
             f"{score_function}_{metric_name.replace('@k', '@' + str(k))}": value
diff --git a/sentence_transformers/evaluation/SimilarityFunction.py b/sentence_transformers/evaluation/SimilarityFunction.py
index 22d112732..f149b30a3 100644
--- a/sentence_transformers/evaluation/SimilarityFunction.py
+++ b/sentence_transformers/evaluation/SimilarityFunction.py
@@ -1,8 +1,3 @@
-from enum import Enum
+from sentence_transformers.similarity_functions import SimilarityFunction
 
-
-class SimilarityFunction(Enum):
-    COSINE = 0
-    EUCLIDEAN = 1
-    MANHATTAN = 2
-    DOT_PRODUCT = 3
+__all__ = ["SimilarityFunction"]
diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py
index da7719f97..3b9a908c2 100644
--- a/sentence_transformers/evaluation/TripletEvaluator.py
+++ b/sentence_transformers/evaluation/TripletEvaluator.py
@@ -1,11 +1,13 @@
+import numpy as np
 from sentence_transformers import SentenceTransformer
 from contextlib import nullcontext
-from . import SentenceEvaluator, SimilarityFunction
+from . import SentenceEvaluator
+from sentence_transformers.similarity_functions import SimilarityFunction
 import logging
 import os
 import csv
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
-from typing import Dict, List, Optional
+from typing import Dict, List, Optional, Union
 from ..readers import InputExample
 
 
@@ -23,7 +25,7 @@ def __init__(
         anchors: List[str],
         positives: List[str],
         negatives: List[str],
-        main_distance_function: SimilarityFunction = None,
+        main_distance_function: Optional[Union[str, SimilarityFunction]] = None,
         name: str = "",
         batch_size: int = 16,
         show_progress_bar: bool = False,
@@ -34,7 +36,8 @@ def __init__(
         :param anchors: Sentences to check similarity to. (e.g. a query)
         :param positives: List of positive sentences
         :param negatives: List of negative sentences
-        :param main_distance_function: One of 0 (Cosine), 1 (Euclidean) or 2 (Manhattan). Defaults to None, returning all 3.
+        :param main_distance_function: The distance function to use. If not specified, use cosine similarity,
+            dot product, Euclidean, and Manhattan.
         :param name: Name for the output
         :param batch_size: Batch size used to compute embeddings
         :param show_progress_bar: If true, prints a progress bar
@@ -52,7 +55,7 @@ def __init__(
         assert len(self.anchors) == len(self.positives)
         assert len(self.anchors) == len(self.negatives)
 
-        self.main_distance_function = main_distance_function
+        self.main_distance_function = SimilarityFunction(main_distance_function) if main_distance_function else None
 
         self.batch_size = batch_size
         if show_progress_bar is None:
@@ -93,7 +96,12 @@ def __call__(
         logger.info(f"TripletEvaluator: Evaluating the model on the {self.name} dataset{out_txt}:")
 
         num_triplets = 0
-        num_correct_cos_triplets, num_correct_manhattan_triplets, num_correct_euclidean_triplets = 0, 0, 0
+        (
+            num_correct_cos_triplets,
+            num_correct_dot_triplets,
+            num_correct_manhattan_triplets,
+            num_correct_euclidean_triplets,
+        ) = 0, 0, 0, 0
 
         with nullcontext() if self.truncate_dim is None else model.truncate_sentence_embeddings(self.truncate_dim):
             embeddings_anchors = model.encode(
@@ -119,6 +127,10 @@ def __call__(
         pos_cos_distance = paired_cosine_distances(embeddings_anchors, embeddings_positives)
         neg_cos_distances = paired_cosine_distances(embeddings_anchors, embeddings_negatives)
 
+        # Dot score
+        pos_dot_distance = np.sum(embeddings_anchors * embeddings_positives, axis=-1)
+        neg_dot_distances = np.sum(embeddings_anchors * embeddings_negatives, axis=-1)
+
         # Manhattan
         pos_manhattan_distance = paired_manhattan_distances(embeddings_anchors, embeddings_positives)
         neg_manhattan_distances = paired_manhattan_distances(embeddings_anchors, embeddings_negatives)
@@ -133,6 +145,9 @@ def __call__(
             if pos_cos_distance[idx] < neg_cos_distances[idx]:
                 num_correct_cos_triplets += 1
 
+            if pos_dot_distance[idx] < neg_dot_distances[idx]:
+                num_correct_dot_triplets += 1
+
             if pos_manhattan_distance[idx] < neg_manhattan_distances[idx]:
                 num_correct_manhattan_triplets += 1
 
@@ -140,10 +155,12 @@ def __call__(
                 num_correct_euclidean_triplets += 1
 
         accuracy_cos = num_correct_cos_triplets / num_triplets
+        accuracy_dot = num_correct_dot_triplets / num_triplets
         accuracy_manhattan = num_correct_manhattan_triplets / num_triplets
         accuracy_euclidean = num_correct_euclidean_triplets / num_triplets
 
         logger.info("Accuracy Cosine Distance:   \t{:.2f}".format(accuracy_cos * 100))
+        logger.info("Accuracy Dot Product:       \t{:.2f}".format(accuracy_dot * 100))
         logger.info("Accuracy Manhattan Distance:\t{:.2f}".format(accuracy_manhattan * 100))
         logger.info("Accuracy Euclidean Distance:\t{:.2f}\n".format(accuracy_euclidean * 100))
 
@@ -161,15 +178,17 @@ def __call__(
                     writer.writerow([epoch, steps, accuracy_cos, accuracy_manhattan, accuracy_euclidean])
 
         self.primary_metric = {
-            SimilarityFunction.COSINE: "accuracy_cosine",
-            SimilarityFunction.EUCLIDEAN: "accuracy_euclidean",
-            SimilarityFunction.MANHATTAN: "accuracy_manhattan",
-        }.get(self.main_distance_function, "accuracy_max")
+            SimilarityFunction.COSINE: "cosine_accuracy",
+            SimilarityFunction.DOT_PRODUCT: "dot_accuracy",
+            SimilarityFunction.EUCLIDEAN: "euclidean_accuracy",
+            SimilarityFunction.MANHATTAN: "manhattan_accuracy",
+        }.get(self.main_distance_function, "max_accuracy")
         metrics = {
-            "accuracy_cosine": accuracy_cos,
-            "accuracy_manhattan": accuracy_manhattan,
-            "accuracy_euclidean": accuracy_euclidean,
-            "accuracy_max": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean),
+            "cosine_accuracy": accuracy_cos,
+            "dot_accuracy": accuracy_dot,
+            "manhattan_accuracy": accuracy_manhattan,
+            "euclidean_accuracy": accuracy_euclidean,
+            "max_accuracy": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean),
         }
         metrics = self.prefix_name_to_metrics(metrics, self.name)
         self.store_metrics_in_model_card_data(model, metrics)
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index 633517574..df382374b 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -10,9 +10,6 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union
 import logging
 
-import accelerate
-import datasets
-import tokenizers
 import torch
 from torch import nn
 import transformers
@@ -206,6 +203,22 @@ def on_log(
 IGNORED_FIELDS = ["model", "trainer", "eval_results_dict"]
 
 
+def get_versions() -> Dict[str, Any]:
+    from accelerate import __version__ as accelerate_version
+    from datasets import __version__ as datasets_version
+    from tokenizers import __version__ as tokenizers_version
+
+    return {
+        "python": python_version(),
+        "sentence_transformers": sentence_transformers_version,
+        "transformers": transformers.__version__,
+        "torch": torch.__version__,
+        "accelerate": accelerate_version,
+        "datasets": datasets_version,
+        "tokenizers": tokenizers_version,
+    }
+
+
 @dataclass
 class SentenceTransformerModelCardData(CardData):
     """A dataclass storing data used in the model card.
@@ -290,18 +303,7 @@ class SentenceTransformerModelCardData(CardData):
     # Computed once, always unchanged
     pipeline_tag: str = field(default="sentence-similarity", init=False)
     library_name: str = field(default="sentence-transformers", init=False)
-    version: Dict[str, str] = field(
-        default_factory=lambda: {
-            "python": python_version(),
-            "sentence_transformers": sentence_transformers_version,
-            "transformers": transformers.__version__,
-            "torch": torch.__version__,
-            "accelerate": accelerate.__version__,
-            "datasets": datasets.__version__,
-            "tokenizers": tokenizers.__version__,
-        },
-        init=False,
-    )
+    version: Dict[str, str] = field(default_factory=get_versions, init=False)
 
     # Passed via `register_model` only
     model: Optional["SentenceTransformer"] = field(default=None, init=False, repr=False)
@@ -899,6 +901,12 @@ def to_dict(self) -> Dict[str, Any]:
         super_dict["model_max_length"] = self.model.get_max_seq_length()
         super_dict["output_dimensionality"] = self.model.get_sentence_embedding_dimension()
         super_dict["model_string"] = str(self.model)
+        super_dict["similarity_fn_name"] = {
+            "cosine": "Cosine Similarity",
+            "dot": "Dot Product",
+            "euclidean": "Euclidean Distance",
+            "manhattan": "Manhattan Distance",
+        }.get(self.model.similarity_fn_name, self.model.similarity_fn_name.replace("_", " ").title())
 
         self.first_save = False
 
diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md
index 2362eb0c6..f503c6770 100644
--- a/sentence_transformers/model_card_template.md
+++ b/sentence_transformers/model_card_template.md
@@ -23,6 +23,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model{% if base_model %
 {%- endif %}
 - **Maximum Sequence Length:** {{ model_max_length }} tokens
 - **Output Dimensionality:** {{ output_dimensionality }} tokens
+- **Similarity Function:** {{ similarity_fn_name }}
 {% if train_datasets | selectattr("name") | list -%}
     - **Training Dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}:**
     {%- for dataset in (train_datasets | selectattr("name")) %}
@@ -88,6 +89,11 @@ sentences = [
 embeddings = model.encode(sentences)
 print(embeddings.shape)
 # [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ output_dimensionality | default(1024, true) }}]
+
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings)
+print(similarities.shape)
+# [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}]
 ```
 
 <!--
diff --git a/sentence_transformers/similarity_functions.py b/sentence_transformers/similarity_functions.py
new file mode 100644
index 000000000..b97e7ec92
--- /dev/null
+++ b/sentence_transformers/similarity_functions.py
@@ -0,0 +1,69 @@
+from enum import Enum
+from typing import Callable, Union
+
+from numpy import ndarray
+from torch import Tensor
+from .util import (
+    cos_sim,
+    manhattan_sim,
+    euclidean_sim,
+    dot_score,
+    pairwise_cos_sim,
+    pairwise_manhattan_sim,
+    pairwise_euclidean_sim,
+    pairwise_dot_score,
+)
+
+
+class SimilarityFunction(Enum):
+    COSINE = "cosine"
+    DOT_PRODUCT = "dot"
+    DOT = "dot"  # Alias for DOT_PRODUCT
+    EUCLIDEAN = "euclidean"
+    MANHATTAN = "manhattan"
+
+    @staticmethod
+    def to_similarity_fn(
+        similarity_function: Union[str, "SimilarityFunction"],
+    ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        similarity_function = SimilarityFunction(similarity_function)
+
+        if similarity_function == SimilarityFunction.COSINE:
+            return cos_sim
+        if similarity_function == SimilarityFunction.DOT_PRODUCT:
+            return dot_score
+        if similarity_function == SimilarityFunction.MANHATTAN:
+            return manhattan_sim
+        if similarity_function == SimilarityFunction.EUCLIDEAN:
+            return euclidean_sim
+
+        raise ValueError(
+            "The provided function {} is not supported. Use one of the supported values: {}.".format(
+                similarity_function, SimilarityFunction.possible_values()
+            )
+        )
+
+    @staticmethod
+    def to_similarity_pairwise_fn(
+        similarity_function: Union[str, "SimilarityFunction"],
+    ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        similarity_function = SimilarityFunction(similarity_function)
+
+        if similarity_function == SimilarityFunction.COSINE:
+            return pairwise_cos_sim
+        if similarity_function == SimilarityFunction.DOT_PRODUCT:
+            return pairwise_dot_score
+        if similarity_function == SimilarityFunction.MANHATTAN:
+            return pairwise_manhattan_sim
+        if similarity_function == SimilarityFunction.EUCLIDEAN:
+            return pairwise_euclidean_sim
+
+        raise ValueError(
+            "The provided function {} is not supported. Use one of the supported values: {}.".format(
+                similarity_function, SimilarityFunction.possible_values()
+            )
+        )
+
+    @staticmethod
+    def possible_values():
+        return [m.value for m in SimilarityFunction]
diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py
index 85bda801c..28b2376e9 100644
--- a/sentence_transformers/util.py
+++ b/sentence_transformers/util.py
@@ -2,7 +2,7 @@
 import functools
 import requests
 from torch import Tensor, device
-from typing import List, Callable, Literal
+from typing import List, Callable, Literal, overload
 from tqdm.autonotebook import tqdm
 import sys
 import importlib
@@ -11,7 +11,7 @@
 import numpy as np
 import queue
 import logging
-from typing import Dict, Optional, Union, overload
+from typing import Dict, Optional, Union
 
 from transformers import is_torch_npu_available
 from huggingface_hub import snapshot_download, hf_hub_download
@@ -20,6 +20,24 @@
 logger = logging.getLogger(__name__)
 
 
+def _convert_to_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor:
+    if not isinstance(a, Tensor):
+        a = torch.tensor(a)
+    return a
+
+
+def _convert_to_batch(a: Tensor) -> Tensor:
+    if a.dim() == 1:
+        a = a.unsqueeze(0)
+    return a
+
+
+def _convert_to_batch_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor:
+    a = _convert_to_tensor(a)
+    a = _convert_to_batch(a)
+    return a
+
+
 def pytorch_cos_sim(a: Tensor, b: Tensor) -> Tensor:
     """
     Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
@@ -29,46 +47,40 @@ def pytorch_cos_sim(a: Tensor, b: Tensor) -> Tensor:
     return cos_sim(a, b)
 
 
-def cos_sim(a: Tensor, b: Tensor) -> Tensor:
+def cos_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
     Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
 
     :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
     """
-    if not isinstance(a, torch.Tensor):
-        a = torch.tensor(a)
+    a = _convert_to_batch_tensor(a)
+    b = _convert_to_batch_tensor(b)
 
-    if not isinstance(b, torch.Tensor):
-        b = torch.tensor(b)
+    a_norm = normalize_embeddings(a)
+    b_norm = normalize_embeddings(b)
+    return torch.mm(a_norm, b_norm.transpose(0, 1))
 
-    if len(a.shape) == 1:
-        a = a.unsqueeze(0)
 
-    if len(b.shape) == 1:
-        b = b.unsqueeze(0)
+def pairwise_cos_sim(a: Tensor, b: Tensor) -> Tensor:
+    """
+    Computes the pairwise cossim cos_sim(a[i], b[i])
 
-    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
-    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
-    return torch.mm(a_norm, b_norm.transpose(0, 1))
+    :return: Vector with res[i] = cos_sim(a[i], b[i])
+    """
+    a = _convert_to_tensor(a)
+    b = _convert_to_tensor(b)
+
+    return pairwise_dot_score(normalize_embeddings(a), normalize_embeddings(b))
 
 
-def dot_score(a: Tensor, b: Tensor) -> Tensor:
+def dot_score(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
     Computes the dot-product dot_prod(a[i], b[j]) for all i and j.
 
     :return: Matrix with res[i][j]  = dot_prod(a[i], b[j])
     """
-    if not isinstance(a, torch.Tensor):
-        a = torch.tensor(a)
-
-    if not isinstance(b, torch.Tensor):
-        b = torch.tensor(b)
-
-    if len(a.shape) == 1:
-        a = a.unsqueeze(0)
-
-    if len(b.shape) == 1:
-        b = b.unsqueeze(0)
+    a = _convert_to_batch_tensor(a)
+    b = _convert_to_batch_tensor(b)
 
     return torch.mm(a, b.transpose(0, 1))
 
@@ -79,28 +91,58 @@ def pairwise_dot_score(a: Tensor, b: Tensor) -> Tensor:
 
     :return: Vector with res[i] = dot_prod(a[i], b[i])
     """
-    if not isinstance(a, torch.Tensor):
-        a = torch.tensor(a)
-
-    if not isinstance(b, torch.Tensor):
-        b = torch.tensor(b)
+    a = _convert_to_tensor(a)
+    b = _convert_to_tensor(b)
 
     return (a * b).sum(dim=-1)
 
 
-def pairwise_cos_sim(a: Tensor, b: Tensor) -> Tensor:
+def manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
-    Computes the pairwise cossim cos_sim(a[i], b[i])
+    Computes the manhattan similarity manhattan_sim(a[i], b[j]) for all i and j.
 
-    :return: Vector with res[i] = cos_sim(a[i], b[i])
+    :return: Matrix with res[i][j] = manhattan_sim(a[i], b[j])
     """
-    if not isinstance(a, torch.Tensor):
-        a = torch.tensor(a)
+    a = _convert_to_batch_tensor(a)
+    b = _convert_to_batch_tensor(b)
 
-    if not isinstance(b, torch.Tensor):
-        b = torch.tensor(b)
+    return -torch.cdist(a, b, p=1.0)
+
+
+def pairwise_manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]):
+    """
+    Computes the negative manhattan distance.
+
+    :return: Vector with res[i] = -manhattan_distance(a[i], b[i])
+    """
+    a = _convert_to_tensor(a)
+    b = _convert_to_tensor(b)
+
+    return -torch.sum(torch.abs(a - b), dim=-1)
 
-    return pairwise_dot_score(normalize_embeddings(a), normalize_embeddings(b))
+
+def euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
+    """
+    Computes the euclidean similarity euclidean_sim(a[i], b[j]) for all i and j.
+
+    :return: Matrix with res[i][j] = euclidean_sim(a[i], b[j])
+    """
+    a = _convert_to_batch_tensor(a)
+    b = _convert_to_batch_tensor(b)
+
+    return -torch.cdist(a, b, p=2.0)
+
+
+def pairwise_euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]):
+    """
+    Computes the negative euclidean distance.
+
+    :return: Vector with res[i]  = -euclidean(a[i], b[i])
+    """
+    a = _convert_to_tensor(a)
+    b = _convert_to_tensor(b)
+
+    return -torch.sqrt(torch.sum((a - b) ** 2, dim=-1))
 
 
 def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor:
@@ -112,11 +154,8 @@ def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor:
     :return: Vector with res[i] = angle_sim(a[i], b[i])
     """
 
-    if not isinstance(x, torch.Tensor):
-        x = torch.tensor(x)
-
-    if not isinstance(y, torch.Tensor):
-        y = torch.tensor(y)
+    x = _convert_to_tensor(x)
+    y = _convert_to_tensor(y)
 
     # modified from https://github.com/SeanLee97/AnglE/blob/main/angle_emb/angle.py
     # chunk both tensors to obtain complex components
diff --git a/tests/test_sentence_transformer.py b/tests/test_sentence_transformer.py
index 770a737e5..04fa69a0d 100644
--- a/tests/test_sentence_transformer.py
+++ b/tests/test_sentence_transformer.py
@@ -19,6 +19,7 @@
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.models import Normalize, Transformer, Pooling
 from sentence_transformers import util
+from sentence_transformers.similarity_functions import SimilarityFunction
 
 
 def test_load_with_safetensors() -> None:
@@ -477,3 +478,49 @@ def test(model: SentenceTransformer, expected_dim: int):
     # Test w/ an ouptut_dim that's larger than the original_output_dim. No truncation ends up happening
     model.truncate_dim = 2 * original_output_dim
     test(model, expected_dim=original_output_dim)
+
+
+@pytest.mark.parametrize("similarity_fn_name", SimilarityFunction.possible_values())
+def test_similarity_score(stsb_bert_tiny_model_reused: SentenceTransformer, similarity_fn_name: str) -> None:
+    model = stsb_bert_tiny_model_reused
+    model.similarity_fn_name = similarity_fn_name
+    sentences = [
+        "The weather is so nice!",
+        "It's so sunny outside.",
+        "He's driving to the movie theater.",
+        "She's going to the cinema.",
+    ]
+    embeddings = model.encode(sentences, normalize_embeddings=True)
+    scores = model.similarity(embeddings, embeddings)
+    assert scores.shape == (len(sentences), len(sentences))
+    if similarity_fn_name in ("cosine", "dot"):
+        expected = np.ones(4, dtype=float)
+    else:
+        expected = np.zeros(4, dtype=float)
+    np.testing.assert_almost_equal(np.diag(scores), expected, decimal=4)
+    assert scores[1][0] > scores[2][0]
+    assert scores[1][0] > scores[3][0]
+    assert scores[2][3] > scores[2][0]
+    assert scores[2][3] > scores[2][1]
+
+    pairwise_scores = model.similarity_pairwise(embeddings[::2], embeddings[1::2])
+    assert pairwise_scores.shape == (len(sentences) // 2,)
+    if similarity_fn_name in ("cosine", "dot"):
+        assert (pairwise_scores > 0.5).all()
+
+
+def test_similarity_score_save(stsb_bert_tiny_model: SentenceTransformer) -> None:
+    model = stsb_bert_tiny_model
+    embeddings = model.encode(["Sentence 1", "Sentence 2"])
+    assert model.similarity_fn_name is None
+    cosine_scores = model.similarity(embeddings, embeddings)
+    # Using 'similarity' methods sets the default similarity function to 'cosine'
+    assert model.similarity_fn_name == "cosine"
+
+    model.similarity_fn_name = "euclidean"
+    with tempfile.TemporaryDirectory() as tmp_folder:
+        model.save(tmp_folder)
+        loaded_model = SentenceTransformer(tmp_folder)
+    assert loaded_model.similarity_fn_name == "euclidean"
+    dot_scores = model.similarity(embeddings, embeddings)
+    assert np.not_equal(cosine_scores, dot_scores).all()
diff --git a/tests/test_util.py b/tests/test_util.py
index 905796534..82cc1f5fb 100644
--- a/tests/test_util.py
+++ b/tests/test_util.py
@@ -71,11 +71,75 @@ def test_paraphrase_mining() -> None:
             assert (a, b) in [(0, 1), (2, 3), (2, 4), (3, 4), (5, 6), (5, 7), (6, 7)]
 
 
-def test_pairwise_scores() -> None:
+def test_pairwise_cos_sim() -> None:
     a = np.random.randn(50, 100)
     b = np.random.randn(50, 100)
 
     # Pairwise cos
     sklearn_pairwise = 1 - sklearn.metrics.pairwise.paired_cosine_distances(a, b)
     pytorch_cos_scores = util.pairwise_cos_sim(a, b).numpy()
+
     assert np.allclose(sklearn_pairwise, pytorch_cos_scores)
+
+
+def test_pairwise_euclidean_sim() -> None:
+    a = np.array([[1, 0], [1, 1]], dtype=np.float32)
+    b = np.array([[0, 0], [0, 0]], dtype=np.float32)
+
+    euclidean_expected = np.array([-1.0, -np.sqrt(2.0)])
+    euclidean_calculated = util.pairwise_euclidean_sim(a, b).numpy()
+
+    assert np.allclose(euclidean_expected, euclidean_calculated)
+
+
+def test_pairwise_manhattan_sim() -> None:
+    a = np.array([[1, 0], [1, 1]], dtype=np.float32)
+    b = np.array([[0, 0], [0, 0]], dtype=np.float32)
+
+    manhattan_expected = np.array([-1.0, -2.0])
+    manhattan_calculated = util.pairwise_manhattan_sim(a, b).numpy()
+
+    assert np.allclose(manhattan_expected, manhattan_calculated)
+
+
+def test_pairwise_dot_score_cos_sim() -> None:
+    a = np.array([[1, 0], [1, 0], [1, 0]], dtype=np.float32)
+    b = np.array([[1, 0], [0, 1], [-1, 0]], dtype=np.float32)
+
+    dot_and_cosine_expected = np.array([1.0, 0.0, -1.0])
+    cosine_calculated = util.pairwise_cos_sim(a, b)
+    dot_calculated = util.pairwise_dot_score(a, b)
+
+    assert np.allclose(cosine_calculated, dot_and_cosine_expected)
+    assert np.allclose(dot_calculated, dot_and_cosine_expected)
+
+
+def test_euclidean_sim() -> None:
+    a = np.array([[1, 0], [0, 1]], dtype=np.float32)
+    b = np.array([[0, 0], [0, 1]], dtype=np.float32)
+
+    euclidean_expected = np.array([[-1.0, -np.sqrt(2.0)], [-1.0, 0.0]])
+    euclidean_calculated = util.euclidean_sim(a, b).detach().numpy()
+
+    assert np.allclose(euclidean_expected, euclidean_calculated)
+
+
+def test_manhattan_sim() -> None:
+    a = np.array([[1, 0], [0, 1]], dtype=np.float32)
+    b = np.array([[0, 0], [0, 1]], dtype=np.float32)
+
+    manhattan_expected = np.array([[-1.0, -2.0], [-1.0, 0]])
+    manhattan_calculated = util.manhattan_sim(a, b).detach().numpy()
+    assert np.allclose(manhattan_expected, manhattan_calculated)
+
+
+def test_dot_score_cos_sim() -> None:
+    a = np.array([[1, 0]], dtype=np.float32)
+    b = np.array([[1, 0], [0, 1], [-1, 0]], dtype=np.float32)
+
+    dot_and_cosine_expected = np.array([[1.0, 0.0, -1.0]])
+    cosine_calculated = util.cos_sim(a, b)
+    dot_calculated = util.dot_score(a, b)
+
+    assert np.allclose(cosine_calculated, dot_and_cosine_expected)
+    assert np.allclose(dot_calculated, dot_and_cosine_expected)

From 8fb44c271056e7f7ab011feaaaf8463423af5bbf Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Thu, 25 Apr 2024 18:06:00 +0200
Subject: [PATCH 03/39] [`v3`] Fix various model card errors (#2616)

* Prevent model card save failure

* Print exceptions in more detail when they occur

* Fix edge case if dataset language is None
---
 sentence_transformers/SentenceTransformer.py |  7 ++++---
 sentence_transformers/model_card.py          | 17 +++++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index 54a1bae6d..53f61332c 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -4,6 +4,7 @@
 import os
 from collections import OrderedDict
 from pathlib import Path
+import traceback
 import warnings
 from typing import Callable, List, Dict, Literal, Tuple, Iterable, Union, Optional, overload
 import numpy as np
@@ -878,10 +879,10 @@ def _create_model_card(
         else:
             try:
                 model_card = generate_model_card(self)
-            except Exception as exc:
+            except Exception:
                 logger.error(
-                    f"Error while generating model card: {exc}\n"
-                    "Consider opening an issue on https://github.com/UKPLab/sentence-transformers/issues with these logs.\n"
+                    f"Error while generating model card:\n{traceback.format_exc()}"
+                    "Consider opening an issue on https://github.com/UKPLab/sentence-transformers/issues with this traceback.\n"
                     "Skipping model card creation."
                 )
                 return
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index df382374b..2b897969e 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -344,6 +344,8 @@ def validate_datasets(self, dataset_list, infer_languages: bool = True) -> None:
                     # TODO: Perhaps we can try to infer the dataset name from the dataset card
                     if info.cardData and infer_languages and "language" in info.cardData:
                         dataset_language = info.cardData.get("language")
+                        if dataset_language is None:
+                            break
                         if isinstance(dataset_language, str):
                             dataset_language = [dataset_language]
                         for language in dataset_language:
@@ -901,12 +903,15 @@ def to_dict(self) -> Dict[str, Any]:
         super_dict["model_max_length"] = self.model.get_max_seq_length()
         super_dict["output_dimensionality"] = self.model.get_sentence_embedding_dimension()
         super_dict["model_string"] = str(self.model)
-        super_dict["similarity_fn_name"] = {
-            "cosine": "Cosine Similarity",
-            "dot": "Dot Product",
-            "euclidean": "Euclidean Distance",
-            "manhattan": "Manhattan Distance",
-        }.get(self.model.similarity_fn_name, self.model.similarity_fn_name.replace("_", " ").title())
+        if self.model.similarity_fn_name:
+            super_dict["similarity_fn_name"] = {
+                "cosine": "Cosine Similarity",
+                "dot": "Dot Product",
+                "euclidean": "Euclidean Distance",
+                "manhattan": "Manhattan Distance",
+            }.get(self.model.similarity_fn_name, self.model.similarity_fn_name.replace("_", " ").title())
+        else:
+            super_dict["similarity_fn_name"] = "Cosine Similarity"
 
         self.first_save = False
 

From 3d888d62b7e8b75f0ced227eb1eb65a82efcbef1 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Thu, 25 Apr 2024 20:01:04 +0200
Subject: [PATCH 04/39] [`v3`] Fix trainer `compute_loss` when
 evaluating/predicting if the `loss` updated the inputs in-place (#2617)

* Recompute the features if return_output

* Add SimilarityFunction to __init__, increment dev version
---
 sentence_transformers/__init__.py | 4 +++-
 sentence_transformers/trainer.py  | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/sentence_transformers/__init__.py b/sentence_transformers/__init__.py
index 7cd7a2230..3d772711b 100644
--- a/sentence_transformers/__init__.py
+++ b/sentence_transformers/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "2.8.0.dev0"
+__version__ = "3.0.0.dev0"
 __MODEL_HUB_ORGANIZATION__ = "sentence-transformers"
 
 import importlib
@@ -7,6 +7,7 @@
 from .datasets import SentencesDataset, ParallelSentencesDataset
 from .LoggingHandler import LoggingHandler
 from .SentenceTransformer import SentenceTransformer
+from .similarity_functions import SimilarityFunction
 from .readers import InputExample
 from .cross_encoder.CrossEncoder import CrossEncoder
 from .trainer import SentenceTransformerTrainer
@@ -25,6 +26,7 @@
     "SentencesDataset",
     "ParallelSentencesDataset",
     "SentenceTransformer",
+    "SimilarityFunction",
     "InputExample",
     "CrossEncoder",
     "SentenceTransformerTrainer",
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 8eb4877ba..400bfbd37 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -153,6 +153,8 @@ def compute_loss(
             loss_fn.model = model
         loss = loss_fn(features, labels)
         if return_outputs:
+            # Get fresh features, as the loss function has likely modified them
+            features, _ = self.collect_features(inputs)
             output = torch.cat([model(row)["sentence_embedding"][:, None] for row in features], dim=1)
             return loss, output
         return loss

From 1ed70ba15f588a3deb6a72fc7cb3aee88fcddb81 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 29 Apr 2024 17:26:08 +0200
Subject: [PATCH 05/39] Never return None in infer_datasets (#2620)

---
 sentence_transformers/model_card.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index 2b897969e..483ee5ce4 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -487,7 +487,9 @@ def set_label_examples(self, dataset: Dataset) -> None:
             for label, example_set in examples.items()
         ]
 
-    def infer_datasets(self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None) -> None:
+    def infer_datasets(
+        self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None
+    ) -> List[Dict[str, str]]:
         if isinstance(dataset, DatasetDict):
             return [
                 dataset
@@ -514,7 +516,7 @@ def subtuple_finder(tuple: Tuple[str], subtuple: Tuple[str]) -> int:
             subtuple = ("huggingface", "datasets")
             index = subtuple_finder(cache_path_parts, subtuple)
             if index == -1:
-                return
+                return [dataset_output]
 
             # Get the folder after "huggingface/datasets"
             cache_dataset_name = cache_path_parts[index + len(subtuple)]

From e7c3f862e5eabbd778628323c2d2cf15b674d110 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 29 Apr 2024 17:26:23 +0200
Subject: [PATCH 06/39] Implement resume_from_checkpoint (#2621)

---
 sentence_transformers/trainer.py | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 400bfbd37..f4bec3ff8 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -553,3 +553,8 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None):
 
         # Good practice: save your training arguments together with the trained model
         torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
+
+    def _load_from_checkpoint(self, checkpoint_path: str) -> None:
+        from sentence_transformers import SentenceTransformer
+
+        self.model = SentenceTransformer(checkpoint_path)

From 4443bf53f50cc225c738230d551ef2d205a07634 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Wed, 8 May 2024 09:43:56 +0200
Subject: [PATCH 07/39] [`v3`] Update example scripts to the new v3 training
 format (#2622)

* Update example scripts to the new v3 training format

* Add distillation training examples

* Add Matryoshka training examples

* Add NLI training examples

* Add STS training scripts

* Fix accidentally overriding eval set

* Update paraphrases multi-dataset training script

* Convert regular dicts to DatasetDict on Trainer init

* Update Quora duplicate training scripts

* Update "other" training scripts

* Update multilingual conversion script

* Add example scripts to Evaluators

* Add example to ST class itself

* Update docs formatting slightly

* Fix model card snippet

* Add short docstring for similarity_fn_name property
---
 .../adaptive_layer/adaptive_layer_nli.py      | 184 ++++-----
 .../adaptive_layer/adaptive_layer_sts.py      | 176 ++++-----
 ...aining_stsbenchmark_avg_word_embeddings.py | 157 ++++----
 .../training_stsbenchmark_bilstm.py           | 147 +++----
 .../training_stsbenchmark_bow.py              | 151 ++++----
 .../training_stsbenchmark_cnn.py              | 161 ++++----
 ...ing_stsbenchmark_tf-idf_word_embeddings.py | 156 ++++----
 .../train_sts_indomain_bm25.py                | 177 +++++----
 .../train_sts_indomain_nlpaug.py              | 178 ++++-----
 .../distillation/dimensionality_reduction.py  |  73 ++--
 .../distillation/model_distillation.py        | 265 +++++++------
 .../model_distillation_layer_reduction.py     | 229 +++++++++++
 .../training/matryoshka/2d_matryoshka_nli.py  | 179 ++++-----
 .../training/matryoshka/2d_matryoshka_sts.py  | 174 ++++-----
 .../training/matryoshka/matryoshka_nli.py     | 179 ++++-----
 .../matryoshka/matryoshka_nli_reduced_dim.py  | 188 ++++-----
 .../training/matryoshka/matryoshka_sts.py     | 196 +++++-----
 .../multilingual/make_multilingual.py         | 358 +++++++++---------
 examples/training/nli/training_nli.py         | 190 +++++-----
 examples/training/nli/training_nli_v2.py      | 203 +++++-----
 examples/training/nli/training_nli_v3.py      | 203 ++++------
 .../training/other/training_multi-task.py     | 219 +++++------
 .../other/training_wikipedia_sections.py      | 177 ++++-----
 examples/training/paraphrases/training.py     | 208 ++++++----
 .../training_MultipleNegativesRankingLoss.py  | 247 ++++++------
 .../training_OnlineContrastiveLoss.py         | 248 ++++++------
 .../training_multi-task-learning.py           | 292 +++++++-------
 .../training/sts/training_stsbenchmark.py     | 166 ++++----
 ...training_stsbenchmark_continue_training.py | 158 ++++----
 sentence_transformers/SentenceTransformer.py  |  26 ++
 .../BinaryClassificationEvaluator.py          |  52 +++
 .../EmbeddingSimilarityEvaluator.py           |  31 +-
 .../InformationRetrievalEvaluator.py          |  83 ++++
 .../evaluation/MSEEvaluator.py                |  34 +-
 .../evaluation/ParaphraseMiningEvaluator.py   |  39 ++
 .../evaluation/TranslationEvaluator.py        |  30 ++
 .../evaluation/TripletEvaluator.py            |  35 +-
 sentence_transformers/model_card_template.md  |   2 +-
 sentence_transformers/trainer.py              |   5 +
 39 files changed, 3209 insertions(+), 2767 deletions(-)
 create mode 100644 examples/training/distillation/model_distillation_layer_reduction.py

diff --git a/examples/training/adaptive_layer/adaptive_layer_nli.py b/examples/training/adaptive_layer/adaptive_layer_nli.py
index 4747941d0..955bc2e6c 100644
--- a/examples/training/adaptive_layer/adaptive_layer_nli.py
+++ b/examples/training/adaptive_layer/adaptive_layer_nli.py
@@ -1,8 +1,7 @@
 """
 The system trains BERT (or any other transformer model like RoBERTa, DistilBERT etc.) on the SNLI + MultiNLI (AllNLI) dataset
-with AdaptiveLayerLoss using MultipleNegativesRankingLoss. This trains a model at output dimensions [768, 512, 256, 128, 64].
-Entailments are positive pairs and the contradiction on AllNLI dataset is added as a hard negative.
-At every 10% training steps, the model is evaluated on the STS benchmark dataset
+with AdaptiveLayerLoss using MultipleNegativesRankingLoss. Entailing texts are used as positive pairs and contradictory
+texts are seen as negative pairs. At every 100 training steps, the model is evaluated on the STS benchmark dataset.
 
 Usage:
 python adaptive_layer_nli.py
@@ -11,147 +10,112 @@
 python adaptive_layer_nli.py pretrained_transformer_model_name
 """
 
-import math
+import traceback
 from datasets import load_dataset
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
-train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
-max_seq_length = 75
-num_epochs = 1
-
-# Save path of the model
-model_save_path = (
-    "output/adaptive_layer_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-)
-
-
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-# Read the AllNLI.tsv.gz file and create the training dataset
-logging.info("Read AllNLI train dataset")
 
+from sentence_transformers.training_args import BatchSamplers
 
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
+model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
+batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
+num_train_epochs = 1
 
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
-
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
-
-
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
+# Save path of the model
+output_dir = f"output/adaptive_layer_nli_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
 
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
 
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+logging.info(train_dataset)
 
+# If you wish, you can limit the number of training samples
+# train_dataset = train_dataset.select(range(5000))
 
-# Our training loss
-train_loss = losses.MultipleNegativesRankingLoss(model)
-train_loss = losses.AdaptiveLayerLoss(model, train_loss)
+# 3. Define our training loss
+inner_train_loss = losses.MultipleNegativesRankingLoss(model)
+train_loss = losses.AdaptiveLayerLoss(model, inner_train_loss)
 
-stsb_dev = load_dataset("mteb/stsbenchmark-sts", split="validation")
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 dev_evaluator = EmbeddingSimilarityEvaluator(
-    stsb_dev["sentence1"],
-    stsb_dev["sentence2"],
-    [score / 5 for score in stsb_dev["score"]],
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
     main_similarity=SimilarityFunction.COSINE,
     name="sts-dev",
 )
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="adaptive-layer-nli",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
+trainer.train()
 
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-
-model = SentenceTransformer(model_save_path)
-stsb_test = load_dataset("mteb/stsbenchmark-sts", split="test")
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
 test_evaluator = EmbeddingSimilarityEvaluator(
-    stsb_test["sentence1"],
-    stsb_test["sentence2"],
-    [score / 5 for score in stsb_test["score"]],
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
     main_similarity=SimilarityFunction.COSINE,
     name="sts-test",
 )
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
 
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-nli-adaptive-layer")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-nli-adaptive-layer')`."
     )
diff --git a/examples/training/adaptive_layer/adaptive_layer_sts.py b/examples/training/adaptive_layer/adaptive_layer_sts.py
index b47b1c7c7..b95ac63b2 100644
--- a/examples/training/adaptive_layer/adaptive_layer_sts.py
+++ b/examples/training/adaptive_layer/adaptive_layer_sts.py
@@ -1,6 +1,6 @@
 """
 This examples trains BERT (or any other transformer model like RoBERTa, DistilBERT etc.) for the STSbenchmark from scratch.
-It uses AdaptiveLayerLoss with the powerful CoSENTLoss to train models that perform well at output dimensions [768, 512, 256, 128, 64].
+It uses AdaptiveLayerLoss with the powerful CoSENTLoss to train models that perform well even when removing some layers.
 It generates sentence embeddings that can be compared using cosine-similarity to measure the similarity.
 
 Usage:
@@ -10,118 +10,108 @@
 python adaptive_layer_sts.py pretrained_transformer_model_name
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
 
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilbert-base-uncased"
-
-# Read the dataset
-train_batch_size = 16
-num_epochs = 4
-model_save_path = (
-    "output/adaptive_layer_sts_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+batch_size = 16
+num_train_epochs = 4
+
+# Save path of the model
+output_dir = f"output/adaptive_layer_sts_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
+
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
+
+# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
+
+# 3. Define our training loss
+# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# similarity score column (between 0 and 1)
+inner_train_loss = losses.CoSENTLoss(model=model)
+train_loss = losses.AdaptiveLayerLoss(model, inner_train_loss)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="adaptive-layer-sts",  # Will be used in W&B if `wandb` is installed
 )
 
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.CoSENTLoss(model=model)
-train_loss = losses.AdaptiveLayerLoss(model, train_loss)
-
-
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training. We skip evaluation in this example
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
 )
+trainer.train()
 
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-sts-adaptive-layer")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-sts-adaptive-layer')`."
     )
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
index bb965df98..a6bb8fe79 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
@@ -7,102 +7,115 @@
 for available word embeddings files
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+import traceback
+from datasets import load_dataset
+from sentence_transformers import models, losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import logging
 from datetime import datetime
-import os
-import csv
-import gzip
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Read the dataset
-batch_size = 32
-model_save_path = "output/training_stsbenchmark_avg_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
 
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-logging.info("Read STSbenchmark train dataset")
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
+num_train_epochs = 1
+batch_size = 32
+output_dir = "output/training_stsbenchmark_avg_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
+# 1. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+# 2. Define the model
 # Map tokens to traditional word embeddings like GloVe
 word_embedding_model = models.WordEmbeddings.from_text_file("glove.6B.300d.txt.gz")
 
 # Apply mean pooling to get one fixed sized sentence vector
 pooling_model = models.Pooling(
     word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+    pooling_mode="mean",
 )
 
 # Add two trainable feed-forward networks (DAN)
 sent_embeddings_dimension = pooling_model.get_sentence_embedding_dimension()
 dan1 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)
 dan2 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)
-
 model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dan1, dan2])
 
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 10
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="glove-mean-pooling-sts",  # Will be used in W&B if `wandb` is installed
+)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(test_evaluator)
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = "glove-mean-pooling-sts"
+try:
+    model.push_to_hub(model_name)
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}')`."
+    )
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
index 296f755c7..4df3d8567 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
@@ -5,53 +5,32 @@
 Note, you can also pass BERT embeddings to the BiLSTM.
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+import traceback
+from datasets import load_dataset
+from sentence_transformers import models, losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import logging
 from datetime import datetime
-import os
-import csv
-import gzip
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Read the dataset
-batch_size = 32
-model_save_path = "output/training_stsbenchmark_bilstm-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
+num_train_epochs = 1
+batch_size = 32
+output_dir = "output/training_stsbenchmark_bilstm-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
+# 1. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+# 2. Define the model
 # Map tokens to traditional word embeddings like GloVe
 word_embedding_model = models.WordEmbeddings.from_text_file("glove.6B.300d.txt.gz")
 
@@ -60,44 +39,68 @@
 # Apply mean pooling to get one fixed sized sentence vector
 pooling_model = models.Pooling(
     lstm.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=False,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=True,
+    pooling_mode="mean",
 )
-
-
 model = SentenceTransformer(modules=[word_embedding_model, lstm, pooling_model])
 
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 10
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="glove-bilstm-sts",  # Will be used in W&B if `wandb` is installed
+)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# 7. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 8. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = "glove-bilstm-sts"
+try:
+    model.push_to_hub(model_name)
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}')`."
+    )
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
index 503de464d..951e006a1 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
@@ -5,56 +5,35 @@
 To make the model trainable, we add multiple dense layers to create a Deep Averaging Network (DAN).
 """
 
-from torch.utils.data import DataLoader
+import traceback
+from datasets import load_dataset
 import math
 from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 from sentence_transformers.models.tokenizer.WordTokenizer import ENGLISH_STOP_WORDS
 import logging
 from datetime import datetime
 import os
-import csv
-import gzip
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Read the dataset
-batch_size = 32
-model_save_path = "output/training_tf-idf_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
+num_train_epochs = 1
+batch_size = 32
+output_dir = "output/training_tf-idf_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-##### Construction of the SentenceTransformer Model #####
+# 1. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+# 2. Define the model
 # Wikipedia document frequency for words
 wiki_doc_freq = "wikipedia_doc_frequencies.txt"
 if not os.path.exists(wiki_doc_freq):
@@ -83,8 +62,6 @@
     if len(vocab) >= max_vocab_size:
         break
 
-##### Construction of the SentenceTransformer Model #####
-
 # Create the BoW model. Because we set word_weights to the IDF values and cumulative_term_frequency=True, we
 # get tf-idf vectors. Set word_weights to an empty dict and cumulative_term_frequency=False to get a 1-hot sentence encoding
 bow = models.BoW(vocab=vocab, word_weights=weights, cumulative_term_frequency=True)
@@ -96,36 +73,74 @@
 
 model = SentenceTransformer(modules=[bow, dan1, dan2])
 
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 10
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="wikipedia-tf-idf-bow",  # Will be used in W&B if `wandb` is installed
+)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(test_evaluator)
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = "wikipedia-tf-idf-bow"
+try:
+    model.push_to_hub(model_name)
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}')`."
+    )
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
index a7c822f52..07c743ac3 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
@@ -5,56 +5,37 @@
 
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+import sys
+import traceback
+from datasets import load_dataset
+from sentence_transformers import models, losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import logging
 from datetime import datetime
-import os
-import csv
-import gzip
 
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Read the dataset
-batch_size = 32
-model_save_path = "output/training_stsbenchmark_cnn-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
 
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
+model_name = sys.argv[1] if len(sys.argv) > 1 else "bert-base-uncased"
+num_train_epochs = 1
+batch_size = 32
+output_dir = "output/training_stsbenchmark_cnn-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
+# 1. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+# 2. Define the model
 # Map tokens to vectors using BERT
-word_embedding_model = models.Transformer("bert-base-uncased")
+word_embedding_model = models.Transformer(model_name)
 
 cnn = models.CNN(
     in_word_embedding_dimension=word_embedding_model.get_word_embedding_dimension(),
@@ -65,44 +46,78 @@
 # Apply mean pooling to get one fixed sized sentence vector
 pooling_model = models.Pooling(
     cnn.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+    pooling_mode="mean",
 )
-
-
 model = SentenceTransformer(modules=[word_embedding_model, cnn, pooling_model])
 
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 10
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="cnn",  # Will be used in W&B if `wandb` is installed
+)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(test_evaluator)
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-cnn")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-cnn')`."
+    )
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
index f45a4e7d3..f11657e8c 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
@@ -9,28 +9,34 @@
 https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/wikipedia_doc_frequencies.txt
 """
 
-from torch.utils.data import DataLoader
+import traceback
+from datasets import load_dataset
 import math
 from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import logging
 from datetime import datetime
 import os
-import csv
-import gzip
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-# Read the dataset
+num_train_epochs = 1
 batch_size = 32
-model_save_path = "output/training_tf-idf_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+output_dir = "output/training_tf-idf_word_embeddings-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+
+# 1. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+# 2. Define the model
 # Wikipedia document frequency for words
 wiki_doc_freq = "wikipedia_doc_frequencies.txt"
 if not os.path.exists(wiki_doc_freq):
@@ -38,32 +44,6 @@
         "https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/wikipedia_doc_frequencies.txt", wiki_doc_freq
     )
 
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-##### Construction of the SentenceTransformer Model #####
-
 # Map tokens to traditional word embeddings like GloVe
 word_embedding_model = models.WordEmbeddings.from_text_file("glove.6B.300d.txt.gz")
 
@@ -84,52 +64,86 @@
 # Initialize the WordWeights model. This model must be between the WordEmbeddings and the Pooling model
 word_weights = models.WordWeights(vocab=vocab, word_weights=word_weights, unknown_word_weight=unknown_word_weight)
 
-
 # Apply mean pooling to get one fixed sized sentence vector
 pooling_model = models.Pooling(
     word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+    pooling_mode="mean",
 )
 
 # Add two trainable feed-forward networks (DAN)
 sent_embeddings_dimension = pooling_model.get_sentence_embedding_dimension()
 dan1 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)
 dan2 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)
-
 model = SentenceTransformer(modules=[word_embedding_model, word_weights, pooling_model, dan1, dan2])
 
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 10
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="glove-wikipedia-tf-idf",  # Will be used in W&B if `wandb` is installed
+)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(test_evaluator)
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = "glove-wikipedia-tf-idf"
+try:
+    model.push_to_hub(model_name)
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}')`."
+    )
diff --git a/examples/training/data_augmentation/train_sts_indomain_bm25.py b/examples/training/data_augmentation/train_sts_indomain_bm25.py
index 1787388bb..121dbb9bc 100644
--- a/examples/training/data_augmentation/train_sts_indomain_bm25.py
+++ b/examples/training/data_augmentation/train_sts_indomain_bm25.py
@@ -1,6 +1,6 @@
 """
 The script shows how to train Augmented SBERT (In-Domain) strategy for STSb dataset with BM25 sampling.
-We utlise easy and practical elasticsearch (https://www.elastic.co/) for BM25 sampling.
+We utilise easy and practical elasticsearch (https://www.elastic.co/) for BM25 sampling.
 
 Installations:
 For this example, elasticsearch to be installed (pip install elasticsearch)
@@ -26,28 +26,28 @@
 
 """
 
+import traceback
+from datasets import load_dataset, Dataset, concatenate_datasets
 from torch.utils.data import DataLoader
-from sentence_transformers import models, losses, util
+from sentence_transformers import losses
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
-from sentence_transformers import LoggingHandler, SentenceTransformer
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import InputExample
 from elasticsearch import Elasticsearch
 from datetime import datetime
 import logging
-import csv
 import sys
 import tqdm
 import math
-import gzip
-import os
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # suppressing INFO messages for elastic-search logger
 tracer = logging.getLogger("elasticsearch")
@@ -62,42 +62,23 @@
 num_epochs = 1
 max_seq_length = 128
 
-###### Read Datasets ######
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
 cross_encoder_path = (
     "output/cross-encoder/stsb_indomain_"
     + model_name.replace("/", "-")
     + "-"
     + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
-bi_encoder_path = (
+sentence_transformer_path = (
     "output/bi-encoder/stsb_augsbert_BM25_"
     + model_name.replace("/", "-")
     + "-"
     + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-###### Cross-encoder (simpletransformers) ######
-logging.info("Loading sentence-transformers model: {}".format(model_name))
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for cross-encoder model
+# Use a Hugging Face model (like BERT, RoBERTa, XLNet, XLM-R) for loading the CrossEncoder and SentenceTransformer
 cross_encoder = CrossEncoder(model_name, num_labels=1)
-
-
-###### Bi-encoder (sentence-transformers) ######
-logging.info("Loading bi-encoder model: {}".format(model_name))
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-
-bi_encoder = SentenceTransformer(modules=[word_embedding_model, pooling_model])
+sentence_transformer = SentenceTransformer(model_name)
+sentence_transformer.max_seq_length = max_seq_length
 
 
 #####################################################################
@@ -108,31 +89,27 @@
 
 logging.info("Step 1: Train cross-encoder: ({}) with STSbenchmark".format(model_name))
 
-gold_samples = []
-dev_samples = []
-test_samples = []
-
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-
-        if row["split"] == "dev":
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-        elif row["split"] == "test":
-            test_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-        else:
-            # As we want to get symmetric scores, i.e. CrossEncoder(A,B) = CrossEncoder(B,A), we pass both combinations to the train set
-            gold_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-            gold_samples.append(InputExample(texts=[row["sentence2"], row["sentence1"]], label=score))
+# Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
+gold_samples = [
+    InputExample(texts=[sentence1, sentence2], label=data["score"])
+    for data in train_dataset
+    for sentence1, sentence2 in [(data["sentence1"], data["sentence2"]), (data["sentence2"], data["sentence1"])]
+]
 
 # We wrap gold_samples (which is a List[InputExample]) into a pytorch DataLoader
 train_dataloader = DataLoader(gold_samples, shuffle=True, batch_size=batch_size)
 
-
 # We add an evaluator, which evaluates the performance during training
-evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name="sts-dev")
+evaluator = CECorrelationEvaluator(
+    sentence_pairs=[[data["sentence1"], data["sentence2"]] for data in eval_dataset],
+    scores=[data["score"] for data in eval_dataset],
+    name="sts-dev",
+)
 
 # Configure the training
 warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
@@ -215,39 +192,81 @@
 
 # Convert the dataset to a DataLoader ready for training
 logging.info("Read STSbenchmark gold and silver train dataset")
-silver_samples = list(
-    InputExample(texts=[data[0], data[1]], label=score) for data, score in zip(silver_data, silver_scores)
+silver_samples = Dataset.from_dict(
+    {
+        "sentence1": [data[0] for data in silver_data],
+        "sentence2": [data[1] for data in silver_data],
+        "score": silver_scores,
+    }
 )
+train_dataset = concatenate_datasets([train_dataset, silver_samples])
 
-
-train_dataloader = DataLoader(gold_samples + silver_samples, shuffle=True, batch_size=batch_size)
-train_loss = losses.CosineSimilarityLoss(model=bi_encoder)
+train_loss = losses.CosineSimilarityLoss(model=sentence_transformer)
 
 logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
+evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
 
-# Configure the training.
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=sentence_transformer_path,
+    # Optional training parameters:
+    num_train_epochs=num_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="augmentation-indomain-bm25-sts",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the bi-encoder model
-bi_encoder.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=sentence_transformer,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=bi_encoder_path,
 )
+trainer.train()
 
-######################################################################
-#
-# Evaluate Augmented SBERT performance on STS benchmark (test) dataset
-#
-######################################################################
 
-# load the stored augmented-sbert model
-bi_encoder = SentenceTransformer(bi_encoder_path)
-logging.info("Read STSbenchmark test dataset")
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(bi_encoder, output_path=bi_encoder_path)
+# Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(sentence_transformer)
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{sentence_transformer_path}/final"
+sentence_transformer.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    sentence_transformer.push_to_hub(f"{model_name}-augmentation-indomain-bm25-sts")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-augmentation-indomain-bm25-sts')`."
+    )
diff --git a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
index b6da91409..735e5cf36 100644
--- a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
+++ b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
@@ -29,26 +29,23 @@
 python train_sts_indomain_nlpaug.py
 """
 
-from torch.utils.data import DataLoader
+import traceback
+from datasets import load_dataset, Dataset, concatenate_datasets
 import torch
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
+from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import nlpaug.augmenter.word as naw
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
 import tqdm
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "bert-base-uncased"
@@ -56,54 +53,21 @@
 batch_size = 16
 num_epochs = 1
 
-###### Read Datasets ######
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-
-model_save_path = (
+output_dir = (
     "output/bi-encoder/stsb_indomain_eda_"
     + model_name.replace("/", "-")
     + "-"
     + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-###### Bi-encoder (sentence-transformers) ######
-logging.info("Loading SBERT model: {}".format(model_name))
 # Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
-)
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Convert the dataset to a DataLoader ready for training
-gold_samples = []
-dev_samples = []
-test_samples = []
+model = SentenceTransformer(model_name)
 
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            gold_samples.append(inp_example)
+# Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
 ##################################################################################
 #
@@ -125,17 +89,24 @@
 # aug = naw.SynonymAug(aug_src='wordnet')
 
 #### Synonym replacement using BERT ####
-aug = naw.ContextualWordEmbsAug(model_path=model_name, action="insert", device=device)
-
-silver_samples = []
-progress = tqdm.tqdm(unit="docs", total=len(gold_samples))
-
-for sample in gold_samples:
-    augmented_texts = aug.augment(sample.texts)
-    inp_example = InputExample(texts=augmented_texts, label=sample.label)
-    silver_samples.append(inp_example)
+aug = naw.ContextualWordEmbsAug(model_path=model_name, action="insert")
+
+silver_samples = {
+    "sentence1": [],
+    "sentence2": [],
+    "score": [],
+}
+progress = tqdm.tqdm(unit="docs", total=len(test_dataset))
+
+for sample in train_dataset:
+    augmented_texts = aug.augment([sample["sentence1"], sample["sentence2"]])
+    silver_samples["sentence1"].append(augmented_texts[0])
+    silver_samples["sentence2"].append(augmented_texts[1])
+    silver_samples["score"].append(sample["score"])
     progress.update(1)
 
+silver_dataset = Dataset.from_dict(silver_samples)
+
 progress.reset()
 progress.close()
 logging.info("Textual augmentation completed....")
@@ -147,36 +118,73 @@
 #
 ###################################################################
 
-logging.info("Read STSbenchmark (gold + silver) training dataset")
-train_dataloader = DataLoader(gold_samples + silver_samples, shuffle=True, batch_size=batch_size)
+train_dataset = concatenate_datasets([train_dataset, silver_dataset])
 train_loss = losses.CosineSimilarityLoss(model=model)
 
-
 logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training.
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
+evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
 
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="augmentation-indomain-nlpaug-sts",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the SBERT model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
 )
+trainer.train()
 
-##########################################################
-#
-# Evaluate SBERT performance on STS benchmark test dataset
-#
-##########################################################
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-augmentation-indomain-nlpaug-sts")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-augmentation-indomain-nlpaug-sts')`."
+    )
diff --git a/examples/training/distillation/dimensionality_reduction.py b/examples/training/distillation/dimensionality_reduction.py
index 82b6a2916..3ba9432a9 100644
--- a/examples/training/distillation/dimensionality_reduction.py
+++ b/examples/training/distillation/dimensionality_reduction.py
@@ -15,71 +15,44 @@
 without further changes needed.
 """
 
+from datasets import load_dataset
 from sklearn.decomposition import PCA
-from sentence_transformers import SentenceTransformer, LoggingHandler, util, evaluation, models, InputExample
+from sentence_transformers import SentenceTransformer, models
 import logging
-import os
-import gzip
-import csv
 import random
 import numpy as np
 import torch
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
-#### /print debug information to stdout
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # Model for which we apply dimensionality reduction
-model = SentenceTransformer("all-MiniLM-L6-v2")
+model_name = "all-MiniLM-L6-v2"
+model = SentenceTransformer(model_name)
 
 # New size for the embeddings
 new_dimension = 128
 
-
-# We use AllNLI as a source of sentences to compute PCA
-nli_dataset_path = "datasets/AllNLI.tsv.gz"
-
-# We use the STS benchmark dataset to see how much performance we loose by using the dimensionality reduction
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-
 # We measure the performance of the original model
 # and later we will measure the performance with the reduces dimension size
-logger.info("Read STSbenchmark test dataset")
-eval_examples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "test":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            eval_examples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-# Evaluate the original model on the STS benchmark dataset
-stsb_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(eval_examples, name="sts-benchmark-test")
-
-logger.info("Original model performance:")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+stsb_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    name="sts-test",
+)
+
+logging.info("Original model performance:")
 stsb_evaluator(model)
 
 ######## Reduce the embedding dimensions ########
 
-# Read sentences from NLI dataset
-nli_sentences = set()
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        nli_sentences.add(row["sentence1"])
-        nli_sentences.add(row["sentence2"])
+train_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="train")
 
-nli_sentences = list(nli_sentences)
+nli_sentences = train_dataset["sentence1"] + train_dataset["sentence2"]
 random.shuffle(nli_sentences)
 
 # To determine the PCA matrix, we need some example sentence embeddings.
@@ -103,12 +76,16 @@
 model.add_module("dense", dense)
 
 # Evaluate the model with the reduce embedding size
-logger.info("Model with {} dimensions:".format(new_dimension))
+logging.info("Model with {} dimensions:".format(new_dimension))
 stsb_evaluator(model)
 
 
 # If you like, you can store the model on disc by uncommenting the following line
-# model.save('models/my-128dim-model')
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+model.save(f"{model_name}-128dim")
 
 # You can then load the adapted model that produces 128 dimensional embeddings like this:
 # model = SentenceTransformer('models/my-128dim-model')
+
+# Or you can push the model to the Hugging Face Hub
+# model.push_to_hub(f'{model_name}-128dim')
diff --git a/examples/training/distillation/model_distillation.py b/examples/training/distillation/model_distillation.py
index f8e6bf333..b33f777ac 100644
--- a/examples/training/distillation/model_distillation.py
+++ b/examples/training/distillation/model_distillation.py
@@ -20,19 +20,21 @@
 of the teacher performance, while being 2.3 times faster.
 """
 
-from torch.utils.data import DataLoader
+import traceback
+from datasets import load_dataset, concatenate_datasets, Dataset
+import pandas as pd
 from sentence_transformers import models, losses, evaluation
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
-from sentence_transformers.datasets import ParallelSentencesDataset
+from sentence_transformers import LoggingHandler, SentenceTransformer
 import logging
 from datetime import datetime
-import os
-import gzip
-import csv
-import random
 from sklearn.decomposition import PCA
 import torch
 
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
@@ -45,45 +47,16 @@
 teacher_model_name = "stsb-roberta-base-v2"
 teacher_model = SentenceTransformer(teacher_model_name)
 
-output_path = "output/model-distillation-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-
-use_layer_reduction = True
-
-# There are two options to create a light and fast student model:
-if use_layer_reduction:
-    # 1) Create a smaller student model by using only some of the teacher layers
-    student_model = SentenceTransformer(teacher_model_name)
-
-    # Get the transformer model
-    auto_model = student_model._first_module().auto_model
-
-    # Which layers to keep from the teacher model. We equally spread the layers to keep over the original teacher
-    # layers_to_keep = [5]
-    # layers_to_keep = [3, 7]
-    # layers_to_keep = [3, 7, 11]
-    layers_to_keep = [1, 4, 7, 10]  # Keep 4 layers from the teacher
-    # layers_to_keep = [0, 2, 4, 6, 8, 10]
-    # layers_to_keep = [0, 1, 3, 4, 6, 7, 9, 10]
-
-    logging.info("Remove layers from student. Only keep these layers: {}".format(layers_to_keep))
-    new_layers = torch.nn.ModuleList(
-        [layer_module for i, layer_module in enumerate(auto_model.encoder.layer) if i in layers_to_keep]
-    )
-    auto_model.encoder.layer = new_layers
-    auto_model.config.num_hidden_layers = len(layers_to_keep)
-else:
-    # 2) The other option is to train a small model like TinyBERT to imitate the teacher.
-    # You can find some small BERT models here: https://huggingface.co/nreimers
-    word_embedding_model = models.Transformer("nreimers/TinyBERT_L-4_H-312_v2")
-    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-    student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
+output_dir = "output/model-distillation-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
+# We will train a small model like TinyBERT to imitate the teacher.
+# You can find some small BERT models here: https://huggingface.co/nreimers
+student_model_name = "nreimers/TinyBERT_L-4_H-312_v2"
+student_model = SentenceTransformer(student_model_name)
 
 inference_batch_size = 64
 train_batch_size = 64
 
-
 # We use AllNLI as a source of sentences for the distillation
 nli_dataset_path = "datasets/AllNLI.tsv.gz"
 
@@ -94,71 +67,73 @@
 sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
 
 
-# Download datasets if needed
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
+logging.info("Load the AllNLI dataset")
+# Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+nli_train_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="train")
+nli_eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="dev")
+# Concatenate all sentences into a new column "sentence"
 
-if not os.path.exists(wikipedia_dataset_path):
-    util.http_get("https://sbert.net/datasets/wikipedia-en-sentences.txt.gz", wikipedia_dataset_path)
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
+def combine_sentences(batch):
+    return {"sentence": batch["sentence1"] + batch["sentence2"]}
 
-# We need sentences to train our distillation. Here, we use sentences from AllNLI and from WikiPedia
-train_sentences_nli = set()
-dev_sentences_nli = set()
 
-train_sentences_wikipedia = []
-dev_sentences_wikipedia = []
+nli_train_dataset = nli_train_dataset.map(
+    combine_sentences, batched=True, remove_columns=nli_train_dataset.column_names
+)
+nli_eval_dataset = nli_eval_dataset.map(combine_sentences, batched=True, remove_columns=nli_eval_dataset.column_names)
 
-# Read ALLNLI
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            dev_sentences_nli.add(row["sentence1"])
-            dev_sentences_nli.add(row["sentence2"])
-        else:
-            train_sentences_nli.add(row["sentence1"])
-            train_sentences_nli.add(row["sentence2"])
 
-train_sentences_nli = list(train_sentences_nli)
-random.shuffle(train_sentences_nli)
+def deduplicate(dataset):
+    df = pd.DataFrame(dataset)
+    df = df.drop_duplicates()
+    return Dataset.from_pandas(df, preserve_index=False)
 
-dev_sentences_nli = list(dev_sentences_nli)
-random.shuffle(dev_sentences_nli)
-dev_sentences_nli = dev_sentences_nli[0:5000]  # Limit dev sentences to 5k
 
-# Read Wikipedia sentences file
-with gzip.open(wikipedia_dataset_path, "rt", encoding="utf8") as fIn:
-    wikipeda_sentences = [line.strip() for line in fIn]
+nli_train_dataset = deduplicate(nli_train_dataset)
+nli_eval_dataset = deduplicate(nli_eval_dataset)
+logging.info(nli_train_dataset)
 
-dev_sentences_wikipedia = wikipeda_sentences[
-    0:5000
-]  # Use the first 5k sentences from the wikipedia file for development
-train_sentences_wikipedia = wikipeda_sentences[5000:]
 
+logging.info("Load the STSB dataset")
+# Load the STSB eval/test datasets: https://huggingface.co/datasets/sentence-transformers/stsb
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+stsb_test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(stsb_eval_dataset)
 
-# We use the STS benchmark dataset to measure the performance of student model im comparison to the teacher model
-logging.info("Read STSbenchmark dev dataset")
-dev_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
 
-dev_evaluator_sts = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
+logging.info("Load the Wikipedia dataset")
+# Load the Wikipedia dataset: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
+wikipedia_train_dataset = load_dataset("sentence-transformers/wikipedia-en-sentences", split="train")
+# Take 5000 random sentences from the Wikipedia dataset for evaluation
+wikipedia_train_dataset_dict = wikipedia_train_dataset.train_test_split(test_size=5000)
+wikipedia_train_dataset = wikipedia_train_dataset_dict["train"]
+wikipedia_eval_dataset = wikipedia_train_dataset_dict["test"]
+logging.info(wikipedia_train_dataset)
 
 
-logging.info("Teacher Performance:")
-dev_evaluator_sts(teacher_model)
+# Concatenate the NLI and Wikipedia datasets for training
+train_dataset: Dataset = concatenate_datasets([nli_train_dataset, wikipedia_train_dataset])
+# Create a relatively small dataset for evaluation
+eval_dataset: Dataset = concatenate_datasets(
+    [nli_eval_dataset.select(range(5000)), wikipedia_eval_dataset.select(range(5000))]
+)
+
+# Create an STSB evaluator
+dev_evaluator_stsb = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+logging.info("Teacher Performance")
+dev_evaluator_stsb(teacher_model)
 
 # Student model has fewer dimensions. Compute PCA for the teacher to reduce the dimensions
 if student_model.get_sentence_embedding_dimension() < teacher_model.get_sentence_embedding_dimension():
     logging.info("Student model has fewer dimensions than the teacher. Compute PCA for down projection")
-    pca_sentences = train_sentences_nli[0:20000] + train_sentences_wikipedia[0:20000]
+    pca_sentences = nli_train_dataset[:20000]["sentence"] + wikipedia_train_dataset[:20000]["sentence"]
     pca_embeddings = teacher_model.encode(pca_sentences, convert_to_numpy=True)
     pca = PCA(n_components=student_model.get_sentence_embedding_dimension())
     pca.fit(pca_embeddings)
@@ -174,37 +149,97 @@
     teacher_model.add_module("dense", dense)
 
     logging.info("Teacher Performance with {} dimensions:".format(teacher_model.get_sentence_embedding_dimension()))
-    dev_evaluator_sts(teacher_model)
+    dev_evaluator_stsb(teacher_model)
 
 
-# We train the student_model such that it creates sentence embeddings similar to the embeddings from the teacher_model
-# For this, we need a large set of sentences. These sentences are embedded using the teacher model,
-# and the student tries to mimic these embeddings. It is the same approach as used in: https://arxiv.org/abs/2004.09813
-train_data = ParallelSentencesDataset(
-    student_model=student_model,
-    teacher_model=teacher_model,
-    batch_size=inference_batch_size,
-    use_embedding_cache=False,
-)
-train_data.add_dataset([[sent] for sent in train_sentences_nli], max_sentence_length=256)
-train_data.add_dataset([[sent] for sent in train_sentences_wikipedia], max_sentence_length=256)
+# Use the teacher model to get the gold embeddings
+def map_embeddings(batch):
+    return {
+        "label": teacher_model.encode(
+            batch["sentence"], batch_size=inference_batch_size, show_progress_bar=False
+        ).tolist()
+    }
+
+
+train_dataset = train_dataset.select(range(200000))
+train_dataset = train_dataset.map(map_embeddings, batched=True, batch_size=50000)
+# Optionally, save the dataset to disk to speed up future runs
+train_dataset.save_to_disk("datasets/distillation_train_dataset")
+# from datasets import DatasetDict, load_from_disk
+
+# train_dataset = load_from_disk("datasets/distillation_train_dataset")
+# if isinstance(train_dataset, DatasetDict):
+#     train_dataset = train_dataset["train"]
+eval_dataset = eval_dataset.map(map_embeddings, batched=True, batch_size=50000)
 
-train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
 train_loss = losses.MSELoss(model=student_model)
 
 # We create an evaluator, that measure the Mean Squared Error (MSE) between the teacher and the student embeddings
-dev_sentences = dev_sentences_nli + dev_sentences_wikipedia
-dev_evaluator_mse = evaluation.MSEEvaluator(dev_sentences, dev_sentences, teacher_model=teacher_model)
-
-# Train the student model to imitate the teacher
-student_model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluation.SequentialEvaluator([dev_evaluator_sts, dev_evaluator_mse]),
-    epochs=1,
-    warmup_steps=1000,
-    evaluation_steps=5000,
-    output_path=output_path,
-    save_best_model=True,
-    optimizer_params={"lr": 1e-4, "eps": 1e-6},
-    use_amp=True,
+eval_sentences = eval_dataset["sentence"]
+dev_evaluator_mse = evaluation.MSEEvaluator(eval_sentences, eval_sentences, teacher_model=teacher_model)
+dev_evaluator = evaluation.SequentialEvaluator([dev_evaluator_stsb, dev_evaluator_mse])
+
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    metric_for_best_model="eval_sts-dev_spearman_cosine",
+    load_best_model_at_end=True,
+    learning_rate=1e-4,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=500,
+    save_strategy="steps",
+    save_steps=500,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="distillation-layer-reduction",  # Will be used in W&B if `wandb` is installed
+)
+
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=student_model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
 )
+trainer.train()
+
+# Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_test_dataset["sentence1"],
+    sentences2=stsb_test_dataset["sentence2"],
+    scores=stsb_test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(student_model)
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+student_model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+if "/" in student_model_name:
+    student_model_name = student_model_name.split("/")[-1]
+if "/" in teacher_model_name:
+    teacher_model_name = teacher_model_name.split("/")[-1]
+repo_id = f"{student_model_name}-distilled-from-{teacher_model_name}"
+try:
+    student_model.push_to_hub(repo_id)
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub({repo_id!r})`."
+    )
diff --git a/examples/training/distillation/model_distillation_layer_reduction.py b/examples/training/distillation/model_distillation_layer_reduction.py
new file mode 100644
index 000000000..7a0d76f5d
--- /dev/null
+++ b/examples/training/distillation/model_distillation_layer_reduction.py
@@ -0,0 +1,229 @@
+"""
+This file contains an example how to make a SentenceTransformer model faster and lighter.
+
+This is achieved by using Knowledge Distillation: We use a well working teacher model to train
+a fast and light student model. The student model learns to imitate the produced
+sentence embeddings from the teacher. We train this on a diverse set of sentences we got
+from SNLI + Multi+NLI + Wikipedia.
+
+After the distillation is finished, the student model produce nearly the same embeddings as the
+teacher, however, it will be much faster.
+
+The script implements to options two options to initialize the student:
+Option 1: Train a light transformer model like TinyBERT to imitate the teacher
+Option 2: We take the teacher model and keep only certain layers, for example, only 4 layers.
+
+Option 2) works usually better, as we keep most of the weights from the teacher. In Option 1, we have to tune all
+weights in the student from scratch.
+
+There is a performance - speed trade-off. However, we found that a student with 4 instead of 12 layers keeps about 99.4%
+of the teacher performance, while being 2.3 times faster.
+"""
+
+import traceback
+from datasets import load_dataset, concatenate_datasets, Dataset
+import pandas as pd
+from sentence_transformers import losses, evaluation
+from sentence_transformers import SentenceTransformer
+import logging
+from datetime import datetime
+import torch
+
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
+
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
+
+
+# Teacher Model: Model we want to distill to a smaller model
+teacher_model_name = "mixedbread-ai/mxbai-embed-large-v1"
+teacher_model = SentenceTransformer(teacher_model_name)
+
+output_dir = "output/model-distillation-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+
+# Create a smaller student model by using only some of the teacher layers
+student_model = SentenceTransformer(teacher_model_name)
+
+# Get the transformer model
+auto_model = student_model._first_module().auto_model
+
+# Which layers to keep from the teacher model. We equally spread the layers to keep over the original teacher
+# layers_to_keep = [5]
+# layers_to_keep = [3, 7]
+# layers_to_keep = [3, 7, 11]
+# layers_to_keep = [0, 2, 4, 6, 8, 10]
+# layers_to_keep = [0, 1, 3, 4, 6, 7, 9, 10]
+# Keep every third layer:
+layers_to_keep = [0, 3, 6, 9, 12, 15, 18, 21]
+
+logging.info("Remove layers from student. Only keep these layers: {}".format(layers_to_keep))
+new_layers = torch.nn.ModuleList(
+    [layer_module for i, layer_module in enumerate(auto_model.encoder.layer) if i in layers_to_keep]
+)
+auto_model.encoder.layer = new_layers
+auto_model.config.num_hidden_layers = len(layers_to_keep)
+print(
+    f"Number of parameters in the Teacher model: {sum(p.numel() for p in teacher_model.parameters() if p.requires_grad)}"
+)
+print(
+    f"Number of parameters in the Student model: {sum(p.numel() for p in student_model.parameters() if p.requires_grad)}"
+)
+
+inference_batch_size = 128
+train_batch_size = 64
+
+logging.info("Load the AllNLI dataset")
+# Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+nli_train_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="train")
+nli_eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="dev")
+# Concatenate all sentences into a new column "sentence"
+
+
+def combine_sentences(batch):
+    return {"sentence": batch["sentence1"] + batch["sentence2"]}
+
+
+nli_train_dataset = nli_train_dataset.map(
+    combine_sentences, batched=True, remove_columns=nli_train_dataset.column_names
+)
+nli_eval_dataset = nli_eval_dataset.map(combine_sentences, batched=True, remove_columns=nli_eval_dataset.column_names)
+
+
+def deduplicate(dataset):
+    df = pd.DataFrame(dataset)
+    df = df.drop_duplicates()
+    return Dataset.from_pandas(df, preserve_index=False)
+
+
+nli_train_dataset = deduplicate(nli_train_dataset)
+nli_eval_dataset = deduplicate(nli_eval_dataset)
+logging.info(nli_train_dataset)
+
+logging.info("Load the STSB dataset")
+# Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+stsb_test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(stsb_eval_dataset)
+
+logging.info("Load the Wikipedia dataset")
+# Load the Wikipedia dataset: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
+wikipedia_train_dataset = load_dataset("sentence-transformers/wikipedia-en-sentences", split="train")
+# Take 5000 random sentences from the Wikipedia dataset for evaluation
+wikipedia_train_dataset_dict = wikipedia_train_dataset.train_test_split(test_size=5000)
+wikipedia_train_dataset = wikipedia_train_dataset_dict["train"]
+wikipedia_eval_dataset = wikipedia_train_dataset_dict["test"]
+logging.info(wikipedia_train_dataset)
+
+# Concatenate the NLI and Wikipedia datasets for training
+train_dataset: Dataset = concatenate_datasets([nli_train_dataset, wikipedia_train_dataset])
+# Create a relatively small dataset for evaluation
+eval_dataset: Dataset = concatenate_datasets(
+    [nli_eval_dataset.select(range(5000)), wikipedia_eval_dataset.select(range(5000))]
+)
+
+
+# Use the teacher model to get the gold embeddings
+def map_embeddings(batch):
+    return {
+        "label": teacher_model.encode(
+            batch["sentence"], batch_size=inference_batch_size, show_progress_bar=False
+        ).tolist()
+    }
+
+
+train_dataset = train_dataset.map(map_embeddings, batched=True, batch_size=50000)
+# Optionally, save the dataset to disk to speed up future runs
+train_dataset.save_to_disk("datasets/distillation_train_dataset")
+# from datasets import DatasetDict, load_from_disk
+
+# train_dataset = load_from_disk("datasets/distillation_train_dataset")
+# if isinstance(train_dataset, DatasetDict):
+#     train_dataset = train_dataset["train"]
+eval_dataset = eval_dataset.map(map_embeddings, batched=True, batch_size=50000)
+
+# Prepare the training loss
+train_loss = losses.MSELoss(model=student_model)
+
+# Create an STSB evaluator
+dev_evaluator_stsb = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+logging.info("Running STSB evaluation on the teacher model")
+dev_evaluator_stsb(teacher_model)
+
+# We create an evaluator, that measure the Mean Squared Error (MSE) between the teacher and the student embeddings
+eval_sentences = eval_dataset["sentence"]
+dev_evaluator_mse = evaluation.MSEEvaluator(eval_sentences, eval_sentences, teacher_model=teacher_model)
+dev_evaluator = evaluation.SequentialEvaluator([dev_evaluator_stsb, dev_evaluator_mse])
+
+# Run the evaluator before training to get a baseline performance of the student model
+dev_evaluator(student_model)
+
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    metric_for_best_model="eval_sts-dev_spearman_cosine",
+    load_best_model_at_end=True,
+    learning_rate=1e-4,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=5000,
+    save_strategy="steps",
+    save_steps=5000,
+    save_total_limit=2,
+    logging_steps=1000,
+    run_name="distillation-layer-reduction",  # Will be used in W&B if `wandb` is installed
+)
+
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=student_model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+
+# Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_test_dataset["sentence1"],
+    sentences2=stsb_test_dataset["sentence2"],
+    scores=stsb_test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(student_model)
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+student_model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = teacher_model_name if "/" not in teacher_model_name else teacher_model_name.split("/")[-1]
+try:
+    student_model.push_to_hub(f"{model_name}-{len(layers_to_keep)}-layers")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-{len(layers_to_keep)}-layers')`."
+    )
diff --git a/examples/training/matryoshka/2d_matryoshka_nli.py b/examples/training/matryoshka/2d_matryoshka_nli.py
index 504b14286..0b6acd2ec 100644
--- a/examples/training/matryoshka/2d_matryoshka_nli.py
+++ b/examples/training/matryoshka/2d_matryoshka_nli.py
@@ -11,147 +11,112 @@
 python 2d_matryoshka_nli.py pretrained_transformer_model_name
 """
 
-import math
+import traceback
 from datasets import load_dataset
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
-train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
-max_seq_length = 75
-num_epochs = 1
-
-# Save path of the model
-model_save_path = (
-    "output/2d_matryoshka_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-)
-
-
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-# Read the AllNLI.tsv.gz file and create the training dataset
-logging.info("Read AllNLI train dataset")
 
+from sentence_transformers.training_args import BatchSamplers
 
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
+model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
+batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
+num_train_epochs = 1
 
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
-
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
-
-
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
+# Save path of the model
+output_dir = f"output/2d_matryoshka_nli_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
 
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
 
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+logging.info(train_dataset)
 
+# If you wish, you can limit the number of training samples
+# train_dataset = train_dataset.select(range(5000))
 
-# Our training loss
-train_loss = losses.MultipleNegativesRankingLoss(model)
-train_loss = losses.Matryoshka2dLoss(model, train_loss, [768, 512, 256, 128, 64])
+# 3. Define our training loss
+inner_train_loss = losses.MultipleNegativesRankingLoss(model)
+train_loss = losses.Matryoshka2dLoss(model, inner_train_loss, [768, 512, 256, 128, 64])
 
-stsb_dev = load_dataset("mteb/stsbenchmark-sts", split="validation")
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 dev_evaluator = EmbeddingSimilarityEvaluator(
-    stsb_dev["sentence1"],
-    stsb_dev["sentence2"],
-    [score / 5 for score in stsb_dev["score"]],
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
     main_similarity=SimilarityFunction.COSINE,
     name="sts-dev",
 )
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="2d-matryoshka-nli",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
+trainer.train()
 
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-
-model = SentenceTransformer(model_save_path)
-stsb_test = load_dataset("mteb/stsbenchmark-sts", split="test")
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
 test_evaluator = EmbeddingSimilarityEvaluator(
-    stsb_test["sentence1"],
-    stsb_test["sentence2"],
-    [score / 5 for score in stsb_test["score"]],
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
     main_similarity=SimilarityFunction.COSINE,
     name="sts-test",
 )
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
 
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-nli-2d-matryoshka")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-nli-2d-matryoshka')`."
     )
diff --git a/examples/training/matryoshka/2d_matryoshka_sts.py b/examples/training/matryoshka/2d_matryoshka_sts.py
index a0be05a97..3880c8c93 100644
--- a/examples/training/matryoshka/2d_matryoshka_sts.py
+++ b/examples/training/matryoshka/2d_matryoshka_sts.py
@@ -10,118 +10,108 @@
 python 2d_matryoshka_sts.py pretrained_transformer_model_name
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
 
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilbert-base-uncased"
-
-# Read the dataset
-train_batch_size = 16
-num_epochs = 4
-model_save_path = (
-    "output/2d_matryoshka_sts_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+batch_size = 16
+num_train_epochs = 4
+
+# Save path of the model
+output_dir = f"output/2d_matryoshka_sts_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
+
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
+
+# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
+
+# 3. Define our training loss
+# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# similarity score column (between 0 and 1)
+inner_train_loss = losses.CoSENTLoss(model=model)
+train_loss = losses.Matryoshka2dLoss(model, inner_train_loss, [768, 512, 256, 128, 64])
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="2d-matryoshka-sts",  # Will be used in W&B if `wandb` is installed
 )
 
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.CoSENTLoss(model=model)
-train_loss = losses.Matryoshka2dLoss(model, train_loss, [768, 512, 256, 128, 64])
-
-
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training. We skip evaluation in this example
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
 )
+trainer.train()
 
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-sts-2d-matryoshka")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-sts-2d-matryoshka')`."
     )
diff --git a/examples/training/matryoshka/matryoshka_nli.py b/examples/training/matryoshka/matryoshka_nli.py
index c4369d13a..07a1d24d5 100644
--- a/examples/training/matryoshka/matryoshka_nli.py
+++ b/examples/training/matryoshka/matryoshka_nli.py
@@ -11,103 +11,56 @@
 python matryoshka_nli.py pretrained_transformer_model_name
 """
 
-import math
+import traceback
 from datasets import load_dataset
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+
+from sentence_transformers.training_args import BatchSamplers
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
-train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
-max_seq_length = 75
-num_epochs = 1
+batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
+num_train_epochs = 1
 matryoshka_dims = [768, 512, 256, 128, 64]
 
 # Save path of the model
-model_save_path = (
-    "output/matryoshka_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-)
-
-
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-# Read the AllNLI.tsv.gz file and create the training dataset
-logging.info("Read AllNLI train dataset")
-
-
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
-
+output_dir = f"output/matryoshka_nli_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
 
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
 
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+logging.info(train_dataset)
 
+# If you wish, you can limit the number of training samples
+# train_dataset = train_dataset.select(range(5000))
 
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
-
-
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
-
+# 3. Define our training loss
+inner_train_loss = losses.MultipleNegativesRankingLoss(model)
+train_loss = losses.MatryoshkaLoss(model, inner_train_loss, matryoshka_dims=matryoshka_dims)
 
-# Our training loss
-train_loss = losses.MultipleNegativesRankingLoss(model)
-train_loss = losses.MatryoshkaLoss(model, train_loss, matryoshka_dims=matryoshka_dims)
-
-stsb_dev = load_dataset("mteb/stsbenchmark-sts", split="validation")
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 evaluators = []
 for dim in matryoshka_dims:
     evaluators.append(
         EmbeddingSimilarityEvaluator(
-            stsb_dev["sentence1"],
-            stsb_dev["sentence2"],
-            [score / 5 for score in stsb_dev["score"]],
+            sentences1=stsb_eval_dataset["sentence1"],
+            sentences2=stsb_eval_dataset["sentence2"],
+            scores=stsb_eval_dataset["score"],
             main_similarity=SimilarityFunction.COSINE,
             name=f"sts-dev-{dim}",
             truncate_dim=dim,
@@ -115,56 +68,68 @@ def add_to_samples(sent1, sent2, label):
     )
 dev_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[0])
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="matryoshka-nli",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
+trainer.train()
 
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-
-model = SentenceTransformer(model_save_path)
-stsb_test = load_dataset("mteb/stsbenchmark-sts", split="test")
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
 evaluators = []
 for dim in matryoshka_dims:
     evaluators.append(
         EmbeddingSimilarityEvaluator(
-            stsb_test["sentence1"],
-            stsb_test["sentence2"],
-            [score / 5 for score in stsb_test["score"]],
+            sentences1=test_dataset["sentence1"],
+            sentences2=test_dataset["sentence2"],
+            scores=test_dataset["score"],
             main_similarity=SimilarityFunction.COSINE,
             name=f"sts-test-{dim}",
             truncate_dim=dim,
         )
     )
 test_evaluator = SequentialEvaluator(evaluators)
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
 
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-nli-matryoshka")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-nli-matryoshka')`."
     )
diff --git a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
index 6413ab593..b1e470b78 100644
--- a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
+++ b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
@@ -15,105 +15,63 @@
 python matryoshka_nli_reduced_dim.py pretrained_transformer_model_name
 """
 
-import math
+import traceback
 from datasets import load_dataset
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+from sentence_transformers import losses, models
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+
+from sentence_transformers.training_args import BatchSamplers
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
-train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
-max_seq_length = 75
-num_epochs = 1
+batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
+num_train_epochs = 1
 reduced_dim = 256
 matryoshka_dims = [256, 128, 64, 32, 16]
 
 # Save path of the model
-model_save_path = (
-    "output/matryoshka_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+output_dir = (
+    f"output/matryoshka_nli_reduced_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
 )
 
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# dense = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=reduced_dim)
+model.add_module(
+    "reduced_dim", models.Dense(in_features=model.get_sentence_embedding_dimension(), out_features=reduced_dim)
+)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
 
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-dense = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=reduced_dim)
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-# Read the AllNLI.tsv.gz file and create the training dataset
-logging.info("Read AllNLI train dataset")
-
-
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
-
-
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
-
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
-
-
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
-
-
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+logging.info(train_dataset)
 
+# If you wish, you can limit the number of training samples
+# train_dataset = train_dataset.select(range(5000))
 
-# Our training loss
-train_loss = losses.MultipleNegativesRankingLoss(model)
-train_loss = losses.MatryoshkaLoss(model, train_loss, matryoshka_dims=matryoshka_dims)
+# 3. Define our training loss
+inner_train_loss = losses.MultipleNegativesRankingLoss(model)
+train_loss = losses.MatryoshkaLoss(model, inner_train_loss, matryoshka_dims=matryoshka_dims)
 
-stsb_dev = load_dataset("mteb/stsbenchmark-sts", split="validation")
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 evaluators = []
 for dim in matryoshka_dims:
     evaluators.append(
         EmbeddingSimilarityEvaluator(
-            stsb_dev["sentence1"],
-            stsb_dev["sentence2"],
-            [score / 5 for score in stsb_dev["score"]],
+            sentences1=stsb_eval_dataset["sentence1"],
+            sentences2=stsb_eval_dataset["sentence2"],
+            scores=stsb_eval_dataset["score"],
             main_similarity=SimilarityFunction.COSINE,
             name=f"sts-dev-{dim}",
             truncate_dim=dim,
@@ -121,56 +79,68 @@ def add_to_samples(sent1, sent2, label):
     )
 dev_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[0])
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="matryoshka-nli-reduced",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
+trainer.train()
 
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-
-model = SentenceTransformer(model_save_path)
-stsb_test = load_dataset("mteb/stsbenchmark-sts", split="test")
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
 evaluators = []
 for dim in matryoshka_dims:
     evaluators.append(
         EmbeddingSimilarityEvaluator(
-            stsb_test["sentence1"],
-            stsb_test["sentence2"],
-            [score / 5 for score in stsb_test["score"]],
+            sentences1=test_dataset["sentence1"],
+            sentences2=test_dataset["sentence2"],
+            scores=test_dataset["score"],
             main_similarity=SimilarityFunction.COSINE,
             name=f"sts-test-{dim}",
             truncate_dim=dim,
         )
     )
 test_evaluator = SequentialEvaluator(evaluators)
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
 
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
-    model.push_to_hub(f"{model_name}-nli-matryoshka-{reduced_dim}")
+    model.push_to_hub(f"{model_name}-nli-matryoshka-reduced")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
-        f"and saving it using `model.push_to_hub('{model_name}-nli-matryoshka-{reduced_dim}')`."
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-nli-matryoshka-reduced')`."
     )
diff --git a/examples/training/matryoshka/matryoshka_sts.py b/examples/training/matryoshka/matryoshka_sts.py
index 039bdac13..7cb17d715 100644
--- a/examples/training/matryoshka/matryoshka_sts.py
+++ b/examples/training/matryoshka/matryoshka_sts.py
@@ -10,118 +10,122 @@
 python matryoshka_sts.py pretrained_transformer_model_name
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-
-# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilbert-base-uncased"
-
-# Read the dataset
-train_batch_size = 16
-num_epochs = 4
-model_save_path = (
-    "output/matryoshka_sts_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-)
-
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+batch_size = 16
+num_train_epochs = 4
+matryoshka_dims = [768, 512, 256, 128, 64]
+
+# Save path of the model
+output_dir = f"output/matryoshka_sts_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
+
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
+
+# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
+
+# 3. Define our training loss
+# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# similarity score column (between 0 and 1)
+inner_train_loss = losses.CoSENTLoss(model=model)
+train_loss = losses.MatryoshkaLoss(model, loss=inner_train_loss, matryoshka_dims=matryoshka_dims)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+evaluators = []
+for dim in matryoshka_dims:
+    evaluators.append(
+        EmbeddingSimilarityEvaluator(
+            sentences1=eval_dataset["sentence1"],
+            sentences2=eval_dataset["sentence2"],
+            scores=eval_dataset["score"],
+            main_similarity=SimilarityFunction.COSINE,
+            name=f"sts-dev-{dim}",
+            truncate_dim=dim,
+        )
+    )
+dev_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[0])
+
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="matryoshka-sts",  # Will be used in W&B if `wandb` is installed
 )
 
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.CoSENTLoss(model=model)
-train_loss = losses.MatryoshkaLoss(model, train_loss, [768, 512, 256, 128, 64])
-
-
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training. We skip evaluation in this example
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
 )
+trainer.train()
+
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+evaluators = []
+for dim in matryoshka_dims:
+    evaluators.append(
+        EmbeddingSimilarityEvaluator(
+            sentences1=test_dataset["sentence1"],
+            sentences2=test_dataset["sentence2"],
+            scores=test_dataset["score"],
+            main_similarity=SimilarityFunction.COSINE,
+            name=f"sts-test-{dim}",
+            truncate_dim=dim,
+        )
+    )
+test_evaluator = SequentialEvaluator(evaluators)
+test_evaluator(model)
 
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
-
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
     model.push_to_hub(f"{model_name}-sts-matryoshka")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
         f"and saving it using `model.push_to_hub('{model_name}-sts-matryoshka')`."
     )
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index fafe454dd..f31377246 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -17,20 +17,22 @@
 https://arxiv.org/abs/2004.09813
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler, models, evaluation, losses
-from torch.utils.data import DataLoader
-from sentence_transformers.datasets import ParallelSentencesDataset
+import traceback
+from sentence_transformers import SentenceTransformer, LoggingHandler
 from datetime import datetime
+from datasets import load_dataset, DatasetDict
 
-import os
 import logging
-import sentence_transformers.util
-import csv
-import gzip
-from tqdm.autonotebook import tqdm
+from sentence_transformers.evaluation import (
+    EmbeddingSimilarityEvaluator,
+    MSEEvaluator,
+    SequentialEvaluator,
+    TranslationEvaluator,
+)
+from sentence_transformers.losses import MSELoss
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 import numpy as np
-import zipfile
-import io
 
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
@@ -38,220 +40,204 @@
 logger = logging.getLogger(__name__)
 
 
-teacher_model_name = (
-    "paraphrase-distilroberta-base-v2"  # Our monolingual teacher model, we want to convert to multiple languages
-)
-student_model_name = "xlm-roberta-base"  # Multilingual base model we use to imitate the teacher model
+# The teacher model is monolingual, we use it for English embeddings
+teacher_model_name = "paraphrase-distilroberta-base-v2"
+# The student model is multilingual, we train it such that embeddings of non-English texts mimic the teacher model's English embeddings
+student_model_name = "xlm-roberta-base"
 
-max_seq_length = 128  # Student model max. lengths for inputs (number of word pieces)
+student_max_seq_length = 128  # Student model max. lengths for inputs (number of word pieces)
 train_batch_size = 64  # Batch size for training
 inference_batch_size = 64  # Batch size at inference
 max_sentences_per_language = 500000  # Maximum number of  parallel sentences for training
-train_max_sentence_length = 250  # Maximum length (characters) for parallel training sentences
 
-num_epochs = 5  # Train for x epochs
-num_warmup_steps = 10000  # Warumup steps
-
-num_evaluation_steps = 1000  # Evaluate performance after every xxxx steps
-dev_sentences = 1000  # Number of parallel sentences to be used for development
+num_train_epochs = 5  # Train for x epochs
+num_evaluation_steps = 5000  # Evaluate performance after every xxxx steps
 
 
 # Define the language codes you would like to extend the model to
 source_languages = set(["en"])  # Our teacher model accepts English (en) sentences
-target_languages = set(
-    ["de", "es", "it", "fr", "ar", "tr"]
-)  # We want to extend the model to these new languages. For language codes, see the header of the train file
+# We want to extend the model to these new languages. For language codes, see the header of the train file
+target_languages = set(["de", "es", "it", "fr", "ar", "tr"])
 
 
-output_path = (
+output_dir = (
     "output/make-multilingual-"
     + "-".join(sorted(list(source_languages)) + sorted(list(target_languages)))
     + "-"
     + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-
-# This function downloads a corpus if it does not exist
-def download_corpora(filepaths):
-    if not isinstance(filepaths, list):
-        filepaths = [filepaths]
-
-    for filepath in filepaths:
-        if not os.path.exists(filepath):
-            print(filepath, "does not exists. Try to download from server")
-            filename = os.path.basename(filepath)
-            url = "https://sbert.net/datasets/" + filename
-            sentence_transformers.util.http_get(url, filepath)
-
-
-# Here we define train train and dev corpora
-train_corpus = "datasets/parallel-sentences.tsv.gz"
-sts_corpus = "datasets/stsbenchmark.zip"
-parallel_sentences_folder = "parallel-sentences/"
-
-# Check if the file exists. If not, they are downloaded
-download_corpora([train_corpus, sts_corpus])
-
-
-# Create parallel files for the selected language combinations
-os.makedirs(parallel_sentences_folder, exist_ok=True)
-train_files = []
-dev_files = []
-files_to_create = []
+# 1a. Here we define our SentenceTransformer teacher model.
+teacher_model = SentenceTransformer(teacher_model_name)
+# If we want, we can limit the maximum sequence length for the model
+# teacher_model.max_seq_length = 128
+logging.info(f"Teacher model: {teacher_model}")
+
+# 1b. Here we define our SentenceTransformer student model. If not already a Sentence Transformer model,
+# it will automatically create one with "mean" pooling.
+student_model = SentenceTransformer(student_model_name)
+# If we want, we can limit the maximum sequence length for the model
+student_model.max_seq_length = student_max_seq_length
+logging.info(f"Student model: {student_model}")
+
+# 2. Load the parallel sentences training dataset: https://huggingface.co/datasets?other=sentence-transformers&sort=trending&search=parallel-sentences
+# NOTE: We can also use multiple datasets if we want
+dataset_to_use = "sentence-transformers/parallel-sentences-talks"
+# dataset_to_use = "sentence-transformers/parallel-sentences-europarl"
+# dataset_to_use = "sentence-transformers/parallel-sentences-global-voices"
+# dataset_to_use = "sentence-transformers/parallel-sentences-muse"
+# dataset_to_use = "sentence-transformers/parallel-sentences-jw300"
+# dataset_to_use = "sentence-transformers/parallel-sentences-news-commentary"
+# dataset_to_use = "sentence-transformers/parallel-sentences-opensubtitles"
+# dataset_to_use = "sentence-transformers/parallel-sentences-tatoeba"
+# dataset_to_use = "sentence-transformers/parallel-sentences-wikimatrix"
+# dataset_to_use = "sentence-transformers/parallel-sentences-wikititles"
+train_dataset_dict = DatasetDict()
+eval_dataset_dict = DatasetDict()
 for source_lang in source_languages:
     for target_lang in target_languages:
-        output_filename_train = os.path.join(
-            parallel_sentences_folder, "talks-{}-{}-train.tsv.gz".format(source_lang, target_lang)
-        )
-        output_filename_dev = os.path.join(
-            parallel_sentences_folder, "talks-{}-{}-dev.tsv.gz".format(source_lang, target_lang)
-        )
-        train_files.append(output_filename_train)
-        dev_files.append(output_filename_dev)
-        if not os.path.exists(output_filename_train) or not os.path.exists(output_filename_dev):
-            files_to_create.append(
-                {
-                    "src_lang": source_lang,
-                    "trg_lang": target_lang,
-                    "fTrain": gzip.open(output_filename_train, "wt", encoding="utf8"),
-                    "fDev": gzip.open(output_filename_dev, "wt", encoding="utf8"),
-                    "devCount": 0,
-                }
+        subset = f"{source_lang}-{target_lang}"
+        try:
+            train_dataset = load_dataset(dataset_to_use, subset, split="train")
+            if len(train_dataset) > max_sentences_per_language:
+                train_dataset = train_dataset.select(range(max_sentences_per_language))
+        except Exception as exc:
+            logging.error(f"Could not load dataset {dataset_to_use}/{source_lang}-{target_lang}: {exc}")
+            continue
+
+        try:
+            eval_dataset = load_dataset(dataset_to_use, subset, split="dev")
+            if len(eval_dataset) > 1000:
+                eval_dataset = eval_dataset.select(range(1000))
+        except Exception:
+            logging.info(
+                f"Could not load dataset {dataset_to_use}/{source_lang}-{target_lang} dev split, splitting 1k samples from train"
             )
+            dataset = train_dataset.train_test_split(test_size=1000, shuffle=True)
+            train_dataset = dataset["train"]
+            eval_dataset = dataset["test"]
 
-if len(files_to_create) > 0:
-    print(
-        "Parallel sentences files {} do not exist. Create these files now".format(
-            ", ".join(map(lambda x: x["src_lang"] + "-" + x["trg_lang"], files_to_create))
-        )
-    )
-    with gzip.open(train_corpus, "rt", encoding="utf8") as fIn:
-        reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-        for line in tqdm(reader, desc="Sentences"):
-            for outfile in files_to_create:
-                src_text = line[outfile["src_lang"]].strip()
-                trg_text = line[outfile["trg_lang"]].strip()
+        train_dataset_dict[subset] = train_dataset
+        eval_dataset_dict[subset] = eval_dataset
+logging.info(train_dataset_dict)
 
-                if src_text != "" and trg_text != "":
-                    if outfile["devCount"] < dev_sentences:
-                        outfile["devCount"] += 1
-                        fOut = outfile["fDev"]
-                    else:
-                        fOut = outfile["fTrain"]
 
-                    fOut.write("{}\t{}\n".format(src_text, trg_text))
+# We want the teacher embeddings of the *source* sentences to be very similar to the student embeddings
+# of the *target* sentences.
+def prepare_dataset(batch):
+    return {
+        "non_english": batch["non_english"],
+        "label": teacher_model.encode(batch["english"], batch_size=inference_batch_size, show_progress_bar=False),
+    }
 
-    for outfile in files_to_create:
-        outfile["fTrain"].close()
-        outfile["fDev"].close()
 
-
-######## Start the extension of the teacher model to multiple languages ########
-logger.info("Load teacher model")
-teacher_model = SentenceTransformer(teacher_model_name)
-
-
-logger.info("Create student model from scratch")
-word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-
-###### Read Parallel Sentences Dataset ######
-train_data = ParallelSentencesDataset(
-    student_model=student_model, teacher_model=teacher_model, batch_size=inference_batch_size, use_embedding_cache=True
+column_names = list(train_dataset_dict.values())[0].column_names
+train_dataset_dict = train_dataset_dict.map(
+    prepare_dataset, batched=True, batch_size=30000, remove_columns=column_names
 )
-for train_file in train_files:
-    train_data.load_data(
-        train_file, max_sentences=max_sentences_per_language, max_sentence_length=train_max_sentence_length
-    )
+logging.info("Prepared datasets for training:", train_dataset_dict)
 
-train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.MSELoss(model=student_model)
+# 3. Define our training loss
+# MSELoss (https://sbert.net/docs/package_reference/losses.html#mseloss) needs one text columns and one
+# column with embeddings from the teacher model
+train_loss = MSELoss(model=student_model)
 
+# 4. Define evaluators for use during training. This is useful to keep track of alongside the evaluation loss.
+evaluators = []
 
-#### Evaluate cross-lingual performance on different tasks #####
-evaluators = []  # evaluators has a list of different evaluator classes we call periodically
-
-for dev_file in dev_files:
-    logger.info("Create evaluator for " + dev_file)
-    src_sentences = []
-    trg_sentences = []
-    with gzip.open(dev_file, "rt", encoding="utf8") as fIn:
-        for line in fIn:
-            splits = line.strip().split("\t")
-            if splits[0] != "" and splits[1] != "":
-                src_sentences.append(splits[0])
-                trg_sentences.append(splits[1])
+for subset, eval_dataset in eval_dataset_dict.items():
+    logger.info(f"Creating evaluators for {subset}")
 
     # Mean Squared Error (MSE) measures the (euclidean) distance between teacher and student embeddings
-    dev_mse = evaluation.MSEEvaluator(
-        src_sentences,
-        trg_sentences,
-        name=os.path.basename(dev_file).split(".")[0],
+    dev_mse = MSEEvaluator(
+        eval_dataset["english"],
+        eval_dataset["non_english"],
+        name=subset,
         teacher_model=teacher_model,
         batch_size=inference_batch_size,
     )
     evaluators.append(dev_mse)
 
-    # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences
-    dev_trans_acc = evaluation.TranslationEvaluator(
-        src_sentences, trg_sentences, name=os.path.basename(dev_file).split(".")[0], batch_size=inference_batch_size
+    # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of
+    # source[i] is the closest to target[i] out of all available target sentences
+    dev_trans_acc = TranslationEvaluator(
+        eval_dataset["english"],
+        eval_dataset["non_english"],
+        name=subset,
+        batch_size=inference_batch_size,
     )
     evaluators.append(dev_trans_acc)
 
+    # Try to load this subset from STS17
+    test_dataset = None
+    try:
+        test_dataset = load_dataset("mteb/sts17-crosslingual-sts", subset, split="test")
+    except Exception:
+        try:
+            test_dataset = load_dataset("mteb/sts17-crosslingual-sts", f"{subset[3:]}-{subset[:2]}", split="test")
+            subset = f"{subset[3:]}-{subset[:2]}"
+        except Exception:
+            pass
+    if test_dataset:
+        test_evaluator = EmbeddingSimilarityEvaluator(
+            sentences1=test_dataset["sentence1"],
+            sentences2=test_dataset["sentence2"],
+            scores=[score / 5.0 for score in test_dataset["score"]],  # Convert 0-5 scores to 0-1 scores
+            batch_size=inference_batch_size,
+            name=f"sts17-{subset}-test",
+            show_progress_bar=False,
+        )
+        evaluators.append(test_evaluator)
+
+evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: np.mean(scores))
+# Now also prepare the evaluation datasets for training
+eval_dataset_dict = eval_dataset_dict.map(prepare_dataset, batched=True, batch_size=30000, remove_columns=column_names)
+
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    learning_rate=2e-5,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=num_evaluation_steps,
+    save_strategy="steps",
+    save_steps=num_evaluation_steps,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name=f"multilingual-{'-'.join(source_languages)}-{'-'.join(target_languages)}",  # Will be used in W&B if `wandb` is installed
+)
 
-##### Read cross-lingual Semantic Textual Similarity (STS) data ####
-all_languages = list(set(list(source_languages) + list(target_languages)))
-sts_data = {}
-
-# Open the ZIP File of STS2017-extended.zip and check for which language combinations we have STS data
-with zipfile.ZipFile(sts_corpus) as zip:
-    filelist = zip.namelist()
-    sts_files = []
-
-    for i in range(len(all_languages)):
-        for j in range(i, len(all_languages)):
-            lang1 = all_languages[i]
-            lang2 = all_languages[j]
-            filepath = "STS2017-extended/STS.{}-{}.txt".format(lang1, lang2)
-            if filepath not in filelist:
-                lang1, lang2 = lang2, lang1
-                filepath = "STS2017-extended/STS.{}-{}.txt".format(lang1, lang2)
-
-            if filepath in filelist:
-                filename = os.path.basename(filepath)
-                sts_data[filename] = {"sentences1": [], "sentences2": [], "scores": []}
-
-                fIn = zip.open(filepath)
-                for line in io.TextIOWrapper(fIn, "utf8"):
-                    sent1, sent2, score = line.strip().split("\t")
-                    score = float(score)
-                    sts_data[filename]["sentences1"].append(sent1)
-                    sts_data[filename]["sentences2"].append(sent2)
-                    sts_data[filename]["scores"].append(score)
-
-for filename, data in sts_data.items():
-    test_evaluator = evaluation.EmbeddingSimilarityEvaluator(
-        data["sentences1"],
-        data["sentences2"],
-        data["scores"],
-        batch_size=inference_batch_size,
-        name=filename.split(".")[0],
-        show_progress_bar=False,
-    )
-    evaluators.append(test_evaluator)
-
-
-# Train the model
-student_model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: np.mean(scores)),
-    epochs=num_epochs,
-    warmup_steps=num_warmup_steps,
-    evaluation_steps=num_evaluation_steps,
-    output_path=output_path,
-    save_best_model=True,
-    optimizer_params={"lr": 2e-5, "eps": 1e-6},
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=student_model,
+    args=args,
+    train_dataset=train_dataset_dict,
+    eval_dataset=eval_dataset_dict,
+    loss=train_loss,
+    evaluator=evaluator,
 )
+trainer.train()
+
+# 7. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+student_model.save(final_output_dir)
+
+# 8. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = student_model_name if "/" not in student_model_name else student_model_name.split("/")[-1]
+try:
+    student_model.push_to_hub(f"{model_name}-multilingual-{'-'.join(source_languages)}-{'-'.join(target_languages)}")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-multilingual-{'-'.join(source_languages)}-{'-'.join(target_languages)}')`."
+    )
diff --git a/examples/training/nli/training_nli.py b/examples/training/nli/training_nli.py
index 7a657f4d8..2dd26c112 100644
--- a/examples/training/nli/training_nli.py
+++ b/examples/training/nli/training_nli.py
@@ -10,128 +10,114 @@
 python training_nli.py pretrained_transformer_model_name
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-sts_dataset_path = "data/stsbenchmark.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-
-# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
+# You can specify any Hugging Face pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "bert-base-uncased"
-
-# Read the dataset
 train_batch_size = 16
 
+output_dir = "output/training_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-model_save_path = (
-    "output/training_nli_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-)
-
-
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
-)
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
 
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
 
-# Read the AllNLI.tsv.gz file and create the training dataset
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+# We'll start with 10k training samples, but you can increase this to get a stronger model
 logging.info("Read AllNLI train dataset")
+train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train").select(range(10000))
+eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000))
+logging.info(train_dataset)
 
-label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
-train_samples = []
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            label_id = label2int[row["label"]]
-            train_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=label_id))
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
+# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#softmaxloss
 train_loss = losses.SoftmaxLoss(
-    model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=len(label2int)
+    model=model,
+    sentence_embedding_dimension=model.get_sentence_embedding_dimension(),
+    num_labels=3,
 )
 
-
-# Read STSbenchmark dataset and use it as development set
-logging.info("Read STSbenchmark dev dataset")
-dev_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-dev_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    dev_samples, batch_size=train_batch_size, name="sts-dev"
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+logging.info("Evaluation before training:")
+dev_evaluator(model)
+
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="nli-v1",  # Will be used in W&B if `wandb` is installed
 )
 
-# Configure the training
-num_epochs = 1
-
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
 )
-
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "test":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            test_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    test_samples, batch_size=train_batch_size, name="sts-test"
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
 )
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-nli-v1")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-nli-v1')`."
+    )
diff --git a/examples/training/nli/training_nli_v2.py b/examples/training/nli/training_nli_v2.py
index e37b0ee6e..0567e0f8b 100644
--- a/examples/training/nli/training_nli_v2.py
+++ b/examples/training/nli/training_nli_v2.py
@@ -10,23 +10,21 @@
 python training_nli_v2.py pretrained_transformer_model_name
 """
 
-import math
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
 train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
@@ -34,121 +32,94 @@
 num_epochs = 1
 
 # Save path of the model
-model_save_path = (
+output_dir = (
     "output/training_nli_v2_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
 
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-sts_dataset_path = "data/stsbenchmark.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
 
-# Read the AllNLI.tsv.gz file and create the training dataset
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+# We'll start with 10k training samples, but you can increase this to get a stronger model
 logging.info("Read AllNLI train dataset")
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train").select(range(10000))
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000))
+logging.info(train_dataset)
 
-
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
-
-
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
-
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
-
-
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
-
-
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
-
-
-# Our training loss
+# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss
 train_loss = losses.MultipleNegativesRankingLoss(model)
 
 
-# Read STSbenchmark dataset and use it as development set
-logging.info("Read STSbenchmark dev dataset")
-dev_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-dev_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    dev_samples, batch_size=train_batch_size, name="sts-dev"
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+logging.info("Evaluation before training:")
+dev_evaluator(model)
+
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=10,
+    save_strategy="steps",
+    save_steps=10,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="nli-v2",  # Will be used in W&B if `wandb` is installed
 )
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
-
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "test":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            test_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    test_samples, batch_size=train_batch_size, name="sts-test"
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
 )
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-nli-v2")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-nli-v2')`."
+    )
diff --git a/examples/training/nli/training_nli_v3.py b/examples/training/nli/training_nli_v3.py
index 312821730..1844a2588 100644
--- a/examples/training/nli/training_nli_v3.py
+++ b/examples/training/nli/training_nli_v3.py
@@ -10,23 +10,21 @@
 python training_nli_v3.py pretrained_transformer_model_name
 """
 
-import math
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
-import random
-
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilroberta-base"
 train_batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
@@ -34,136 +32,95 @@
 num_epochs = 1
 
 # Save path of the model
-model_save_path = (
+output_dir = (
     "output/training_nli_v3_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
 
-# Here we define our SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean")
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "data/AllNLI.tsv.gz"
-sts_dataset_path = "data/stsbenchmark.tsv.gz"
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
 
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-
-# Read the AllNLI.tsv.gz file and create the training dataset
+# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+# We'll start with 10k training samples, but you can increase this to get a stronger model
 logging.info("Read AllNLI train dataset")
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train").select(range(10000))
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000))
+logging.info(train_dataset)
 
-
-def add_to_samples(sent1, sent2, label):
-    if sent1 not in train_data:
-        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
-    train_data[sent1][label].add(sent2)
-
-
-train_data = {}
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            sent1 = row["sentence1"].strip()
-            sent2 = row["sentence2"].strip()
-
-            add_to_samples(sent1, sent2, row["label"])
-            add_to_samples(sent2, sent1, row["label"])  # Also add the opposite
-
-
-train_samples = []
-for sent1, others in train_data.items():
-    if len(others["entailment"]) > 0 and len(others["contradiction"]) > 0:
-        train_samples.append(
-            InputExample(
-                texts=[sent1, random.choice(list(others["entailment"])), random.choice(list(others["contradiction"]))]
-            )
-        )
-        train_samples.append(
-            InputExample(
-                texts=[random.choice(list(others["entailment"])), sent1, random.choice(list(others["contradiction"]))]
-            )
-        )
-
-logging.info("Train samples: {}".format(len(train_samples)))
-
-
-# Special data loader that avoid duplicates within a batch
-train_dataloader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=train_batch_size)
-
-
+# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#gistembedloss
 # The guiding model
 guide_model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Our training loss
 train_loss = losses.GISTEmbedLoss(model, guide_model)
 
-
-# Read STSbenchmark dataset and use it as development set
-logging.info("Read STSbenchmark dev dataset")
-dev_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-dev_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    dev_samples, batch_size=train_batch_size, name="sts-dev"
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+logging.info("Evaluation before training:")
+dev_evaluator(model)
+
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=10,
+    save_strategy="steps",
+    save_steps=10,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="nli-v3",  # Will be used in W&B if `wandb` is installed
 )
 
-# Configure the training
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=int(len(train_dataloader) * 0.1),
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=False,  # Set to True, if your GPU supports FP16 operations
 )
-
-
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
-
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "test":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            test_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    test_samples, batch_size=train_batch_size, name="sts-test"
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
 )
-test_evaluator(model, output_path=model_save_path)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
 
-# Optionally, save the model to the Hugging Face Hub!
+# 9. (Optional) save the model to the Hugging Face Hub!
 # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
 model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
 try:
-    model.push_to_hub(f"{model_name}-nli-gist")
+    model.push_to_hub(f"{model_name}-nli-v3")
 except Exception:
     logging.error(
-        "Error uploading model to the Hugging Face Hub. To upload it manually, you can run "
-        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({model_save_path!r})` "
-        f"and saving it using `model.push_to_hub('{model_name}-nli-gist')`."
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-nli-v3')`."
     )
diff --git a/examples/training/other/training_multi-task.py b/examples/training/other/training_multi-task.py
index d1a0d625a..b9de63c0a 100644
--- a/examples/training/other/training_multi-task.py
+++ b/examples/training/other/training_multi-task.py
@@ -4,126 +4,129 @@
 The system trains BERT on the AllNLI and on the STSbenchmark dataset.
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer, util
+import traceback
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
+from sentence_transformers.losses import CosineSimilarityLoss, SoftmaxLoss
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments
 import logging
 from datetime import datetime
-import gzip
-import csv
-import os
+from datasets import load_dataset
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # Read the dataset
 model_name = "bert-base-uncased"
+num_train_epochs = 1
 batch_size = 16
-model_save_path = "output/training_multi-task_" + model_name + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-
-# Check if dataset exists. If not, download and extract  it
-nli_dataset_path = "datasets/AllNLI.tsv.gz"
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
-
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-
-# Use BERT for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+output_dir = "output/training_multi-task_" + model_name + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
+
+# 2a. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
+nli_train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
+nli_eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000))
+logging.info(nli_train_dataset)
+
+# 2b. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+stsb_train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+stsb_test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(stsb_train_dataset)
+
+# 3. Define our training losses
+# 3a. SoftmaxLoss for the NLI data (sentence_A, sentence_B, class), see also https://sbert.net/docs/training/loss_overview.html
+train_loss_nli = SoftmaxLoss(
+    model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=3
 )
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read AllNLI train dataset")
-label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
-train_nli_samples = []
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "train":
-            label_id = label2int[row["label"]]
-            train_nli_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=label_id))
-
-
-train_dataloader_nli = DataLoader(train_nli_samples, shuffle=True, batch_size=batch_size)
-train_loss_nli = losses.SoftmaxLoss(
-    model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=len(label2int)
+# 3b. CosineSimilarityLoss for the STSB data (sentence_A, sentence_B, similarity score between 0 and 1)
+train_loss_sts = CosineSimilarityLoss(model=model)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
 )
 
-logging.info("Read STSbenchmark train dataset")
-train_sts_samples = []
-dev_sts_samples = []
-test_sts_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_sts_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_sts_samples.append(inp_example)
-        else:
-            train_sts_samples.append(inp_example)
-
-
-train_dataloader_sts = DataLoader(train_sts_samples, shuffle=True, batch_size=batch_size)
-train_loss_sts = losses.CosineSimilarityLoss(model=model)
-
-
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_sts_samples, name="sts-dev")
-
-# Configure the training
-num_epochs = 4
-
-warmup_steps = math.ceil(len(train_dataloader_sts) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Here we define the two train objectives: train_dataloader_nli with train_loss_nli (i.e., SoftmaxLoss for NLI data)
-# and train_dataloader_sts with train_loss_sts (i.e., CosineSimilarityLoss for STSbenchmark data)
-# You can pass as many (dataloader, loss) tuples as you like. They are iterated in a round-robin way.
-train_objectives = [(train_dataloader_nli, train_loss_nli), (train_dataloader_sts, train_loss_sts)]
-
-# Train the model
-model.fit(
-    train_objectives=train_objectives,
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # With ROUND_ROBIN you'll sample the same amount from each dataset, until one of the multi-datasets is exhausted
+    # The alternative is PROPORTIONAL, which samples from each dataset in proportion to the dataset size,
+    # but that will lead to a lot of samples from the larger dataset (AllNLI in this case)
+    multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="multi-task",  # Will be used in W&B if `wandb` is installed
 )
 
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset={
+        "all-nli": nli_train_dataset,
+        "sts": stsb_train_dataset,
+    },
+    eval_dataset={
+        "all-nli": nli_eval_dataset,
+        "sts": stsb_eval_dataset,
+    },
+    loss={
+        "all-nli": train_loss_nli,
+        "sts": train_loss_sts,
+    },
+    evaluator=dev_evaluator,
+)
+trainer.train()
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_sts_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_test_dataset["sentence1"],
+    sentences2=stsb_test_dataset["sentence2"],
+    scores=stsb_test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-multi-task")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-multi-task')`."
+    )
diff --git a/examples/training/other/training_wikipedia_sections.py b/examples/training/other/training_wikipedia_sections.py
index 33ca4f549..8f9b36dfa 100644
--- a/examples/training/other/training_wikipedia_sections.py
+++ b/examples/training/other/training_wikipedia_sections.py
@@ -4,106 +4,109 @@
 As corpus, we use the wikipedia sections dataset that was describd by Dor et al., 2018, Learning Thematic Similarity Metric Using Triplet Networks.
 """
 
-from sentence_transformers import SentenceTransformer, InputExample, LoggingHandler, losses, models, util
-from torch.utils.data import DataLoader
+import traceback
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import TripletEvaluator
+from sentence_transformers.losses import TripletLoss
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 from datetime import datetime
-from zipfile import ZipFile
-
-import csv
+from datasets import load_dataset
 import logging
-import os
-
 
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = "distilbert-base-uncased"
+batch_size = 16
+num_train_epochs = 1
 
-dataset_path = "datasets/wikipedia-sections"
-if not os.path.exists(dataset_path):
-    os.makedirs(dataset_path, exist_ok=True)
-    filepath = os.path.join(dataset_path, "wikipedia-sections-triplets.zip")
-    util.http_get("https://sbert.net/datasets/wikipedia-sections-triplets.zip", filepath)
-    with ZipFile(filepath, "r") as zip:
-        zip.extractall(dataset_path)
-
-
-### Create a torch.DataLoader that passes training batch instances to our model
-train_batch_size = 16
-output_path = "output/training-wikipedia-sections-" + model_name + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-num_epochs = 1
-
+output_dir = "output/training-wikipedia-sections-" + model_name + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-### Configure sentence transformers for training and train on the provided dataset
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+# model.max_seq_length = 75
+logging.info(model)
 
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
+# 2. Load the Wikipedia-Sections dataset: https://huggingface.co/datasets/sentence-transformers/wikipedia-sections
+train_dataset = load_dataset("sentence-transformers/wikipedia-sections", "triplet", split="train").select(
+    range(10_000)
 )
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-
-logger.info("Read Triplet train dataset")
-train_examples = []
-with open(os.path.join(dataset_path, "train.csv"), encoding="utf-8") as fIn:
-    reader = csv.DictReader(fIn, delimiter=",", quoting=csv.QUOTE_MINIMAL)
-    for row in reader:
-        train_examples.append(InputExample(texts=[row["Sentence1"], row["Sentence2"], row["Sentence3"]], label=0))
-
-
-train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.TripletLoss(model=model)
-
-logger.info("Read Wikipedia Triplet dev dataset")
-dev_examples = []
-with open(os.path.join(dataset_path, "validation.csv"), encoding="utf-8") as fIn:
-    reader = csv.DictReader(fIn, delimiter=",", quoting=csv.QUOTE_MINIMAL)
-    for row in reader:
-        dev_examples.append(InputExample(texts=[row["Sentence1"], row["Sentence2"], row["Sentence3"]]))
-
-        if len(dev_examples) >= 1000:
-            break
-
-evaluator = TripletEvaluator.from_input_examples(dev_examples, name="dev")
-
-
-warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=output_path,
+eval_dataset = load_dataset("sentence-transformers/wikipedia-sections", "triplet", split="validation").select(
+    range(1000)
+)
+test_dataset = load_dataset("sentence-transformers/wikipedia-sections", "triplet", split="test").select(range(1000))
+logging.info(train_dataset)
+
+# 3. Define our training loss
+# TripletLoss (https://sbert.net/docs/package_reference/losses.html#tripletloss) needs three text columns
+train_loss = TripletLoss(model)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = TripletEvaluator(
+    anchors=eval_dataset[:1000]["anchor"],
+    positives=eval_dataset[:1000]["positive"],
+    negatives=eval_dataset[:1000]["negative"],
+    name="wikipedia-sections-dev",
 )
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="wikipedia-sections-triplet",  # Will be used in W&B if `wandb` is installed
+)
 
-logger.info("Read test examples")
-test_examples = []
-with open(os.path.join(dataset_path, "test.csv"), encoding="utf-8") as fIn:
-    reader = csv.DictReader(fIn, delimiter=",", quoting=csv.QUOTE_MINIMAL)
-    for row in reader:
-        test_examples.append(InputExample(texts=[row["Sentence1"], row["Sentence2"], row["Sentence3"]]))
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
 
 
-model = SentenceTransformer(output_path)
-test_evaluator = TripletEvaluator.from_input_examples(test_examples, name="test")
-test_evaluator(model, output_path=output_path)
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = TripletEvaluator(
+    anchors=test_dataset["anchor"],
+    positives=test_dataset["positive"],
+    negatives=test_dataset["negative"],
+    name="wikipedia-sections-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-wikipedia-sections-triplet")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-wikipedia-sections-triplet')`."
+    )
diff --git a/examples/training/paraphrases/training.py b/examples/training/paraphrases/training.py
index 9f58e9603..6ad6b5877 100644
--- a/examples/training/paraphrases/training.py
+++ b/examples/training/paraphrases/training.py
@@ -1,101 +1,145 @@
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
+"""
+Note: This script was modified with the v3 release of Sentence Transformers.
+As a result, it does not produce exactly the same behaviour as the original script.
+"""
+
+import traceback
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+from sentence_transformers.losses import MultipleNegativesRankingLoss
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import (
+    BatchSamplers,
+    MultiDatasetBatchSamplers,
+    SentenceTransformerTrainingArguments,
+)
 import logging
 from datetime import datetime
-import sys
-import os
-import gzip
-import csv
-from MultiDatasetDataLoader import MultiDatasetDataLoader
+from datasets import load_dataset
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = "distilroberta-base"
 num_epochs = 1
-sts_dataset_path = "data-eval/stsbenchmark.tsv.gz"
-batch_size_pairs = 384
-batch_size_triplets = 256
+batch_size = 128
 max_seq_length = 128
-use_amp = True  # Set to False, if you use a CPU or your GPU does not support FP16 operations
-evaluation_steps = 500
-warmup_steps = 500
-
-#####
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
 
 # Save path of the model
-model_save_path = (
+output_dir = (
     "output/training_paraphrases_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-
-## SentenceTransformer model
-word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-dataset_list = []
-for filepath in sys.argv[1:]:
-    dataset = []
-    with_guid = "with-guid" in filepath  # Some datasets have a guid in the first column
-
-    with gzip.open(filepath, "rt", encoding="utf8") as fIn:
-        for line in fIn:
-            splits = line.strip().split("\t")
-            if with_guid:
-                guid = splits[0]
-                texts = splits[1:]
-            else:
-                guid = None
-                texts = splits
-
-            dataset.append(InputExample(texts=texts, guid=guid))
-
-    dataset_list.append(dataset)
-
-
-train_dataloader = MultiDatasetDataLoader(
-    dataset_list, batch_size_pairs=batch_size_pairs, batch_size_triplets=batch_size_triplets
+# 2. Load some training dataset from: https://huggingface.co/datasets?other=sentence-transformers
+# Notably, we are looking for datasets compatible with MultipleNegativesRankingLoss, which accepts
+# triplets of sentences (anchor, positive, negative) and pairs of sentences (anchor, positive).
+all_nli_train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
+sentence_compression_train_dataset = load_dataset("sentence-transformers/sentence-compression", split="train")
+simple_wiki_train_dataset = load_dataset("sentence-transformers/simple-wiki", split="train")
+altlex_train_dataset = load_dataset("sentence-transformers/altlex", split="train")
+quora_train_dataset = load_dataset("sentence-transformers/quora-duplicates", "triplet", split="train")
+coco_train_dataset = load_dataset("sentence-transformers/coco-captions", split="train")
+flickr_train_dataset = load_dataset("sentence-transformers/flickr30k-captions", split="train")
+yahoo_answers_train_dataset = load_dataset(
+    "sentence-transformers/yahoo-answers", "title-question-answer-pair", split="train"
+)
+stack_exchange_train_dataset = load_dataset(
+    "sentence-transformers/stackexchange-duplicates", "title-title-pair", split="train"
 )
 
+train_dataset_dict = {
+    "all-nli": all_nli_train_dataset,
+    "sentence-compression": sentence_compression_train_dataset,
+    "simple-wiki": simple_wiki_train_dataset,
+    "altlex": altlex_train_dataset,
+    "quora-duplicates": quora_train_dataset,
+    "coco-captions": coco_train_dataset,
+    "flickr30k-captions": flickr_train_dataset,
+    "yahoo-answers": yahoo_answers_train_dataset,
+    "stack-exchange": stack_exchange_train_dataset,
+}
+print(train_dataset_dict)
+
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
+# If we want, we can limit the maximum sequence length for the model
+model.max_seq_length = max_seq_length
+logging.info(model)
+
+# 3. Define our training loss
+train_loss = MultipleNegativesRankingLoss(model)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
 
-# Our training loss
-train_loss = losses.MultipleNegativesRankingLoss(model)
-
-
-# Read STSbenchmark dataset and use it as development set
-logging.info("Read STSbenchmark dev dataset")
-dev_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["split"] == "dev":
-            score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-            dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
-
-dev_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-# Configure the training
-logging.info("Warmup-steps: {}".format(warmup_steps))
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # We can use ROUND_ROBIN or PROPORTIONAL - to avoid focusing too much on one dataset, we will
+    # use round robin, which samples the same amount of batches from each dataset, until one dataset is empty
+    multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN,
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=1000,
+    save_strategy="steps",
+    save_steps=1000,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="paraphrases-multi",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset_dict,
+    loss=train_loss,
     evaluator=dev_evaluator,
-    epochs=num_epochs,
-    evaluation_steps=evaluation_steps,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-    use_amp=use_amp,
-    checkpoint_path=model_save_path,
-    checkpoint_save_steps=1000,
-    checkpoint_save_total_limit=3,
 )
+trainer.train()
+
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-paraphrases-multi")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-paraphrases-multi')`."
+    )
diff --git a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
index 8b1a1b4a2..9ddea4dd2 100644
--- a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
+++ b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
@@ -11,66 +11,47 @@
 The model we get works well for duplicate questions mining and for duplicate questions information retrieval. For question pair classification, other losses (like OnlineConstrativeLoss) work better.
 """
 
-from torch.utils.data import DataLoader
-from sentence_transformers import losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.evaluation import (
+    BinaryClassificationEvaluator,
+    InformationRetrievalEvaluator,
+    ParaphraseMiningEvaluator,
+    SequentialEvaluator,
+)
+from sentence_transformers.losses import MultipleNegativesRankingLoss
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
 import logging
 from datetime import datetime
-import csv
-import os
-from zipfile import ZipFile
 import random
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
-#### /print debug information to stdout
-
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # As base model, we use DistilBERT-base that was pre-trained on NLI and STSb data
-model = SentenceTransformer("stsb-distilbert-base")
-
+model_name = "stsb-distilbert-base"
+model = SentenceTransformer(model_name)
 # Training for multiple epochs can be beneficial, as in each epoch a mini-batch is sampled differently
 # hence, we get different negatives for each positive
-num_epochs = 10
-
+num_train_epochs = 1
 # Increasing the batch size improves the performance for MultipleNegativesRankingLoss. Choose it as large as possible
-# I achieved the good results with a batch size of 300-350 (requires about 30 GB of GPU memory)
-train_batch_size = 64
-
-dataset_path = "quora-IR-dataset"
-model_save_path = "output/training_MultipleNegativesRankingLoss-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-os.makedirs(model_save_path, exist_ok=True)
+# I achieved the good results with a batch size of 300-350
+batch_size = 64
 
-# Check if the dataset exists. If not, download and extract
-if not os.path.exists(dataset_path):
-    logger.info("Dataset not found. Download")
-    zip_save_path = "quora-IR-dataset.zip"
-    util.http_get(url="https://sbert.net/datasets/quora-IR-dataset.zip", path=zip_save_path)
-    with ZipFile(zip_save_path, "r") as zip:
-        zip.extractall(dataset_path)
+output_dir = "output/training_mnrl-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
+################### Load Quora Duplicate Questions dataset ##################
 
-######### Read train data  ##########
-train_samples = []
-with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["is_duplicate"] == "1":
-            train_samples.append(InputExample(texts=[row["question1"], row["question2"]], label=1))
-            train_samples.append(
-                InputExample(texts=[row["question2"], row["question1"]], label=1)
-            )  # if A is a duplicate of B, then B is a duplicate of A
-
-
-# After reading the train_samples, we create a DataLoader
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.MultipleNegativesRankingLoss(model)
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+dataset = load_dataset(
+    "sentence-transformers/quora-duplicates", "triplet", split="train"
+)  # The "pair" subset also works
+train_dataset = dataset.select(range(100000))
+eval_dataset = dataset.select(range(100000, 101000))
 
+train_loss = MultipleNegativesRankingLoss(model=model)
 
 ################### Development  Evaluators ##################
 # We add 3 evaluators, that evaluate the model on Duplicate Questions pair classification,
@@ -78,50 +59,37 @@
 evaluators = []
 
 ###### Classification ######
-# Given (quesiton1, question2), is this a duplicate or not?
+# Given (question1, question2), is this a duplicate or not?
 # The evaluator will compute the embeddings for both questions and then compute
 # a cosine similarity. If the similarity is above a threshold, we have a duplicate.
-dev_sentences1 = []
-dev_sentences2 = []
-dev_labels = []
-with open(os.path.join(dataset_path, "classification/dev_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences1.append(row["question1"])
-        dev_sentences2.append(row["question2"])
-        dev_labels.append(int(row["is_duplicate"]))
-
-
-binary_acc_evaluator = evaluation.BinaryClassificationEvaluator(dev_sentences1, dev_sentences2, dev_labels)
+
+duplicate_classes_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train[-1000:]")
+binary_acc_evaluator = BinaryClassificationEvaluator(
+    sentences1=duplicate_classes_dataset["sentence1"],
+    sentences2=duplicate_classes_dataset["sentence2"],
+    labels=duplicate_classes_dataset["label"],
+    name="quora-duplicates",
+)
 evaluators.append(binary_acc_evaluator)
 
 
 ###### Duplicate Questions Mining ######
 # Given a large corpus of questions, identify all duplicates in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_dev_samples = 10000
-dev_sentences = {}
-dev_duplicates = []
-with open(os.path.join(dataset_path, "duplicate-mining/dev_corpus.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences[row["qid"]] = row["question"]
-
-        if len(dev_sentences) >= max_dev_samples:
-            break
-
-with open(os.path.join(dataset_path, "duplicate-mining/dev_duplicates.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["qid1"] in dev_sentences and row["qid2"] in dev_sentences:
-            dev_duplicates.append([row["qid1"], row["qid2"]])
+# Load the Quora Duplicates Mining dataset
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates-mining
+questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
+duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
 
+# Create a mapping from qid to question & a list of duplicates (qid1, qid2)
+qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
+duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
 
 # The ParaphraseMiningEvaluator computes the cosine similarity between all sentences and
 # extracts a list with the pairs that have the highest similarity. Given the duplicate
 # information in dev_duplicates, it then computes and F1 score how well our duplicate mining worked
-paraphrase_mining_evaluator = evaluation.ParaphraseMiningEvaluator(dev_sentences, dev_duplicates, name="dev")
+paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev")
+
 evaluators.append(paraphrase_mining_evaluator)
 
 
@@ -129,64 +97,85 @@
 # Given a question and a large corpus of thousands questions, find the most relevant (i.e. duplicate) question
 # in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_corpus_size = 10000
-
-ir_queries = {}  # Our queries (qid => question)
-ir_needed_qids = set()  # QIDs we need in the corpus
-ir_corpus = {}  # Our corpus (qid => question)
-ir_relevant_docs = {}  # Mapping of relevant documents for a given query (qid => set([relevant_question_ids])
-
-with open(os.path.join(dataset_path, "information-retrieval/dev-queries.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, query, duplicate_ids = line.strip().split("\t")
-        duplicate_ids = duplicate_ids.split(",")
-        ir_queries[qid] = query
-        ir_relevant_docs[qid] = set(duplicate_ids)
-
-        for qid in duplicate_ids:
-            ir_needed_qids.add(qid)
-
-# First get all needed relevant documents (i.e., we must ensure, that the relevant questions are actually in the corpus
-distraction_questions = {}
-with open(os.path.join(dataset_path, "information-retrieval/corpus.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, question = line.strip().split("\t")
-
-        if qid in ir_needed_qids:
-            ir_corpus[qid] = question
-        else:
-            distraction_questions[qid] = question
-
-# Now, also add some irrelevant questions to fill our corpus
-other_qid_list = list(distraction_questions.keys())
-random.shuffle(other_qid_list)
-
-for qid in other_qid_list[0 : max(0, max_corpus_size - len(ir_corpus))]:
-    ir_corpus[qid] = distraction_questions[qid]
+# https://huggingface.co/datasets/BeIR/quora
+# https://huggingface.co/datasets/BeIR/quora-qrels
+new_ir_corpus = load_dataset("BeIR/quora", "corpus", split="corpus")
+new_ir_queries = load_dataset("BeIR/quora", "queries", split="queries")
+new_ir_relevant_docs_data = load_dataset("BeIR/quora-qrels", split="validation")
+
+# Shrink the corpus size heavily to only the relevant documents + 10,000 random documents
+required_corpus_ids = list(map(str, new_ir_relevant_docs_data["corpus-id"]))
+required_corpus_ids += random.sample(new_ir_corpus["_id"], k=10_000)
+new_ir_corpus = new_ir_corpus.filter(lambda x: x["_id"] in required_corpus_ids)
+
+# Convert the datasets to dictionaries
+new_ir_corpus = dict(zip(new_ir_corpus["_id"], new_ir_corpus["text"]))  # Our corpus (qid => question)
+new_ir_queries = dict(zip(new_ir_queries["_id"], new_ir_queries["text"]))  # Our queries (qid => question)
+new_ir_relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_question_ids])
+for qid, corpus_ids in zip(new_ir_relevant_docs_data["query-id"], new_ir_relevant_docs_data["corpus-id"]):
+    qid = str(qid)
+    corpus_ids = str(corpus_ids)
+    if qid not in new_ir_relevant_docs:
+        new_ir_relevant_docs[qid] = set()
+    new_ir_relevant_docs[qid].add(corpus_ids)
 
 # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR
 # metrices. For our use case MRR@k and Accuracy@k are relevant.
-ir_evaluator = evaluation.InformationRetrievalEvaluator(ir_queries, ir_corpus, ir_relevant_docs)
-
+ir_evaluator = InformationRetrievalEvaluator(new_ir_queries, new_ir_corpus, new_ir_relevant_docs)
 evaluators.append(ir_evaluator)
 
 # Create a SequentialEvaluator. This SequentialEvaluator runs all three evaluators in a sequential order.
 # We optimize the model with respect to the score from the last evaluator (scores[-1])
-seq_evaluator = evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
-
-
-logger.info("Evaluate model without training")
-seq_evaluator(model, epoch=0, steps=0, output_path=model_save_path)
-
+seq_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
+
+logging.info("Evaluate model without training")
+seq_evaluator(model, epoch=0, steps=0)
+
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=250,
+    save_strategy="steps",
+    save_steps=250,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="mnrl",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=seq_evaluator,
-    epochs=num_epochs,
-    warmup_steps=1000,
-    output_path=model_save_path,
 )
+trainer.train()
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-mnrl")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-mnrl')`."
+    )
diff --git a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
index a492d0df0..51fc34b0a 100644
--- a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
+++ b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
@@ -9,63 +9,46 @@
 An issue with constrative loss is, that it might push sentences away that are already well positioned in vector space.
 """
 
-from torch.utils.data import DataLoader
-from sentence_transformers import losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.evaluation import (
+    BinaryClassificationEvaluator,
+    InformationRetrievalEvaluator,
+    ParaphraseMiningEvaluator,
+    SequentialEvaluator,
+)
+from sentence_transformers.losses import OnlineContrastiveLoss
+from sentence_transformers.losses.ContrastiveLoss import SiameseDistanceMetric
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
 import logging
 from datetime import datetime
-import csv
-import os
-from zipfile import ZipFile
 import random
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
-#### /print debug information to stdout
-
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # As base model, we use DistilBERT-base that was pre-trained on NLI and STSb data
-model = SentenceTransformer("stsb-distilbert-base")
-num_epochs = 10
-train_batch_size = 64
+model_name = "stsb-distilbert-base"
+model = SentenceTransformer(model_name)
+num_train_epochs = 1
+batch_size = 64
 
-# As distance metric, we use cosine distance (cosine_distance = 1-cosine_similarity)
-distance_metric = losses.SiameseDistanceMetric.COSINE_DISTANCE
+output_dir = "output/training_ocl-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+
+################### Load Quora Duplicate Questions dataset ##################
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train")
+dataset = dataset.train_test_split(test_size=1000)
+train_dataset = dataset["train"].select(range(100000))
+eval_dataset = dataset["test"]
 
 # Negative pairs should have a distance of at least 0.5
 margin = 0.5
-
-dataset_path = "quora-IR-dataset"
-model_save_path = "output/training_OnlineConstrativeLoss-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-os.makedirs(model_save_path, exist_ok=True)
-
-# Check if the dataset exists. If not, download and extract
-if not os.path.exists(dataset_path):
-    logger.info("Dataset not found. Download")
-    zip_save_path = "quora-IR-dataset.zip"
-    util.http_get(url="https://sbert.net/datasets/quora-IR-dataset.zip", path=zip_save_path)
-    with ZipFile(zip_save_path, "r") as zip:
-        zip.extractall(dataset_path)
-
-
-######### Read train data  ##########
-# Read train data
-train_samples = []
-with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        sample = InputExample(texts=[row["question1"], row["question2"]], label=int(row["is_duplicate"]))
-        train_samples.append(sample)
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=distance_metric, margin=margin)
-
+# As distance metric, we use cosine distance (cosine_distance = 1-cosine_similarity)
+distance_metric = SiameseDistanceMetric.COSINE_DISTANCE
+train_loss = OnlineContrastiveLoss(model=model, distance_metric=distance_metric, margin=margin)
 
 ################### Development  Evaluators ##################
 # We add 3 evaluators, that evaluate the model on Duplicate Questions pair classification,
@@ -73,50 +56,36 @@
 evaluators = []
 
 ###### Classification ######
-# Given (quesiton1, question2), is this a duplicate or not?
+# Given (question1, question2), is this a duplicate or not?
 # The evaluator will compute the embeddings for both questions and then compute
 # a cosine similarity. If the similarity is above a threshold, we have a duplicate.
-dev_sentences1 = []
-dev_sentences2 = []
-dev_labels = []
-with open(os.path.join(dataset_path, "classification/dev_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences1.append(row["question1"])
-        dev_sentences2.append(row["question2"])
-        dev_labels.append(int(row["is_duplicate"]))
-
-
-binary_acc_evaluator = evaluation.BinaryClassificationEvaluator(dev_sentences1, dev_sentences2, dev_labels)
+
+binary_acc_evaluator = BinaryClassificationEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    labels=eval_dataset["label"],
+    name="quora-duplicates",
+)
 evaluators.append(binary_acc_evaluator)
 
 
 ###### Duplicate Questions Mining ######
 # Given a large corpus of questions, identify all duplicates in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_dev_samples = 10000
-dev_sentences = {}
-dev_duplicates = []
-with open(os.path.join(dataset_path, "duplicate-mining/dev_corpus.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences[row["qid"]] = row["question"]
-
-        if len(dev_sentences) >= max_dev_samples:
-            break
-
-with open(os.path.join(dataset_path, "duplicate-mining/dev_duplicates.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["qid1"] in dev_sentences and row["qid2"] in dev_sentences:
-            dev_duplicates.append([row["qid1"], row["qid2"]])
+# Load the Quora Duplicates Mining dataset
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates-mining
+questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
+duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
 
+# Create a mapping from qid to question & a list of duplicates (qid1, qid2)
+qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
+duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
 
 # The ParaphraseMiningEvaluator computes the cosine similarity between all sentences and
 # extracts a list with the pairs that have the highest similarity. Given the duplicate
 # information in dev_duplicates, it then computes and F1 score how well our duplicate mining worked
-paraphrase_mining_evaluator = evaluation.ParaphraseMiningEvaluator(dev_sentences, dev_duplicates, name="dev")
+paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev")
+
 evaluators.append(paraphrase_mining_evaluator)
 
 
@@ -124,64 +93,85 @@
 # Given a question and a large corpus of thousands questions, find the most relevant (i.e. duplicate) question
 # in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_corpus_size = 100000
-
-ir_queries = {}  # Our queries (qid => question)
-ir_needed_qids = set()  # QIDs we need in the corpus
-ir_corpus = {}  # Our corpus (qid => question)
-ir_relevant_docs = {}  # Mapping of relevant documents for a given query (qid => set([relevant_question_ids])
-
-with open(os.path.join(dataset_path, "information-retrieval/dev-queries.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, query, duplicate_ids = line.strip().split("\t")
-        duplicate_ids = duplicate_ids.split(",")
-        ir_queries[qid] = query
-        ir_relevant_docs[qid] = set(duplicate_ids)
-
-        for qid in duplicate_ids:
-            ir_needed_qids.add(qid)
-
-# First get all needed relevant documents (i.e., we must ensure, that the relevant questions are actually in the corpus
-distraction_questions = {}
-with open(os.path.join(dataset_path, "information-retrieval/corpus.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, question = line.strip().split("\t")
-
-        if qid in ir_needed_qids:
-            ir_corpus[qid] = question
-        else:
-            distraction_questions[qid] = question
-
-# Now, also add some irrelevant questions to fill our corpus
-other_qid_list = list(distraction_questions.keys())
-random.shuffle(other_qid_list)
-
-for qid in other_qid_list[0 : max(0, max_corpus_size - len(ir_corpus))]:
-    ir_corpus[qid] = distraction_questions[qid]
+# https://huggingface.co/datasets/BeIR/quora
+# https://huggingface.co/datasets/BeIR/quora-qrels
+new_ir_corpus = load_dataset("BeIR/quora", "corpus", split="corpus")
+new_ir_queries = load_dataset("BeIR/quora", "queries", split="queries")
+new_ir_relevant_docs_data = load_dataset("BeIR/quora-qrels", split="validation")
+
+# Shrink the corpus size heavily to only the relevant documents + 10,000 random documents
+required_corpus_ids = list(map(str, new_ir_relevant_docs_data["corpus-id"]))
+required_corpus_ids += random.sample(new_ir_corpus["_id"], k=10_000)
+new_ir_corpus = new_ir_corpus.filter(lambda x: x["_id"] in required_corpus_ids)
+
+# Convert the datasets to dictionaries
+new_ir_corpus = dict(zip(new_ir_corpus["_id"], new_ir_corpus["text"]))  # Our corpus (qid => question)
+new_ir_queries = dict(zip(new_ir_queries["_id"], new_ir_queries["text"]))  # Our queries (qid => question)
+new_ir_relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_question_ids])
+for qid, corpus_ids in zip(new_ir_relevant_docs_data["query-id"], new_ir_relevant_docs_data["corpus-id"]):
+    qid = str(qid)
+    corpus_ids = str(corpus_ids)
+    if qid not in new_ir_relevant_docs:
+        new_ir_relevant_docs[qid] = set()
+    new_ir_relevant_docs[qid].add(corpus_ids)
 
 # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR
 # metrices. For our use case MRR@k and Accuracy@k are relevant.
-ir_evaluator = evaluation.InformationRetrievalEvaluator(ir_queries, ir_corpus, ir_relevant_docs)
-
+ir_evaluator = InformationRetrievalEvaluator(new_ir_queries, new_ir_corpus, new_ir_relevant_docs)
 evaluators.append(ir_evaluator)
 
 # Create a SequentialEvaluator. This SequentialEvaluator runs all three evaluators in a sequential order.
 # We optimize the model with respect to the score from the last evaluator (scores[-1])
-seq_evaluator = evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
-
-
-logger.info("Evaluate model without training")
-seq_evaluator(model, epoch=0, steps=0, output_path=model_save_path)
-
+seq_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
+
+logging.info("Evaluate model without training")
+seq_evaluator(model, epoch=0, steps=0)
+
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # OCL benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=250,
+    save_strategy="steps",
+    save_steps=250,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="online-contrastive-loss",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
     evaluator=seq_evaluator,
-    epochs=num_epochs,
-    warmup_steps=1000,
-    output_path=model_save_path,
 )
+trainer.train()
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-ocl")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-ocl')`."
+    )
diff --git a/examples/training/quora_duplicate_questions/training_multi-task-learning.py b/examples/training/quora_duplicate_questions/training_multi-task-learning.py
index 7ba18cdd0..15d8c227c 100644
--- a/examples/training/quora_duplicate_questions/training_multi-task-learning.py
+++ b/examples/training/quora_duplicate_questions/training_multi-task-learning.py
@@ -11,85 +11,58 @@
 model.fit(train_objectives=[(train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss), (train_dataloader_constrative_loss, train_loss_constrative_loss)] ...)
 """
 
-from torch.utils.data import DataLoader
-from sentence_transformers import losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation
-from sentence_transformers.readers import InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.evaluation import (
+    BinaryClassificationEvaluator,
+    InformationRetrievalEvaluator,
+    ParaphraseMiningEvaluator,
+    SequentialEvaluator,
+)
+from sentence_transformers.losses import ContrastiveLoss, MultipleNegativesRankingLoss
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import (
+    BatchSamplers,
+    MultiDatasetBatchSamplers,
+    SentenceTransformerTrainingArguments,
+)
 import logging
 from datetime import datetime
-import csv
-import os
-from zipfile import ZipFile
 import random
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
-#### /print debug information to stdout
-
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 # As base model, we use DistilBERT-base that was pre-trained on NLI and STSb data
-model = SentenceTransformer("stsb-distilbert-base")
-
+model_name = "stsb-distilbert-base"
+model = SentenceTransformer(model_name)
 # Training for multiple epochs can be beneficial, as in each epoch a mini-batch is sampled differently
 # hence, we get different negatives for each positive
-num_epochs = 10
-
+num_train_epochs = 1
 # Increasing the batch size improves the performance for MultipleNegativesRankingLoss. Choose it as large as possible
-# I achieved the good results with a batch size of 300-350 (requires about 30 GB of GPU memory)
-train_batch_size = 64
-
-# As distance metric, we use cosine distance (cosine_distance = 1-cosine_similarity)
-distance_metric = losses.SiameseDistanceMetric.COSINE_DISTANCE
-
-# Negative pairs should have a distance of at least 0.5
-margin = 0.5
-
-dataset_path = "quora-IR-dataset"
-model_save_path = "output/training_multi-task-learning" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-os.makedirs(model_save_path, exist_ok=True)
-
-# Check if the dataset exists. If not, download and extract
-if not os.path.exists(dataset_path):
-    logger.info("Dataset not found. Download")
-    zip_save_path = "quora-IR-dataset.zip"
-    util.http_get(url="https://sbert.net/datasets/quora-IR-dataset.zip", path=zip_save_path)
-    with ZipFile(zip_save_path, "r") as zip:
-        zip.extractall(dataset_path)
-
-
-######### Read train data  ##########
-train_samples_MultipleNegativesRankingLoss = []
-train_samples_ConstrativeLoss = []
-
-with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        train_samples_ConstrativeLoss.append(
-            InputExample(texts=[row["question1"], row["question2"]], label=int(row["is_duplicate"]))
-        )
-        if row["is_duplicate"] == "1":
-            train_samples_MultipleNegativesRankingLoss.append(
-                InputExample(texts=[row["question1"], row["question2"]], label=1)
-            )
-            train_samples_MultipleNegativesRankingLoss.append(
-                InputExample(texts=[row["question2"], row["question1"]], label=1)
-            )  # if A is a duplicate of B, then B is a duplicate of A
-
-# Create data loader and loss for MultipleNegativesRankingLoss
-train_dataloader_MultipleNegativesRankingLoss = DataLoader(
-    train_samples_MultipleNegativesRankingLoss, shuffle=True, batch_size=train_batch_size
-)
-train_loss_MultipleNegativesRankingLoss = losses.MultipleNegativesRankingLoss(model)
+# I achieved the good results with a batch size of 300-350
+batch_size = 64
 
+output_dir = "output/training_mnrl-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 
-# Create data loader and loss for OnlineContrastiveLoss
-train_dataloader_ConstrativeLoss = DataLoader(train_samples_ConstrativeLoss, shuffle=True, batch_size=train_batch_size)
-train_loss_ConstrativeLoss = losses.OnlineContrastiveLoss(model=model, distance_metric=distance_metric, margin=margin)
+################### Load Quora Duplicate Questions dataset ##################
 
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+mnrl_dataset = load_dataset(
+    "sentence-transformers/quora-duplicates", "triplet", split="train"
+)  # The "pair" subset also works
+mnrl_train_dataset = mnrl_dataset.select(range(100000))
+mnrl_eval_dataset = mnrl_dataset.select(range(100000, 101000))
+
+mnrl_train_loss = MultipleNegativesRankingLoss(model=model)
+
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+cl_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train")
+cl_train_dataset = cl_dataset.select(range(100000))
+cl_eval_dataset = cl_dataset.select(range(100000, 101000))
+
+cl_train_loss = ContrastiveLoss(model=model, margin=0.5)
 
 ################### Development  Evaluators ##################
 # We add 3 evaluators, that evaluate the model on Duplicate Questions pair classification,
@@ -97,50 +70,37 @@
 evaluators = []
 
 ###### Classification ######
-# Given (quesiton1, question2), is this a duplicate or not?
+# Given (question1, question2), is this a duplicate or not?
 # The evaluator will compute the embeddings for both questions and then compute
 # a cosine similarity. If the similarity is above a threshold, we have a duplicate.
-dev_sentences1 = []
-dev_sentences2 = []
-dev_labels = []
-with open(os.path.join(dataset_path, "classification/dev_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences1.append(row["question1"])
-        dev_sentences2.append(row["question2"])
-        dev_labels.append(int(row["is_duplicate"]))
-
-
-binary_acc_evaluator = evaluation.BinaryClassificationEvaluator(dev_sentences1, dev_sentences2, dev_labels)
+
+duplicate_classes_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train[-1000:]")
+binary_acc_evaluator = BinaryClassificationEvaluator(
+    sentences1=duplicate_classes_dataset["sentence1"],
+    sentences2=duplicate_classes_dataset["sentence2"],
+    labels=duplicate_classes_dataset["label"],
+    name="quora-duplicates",
+)
 evaluators.append(binary_acc_evaluator)
 
 
 ###### Duplicate Questions Mining ######
 # Given a large corpus of questions, identify all duplicates in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_dev_samples = 10000
-dev_sentences = {}
-dev_duplicates = []
-with open(os.path.join(dataset_path, "duplicate-mining/dev_corpus.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        dev_sentences[row["qid"]] = row["question"]
-
-        if len(dev_sentences) >= max_dev_samples:
-            break
-
-with open(os.path.join(dataset_path, "duplicate-mining/dev_duplicates.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["qid1"] in dev_sentences and row["qid2"] in dev_sentences:
-            dev_duplicates.append([row["qid1"], row["qid2"]])
+# Load the Quora Duplicates Mining dataset
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates-mining
+questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
+duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
 
+# Create a mapping from qid to question & a list of duplicates (qid1, qid2)
+qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
+duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
 
 # The ParaphraseMiningEvaluator computes the cosine similarity between all sentences and
 # extracts a list with the pairs that have the highest similarity. Given the duplicate
 # information in dev_duplicates, it then computes and F1 score how well our duplicate mining worked
-paraphrase_mining_evaluator = evaluation.ParaphraseMiningEvaluator(dev_sentences, dev_duplicates, name="dev")
+paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev")
+
 evaluators.append(paraphrase_mining_evaluator)
 
 
@@ -148,67 +108,95 @@
 # Given a question and a large corpus of thousands questions, find the most relevant (i.e. duplicate) question
 # in that corpus.
 
-# For faster processing, we limit the development corpus to only 10,000 sentences.
-max_corpus_size = 100000
-
-ir_queries = {}  # Our queries (qid => question)
-ir_needed_qids = set()  # QIDs we need in the corpus
-ir_corpus = {}  # Our corpus (qid => question)
-ir_relevant_docs = {}  # Mapping of relevant documents for a given query (qid => set([relevant_question_ids])
-
-with open(os.path.join(dataset_path, "information-retrieval/dev-queries.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, query, duplicate_ids = line.strip().split("\t")
-        duplicate_ids = duplicate_ids.split(",")
-        ir_queries[qid] = query
-        ir_relevant_docs[qid] = set(duplicate_ids)
-
-        for qid in duplicate_ids:
-            ir_needed_qids.add(qid)
-
-# First get all needed relevant documents (i.e., we must ensure, that the relevant questions are actually in the corpus
-distraction_questions = {}
-with open(os.path.join(dataset_path, "information-retrieval/corpus.tsv"), encoding="utf8") as fIn:
-    next(fIn)  # Skip header
-    for line in fIn:
-        qid, question = line.strip().split("\t")
-
-        if qid in ir_needed_qids:
-            ir_corpus[qid] = question
-        else:
-            distraction_questions[qid] = question
-
-# Now, also add some irrelevant questions to fill our corpus
-other_qid_list = list(distraction_questions.keys())
-random.shuffle(other_qid_list)
-
-for qid in other_qid_list[0 : max(0, max_corpus_size - len(ir_corpus))]:
-    ir_corpus[qid] = distraction_questions[qid]
+# https://huggingface.co/datasets/BeIR/quora
+# https://huggingface.co/datasets/BeIR/quora-qrels
+new_ir_corpus = load_dataset("BeIR/quora", "corpus", split="corpus")
+new_ir_queries = load_dataset("BeIR/quora", "queries", split="queries")
+new_ir_relevant_docs_data = load_dataset("BeIR/quora-qrels", split="validation")
+
+# Shrink the corpus size heavily to only the relevant documents + 10,000 random documents
+required_corpus_ids = list(map(str, new_ir_relevant_docs_data["corpus-id"]))
+required_corpus_ids += random.sample(new_ir_corpus["_id"], k=10_000)
+new_ir_corpus = new_ir_corpus.filter(lambda x: x["_id"] in required_corpus_ids)
+
+# Convert the datasets to dictionaries
+new_ir_corpus = dict(zip(new_ir_corpus["_id"], new_ir_corpus["text"]))  # Our corpus (qid => question)
+new_ir_queries = dict(zip(new_ir_queries["_id"], new_ir_queries["text"]))  # Our queries (qid => question)
+new_ir_relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_question_ids])
+for qid, corpus_ids in zip(new_ir_relevant_docs_data["query-id"], new_ir_relevant_docs_data["corpus-id"]):
+    qid = str(qid)
+    corpus_ids = str(corpus_ids)
+    if qid not in new_ir_relevant_docs:
+        new_ir_relevant_docs[qid] = set()
+    new_ir_relevant_docs[qid].add(corpus_ids)
 
 # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR
 # metrices. For our use case MRR@k and Accuracy@k are relevant.
-ir_evaluator = evaluation.InformationRetrievalEvaluator(ir_queries, ir_corpus, ir_relevant_docs)
-
+ir_evaluator = InformationRetrievalEvaluator(new_ir_queries, new_ir_corpus, new_ir_relevant_docs)
 evaluators.append(ir_evaluator)
 
 # Create a SequentialEvaluator. This SequentialEvaluator runs all three evaluators in a sequential order.
 # We optimize the model with respect to the score from the last evaluator (scores[-1])
-seq_evaluator = evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
-
-
-logger.info("Evaluate model without training")
-seq_evaluator(model, epoch=0, steps=0, output_path=model_save_path)
-
+seq_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[-1])
+
+logging.info("Evaluate model without training")
+seq_evaluator(model, epoch=0, steps=0)
+
+# Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_train_epochs,
+    per_device_train_batch_size=batch_size,
+    per_device_eval_batch_size=batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    multi_dataset_batch_sampler=MultiDatasetBatchSamplers.PROPORTIONAL,  # PROPORTIONAL or ROUND_ROBIN
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=250,
+    save_strategy="steps",
+    save_steps=250,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="mnrl-cl-multi",  # Will be used in W&B if `wandb` is installed
+)
 
-# Train the model
-model.fit(
-    train_objectives=[
-        (train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss),
-        (train_dataloader_ConstrativeLoss, train_loss_ConstrativeLoss),
-    ],
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset={
+        "mnrl": mnrl_train_dataset,
+        "cl": cl_train_dataset,
+    },
+    eval_dataset={
+        "mnrl": mnrl_eval_dataset,
+        "cl": cl_eval_dataset,
+    },
+    loss={
+        "mnrl": mnrl_train_loss,
+        "cl": cl_train_loss,
+    },
     evaluator=seq_evaluator,
-    epochs=num_epochs,
-    warmup_steps=1000,
-    output_path=model_save_path,
 )
+trainer.train()
+
+# Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-mnrl-cl-multi")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-mnrl-cl-multi')`."
+    )
diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py
index 42fafd93d..ea194640e 100644
--- a/examples/training/sts/training_stsbenchmark.py
+++ b/examples/training/sts/training_stsbenchmark.py
@@ -9,105 +9,109 @@
 python training_nli.py pretrained_transformer_model_name
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
+import traceback
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers.readers import InputExample
 import logging
 from datetime import datetime
 import sys
-import os
-import gzip
-import csv
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-
-# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
+# You can specify any Hugging Face pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
 model_name = sys.argv[1] if len(sys.argv) > 1 else "distilbert-base-uncased"
-
-# Read the dataset
 train_batch_size = 16
 num_epochs = 4
-model_save_path = (
+output_dir = (
     "output/training_stsbenchmark_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
-word_embedding_model = models.Transformer(model_name)
-
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(
-    word_embedding_model.get_word_embedding_dimension(),
-    pooling_mode_mean_tokens=True,
-    pooling_mode_cls_token=False,
-    pooling_mode_max_tokens=False,
-)
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
+# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
+# create one with "mean" pooling.
+model = SentenceTransformer(model_name)
 
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
+# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
+# train_loss = losses.CoSENTLoss(model=model)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
 
-
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training. We skip evaluation in this example
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_epochs,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="sts",  # Will be used in W&B if `wandb` is installed
 )
 
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-sts")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-sts')`."
+    )
diff --git a/examples/training/sts/training_stsbenchmark_continue_training.py b/examples/training/sts/training_stsbenchmark_continue_training.py
index c06b2a62d..c902306a1 100644
--- a/examples/training/sts/training_stsbenchmark_continue_training.py
+++ b/examples/training/sts/training_stsbenchmark_continue_training.py
@@ -1,97 +1,113 @@
 """
-This example loads the pre-trained SentenceTransformer model 'nli-distilroberta-base-v2' from the server.
+This example loads the pre-trained SentenceTransformer model 'nli-distilroberta-base-v2' from Hugging Face.
 It then fine-tunes this model for some epochs on the STS benchmark dataset.
 
 Note: In this example, you must specify a SentenceTransformer model.
 If you want to fine-tune a huggingface/transformers model like bert-base-uncased, see training_nli.py and training_stsbenchmark.py
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util, InputExample
+import traceback
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
 from datetime import datetime
-import os
-import gzip
-import csv
+import sys
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
-
-# Check if dataset exists. If not, download and extract  it
-sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
-
-# Read the dataset
-model_name = "nli-distilroberta-base-v2"
+# You can specify any Sentence Transformer model here, for example all-mpnet-base-v2, all-MiniLM-L6-v2, mixedbread-ai/mxbai-embed-large-v1
+model_name = sys.argv[1] if len(sys.argv) > 1 else "sentence-transformers/all-mpnet-base-v2"
 train_batch_size = 16
 num_epochs = 4
-model_save_path = (
-    "output/training_stsbenchmark_continue_training-" + model_name + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+output_dir = (
+    "output/training_stsbenchmark_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 )
 
-
-# Load a pre-trained sentence transformer model
+# 1. Here we define our SentenceTransformer model.
 model = SentenceTransformer(model_name)
 
-# Convert the dataset to a DataLoader ready for training
-logging.info("Read STSbenchmark train dataset")
+# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+logging.info(train_dataset)
 
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
+# 3. Define our training loss
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
+# train_loss = losses.CoSENTLoss(model=model)
+
+# 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=eval_dataset["sentence1"],
+    sentences2=eval_dataset["sentence2"],
+    scores=eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
 
-
-# Development set: Measure correlation between cosine score and gold labels
-logging.info("Read STSbenchmark dev dataset")
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-
-
-# Configure the training. We skip evaluation in this example
-warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up
-logging.info("Warmup-steps: {}".format(warmup_steps))
-
-
-# Train the model
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
+# 5. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir=output_dir,
+    # Optional training parameters:
+    num_train_epochs=num_epochs,
+    per_device_train_batch_size=train_batch_size,
+    per_device_eval_batch_size=train_batch_size,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="sts",  # Will be used in W&B if `wandb` is installed
 )
 
+# 6. Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=train_loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
 
-##############################################################################
-#
-# Load the stored model and evaluate its performance on STS benchmark dataset
-#
-##############################################################################
 
-model = SentenceTransformer(model_save_path)
-test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-test_evaluator(model, output_path=model_save_path)
+# 7. Evaluate the model performance on the STS Benchmark test dataset
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+test_evaluator(model)
+
+# 8. Save the trained & evaluated model locally
+final_output_dir = f"{output_dir}/final"
+model.save(final_output_dir)
+
+# 9. (Optional) save the model to the Hugging Face Hub!
+# It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+try:
+    model.push_to_hub(f"{model_name}-sts")
+except Exception:
+    logging.error(
+        f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+        f"`huggingface-cli login`, followed by loading the model using `model = SentenceTransformer({final_output_dir!r})` "
+        f"and saving it using `model.push_to_hub('{model_name}-sts')`."
+    )
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index 53f61332c..852750404 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -73,6 +73,31 @@ class SentenceTransformer(nn.Sequential, FitMixin):
     :param token: Hugging Face authentication token to download private models.
     :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is
         only applicable during inference when `.encode` is called.
+
+    Example
+        ::
+
+            from sentence_transformers import SentenceTransformer
+
+            # Load a pre-trained SentenceTransformer model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Encode some texts
+            sentences = [
+                "The weather is lovely today.",
+                "It's so sunny outside!",
+                "He drove to the stadium.",
+            ]
+            embeddings = model.encode(sentences)
+            print(embeddings.shape)
+            # (3, 768)
+
+            # Get the similarity scores between all sentences
+            similarities = model.similarity(embeddings, embeddings)
+            print(similarities)
+            # tensor([[1.0000, 0.6817, 0.0492],
+            #         [0.6817, 1.0000, 0.0421],
+            #         [0.0492, 0.0421, 1.0000]])
     """
 
     def __init__(
@@ -444,6 +469,7 @@ def encode(
 
     @property
     def similarity_fn_name(self) -> Optional[str]:
+        """Return the name of the similarity function used by :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise`."""
         return self._similarity_fn_name
 
     @similarity_fn_name.setter
diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
index a18ec9f85..e15a61d9d 100644
--- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
+++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
@@ -36,6 +36,58 @@ class BinaryClassificationEvaluator(SentenceEvaluator):
     :param write_csv: Write results to a CSV file
     :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
         dimension. Defaults to None.
+
+    Example
+        ::
+
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import BinaryClassificationEvaluator
+            from datasets import load_dataset
+
+            # Load a model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load a dataset with two text columns and a class label column (https://huggingface.co/datasets/sentence-transformers/quora-duplicates)
+            eval_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train[-1000:]")
+
+            # Initialize the evaluator
+            binary_acc_evaluator = BinaryClassificationEvaluator(
+                sentences1=eval_dataset["sentence1"],
+                sentences2=eval_dataset["sentence2"],
+                labels=eval_dataset["label"],
+                name="quora-duplicates-dev",
+            )
+            results = binary_acc_evaluator(model)
+            '''
+            Binary Accuracy Evaluation of the model on the quora-duplicates-dev dataset:
+            Accuracy with Cosine-Similarity:           81.60    (Threshold: 0.8352)
+            F1 with Cosine-Similarity:                 75.27    (Threshold: 0.7715)
+            Precision with Cosine-Similarity:          65.81
+            Recall with Cosine-Similarity:             87.89
+            Average Precision with Cosine-Similarity:  76.03
+
+            Accuracy with Dot-Product:           81.60  (Threshold: 0.8352)
+            F1 with Dot-Product:                 75.27  (Threshold: 0.7715)
+            Precision with Dot-Product:          65.81
+            Recall with Dot-Product:             87.89
+            Average Precision with Dot-Product:  76.03
+
+            Accuracy with Manhattan-Distance:           81.50   (Threshold: 12.0727)
+            F1 with Manhattan-Distance:                 74.97   (Threshold: 15.2269)
+            Precision with Manhattan-Distance:          63.89
+            Recall with Manhattan-Distance:             90.68
+            Average Precision with Manhattan-Distance:  75.66
+
+            Accuracy with Euclidean-Distance:           81.60   (Threshold: 0.5741)
+            F1 with Euclidean-Distance:                 75.27   (Threshold: 0.6760)
+            Precision with Euclidean-Distance:          65.81
+            Recall with Euclidean-Distance:             87.89
+            Average Precision with Euclidean-Distance:  76.03
+            '''
+            print(binary_acc_evaluator.primary_metric)
+            # => "quora-duplicates-dev_max_ap"
+            print(results[binary_acc_evaluator.primary_metric])
+            # => 0.760277070888393
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
index 0f2e9ca39..5bbbaa4fe 100644
--- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
+++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
@@ -23,7 +23,36 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator):
     The metrics are the cosine similarity as well as euclidean and Manhattan distance
     The returned score is the Spearman correlation with a specified metric.
 
-    The results are written in a CSV. If a CSV already exists, then values are appended.
+    Example
+        ::
+
+            from datasets import load_dataset
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
+
+            # Load a model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load the STSB dataset (https://huggingface.co/datasets/nyu-mll/glue/viewer/stsb)
+            eval_dataset = load_dataset("nyu-mll/glue", "stsb", split="validation")
+
+            # Initialize the evaluator
+            dev_evaluator = EmbeddingSimilarityEvaluator(
+                sentences1=eval_dataset["sentence1"],
+                sentences2=eval_dataset["sentence2"],
+                scores=[score / 5 for score in eval_dataset["label"]],
+                main_similarity=SimilarityFunction.COSINE,
+                name="sts-dev",
+            )
+            dev_evaluator(model)
+            '''
+            EmbeddingSimilarityEvaluator: Evaluating the model on the sts-dev dataset:
+            Cosine-Similarity :       Pearson: 0.7874 Spearman: 0.8004
+            Manhattan-Distance:       Pearson: 0.7823 Spearman: 0.7827
+            Euclidean-Distance:       Pearson: 0.7824 Spearman: 0.7827
+            Dot-Product-Similarity:   Pearson: 0.7192 Spearman: 0.7126
+            '''
+            # => 0.8004
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
index 917574c3e..235bd1b75 100644
--- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
+++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
@@ -23,6 +23,89 @@ class InformationRetrievalEvaluator(SentenceEvaluator):
 
     Given a set of queries and a large corpus set. It will retrieve for each query the top-k most similar document. It measures
     Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG)
+
+    Example
+        ::
+
+            import random
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import InformationRetrievalEvaluator
+            from datasets import load_dataset
+
+            # Load a model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load the Quora IR dataset (https://huggingface.co/datasets/BeIR/quora, https://huggingface.co/datasets/BeIR/quora-qrels)
+            corpus = load_dataset("BeIR/quora", "corpus", split="corpus")
+            queries = load_dataset("BeIR/quora", "queries", split="queries")
+            relevant_docs_data = load_dataset("BeIR/quora-qrels", split="validation")
+
+            # Shrink the corpus size heavily to only the relevant documents + 10,000 random documents
+            required_corpus_ids = list(map(str, relevant_docs_data["corpus-id"]))
+            required_corpus_ids += random.sample(corpus["_id"], k=10_000)
+            corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids)
+
+            # Convert the datasets to dictionaries
+            corpus = dict(zip(corpus["_id"], corpus["text"]))  # Our corpus (qid => question)
+            queries = dict(zip(queries["_id"], queries["text"]))  # Our queries (qid => question)
+            relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_question_ids])
+            for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]):
+                qid = str(qid)
+                corpus_ids = str(corpus_ids)
+                if qid not in relevant_docs:
+                    relevant_docs[qid] = set()
+                relevant_docs[qid].add(corpus_ids)
+
+            # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR metrics.
+            ir_evaluator = InformationRetrievalEvaluator(
+                queries=queries,
+                corpus=corpus,
+                relevant_docs=relevant_docs,
+                name="BeIR-quora-dev",
+            )
+            results = ir_evaluator(model)
+            '''
+            Information Retrieval Evaluation of the model on the BeIR-quora-dev dataset:
+            Queries: 5000
+            Corpus: 17476
+
+            Score-Function: cosine
+            Accuracy@1: 96.26%
+            Accuracy@3: 99.38%
+            Accuracy@5: 99.74%
+            Accuracy@10: 99.94%
+            Precision@1: 96.26%
+            Precision@3: 43.01%
+            Precision@5: 27.66%
+            Precision@10: 14.58%
+            Recall@1: 82.93%
+            Recall@3: 96.28%
+            Recall@5: 98.38%
+            Recall@10: 99.55%
+            MRR@10: 0.9782
+            NDCG@10: 0.9807
+            MAP@100: 0.9732
+            Score-Function: dot
+            Accuracy@1: 96.26%
+            Accuracy@3: 99.38%
+            Accuracy@5: 99.74%
+            Accuracy@10: 99.94%
+            Precision@1: 96.26%
+            Precision@3: 43.01%
+            Precision@5: 27.66%
+            Precision@10: 14.58%
+            Recall@1: 82.93%
+            Recall@3: 96.28%
+            Recall@5: 98.38%
+            Recall@10: 99.55%
+            MRR@10: 0.9782
+            NDCG@10: 0.9807
+            MAP@100: 0.9732
+            '''
+            print(ir_evaluator.primary_metric)
+            # => "BeIR-quora-dev_cosine_map@100"
+            print(results[ir_evaluator.primary_metric])
+            # => 0.9732046108457585
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py
index 80ae29899..1cca98e6c 100644
--- a/sentence_transformers/evaluation/MSEEvaluator.py
+++ b/sentence_transformers/evaluation/MSEEvaluator.py
@@ -28,6 +28,38 @@ class MSEEvaluator(SentenceEvaluator):
     :param write_csv: Write results to CSV file
     :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
         dimension. Defaults to None.
+
+    Example
+        ::
+
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import MSEEvaluator
+            from datasets import load_dataset
+
+            # Load a model
+            student_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
+            teacher_model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load any dataset with some texts
+            dataset = load_dataset("sentence-transformers/stsb", split="validation")
+            sentences = dataset["sentence1"] + dataset["sentence2"]
+
+            # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR metrics.
+            mse_evaluator = MSEEvaluator(
+                source_sentences=sentences,
+                target_sentences=sentences,
+                teacher_model=teacher_model,
+                name="stsb-dev",
+            )
+            results = mse_evaluator(student_model)
+            '''
+            MSE evaluation (lower = better) on the stsb-dev dataset:
+            MSE (*100):  0.805045
+            '''
+            print(mse_evaluator.primary_metric)
+            # => "stsb-dev_negative_mse"
+            print(results[mse_evaluator.primary_metric])
+            # => -0.8050452917814255
     """
 
     def __init__(
@@ -60,7 +92,7 @@ def __init__(
         self.write_csv = write_csv
         self.primary_metric = "negative_mse"
 
-    def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1) -> Dict[str, float]:
+    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch=-1, steps=-1) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
index 264cf7782..46f2f4fdb 100644
--- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
+++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
@@ -18,6 +18,45 @@ class ParaphraseMiningEvaluator(SentenceEvaluator):
     Given a large set of sentences, this evaluator performs paraphrase (duplicate) mining and
     identifies the pairs with the highest similarity. It compare the extracted paraphrase pairs
     with a set of gold labels and computes the F1 score.
+
+    Example
+        ::
+
+            from datasets import load_dataset
+            from sentence_transformers.SentenceTransformer import SentenceTransformer
+            from sentence_transformers.evaluation import ParaphraseMiningEvaluator
+
+            # Load a model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load the Quora Duplicates Mining dataset
+            questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
+            duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
+
+            # Create a mapping from qid to question & a list of duplicates (qid1, qid2)
+            qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
+            duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
+
+            # Initialize the paraphrase mining evaluator
+            paraphrase_mining_evaluator = ParaphraseMiningEvaluator(
+                sentences_map=qid_to_questions,
+                duplicates_list=duplicates,
+                name="quora-duplicates-dev",
+            )
+            results = paraphrase_mining_evaluator(model)
+            '''
+            Paraphrase Mining Evaluation of the model on the quora-duplicates-dev dataset:
+            Number of candidate pairs: 250564
+            Average Precision: 56.51
+            Optimal threshold: 0.8325
+            Precision: 52.76
+            Recall: 59.19
+            F1: 55.79
+            '''
+            print(paraphrase_mining_evaluator.primary_metric)
+            # => "quora-duplicates-dev_average_precision"
+            print(results[paraphrase_mining_evaluator.primary_metric])
+            # => 0.5650940787776353
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py
index 199d4cad1..603cbe744 100644
--- a/sentence_transformers/evaluation/TranslationEvaluator.py
+++ b/sentence_transformers/evaluation/TranslationEvaluator.py
@@ -18,6 +18,36 @@ class TranslationEvaluator(SentenceEvaluator):
     Given two sets of sentences in different languages, e.g. (en_1, en_2, en_3...) and (fr_1, fr_2, fr_3, ...),
     and assuming that fr_i is the translation of en_i.
     Checks if vec(en_i) has the highest similarity to vec(fr_i). Computes the accuracy in both directions
+
+    Example
+        ::
+
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import TranslationEvaluator
+            from datasets import load_dataset
+
+            # Load a model
+            model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
+
+            # Load a parallel sentences dataset
+            dataset = load_dataset("sentence-transformers/parallel-sentences-news-commentary", "en-nl", split="train[:1000]")
+
+            # Initialize the TranslationEvaluator using the same texts from two languages
+            translation_evaluator = TranslationEvaluator(
+                source_sentences=dataset["english"],
+                target_sentences=dataset["non_english"],
+                name="news-commentary-en-nl",
+            )
+            results = translation_evaluator(model)
+            '''
+            Evaluating translation matching Accuracy of the model on the news-commentary-en-nl dataset:
+            Accuracy src2trg: 90.80
+            Accuracy trg2src: 90.40
+            '''
+            print(translation_evaluator.primary_metric)
+            # => "news-commentary-en-nl_mean_accuracy"
+            print(results[translation_evaluator.primary_metric])
+            # => 0.906
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py
index 3b9a908c2..86ff42ba8 100644
--- a/sentence_transformers/evaluation/TripletEvaluator.py
+++ b/sentence_transformers/evaluation/TripletEvaluator.py
@@ -17,7 +17,40 @@
 class TripletEvaluator(SentenceEvaluator):
     """
     Evaluate a model based on a triplet: (sentence, positive_example, negative_example).
-        Checks if distance(sentence, positive_example) < distance(sentence, negative_example).
+    Checks if distance(sentence, positive_example) < distance(sentence, negative_example).
+
+    Example
+        ::
+
+            from sentence_transformers import SentenceTransformer
+            from sentence_transformers.evaluation import TripletEvaluator
+            from datasets import load_dataset
+
+            # Load a model
+            model = SentenceTransformer('all-mpnet-base-v2')
+
+            # Load a dataset with (anchor, positive, negative) triplets
+            dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+
+            # Initialize the TripletEvaluator using anchors, positives, and negatives
+            triplet_evaluator = TripletEvaluator(
+                anchors=dataset[:1000]["anchor"],
+                positives=dataset[:1000]["positive"],
+                negatives=dataset[:1000]["negative"],
+                name="all-nli-dev",
+            )
+            results = triplet_evaluator(model)
+            '''
+            TripletEvaluator: Evaluating the model on the all-nli-dev dataset:
+            Accuracy Cosine Distance:        95.60
+            Accuracy Dot Product:            4.40
+            Accuracy Manhattan Distance:     95.40
+            Accuracy Euclidean Distance:     95.60
+            '''
+            print(triplet_evaluator.primary_metric)
+            # => "all-nli-dev_max_accuracy"
+            print(results[triplet_evaluator.primary_metric])
+            # => 0.956
     """
 
     def __init__(
diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md
index f503c6770..bf1ac2896 100644
--- a/sentence_transformers/model_card_template.md
+++ b/sentence_transformers/model_card_template.md
@@ -91,7 +91,7 @@ print(embeddings.shape)
 # [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ output_dimensionality | default(1024, true) }}]
 
 # Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings)
+similarities = model.similarity(embeddings, embeddings)
 print(similarities.shape)
 # [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}]
 ```
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index f4bec3ff8..9803a1503 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -79,6 +79,11 @@ def __init__(
 
         if data_collator is None:
             data_collator = SentenceTransformerDataCollator(tokenize_fn=model.tokenize)
+
+        if isinstance(train_dataset, dict) and not isinstance(train_dataset, DatasetDict):
+            train_dataset = DatasetDict(train_dataset)
+        if isinstance(eval_dataset, dict) and not isinstance(eval_dataset, Dataset):
+            eval_dataset = DatasetDict(eval_dataset)
         super().__init__(
             model=model,
             args=args,

From 5b2ca407017e6ebd6185cfb784b77e41ce9c0a68 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Wed, 8 May 2024 12:02:38 +0200
Subject: [PATCH 08/39] Remove "return_outputs" as it's not strictly necessary.
 Avoids OOM & speeds up training (#2633)

---
 sentence_transformers/trainer.py       | 9 +++++----
 sentence_transformers/training_args.py | 4 ++++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 9803a1503..5447a38b0 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -158,10 +158,11 @@ def compute_loss(
             loss_fn.model = model
         loss = loss_fn(features, labels)
         if return_outputs:
-            # Get fresh features, as the loss function has likely modified them
-            features, _ = self.collect_features(inputs)
-            output = torch.cat([model(row)["sentence_embedding"][:, None] for row in features], dim=1)
-            return loss, output
+            # During prediction/evaluation, `compute_loss` will be called with `return_outputs=True`.
+            # However, Sentence Transformer losses do not return outputs, so we return an empty dictionary.
+            # This does not result in any problems, as the SentenceTransformersTrainingArguments sets
+            # `prediction_loss_only=True` which means that the output is not used.
+            return loss, {}
         return loss
 
     def collect_features(
diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
index bdd46a9aa..f28f0d359 100644
--- a/sentence_transformers/training_args.py
+++ b/sentence_transformers/training_args.py
@@ -37,3 +37,7 @@ def __post_init__(self):
 
         self.batch_sampler = BatchSamplers(self.batch_sampler)
         self.multi_dataset_batch_sampler = MultiDatasetBatchSamplers(self.multi_dataset_batch_sampler)
+
+        # The `compute_loss` method in `SentenceTransformerTrainer` is overridden to only compute the prediction loss,
+        # so we set `prediction_loss_only` to `True` here to avoid
+        self.prediction_loss_only = True

From e88c3f46e7031cf383b56fa715df3867c0a71d41 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 10 May 2024 16:08:42 +0200
Subject: [PATCH 09/39] Fix crash from inferring the dataset_id from a local
 dataset (#2636)

See #2635
---
 sentence_transformers/model_card.py | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index 483ee5ce4..f8f9fc37f 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -527,7 +527,13 @@ def subtuple_finder(tuple: Tuple[str], subtuple: Tuple[str]) -> int:
             else:
                 author = None
                 dataset_name = cache_dataset_name
-                dataset_output["id"] = get_dataset_info(dataset_name).id
+                # We can still be dealing with a local dataset here, so we wrap this with try-except
+                try:
+                    dataset_output["id"] = get_dataset_info(dataset_name).id
+                except Exception:
+                    # We can have a wide range of errors here, such as the dataset not existing, no internet, etc.
+                    # So we use the generic Exception
+                    pass
 
             # If the cache path ends with a 40 character hash, it is the current revision
             if len(cache_path_parts[-2]) == 40:

From d86319e50a63ed712cb6c6fa6d6d3d52520e9b1e Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 13 May 2024 11:24:22 +0200
Subject: [PATCH 10/39] Fix multilingual conversion script; extend MSELoss to
 multi-column (#2641)

And remove the now-unnecessary make_multilingual_sys.py
---
 docs/training/loss_overview.md                |   3 +-
 .../multilingual/get_parallel_data_talks.py   |   2 +-
 .../multilingual/make_multilingual.py         |   5 +-
 .../multilingual/make_multilingual_sys.py     | 168 ------------------
 sentence_transformers/losses/MSELoss.py       |  21 ++-
 5 files changed, 20 insertions(+), 179 deletions(-)
 delete mode 100644 examples/training/multilingual/make_multilingual_sys.py

diff --git a/docs/training/loss_overview.md b/docs/training/loss_overview.md
index 4a756a818..33ec36f41 100644
--- a/docs/training/loss_overview.md
+++ b/docs/training/loss_overview.md
@@ -33,7 +33,8 @@ For example, when finetuning a small model to behave more like a larger & strong
 
 | Texts                                        | Labels                                                        | Appropriate Loss Functions                                                   |
 |----------------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------------------|
-| `single sentences`                           | `model sentence embeddings`                                   | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a>             |
+| `sentence`                           | `model sentence embeddings`                                   | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a>             |
+| `sentence_1, sentence_2, ..., sentence_N`                           | `model sentence embeddings`                                   | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a>             |
 | `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/losses.html#marginmseloss">`MarginMSELoss`</a> |
 
 ## Commonly used Loss Functions
diff --git a/examples/training/multilingual/get_parallel_data_talks.py b/examples/training/multilingual/get_parallel_data_talks.py
index 401b41b4d..0c567b2e6 100644
--- a/examples/training/multilingual/get_parallel_data_talks.py
+++ b/examples/training/multilingual/get_parallel_data_talks.py
@@ -6,7 +6,7 @@
 
 The parallel sentences corpus cannot be downloaded automatically. It is available for research purposes only (CC-BY-NC).
 
-The training procedure can be found in the files make_multilingual.py and make_multilingual_sys.py.
+The training procedure can be found in the files make_multilingual.py.
 
 Further information can be found in our paper:
 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index f31377246..21f30f8fd 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -122,10 +122,11 @@
 logging.info(train_dataset_dict)
 
 
-# We want the teacher embeddings of the *source* sentences to be very similar to the student embeddings
-# of the *target* sentences.
+# We want the student EN embeddings to be similar to the teacher EN embeddings and
+# the student non-EN embeddings to be similar to the teacher EN embeddings
 def prepare_dataset(batch):
     return {
+        "english": batch["english"],
         "non_english": batch["non_english"],
         "label": teacher_model.encode(batch["english"], batch_size=inference_batch_size, show_progress_bar=False),
     }
diff --git a/examples/training/multilingual/make_multilingual_sys.py b/examples/training/multilingual/make_multilingual_sys.py
deleted file mode 100644
index 802c306aa..000000000
--- a/examples/training/multilingual/make_multilingual_sys.py
+++ /dev/null
@@ -1,168 +0,0 @@
-"""
-This script contains an example how to extend an existent sentence embedding model to new languages.
-
-Given a (monolingual) teacher model you would like to extend to new languages, which is specified in the teacher_model_name
-variable. We train a multilingual student model to imitate the teacher model (variable student_model_name)
-on multiple languages.
-
-For training, you need parallel sentence data (machine translation training data). You need tab-seperated files (.tsv)
-with the first column a sentence in a language understood by the teacher model, e.g. English,
-and the further columns contain the according translations for languages you want to extend to.
-
-See get_parallel_data_[opus/tatoeba/talks].py for automatic download of parallel sentences datasets.
-
-Note: See make_multilingual.py for a fully automated script that downloads the necessary data and trains the model. This script just trains the model if you have already parallel data in the right format.
-
-
-Further information can be found in our paper:
-Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
-https://arxiv.org/abs/2004.09813
-
-
-Usage:
-python make_multilingual_sys.py train1.tsv.gz train2.tsv.gz train3.tsv.gz --dev dev1.tsv.gz dev2.tsv.gz
-
-For example:
-python make_multilingual_sys.py parallel-sentences/talks-en-de-train.tsv.gz --dev parallel-sentences/talks-en-de-dev.tsv.gz
-
-To load all training & dev files from a folder (Linux):
-python make_multilingual_sys.py parallel-sentences/*-train.tsv.gz --dev parallel-sentences/*-dev.tsv.gz
-
-
-
-"""
-
-from sentence_transformers import SentenceTransformer, LoggingHandler, models, evaluation, losses
-from torch.utils.data import DataLoader
-from sentence_transformers.datasets import ParallelSentencesDataset
-from datetime import datetime
-
-import os
-import logging
-import gzip
-import numpy as np
-import sys
-
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-logger = logging.getLogger(__name__)
-
-
-teacher_model_name = (
-    "paraphrase-distilroberta-base-v2"  # Our monolingual teacher model, we want to convert to multiple languages
-)
-student_model_name = "xlm-roberta-base"  # Multilingual base model we use to imitate the teacher model
-
-max_seq_length = 128  # Student model max. lengths for inputs (number of word pieces)
-train_batch_size = 64  # Batch size for training
-inference_batch_size = 64  # Batch size at inference
-max_sentences_per_trainfile = 500000  # Maximum number of  parallel sentences for training
-train_max_sentence_length = 250  # Maximum length (characters) for parallel training sentences
-
-num_epochs = 5  # Train for x epochs
-num_warmup_steps = 10000  # Warumup steps
-
-num_evaluation_steps = 1000  # Evaluate performance after every xxxx steps
-
-
-output_path = "output/make-multilingual-sys-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
-
-
-# Read passed arguments
-
-
-train_files = []
-dev_files = []
-is_dev_file = False
-for arg in sys.argv[1:]:
-    if arg.lower() == "--dev":
-        is_dev_file = True
-    else:
-        if not os.path.exists(arg):
-            print("File could not be found:", arg)
-            exit()
-
-        if is_dev_file:
-            dev_files.append(arg)
-        else:
-            train_files.append(arg)
-
-if len(train_files) == 0:
-    print("Please pass at least some train files")
-    print("python make_multilingual_sys.py file1.tsv.gz file2.tsv.gz --dev dev1.tsv.gz dev2.tsv.gz")
-    exit()
-
-
-logger.info("Train files: {}".format(", ".join(train_files)))
-logger.info("Dev files: {}".format(", ".join(dev_files)))
-
-######## Start the extension of the teacher model to multiple languages ########
-logger.info("Load teacher model")
-teacher_model = SentenceTransformer(teacher_model_name)
-
-
-logger.info("Create student model from scratch")
-word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)
-# Apply mean pooling to get one fixed sized sentence vector
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-
-
-###### Read Parallel Sentences Dataset ######
-train_data = ParallelSentencesDataset(
-    student_model=student_model, teacher_model=teacher_model, batch_size=inference_batch_size, use_embedding_cache=True
-)
-for train_file in train_files:
-    train_data.load_data(
-        train_file, max_sentences=max_sentences_per_trainfile, max_sentence_length=train_max_sentence_length
-    )
-
-train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.MSELoss(model=student_model)
-
-
-#### Evaluate cross-lingual performance on different tasks #####
-evaluators = []  # evaluators has a list of different evaluator classes we call periodically
-
-for dev_file in dev_files:
-    logger.info("Create evaluator for " + dev_file)
-    src_sentences = []
-    trg_sentences = []
-    with gzip.open(dev_file, "rt", encoding="utf8") if dev_file.endswith(".gz") else open(
-        dev_file, encoding="utf8"
-    ) as fIn:
-        for line in fIn:
-            splits = line.strip().split("\t")
-            if splits[0] != "" and splits[1] != "":
-                src_sentences.append(splits[0])
-                trg_sentences.append(splits[1])
-
-    # Mean Squared Error (MSE) measures the (euclidean) distance between teacher and student embeddings
-    dev_mse = evaluation.MSEEvaluator(
-        src_sentences,
-        trg_sentences,
-        name=os.path.basename(dev_file),
-        teacher_model=teacher_model,
-        batch_size=inference_batch_size,
-    )
-    evaluators.append(dev_mse)
-
-    # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences
-    dev_trans_acc = evaluation.TranslationEvaluator(
-        src_sentences, trg_sentences, name=os.path.basename(dev_file), batch_size=inference_batch_size
-    )
-    evaluators.append(dev_trans_acc)
-
-
-# Train the model
-student_model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: np.mean(scores)),
-    epochs=num_epochs,
-    warmup_steps=num_warmup_steps,
-    evaluation_steps=num_evaluation_steps,
-    output_path=output_path,
-    save_best_model=True,
-    optimizer_params={"lr": 2e-5, "eps": 1e-6, "correct_bias": False},
-)
diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py
index 89db913bd..bb2dadebb 100644
--- a/sentence_transformers/losses/MSELoss.py
+++ b/sentence_transformers/losses/MSELoss.py
@@ -1,3 +1,4 @@
+import torch
 from torch import nn, Tensor
 from typing import Iterable, Dict
 
@@ -25,11 +26,13 @@ def __init__(self, model):
             - :class:`MarginMSELoss` is equivalent to this loss, but with a margin through a negative pair.
 
         Input:
-            +-------------------+-----------------------------+
-            | Texts             | Labels                      |
-            +===================+=============================+
-            | single sentences  | model sentence embeddings   |
-            +-------------------+-----------------------------+
+            +-----------------------------------------+-----------------------------+
+            | Texts                                   | Labels                      |
+            +=========================================+=============================+
+            | sentence                                | model sentence embeddings   |
+            +-----------------------------------------+-----------------------------+
+            | sentence_1, sentence_2, ..., sentence_N | model sentence embeddings   |
+            +-----------------------------------------+-----------------------------+
 
         Example::
 
@@ -61,8 +64,12 @@ def __init__(self, model):
         self.loss_fct = nn.MSELoss()
 
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
-        rep = self.model(sentence_features[0])["sentence_embedding"]
-        return self.loss_fct(rep, labels)
+        # Concatenate multiple inputs on the batch dimension
+        embeddings = torch.cat([self.model(inputs)["sentence_embedding"] for inputs in sentence_features], dim=0)
+        if len(sentence_features) > 1:
+            # Repeat the labels for each input
+            labels = labels.repeat(len(sentence_features), 1)
+        return self.loss_fct(embeddings, labels)
 
     @property
     def citation(self) -> str:

From fc7910d955671a18d401c7d3fea3c24b6acd10e2 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 13 May 2024 11:24:30 +0200
Subject: [PATCH 11/39] Update evaluation scripts to use HF Datasets (#2642)

---
 .../evaluation/evaluation_inference_speed.py  | 29 ++-----
 .../evaluation/evaluation_stsbenchmark.py     | 60 ++++++--------
 .../evaluation_translation_matching.py        | 78 +++++++++----------
 3 files changed, 68 insertions(+), 99 deletions(-)

diff --git a/examples/evaluation/evaluation_inference_speed.py b/examples/evaluation/evaluation_inference_speed.py
index a98b80706..6e16cbd74 100644
--- a/examples/evaluation/evaluation_inference_speed.py
+++ b/examples/evaluation/evaluation_inference_speed.py
@@ -7,13 +7,11 @@
 python evaluation_inference_speed.py model_name
 """
 
-from sentence_transformers import SentenceTransformer, util
+from sentence_transformers import SentenceTransformer
 import sys
-import os
 import time
 import torch
-import gzip
-import csv
+from datasets import load_dataset
 
 # Limit torch to 4 threads
 torch.set_num_threads(4)
@@ -21,28 +19,13 @@
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "bert-base-nli-mean-tokens"
 
-# Load a named sentence model (based on BERT). This will download the model from our server.
-# Alternatively, you can also pass a filepath to SentenceTransformer()
+# Load a sentence transformer model
 model = SentenceTransformer(model_name)
 
+max_sentences = 10_000
+dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train")
+sentences = list(set(dataset["anchor"] + dataset["positive"]))[:max_sentences]
 
-nli_dataset_path = "datasets/AllNLI.tsv.gz"
-sentences = set()
-max_sentences = 100000
-
-
-# Download datasets if needed
-if not os.path.exists(nli_dataset_path):
-    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
-
-with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        sentences.add(row["sentence1"])
-        if len(sentences) >= max_sentences:
-            break
-
-sentences = list(sentences)
 print("Model Name:", model_name)
 print("Number of sentences:", len(sentences))
 
diff --git a/examples/evaluation/evaluation_stsbenchmark.py b/examples/evaluation/evaluation_stsbenchmark.py
index 966158914..1e3fb78e0 100644
--- a/examples/evaluation/evaluation_stsbenchmark.py
+++ b/examples/evaluation/evaluation_stsbenchmark.py
@@ -7,25 +7,23 @@
 python evaluation_stsbenchmark.py model_name
 """
 
-from sentence_transformers import SentenceTransformer, util, LoggingHandler, InputExample
+from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+from datasets import load_dataset
 import logging
 import sys
 import torch
-import gzip
 import os
-import csv
+
+from sentence_transformers.similarity_functions import SimilarityFunction
 
 script_folder_path = os.path.dirname(os.path.realpath(__file__))
 
 # Limit torch to 4 threads
 torch.set_num_threads(4)
 
-#### Just some code to print debug information to stdout
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-#### /print debug information to stdout
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1] if len(sys.argv) > 1 else "stsb-distilroberta-base-v2"
 
@@ -33,30 +31,22 @@
 # Alternatively, you can also pass a filepath to SentenceTransformer()
 model = SentenceTransformer(model_name)
 
-
-sts_dataset_path = "data/stsbenchmark.tsv.gz"
-
-if not os.path.exists(sts_dataset_path):
-    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
-
-train_samples = []
-dev_samples = []
-test_samples = []
-with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        score = float(row["score"]) / 5.0  # Normalize score to range 0 ... 1
-        inp_example = InputExample(texts=[row["sentence1"], row["sentence2"]], label=score)
-
-        if row["split"] == "dev":
-            dev_samples.append(inp_example)
-        elif row["split"] == "test":
-            test_samples.append(inp_example)
-        else:
-            train_samples.append(inp_example)
-
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
-model.evaluate(evaluator)
-
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test")
-model.evaluate(evaluator)
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+model.evaluate(dev_evaluator)
+
+test_dataset = load_dataset("sentence-transformers/stsb", split="test")
+test_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=test_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-test",
+)
+model.evaluate(test_evaluator)
diff --git a/examples/evaluation/evaluation_translation_matching.py b/examples/evaluation/evaluation_translation_matching.py
index b1bd957a1..bac21e9bd 100644
--- a/examples/evaluation/evaluation_translation_matching.py
+++ b/examples/evaluation/evaluation_translation_matching.py
@@ -1,61 +1,57 @@
 """
-Given a tab separated file (.tsv) with parallel sentences, where the second column is the translation of the sentence in the first column, for example, in the format:
-src1    trg1
-src2    trg2
-...
-
-where trg_i is the translation of src_i.
-
-Given src_i, the TranslationEvaluator checks which trg_j has the highest similarity using cosine similarity. If i == j, we assume
-a match, i.e., the correct translation has been found for src_i out of all possible target sentences.
+Given a dataset with parallel sentences, one "english" column and one "non_english" column, this script evaluates a model on the translation task.
+Given a sentence in the "english" column, the model should find the correct translation in the "non_english" column, based on just the embeddings.
 
 It then computes an accuracy over all possible source sentences src_i. Equivalently, it computes also the accuracy for the other direction.
-
 A high accuracy score indicates that the model is able to find the correct translation out of a large pool with sentences.
 
+Good options for datasets are:
+* sentence-transformers/parallel-sentences-wikimatrix
+* sentence-transformers/parallel-sentences-tatoeba
+* sentence-transformers/parallel-sentences-talks
+
+As these have development sets.
+
 Usage:
-python [model_name_or_path] [parallel-file1] [parallel-file2] ...
+python examples/evaluation/evaluation_translation_matching.py [model_name_or_path] [dataset_name] [subset1] [subset2] ...
 
 For example:
-python distiluse-base-multilingual-cased  talks-en-de.tsv.gz
-
-See the training_multilingual/get_parallel_data_...py scripts for getting parallel sentence data from different sources
+python examples/evaluation/evaluation_translation_matching.py distiluse-base-multilingual-cased sentence-transformers/parallel-sentences-tatoeba en-ar en-de en-nl
 """
 
-from sentence_transformers import SentenceTransformer, evaluation, LoggingHandler
+from sentence_transformers import SentenceTransformer, evaluation
 import sys
-import gzip
-import os
 import logging
+from datasets import load_dataset
 
 
-logging.basicConfig(
-    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
-)
-
-logger = logging.getLogger(__name__)
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
 model_name = sys.argv[1]
-filepaths = sys.argv[2:]
+dataset_name = sys.argv[2]
+subsets = sys.argv[3:]
 inference_batch_size = 32
 
 model = SentenceTransformer(model_name)
 
-
-for filepath in filepaths:
-    src_sentences = []
-    trg_sentences = []
-    with gzip.open(filepath, "rt", encoding="utf8") if filepath.endswith(".gz") else open(
-        filepath, "r", encoding="utf8"
-    ) as fIn:
-        for line in fIn:
-            splits = line.strip().split("\t")
-            if len(splits) >= 2:
-                src_sentences.append(splits[0])
-                trg_sentences.append(splits[1])
-
-    logger.info(os.path.basename(filepath) + ": " + str(len(src_sentences)) + " sentence pairs")
-    dev_trans_acc = evaluation.TranslationEvaluator(
-        src_sentences, trg_sentences, name=os.path.basename(filepath), batch_size=inference_batch_size
-    )
-    dev_trans_acc(model)
+for subset in subsets:
+    dataset = load_dataset(dataset_name, subset)
+    datasets = {}
+    if dataset.column_names == ["train"]:
+        num_samples = min(5000, len(dataset["train"]))
+        datasets[f"train[:{num_samples}]"].append(dataset["train"].select(range(num_samples)))
+    else:
+        for split, sub_dataset in dataset.items():
+            if split != "train":
+                datasets[split] = sub_dataset
+
+    for split, sub_dataset in datasets.items():
+        logging.info(f"{dataset_name}, subset={subset}, split={split}, num_samples={len(sub_dataset)}")
+        translation_evaluator = evaluation.TranslationEvaluator(
+            sub_dataset["english"],
+            sub_dataset["non_english"],
+            name=f"{dataset_name}-{subset}-{split}",
+            batch_size=inference_batch_size,
+        )
+        translation_evaluator(model)

From 453d62256b3da4bce61fcc4d9168473ea78abee3 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Mon, 13 May 2024 11:25:32 +0200
Subject: [PATCH 12/39] Increment the version in setup.py (as well)

---
 setup.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/setup.py b/setup.py
index a791e73e2..593033260 100644
--- a/setup.py
+++ b/setup.py
@@ -6,7 +6,7 @@
 
 setup(
     name="sentence-transformers",
-    version="2.8.0.dev0",
+    version="3.0.0.dev0",
     author="Nils Reimers",
     author_email="info@nils-reimers.de",
     description="Multilingual text embeddings",

From e854736a4b42fac7679bc46808f5d7e82a99462f Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Wed, 15 May 2024 15:44:09 +0200
Subject: [PATCH 13/39] Fix resume_from_checkpoint by also updating the loss
 (#2648)

I'm not very sure if updating the potential wrapped model like this will also work; it seems a bit risky, but it's equally risky to not do it.
---
 sentence_transformers/trainer.py | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 5447a38b0..da196439f 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -564,3 +564,22 @@ def _load_from_checkpoint(self, checkpoint_path: str) -> None:
         from sentence_transformers import SentenceTransformer
 
         self.model = SentenceTransformer(checkpoint_path)
+        # Naively try and update the wrapped model as well
+        model_wrapped = self.model_wrapped
+        if isinstance(model_wrapped, SentenceTransformer):
+            self.model_wrapped = self.model
+        else:
+            while hasattr(model_wrapped, "module"):
+                if isinstance(model_wrapped.module, SentenceTransformer):
+                    model_wrapped.module = self.model
+                    break
+                model_wrapped = model_wrapped.module
+
+        # Naively try and update the model in the loss function(s)
+        if isinstance(self.loss, dict):
+            for loss_fn in self.loss.values():
+                if hasattr(loss_fn, "model"):
+                    loss_fn.model = self.model
+        else:
+            if hasattr(self.loss, "model"):
+                self.loss.model = self.model

From 201bdd2fce1e60b44b12d8189c25b8e9c4167f50 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Wed, 15 May 2024 15:44:31 +0200
Subject: [PATCH 14/39] Fix an issue with in-place variable overriding
 preventing backwards passes on MSELoss (#2647)

Only when there's multiple columns
---
 sentence_transformers/losses/MSELoss.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py
index bb2dadebb..798c7949f 100644
--- a/sentence_transformers/losses/MSELoss.py
+++ b/sentence_transformers/losses/MSELoss.py
@@ -65,10 +65,12 @@ def __init__(self, model):
 
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
         # Concatenate multiple inputs on the batch dimension
-        embeddings = torch.cat([self.model(inputs)["sentence_embedding"] for inputs in sentence_features], dim=0)
         if len(sentence_features) > 1:
+            embeddings = torch.cat([self.model(inputs)["sentence_embedding"] for inputs in sentence_features], dim=0)
             # Repeat the labels for each input
-            labels = labels.repeat(len(sentence_features), 1)
+            return self.loss_fct(embeddings, labels.repeat(len(sentence_features), 1))
+
+        embeddings = self.model(sentence_features[0])["sentence_embedding"]
         return self.loss_fct(embeddings, labels)
 
     @property

From 2e9e642e33e0f5a3fd56c36bba755b03ff71ab9f Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Thu, 16 May 2024 10:47:44 +0200
Subject: [PATCH 15/39] Simplify load_from_checkpoint using load_state_dict
 (#2650)

Overriding the model has several downsides, e.g. regarding the model card generation
---
 sentence_transformers/trainer.py | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index da196439f..d47b72ce0 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -563,23 +563,5 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None):
     def _load_from_checkpoint(self, checkpoint_path: str) -> None:
         from sentence_transformers import SentenceTransformer
 
-        self.model = SentenceTransformer(checkpoint_path)
-        # Naively try and update the wrapped model as well
-        model_wrapped = self.model_wrapped
-        if isinstance(model_wrapped, SentenceTransformer):
-            self.model_wrapped = self.model
-        else:
-            while hasattr(model_wrapped, "module"):
-                if isinstance(model_wrapped.module, SentenceTransformer):
-                    model_wrapped.module = self.model
-                    break
-                model_wrapped = model_wrapped.module
-
-        # Naively try and update the model in the loss function(s)
-        if isinstance(self.loss, dict):
-            for loss_fn in self.loss.values():
-                if hasattr(loss_fn, "model"):
-                    loss_fn.model = self.model
-        else:
-            if hasattr(self.loss, "model"):
-                self.loss.model = self.model
+        loaded_model = SentenceTransformer(checkpoint_path)
+        self.model.load_state_dict(loaded_model.state_dict())

From 6bc637aaf7e61dcd7f44a048ac1092cc8ee65548 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 17 May 2024 12:58:48 +0200
Subject: [PATCH 16/39] Don't override the labels variable to avoid inplace
 operation (#2651)

---
 .../losses/MultipleNegativesRankingLoss.py                 | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
index 5416b3d70..30c1fd163 100644
--- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
@@ -87,10 +87,9 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
         embeddings_b = torch.cat(reps[1:])
 
         scores = self.similarity_fct(embeddings_a, embeddings_b) * self.scale
-        labels = torch.tensor(
-            range(len(scores)), dtype=torch.long, device=scores.device
-        )  # Example a[i] should match with b[i]
-        return self.cross_entropy_loss(scores, labels)
+        # Example a[i] should match with b[i]
+        range_labels = torch.arange(0, scores.size(0), device=scores.device)
+        return self.cross_entropy_loss(scores, range_labels)
 
     def get_config_dict(self):
         return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__}

From a6a2559e0947d5cfbf56e4ae03b572789a10a669 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 17 May 2024 13:23:02 +0200
Subject: [PATCH 17/39] Resolve "one of the variables needed for gradient
 computation has been modified by an inplace operation." (#2654)

---
 sentence_transformers/models/Pooling.py |  2 +-
 sentence_transformers/trainer.py        | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/sentence_transformers/models/Pooling.py b/sentence_transformers/models/Pooling.py
index cc314cb29..23b61962e 100644
--- a/sentence_transformers/models/Pooling.py
+++ b/sentence_transformers/models/Pooling.py
@@ -209,7 +209,7 @@ def forward(self, features: Dict[str, Tensor]):
             output_vectors.append(embedding)
 
         output_vector = torch.cat(output_vectors, 1)
-        features.update({"sentence_embedding": output_vector})
+        features["sentence_embedding"] = output_vector
         return features
 
     def get_sentence_embedding_dimension(self):
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index d47b72ce0..e04537c09 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -129,6 +129,16 @@ def __init__(
         self.add_callback(model_card_callback)
         model_card_callback.on_init_end(self.args, self.state, self.control, self.model)
 
+    def override_model_in_loss(self, loss: torch.nn.Module, model: "SentenceTransformer"):
+        from sentence_transformers import SentenceTransformer
+
+        for name, child in loss.named_children():
+            if name == "model" and isinstance(child, SentenceTransformer):
+                loss.model = model
+            elif isinstance(child, torch.nn.Module):
+                setattr(loss, name, self.override_model_in_loss(child, model))
+        return loss
+
     def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict:
         for key, dataset in dataset_dict.items():
             if "dataset_name" not in dataset.column_names:
@@ -153,9 +163,9 @@ def compute_loss(
         if (
             self.args.parallel_mode != ParallelMode.NOT_PARALLEL
             and hasattr(model, "module")
-            and getattr(loss_fn, "model", None) == model.module
+            and hasattr(loss_fn, "model")
         ):
-            loss_fn.model = model
+            loss_fn = self.override_model_in_loss(loss_fn, model)
         loss = loss_fn(features, labels)
         if return_outputs:
             # During prediction/evaluation, `compute_loss` will be called with `return_outputs=True`.

From 666fbfead2de6d7203f9336f7e7f4be6a21bd63b Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 17 May 2024 13:48:57 +0200
Subject: [PATCH 18/39] [`v3`] Add hyperparameter optimization support by
 letting `loss` be a Callable that accepts a `model` (#2655)

* Add HPO support by letting the 'loss' be a function

* Only add "dataset_name" column if required by the loss function
---
 sentence_transformers/trainer.py | 83 +++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 12 deletions(-)

diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index e04537c09..532e4c009 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -2,6 +2,7 @@
 import logging
 import os
 from typing import Any, Callable, Dict, List, Optional, Tuple, Union, TYPE_CHECKING
+import warnings
 
 import torch
 from torch import nn
@@ -46,9 +47,16 @@ def __init__(
         self,
         model: Optional["SentenceTransformer"] = None,
         args: SentenceTransformerTrainingArguments = None,
-        train_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
-        eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
-        loss: Optional[Union[Dict[str, nn.Module], nn.Module]] = None,
+        train_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None,
+        eval_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None,
+        loss: Optional[
+            Union[
+                nn.Module,
+                Dict[str, nn.Module],
+                Callable[["SentenceTransformer"], torch.nn.Module],
+                Dict[str, Callable[["SentenceTransformer"], torch.nn.Module]],
+            ]
+        ] = None,
         evaluator: Optional[SentenceEvaluator] = None,
         data_collator: Optional[DataCollator] = None,
         tokenizer: Optional[Union[PreTrainedTokenizerBase, Callable]] = None,
@@ -65,6 +73,22 @@ def __init__(
         elif not isinstance(args, SentenceTransformerTrainingArguments):
             raise ValueError("Please use `TrainingArguments` imported from `sentence_transformers`.")
 
+        if model is None:
+            if model_init is not None:
+                self.model_init = model_init
+                model = self.call_model_init()
+            else:
+                raise RuntimeError("`Trainer` requires either a `model` or `model_init` argument")
+        else:
+            if model_init is not None:
+                warnings.warn(
+                    "`Trainer` requires either a `model` or `model_init` argument, but not both. `model_init` will"
+                    " overwrite your model when calling the `train` method. This will become a fatal error in the next"
+                    " release.",
+                    FutureWarning,
+                )
+            self.model_init = model_init
+
         # Get a dictionary of the default training arguments, so we can determine which arguments have been changed
         # for the model card
         default_args_dict = SentenceTransformerTrainingArguments(output_dir="unused").to_dict()
@@ -85,7 +109,7 @@ def __init__(
         if isinstance(eval_dataset, dict) and not isinstance(eval_dataset, Dataset):
             eval_dataset = DatasetDict(eval_dataset)
         super().__init__(
-            model=model,
+            model=None if self.model_init else model,
             args=args,
             data_collator=data_collator,
             train_dataset=train_dataset,
@@ -105,9 +129,8 @@ def __init__(
             logger.info("No `loss` passed, using `losses.CoSENTLoss` as a default option.")
             loss = CoSENTLoss(self.model)
 
-        self.loss = loss
         if isinstance(loss, dict):
-            self.loss = {dataset_name: loss_fn.to(self.model.device) for dataset_name, loss_fn in loss.items()}
+            self.loss = {dataset_name: self.prepare_loss(loss_fn, model) for dataset_name, loss_fn in loss.items()}
             for dataset_name, dataset in zip(["train", "eval"], [train_dataset, eval_dataset]):
                 if dataset is None:
                     continue
@@ -121,7 +144,7 @@ def __init__(
                         f"Currently, {sorted(missing)} occur{'s' if len(missing) == 1 else ''} in `{dataset_name}_dataset` but not in `loss`."
                     )
         else:
-            self.loss.to(self.model.device)
+            self.loss = self.prepare_loss(loss, model)
         self.evaluator = evaluator
 
         # Add a callback responsible for automatically tracking data required for the automatic model card generation
@@ -129,6 +152,31 @@ def __init__(
         self.add_callback(model_card_callback)
         model_card_callback.on_init_end(self.args, self.state, self.control, self.model)
 
+    def call_model_init(self, trial=None) -> "SentenceTransformer":
+        model = super().call_model_init(trial=trial)
+        # If the Trainer already has a loss, then we'll want to override the model in the loss function
+        if not hasattr(self, "loss"):
+            return model
+
+        # Multi-loss training:
+        if isinstance(self.loss, dict):
+            for key, loss_fn in self.loss.items():
+                # If a loss function is not yet initialized, we initialize it here
+                if not isinstance(loss_fn, torch.nn.Module):
+                    self.loss[key] = loss_fn(model)
+                # Otherwise, we override the original model with the updated model in the loss function
+                elif hasattr(loss_fn, "model"):
+                    self.loss = self.override_model_in_loss(self.loss, model)
+
+        # Loss is a function accepting a model as an argument
+        elif not isinstance(self.loss, torch.nn.Module):
+            self.loss = self.loss(model)
+
+        # Loss is an initialized torch.nn.Module
+        elif hasattr(self.loss, "model"):
+            self.loss = self.override_model_in_loss(self.loss, model)
+        return model
+
     def override_model_in_loss(self, loss: torch.nn.Module, model: "SentenceTransformer"):
         from sentence_transformers import SentenceTransformer
 
@@ -139,6 +187,15 @@ def override_model_in_loss(self, loss: torch.nn.Module, model: "SentenceTransfor
                 setattr(loss, name, self.override_model_in_loss(child, model))
         return loss
 
+    def prepare_loss(
+        self,
+        loss: Union[Callable[["SentenceTransformer"], torch.nn.Module], torch.nn.Module],
+        model: "SentenceTransformer",
+    ):
+        if isinstance(loss, torch.nn.Module):
+            return loss.to(model.device)
+        return loss(model).to(model.device)
+
     def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict:
         for key, dataset in dataset_dict.items():
             if "dataset_name" not in dataset.column_names:
@@ -217,7 +274,7 @@ def evaluate(
         metric_key_prefix: str = "eval",
     ) -> Dict[str, float]:
         eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
-        if isinstance(eval_dataset, DatasetDict):
+        if isinstance(eval_dataset, DatasetDict) and isinstance(self.loss, dict):
             eval_dataset = self.add_dataset_name_column(eval_dataset)
         return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
 
@@ -373,7 +430,8 @@ def get_train_dataloader(self) -> DataLoader:
         if isinstance(train_dataset, DatasetDict):
             for dataset_name, dataset in train_dataset.items():
                 self.validate_column_names(dataset, dataset_name=dataset_name)
-            train_dataset = self.add_dataset_name_column(train_dataset)
+            if isinstance(self.loss, dict):
+                train_dataset = self.add_dataset_name_column(train_dataset)
             batch_samplers = [
                 self.get_batch_sampler(
                     dataset,
@@ -445,8 +503,8 @@ def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> Data
 
         # TODO: Correctly validate the column names for the eval_dataset
         if isinstance(eval_dataset, DatasetDict):
-            eval_dataset = self.add_dataset_name_column(eval_dataset)
-            eval_dataset = self.add_dataset_name_column(eval_dataset)
+            if isinstance(self.loss, dict):
+                eval_dataset = self.add_dataset_name_column(eval_dataset)
             batch_samplers = [
                 self.get_batch_sampler(
                     dataset,
@@ -509,7 +567,8 @@ def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
         if isinstance(test_dataset, DatasetDict):
             for dataset_name, dataset in test_dataset.items():
                 self.validate_column_names(dataset, dataset_name=dataset_name)
-            test_dataset = self.add_dataset_name_column(test_dataset)
+            if isinstance(self.loss, dict):
+                test_dataset = self.add_dataset_name_column(test_dataset)
             batch_samplers = [
                 self.get_batch_sampler(
                     dataset,

From 56436566acdae4c491b08530648a0a72c13bf670 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Tue, 21 May 2024 13:07:49 +0200
Subject: [PATCH 19/39] Add tag hinting at the number of training samples
 (#2660)

---
 sentence_transformers/model_card.py | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index f8f9fc37f..ff9f68fcb 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -692,8 +692,32 @@ def extract_dataset_metadata(
             else:
                 dataset_metadata = [self.compute_dataset_metrics(dataset, dataset_metadata[0], self.trainer.loss)]
 
+        # Try to get the number of training samples
+        if dataset_type == "train":
+            num_training_samples = sum([metadata.get("size", 0) for metadata in dataset_metadata])
+            if num_training_samples:
+                self.tags += [self.num_training_samples_to_tag(num_training_samples)]
+
         return self.validate_datasets(dataset_metadata)
 
+    def num_training_samples_to_tag(self, num_samples: int) -> str:
+        sizes_mapping = {
+            1_000: "n<1K",
+            10_000: "1K<n<10K",
+            100_000: "10K<n<100K",
+            1_000_000: "100K<n<1M",
+            10_000_000: "1M<n<10M",
+            100_000_000: "10M<n<100M",
+            1_000_000_000: "100M<n<1B",
+            10_000_000_000: "1B<n<10B",
+            100_000_000_000: "10B<n<100B",
+            1_000_000_000_000: "100B<n<1T",
+        }
+        for size, tag in sizes_mapping.items():
+            if num_samples < size:
+                return tag
+        return "n>1T"
+
     def register_model(self, model: "SentenceTransformer") -> None:
         self.model = model
 
@@ -761,7 +785,6 @@ def try_to_pure_python(value: Any) -> Any:
                     pass
                 return value
 
-            # Try to convert to pure Python
             metrics = {key: try_to_pure_python(value) for key, value in metrics.items()}
 
             table_lines = [

From 126357f359de68e9b4374c975bfe322568fb50b1 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 07:55:11 +0200
Subject: [PATCH 20/39] [`v3`] For the Cached losses; ignore gradients if grad
 is disabled (e.g. eval) (#2668)

* For the Cached losses; ignore gradients if grad is disabled (e.g. eval)

* Warn that Matryoshka/AdaptiveLayer losses are not compatible with Cached
---
 .../losses/AdaptiveLayerLoss.py               |  5 +-
 .../losses/CachedGISTEmbedLoss.py             | 83 ++++++++++++++++++-
 .../CachedMultipleNegativesRankingLoss.py     | 35 +++++++-
 .../losses/MatryoshkaLoss.py                  |  5 +-
 4 files changed, 119 insertions(+), 9 deletions(-)

diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py
index 6337b95f3..678aada1a 100644
--- a/sentence_transformers/losses/AdaptiveLayerLoss.py
+++ b/sentence_transformers/losses/AdaptiveLayerLoss.py
@@ -6,6 +6,7 @@
 import torch
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
+from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss
 from sentence_transformers.models import Transformer
 
 
@@ -130,7 +131,7 @@ def __init__(
             - `Adaptive Layers <../../examples/training/adaptive_layer/README.html>`_
 
         Requirements:
-            1. The base loss cannot be :class:`CachedMultipleNegativesRankingLoss`.
+            1. The base loss cannot be :class:`CachedMultipleNegativesRankingLoss` or :class:`CachedGISTEmbedLoss`.
 
         Relations:
             - :class:`Matryoshka2dLoss` uses this loss in combination with :class:`MatryoshkaLoss` which allows for
@@ -173,6 +174,8 @@ def __init__(
         assert isinstance(self.model[0], Transformer)
         if isinstance(loss, CachedMultipleNegativesRankingLoss):
             warnings.warn("MatryoshkaLoss is not compatible with CachedMultipleNegativesRankingLoss.", stacklevel=2)
+        if isinstance(loss, CachedGISTEmbedLoss):
+            warnings.warn("MatryoshkaLoss is not compatible with CachedGISTEmbedLoss.", stacklevel=2)
 
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor) -> Tensor:
         # Decorate the forward function of the transformer to cache the embeddings of all layers
diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py
index f8623e325..6612a1773 100644
--- a/sentence_transformers/losses/CachedGISTEmbedLoss.py
+++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py
@@ -277,6 +277,76 @@ def calculate_loss_and_cache_gradients(self, reps: List[List[Tensor]], reps_guid
 
         return loss
 
+    def calculate_loss(self, reps: List[List[Tensor]], reps_guided: List[List[Tensor]]) -> Tensor:
+        """Calculate the cross-entropy loss. No need to cache the gradients."""
+        if len(reps) == 2:
+            anchor, positive = reps
+            anchor_guide, positive_guide = reps_guided
+            negative = None
+            negative_guide = None
+        elif len(reps) == 3:
+            anchor, positive, negative = reps
+            anchor_guide, positive_guide, negative_guide = reps_guided
+        else:
+            raise ValueError("Expected 2 or 3 embeddings, got {}".format(len(reps)))
+
+        anchor = torch.cat(anchor, dim=0)
+        positive = torch.cat(positive, dim=0)
+        anchor_guide = torch.cat(anchor_guide, dim=0)
+        positive_guide = torch.cat(positive_guide, dim=0)
+        # Handle the case where we have a negative sample
+        if negative:
+            negative = torch.cat(negative, dim=0)
+            negative_guide = torch.cat(negative_guide, dim=0)
+
+        labels = torch.arange(anchor.size(0)).long().to(anchor.device)
+        batch_size = anchor.shape[0]
+
+        losses: List[torch.Tensor] = []
+        for b in tqdm.trange(
+            0,
+            batch_size,
+            self.mini_batch_size,
+            desc="Preparing caches",
+            disable=not self.show_progress_bar,
+        ):
+            e = b + self.mini_batch_size
+            # Let's compute the similarity matrices for the combinations of anchor and positive samples.
+            guided_ap_sim = self.sim_matrix(anchor_guide[b:e], positive_guide)
+            guided_aa_sim = self.sim_matrix(anchor_guide[b:e], anchor_guide)
+            guided_pp_sim = self.sim_matrix(positive_guide[b:e], positive_guide)
+            # Define the anchor threshold
+            guided_sim = guided_ap_sim.diagonal(offset=b).view(-1, 1)
+
+            # Compute similarity scores for current mini-batch.
+            # anchor (mbsz,hdim), positive (bsz,hdim)
+            ap_sim = self.sim_matrix(anchor[b:e], positive)  # (mbsz,bsz)
+            aa_sim = self.sim_matrix(anchor[b:e], anchor)
+            pp_sim = self.sim_matrix(positive[b:e], positive)
+
+            # Find which samples cannot be used as negatives because they are
+            # more similar to the query than the assigned positive as deemed by the guide model.
+            # For these samples, we mask them with -inf to basically ignore their contribution to
+            # the loss.
+            ap_sim[guided_ap_sim > guided_sim] = -torch.inf
+            aa_sim[guided_aa_sim > guided_sim] = -torch.inf
+            pp_sim[guided_pp_sim > guided_sim] = -torch.inf
+
+            scores = torch.cat([ap_sim, aa_sim, pp_sim], dim=1)
+
+            # Handle the case where we have a negative sample
+            if negative is not None:
+                guided_an_sim = self.sim_matrix(anchor_guide[b:e], negative_guide)
+                an_sim = self.sim_matrix(anchor[b:e], negative)
+                an_sim[guided_an_sim > guided_sim] = -torch.inf
+                scores = torch.cat([scores, an_sim], dim=1)
+            scores = scores / self.temperature
+            loss_mbatch: torch.Tensor = self.cross_entropy_loss(scores, labels[b:e]) * len(scores) / batch_size
+            losses.append(loss_mbatch)
+
+        loss = sum(losses)
+        return loss
+
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor) -> Tensor:
         # Step (1): A quick embedding step without gradients/computation graphs to get all the embeddings
         reps = []
@@ -298,10 +368,15 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
             reps_guided.append(reps_guided_mbs)
             self.random_states.append(random_state_mbs)
 
-        # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings
-        loss = self.calculate_loss_and_cache_gradients(reps, reps_guided)
-        # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain
-        loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self))
+        if torch.is_grad_enabled():
+            # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings
+            loss = self.calculate_loss_and_cache_gradients(reps, reps_guided)
+
+            # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain
+            loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self))
+        else:
+            # If grad is not enabled (e.g. in evaluation), then we don't have to worry about the gradients or backward hook
+            loss = self.calculate_loss(reps, reps_guided)
         return loss
 
     def get_config_dict(self):
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index 12a0cb931..80bdc4899 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -215,6 +215,31 @@ def calculate_loss_and_cache_gradients(self, reps: List[List[Tensor]]) -> Tensor
 
         return loss
 
+    def calculate_loss(self, reps: List[List[Tensor]]) -> Tensor:
+        """Calculate the cross-entropy loss. No need to cache the gradients."""
+        embeddings_a = torch.cat(reps[0])  # (bsz, hdim)
+        embeddings_b = torch.cat([torch.cat(r) for r in reps[1:]])  # ((1 + nneg) * bsz, hdim)
+
+        batch_size = len(embeddings_a)
+        labels = torch.tensor(
+            range(batch_size), dtype=torch.long, device=embeddings_a.device
+        )  # (bsz, (1 + nneg) * bsz)  Example a[i] should match with b[i]
+        losses: List[torch.Tensor] = []
+        for b in tqdm.trange(
+            0,
+            batch_size,
+            self.mini_batch_size,
+            desc="Preparing caches",
+            disable=not self.show_progress_bar,
+        ):
+            e = b + self.mini_batch_size
+            scores: Tensor = self.similarity_fct(embeddings_a[b:e], embeddings_b) * self.scale
+            loss_mbatch: torch.Tensor = self.cross_entropy_loss(scores, labels[b:e]) * len(scores) / batch_size
+            losses.append(loss_mbatch)
+
+        loss = sum(losses)
+        return loss
+
     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor) -> Tensor:
         # Step (1): A quick embedding step without gradients/computation graphs to get all the embeddings
         reps = []
@@ -232,12 +257,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor
             reps.append(reps_mbs)
             self.random_states.append(random_state_mbs)
 
-        with torch.set_grad_enabled(True):
+        if torch.is_grad_enabled():
             # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings
             loss = self.calculate_loss_and_cache_gradients(reps)
 
-        # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain
-        loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self))
+            # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain
+            loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self))
+        else:
+            # If grad is not enabled (e.g. in evaluation), then we don't have to worry about the gradients or backward hook
+            loss = self.calculate_loss(reps)
+
         return loss
 
     def get_config_dict(self):
diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py
index 0e9ab32a6..c866f1e08 100644
--- a/sentence_transformers/losses/MatryoshkaLoss.py
+++ b/sentence_transformers/losses/MatryoshkaLoss.py
@@ -4,6 +4,7 @@
 from torch import Tensor, nn
 import torch.nn.functional as F
 from sentence_transformers import SentenceTransformer
+from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss
 from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
 
 
@@ -72,7 +73,7 @@ def __init__(
             - `Matryoshka Embeddings <../../examples/training/matryoshka/README.html>`_
 
         Requirements:
-            1. The base loss cannot be :class:`CachedMultipleNegativesRankingLoss`.
+            1. The base loss cannot be :class:`CachedMultipleNegativesRankingLoss` or :class:`CachedGISTEmbedLoss`.
 
         Relations:
             - :class:`Matryoshka2dLoss` uses this loss in combination with :class:`AdaptiveLayerLoss` which allows for
@@ -109,6 +110,8 @@ def __init__(
         self.loss = loss
         if isinstance(loss, CachedMultipleNegativesRankingLoss):
             warnings.warn("MatryoshkaLoss is not compatible with CachedMultipleNegativesRankingLoss.", stacklevel=2)
+        if isinstance(loss, CachedGISTEmbedLoss):
+            warnings.warn("MatryoshkaLoss is not compatible with CachedGISTEmbedLoss.", stacklevel=2)
         self.matryoshka_dims = matryoshka_dims
         if matryoshka_weights is None:
             matryoshka_weights = [1] * len(matryoshka_dims)

From 1a64be9c6563cde66ed34c958e594d9e12820b58 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 12:49:50 +0200
Subject: [PATCH 21/39] [`docs`] Rewrite the https://sbert.net documentation
 for v3.0 (#2632)

* Start restructuring/rewriting the docs

* Update Pretrained Models section for ST

* Update & add many docstrings

* Completely overhaul "Training Overview" docs page for ST

* Update dataset overview

* Remove kwargs from paraphrase_mining signature

* Add "aka sbert"

* Remove Hugging Face docs page

* Update ST Usages

* Fix some links

* Use the training examples corresponding to that model type

* Add hyperparameter optimization example script + docs

* Add distributed training docs

* Complete rewrite for the Sentence Transformer docs portion

* Update the CE part of the docs

* Specify if __name__ == "__main__" & dataloader_drop_last with DDP

* Update the entire project to Google-style docstring

* Remove contact page

* Update README with updated links, etc.

* Update the loss examples

* Fix formatting

* Add remove_columns/select_columns tip to dataset overview
---
 README.md                                     |  74 +-
 docs/Makefile                                 |   5 +-
 docs/_static/css/custom.css                   |  86 ++-
 docs/_themes/sphinx_rtd_theme/footer.html     |   3 -
 docs/_themes/sphinx_rtd_theme/layout.html     |   6 +-
 docs/_themes/sphinx_rtd_theme/theme.conf      |   2 +-
 docs/changelog/v3.0.md                        |   0
 docs/conf.py                                  |  38 +-
 docs/contact.md                               |  19 -
 docs/cross_encoder/pretrained_models.md       | 111 +++
 docs/cross_encoder/training/examples.md       |  16 +
 docs/cross_encoder/training_overview.md       |  65 ++
 docs/cross_encoder/usage/usage.rst            |  75 ++
 docs/hugging_face.md                          |  96 ---
 docs/img/hf-logo.svg                          |  21 +
 docs/installation.md                          |  32 +-
 docs/package_reference/SentenceTransformer.md |  15 -
 .../cross_encoder/cross_encoder.md            |  14 +
 .../evaluation.md}                            |  30 +-
 .../package_reference/cross_encoder/index.rst |   9 +
 docs/package_reference/quantization.md        |   7 -
 .../SentenceTransformer.md                    |  20 +
 .../{ => sentence_transformer}/datasets.md    |   0
 .../{ => sentence_transformer}/evaluation.md  |  39 +-
 .../sentence_transformer/index.rst            |  15 +
 .../{ => sentence_transformer}/losses.md      |   5 +-
 .../{ => sentence_transformer}/models.md      |   0
 .../sentence_transformer/quantization.md      |  12 +
 .../sentence_transformer/trainer.md           |  11 +
 .../sentence_transformer/training_args.md     |  21 +
 docs/package_reference/util.md                |  10 +-
 docs/pretrained-models/msmarco-v1.md          |   1 -
 docs/pretrained-models/msmarco-v2.md          |   1 -
 docs/pretrained-models/msmarco-v3.md          |   1 -
 docs/pretrained-models/msmarco-v5.md          |   1 -
 docs/pretrained_cross-encoders.md             |  93 ---
 docs/quickstart.md                            |  96 ---
 docs/quickstart.rst                           | 160 +++++
 docs/requirements.txt                         |   1 +
 docs/sentence_transformer/dataset_overview.md | 121 ++++
 .../loss_overview.md                          |  27 +-
 .../pretrained_models.md                      | 193 +++--
 .../training/distributed.rst                  |  85 +++
 .../training/examples.rst                     |  32 +
 .../sentence_transformer/training_overview.md | 667 ++++++++++++++++++
 .../usage/semantic_textual_similarity.rst     | 132 ++++
 docs/sentence_transformer/usage/usage.rst     |  59 ++
 docs/training/overview.md                     | 283 --------
 docs/usage/semantic_textual_similarity.md     |  82 ---
 examples/README.md                            |   2 +-
 examples/applications/clustering/README.md    |   7 +-
 .../computing-embeddings/README.md            | 213 ------
 .../computing-embeddings/README.rst           | 150 ++++
 .../embedding-quantization/README.md          |  31 +-
 .../parallel-sentence-mining/README.md        |   6 +-
 .../applications/paraphrase-mining/README.md  |  71 +-
 .../applications/retrieve_rerank/README.md    |  29 +-
 .../applications/semantic-search/README.md    |  95 +--
 .../semantic-search/semantic_search.py        |  11 +-
 examples/domain_adaptation/README.md          |  17 +-
 .../evaluation/evaluation_inference_speed.py  |   6 +-
 examples/training/README.md                   |  12 +-
 .../adaptive_layer/adaptive_layer_sts.py      |   4 +-
 ...aining_stsbenchmark_avg_word_embeddings.py |   2 +-
 .../training_stsbenchmark_bilstm.py           |   2 +-
 .../training_stsbenchmark_bow.py              |   2 +-
 .../training_stsbenchmark_cnn.py              |   2 +-
 ...ing_stsbenchmark_tf-idf_word_embeddings.py |   2 +-
 examples/training/datasets/README.md          |  59 --
 examples/training/distillation/README.md      |  43 +-
 examples/training/hpo/README.rst              | 217 ++++++
 examples/training/hpo/hpo_nli.py              |  95 +++
 examples/training/ms_marco/README.md          |  65 +-
 .../training/ms_marco/cross_encoder_README.md |  22 +
 examples/training/multilingual/README.md      | 214 +++---
 .../multilingual/make_multilingual.py         |   8 +-
 examples/training/nli/README.md               |  67 +-
 .../paraphrases/MultiDatasetDataLoader.py     |  91 ---
 examples/training/paraphrases/README.md       |  68 +-
 .../quora_duplicate_questions/README.md       | 263 +++----
 examples/training/sts/README.md               |  69 +-
 .../training/sts/training_stsbenchmark.py     |   2 +-
 ...training_stsbenchmark_continue_training.py |   2 +-
 examples/unsupervised_learning/README.md      |   6 +-
 index.rst                                     | 179 ++---
 sentence_transformers/SentenceTransformer.py  | 557 +++++++++------
 .../cross_encoder/CrossEncoder.py             | 161 +++--
 .../cross_encoder/evaluation/CEF1Evaluator.py |  54 +-
 .../evaluation/CERerankingEvaluator.py        |   6 +-
 sentence_transformers/data_collator.py        |   3 +-
 .../datasets/DenoisingAutoEncoderDataset.py   |   6 +-
 .../datasets/ParallelSentencesDataset.py      |  25 +-
 .../datasets/SentenceLabelDataset.py          |  17 +-
 .../BinaryClassificationEvaluator.py          |  34 +-
 .../EmbeddingSimilarityEvaluator.py           |  38 +-
 .../InformationRetrievalEvaluator.py          |  29 +-
 .../evaluation/LabelAccuracyEvaluator.py      |   4 +-
 .../evaluation/MSEEvaluator.py                |  22 +-
 .../evaluation/MSEEvaluatorFromDataFrame.py   |  25 +-
 .../evaluation/ParaphraseMiningEvaluator.py   |  47 +-
 .../evaluation/RerankingEvaluator.py          |  69 +-
 .../evaluation/SentenceEvaluator.py           |  32 +-
 .../evaluation/SequentialEvaluator.py         |  16 +
 .../evaluation/TranslationEvaluator.py        |  29 +-
 .../evaluation/TripletEvaluator.py            |  28 +-
 sentence_transformers/fit_mixin.py            | 167 +++--
 .../losses/AdaptiveLayerLoss.py               |  71 +-
 sentence_transformers/losses/AnglELoss.py     |  28 +-
 .../losses/BatchAllTripletLoss.py             |  49 +-
 .../losses/BatchHardSoftMarginTripletLoss.py  |  46 +-
 .../losses/BatchHardTripletLoss.py            |  57 +-
 .../losses/BatchSemiHardTripletLoss.py        |  49 +-
 .../losses/CachedGISTEmbedLoss.py             |  40 +-
 .../CachedMultipleNegativesRankingLoss.py     |  38 +-
 sentence_transformers/losses/CoSENTLoss.py    |  36 +-
 .../losses/ContrastiveLoss.py                 |  48 +-
 .../losses/ContrastiveTensionLoss.py          |  13 +-
 .../losses/CosineSimilarityLoss.py            |  41 +-
 .../losses/DenoisingAutoEncoderLoss.py        |  13 +-
 sentence_transformers/losses/GISTEmbedLoss.py |  40 +-
 sentence_transformers/losses/MSELoss.py       |  54 +-
 sentence_transformers/losses/MarginMSELoss.py |  92 ++-
 .../losses/Matryoshka2dLoss.py                |  56 +-
 .../losses/MatryoshkaLoss.py                  |  23 +-
 .../losses/MegaBatchMarginLoss.py             |  18 +-
 .../losses/MultipleNegativesRankingLoss.py    |  38 +-
 .../MultipleNegativesSymmetricRankingLoss.py  |  38 +-
 .../losses/OnlineContrastiveLoss.py           |  36 +-
 sentence_transformers/losses/SoftmaxLoss.py   |  61 +-
 sentence_transformers/losses/TripletLoss.py   |  47 +-
 sentence_transformers/model_card.py           |   8 +-
 sentence_transformers/models/Asym.py          |  14 +-
 sentence_transformers/models/Dense.py         |  17 +-
 sentence_transformers/models/Dropout.py       |   3 +-
 sentence_transformers/models/LSTM.py          |   4 +-
 sentence_transformers/models/Normalize.py     |   4 +-
 sentence_transformers/models/Pooling.py       |  35 +-
 sentence_transformers/models/Transformer.py   |  28 +-
 .../models/WeightedLayerPooling.py            |   4 +-
 sentence_transformers/models/WordWeights.py   |  15 +-
 sentence_transformers/quantization.py         | 177 +++--
 sentence_transformers/readers/InputExample.py |  15 +-
 .../readers/LabelSentenceReader.py            |   3 +-
 .../readers/NLIDataReader.py                  |   4 +-
 .../readers/PairedFilesReader.py              |   5 +-
 .../readers/STSDataReader.py                  |  10 +-
 .../readers/TripletReader.py                  |   4 +-
 sentence_transformers/similarity_functions.py |  61 ++
 sentence_transformers/trainer.py              |  91 ++-
 sentence_transformers/training_args.py        |  38 +-
 sentence_transformers/util.py                 | 352 +++++++--
 setup.py                                      |   4 +
 tests/test_compute_embeddings.py              |   1 -
 153 files changed, 5238 insertions(+), 3086 deletions(-)
 create mode 100644 docs/changelog/v3.0.md
 delete mode 100644 docs/contact.md
 create mode 100644 docs/cross_encoder/pretrained_models.md
 create mode 100644 docs/cross_encoder/training/examples.md
 create mode 100644 docs/cross_encoder/training_overview.md
 create mode 100644 docs/cross_encoder/usage/usage.rst
 delete mode 100644 docs/hugging_face.md
 create mode 100644 docs/img/hf-logo.svg
 delete mode 100644 docs/package_reference/SentenceTransformer.md
 create mode 100644 docs/package_reference/cross_encoder/cross_encoder.md
 rename docs/package_reference/{cross_encoder.md => cross_encoder/evaluation.md} (73%)
 create mode 100644 docs/package_reference/cross_encoder/index.rst
 delete mode 100644 docs/package_reference/quantization.md
 create mode 100644 docs/package_reference/sentence_transformer/SentenceTransformer.md
 rename docs/package_reference/{ => sentence_transformer}/datasets.md (100%)
 rename docs/package_reference/{ => sentence_transformer}/evaluation.md (67%)
 create mode 100644 docs/package_reference/sentence_transformer/index.rst
 rename docs/package_reference/{ => sentence_transformer}/losses.md (93%)
 rename docs/package_reference/{ => sentence_transformer}/models.md (100%)
 create mode 100644 docs/package_reference/sentence_transformer/quantization.md
 create mode 100644 docs/package_reference/sentence_transformer/trainer.md
 create mode 100644 docs/package_reference/sentence_transformer/training_args.md
 delete mode 100644 docs/pretrained_cross-encoders.md
 delete mode 100644 docs/quickstart.md
 create mode 100644 docs/quickstart.rst
 create mode 100644 docs/sentence_transformer/dataset_overview.md
 rename docs/{training => sentence_transformer}/loss_overview.md (88%)
 rename docs/{ => sentence_transformer}/pretrained_models.md (50%)
 create mode 100644 docs/sentence_transformer/training/distributed.rst
 create mode 100644 docs/sentence_transformer/training/examples.rst
 create mode 100644 docs/sentence_transformer/training_overview.md
 create mode 100644 docs/sentence_transformer/usage/semantic_textual_similarity.rst
 create mode 100644 docs/sentence_transformer/usage/usage.rst
 delete mode 100644 docs/training/overview.md
 delete mode 100644 docs/usage/semantic_textual_similarity.md
 delete mode 100644 examples/applications/computing-embeddings/README.md
 create mode 100644 examples/applications/computing-embeddings/README.rst
 delete mode 100644 examples/training/datasets/README.md
 create mode 100644 examples/training/hpo/README.rst
 create mode 100644 examples/training/hpo/hpo_nli.py
 create mode 100644 examples/training/ms_marco/cross_encoder_README.md
 delete mode 100644 examples/training/paraphrases/MultiDatasetDataLoader.py

diff --git a/README.md b/README.md
index f05b4cbbf..acb0e4bd9 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,10 @@
 <!--- BADGES: START --->
+[![HF Models](https://img.shields.io/badge/%F0%9F%A4%97-models-yellow)](https://huggingface.co/models?library=sentence-transformers)
 [![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]
 [![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]
-[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]
-[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]
 [![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]
-<!--- 
-[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]
-[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package] 
---->
+<!-- [![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package] -->
 
 [#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE
 [#pypi-package]: https://pypi.org/project/sentence-transformers/
@@ -20,38 +16,24 @@
 
 This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.
 
-We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.
+We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.
 
-Further, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.
+Further, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/sentence_transformer/training_overview.html)**, to achieve maximal performance on your specific task.
 
 For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.
 
-The following publications are integrated in this framework:
-
-- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
-- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
-- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)
-- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)
-- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)
-- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)
-- [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) (arXiv 2022)
-
 ## Installation
 
-We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.
+We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**.
 
 **Install with pip**
 
-Install the *sentence-transformers* with `pip`:
-
 ```
 pip install -U sentence-transformers
 ```
 
 **Install with conda**
 
-You can install the *sentence-transformers* with `conda`:
-
 ```
 conda install -c conda-forge sentence-transformers
 ```
@@ -73,8 +55,6 @@ If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA
 
 See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.
 
-[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
-
 First download a pretrained model.
 
 ````python
@@ -87,45 +67,40 @@ Then provide some sentences to the model.
 
 ````python
 sentences = [
-    "This framework generates embeddings for each input sentence",
-    "Sentences are passed as a list of string.",
-    "The quick brown fox jumps over the lazy dog.",
+    "The weather is lovely today.",
+    "It's so sunny outside!",
+    "He drove to the stadium.",
 ]
-sentence_embeddings = model.encode(sentences)
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# => (3, 384)
 ````
 
-And that's it already. We now have a list of numpy arrays with the embeddings.
+And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities.
 
 ````python
-for sentence, embedding in zip(sentences, sentence_embeddings):
-    print("Sentence:", sentence)
-    print("Embedding:", embedding)
-    print("")
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
+# tensor([[1.0000, 0.6660, 0.1046],
+#         [0.6660, 1.0000, 0.1411],
+#         [0.1046, 0.1411, 1.0000]])
 ````
 
 ## Pre-Trained Models
 
-We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.
-
-[»  Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+We provide a large list of [Pretrained Models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.
 
 ## Training
 
 This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task. 
 
-See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.
+See [Training Overview](https://www.sbert.net/docs/sentence_transformer/training_overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.
 
 Some highlights are:
 - Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
 - Multi-Lingual and multi-task learning
 - Evaluation during training to find optimal model
-- [20+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.
-
-## Performance
-
-Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.
-
-[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+- [20+ loss-functions](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc.
 
 ## Application Examples
 
@@ -133,12 +108,11 @@ You can use this framework for:
 
 - [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
 - [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
+- [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
+- [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) 
 - [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)
 - [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)
- - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
- - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
- - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) 
- - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html) 
+- [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
 - [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)
 
 and many more use-cases.
@@ -193,7 +167,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma
 
 Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.
 
-Contact person: Tom Aarsen, [tom.aarsen@huggingface.co](mailto:tom.aarsen@huggingface.co)
+Maintainer: [Tom Aarsen](https://github.com/tomaarsen), 🤗 Hugging Face
 
 https://www.ukp.tu-darmstadt.de/
 
diff --git a/docs/Makefile b/docs/Makefile
index 484135cad..ae30537c3 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -1,3 +1,6 @@
 
 docs:
-	sphinx-build -c . -a -E .. _build
\ No newline at end of file
+	sphinx-build -c . -a -E .. _build
+
+docs-quick:
+	sphinx-build -c . .. _build
\ No newline at end of file
diff --git a/docs/_static/css/custom.css b/docs/_static/css/custom.css
index 7938469ed..0bab3e76f 100644
--- a/docs/_static/css/custom.css
+++ b/docs/_static/css/custom.css
@@ -24,4 +24,88 @@ dl.class > dt {
 
 .wy-side-nav-search {
     padding-top: 0px;
-}
\ No newline at end of file
+}
+
+.components {
+    display: flex;
+    flex-flow: row wrap;
+}
+
+.components > .box {
+    flex: 1;
+    margin: 0.5rem;
+    padding: 1rem;
+    border-style: solid;
+    border-width: 1px;
+    border-radius: 0.5rem;
+    border-color: rgb(55 65 81);
+    background-color: #e3e3e3;
+    color: #404040; /* Override the colors imposed by <a href> */
+}
+
+.components > .box:nth-child(1) > .header {
+    background-image: linear-gradient(to bottom right, #60a5fa, #3b82f6);
+}
+
+.components > .box:nth-child(2) > .header {
+    background-image: linear-gradient(to bottom right, #fb923c, #f97316);
+}
+
+.components > .box:nth-child(3) > .header {
+    background-image: linear-gradient(to bottom right, #f472b6, #ec4899);
+}
+
+.components > .box:nth-child(4) > .header {
+    background-image: linear-gradient(to bottom right, #a78bfa, #8b5cf6);
+}
+
+.components > .box:nth-child(5) > .header {
+    background-image: linear-gradient(to bottom right, #34d399, #10b981);
+}
+
+.components > .optional {
+    background: repeating-linear-gradient(
+        135deg,
+        #f1f1f1,
+        #f1f1f1 25px,
+        #e3e3e3 25px,
+        #e3e3e3 50px
+    );
+}
+
+.components > .box > .header {
+    border-style: solid;
+    border-width: 1px;
+    border-radius: 0.5rem;
+    border-color: rgb(55 65 81);
+    padding: 0.5rem;
+    text-align: center;
+    margin-bottom: 0.5rem;
+    font-weight: bold;
+    color: white;
+}
+
+.sidebar p {
+    font-size: 100% !important;
+}
+
+.training-arguments {
+    background-color: #f3f6f6;
+    border: 1px solid #e1e4e5;
+}
+
+.training-arguments > .header {
+    font-weight: 700;
+    padding: 6px 12px;
+    background: #e1e4e5;
+}
+
+.training-arguments > .table {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(15em, 1fr));
+}
+
+.training-arguments > .table > a {
+    padding: 0.5rem;
+    border: 1px solid #e1e4e5;
+}
diff --git a/docs/_themes/sphinx_rtd_theme/footer.html b/docs/_themes/sphinx_rtd_theme/footer.html
index c82e5ed45..4c4c2b429 100644
--- a/docs/_themes/sphinx_rtd_theme/footer.html
+++ b/docs/_themes/sphinx_rtd_theme/footer.html
@@ -24,9 +24,6 @@
         &copy; {% trans %}Copyright{% endtrans %} {{ copyright }}
       {%- endif %}
     {%- endif %}
-
-       &bull; <a href="/docs/contact.html">Contact</a>
-
     {%- if build_id and build_url %}
       <span class="build">
         {# Translators: Build is a noun, not a verb #}
diff --git a/docs/_themes/sphinx_rtd_theme/layout.html b/docs/_themes/sphinx_rtd_theme/layout.html
index 2696eaaa2..3e30b0fa5 100644
--- a/docs/_themes/sphinx_rtd_theme/layout.html
+++ b/docs/_themes/sphinx_rtd_theme/layout.html
@@ -121,8 +121,12 @@
             </a>
 
             <div style="display: flex; justify-content: center;">
-              <div id="twitter-button">
+              <!-- This snippet adds a "Follow SBERT on Twitter" button. I'll remove it as Nils doesn't post about SBERT anmymore -->
+              <!-- <div id="twitter-button">
                 <a href="https://twitter.com/Nils_Reimers" target="_blank" title="Follow SBERT on Twitter"><img src="/_static/Twitter_Logo_White.svg" height="20" style="margin: 0px 10px 0px -10px;"> </a>
+              </div> -->
+              <div id="hf-button">
+                <a href="https://huggingface.co/models?library=sentence-transformers" target="_blank" title="See all Sentence Transformer models"><img src="{{ pathto('_static/img/hf-logo.svg', 1) }}" style="margin: 0px 10px 0px -10px; padding: 0px; height: 28px; width: 28px;"></a>
               </div>
               <div id="github-button"></div>
             </div>
diff --git a/docs/_themes/sphinx_rtd_theme/theme.conf b/docs/_themes/sphinx_rtd_theme/theme.conf
index fd0521f02..f26931470 100644
--- a/docs/_themes/sphinx_rtd_theme/theme.conf
+++ b/docs/_themes/sphinx_rtd_theme/theme.conf
@@ -8,7 +8,7 @@ canonical_url =
 analytics_id =
 collapse_navigation = True
 sticky_navigation = True
-navigation_depth = 4
+navigation_depth =
 includehidden = True
 titles_only =
 logo_only =
diff --git a/docs/changelog/v3.0.md b/docs/changelog/v3.0.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/conf.py b/docs/conf.py
index e9c182541..ba2e61f0b 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -21,8 +21,8 @@
 # -- Project information -----------------------------------------------------
 
 project = "Sentence-Transformers"
-copyright = str(datetime.datetime.now().year) + ", Nils Reimers"
-author = "Nils Reimers"
+copyright = str(datetime.datetime.now().year)
+author = "Nils Reimers, Tom Aarsen"
 
 
 # -- General configuration ---------------------------------------------------
@@ -30,7 +30,14 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["sphinx.ext.autodoc", "recommonmark", "sphinx_markdown_tables"]
+extensions = [
+    "sphinx.ext.napoleon",
+    "sphinx.ext.autodoc",
+    "recommonmark",
+    "sphinx_markdown_tables",
+    "sphinx.ext.intersphinx",
+    "sphinx_tabs.tabs",
+]
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
@@ -38,7 +45,24 @@
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "nr_examples"]
+exclude_patterns = [
+    "_build",
+    "Thumbs.db",
+    ".DS_Store",
+    "nr_examples",
+    "archived",
+    "dist",
+    "build",
+    "output",
+    "models",
+    "model_card_template.md",
+]
+
+intersphinx_mapping = {
+    "datasets": ("https://huggingface.co/docs/datasets/main/en/", None),
+    "transformers": ("https://huggingface.co/docs/transformers/main/en/", None),
+    "torch": ("https://pytorch.org/docs/stable/", None),
+}
 
 
 # -- Options for HTML output -------------------------------------------------
@@ -49,7 +73,11 @@
 html_theme = "sphinx_rtd_theme"
 html_theme_path = ["_themes"]
 
-html_theme_options = {"logo_only": True, "canonical_url": "https://www.sbert.net"}
+html_theme_options = {
+    "logo_only": True,
+    "canonical_url": "https://www.sbert.net",
+    "collapse_navigation": False,
+}
 
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
diff --git a/docs/contact.md b/docs/contact.md
deleted file mode 100644
index 1ce3cf82e..000000000
--- a/docs/contact.md
+++ /dev/null
@@ -1,19 +0,0 @@
-# Contact
-
-In case of questions, feel free to open a [Github Issue](https://github.com/UKPLab/sentence-transformers/issues) or write me an email: [info@nils-reimers.de](mailto:info@nils-reimers.de).
-
-**SentenceTransformers is maintained by:**  
-Nils Reimers  
-Ubiquitous Knowledge Processing (UKP) Lab  
-FB 20 / Department of Computer Science  
-Technische Universität Darmstadt  
-Hochschulstr. 10  
-64289 Darmstadt  
-Germany  
-[Website](https://www.informatik.tu-darmstadt.de/ukp/ukp_home/index.en.jsp)
-
-
-**Privacy Policy**  
-The webserver / web hosting company might collect certain log files to prevent abuse of services. These log files can include: IP address, URL, date and time.
-
-We do not use any tracking services or cookies to track or re-identify visitors. 
\ No newline at end of file
diff --git a/docs/cross_encoder/pretrained_models.md b/docs/cross_encoder/pretrained_models.md
new file mode 100644
index 000000000..14715ccd8
--- /dev/null
+++ b/docs/cross_encoder/pretrained_models.md
@@ -0,0 +1,111 @@
+# Pretrained Models
+
+We have released various pre-trained Cross Encoder models via our [Cross Encoder Hugging Face organization](https://huggingface.co/models?author=cross-encoder). Additionally, numerous community CrossEncoder models have been publicly released on the Hugging Face Hub.
+
+Each of these models can be easily downloaded and used like so:
+
+```python
+from sentence_transformers import CrossEncoder
+import torch
+
+model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", default_activation_function=torch.nn.Sigmoid())
+scores = model.predict([
+    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
+    ("How many people live in Berlin?", "Berlin is well known for its museums."),
+])
+# => array([0.9998173 , 0.01312432], dtype=float32)
+```
+
+Cross-Encoders require text pairs as inputs and output a score 0...1 (if the Sigmoid activation function is used). They do not work for individual sentences and they don't compute embeddings for individual texts.
+
+## MS MARCO
+[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages.
+
+```eval_rst
+.. note::
+    You can initialize these models with ``default_activation_function=torch.nn.Sigmoid()`` to force the model to return scores between 0 and 1. Otherwise, the raw value can reasonably range between -10 and 10.
+```
+
+- [cross-encoder/ms-marco-TinyBERT-L-2-v2](https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2) - MRR@10 on MS Marco Dev Set: 32.56
+- [cross-encoder/ms-marco-MiniLM-L-2-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-2-v2) - MRR@10 on MS Marco Dev Set: 34.85
+- [cross-encoder/ms-marco-MiniLM-L-4-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-4-v2) - MRR@10 on MS Marco Dev Set: 37.70
+- [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) - MRR@10 on MS Marco Dev Set: 39.01
+- [cross-encoder/ms-marco-MiniLM-L-12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2) - MRR@10 on MS Marco Dev Set: 39.02
+
+For details on the usage, see [Retrieve & Re-Rank](../../examples/applications/retrieve_rerank/README.md) or [MS MARCO Cross-Encoders](../pretrained-models/ce-msmarco.md).
+
+## SQuAD (QNLI)
+
+QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) ([HF](https://huggingface.co/datasets/rajpurkar/squad)) and was introduced by the [GLUE Benchmark](https://arxiv.org/abs/1804.07461) ([HF](https://huggingface.co/datasets/nyu-mll/glue)). Given a passage from Wikipedia, annotators created questions that are answerable by that passage.
+
+- [cross-encoder/qnli-distilroberta-base](https://huggingface.co/cross-encoder/qnli-distilroberta-base) - Accuracy on QNLI dev set: 90.96
+- [cross-encoder/qnli-electra-base](https://huggingface.co/cross-encoder/qnli-electra-base) - Accuracy on QNLI dev set: 93.21
+
+## STSbenchmark
+The following models can be used like this:
+```python
+from sentence_transformers import CrossEncoder
+
+model = CrossEncoder("cross-encoder/stsb-roberta-base")
+scores = model.predict([("It's a wonderful day outside.", "It's so sunny today!"), ("It's a wonderful day outside.", "He drove to work earlier.")])
+# => array([0.60443085, 0.00240758], dtype=float32)
+```
+
+They return a score  0...1 indicating the semantic similarity of the given sentence pair.
+- [cross-encoder/stsb-TinyBERT-L-4](https://huggingface.co/cross-encoder/stsb-TinyBERT-L-4) - STSbenchmark test performance: 85.50
+- [cross-encoder/stsb-distilroberta-base](https://huggingface.co/cross-encoder/stsb-distilroberta-base) - STSbenchmark test performance: 87.92
+- [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) - STSbenchmark test performance: 90.17
+- [cross-encoder/stsb-roberta-large](https://huggingface.co/cross-encoder/stsb-roberta-large) - STSbenchmark test performance: 91.47 
+
+## Quora Duplicate Questions
+These models have been trained on the [Quora duplicate questions dataset](https://huggingface.co/datasets/sentence-transformers/quora-duplicates). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions.
+
+- [cross-encoder/quora-distilroberta-base](https://huggingface.co/cross-encoder/quora-distilroberta-base) - Average Precision dev set: 87.48
+- [cross-encoder/quora-roberta-base](https://huggingface.co/cross-encoder/quora-roberta-base) - Average Precision dev set: 87.80
+- [cross-encoder/quora-roberta-large](https://huggingface.co/cross-encoder/quora-roberta-large) - Average Precision dev set: 87.91
+
+```eval_rst
+.. note::
+    The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results.
+```
+
+## NLI
+Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets.
+- [cross-encoder/nli-deberta-v3-base](https://huggingface.co/cross-encoder/nli-deberta-v3-base) - Accuracy on MNLI mismatched set: 90.04
+- [cross-encoder/nli-deberta-base](https://huggingface.co/cross-encoder/nli-deberta-base) - Accuracy on MNLI mismatched set: 88.08
+- [cross-encoder/nli-deberta-v3-xsmall](https://huggingface.co/cross-encoder/nli-deberta-v3-xsmall) - Accuracy on MNLI mismatched set:  87.77
+- [cross-encoder/nli-deberta-v3-small](https://huggingface.co/cross-encoder/nli-deberta-v3-small) - Accuracy on MNLI mismatched set: 87.55
+- [cross-encoder/nli-roberta-base](https://huggingface.co/cross-encoder/nli-roberta-base) - Accuracy on MNLI mismatched set: 87.47
+- [cross-encoder/nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768) - Accuracy on MNLI mismatched set: 86.89  
+- [cross-encoder/nli-distilroberta-base](https://huggingface.co/cross-encoder/nli-distilroberta-base) - Accuracy on MNLI mismatched set: 83.98
+
+```python
+from sentence_transformers import CrossEncoder
+
+model = CrossEncoder("cross-encoder/nli-deberta-v3-base")
+scores = model.predict([
+    ("A man is eating pizza", "A man eats something"),
+    ("A black race car starts up in front of a crowd of people.", "A man is driving down a lonely road."),
+])
+
+# Convert scores to labels
+label_mapping = ["contradiction", "entailment", "neutral"]
+labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
+# => ['entailment', 'contradiction']
+```
+
+## Community Models
+
+Some notable models from the Community include:
+
+- [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base)
+- [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large)
+- [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
+- [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma)
+- [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise)
+- [jinaai/jina-reranker-v1-tiny-en](https://huggingface.co/jinaai/jina-reranker-v1-tiny-en)
+- [jinaai/jina-reranker-v1-turbo-en](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en)
+- [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1)
+- [mixedbread-ai/mxbai-rerank-base-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1)
+- [mixedbread-ai/mxbai-rerank-large-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1)
+- [maidalun1020/bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1)
\ No newline at end of file
diff --git a/docs/cross_encoder/training/examples.md b/docs/cross_encoder/training/examples.md
new file mode 100644
index 000000000..2235e3bbd
--- /dev/null
+++ b/docs/cross_encoder/training/examples.md
@@ -0,0 +1,16 @@
+
+# Training Examples
+
+See the following examples how to train Cross-Encoders:
+
+- [training_stsbenchmark.py](../../../examples/training/cross-encoder/training_stsbenchmark.py) - Example how to train for Semantic Textual Similarity (STS) on the STS benchmark dataset.
+- [training_quora_duplicate_questions.py](../../../examples/training/cross-encoder/training_quora_duplicate_questions.py) - Example how to train a Cross-Encoder to predict if two questions are duplicates. Uses Quora Duplicate Questions as training dataset.
+- [training_nli.py](../../../examples/training/cross-encoder/training_nli.py) - Example for a multilabel classification task for Natural Language Inference (NLI) task.
+
+```eval_rst
+.. toctree::
+   :maxdepth: 1
+   :caption: Supervised Learning
+
+   ../../../examples/training/ms_marco/cross_encoder_README
+```
\ No newline at end of file
diff --git a/docs/cross_encoder/training_overview.md b/docs/cross_encoder/training_overview.md
new file mode 100644
index 000000000..a8902e6ec
--- /dev/null
+++ b/docs/cross_encoder/training_overview.md
@@ -0,0 +1,65 @@
+
+# Training Overview
+
+```eval_rst
+.. note::
+    The CrossEncoder training approach has not been updated in v3.0 when `training Sentence Transformer models <../sentence_transformer/training_overview.html>`_ was improved. Improving training CrossEncoders is planned for a future major update.
+```
+
+The `CrossEncoder` class is a wrapper around Huggingface `AutoModelForSequenceClassification`, but with some methods to make training and predicting scores a little bit easier. The saved models are 100% compatible with Huggingface and can also be loaded with their classes.
+
+First, you need some sentence pair data. You can either have a continuous score, like:
+
+```eval_rst
+
+.. sidebar:: Documentation
+
+    - :class:`~sentence_transformers.readers.InputExample`
+```
+
+```python
+from sentence_transformers import InputExample
+
+train_samples = [
+    InputExample(texts=["sentence1", "sentence2"], label=0.3),
+    InputExample(texts=["Another", "pair"], label=0.8),
+]
+```
+
+Or you have distinct classes as in the [training_nli.py](../../examples/training/cross-encoder/training_nli.py) example:
+```python
+from sentence_transformers import InputExample
+
+label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
+train_samples = [
+    InputExample(texts=["sentence1", "sentence2"], label=label2int["neutral"]),
+    InputExample(texts=["Another", "pair"], label=label2int["entailment"]),
+]
+```
+
+Then, you define the base model and the number of labels. You can take any [Hugging Face pre-trained model](https://huggingface.co/models) that is compatible with AutoModel:
+```
+model = CrossEncoder('distilroberta-base', num_labels=1)
+```
+
+For binary tasks and tasks with continuous scores (like STS), we set num_labels=1. For classification tasks, we set it to the number of labels we have.
+
+```eval_rst
+
+We start the training by calling :meth:`CrossEncoder.fit <sentence_transformers.cross_encoder.CrossEncoder.fit>`:
+
+.. sidebar:: Documentation
+
+    - :class:`~sentence_transformers.cross_encoder.CrossEncoder`
+    - :meth:`CrossEncoder.fit <sentence_transformers.cross_encoder.CrossEncoder.fit>`
+
+::
+
+    model.fit(
+        train_dataloader=train_dataloader,
+        evaluator=evaluator,
+        epochs=num_epochs,
+        warmup_steps=warmup_steps,
+        output_path=model_save_path,
+    )
+```
\ No newline at end of file
diff --git a/docs/cross_encoder/usage/usage.rst b/docs/cross_encoder/usage/usage.rst
new file mode 100644
index 000000000..c36d5a5f4
--- /dev/null
+++ b/docs/cross_encoder/usage/usage.rst
@@ -0,0 +1,75 @@
+
+Usage
+=====
+
+Characteristics of Cross Encoder (a.k.a reranker) models:
+
+1. Calculates a **similarity score** given **pairs of texts**.
+2. Generally provides **superior performance** compared to a Sentence Transformer (a.k.a. bi-encoder) model.
+3. Often **slower** than a Sentence Transformer model, as it requires computation for each pair rather than each text.
+4. Due to the previous 2 characteristics, Cross Encoders are often used to **re-rank the top-k results** from a Sentence Transformer model.
+
+Once you have installed `installed <installation.md>`_ Sentence Transformers, you can easily use Cross Encoder models:
+
+.. sidebar:: Documentation
+
+   1. :class:`~sentence_transformers.cross_encoder.CrossEncoder`
+   2. :meth:`CrossEncoder.predict <sentence_transformers.cross_encoder.CrossEncoder.predict>`
+   3. :meth:`CrossEncoder.rank <sentence_transformers.cross_encoder.CrossEncoder.rank>`
+
+   .. note::
+      MS Marco models return logits rather than scores between 0 and 1. Load the :class:`~sentence_transformers.cross_encoder.CrossEncoder` with ``default_activation_function=torch.nn.Sigmoid()`` to get scores between 0 and 1. This does not affect the ranking.
+
+::
+
+   from sentence_transformers import CrossEncoder
+   
+   # 1. Load a pre-trained CrossEncoder model
+   model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
+
+   # 2. Predict scores for a pair of sentences
+   scores = model.predict([
+       ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
+       ("How many people live in Berlin?", "Berlin is well known for its museums."),
+   ])
+   # => array([ 8.607138 , -4.3200774], dtype=float32)
+   
+   # 3. Rank a list of passages for a query
+   query = "How many people live in Berlin?"
+   passages = [
+       "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
+       "Berlin is well known for its museums.",
+       "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
+       "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
+       "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
+       "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
+       "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
+       "In 2015, the total labour force in Berlin was 1.85 million.",
+       "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
+       "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
+   ]
+   ranks = model.rank(query, passages)
+   
+   # Print the scores
+   print("Query:", query)
+   for rank in ranks:
+       print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
+   """
+   Query: How many people live in Berlin?
+   8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
+   8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
+   8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
+   7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
+   6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
+   5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
+   3.45    In 2015, the total labour force in Berlin was 1.85 million.
+   0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).
+   -4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
+   -4.32   Berlin is well known for its museums.
+   """
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Tasks
+
+   ../../../examples/applications/retrieve_rerank/README
diff --git a/docs/hugging_face.md b/docs/hugging_face.md
deleted file mode 100644
index 19348ea46..000000000
--- a/docs/hugging_face.md
+++ /dev/null
@@ -1,96 +0,0 @@
-# Hugging Face 🤗
-
-## The Hugging Face Hub
-
-In addition to the official [pre-trained models](https://www.sbert.net/docs/pretrained_models.html), you can find over 500 `sentence-transformer` models on the [Hugging Face Hub](http://hf.co/models?library=sentence-transformers&sort=downloads).
-
-All models on the Hugging Face Hub come with the following:
-1. An [automatically generated model card](https://huggingface.co/docs/hub/models-cards#what-are-model-cards) with a description, example code snippets, architecture overview, and more. 
-2. [Metadata tags](https://huggingface.co/docs/hub/models-cards#model-card-metadata) that help for discoverability and contain additional information such as a usage license.
-3. An [interactive widget](https://huggingface.co/docs/hub/models-widgets) you can use to play with the model directly in the browser.
-4. An [Inference API](https://huggingface.co/docs/hub/models-inference) that allows you to make inference requests.
-
-<img style="height:400px;display:block;margin-left:auto;margin-right:auto;" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/libraries-sentence_transformers_widget.png"/>
-
-## Using Hugging Face models
-
-Any pre-trained models from the Hub can be loaded with a single line of code:
-
-```py
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("model_name")
-```
-
-You can even click `Use in sentence-transformers` to get a code snippet that you can copy and paste! 
-
-<div style="display:flex; flex-direction:column; gap: 15px; margin-bottom: 15px;">
-<img style=max-height:150px;object-fit:contain;" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/libraries-sentence_transformers_snippet1.png"/>
-<img style="max-height:130px;object-fit:contain" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/libraries-sentence_transformers_snippet2.png"/>
-</div>
-
-Here is an example that loads the [multi-qa-MiniLM-L6-cos-v1 model](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) and uses it to encode sentences and then compute the distance between them for doing semantic search.
-
-```py
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
-
-query_embedding = model.encode("How big is London")
-passage_embedding = model.encode([
-    "London has 9,787,426 inhabitants at the 2011 census",
-    "London is known for its finacial district",
-])
-
-print("Similarity:", util.dot_score(query_embedding, passage_embedding))
-```
-
-Here is another example, this time using the [clips/mfaq model](https://huggingface.co/clips/mfaq) for multilingual FAQ retrieval. After embedding the query and the answers, we perform a semantic search to find the most relevant answer. 
-
-```py
-from sentence_transformers import SentenceTransformer, util
-
-question = "<Q>How many models can I host on HuggingFace?"
-answer_1 = "<A>All plans come with unlimited private models and datasets."
-answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
-answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."
-
-model = SentenceTransformer("clips/mfaq")
-query_embedding = model.encode(question)
-corpus_embeddings = model.encode([answer_1, answer_2, answer_3])
-
-print(util.semantic_search(query_embedding, corpus_embeddings))
-```
-
-## Sharing your models
-
-Once you've installed the [Hub Client Library](https://huggingface.co/docs/huggingface_hub/quick-start), you can login through your terminal with your Hugging Face account.
-
-```bash
-pip install huggingface_hub
-huggingface-cli login
-```
-
-Then, you can share your SentenceTransformers models by calling the [`push_to_hub` method](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.push_to_hub) from a trained model. By default, the model will be uploaded to your account, but you can upload to an [organization](https://huggingface.co/docs/hub/organizations) by providing the organization as a part of the `repo_id`, e.g. `model.push_to_hub("my_organization/my_model_name")`. `push_to_hub` automatically generates a model card, an inference widget, example code snippets, and more.
-
-```py
-from sentence_transformers import SentenceTransformer
-
-# Load or train a model
-model.push_to_hub("my_new_model")
-```
-
-You can automatically add to the Hub's model card a list of datasets you used to train the model with the argument `train_datasets: Optional[List[str]] = None)`. See the "Datasets used to train" section in the [ITESM/sentece-embeddings-BETO](https://huggingface.co/ITESM/sentece-embeddings-BETO) model for an example of the final result.
-
-```py
-model.push_to_hub("my_new_model", train_datasets=["GEM/wiki_lingua", "code_search_net"])
-```
-
-## Sharing your embeddings
-
-The Hugging Face Hub can also be used to store and share any embeddings you generate. You can export your embeddings to CSV, ZIP, Pickle, or any other format, and then upload them to the Hub as a [Dataset](https://huggingface.co/docs/hub/datasets-adding). Read the ["Getting Started With Embeddings" blog post](https://huggingface.co/blog/getting-started-with-embeddings#2-host-embeddings-for-free-on-the-hugging-face-hub) for more information.
-
-## Additional resources
-
-* [Hugging Face Hub docs](https://huggingface.co/docs/hub/index)
-* Integration with Hub [announcement](https://huggingface.co/blog/sentence-transformers-in-the-hub).
diff --git a/docs/img/hf-logo.svg b/docs/img/hf-logo.svg
new file mode 100644
index 000000000..18797de4f
--- /dev/null
+++ b/docs/img/hf-logo.svg
@@ -0,0 +1,21 @@
+<svg width="256" height="256" viewBox="0 0 256 256" fill="none" xmlns="http://www.w3.org/2000/svg">
+<g filter="url(#filter0_d_3_42)">
+<path d="M230.721 172.7C230.183 170.674 229.313 168.75 228.146 167.008C228.396 166.091 228.587 165.159 228.714 164.217C229.543 158.241 227.471 152.77 223.567 148.537C221.452 146.225 219.185 144.698 216.784 143.761C218.36 137.018 219.157 130.117 219.161 123.193C219.161 120.03 218.982 116.932 218.682 113.88C218.529 112.354 218.333 110.827 218.115 109.32C217.428 104.846 216.408 100.431 215.064 96.1103C214.169 93.2536 213.166 90.4617 212.01 87.7359C210.281 83.678 208.262 79.7498 205.969 75.9816C204.465 73.4749 202.827 71.0509 201.062 68.72C200.19 67.5423 199.296 66.3859 198.358 65.2744C195.58 61.898 192.561 58.7275 189.325 55.7878C188.257 54.8041 187.144 53.8463 186.01 52.9261C184.898 51.9889 183.742 51.0946 182.586 50.2218C180.253 48.477 177.811 46.8414 175.324 45.3148C161.543 36.9453 145.382 32.1456 128.109 32.1456C77.8172 32.1456 37.0566 72.9071 37.0566 123.196C37.0552 130.208 37.8676 137.196 39.4776 144.02C37.3178 144.958 35.2465 146.42 33.3273 148.535C29.4235 152.766 27.3514 158.217 28.1804 164.193C28.3058 165.142 28.4952 166.082 28.7474 167.006C27.5812 168.749 26.7115 170.673 26.1736 172.7C24.9743 177.261 25.3687 181.374 26.8935 184.978C25.2363 189.688 25.6504 194.704 27.8093 199.065C29.3797 202.25 31.6263 204.714 34.396 206.916C37.6889 209.534 41.8109 211.758 46.7833 213.892C52.7154 216.422 59.9559 218.799 63.2488 219.671C71.7548 221.873 79.9108 223.269 88.1772 223.337C99.9539 223.446 110.096 220.677 117.357 213.59C120.924 214.027 124.515 214.246 128.109 214.244C131.906 214.236 135.699 213.997 139.467 213.529C146.711 220.661 156.892 223.455 168.712 223.343C176.977 223.277 185.133 221.881 193.617 219.676C196.932 218.804 204.17 216.427 210.105 213.897C215.077 211.76 219.199 209.536 222.514 206.922C225.263 204.719 227.508 202.256 229.079 199.071C231.26 194.709 231.652 189.693 230.017 184.983C231.527 181.379 231.92 177.257 230.721 172.7ZM222.281 184.673C223.952 187.844 224.059 191.427 222.585 194.764C220.349 199.821 214.795 203.805 204.008 208.082C197.3 210.742 191.158 212.443 191.104 212.458C182.232 214.759 174.208 215.928 167.262 215.928C155.76 215.928 147.201 212.754 141.773 206.486C132.595 208.05 123.222 208.103 114.026 206.644C108.591 212.808 100.081 215.928 88.6752 215.928C81.7291 215.928 73.7054 214.759 64.8335 212.458C64.779 212.443 58.6394 210.742 51.9293 208.082C41.1428 203.805 35.5873 199.824 33.3523 194.764C31.8776 191.427 31.9849 187.844 33.6555 184.673C33.8102 184.378 33.9757 184.091 34.1527 183.813C33.1514 182.309 32.4799 180.61 32.1821 178.828C31.8843 177.045 31.967 175.22 32.4249 173.472C33.0885 170.949 34.4604 168.851 36.3224 167.344C35.4257 165.871 34.8373 164.23 34.5927 162.522C34.0561 158.808 35.2885 155.1 38.0627 152.076C40.2216 149.723 43.2749 148.428 46.6554 148.428H46.7449C44.1964 140.26 42.9043 131.751 42.9127 123.193C42.9127 76.5217 80.7492 38.6838 127.427 38.6838C174.104 38.6838 211.94 76.5182 211.94 123.193C211.947 131.773 210.646 140.304 208.081 148.492C208.489 148.452 208.889 148.432 209.282 148.431C212.662 148.431 215.716 149.726 217.874 152.079C220.647 155.1 221.881 158.811 221.344 162.525C221.1 164.233 220.511 165.873 219.615 167.347C221.477 168.854 222.849 170.952 223.512 173.475C223.97 175.223 224.053 177.048 223.755 178.83C223.457 180.612 222.786 182.312 221.784 183.816C221.961 184.091 222.129 184.378 222.281 184.673Z" fill="white"/>
+</g>
+<path d="M221.784 183.816C222.786 182.312 223.457 180.613 223.755 178.831C224.053 177.048 223.97 175.223 223.512 173.475C222.848 170.952 221.476 168.854 219.615 167.347C220.511 165.874 221.1 164.233 221.344 162.525C221.881 158.811 220.648 155.103 217.874 152.079C215.716 149.726 212.662 148.431 209.282 148.431C208.889 148.431 208.489 148.452 208.081 148.492C210.643 140.304 211.942 131.774 211.933 123.195C211.933 76.523 174.097 38.6849 127.424 38.6849C80.7505 38.6849 42.9098 76.5194 42.9098 123.195C42.9014 131.752 44.1936 140.261 46.742 148.43H46.6525C43.272 148.43 40.2188 149.724 38.0599 152.077C35.2875 155.098 34.0533 158.81 34.5899 162.523C34.8344 164.231 35.4228 165.872 36.3195 167.346C34.4575 168.852 33.0856 170.95 32.422 173.473C31.9643 175.222 31.8818 177.047 32.1801 178.83C32.4784 180.613 33.1505 182.312 34.1526 183.816C33.9737 184.094 33.81 184.381 33.6553 184.676C31.9847 187.847 31.8774 191.43 33.3521 194.767C35.5879 199.824 41.1426 203.808 51.9291 208.085C58.6365 210.745 64.7788 212.446 64.8334 212.461C73.7052 214.762 81.729 215.931 88.6751 215.931C100.081 215.931 108.591 212.811 114.026 206.647C123.222 208.106 132.594 208.052 141.773 206.489C147.201 212.757 155.76 215.931 167.262 215.931C174.208 215.931 182.232 214.762 191.103 212.461C191.158 212.446 197.298 210.745 204.008 208.085C214.795 203.808 220.35 199.824 222.585 194.767C224.059 191.43 223.952 187.847 222.281 184.676C222.129 184.379 221.961 184.091 221.784 183.816ZM110.137 196.997C109.692 197.778 109.192 198.576 108.635 199.391C107.23 201.448 105.382 203.02 103.237 204.188C99.1363 206.424 93.9472 207.205 88.6751 207.205C80.3464 207.205 71.8082 205.256 67.0227 204.015C66.7874 203.954 37.6887 195.735 41.3734 188.739C41.9922 187.562 43.0136 187.092 44.2978 187.092C49.4849 187.092 58.9299 194.816 62.9892 194.816C63.8961 194.816 64.5355 194.43 64.7967 193.488C66.5263 187.284 38.5035 184.676 40.8636 175.692C41.2803 174.102 42.4099 173.456 43.9982 173.456C50.8559 173.455 66.2473 185.516 69.4668 185.516C69.7137 185.516 69.8908 185.443 69.9865 185.291C70.0008 185.268 70.0151 185.246 70.0285 185.222C71.539 182.727 70.6724 180.913 60.3206 174.573C59.9986 174.376 59.6674 174.175 59.327 173.968C47.9359 167.074 39.9406 162.925 44.4883 157.975C45.0115 157.404 45.7529 157.151 46.6535 157.151C47.7222 157.151 49.0154 157.508 50.4392 158.108C56.4553 160.645 64.7931 167.564 68.2765 170.581C69.3032 171.476 69.9087 172.022 69.9087 172.022C69.9087 172.022 74.3185 176.608 76.9854 176.608C77.5992 176.608 78.1195 176.366 78.4726 175.768C80.3644 172.58 60.9099 157.838 59.8135 151.755C59.0694 147.634 60.3349 145.546 62.6753 145.546C63.7879 145.546 65.1464 146.02 66.6444 146.971C71.2949 149.922 80.2729 165.35 83.5595 171.352C84.6615 173.363 86.543 174.213 88.2379 174.213C91.6008 174.213 94.2298 170.87 88.5457 166.622C80.003 160.23 83.0007 149.782 87.078 149.139C87.2516 149.111 87.4272 149.097 87.6027 149.097C91.3109 149.097 92.9464 155.486 92.9464 155.486C92.9464 155.486 97.74 167.524 105.975 175.753C113.447 183.222 114.491 189.351 110.137 196.997ZM136.766 198.407L136.339 198.458C136.096 198.486 135.854 198.514 135.611 198.541C135.228 198.581 134.844 198.619 134.459 198.654L134.084 198.688L133.741 198.717L133.255 198.756C133.076 198.77 132.897 198.783 132.718 198.795L132.182 198.83L132.063 198.838C131.923 198.846 131.783 198.855 131.641 198.862L131.462 198.872C131.296 198.881 131.13 198.889 130.962 198.896L130.381 198.921L129.854 198.939L129.502 198.949H129.323C129.213 198.949 129.104 198.955 128.994 198.956C128.936 198.956 128.878 198.956 128.82 198.956C128.71 198.956 128.601 198.956 128.491 198.961L128.043 198.967C127.835 198.967 127.628 198.967 127.418 198.967C126.927 198.967 126.437 198.962 125.949 198.952L125.553 198.943C125.44 198.943 125.327 198.938 125.216 198.934L124.796 198.922L124.275 198.902L123.805 198.881L123.684 198.876L123.237 198.853C123.112 198.846 122.989 198.84 122.865 198.831L122.576 198.814C122.212 198.791 121.849 198.766 121.487 198.738L121.107 198.707C120.947 198.695 120.787 198.68 120.628 198.666C120.441 198.65 120.254 198.632 120.067 198.614C119.753 198.585 119.44 198.553 119.128 198.519H119.113C123.683 188.324 121.372 178.802 112.137 169.575C106.08 163.526 102.051 154.594 101.215 152.633C99.5234 146.828 95.0449 140.375 87.6081 140.375C86.9793 140.375 86.3514 140.425 85.7304 140.523C82.4721 141.036 79.6237 142.911 77.5919 145.733C75.3961 143.002 73.2624 140.831 71.3315 139.605C68.4223 137.76 65.5184 136.824 62.6887 136.824C59.1579 136.824 56.0019 138.274 53.8018 140.904L53.7464 140.971C53.7043 140.798 53.6641 140.625 53.6229 140.451L53.6176 140.428C53.4942 139.895 53.3749 139.359 53.2599 138.819C53.0709 137.919 52.8982 137.013 52.742 136.1C52.6824 135.742 52.6228 135.381 52.5632 135.016C52.5632 135.004 52.5632 134.992 52.5578 134.98C52.5364 134.843 52.5158 134.705 52.4952 134.568C52.4827 134.487 52.4702 134.404 52.4585 134.323C52.438 134.183 52.4174 134.042 52.3977 133.901C52.3718 133.717 52.3459 133.532 52.3217 133.348C52.2976 133.163 52.2725 132.978 52.2493 132.793C52.226 132.608 52.2037 132.423 52.1813 132.238C52.1589 132.053 52.1402 131.885 52.1205 131.709L52.1151 131.665C52.0752 131.299 52.037 130.933 52.0006 130.564C51.9666 130.211 51.9351 129.857 51.9058 129.503C51.8978 129.414 51.8906 129.319 51.8835 129.226C51.871 129.07 51.8594 128.912 51.8477 128.754C51.8379 128.625 51.8289 128.495 51.8209 128.365C51.8209 128.334 51.8164 128.304 51.8146 128.275C51.803 128.102 51.7914 127.927 51.7815 127.753C51.7699 127.564 51.76 127.376 51.7502 127.187C51.7404 126.998 51.7306 126.81 51.7225 126.62C51.7154 126.455 51.7082 126.29 51.7019 126.124L51.6966 125.974C51.6912 125.822 51.6858 125.669 51.6814 125.517L51.6706 125.128C51.6706 124.973 51.6635 124.818 51.6608 124.663C51.6581 124.508 51.6545 124.338 51.6528 124.174C51.651 124.01 51.6528 123.848 51.6483 123.685C51.6438 123.521 51.6483 123.358 51.6483 123.195C51.6483 81.3422 85.5792 47.4114 127.436 47.4114C169.292 47.4114 203.222 81.3403 203.222 123.195C203.222 123.358 203.222 123.521 203.222 123.685C203.222 123.848 203.222 124.011 203.222 124.174C203.222 124.337 203.217 124.501 203.214 124.663C203.214 124.798 203.208 124.931 203.204 125.068C203.204 125.188 203.199 125.309 203.195 125.425C203.195 125.578 203.186 125.731 203.181 125.884V125.896C203.174 126.075 203.167 126.252 203.16 126.427C203.153 126.582 203.147 126.738 203.139 126.893L203.134 127.003C203.126 127.168 203.116 127.333 203.107 127.499C203.1 127.631 203.093 127.761 203.084 127.892C203.036 128.686 202.976 129.478 202.905 130.267C202.892 130.405 202.879 130.544 202.866 130.683C202.866 130.687 202.866 130.691 202.866 130.696C202.849 130.87 202.832 131.044 202.813 131.218C202.798 131.355 202.784 131.492 202.768 131.629C202.739 131.897 202.709 132.166 202.679 132.433L202.628 132.84C202.607 132.999 202.586 133.159 202.565 133.319C202.542 133.493 202.519 133.668 202.493 133.841C202.467 134.036 202.438 134.23 202.409 134.424L202.34 134.883C202.313 135.057 202.285 135.229 202.258 135.403C202.23 135.576 202.201 135.748 202.168 135.92C202.135 136.093 202.109 136.265 202.079 136.437C202.019 136.781 201.956 137.125 201.89 137.468C201.826 137.796 201.761 138.124 201.695 138.451C201.657 138.636 201.619 138.821 201.58 139.005C201.544 139.175 201.507 139.343 201.47 139.512C201.434 139.681 201.395 139.851 201.357 140.02C199.224 137.947 196.399 136.818 193.284 136.818C190.457 136.818 187.55 137.753 184.641 139.598C182.711 140.824 180.578 142.996 178.381 145.726C176.346 142.904 173.498 141.029 170.242 140.516C169.621 140.418 168.993 140.368 168.364 140.368C160.925 140.368 156.45 146.821 154.757 152.626C153.917 154.587 149.887 163.519 143.825 169.577C134.596 178.775 132.268 188.254 136.766 198.407ZM215.007 177.998C214.997 178.027 214.987 178.055 214.977 178.087C214.956 178.145 214.934 178.201 214.911 178.258C214.876 178.342 214.838 178.425 214.798 178.509C214.771 178.565 214.743 178.62 214.714 178.674C214.64 178.812 214.559 178.948 214.47 179.082C214.362 179.243 214.245 179.403 214.117 179.561C214.055 179.638 213.989 179.716 213.921 179.793C213.875 179.845 213.831 179.897 213.779 179.948C213.708 180.024 213.634 180.1 213.559 180.175C212.213 181.509 210.161 182.679 207.841 183.752C207.578 183.871 207.311 183.99 207.042 184.11L206.774 184.229C206.595 184.308 206.416 184.386 206.228 184.463C206.049 184.541 205.863 184.619 205.677 184.695L205.119 184.925C203.814 185.462 202.477 185.974 201.173 186.479L200.615 186.696L200.064 186.912C199.697 187.055 199.335 187.198 198.979 187.341L198.448 187.555L197.926 187.768L197.67 187.876C197.499 187.947 197.332 188.018 197.165 188.089C193.328 189.736 190.567 191.411 191.147 193.489C191.163 193.548 191.181 193.604 191.201 193.659C191.253 193.813 191.324 193.958 191.413 194.095C191.465 194.176 191.525 194.253 191.592 194.323C192.274 195.032 193.515 194.92 195.08 194.357C195.206 194.311 195.336 194.262 195.468 194.211C195.557 194.177 195.647 194.141 195.736 194.104L195.872 194.048C196.23 193.896 196.609 193.726 196.996 193.542C197.093 193.496 197.191 193.452 197.289 193.401C199.203 192.465 201.372 191.205 203.524 190.058C203.854 189.879 204.185 189.707 204.514 189.537C205.062 189.254 205.607 188.983 206.142 188.733C208.18 187.774 210.096 187.094 211.636 187.094C212.359 187.094 212.997 187.242 213.529 187.582L213.618 187.641C213.675 187.681 213.732 187.723 213.786 187.768C213.986 187.933 214.163 188.122 214.312 188.333C214.356 188.395 214.401 188.461 214.441 188.528C214.482 188.595 214.522 188.666 214.561 188.739C215.322 190.184 214.685 191.68 213.194 193.147C211.763 194.556 209.537 195.937 207.007 197.215C206.819 197.31 206.631 197.405 206.44 197.498C198.91 201.196 189.049 203.981 188.912 204.016C186.284 204.697 182.526 205.591 178.292 206.26L177.666 206.358L177.563 206.373C177.337 206.407 177.11 206.44 176.882 206.472C176.635 206.507 176.387 206.541 176.138 206.574C175.655 206.639 175.167 206.698 174.676 206.753L174.586 206.763C172.831 206.962 171.032 207.107 169.228 207.169H169.202C168.554 207.192 167.907 207.204 167.259 207.204C167.155 207.204 167.052 207.204 166.948 207.204H166.859C166.744 207.204 166.628 207.204 166.512 207.204C166.136 207.199 165.761 207.189 165.387 207.175C164.772 207.153 164.16 207.118 163.553 207.07C163.53 207.07 163.505 207.07 163.482 207.064C163.251 207.046 163.02 207.026 162.791 207.003C162.668 206.992 162.547 206.979 162.425 206.965C162.059 206.926 161.694 206.882 161.333 206.833C161.093 206.8 160.852 206.765 160.618 206.726C160.383 206.688 160.134 206.647 159.893 206.605C159.783 206.585 159.673 206.564 159.564 206.543L159.539 206.538C159.19 206.471 158.845 206.399 158.503 206.319C158.303 206.274 158.104 206.23 157.907 206.176L157.788 206.146C157.69 206.122 157.595 206.096 157.498 206.07L157.445 206.056L157.137 205.966C157.025 205.935 156.913 205.901 156.801 205.868L156.762 205.857L156.471 205.768C156.361 205.734 156.251 205.698 156.142 205.662C156.053 205.633 155.964 205.602 155.874 205.573L155.677 205.504C155.487 205.437 155.298 205.368 155.111 205.296L154.933 205.226L154.786 205.168C154.5 205.053 154.219 204.934 153.941 204.81L153.756 204.72L153.725 204.706C153.659 204.675 153.594 204.644 153.528 204.617C153.399 204.555 153.271 204.491 153.144 204.426L153.105 204.407L152.921 204.31C152.711 204.199 152.504 204.085 152.301 203.966C152.186 203.9 152.072 203.833 151.96 203.764L151.788 203.658C151.699 203.604 151.617 203.55 151.532 203.494L151.308 203.346C151.227 203.291 151.147 203.236 151.067 203.18L150.923 203.077C150.771 202.969 150.622 202.857 150.476 202.742C150.398 202.682 150.32 202.621 150.243 202.563C150.151 202.488 150.058 202.412 149.967 202.335C149.89 202.272 149.815 202.206 149.74 202.14L149.734 202.135C149.653 202.064 149.574 201.993 149.495 201.92C149.417 201.849 149.339 201.777 149.263 201.704L149.254 201.695C149.175 201.619 149.096 201.542 149.019 201.463C148.942 201.385 148.863 201.307 148.788 201.227C148.713 201.148 148.636 201.067 148.562 200.984C148.488 200.902 148.42 200.827 148.35 200.746L148.327 200.719C148.259 200.641 148.192 200.562 148.126 200.481C147.983 200.31 147.844 200.135 147.71 199.956C147.575 199.776 147.443 199.592 147.314 199.405C147.273 199.343 147.231 199.282 147.191 199.221C147.108 199.099 147.027 198.978 146.948 198.857C146.868 198.736 146.79 198.615 146.712 198.493C146.634 198.374 146.558 198.255 146.484 198.136C146.445 198.076 146.408 198.017 146.373 197.957C146.302 197.844 146.234 197.73 146.166 197.618C146.157 197.603 146.148 197.587 146.138 197.572C146.073 197.462 146.009 197.354 145.947 197.245C145.911 197.186 145.878 197.128 145.845 197.066C145.812 197.004 145.774 196.941 145.739 196.878C145.719 196.845 145.701 196.812 145.682 196.779C145.664 196.746 145.658 196.736 145.647 196.715C145.58 196.595 145.514 196.474 145.45 196.352C145.42 196.298 145.391 196.244 145.36 196.192L145.271 196.019L145.181 195.848C145.123 195.733 145.066 195.616 145.01 195.501C144.981 195.444 144.953 195.387 144.927 195.33C144.869 195.208 144.812 195.086 144.757 194.966C144.734 194.917 144.712 194.868 144.692 194.819C144.64 194.705 144.59 194.593 144.543 194.48C144.519 194.423 144.495 194.367 144.472 194.311C144.426 194.198 144.383 194.086 144.337 193.975C144.315 193.921 144.293 193.868 144.274 193.814C144.187 193.588 144.106 193.364 144.031 193.14C144.011 193.085 143.992 193.029 143.975 192.975C143.942 192.874 143.91 192.775 143.88 192.675C143.825 192.5 143.775 192.325 143.729 192.152C143.713 192.097 143.699 192.043 143.685 191.988C143.642 191.825 143.603 191.662 143.567 191.499C143.542 191.391 143.519 191.283 143.498 191.175C143.487 191.12 143.476 191.065 143.467 191.012C143.446 190.903 143.427 190.796 143.41 190.689C143.383 190.528 143.36 190.367 143.34 190.206C143.332 190.153 143.326 190.1 143.32 190.047C143.315 189.994 143.309 189.94 143.303 189.885C143.281 189.671 143.264 189.459 143.254 189.247C143.254 189.193 143.249 189.139 143.247 189.087C143.242 188.98 143.24 188.875 143.239 188.769C143.183 184.496 145.345 180.388 149.968 175.767C158.203 167.54 162.997 155.501 162.997 155.501C162.997 155.501 163.126 154.996 163.394 154.269C163.431 154.168 163.47 154.064 163.514 153.955C163.564 153.825 163.619 153.69 163.678 153.551C163.756 153.366 163.842 153.174 163.935 152.979C163.969 152.905 164.004 152.832 164.041 152.758C164.054 152.733 164.067 152.708 164.08 152.683C164.104 152.634 164.129 152.583 164.155 152.534C164.298 152.26 164.454 151.983 164.624 151.712C164.67 151.639 164.714 151.567 164.765 151.494C164.863 151.348 164.965 151.205 165.069 151.066C165.121 150.996 165.175 150.928 165.23 150.86C165.319 150.749 165.416 150.639 165.513 150.532C165.552 150.49 165.59 150.448 165.631 150.408C166.108 149.915 166.653 149.513 167.27 149.299L167.348 149.273C167.4 149.256 167.452 149.24 167.505 149.225C167.566 149.209 167.627 149.195 167.69 149.182L167.719 149.176C167.849 149.15 167.981 149.133 168.114 149.124H168.125C168.194 149.124 168.264 149.117 168.335 149.117C168.424 149.117 168.507 149.117 168.594 149.126C168.684 149.134 168.773 149.144 168.863 149.158C169.605 149.276 170.311 149.718 170.919 150.4C170.947 150.431 170.975 150.463 171.001 150.495C171.142 150.66 171.272 150.833 171.393 151.011C171.444 151.084 171.493 151.159 171.54 151.236C171.66 151.428 171.773 151.631 171.88 151.845C171.923 151.934 171.964 152.016 172.004 152.104C172.025 152.149 172.045 152.193 172.064 152.239C172.103 152.329 172.141 152.418 172.178 152.513C172.214 152.602 172.25 152.7 172.284 152.795C172.442 153.242 172.569 153.7 172.662 154.165C172.684 154.272 172.704 154.379 172.723 154.487C172.788 154.878 172.833 155.271 172.857 155.666C172.865 155.781 172.87 155.897 172.873 156.013C172.881 156.286 172.881 156.563 172.873 156.842C172.869 156.949 172.864 157.055 172.856 157.162C172.786 158.174 172.587 159.172 172.265 160.133C172.209 160.3 172.149 160.467 172.086 160.634C172.044 160.745 171.997 160.857 171.952 160.969C171.863 161.19 171.759 161.416 171.65 161.634C171.569 161.799 171.484 161.965 171.392 162.13C171.332 162.24 171.269 162.35 171.206 162.46C171.045 162.734 170.871 163.006 170.684 163.277L170.571 163.439C170.417 163.653 170.254 163.866 170.083 164.078C169.775 164.458 169.446 164.822 169.099 165.167C168.569 165.698 168.001 166.189 167.4 166.637C166.857 167.038 166.344 167.478 165.865 167.955C165.813 168.008 165.761 168.061 165.711 168.114C164.208 169.691 163.858 171.083 164.196 172.138C164.212 172.187 164.229 172.234 164.247 172.28C164.274 172.351 164.305 172.419 164.336 172.486C164.358 172.531 164.382 172.575 164.407 172.617C164.508 172.791 164.628 172.951 164.764 173.097C164.781 173.116 164.799 173.134 164.817 173.152L164.871 173.206C164.925 173.258 164.982 173.309 165.043 173.359L165.103 173.407C165.143 173.438 165.185 173.469 165.227 173.497C165.335 173.573 165.447 173.643 165.563 173.707C165.61 173.732 165.652 173.757 165.705 173.781C165.879 173.866 166.058 173.939 166.242 173.998C166.293 174.015 166.344 174.03 166.396 174.046L166.461 174.063L166.551 174.087L166.628 174.106L166.712 174.124L166.795 174.141L166.874 174.154C166.932 174.164 166.992 174.174 167.052 174.181L167.109 174.19L167.213 174.2L167.277 174.207L167.382 174.214H167.444L167.554 174.22H167.603H167.705H167.752H167.9L167.999 174.214L168.113 174.207L168.252 174.194L168.382 174.179C168.412 174.179 168.442 174.171 168.472 174.165C168.538 174.156 168.605 174.145 168.671 174.132C168.835 174.101 168.997 174.062 169.157 174.015C169.32 173.968 169.481 173.912 169.639 173.849C169.691 173.827 169.745 173.805 169.798 173.782C169.887 173.743 169.977 173.702 170.059 173.658C170.129 173.624 170.198 173.586 170.266 173.547C170.368 173.489 170.47 173.428 170.57 173.361C170.799 173.211 171.015 173.043 171.217 172.858C171.265 172.815 171.312 172.769 171.358 172.725C171.381 172.703 171.403 172.682 171.425 172.658C171.469 172.613 171.514 172.569 171.558 172.52C171.645 172.425 171.731 172.326 171.814 172.221C172.028 171.952 172.218 171.665 172.383 171.363C173.718 168.925 175.991 164.93 178.534 160.849C178.632 160.692 178.73 160.535 178.828 160.378C178.926 160.222 179.026 160.064 179.125 159.907C179.273 159.668 179.423 159.433 179.572 159.199L179.722 158.965C179.822 158.808 179.923 158.651 180.024 158.496C180.427 157.87 180.833 157.252 181.241 156.641L181.546 156.185C182.158 155.278 182.768 154.396 183.373 153.558L183.674 153.143C184.121 152.526 184.569 151.934 185.008 151.376C185.251 151.067 185.491 150.769 185.728 150.482L186.01 150.144C186.057 150.088 186.1 150.032 186.151 149.978C186.244 149.868 186.337 149.761 186.428 149.657C186.474 149.604 186.517 149.552 186.566 149.5C186.658 149.397 186.747 149.296 186.834 149.198C186.878 149.149 186.924 149.1 186.968 149.051C187.103 148.906 187.235 148.767 187.365 148.634C187.455 148.544 187.538 148.455 187.624 148.371C188.131 147.853 188.69 147.388 189.293 146.985L189.433 146.895C189.479 146.865 189.522 146.837 189.573 146.806C189.662 146.75 189.752 146.697 189.848 146.645C192.212 145.303 194.169 145.204 195.296 146.331C195.978 147.013 196.356 148.144 196.335 149.718C196.335 149.787 196.335 149.857 196.33 149.929C196.33 149.954 196.33 149.98 196.33 150.006C196.33 150.078 196.324 150.15 196.318 150.223C196.318 150.313 196.308 150.402 196.299 150.492C196.29 150.581 196.285 150.649 196.276 150.729C196.276 150.751 196.272 150.774 196.268 150.798C196.262 150.867 196.253 150.938 196.243 151.009C196.243 151.03 196.243 151.052 196.235 151.074C196.226 151.147 196.216 151.222 196.204 151.296C196.202 151.317 196.199 151.337 196.194 151.357C196.183 151.447 196.168 151.531 196.152 151.619C196.144 151.669 196.135 151.718 196.126 151.768C196.1 151.91 196.067 152.05 196.026 152.188C195.983 152.333 195.932 152.482 195.873 152.636C195.833 152.737 195.79 152.84 195.743 152.946C195.697 153.051 195.646 153.157 195.592 153.265C195.485 153.481 195.363 153.704 195.229 153.933C195.162 154.047 195.092 154.163 195.019 154.28C194.982 154.338 194.945 154.397 194.907 154.459C194.793 154.638 194.673 154.819 194.549 155.002C194.425 155.185 194.292 155.371 194.15 155.561C193.965 155.813 193.769 156.069 193.564 156.33L193.408 156.527C193.146 156.854 192.871 157.188 192.583 157.529C192.41 157.733 192.233 157.938 192.052 158.146C191.932 158.284 191.81 158.422 191.686 158.562L191.499 158.772C191.247 159.053 190.991 159.336 190.729 159.62L190.532 159.834C190.401 159.977 190.264 160.12 190.132 160.264C190.001 160.407 189.864 160.552 189.726 160.697L189.315 161.13C189.177 161.275 189.038 161.421 188.898 161.566C188.758 161.711 188.619 161.857 188.478 162.002C188.196 162.294 187.913 162.586 187.628 162.878C183.573 167.037 179.301 171.182 177.855 173.766C177.759 173.934 177.671 174.107 177.593 174.285C177.387 174.755 177.301 175.157 177.36 175.482C177.379 175.589 177.416 175.691 177.471 175.785C177.514 175.857 177.561 175.927 177.611 175.993C177.659 176.056 177.711 176.116 177.766 176.172C177.819 176.224 177.875 176.272 177.934 176.316C178.232 176.528 178.591 176.637 178.957 176.627H179.071L179.188 176.618L179.305 176.605L179.402 176.591C179.415 176.589 179.428 176.587 179.442 176.583L179.531 176.566L179.554 176.561L179.653 176.54L179.688 176.531C179.721 176.523 179.757 176.513 179.792 176.503C179.827 176.493 179.875 176.48 179.917 176.466C179.999 176.441 180.082 176.414 180.166 176.383C180.25 176.353 180.345 176.316 180.434 176.278C180.524 176.241 180.613 176.203 180.696 176.161C180.741 176.141 180.786 176.12 180.828 176.098L180.962 176.032C181.135 175.943 181.309 175.848 181.483 175.745C181.622 175.664 181.76 175.579 181.898 175.491L182.031 175.401C182.076 175.373 182.121 175.344 182.164 175.312L182.297 175.223L182.368 175.174C182.432 175.129 182.497 175.084 182.56 175.039C182.739 174.916 182.906 174.789 183.075 174.66L183.09 174.648C183.18 174.579 183.269 174.51 183.359 174.44C183.726 174.15 184.074 173.858 184.39 173.583L184.6 173.399L184.619 173.381L184.729 173.284C184.987 173.052 185.217 172.836 185.408 172.658L185.487 172.581C185.556 172.516 185.619 172.455 185.676 172.403L185.788 172.292L185.828 172.253L185.839 172.242C185.886 172.194 185.928 172.152 185.956 172.125C185.997 172.081 186.022 172.056 186.03 172.048L186.039 172.041L186.074 172.009C186.086 171.997 186.101 171.984 186.118 171.969L186.132 171.956L186.169 171.922L186.373 171.743L186.487 171.641C186.542 171.593 186.604 171.538 186.666 171.479L186.802 171.358C186.827 171.338 186.851 171.316 186.876 171.294L187.019 171.169L187.229 170.984L187.341 170.887C187.776 170.509 188.305 170.052 188.913 169.537L189.162 169.326L189.573 168.981C189.711 168.866 189.851 168.749 189.994 168.63C190.544 168.173 191.136 167.688 191.762 167.185L192.173 166.855C192.523 166.576 192.882 166.292 193.246 166.006C193.393 165.891 193.542 165.776 193.694 165.662C193.917 165.489 194.142 165.315 194.371 165.141C194.517 165.028 194.666 164.916 194.817 164.803C195.675 164.155 196.56 163.506 197.456 162.874L197.84 162.606C198.109 162.421 198.377 162.235 198.645 162.054L198.888 161.89C199.142 161.719 199.395 161.552 199.647 161.386C199.88 161.234 200.111 161.085 200.343 160.939L200.586 160.786L200.827 160.636C201.069 160.486 201.309 160.339 201.548 160.196L201.787 160.053L202.265 159.775C202.422 159.684 202.579 159.594 202.734 159.506L202.829 159.454C202.953 159.385 203.077 159.317 203.2 159.25C203.355 159.166 203.509 159.085 203.663 159.006L203.892 158.888L204.115 158.776C204.193 158.739 204.27 158.7 204.346 158.663C204.793 158.443 205.24 158.243 205.666 158.068C205.737 158.039 205.809 158.009 205.88 157.979C206.021 157.919 206.161 157.865 206.3 157.818C206.439 157.77 206.576 157.722 206.71 157.674C206.833 157.633 206.953 157.594 207.068 157.559L207.108 157.547C207.17 157.527 207.232 157.509 207.293 157.493L207.311 157.488C207.439 157.451 207.566 157.419 207.691 157.389H207.7C207.887 157.345 208.07 157.306 208.248 157.277C208.308 157.266 208.368 157.256 208.427 157.248C208.546 157.23 208.662 157.216 208.777 157.206C208.835 157.2 208.891 157.195 208.948 157.193C209.061 157.185 209.171 157.181 209.279 157.181C209.307 157.181 209.335 157.181 209.363 157.181C209.475 157.181 209.583 157.188 209.69 157.199C209.739 157.199 209.788 157.209 209.836 157.215H209.856C209.904 157.221 209.952 157.228 210 157.239C210.047 157.248 210.095 157.256 210.141 157.267H210.156C210.203 157.277 210.245 157.289 210.294 157.303C210.342 157.316 210.388 157.329 210.434 157.346C210.593 157.4 210.746 157.469 210.891 157.552C210.933 157.576 210.973 157.601 211.012 157.628C211.051 157.653 211.09 157.68 211.128 157.709C211.193 157.758 211.257 157.811 211.317 157.868L211.344 157.894C211.362 157.91 211.379 157.927 211.395 157.944L211.444 157.997C211.846 158.418 212.178 158.901 212.428 159.427C212.441 159.457 212.454 159.487 212.466 159.517C212.491 159.576 212.515 159.634 212.536 159.696C212.588 159.839 212.632 159.986 212.666 160.135C212.796 160.726 212.76 161.342 212.564 161.914C212.547 161.967 212.528 162.018 212.508 162.07C212.435 162.265 212.349 162.454 212.251 162.637C212.225 162.686 212.198 162.736 212.169 162.785C211.858 163.309 211.489 163.796 211.07 164.237L210.981 164.332C210.848 164.472 210.71 164.612 210.565 164.752C210.501 164.815 210.434 164.877 210.367 164.94C210.3 165.003 210.231 165.065 210.162 165.129L210.055 165.224C209.984 165.287 209.912 165.35 209.839 165.413C209.692 165.54 209.54 165.667 209.383 165.796C209.342 165.828 209.302 165.86 209.263 165.893C209.104 166.022 208.939 166.151 208.77 166.281C208.472 166.51 208.162 166.74 207.841 166.973C207.563 167.173 207.275 167.374 206.979 167.578C206.621 167.823 206.251 168.071 205.869 168.321C205.189 168.767 204.471 169.224 203.715 169.693C203.507 169.823 203.297 169.953 203.083 170.084C201.115 171.294 198.934 172.588 196.609 173.995L196.007 174.36C195.348 174.762 194.726 175.146 194.14 175.512L193.845 175.697L193.287 176.055C193.067 176.196 192.852 176.334 192.643 176.469C192.485 176.57 192.331 176.67 192.179 176.77L191.882 176.966C191.734 177.062 191.59 177.158 191.449 177.252L191.308 177.342C191.161 177.44 191.017 177.537 190.876 177.633L190.647 177.79L190.379 177.976L190.13 178.149C189.713 178.444 189.325 178.725 188.968 178.992L188.834 179.094C188.683 179.208 188.537 179.322 188.396 179.433C188.333 179.482 188.272 179.531 188.211 179.58C187.902 179.829 187.62 180.067 187.367 180.296L187.243 180.409C187.172 180.474 187.102 180.539 187.035 180.603C186.989 180.648 186.946 180.693 186.898 180.736L186.834 180.8C186.792 180.841 186.751 180.883 186.711 180.924C186.606 181.033 186.508 181.139 186.416 181.242L186.35 181.318C186.203 181.488 186.075 181.651 185.963 181.81L185.913 181.881C185.825 182.009 185.745 182.14 185.671 182.277C185.652 182.311 185.635 182.345 185.618 182.379C185.601 182.413 185.585 182.447 185.569 182.481C185.554 182.515 185.547 182.53 185.536 182.555L185.515 182.605L185.498 182.65C185.49 182.67 185.482 182.69 185.475 182.711C185.467 182.732 185.454 182.768 185.445 182.8C185.413 182.898 185.387 182.998 185.367 183.099C185.362 183.121 185.358 183.145 185.355 183.167C185.351 183.189 185.347 183.212 185.345 183.234C185.342 183.257 185.339 183.276 185.337 183.296L185.331 183.354C185.331 183.378 185.331 183.401 185.331 183.425C185.331 183.448 185.331 183.448 185.331 183.459C185.331 183.486 185.331 183.512 185.331 183.539C185.331 183.566 185.331 183.607 185.331 183.641C185.331 183.65 185.331 183.659 185.331 183.669C185.331 183.695 185.331 183.721 185.338 183.749C185.339 183.765 185.341 183.781 185.343 183.797C185.343 183.823 185.349 183.848 185.353 183.876C185.357 183.902 185.364 183.949 185.372 183.986C185.372 183.986 185.372 183.986 185.372 183.991C185.379 184.026 185.386 184.06 185.395 184.095C185.404 184.13 185.413 184.17 185.424 184.206C185.443 184.277 185.467 184.347 185.492 184.417C185.508 184.459 185.523 184.5 185.54 184.541C185.54 184.549 185.546 184.558 185.55 184.566C185.561 184.592 185.573 184.62 185.586 184.647C185.598 184.674 185.619 184.721 185.636 184.758C185.671 184.832 185.709 184.907 185.75 184.982C185.77 185.019 185.791 185.057 185.813 185.094L185.879 185.208C185.901 185.246 185.924 185.284 185.947 185.322C185.96 185.341 185.973 185.359 185.988 185.376L186.01 185.399L186.035 185.422L186.061 185.442C186.08 185.455 186.099 185.467 186.119 185.477C186.139 185.487 186.16 185.496 186.183 185.505C186.206 185.513 186.23 185.519 186.254 185.525C186.831 185.655 188.017 185.178 189.593 184.346C189.682 184.298 189.78 184.248 189.875 184.196L190.355 183.934L190.589 183.804C190.756 183.715 190.926 183.614 191.1 183.515C191.205 183.456 191.31 183.396 191.417 183.336C193.5 182.137 195.988 180.597 198.56 179.093C198.801 178.952 199.043 178.811 199.285 178.672L199.771 178.361C200.013 178.222 200.256 178.084 200.498 177.947C200.822 177.764 201.147 177.583 201.471 177.404C202.199 177.004 202.925 176.618 203.639 176.254L204.115 176.013C204.431 175.857 204.744 175.705 205.053 175.557C205.363 175.41 205.67 175.269 205.974 175.134C206.276 175 206.574 174.871 206.868 174.748L207.203 174.612L207.243 174.596C209.018 173.893 210.627 173.459 211.929 173.459C212.211 173.456 212.492 173.48 212.769 173.528H212.778C212.867 173.544 212.948 173.562 213.031 173.582H213.046C213.114 173.599 213.181 173.619 213.246 173.641C213.298 173.658 213.348 173.675 213.397 173.694C213.446 173.713 213.474 173.724 213.512 173.741C213.563 173.764 213.613 173.787 213.662 173.812C213.711 173.837 213.778 173.874 213.833 173.908C214.042 174.034 214.23 174.19 214.393 174.371C214.465 174.451 214.531 174.536 214.591 174.625C214.612 174.654 214.631 174.683 214.649 174.714C214.687 174.773 214.722 174.833 214.756 174.893C214.799 174.971 214.838 175.052 214.875 175.137C214.911 175.223 214.945 175.31 214.978 175.406C215.01 175.501 215.038 175.594 215.067 175.693C215.278 176.45 215.257 177.253 215.007 177.998Z" fill="#FF9D00"/>
+<path fill-rule="evenodd" clip-rule="evenodd" d="M203.21 123.685V123.194C203.21 81.3404 169.292 47.4113 127.435 47.4113C85.5791 47.4113 51.6482 81.3424 51.6482 123.194C51.6482 123.249 51.6477 123.304 51.6472 123.358C51.6462 123.467 51.6452 123.576 51.6482 123.685C51.6511 123.789 51.6514 123.893 51.6517 123.997C51.6518 124.056 51.652 124.115 51.6526 124.174C51.654 124.293 51.6562 124.416 51.6584 124.534C51.6592 124.578 51.66 124.621 51.6607 124.663C51.6617 124.723 51.6634 124.782 51.6651 124.842C51.6678 124.937 51.6705 125.033 51.6705 125.128L51.6813 125.517C51.6858 125.669 51.6911 125.822 51.6965 125.974L51.7019 126.124C51.7079 126.282 51.7146 126.44 51.7215 126.597L51.7224 126.62C51.7303 126.805 51.7399 126.989 51.7494 127.173L51.7502 127.187C51.76 127.375 51.7698 127.564 51.7815 127.753C51.7913 127.927 51.8029 128.101 51.8145 128.275C51.8152 128.285 51.8162 128.296 51.8172 128.306C51.819 128.325 51.8208 128.345 51.8208 128.364C51.8288 128.495 51.8378 128.625 51.8476 128.754L51.8505 128.794C51.8612 128.938 51.8719 129.082 51.8834 129.226L51.8856 129.254C51.892 129.338 51.8985 129.422 51.9058 129.503C51.935 129.857 51.9666 130.211 52.0006 130.565C52.0369 130.932 52.0751 131.299 52.115 131.664L52.1204 131.709C52.1401 131.885 52.1589 132.053 52.1812 132.238C52.2036 132.423 52.2259 132.608 52.2492 132.793C52.265 132.919 52.2816 133.044 52.2982 133.17C52.306 133.229 52.3139 133.288 52.3216 133.347C52.3458 133.532 52.3717 133.717 52.3976 133.901C52.4173 134.041 52.4379 134.183 52.4585 134.323C52.4701 134.405 52.4826 134.487 52.4951 134.568L52.4992 134.595C52.5185 134.723 52.5377 134.852 52.5577 134.979C52.5631 134.992 52.5631 135.004 52.5631 135.016C52.6227 135.381 52.6823 135.742 52.742 136.1C52.8982 137.012 53.0708 137.918 53.2598 138.819C53.3748 139.359 53.4941 139.896 53.6175 140.429L53.6229 140.451C53.6335 140.496 53.6442 140.541 53.6548 140.586C53.685 140.715 53.7152 140.843 53.7463 140.971L53.8017 140.904C56.0018 138.274 59.1579 136.824 62.6887 136.824C65.5183 136.824 68.4222 137.76 71.3315 139.605C73.2623 140.831 75.3961 143.002 77.5918 145.733C79.6237 142.911 82.472 141.035 85.7299 140.523C86.3514 140.425 86.9792 140.376 87.608 140.375C95.0443 140.375 99.5233 146.828 101.215 152.633C102.051 154.594 106.08 163.526 112.156 169.568C121.392 178.795 123.703 188.316 119.132 198.511H119.148C119.459 198.546 119.772 198.578 120.087 198.607C120.274 198.625 120.46 198.643 120.648 198.659L120.714 198.665C120.851 198.677 120.989 198.689 121.127 198.7L121.507 198.73C121.869 198.758 122.232 198.784 122.596 198.807L122.885 198.824C122.961 198.829 123.037 198.833 123.114 198.838C123.161 198.84 123.209 198.843 123.256 198.846L123.703 198.869L123.825 198.874L124.294 198.895L124.816 198.915L125.235 198.927C125.259 198.928 125.282 198.928 125.305 198.929C125.394 198.933 125.483 198.936 125.572 198.936C125.604 198.937 125.636 198.938 125.668 198.939C125.767 198.942 125.868 198.945 125.968 198.945C126.457 198.955 126.946 198.959 127.437 198.959H128.063L128.51 198.954C128.62 198.949 128.729 198.949 128.84 198.949H129.014C129.064 198.948 129.114 198.947 129.165 198.945C129.224 198.943 129.283 198.941 129.343 198.941H129.522L129.873 198.932L130.401 198.914L130.982 198.888C131.15 198.882 131.316 198.873 131.482 198.865L131.661 198.854C131.75 198.85 131.838 198.845 131.927 198.84C131.979 198.837 132.031 198.833 132.083 198.831L132.201 198.823L132.738 198.788C132.916 198.776 133.095 198.762 133.274 198.749L133.761 198.71L134.103 198.681L134.479 198.647C134.864 198.611 135.247 198.573 135.63 198.533C135.873 198.507 136.115 198.479 136.359 198.45L136.786 198.399C132.287 188.247 134.616 178.767 143.813 169.577C149.876 163.519 153.905 154.587 154.745 152.625C156.438 146.821 160.914 140.368 168.352 140.368C168.981 140.368 169.61 140.418 170.231 140.516C173.486 141.028 176.334 142.904 178.369 145.726C180.566 142.996 182.699 140.823 184.63 139.597C187.539 137.753 190.445 136.817 193.272 136.817C196.388 136.817 199.212 137.947 201.345 140.02C201.384 139.851 201.422 139.681 201.459 139.512C201.496 139.343 201.532 139.174 201.568 139.006C201.607 138.821 201.646 138.636 201.683 138.451C201.749 138.124 201.815 137.797 201.878 137.467C201.944 137.125 202.007 136.781 202.067 136.437C202.078 136.375 202.088 136.313 202.098 136.251C202.117 136.141 202.135 136.031 202.156 135.92C202.19 135.748 202.218 135.576 202.246 135.402L202.257 135.336C202.281 135.185 202.305 135.034 202.328 134.883L202.398 134.424L202.398 134.42C202.427 134.227 202.455 134.034 202.482 133.841C202.503 133.696 202.523 133.549 202.542 133.403L202.553 133.319L202.616 132.841L202.667 132.433C202.697 132.166 202.727 131.898 202.757 131.629C202.769 131.521 202.781 131.413 202.792 131.306L202.801 131.218C202.82 131.044 202.838 130.87 202.854 130.696C202.854 130.692 202.854 130.687 202.854 130.682C202.867 130.544 202.881 130.405 202.893 130.266C202.964 129.478 203.024 128.686 203.072 127.891C203.081 127.761 203.088 127.63 203.096 127.499L203.096 127.493C203.105 127.329 203.114 127.166 203.122 127.002L203.128 126.892C203.135 126.738 203.142 126.582 203.148 126.427C203.155 126.252 203.162 126.075 203.169 125.896V125.884C203.17 125.841 203.172 125.798 203.174 125.754C203.179 125.645 203.183 125.535 203.183 125.425L203.185 125.381C203.189 125.278 203.193 125.172 203.193 125.067C203.194 125.037 203.195 125.007 203.196 124.977C203.199 124.872 203.202 124.768 203.202 124.663C203.203 124.634 203.203 124.604 203.204 124.574C203.207 124.441 203.21 124.307 203.21 124.174V123.685ZM108.638 199.391C114.64 190.59 114.214 183.984 105.98 175.754C97.7443 167.523 92.9511 155.487 92.9511 155.487C92.9511 155.487 91.1624 148.496 87.0822 149.138C83.0025 149.78 80.0092 160.227 88.5519 166.622C97.0941 173.017 86.8527 177.353 83.5642 171.352C80.2757 165.35 71.2992 149.923 66.6452 146.972C61.991 144.021 58.7178 145.675 59.8152 151.757C60.3598 154.776 65.4278 159.929 70.1627 164.743C74.9667 169.627 79.4279 174.163 78.4744 175.768C76.5812 178.955 69.9139 172.023 69.9139 172.023C69.9139 172.023 49.0385 153.025 44.4936 157.976C40.3048 162.539 46.7652 166.418 56.7205 172.397C57.5673 172.905 58.4393 173.429 59.3322 173.969C70.7233 180.865 71.6087 182.684 69.9926 185.293C69.3952 186.257 65.5819 183.968 60.8927 181.153C52.8972 176.352 42.3549 170.023 40.8662 175.688C39.5777 180.591 47.3335 183.595 54.3678 186.32C60.228 188.59 65.5873 190.666 64.7992 193.484C63.9823 196.406 59.5533 193.969 54.7118 191.305C49.2766 188.314 43.3215 185.038 41.3732 188.735C37.6895 195.725 66.7838 203.954 67.0225 204.015C76.4226 206.453 100.295 211.619 108.638 199.391ZM147.303 199.391C141.301 190.59 141.727 183.984 149.962 175.754C158.197 167.523 162.99 155.487 162.99 155.487C162.99 155.487 164.779 148.496 168.859 149.138C172.939 149.78 175.932 160.227 167.39 166.622C158.847 173.017 169.089 177.353 172.377 171.352C175.666 165.35 184.637 149.923 189.291 146.972C193.945 144.021 197.22 145.675 196.122 151.757C195.578 154.776 190.509 159.929 185.774 164.744L185.774 164.744C180.97 169.628 176.509 174.163 177.462 175.768C179.355 178.955 186.027 172.019 186.027 172.019C186.027 172.019 206.902 153.022 211.448 157.973C215.637 162.535 209.176 166.415 199.219 172.394C198.373 172.902 197.502 173.426 196.609 173.966C185.218 180.862 184.332 182.681 185.948 185.289C186.546 186.254 190.359 183.964 195.048 181.149C203.044 176.349 213.586 170.019 215.075 175.685C216.364 180.588 208.607 183.592 201.573 186.317C195.713 188.587 190.353 190.663 191.141 193.481C191.957 196.402 196.385 193.965 201.225 191.301C206.66 188.31 212.616 185.032 214.564 188.732C218.248 195.726 189.15 203.947 188.915 204.007C179.515 206.453 155.643 211.619 147.303 199.391Z" fill="#FFD21E"/>
+<path fill-rule="evenodd" clip-rule="evenodd" d="M152.047 102.567C153.229 102.985 154.108 104.257 154.944 105.468C156.074 107.104 157.126 108.627 158.74 107.769C160.644 106.756 162.205 105.202 163.226 103.302C164.247 101.402 164.681 99.2425 164.475 97.0959C164.268 94.9493 163.43 92.9121 162.065 91.2421C160.7 89.5721 158.871 88.3441 156.809 87.7134C154.746 87.0826 152.543 87.0778 150.478 87.6993C148.413 88.3208 146.578 89.5405 145.206 91.2047C143.834 92.8688 142.987 94.9021 142.771 97.0478C142.555 99.1939 142.98 101.356 143.992 103.26C144.74 104.667 146.4 104.003 148.152 103.302C149.525 102.753 150.956 102.181 152.047 102.567ZM100.672 102.567C99.4902 102.985 98.611 104.258 97.7748 105.468C96.6443 107.105 95.5925 108.627 93.9786 107.769C92.0747 106.757 90.5136 105.202 89.4933 103.302C88.4725 101.402 88.0382 99.243 88.2449 97.0964C88.4521 94.9498 89.2905 92.9126 90.6551 91.2426C92.0197 89.5726 93.8488 88.3451 95.9112 87.7144C97.9737 87.0841 100.177 87.0792 102.242 87.7008C104.307 88.3223 106.141 89.542 107.513 91.2056C108.885 92.8698 109.733 94.9031 109.949 97.0488C110.165 99.1944 109.74 101.356 108.728 103.26C107.979 104.667 106.319 104.003 104.567 103.303C103.193 102.753 101.764 102.181 100.672 102.567ZM144.099 149.318C152.242 142.903 155.233 132.429 155.233 125.977C155.233 120.877 151.802 122.482 146.309 125.202C146.206 125.252 146.103 125.304 145.999 125.355C140.957 127.852 134.245 131.177 126.877 131.177C119.508 131.177 112.796 127.852 107.755 125.354C102.084 122.545 98.5266 120.783 98.5266 125.978C98.5266 132.634 101.709 143.563 110.443 149.912C111.596 147.572 113.219 145.496 115.21 143.812C117.202 142.129 119.519 140.873 122.018 140.126C122.89 139.866 123.788 141.367 124.707 142.904C125.594 144.386 126.501 145.902 127.423 145.902C128.406 145.902 129.371 144.408 130.314 142.95C131.299 141.425 132.26 139.94 133.189 140.237C133.46 140.323 133.728 140.416 133.994 140.514C138.322 142.121 141.914 145.252 144.099 149.318Z" fill="#32343D"/>
+<path d="M144.097 149.317C139.856 152.659 134.219 154.9 126.878 154.9C119.981 154.9 114.587 152.922 110.443 149.911C111.596 147.572 113.219 145.495 115.21 143.812C117.202 142.128 119.519 140.873 122.018 140.125C123.73 139.614 125.545 145.901 127.423 145.901C129.433 145.901 131.37 139.655 133.189 140.236C133.459 140.322 133.727 140.414 133.994 140.514C138.32 142.121 141.912 145.252 144.097 149.317Z" fill="#FF323D"/>
+<path fill-rule="evenodd" clip-rule="evenodd" d="M81.2008 111.64C80.0351 112.42 78.6646 112.835 77.2626 112.835C76.3318 112.835 75.4097 112.652 74.5494 112.295C73.6891 111.94 72.9078 111.417 72.2495 110.759C71.5912 110.1 71.0692 109.319 70.7131 108.458C70.357 107.598 70.1739 106.676 70.1743 105.745C70.1746 104.343 70.5906 102.973 71.3697 101.808C72.1489 100.642 73.2561 99.7344 74.5514 99.198C75.8464 98.6616 77.2718 98.5215 78.6466 98.7953C80.0215 99.0691 81.2844 99.7441 82.2756 100.736C83.2667 101.727 83.9417 102.99 84.215 104.365C84.4883 105.74 84.3478 107.165 83.8113 108.46C83.2749 109.755 82.3665 110.862 81.2008 111.64ZM182.613 111.64C181.447 112.42 180.076 112.835 178.675 112.835C177.744 112.835 176.822 112.652 175.962 112.295C175.101 111.94 174.32 111.417 173.661 110.759C173.003 110.1 172.481 109.319 172.125 108.458C171.769 107.598 171.586 106.676 171.586 105.745C171.587 104.343 172.003 102.973 172.782 101.808C173.561 100.642 174.668 99.7344 175.963 99.198C177.259 98.6616 178.684 98.5215 180.059 98.7953C181.434 99.0691 182.697 99.7441 183.688 100.736C184.679 101.727 185.354 102.99 185.627 104.365C185.901 105.74 185.76 107.165 185.224 108.46C184.687 109.755 183.778 110.862 182.613 111.64Z" fill="#FFAD03"/>
+<defs>
+<filter id="filter0_d_3_42" x="12.4697" y="22.1456" width="231.956" height="217.2" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB">
+<feFlood flood-opacity="0" result="BackgroundImageFix"/>
+<feColorMatrix in="SourceAlpha" type="matrix" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/>
+<feOffset dy="3"/>
+<feGaussianBlur stdDeviation="6.5"/>
+<feColorMatrix type="matrix" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1 0"/>
+<feBlend mode="normal" in2="BackgroundImageFix" result="effect1_dropShadow_3_42"/>
+<feBlend mode="normal" in="SourceGraphic" in2="effect1_dropShadow_3_42" result="shape"/>
+</filter>
+</defs>
+</svg>
diff --git a/docs/installation.md b/docs/installation.md
index 6fc0b3036..2f694b475 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,31 +1,39 @@
 # Installation
 
-We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher.
+We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**.
 
-## Install SentenceTransformers
-
-**Install with pip**
+## Install with pip
 
 Install the *sentence-transformers* with `pip`:
 ```
 pip install -U sentence-transformers
 ```
 
-**Install with conda**
+## Install with conda
 
-Apple silicon Installation of *sentence-transformers*
+Apple silicon installation of *sentence-transformers*
 ```
 conda install -c conda-forge sentence-transformers
 ```
 
-**Install from source**
+## Install from source
+
+You can install *sentence-transformers* directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release:
+```
+pip install git+https://github.com/UKPLab/sentence-transformers
+```
+
+## Editable install
 
-Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code:
-````
+If you want to make changes to *sentence-transformers*, you will need an editable install. Clone the repository and install it with these commands:
+```
+git clone https://github.com/UKPLab/sentence-transformers
+cd sentence-transformers
 pip install -e .
-```` 
+```
+
+These commands will link the new `sentence-transformers` folder and your Python library paths, such that this folder will be used when importing `sentence-transformers`.
 
 ## Install PyTorch with CUDA support
 
-If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow
-[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.
+To use a GPU/CUDA, you must install PyTorch with CUDA support. Follow [PyTorch - Get Started](https://pytorch.org/get-started/locally/) for installation steps.
\ No newline at end of file
diff --git a/docs/package_reference/SentenceTransformer.md b/docs/package_reference/SentenceTransformer.md
deleted file mode 100644
index cb4d36c9a..000000000
--- a/docs/package_reference/SentenceTransformer.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# SentenceTransformer
-
-This page documents the properties and methods when you load a SentenceTransformer model:
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("model-name")
-```
-
-```eval_rst
-.. autoclass:: sentence_transformers.SentenceTransformer
-   :members:
-   :exclude-members: save_to_hub
-
-```
\ No newline at end of file
diff --git a/docs/package_reference/cross_encoder/cross_encoder.md b/docs/package_reference/cross_encoder/cross_encoder.md
new file mode 100644
index 000000000..30c1fcf9d
--- /dev/null
+++ b/docs/package_reference/cross_encoder/cross_encoder.md
@@ -0,0 +1,14 @@
+# CrossEncoder
+
+## CrossEncoder
+For an introduction to Cross-Encoders, see [Cross-Encoders](../../examples/applications/cross-encoder/README.md).
+```eval_rst
+.. autoclass:: sentence_transformers.cross_encoder.CrossEncoder
+   :members:
+```
+
+## Training Inputs
+
+```eval_rst
+.. autoclass:: sentence_transformers.readers.InputExample
+```
\ No newline at end of file
diff --git a/docs/package_reference/cross_encoder.md b/docs/package_reference/cross_encoder/evaluation.md
similarity index 73%
rename from docs/package_reference/cross_encoder.md
rename to docs/package_reference/cross_encoder/evaluation.md
index fc8737d0c..23f9d1265 100644
--- a/docs/package_reference/cross_encoder.md
+++ b/docs/package_reference/cross_encoder/evaluation.md
@@ -1,19 +1,31 @@
-# cross_encoder
-For an introduction to Cross-Encoders, see [Cross-Encoders](../../examples/applications/cross-encoder/README.md).
-```eval_rst
-.. autoclass:: sentence_transformers.cross_encoder.CrossEncoder
-   :members:
-```
-
-
-## Evaluation
+# Evaluation
 CrossEncoder have their own evaluation classes, that are in `sentence_transformers.cross_encoder.evaluation`.
 
+## CEBinaryAccuracyEvaluator
 ```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryAccuracyEvaluator
+```
+## CEBinaryClassificationEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryClassificationEvaluator
+```
+
+## CECorrelationEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CECorrelationEvaluator
+```
+
+## CEF1Evaluator
+```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEF1Evaluator
+```
+
+## CESoftmaxAccuracyEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CESoftmaxAccuracyEvaluator
+```
+
+## CERerankingEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CERerankingEvaluator
 ```
\ No newline at end of file
diff --git a/docs/package_reference/cross_encoder/index.rst b/docs/package_reference/cross_encoder/index.rst
new file mode 100644
index 000000000..f27406944
--- /dev/null
+++ b/docs/package_reference/cross_encoder/index.rst
@@ -0,0 +1,9 @@
+
+Cross Encoder
+=============
+
+.. toctree::
+   :hidden:
+
+   cross_encoder
+   evaluation
\ No newline at end of file
diff --git a/docs/package_reference/quantization.md b/docs/package_reference/quantization.md
deleted file mode 100644
index 4e47fd112..000000000
--- a/docs/package_reference/quantization.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# quantization
-`sentence_transformers.quantization` defines different helpful functions to quantize.
-
-```eval_rst
-.. automodule:: sentence_transformers.quantization
-   :members: quantize_embeddings, semantic_search_faiss, semantic_search_usearch
-```
diff --git a/docs/package_reference/sentence_transformer/SentenceTransformer.md b/docs/package_reference/sentence_transformer/SentenceTransformer.md
new file mode 100644
index 000000000..1a1b3e71c
--- /dev/null
+++ b/docs/package_reference/sentence_transformer/SentenceTransformer.md
@@ -0,0 +1,20 @@
+# SentenceTransformer
+
+## SentenceTransformer
+```eval_rst
+.. autoclass:: sentence_transformers.SentenceTransformer
+   :members:
+   :inherited-members: fit, old_fit
+   :exclude-members: save_to_hub, add_module, append, apply, buffers, children, extra_repr, forward, get_buffer, get_extra_state, get_parameter, get_submodule, ipu, load_state_dict, modules, named_buffers, named_children, named_modules, named_parameters, parameters, register_backward_hook, register_buffer, register_forward_hook, register_forward_pre_hook, register_full_backward_hook, register_full_backward_pre_hook, register_load_state_dict_post_hook, register_module, register_parameter, register_state_dict_pre_hook, requires_grad_, set_extra_state, share_memory, state_dict, to_empty, type, xpu, zero_grad
+```
+
+## SentenceTransformerModelCardData
+```eval_rst
+.. autoclass:: sentence_transformers.model_card.SentenceTransformerModelCardData
+```
+
+## SimilarityFunction
+```eval_rst
+.. autoclass:: sentence_transformers.SimilarityFunction
+   :members:
+```
\ No newline at end of file
diff --git a/docs/package_reference/datasets.md b/docs/package_reference/sentence_transformer/datasets.md
similarity index 100%
rename from docs/package_reference/datasets.md
rename to docs/package_reference/sentence_transformer/datasets.md
diff --git a/docs/package_reference/evaluation.md b/docs/package_reference/sentence_transformer/evaluation.md
similarity index 67%
rename from docs/package_reference/evaluation.md
rename to docs/package_reference/sentence_transformer/evaluation.md
index eb1c46c6e..df5fb258c 100644
--- a/docs/package_reference/evaluation.md
+++ b/docs/package_reference/sentence_transformer/evaluation.md
@@ -1,17 +1,52 @@
 # Evaluation
 `sentence_transformers.evaluation` defines different classes, that can be used to evaluate the model during training.
 
+## BinaryClassificationEvaluator
 ```eval_rst
 .. autoclass:: sentence_transformers.evaluation.BinaryClassificationEvaluator
+```
+
+## EmbeddingSimilarityEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.EmbeddingSimilarityEvaluator
+```
+
+## InformationRetrievalEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.InformationRetrievalEvaluator
-.. autoclass:: sentence_transformers.evaluation.LabelAccuracyEvaluator
+```
+
+## MSEEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.MSEEvaluator
-.. autoclass:: sentence_transformers.evaluation.MSEEvaluatorFromDataFrame
+```
+
+## ParaphraseMiningEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.ParaphraseMiningEvaluator
+```
+
+## RerankingEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.RerankingEvaluator
+```
+
+## SentenceEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.SentenceEvaluator
+```
+
+## SequentialEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.SequentialEvaluator
+```
+
+## TranslationEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.TranslationEvaluator
+```
+
+## TripletEvaluator
+```eval_rst
 .. autoclass:: sentence_transformers.evaluation.TripletEvaluator
 ```
diff --git a/docs/package_reference/sentence_transformer/index.rst b/docs/package_reference/sentence_transformer/index.rst
new file mode 100644
index 000000000..063ed31e1
--- /dev/null
+++ b/docs/package_reference/sentence_transformer/index.rst
@@ -0,0 +1,15 @@
+
+Sentence Transformer
+====================
+
+.. toctree::
+   :hidden:
+
+   SentenceTransformer
+   trainer
+   training_args
+   losses
+   evaluation
+   datasets
+   models
+   quantization
\ No newline at end of file
diff --git a/docs/package_reference/losses.md b/docs/package_reference/sentence_transformer/losses.md
similarity index 93%
rename from docs/package_reference/losses.md
rename to docs/package_reference/sentence_transformer/losses.md
index 65475427d..db5b4a66c 100644
--- a/docs/package_reference/losses.md
+++ b/docs/package_reference/sentence_transformer/losses.md
@@ -1,7 +1,7 @@
 # Losses
 `sentence_transformers.losses` defines different loss functions that can be used to fine-tune embedding models on training data. The choice of loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task.
 
-Sadly, there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the [Loss Overview](../training/loss_overview.html) to help narrow down your choice of loss function(s).
+Sadly, there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the [Loss Overview](../../sentence_transformer/loss_overview.html) to help narrow down your choice of loss function(s).
 
 ## BatchAllTripletLoss
 ```eval_rst
@@ -57,8 +57,7 @@ Sadly, there is no "one size fits all" loss function. Which loss function is sui
 
 ## CosineSimilarityLoss
 
-![SBERT Siamese Network Architecture](../img/SBERT_Siamese_Network.png "SBERT Siamese Architecture")
-
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png" alt="SBERT Siamese Network Architecture" width="250"/>
 
 For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. 
 
diff --git a/docs/package_reference/models.md b/docs/package_reference/sentence_transformer/models.md
similarity index 100%
rename from docs/package_reference/models.md
rename to docs/package_reference/sentence_transformer/models.md
diff --git a/docs/package_reference/sentence_transformer/quantization.md b/docs/package_reference/sentence_transformer/quantization.md
new file mode 100644
index 000000000..cc53a0ef5
--- /dev/null
+++ b/docs/package_reference/sentence_transformer/quantization.md
@@ -0,0 +1,12 @@
+# quantization
+`sentence_transformers.quantization` defines different helpful functions to perform embedding quantization. 
+
+```eval_rst
+.. note::
+   `Embedding Quantization <../../../examples/applications/embedding-quantization/README.html>`_ differs from model quantization. The former shrinks the size of embeddings such that semantic search/retrieval is faster and requires less memory and disk space. The latter refers to lowering the precision of the model weights to speed up inference. This page only shows documentation for the former.
+```
+
+```eval_rst
+.. automodule:: sentence_transformers.quantization
+   :members: quantize_embeddings, semantic_search_faiss, semantic_search_usearch
+```
diff --git a/docs/package_reference/sentence_transformer/trainer.md b/docs/package_reference/sentence_transformer/trainer.md
new file mode 100644
index 000000000..64e03f84e
--- /dev/null
+++ b/docs/package_reference/sentence_transformer/trainer.md
@@ -0,0 +1,11 @@
+
+# Trainer
+
+## SentenceTransformerTrainer
+
+```eval_rst
+.. autoclass:: sentence_transformers.trainer.SentenceTransformerTrainer
+    :members:
+    :inherited-members:
+    :exclude-members: autocast_smart_context_manager, collect_features, compute_loss_context_manager, evaluation_loop, floating_point_ops, get_decay_parameter_names, get_optimizer_cls_and_kwargs, init_hf_repo, log_metrics, metrics_format, num_examples, num_tokens, predict, prediction_loop, prediction_step, save_metrics, save_model, save_state, training_step
+```
\ No newline at end of file
diff --git a/docs/package_reference/sentence_transformer/training_args.md b/docs/package_reference/sentence_transformer/training_args.md
new file mode 100644
index 000000000..0c68fe97c
--- /dev/null
+++ b/docs/package_reference/sentence_transformer/training_args.md
@@ -0,0 +1,21 @@
+
+# Training Arguments
+
+## SentenceTransformerTrainingArguments
+```eval_rst
+.. autoclass:: sentence_transformers.training_args.SentenceTransformerTrainingArguments
+    :members:
+    :inherited-members:
+```
+
+## BatchSamplers
+```eval_rst
+.. autoclass:: sentence_transformers.training_args.BatchSamplers
+    :members:
+```
+
+## MultiDatasetBatchSamplers
+```eval_rst
+.. autoclass:: sentence_transformers.training_args.MultiDatasetBatchSamplers
+    :members:
+```
\ No newline at end of file
diff --git a/docs/package_reference/util.md b/docs/package_reference/util.md
index a3f30fb15..690b4cd19 100644
--- a/docs/package_reference/util.md
+++ b/docs/package_reference/util.md
@@ -1,7 +1,15 @@
 # util
 `sentence_transformers.util` defines different helpful functions to work with text embeddings.
 
+## Helper Functions
 ```eval_rst
 .. automodule:: sentence_transformers.util
-   :members: cos_sim, dot_score, paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings
+   :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings
+```
+
+## Similarity Metrics
+
+```eval_rst
+.. automodule:: sentence_transformers.util
+   :members: cos_sim, pairwise_cos_sim, dot_score, pairwise_dot_score, manhattan_sim, pairwise_manhattan_sim, euclidean_sim, pairwise_euclidean_sim
 ```
diff --git a/docs/pretrained-models/msmarco-v1.md b/docs/pretrained-models/msmarco-v1.md
index be537f2d7..3123bfc03 100644
--- a/docs/pretrained-models/msmarco-v1.md
+++ b/docs/pretrained-models/msmarco-v1.md
@@ -6,7 +6,6 @@ The training data constist of over 500k examples, while the complete  corpus con
 
 
 ## Version Histroy 
-As we work on the topic, we will publish updated (and improved) models.
 
 ### v1
 Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.
diff --git a/docs/pretrained-models/msmarco-v2.md b/docs/pretrained-models/msmarco-v2.md
index c9a88e4df..23a528d6d 100644
--- a/docs/pretrained-models/msmarco-v2.md
+++ b/docs/pretrained-models/msmarco-v2.md
@@ -34,6 +34,5 @@ As baseline we show the results for lexical search with BM25 using Elasticsearch
 
 
 ## Version Histroy 
-As we work on the topic, we will publish updated (and improved) models.
 
 - [Version 1](msmarco-v1.md)
diff --git a/docs/pretrained-models/msmarco-v3.md b/docs/pretrained-models/msmarco-v3.md
index e5134fd97..8ba14c798 100644
--- a/docs/pretrained-models/msmarco-v3.md
+++ b/docs/pretrained-models/msmarco-v3.md
@@ -58,7 +58,6 @@ If they received a low score by the cross-encoder, we saved them as hard negativ
 We then trained the v2 models with these new hard negatives.
 
 ## Version Histroy 
-As we work on the topic, we will publish updated (and improved) models.
 
 - [Version 2](msmarco-v2.md)
 - [Version 1](msmarco-v1.md)
diff --git a/docs/pretrained-models/msmarco-v5.md b/docs/pretrained-models/msmarco-v5.md
index d3f29ca71..9f93c0741 100644
--- a/docs/pretrained-models/msmarco-v5.md
+++ b/docs/pretrained-models/msmarco-v5.md
@@ -65,7 +65,6 @@ If they received a low score by the cross-encoder, we saved them as hard negativ
 We then trained the v2 models with these new hard negatives.
 
 ## Version Histroy 
-As we work on the topic, we will publish updated (and improved) models.
 
 - [Version 3](msmarco-v3.md)
 - [Version 2](msmarco-v2.md)
diff --git a/docs/pretrained_cross-encoders.md b/docs/pretrained_cross-encoders.md
deleted file mode 100644
index e95097e8d..000000000
--- a/docs/pretrained_cross-encoders.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Pretrained Cross-Encoders
-
-This page lists available **pretrained Cross-Encoders**. Cross-Encoders require the input of a text pair and output a score 0...1. They do not work for individual sentences and they don't compute embeddings for individual texts.
-
-![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)
-
-## MS MARCO
-[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages.
-
-These models can be used like this:
-```python
-from sentence_transformers import CrossEncoder
-
-model = CrossEncoder("model_name", max_length=512)
-scores = model.predict([("Query1", "Paragraph1"), ("Query1", "Paragraph2")])
-
-# For Example
-scores = model.predict([
-    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
-    ("How many people live in Berlin?", "Berlin is well known for its museums."),
-])
-```
-
-- **cross-encoder/ms-marco-TinyBERT-L-2-v2** - MRR@10 on MS Marco Dev Set: 32.56
-- **cross-encoder/ms-marco-MiniLM-L-2-v2** - MRR@10 on MS Marco Dev Set: 34.85
-- **cross-encoder/ms-marco-MiniLM-L-4-v2** - MRR@10 on MS Marco Dev Set: 37.70
-- **cross-encoder/ms-marco-MiniLM-L-6-v2** - MRR@10 on MS Marco Dev Set: 39.01
-- **cross-encoder/ms-marco-MiniLM-L-12-v2** - MRR@10 on MS Marco Dev Set: 39.02
-
-
-For details on the usage, see [Applications - Information Retrieval](../examples/applications/retrieve_rerank/README.md)
-
-
-[MS MARCO Cross-Encoders - More details](pretrained-models/ce-msmarco.md)
-
-## SQuAD (QNLI)
-
-QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) and was introduced by the [GLUE Benchmark](https://arxiv.org/abs/1804.07461). Given a passage from Wikipedia, annotators created questions that are answerable by that passage.
-
-- **cross-encoder/qnli-distilroberta-base** - Accuracy on QNLI dev set: 90.96
-- **cross-encoder/qnli-electra-base** - Accuracy on QNLI dev set: 93.21
-
-
-## STSbenchmark
-The following models can be used like this:
-```python
-from sentence_transformers import CrossEncoder
-
-model = CrossEncoder("model_name")
-scores = model.predict([("Sent A1", "Sent B1"), ("Sent A2", "Sent B2")])
-```
-
-They return a score  0...1 indicating the semantic similarity of the given sentence pair.
-- **cross-encoder/stsb-TinyBERT-L-4** - STSbenchmark test performance: 85.50
-- **cross-encoder/stsb-distilroberta-base** - STSbenchmark test performance: 87.92
-- **cross-encoder/stsb-roberta-base** - STSbenchmark test performance: 90.17
-- **cross-encoder/stsb-roberta-large** - STSbenchmark test performance: 91.47 
-
-## Quora Duplicate Questions
-These models have been trained on the [Quora duplicate questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions.
-
-- **cross-encoder/quora-distilroberta-base** - Average Precision dev set: 87.48
-- **cross-encoder/quora-roberta-base** - Average Precision dev set: 87.80
-- **cross-encoder/quora-roberta-large** - Average Precision dev set: 87.91
-
-Note: The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results.
-
-
-
-## NLI
-Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) datasets.
-- **cross-encoder/nli-deberta-v3-base** - Accuracy on MNLI mismatched set: 90.04
-- **cross-encoder/nli-deberta-base** - Accuracy on MNLI mismatched set: 88.08
-- **cross-encoder/nli-deberta-v3-xsmall** - Accuracy on MNLI mismatched set:  87.77
-- **cross-encoder/nli-deberta-v3-small** - Accuracy on MNLI mismatched set: 87.55
-- **cross-encoder/nli-roberta-base** - Accuracy on MNLI mismatched set: 87.47
-- **cross-encoder/nli-MiniLM2-L6-H768** - Accuracy on MNLI mismatched set: 86.89  
-- **cross-encoder/nli-distilroberta-base** - Accuracy on MNLI mismatched set: 83.98
-
-```python
-from sentence_transformers import CrossEncoder
-
-model = CrossEncoder("model_name")
-scores = model.predict([
-    ("A man is eating pizza", "A man eats something"),
-    ("A black race car starts up in front of a crowd of people.", "A man is driving down a lonely road."),
-])
-
-# Convert scores to labels
-label_mapping = ["contradiction", "entailment", "neutral"]
-labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
-```
-
diff --git a/docs/quickstart.md b/docs/quickstart.md
deleted file mode 100644
index cf6c003aa..000000000
--- a/docs/quickstart.md
+++ /dev/null
@@ -1,96 +0,0 @@
-# Quickstart
-Once you have [installed](installation.md) Sentence Transformers, the usage is simple:
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Our sentences we like to encode
-sentences = [
-    "This framework generates embeddings for each input sentence",
-    "Sentences are passed as a list of string.",
-    "The quick brown fox jumps over the lazy dog.",
-]
-
-# Sentences are encoded by calling model.encode()
-sentence_embeddings = model.encode(sentences)
-
-# Print the embeddings
-for sentence, embedding in zip(sentences, sentence_embeddings):
-    print("Sentence:", sentence)
-    print("Embedding:", embedding)
-    print("")
-```
-
-
-With `SentenceTransformer('all-MiniLM-L6-v2')` we define which sentence transformer model we like to load. In this example, we load [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs.
-
-BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector.
-
-## Comparing Sentence Similarities
-
-The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). For two sentences, this can be done like this:
-
-```python
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Sentences are encoded by calling model.encode()
-emb1 = model.encode("This is a red cat with a hat.")
-emb2 = model.encode("Have you seen my red cat?")
-
-cos_sim = util.cos_sim(emb1, emb2)
-print("Cosine-Similarity:", cos_sim)
-```
-
-If you have a list with more sentences, you can use the following code example:
-```python
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-sentences = [
-    "A man is eating food.",
-    "A man is eating a piece of bread.",
-    "The girl is carrying a baby.",
-    "A man is riding a horse.",
-    "A woman is playing violin.",
-    "Two men pushed carts through the woods.",
-    "A man is riding a white horse on an enclosed ground.",
-    "A monkey is playing drums.",
-    "Someone in a gorilla costume is playing a set of drums.",
-]
-
-# Encode all sentences
-embeddings = model.encode(sentences)
-
-# Compute cosine similarity between all pairs
-cos_sim = util.cos_sim(embeddings, embeddings)
-
-# Add all pairs to a list with their cosine similarity score
-all_sentence_combinations = []
-for i in range(len(cos_sim) - 1):
-    for j in range(i + 1, len(cos_sim)):
-        all_sentence_combinations.append([cos_sim[i][j], i, j])
-
-# Sort list by the highest cosine similarity score
-all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)
-
-print("Top-5 most similar pairs:")
-for score, i, j in all_sentence_combinations[0:5]:
-    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
-```
-
-See on the left the *Usage* sections for more examples how to use SentenceTransformers.
-
-## Pre-Trained Models
-Various pre-trained models exists optimized for many tasks exists. For a full list, see **[Pretrained Models](pretrained_models.md)**. 
-
-
-
-## Training your own Embeddings
-
-Training your own sentence embeddings models for all type of use-cases is easy and requires often only minimal coding effort. For a comprehensive tutorial, see [Training/Overview](training/overview.md).
-
-You can also extend easily existent sentence embeddings models to **further languages**.  For details, see [Multi-Lingual Training](../examples/training/multilingual/README).
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
new file mode 100644
index 000000000..38461820f
--- /dev/null
+++ b/docs/quickstart.rst
@@ -0,0 +1,160 @@
+Quickstart
+==========
+
+Sentence Transformer
+--------------------
+
+Characteristics of Sentence Transformer (a.k.a bi-encoder) models:
+
+1. Calculates a **fixed-size vector representation (embedding)** given **texts or images**.
+2. Embedding calculation is often **efficient**, embedding similarity calculation is **very fast**.
+3. Applicable for a **wide range of tasks**, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
+4. Often used as a **first step in a two-step retrieval process**, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.
+
+Once you have `installed <installation.md>`_ Sentence Transformers, you can easily use Sentence Transformer models:
+
+.. sidebar:: Documentation
+
+   1. :class:`SentenceTransformer <sentence_transformers.SentenceTransformer>`
+   2. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   3. :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`
+
+   **Other useful methods and links:**
+
+   - :meth:`SentenceTransformer.similarity_pairwise <sentence_transformers.SentenceTransformer.similarity_pairwise>`
+   - `SentenceTransformer > Usage <./sentence_transformer/usage/usage.html>`_
+   - `SentenceTransformer > Pretrained Models <./sentence_transformer/pretrained_models.html>`_
+   - `SentenceTransformer > Training Overview <./sentence_transformer/training_overview.html>`_
+   - `SentenceTransformer > Dataset Overview <./sentence_transformer/dataset_overview.html>`_
+   - `SentenceTransformer > Loss Overview <./sentence_transformer/loss_overview.html>`_
+   - `SentenceTransformer > Training Examples <./sentence_transformer/training/examples.html>`_
+
+::
+
+   from sentence_transformers import SentenceTransformer
+
+   # 1. Load a pretrained Sentence Transformer model
+   model = SentenceTransformer("all-MiniLM-L6-v2")
+
+   # The sentences to encode
+   sentences = [
+       "The weather is lovely today.",
+       "It's so sunny outside!",
+       "He drove to the stadium.",
+   ]
+
+   # 2. Calculate embeddings by calling model.encode()
+   embeddings = model.encode(sentences)
+   print(embeddings.shape)
+   # [3, 384]
+
+   # 3. Calculate the embedding similarities
+   similarities = model.similarity(embeddings, embeddings)
+   print(similarities)
+   # tensor([[1.0000, 0.6660, 0.1046],
+   #         [0.6660, 1.0000, 0.1411],
+   #         [0.1046, 0.1411, 1.0000]])
+
+With ``SentenceTransformer("all-MiniLM-L6-v2")`` we pick which `Sentence Transformer model <https://huggingface.co/models?library=sentence-transformers>`_ we load. In this example, we load `all-MiniLM-L6-v2 <https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2>`_, which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs. Using `SentenceTransformer.similarity() <./package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity>`_, we compute the similarity between all pairs of sentences. As expected, the similarity between the first two sentences (0.6660) is higher than the similarity between the first and the third sentence (0.1046) or the second and the third sentence (0.1411).
+
+Finetuning Sentence Transformer models is easy and requires only a few lines of code. For more information, see the `Training Overview <./sentence_transformer/training_overview.html>`_ section.
+
+Cross Encoder
+-------------
+
+Characteristics of Cross Encoder (a.k.a reranker) models:
+
+1. Calculates a **similarity score** given **pairs of texts**.
+2. Generally provides **superior performance** compared to a Sentence Transformer (a.k.a. bi-encoder) model.
+3. Often **slower** than a Sentence Transformer model, as it requires computation for each pair rather than each text.
+4. Due to the previous 2 characteristics, Cross Encoders are often used to **re-rank the top-k results** from a Sentence Transformer model.
+
+The usage for Cross Encoder (a.k.a. reranker) models is similar to Sentence Transformers:
+
+.. sidebar:: Documentation
+
+   1. :class:`CrossEncoder <sentence_transformers.CrossEncoder>`
+   2. :meth:`CrossEncoder.rank <sentence_transformers.CrossEncoder.rank>`
+   3. :meth:`CrossEncoder.predict <sentence_transformers.CrossEncoder.predict>`
+
+   **Other useful methods and links:**
+
+   - `CrossEncoder > Usage <./cross_encoder/usage/usage.html>`_
+   - `CrossEncoder > Pretrained Models <./cross_encoder/pretrained_models.html>`_
+   - `CrossEncoder > Training Overview <./cross_encoder/training_overview.html>`_
+   - `CrossEncoder > Dataset Overview <./cross_encoder/dataset_overview.html>`_
+   - `CrossEncoder > Loss Overview <./cross_encoder/loss_overview.html>`_
+   - `CrossEncoder > Training Examples <./cross_encoder/training/examples.html>`_
+
+::
+
+   from sentence_transformers.cross_encoder import CrossEncoder
+
+   # 1. Load a pretrained CrossEncoder model
+   model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
+
+   # We want to compute the similarity between the query sentence...
+   query = "A man is eating pasta."
+
+   # ... and all sentences in the corpus
+   corpus = [
+       "A man is eating food.",
+       "A man is eating a piece of bread.",
+       "The girl is carrying a baby.",
+       "A man is riding a horse.",
+       "A woman is playing violin.",
+       "Two men pushed carts through the woods.",
+       "A man is riding a white horse on an enclosed ground.",
+       "A monkey is playing drums.",
+       "A cheetah is running behind its prey.",
+   ]
+
+   # 2. We rank all sentences in the corpus for the query
+   ranks = model.rank(query, corpus)
+
+   # Print the scores
+   print("Query: ", query)
+   for rank in ranks:
+       print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
+   """
+   Query:  A man is eating pasta.
+   0.67    A man is eating food.
+   0.34    A man is eating a piece of bread.
+   0.08    A man is riding a horse.
+   0.07    A man is riding a white horse on an enclosed ground.
+   0.01    The girl is carrying a baby.
+   0.01    Two men pushed carts through the woods.
+   0.01    A monkey is playing drums.
+   0.01    A woman is playing violin.
+   0.01    A cheetah is running behind its prey.
+   """
+
+   # 3. Alternatively, you can also manually compute the score between two sentences
+   import numpy as np
+
+   sentence_combinations = [[query, sentence] for sentence in corpus]
+   scores = model.predict(sentence_combinations)
+
+   # Sort the scores in decreasing order to get the corpus indices
+   ranked_indices = np.argsort(scores)[::-1]
+   print("Scores:", scores)
+   print("Indices:", ranked_indices)
+   """
+   Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717]
+   Indices: [0 1 3 6 2 5 7 4 8]
+   """
+
+With ``CrossEncoder("cross-encoder/stsb-distilroberta-base")`` we pick which `CrossEncoder model <./cross_encoder/pretrained_models.html>`_ we load. In this example, we load `cross-encoder/stsb-distilroberta-base <https://huggingface.co/cross-encoder/stsb-distilroberta-base>`_, which is a `DistilRoBERTa <https://huggingface.co/distilbert/distilroberta-base>`_ model finetuned on the `STS Benchmark <https://huggingface.co/datasets/sentence-transformers/stsb>`_ dataset.
+
+Next Steps
+----------
+
+Consider reading one of the following sections next:
+
+* `Sentence Transformers > Usage <./sentence_transformer/usage/usage.html>`_
+* `Sentence Transformers > Pretrained Models <./sentence_transformer/pretrained_models.html>`_
+* `Sentence Transformers > Training Overview <./sentence_transformer/training_overview.html>`_
+* `Sentence Transformers > Training Examples > Multilingual Models <../examples/training/multilingual/README.html>`_
+* `Cross Encoder > Usage <./cross_encoder/usage/usage.html>`_
+* `Cross Encoder > Pretrained Models <./cross_encoder/pretrained_models.html>`_
+
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 0721a6b8a..bbc151601 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -4,4 +4,5 @@ sphinx<4
 Jinja2<3.1
 sphinx_markdown_tables
 recommonmark==0.7.1
+sphinx-tabs==3.4.5
 -e ..
\ No newline at end of file
diff --git a/docs/sentence_transformer/dataset_overview.md b/docs/sentence_transformer/dataset_overview.md
new file mode 100644
index 000000000..95923a3e8
--- /dev/null
+++ b/docs/sentence_transformer/dataset_overview.md
@@ -0,0 +1,121 @@
+# Dataset Overview
+
+```eval_rst
+.. hint::
+
+   **Quickstart:** Find `curated datasets <https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552>`_ or `community datasets <https://huggingface.co/datasets?other=sentence-transformers>`_, choose a loss function via this `loss overview <loss_overview.html>`_, and `verify <training_overview.html#dataset-format>`_ that it works with your dataset.
+```
+
+It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format). See [Training Overview > Dataset Format](./training_overview.html#dataset-format) to learn how to verify whether a dataset format works with a loss function.
+
+In practice, most dataset configurations will take one of four forms:
+
+- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or assymetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
+   - **Examples:** [sentence-transformers/sentence-compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression), [sentence-transformers/coco-captions](https://huggingface.co/datasets/sentence-transformers/coco-captions), [sentence-transformers/codesearchnet](https://huggingface.co/datasets/sentence-transformers/codesearchnet), [sentence-transformers/natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions), [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad), [sentence-transformers/wikihow](https://huggingface.co/datasets/sentence-transformers/wikihow), [sentence-transformers/eli5](https://huggingface.co/datasets/sentence-transformers/eli5)
+- **Triplets**: (anchor, positive, negative) text triplets. These datasets don't need labels.
+   - **Examples:** [sentence-transformers/quora-duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates), [nirantk/triplets](https://huggingface.co/datasets/nirantk/triplets), [sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli)
+- **Pair with Similarity Score**: A pair of sentences with a score indicating their similarity. Common examples are "Semantic Textual Similarity" datasets.
+   - **Examples:** [sentence-transformers/stsb](https://huggingface.co/datasets/sentence-transformers/stsb), [PhilipMay/stsb_multi_mt](https://huggingface.co/datasets/PhilipMay/stsb_multi_mt).
+- **Texts with Classes**: A text with its corresponding class. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class.
+   - **Examples:** [trec](https://huggingface.co/datasets/trec), [yahoo_answers_topics](https://huggingface.co/datasets/yahoo_answers_topics).
+
+Note that it is often simple to transform a dataset from one format to another, such that it works with your loss function of choice.
+
+## Datasets on the Hugging Face Hub
+
+```eval_rst
+The `Datasets library <https://huggingface.co/docs/datasets/index>`_ (``pip install datasets``) allows you to load datasets from the Hugging Face Hub with the :func:`~datasets.load_dataset` function::
+
+   from datasets import load_dataset
+
+   # Indicate the dataset id from the Hub
+   dataset_id = "sentence-transformers/natural-questions"
+   dataset = load_dataset(dataset_id, split="train")
+   """
+   Dataset({
+      features: ['query', 'answer'],
+      num_rows: 100231
+   })
+   """
+   print(dataset[0])
+   """
+   {
+      'query': 'when did richmond last play in a preliminary final',
+      'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next."
+   }
+   """
+```
+
+For more information on how to manipulate your dataset see the [Datasets Documentation](https://huggingface.co/docs/datasets/access).
+
+```eval_rst
+.. tip::
+   
+   It's common for Hugging Face Datasets to contain extraneous columns, e.g. sample_id, metadata, source, type, etc. You can use :meth:`Dataset.remove_columns <datasets.Dataset.remove_columns>` to remove these columns, as they will be used as inputs otherwise. You can also use :meth:`Dataset.select_columns <datasets.Dataset.select_columns>` to keep only the desired columns.
+```
+
+## Pre-existing Datasets
+
+The [Hugging Face Hub](https://huggingface.co/datasets) hosts 150k+ datasets, many of which can be converted for training embedding models. 
+We are aiming to tag all Hugging Face datasets that work out of the box with Sentence Transformers with `sentence-transformers`, allowing you to easily find them by browsing to [https://huggingface.co/datasets?other=sentence-transformers](https://huggingface.co/datasets?other=sentence-transformers). We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks.
+
+These are some of the popular pre-existing datasets tagged as ``sentence-transformers`` that can be used to train and fine-tune SentenceTransformer models:
+
+| Dataset                                                                                                                                                                | Description                                                                                               |
+|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
+| [GooAQ](https://huggingface.co/datasets/sentence-transformers/gooaq)                                                                                                   | (Question, Answer) pairs from Google auto suggest                                                         |
+| [Yahoo Answers](https://huggingface.co/datasets/sentence-transformers/yahoo-answers)                                                                                   | (Title+Question, Answer), (Title, Answer), (Title, Question), (Question, Answer) pairs from Yahoo Answers |
+| [MS MARCO Triplets (msmarco-distilbert-base-tas-b)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b)                       | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (msmarco-distilbert-base-v3)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3)                             | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (msmarco-MiniLM-L-6-v3)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3)                                       | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-cls-dot-v2)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2)                 | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-cls-dot-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1)                 | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-mean-dot-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1)               | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (mpnet-margin-mse-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1)                                 | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (co-condenser-margin-mse-cls-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1)                     | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1)             | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1)     | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v2)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2)     | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (co-condenser-margin-mse-sym-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [MS MARCO Triplets (BM25)](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25)                                                                         | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives                 |
+| [Stack Exchange Duplicates](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates)                                                            | (Title, Title), (Title+Body, Title+Body), (Body, Body) pairs of duplicate questions from StackExchange    |
+| [ELI5](https://huggingface.co/datasets/sentence-transformers/eli5)                                                                                                     | (Question, Answer) pairs from ELI5 dataset                                                                |
+| [SQuAD](https://huggingface.co/datasets/sentence-transformers/squad)                                                                                                   | (Question, Answer) pairs from SQuAD dataset                                                               |
+| [WikiHow](https://huggingface.co/datasets/sentence-transformers/wikihow)                                                                                               | (Summary, Text) pairs from WikiHow                                                                        |
+| [Amazon Reviews 2018](https://huggingface.co/datasets/sentence-transformers/amazon-reviews)                                                                            | (Title, review) pairs from Amazon Reviews                                                                 |
+| [Natural Questions](https://huggingface.co/datasets/sentence-transformers/natural-questions)                                                                           | (Query, Answer) pairs from the Natural Questions dataset                                                  |
+| [Amazon QA](https://huggingface.co/datasets/sentence-transformers/amazon-qa)                                                                                           | (Question, Answer) pairs from Amazon                                                                      |
+| [S2ORC](https://huggingface.co/datasets/sentence-transformers/s2orc)                                                                                                   | (Title, Abstract), (Abstract, Citation), (Title, Citation) pairs of scientific papers                     |
+| [Quora Duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates)                                                                             | Duplicate question pairs from Quora                                                                       |
+| [WikiAnswers](https://huggingface.co/datasets/sentence-transformers/wikianswers-duplicates)                                                                            | Duplicate question pairs from WikiAnswers                                                                 |
+| [AGNews](https://huggingface.co/datasets/sentence-transformers/agnews)                                                                                                 | (Title, Description) pairs of news articles from the AG News dataset                                      |
+| [AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli)                                                                                                | (Anchor, Entailment, Contradiction) triplets from SNLI + MultiNLI                                         |
+| [NPR](https://huggingface.co/datasets/sentence-transformers/npr)                                                                                                       | (Title, Body) pairs from the npr.org website                                                              |
+| [SPECTER](https://huggingface.co/datasets/sentence-transformers/specter)                                                                                               | (Title, Positive Title, Negative Title) triplets of Scientific Publications from Specter                  |
+| [Simple Wiki](https://huggingface.co/datasets/sentence-transformers/simple-wiki)                                                                                       | (English, Simple English) pairs from Wikipedia                                                            |
+| [PAQ](https://huggingface.co/datasets/sentence-transformers/paq)                                                                                                       | (Query, Answer) from the Probably-Asked Questions dataset                                                 |
+| [altlex](https://huggingface.co/datasets/sentence-transformers/altlex)                                                                                                 | (English, Simple English) pairs from Wikipedia                                                            |
+| [CC News](https://huggingface.co/datasets/sentence-transformers/ccnews)                                                                                                | (Title, article) pairs from the CC News dataset                                                           |
+| [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet)                                                                                   | (Comment, Code) pairs from open source libraries on GitHub                                                |
+| [Sentence Compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression)                                                                     | (Long text, Short text) pairs from the Sentence Compression dataset                                       |
+| [Trivia QA](https://huggingface.co/datasets/sentence-transformers/trivia-qa)                                                                                           | (Query, Answer) pairs from the TriviaQA dataset                                                           |
+| [Flickr30k Captions](https://huggingface.co/datasets/sentence-transformers/flickr30k-captions)                                                                         | Duplicate captions from the Flickr30k dataset                                                             |
+| [xsum](https://huggingface.co/datasets/sentence-transformers/xsum)                                                                                                     | (News Article, Summary) pairs from XSUM dataset                                                           |
+| [Coco Captions](https://huggingface.co/datasets/sentence-transformers/coco-captions)                                                                                   | Duplicate captions from the Coco Captions dataset                                                         |
+| [Parallel Sentences: Europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)                                                      | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: Global Voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)                                            | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: MUSE](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)                                                              | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: JW300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)                                                            | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: News Commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)                                        | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: OpenSubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)                                            | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: Talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)                                                            | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: Tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)                                                        | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: WikiMatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)                                                  | (English, Non-English) pairs across numerous languages                                                    |
+| [Parallel Sentences: WikiTitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)                                                  | (English, Non-English) pairs across numerous languages                                                    |
+
+```eval_rst
+
+.. note::
+
+   We advise users to tag datasets that can be used for training embedding models with ``sentence-transformers`` by adding ``tags: sentence-transformers``. We would also gladly accept high quality datasets to be added to the list above for all to see and use.
+```
\ No newline at end of file
diff --git a/docs/training/loss_overview.md b/docs/sentence_transformer/loss_overview.md
similarity index 88%
rename from docs/training/loss_overview.md
rename to docs/sentence_transformer/loss_overview.md
index 33ec36f41..f46b0418e 100644
--- a/docs/training/loss_overview.md
+++ b/docs/sentence_transformer/loss_overview.md
@@ -1,10 +1,14 @@
 # Loss Overview
 
-Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no "one size fits all" loss function. Ideally, this overview should help narrow down your choice of loss function(s) by matching them to your data formats.
+Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no "one size fits all" loss function. Ideally, this table should help narrow down your choice of loss function(s) by matching them to your data formats.
 
-**Note**: you can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, `(sentence_A, sentence_B) pairs` with `class` labels can be converted into `(anchor, positive, negative) triplets` by sampling sentences with the same or different classes.
+```eval_rst
+.. note:: 
 
-| Texts                                         | Labels                         | Appropriate Loss Functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+    You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, ``(sentence_A, sentence_B) pairs`` with ``class`` labels can be converted into ``(anchor, positive, negative) triplets`` by sampling sentences with the same or different classes.
+```
+
+| Inputs                                        | Labels                         | Appropriate Loss Functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
 |-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `single sentences`                            | `class`                        | <a href="../package_reference/losses.html#batchalltripletloss">`BatchAllTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardsoftmargintripletloss">`BatchHardSoftMarginTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardtripletloss">`BatchHardTripletLoss`</a><br><a href="../package_reference/losses.html#batchsemihardtripletloss">`BatchSemiHardTripletLoss`</a>                                                                                                                                                                                                                               |
 | `single sentences`                            | `none`                         | <a href="../package_reference/losses.html#contrastivetensionloss">`ContrastiveTensionLoss`</a><br><a href="../package_reference/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                         |
@@ -41,4 +45,19 @@ For example, when finetuning a small model to behave more like a larger & strong
 In practice, not all loss functions get used equally often. The most common scenarios are:
 
 * `(anchor, positive) pairs` without any labels: <a href="../package_reference/losses.html#multiplenegativesrankingloss"><code>MultipleNegativesRankingLoss</code></a> is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss"><code>CachedMultipleNegativesRankingLoss</code></a> is often used to increase the batch size, resulting in superior performance.
-* `(sentence_A, sentence_B) pairs` with a `float similarity score`: <a href="../package_reference/losses.html#cosinesimilarityloss"><code>CosineSimilarityLoss</code></a> is traditionally used a lot, though more recently <a href="../package_reference/losses.html#cosentloss"><code>CoSENTLoss</code></a> and <a href="../package_reference/losses.html#angleloss"><code>AnglELoss</code></a> are used as drop-in replacements with superior performance.
\ No newline at end of file
+* `(sentence_A, sentence_B) pairs` with a `float similarity score`: <a href="../package_reference/losses.html#cosinesimilarityloss"><code>CosineSimilarityLoss</code></a> is traditionally used a lot, though more recently <a href="../package_reference/losses.html#cosentloss"><code>CoSENTLoss</code></a> and <a href="../package_reference/losses.html#angleloss"><code>AnglELoss</code></a> are used as drop-in replacements with superior performance.
+
+## Custom Loss Functions
+
+```eval_rst
+Advanced users can create and train with their own loss functions. Custom loss functions only have a few requirements:
+
+- They must be a subclass of :class:`torch.nn.Module`.
+- They must have ``model`` as the first argument in the constructor.
+- They must implement a ``forward`` method that accepts ``sentence_features`` and ``labels``. The former is a list of tokenized batches, one element for each column. These tokenized batches can be fed directly to the ``model`` being trained to produce embeddings. The latter is an optional tensor of labels. The method must return a single loss value.
+
+To get full support with the automatic model card generation, you may also wish to implement:
+
+- a ``get_config_dict`` method that returns a dictionary of loss parameters.
+- a ``citation`` property so your work gets cited in all models that train with the loss.
+```
\ No newline at end of file
diff --git a/docs/pretrained_models.md b/docs/sentence_transformer/pretrained_models.md
similarity index 50%
rename from docs/pretrained_models.md
rename to docs/sentence_transformer/pretrained_models.md
index b35947802..ac653ef85 100644
--- a/docs/pretrained_models.md
+++ b/docs/sentence_transformer/pretrained_models.md
@@ -1,42 +1,77 @@
 # Pretrained Models
 
-We provide various pre-trained models. Using these models is easy:
+We provide various pre-trained Sentence Transformers models via our Sentence Transformers Hugging Face organization. Additionally, over 6,000 community Sentence Transformers models have been publicly released on the Hugging Face Hub. All models can be found here:
+* **Original models**: [Sentence Transformers Hugging Face organization](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers).
+* **Community models**: [All Sentence Transformer models on Hugging Face](https://huggingface.co/models?library=sentence-transformers).
+
+Each of these models can be easily downloaded and used like so:
+
+```eval_rst
+.. sidebar:: Original Models
+
+    For the original models from the `Sentence Transformers Hugging Face organization <https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers>`_, it is not necessary to include the model author or organization prefix. For example, this snippet loads `sentence-transformers/all-mpnet-base-v2 <https://huggingface.co/sentence-transformers/all-mpnet-base-v2>`_.
+```
 
 ```python
 from sentence_transformers import SentenceTransformer
 
-model = SentenceTransformer("model_name")
+# Load https://huggingface.co/sentence-transformers/all-mpnet-base-v2
+model = SentenceTransformer("all-mpnet-base-v2")
+embeddings = model.encode([
+    "The weather is lovely today.",
+    "It's so sunny outside!",
+    "He drove to the stadium.",
+])
+similarities = model.similarity(embeddings, embeddings)
 ```
 
-All models are hosted on the [HuggingFace Model Hub](https://huggingface.co/sentence-transformers).
+```eval_rst
+.. note::
+    Consider using the `Massive Textual Embedding Benchmark leaderboard <https://huggingface.co/spaces/mteb/leaderboard>`_ as an inspiration of strong Sentence Transformer models. Be wary:
 
-## Model Overview
+    - **Model sizes**: it is recommended to filter away the large models that might not be feasible without excessive hardware.
+    - **Experimentation is key**: models that perform well on the leaderboard do not necessarily do well on your tasks, it is **crucial** to experiment with various promising models.
+```
+
+## Original Models
 
-The following table provides an overview of (selected) models. They have been extensively evaluated for their quality to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search).
+The following table provides an overview of a selection of our models. They have been extensively evaluated for their quality to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search).
 
-The **all-*** models were trained on all available training data (more than 1 billion training pairs) and are designed as **general purpose** models. The **all-mpnet-base-v2** model provides the best quality, while **all-MiniLM-L6-v2** is 5 times faster and still offers good quality. Toggle *All models* to see all evaluated models or visit [HuggingFace Model Hub](https://huggingface.co/models?library=sentence-transformers) to view all existing sentence-transformers models. 
+The **all-*** models were trained on all available training data (more than 1 billion training pairs) and are designed as **general purpose** models. The [**all-mpnet-base-v2**](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model provides the best quality, while [**all-MiniLM-L6-v2**](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) is 5 times faster and still offers good quality. Toggle *All models* to see all evaluated original models.
 
 
-<iframe src="../_static/html/models_en_sentence_embeddings.html" height="600" style="width:100%; border:none;" title="Iframe Example"></iframe>
+<iframe src="../../../_static/html/models_en_sentence_embeddings.html" height="600" style="width:100%; border:none;" title="Iframe Example"></iframe>
 
 ---
 
-## Semantic Search
+## Semantic Search Models
+
+The following models have been specifically trained for **Semantic Search**: Given a question / search query, these models are able to find relevant text passages. For more details, see [Usage > Semantic Search](../../examples/applications/semantic-search/README.md).
+
+```eval_rst
+.. sidebar:: Documentation
 
-The following models have been specifically trained for **Semantic Search**: Given a question / search query, these models are able to find relevant text passages. For more details, see [Usage - Semantic Search](../examples/applications/semantic-search/README.md).
+   #. `multi-qa-mpnet-base-cos-v1 <https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1>`_
+   #. :class:`SentenceTransformer <sentence_transformers.SentenceTransformer>`
+   #. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   #. :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`
+
+```
 
 ```python
-from sentence_transformers import SentenceTransformer, util
+from sentence_transformers import SentenceTransformer
 
-model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
+model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")
 
 query_embedding = model.encode("How big is London")
-passage_embedding = model.encode([
-    "London has 9,787,426 inhabitants at the 2011 census",
+passage_embeddings = model.encode([
     "London is known for its finacial district",
+    "London has 9,787,426 inhabitants at the 2011 census",
+    "The United Kingdom is the fourth largest exporter of goods in the world",
 ])
 
-print("Similarity:", util.dot_score(query_embedding, passage_embedding))
+similarity = model.similarity(query_embedding, passage_embeddings)
+# => tensor([[0.4659, 0.6142, 0.2697]])
 ```
 
 
@@ -45,94 +80,80 @@ print("Similarity:", util.dot_score(query_embedding, passage_embedding))
 
 The following models have been trained on [215M question-answer pairs](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1#training) from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries and many more. These model perform well across many search tasks and domains.
 
-
-These models were tuned to be used with dot-product:
+These models were tuned to be used with the dot-product similarity score:
 
 | Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | 
 | --- | :---: | :---: |
-| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | 49.19 | 18,000 / 750 |
-| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | 52.51  | 7,000 / 350 |
 | [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | 57.60 | 4,000 / 170 |
+| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | 52.51  | 7,000 / 350 |
+| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | 49.19 | 18,000 / 750 |
 
-
-
-These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance:
+These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance as the similarity functions:
 
 | Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | 
 | --- | :---: | :---: |
-| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | 51.83 | 18,000 / 750 |
-| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) |  52.83 | 7,000 / 350 |
 | [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) | 57.46 | 4,000 / 170 |
+| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) |  52.83 | 7,000 / 350 |
+| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | 51.83 | 18,000 / 750 |
 
 ### MSMARCO Passage Models
 
-The [MSMARCO Passage Ranking Dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking) contains 500k real queries from Bing search together with the relevant passages from various web sources. Given the diversity of the MSMARCO dataset, models also perform well on other domains. 
+The following models have been trained on the [MSMARCO Passage Ranking Dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking), which contains 500k real queries from Bing search together with the relevant passages from various web sources. Given the diversity of the MSMARCO dataset, models also perform well on other domains. 
 
-Models tuned to be used with dot-product:
+These models were tuned to be used with the dot-product similarity score:
 
 | Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | 
 | --- | :---: | :---: | :---: |
-| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 49.25 | 7,000 / 350 |
-| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 49.47 | 7,000 / 350 |
 | [msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) | 38.08 | 52.11 | 4,000 / 170 |
+| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 49.47 | 7,000 / 350 |
+| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 49.25 | 7,000 / 350 |
 
-
-These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance:
+These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance as the similarity functions:
 
 | Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | 
 | --- | :---: | :---: | :---: |
-| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 42.16 | 18,000 / 750 |
-| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 43.89 | 11,000 / 400 |
 | [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5) | 33.79 | 44.98 | 7,000 / 350 |
+| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 43.89 | 11,000 / 400 |
+| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 42.16 | 18,000 / 750 |
 
-[MSMARCO Models - More details](pretrained-models/msmarco-v5.md)
+[MSMARCO Models - More details](../pretrained-models/msmarco-v5.md)
 
 ---
 
-## Multi-Lingual Models
-The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language.  Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw. 
-
-
+## Multilingual Models
+The following models similar embeddings for the same texts in different languages. You do not need to specify the input language. Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw. 
 
-**Semantic Similarity**
+### Semantic Similarity Models
 
 These models find semantically similar sentences within one language or across languages:
 
-- **distiluse-base-multilingual-cased-v1**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). Supports 15 languages:  Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. 
-- **distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). This version supports 50+ languages, but performs a bit weaker than the v1 model.
-- **paraphrase-multilingual-MiniLM-L12-v2** - Multilingual version of *paraphrase-MiniLM-L12-v2*, trained on parallel data for 50+ languages. 
-- **paraphrase-multilingual-mpnet-base-v2** - Multilingual version of *paraphrase-mpnet-base-v2*, trained on parallel data for 50+ languages. 
+- **[distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). Supports 15 languages:  Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. 
+- **[distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2)**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). This version supports 50+ languages, but performs a bit weaker than the v1 model.
+- **[paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)** - Multilingual version of [paraphrase-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L12-v2), trained on parallel data for 50+ languages. 
+- **[paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)** - Multilingual version of [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2), trained on parallel data for 50+ languages. 
 
-**Bitext Mining** 
+### Bitext Mining
 
 Bitext mining describes the process of finding translated sentence pairs in two languages. If this is your use-case, the following model gives the best performance:
-- **LaBSE** - [LaBSE](https://arxiv.org/abs/2007.01852) Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed  [here](https://arxiv.org/abs/2004.09813), LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.
+- **[LaBSE](https://huggingface.co/sentence-transformers/LaBSE)** - [LaBSE](https://arxiv.org/abs/2007.01852) Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed  [here](https://arxiv.org/abs/2004.09813), LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.
 
-
-Extending a model to new languages is easy by following [the description here](https://www.sbert.net/examples/training/multilingual/README.html).
-
-----
+Extending a model to new languages is easy by following [Training Examples > Multilingual Models](../../examples/training/multilingual/README.html).
 
 ## Image & Text-Models
-The following models can embed images and text into a joint vector space. See [Image Search](../examples/applications/image-search/README.md)  for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification.
+The following models can embed images and text into a joint vector space. See [Usage > Image Search](../../examples/applications/image-search/README.md) for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification.
 
 The following models are available with their respective Top 1 accuracy on zero-shot ImageNet validation dataset.
 
 | Model | Top 1 Performance |
 | --- | :---: |
-| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 |
-| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 |
 | [clip-ViT-L-14](https://huggingface.co/sentence-transformers/clip-ViT-L-14) | 75.4 |
+| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 |
+| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 |
 
 We further provide this multilingual text-image model:
-- **clip-ViT-B-32-multilingual-v1** - Multilingual text encoder for the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32)   model using [Multilingual Knowledge Distillation](https://arxiv.org/abs/2004.09813). This model can encode text in 50+ languages to match the image vectors from the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32)  model.
-
-
----
+- **[clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)** - Multilingual text encoder for the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model using [Multilingual Knowledge Distillation](https://arxiv.org/abs/2004.09813). This model can encode text in 50+ languages to match the image vectors from the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model.
 
-## Other Models
-
-### INSTRUCTOR models
+## INSTRUCTOR models
 Some INSTRUCTOR models, such as [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large), are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step.
 
 The following models work out of the box:
@@ -185,61 +206,7 @@ print(similarities)
 
 All other Instructor models either 1) will not load as they refer to `InstructorEmbedding` in their `modules.json` or 2) require calling `model.set_pooling_include_prompt(include_prompt=False)` after loading.
 
-### Scientific Publications
+## Scientific Similarity Models
 [SPECTER](https://arxiv.org/abs/2004.07180) is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers.
 
-- **allenai-specter** - [Semantic Search Python Example](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_publications.py) / [Semantic Search Colab Example](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06)
-
-
-
-
-
-### Natural Questions (NQ) Dataset Models
-The following models were trained on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions), a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.
-
-- **nq-distilbert-base-v1**: MRR10: 72.36 on NQ dev set (small)
-
-```python
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("nq-distilbert-base-v1")
-
-query_embedding = model.encode("How many people live in London?")
-
-# The passages are encoded as [ [title1, text1], [title2, text2], ...]
-passage_embedding = model.encode(
-    [["London", "London has 9,787,426 inhabitants at the 2011 census."]]
-)
-
-print("Similarity:", util.cos_sim(query_embedding, passage_embedding))
-```
-
-You can index the passages as shown [here](../examples/applications/semantic-search/README.md).
-
-**Note:** The NQ model doesn't perform well. Use the above mentioned Multi-QA models to achieve the optimal performance.
-
-[More details](pretrained-models/nq-v1.md)
-
-
-
-### DPR-Models
-
-In [Dense Passage Retrieval  for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906)  Karpukhin et al. trained models based on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions):
-- **facebook-dpr-ctx_encoder-single-nq-base** 
-- **facebook-dpr-question_encoder-single-nq-base**
-
-They also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.
-- **facebook-dpr-ctx_encoder-multiset-base** 
-- **facebook-dpr-question_encoder-multiset-base**
-
-**Note:** The DPR models perform comparabily bad. Use the above mentioned Multi-QA models to achieve the optimal performance.
-
-[More details & usage of the DPR models](pretrained-models/dpr.md)
-
-### Average Word Embeddings Models
-
-The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.
-- **average_word_embeddings_glove.6B.300d**
-- **average_word_embeddings_komninos**
-- **average_word_embeddings_levy_dependency**
-- **average_word_embeddings_glove.840B.300d**
+- **[allenai-specter](https://huggingface.co/sentence-transformers/allenai-specter)** - [Semantic Search Python Example](../../examples/applications/semantic-search/semantic_search_publications.py) / [Semantic Search Colab Example](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06)
diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst
new file mode 100644
index 000000000..fc6e78138
--- /dev/null
+++ b/docs/sentence_transformer/training/distributed.rst
@@ -0,0 +1,85 @@
+
+Distributed Training
+====================
+
+Sentence Transformers implements two forms of distributed training: Data Parallel (DP) and Distributed Data Parallel (DDP). Read the `Data Parallelism documentation <https://huggingface.co/docs/transformers/en/perf_train_gpu_many#data-parallelism>`_ on Hugging Face for more details on these strategies. Some of the key differences include:
+
+1. DDP is generally faster than DP because it has to communicate less data.
+2. With DP, GPU 0 does the bulk of the work, while with DDP, the work is distributed more evenly across all GPUs.
+3. DDP allows for training across multiple machines, while DP is limited to a single machine.
+
+In short, **DDP is generally recommended**. You can use DDP by running your normal training scripts with ``torchrun`` or ``accelerate``. For example, if you have a script called ``train_script.py``, you can run it with DDP using the following command:
+
+.. tabs::
+
+   .. tab:: Via ``torchrun``
+
+      - `torchrun documentation <https://pytorch.org/docs/stable/elastic/run.html>`_
+
+      ::
+
+         torchrun --nproc_per_node=4 train_script.py
+    
+   .. tab:: Via ``accelerate``
+
+      - `accelerate documentation <https://huggingface.co/docs/accelerate/en/index>`_
+
+      ::
+        
+         accelerate launch --num_processes 4 train_script.py
+
+.. note::
+  
+   When performing distributed training, you have to wrap your code in a ``main`` function and call it with ``if __name__ == "__main__":``. This is because each process will run the entire script, so you don't want to run the same code multiple times. Here is an example of how to do this::
+
+      from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments, SentenceTransformerTrainer
+      # Other imports here
+
+      def main():
+          # Your training code here
+
+      if __name__ == "__main__":
+          main()
+
+.. note::
+
+   When using DDP, using ``dataloader_drop_last=True`` in :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` is recommended, as the training may halt at the last (incomplete) training batch otherwise.
+
+Comparison
+----------
+
+The following table shows the speedup of DDP over DP and no parallelism given a certain hardware setup.
+
+- Hardware: a ``p3.8xlarge`` AWS instance, i.e. 4x V100 GPUs
+- Model being trained: `microsoft/mpnet-base <https://huggingface.co/microsoft/mpnet-base>`_ (133M parameters)
+- Maximum sequence length: 384 (following `all-mpnet-base-v2 <https://huggingface.co/sentence-transformers/all-mpnet-base-v2>`_)
+- Training datasets: MultiNLI, SNLI and STSB (note: these have short texts)
+- Losses: :class:`~sentence_transformers.losses.SoftmaxLoss` for MultiNLI and SNLI, :class:`~sentence_transformers.losses.CosineSimilarityLoss` for STSB
+- Batch size per device: 32
+
+.. list-table::
+   :header-rows: 1
+
+   * - Strategy
+     - Launcher
+     - Samples per Second
+   * - No Parallelism
+     - ``CUDA_VISIBLE_DEVICES=0 python train_script.py``
+     - 2724
+   * - Data Parallel (DP)
+     - ``python train_script.py`` (DP is used by default when launching a script with ``python``)
+     - 3675 (1.349x speedup)
+   * - **Distributed Data Parallel (DDP)**
+     - ``torchrun --nproc_per_node=4 train_script.py`` or ``accelerate launch --num_processes 4 train_script.py``
+     - **6980 (2.562x speedup)**
+
+FSDP
+----
+
+Fully Sharded Data Parallelism (FSDP) is another distributed training strategy that is not fully supported by Sentence Transformers. It is a more advanced version of DDP that is particularly useful for very large models. Note that in the previous comparison, FSDP reaches 5782 samples per second (2.122x speedup), i.e. **worse than DDP**. FSDP only makes sense with very large models. If you want to use FSDP with Sentence Transformers, you have to be aware of the following limitations:
+
+- You can't use the ``evaluator`` functionality with FSDP.
+- You have to save the trained model with ``trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")`` followed with ``trainer.save_model("output")``.
+- You have to use ``fsdp=["full_shard", "auto_wrap"]`` and ``fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"}`` in your ``SentenceTransformerTrainingArguments``, where ``BertLayer`` is the repeated layer in the encoder that houses the multi-head attention and feed-forward layers, so e.g. ``BertLayer`` or ``MPNetLayer``.
+
+Read the `FSDP documentation <https://huggingface.co/docs/accelerate/en/usage_guides/fsdp>`_ by Accelerate for more details.
\ No newline at end of file
diff --git a/docs/sentence_transformer/training/examples.rst b/docs/sentence_transformer/training/examples.rst
new file mode 100644
index 000000000..f78d5916b
--- /dev/null
+++ b/docs/sentence_transformer/training/examples.rst
@@ -0,0 +1,32 @@
+
+Training Examples
+=================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Supervised Learning
+
+   ../../../examples/training/sts/README
+   ../../../examples/training/nli/README
+   ../../../examples/training/paraphrases/README
+   ../../../examples/training/quora_duplicate_questions/README
+   ../../../examples/training/ms_marco/README
+   ../../../examples/training/matryoshka/README
+   ../../../examples/training/adaptive_layer/README
+   ../../../examples/training/multilingual/README
+   ../../../examples/training/distillation/README
+   ../../../examples/training/data_augmentation/README
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Unsupervised Learning
+
+   ../../../examples/unsupervised_learning/README
+   ../../../examples/domain_adaptation/README
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Usage
+
+   ../../../examples/training/hpo/README
+   distributed
\ No newline at end of file
diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
new file mode 100644
index 000000000..def0a1759
--- /dev/null
+++ b/docs/sentence_transformer/training_overview.md
@@ -0,0 +1,667 @@
+# Training Overview
+
+## Why Finetune?
+Finetuning Sentence Transformer models often heavily improves the performance of the model on your use case, because each task requires a different notion of similarity. For example, given news articles: 
+- "Apple launches the new iPad"
+- "NVIDIA is gearing up for the next GPU generation"
+
+Then the following use cases, we may have different notions of similarity:
+- a model for **classification** of news articles as Economy, Sports, Technology, Politics, etc., should produce **similar embeddings** for these texts.
+- a model for **semantic textual similarity** should produce **dissimilar embeddings** for these texts, as they have different meanings.
+- a model for **semantic search** would **not need a notion for similarity** between two documents, as it should only compare queries and documents.
+
+
+Also see [**Training Examples**](training/examples) for numerous training scripts for common real-world applications that you can adopt.
+
+
+## Training Components
+Training Sentence Transformer models involves between 3 to 5 components:
+
+<div class="components">
+    <a href="#dataset" class="box">
+        <div class="header">Dataset</div>
+        Learn how to prepare the <b>data</b> for training.
+    </a>
+    <a href="#loss-function" class="box">
+        <div class="header">Loss Function</div>
+        Learn how to prepare and choose a <b>loss</b> function.
+    </a>
+    <a href="#training-arguments" class="box optional">
+        <div class="header">Training Arguments</div>
+        Learn which <b>training arguments</b> are useful.
+    </a>
+    <a href="#evaluator" class="box optional">
+        <div class="header">Evaluator</div>
+        Learn how to <b>evaluate</b> during and after training.
+    </a>
+    <a href="#trainer" class="box">
+        <div class="header">Trainer</div>
+        Learn how to start the <b>training</b> process.
+    </a>
+</div>
+<p></p>
+
+## Dataset
+```eval_rst
+The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datasets.Dataset` (one dataset) or :class:`datasets.DatasetDict` instances (multiple datasets, see also `Multi-dataset training <#multi-dataset-training>`_). 
+
+.. tabs::
+
+    .. tab:: Data on 🤗 Hugging Face Hub
+
+        If you want to load data from the `Hugging Face Datasets <https://huggingface.co/datasets>`_, then you should use :func:`datasets.load_dataset`:
+
+        .. raw:: html
+
+            <div class="sidebar">
+                <p class="sidebar-title">Documentation</p>
+                <ul class="simple">
+                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#hugging-face-hub">Datasets, Loading from the Hugging Face Hub</a></li>
+                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
+                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
+                </ul>
+            </div>
+
+        ::
+
+            from datasets import load_dataset
+
+            train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
+            eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")
+
+            print(train_dataset)
+            """
+            Dataset({
+                features: ['premise', 'hypothesis', 'label'],
+                num_rows: 942069
+            })
+            """
+
+        Some datasets (including `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair>`_, `pair-class <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-class>`_, `pair-score <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-score>`_, `triplet <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet>`_.
+
+        .. note::
+
+            Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers <https://huggingface.co/datasets?other=sentence-transformers>`_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks.
+
+    .. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL)
+
+        If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`:
+
+        .. raw:: html
+
+            <div class="sidebar">
+                <p class="sidebar-title">Documentation</p>
+                <ul class="simple">
+                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files">Datasets, Loading local files</a></li>
+                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
+                </ul>
+            </div>
+
+        ::
+
+            from datasets import load_dataset
+
+            dataset = load_dataset("csv", data_files="my_file.csv")
+        
+        or::
+
+            from datasets import load_dataset
+
+            dataset = load_dataset("json", data_files="my_file.json")
+
+    .. tab:: Local Data that requires pre-processing
+
+        .. sidebar:: Documentation
+
+            - :meth:`datasets.Dataset.from_dict`
+
+        If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so:
+
+        .. raw:: html
+
+            <div class="sidebar">
+                <p class="sidebar-title">Documentation</p>
+                <ul class="simple">
+                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.from_dict" title="(in datasets vmain)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">datasets.Dataset.from_dict()</span></code></a></li>
+                </ul>
+            </div>
+
+        ::
+
+            from datasets import Dataset
+
+            sentence1_list = []
+            sentence2_list = []
+            # Open a file, do preprocessing, filtering, cleaning, etc.
+            # and append to the lists
+
+            dataset = Dataset.from_dict({
+                "sentence1": sentence1_list,
+                "sentence2": sentence2_list,
+            })
+
+        Each key from the dictionary will become a column in the resulting dataset.
+
+```
+
+### Dataset Format
+
+```eval_rst
+It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format). Verifying whether a dataset format works with a loss function involves two steps:
+
+1. If your loss function requires a *Label* according to the `Loss Overview <loss_overview.html>`_ table, then your dataset must have a **column named "label" or "score"**. This column is automatically taken as the label.
+2. All columns not named "label" or "score" are considered *Inputs* according to the `Loss Overview <loss_overview.html>`_ table. The number of remaining columns must match the number of valid inputs for your chosen loss. The names of these columns are **irrelevant**, only the **order matters**. 
+
+For example, given a dataset with columns ``["text1", "text2", "label"]`` where the "label" column has float similarity score, we can use it with :class:`~sentence_transformers.losses.CoSENTLoss`, :class:`~sentence_transformers.losses.AnglELoss`, and :class:`~sentence_transformers.losses.CosineSimilarityLoss` because it:
+
+1. has a "label" column as is required for these loss functions.
+2. has 2 non-label columns, exactly the amount required by these loss functions.
+
+Be sure to re-order your dataset columns with :meth:`Dataset.select_columns <datasets.Dataset.select_columns>` if your columns are not ordered correctly. For example, if your dataset has ``["good_answer", "bad_answer", "question"]`` as columns, then this dataset can technically be used with a loss that requires (anchor, positive, negative) triplets, but the ``good_answer`` column will be taken as the anchor, ``bad_answer`` as the positive, and ``question`` as the negative.
+
+Additionally, if your dataset has extraneous columns (e.g. sample_id, metadata, source, type), you should remove these with :meth:`Dataset.remove_columns <datasets.Dataset.remove_columns>` as they will be used as inputs otherwise. You can also use :meth:`Dataset.select_columns <datasets.Dataset.select_columns>` to keep only the desired columns.
+```
+
+## Loss Function
+Loss functions quantify how well a model performs for a given batch of data, allowing an optimizer to update the model weights to produce more favourable (i.e., lower) loss values. This is the core of the training process.
+
+Sadly, there is no single loss function that works best for all use-cases. Instead, which loss function to use greatly depends on your available data and on your target task. See [Dataset Format](#dataset-format) to learn what datasets are valid for which loss functions. Additionally, the [Loss Overview](loss_overview) will be your best friend to learn about the options.
+
+```eval_rst
+Most loss functions can be initialized with just the :class:`SentenceTransformer` that you're training, alongside some optional parameters, e.g.:
+
+.. sidebar:: Documentation
+
+    - :class:`sentence_transformers.losses.CoSENTLoss`
+    - `Losses API Reference <../package_reference/sentence_transformer/losses>`_
+    - `Loss Overview <loss_overview>`_
+
+::
+
+    from datasets import load_dataset
+    from sentence_transformers import SentenceTransformer
+    from sentence_transformers.losses import CoSENTLoss
+
+    # Load a model to train/finetune
+    model = SentenceTransformer("xlm-roberta-base")
+
+    # Initialize the CoSENTLoss
+    # This loss requires pairs of text and a float similarity score as a label
+    loss = CoSENTLoss(model)
+
+    # Load an example training dataset that works with our loss function:
+    train_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="train")
+    """
+    Dataset({
+        features: ['sentence1', 'sentence2', 'label'],
+        num_rows: 942069
+    })
+    """
+```
+
+## Training Arguments
+
+```eval_rst
+The :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` class can be used to specify parameters for influencing training performance as well as defining the tracking/debugging parameters. Although it is optional, it is heavily recommended to experiment with the various useful arguments.
+```
+
+The following are tables with some of the most useful training arguments.
+
+<div class="training-arguments">
+    <div class="header">Key Training Arguments for improving training performance</div>
+    <div class="table">
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.learning_rate"><code>learning_rate</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.lr_scheduler_type"><code>lr_scheduler_type</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.warmup_ratio"><code>warmup_ratio</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.num_train_epochs"><code>num_train_epochs</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.max_steps"><code>max_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size"><code>per_device_train_batch_size</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.per_device_eval_batch_size"><code>per_device_eval_batch_size</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.auto_find_batch_size "><code>auto_find_batch_size</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.fp16"><code>fp16</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.bf16"><code>bf16</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps"><code>gradient_accumulation_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.gradient_checkpointing"><code>gradient_checkpointing</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps"><code>eval_accumulation_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.optim"><code>optim</code></a>
+        <a href="../package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.SentenceTransformerTrainingArguments"><code>batch_sampler</code></a>
+        <a href="../package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.SentenceTransformerTrainingArguments"><code>multi_dataset_batch_sampler</code></a>
+    </div>
+</div>
+<br>
+<div class="training-arguments">
+    <div class="header">Key Training Arguments for observing training performance</div>
+    <div class="table">
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.eval_strategy"><code>eval_strategy</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.eval_steps"><code>eval_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.save_strategy"><code>save_strategy</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.save_steps"><code>save_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.save_total_limit"><code>save_total_limit</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.load_best_model_at_end"><code>load_best_model_at_end</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.report_to"><code>report_to</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_level"><code>log_level</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.logging_steps"><code>logging_steps</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.push_to_hub"><code>push_to_hub</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.hub_model_id"><code>hub_model_id</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy"><code>hub_strategy</code></a>
+        <a href="https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.hub_private_repo"><code>hub_private_repo</code></a>
+    </div>
+</div>
+<br>
+
+```eval_rst
+Here is an example of how :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` can be initialized:
+```
+
+```python
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir="models/mpnet-base-all-nli-triplet",
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # losses that use "in-batch negatives" benefit from no duplicates
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    save_total_limit=2,
+    logging_steps=100,
+    run_name="mpnet-base-all-nli-triplet",  # Will be used in W&B if `wandb` is installed
+)
+```
+
+## Evaluator
+
+```eval_rst
+Several evaluators exist that can help with evaluation before, during, and after training:
+
+========================================================================  ===========================================================================================================================
+Evaluator                                                                 Required Data
+========================================================================  ===========================================================================================================================
+:class:`~sentence_transformers.evaluation.BinaryClassificationEvaluator`  Pairs with class labels
+:class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator`   Pairs with similarity scores
+:class:`~sentence_transformers.evaluation.InformationRetrievalEvaluator`  Queries (qid => question), Corpus (cid => document), and relevant documents (qid => set[cid])
+:class:`~sentence_transformers.evaluation.MSEEvaluator`                   Source sentences to embed with a teacher model and target sentences to embed with the student model. Can be the same texts.
+:class:`~sentence_transformers.evaluation.ParaphraseMiningEvaluator`      Mapping of IDs to sentences & pairs with IDs of duplicate sentences.
+:class:`~sentence_transformers.evaluation.RerankingEvaluator`             List of ``{'query': '...', 'positive': [...], 'negative': [...]}`` dictionaries.
+:class:`~sentence_transformers.evaluation.TranslationEvaluator`           Pairs of sentences in two separate languages.
+:class:`~sentence_transformers.evaluation.TripletEvaluator`               (anchor, positive, negative) pairs.
+========================================================================  ===========================================================================================================================
+
+Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`. When the evaluator is run depends on the ``eval_strategy`` and ``eval_steps`` `Training Arguments <#training-arguments>`_.
+
+Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face.
+
+.. tabs::
+
+    .. tab:: EmbeddingSimilarityEvaluator with STSb
+
+        .. raw:: html
+
+            <div class="sidebar">
+                <p class="sidebar-title">Documentation</p>
+                <ul class="simple">
+                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/stsb">sentence-transformers/stsb</a></li>
+                    <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator" title="sentence_transformers.evaluation.EmbeddingSimilarityEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.EmbeddingSimilarityEvaluator</span></code></a></li>
+                    <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
+                </ul>
+            </div>
+
+        ::
+
+            from datasets import load_dataset
+            from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
+
+            # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
+            eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+
+            # Initialize the evaluator
+            dev_evaluator = EmbeddingSimilarityEvaluator(
+                sentences1=eval_dataset["sentence1"],
+                sentences2=eval_dataset["sentence2"],
+                scores=eval_dataset["score"],
+                main_similarity=SimilarityFunction.COSINE,
+                name="sts-dev",
+            )
+            # You can run evaluation like so:
+            # dev_evaluator(model)
+    
+    .. tab:: TripletEvaluator with AllNLI
+
+        .. raw:: html
+
+            <div class="sidebar">
+                <p class="sidebar-title">Documentation</p>
+                <ul class="simple">
+                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
+                    <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator" title="sentence_transformers.evaluation.TripletEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.TripletEvaluator</span></code></a></li>
+                    <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
+                </ul>
+            </div>
+
+        ::
+
+            from datasets import load_dataset
+            from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction
+
+            # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli)
+            max_samples = 1000
+            eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")
+
+            # Initialize the evaluator
+            dev_evaluator = TripletEvaluator(
+                anchors=eval_dataset["anchor"],
+                positives=eval_dataset["positive"],
+                negatives=eval_dataset["negative"],
+                main_distance_function=SimilarityFunction.COSINE,
+                name="all-nli-dev",
+            )
+            # You can run evaluation like so:
+            # dev_evaluator(model)
+```
+
+## Trainer
+
+```eval_rst
+The :class:`sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together:
+
+.. sidebar:: Documentation
+
+    #. :class:`~sentence_transformers.SentenceTransformer`
+    #. :class:`~sentence_transformers.model_card.SentenceTransformerModelCardData`
+    #. :func:`~datasets.load_dataset`
+    #. :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`
+    #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`
+    #. :class:`~sentence_transformers.evaluation.TripletEvaluator`
+    #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`
+    #. :class:`SentenceTransformer.save_pretrained <sentence_transformers.SentenceTransformer.save_pretrained>`
+    #. :class:`SentenceTransformer.push_to_hub <sentence_transformers.SentenceTransformer.push_to_hub>`
+
+    - `Training Examples <training/examples>`_
+
+::
+
+    from datasets import load_dataset
+    from sentence_transformers import (
+        SentenceTransformer,
+        SentenceTransformerTrainer,
+        SentenceTransformerTrainingArguments,
+        SentenceTransformerModelCardData,
+    )
+    from sentence_transformers.losses import MultipleNegativesRankingLoss
+    from sentence_transformers.training_args import BatchSamplers
+    from sentence_transformers.evaluation import TripletEvaluator
+
+    # 1. Load a model to finetune with 2. (Optional) model card data
+    model = SentenceTransformer(
+        "microsoft/mpnet-base",
+        model_card_data=SentenceTransformerModelCardData(
+            language="en",
+            license="apache-2.0",
+            model_name="MPNet base trained on AllNLI triplets",
+        )
+    )
+
+    # 3. Load a dataset to finetune on
+    dataset = load_dataset("sentence-transformers/all-nli", "triplet")
+    train_dataset = dataset["train"].select(range(100_000))
+    eval_dataset = dataset["dev"]
+    test_dataset = dataset["test"]
+
+    # 4. Define a loss function
+    loss = MultipleNegativesRankingLoss(model)
+
+    # 5. (Optional) Specify training arguments
+    args = SentenceTransformerTrainingArguments(
+        # Required parameter:
+        output_dir="models/mpnet-base-all-nli-triplet",
+        # Optional training parameters:
+        num_train_epochs=1,
+        per_device_train_batch_size=16,
+        per_device_eval_batch_size=16,
+        warmup_ratio=0.1,
+        fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+        bf16=False,  # Set to True if you have a GPU that supports BF16
+        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+        # Optional tracking/debugging parameters:
+        eval_strategy="steps",
+        eval_steps=100,
+        save_strategy="steps",
+        save_steps=100,
+        save_total_limit=2,
+        logging_steps=100,
+        run_name="mpnet-base-all-nli-triplet",  # Will be used in W&B if `wandb` is installed
+    )
+
+    # 6. (Optional) Create an evaluator & evaluate the base model
+    dev_evaluator = TripletEvaluator(
+        anchors=eval_dataset["anchor"],
+        positives=eval_dataset["positive"],
+        negatives=eval_dataset["negative"],
+        name="all-nli-dev",
+    )
+    dev_evaluator(model)
+
+    # 7. Create a trainer & train
+    trainer = SentenceTransformerTrainer(
+        model=model,
+        args=args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        loss=loss,
+        evaluator=dev_evaluator,
+    )
+    trainer.train()
+
+    # (Optional) Evaluate the trained model on the test set
+    test_evaluator = TripletEvaluator(
+        anchors=test_dataset["anchor"],
+        positives=test_dataset["positive"],
+        negatives=test_dataset["negative"],
+        name="all-nli-test",
+    )
+    test_evaluator(model)
+
+    # 8. Save the trained model
+    model.save_pretrained("models/mpnet-base-all-nli-triplet/final")
+    
+    # 9. (Optional) Push it to the Hugging Face Hub
+    model.push_to_hub("mpnet-base-all-nli-triplet")
+
+```
+
+### Callbacks
+
+```eval_rst
+This Sentence Transformers trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as:
+
+- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed
+- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible.
+- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed.
+
+    - Note: These carbon emissions will be included in your automatically generated model card.
+
+See the Transformers `Callbacks <https://huggingface.co/docs/transformers/main/en/main_classes/callback>`_
+documentation for more information on the integrated callbacks and how to write your own callbacks.
+```
+
+## Multi-Dataset Training
+```eval_rst
+The top performing models are trained using many datasets at once. Normally, this is rather tricky, as each dataset has a different format. However, :class:`SentenceTransformerTrainer` can train with multiple datasets without having to convert each dataset to the same format. It can even apply different loss functions to each of the datasets. The steps to train with multiple datasets are:
+
+- Use a dictionary of :class:`~datasets.Dataset` instances (or a :class:`~datasets.DatasetDict`) as the ``train_dataset`` and ``eval_dataset``.
+- (Optional) Use a dictionary of loss functions mapping dataset names to losses. Only required if you wish to use different loss function for different datasets.
+
+Each training/evaluation batch will only contain samples from one of the datasets. The order in which batches are samples from the multiple datasets is defined by the :class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers` enum, which can be passed to the :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` via ``multi_dataset_batch_sampler``. Valid options are:
+
+- ``MultiDatasetBatchSamplers.ROUND_ROBIN``: Round-robin sampling from each dataset until one is exhausted. With this strategy, it’s likely that not all samples from each dataset are used, but each dataset is sampled from equally.
+- ``MultiDatasetBatchSamplers.PROPORTIONAL`` (default): Sample from each dataset in proportion to its size. With this strategy, all samples from each dataset are used and larger datasets are sampled from more frequently.
+
+This multi-task training has been shown to be very effective, e.g. `Huang et al. <https://arxiv.org/pdf/2405.06932>`_ employed :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, :class:`~sentence_transformers.losses.CoSENTLoss`, and a variation on :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` without in-batch negatives and only hard negatives to reach state-of-the-art performance on Chinese. They even applied :class:`~sentence_transformers.losses.MatryoshkaLoss` to allow the model to produce `Matryoshka Embeddings <../../examples/training/matryoshka/README.html>`_.
+
+Training on multiple datasets looks like this:
+
+.. sidebar:: Documentation
+
+    - :func:`datasets.load_dataset`
+    - :class:`~sentence_transformers.SentenceTransformer`
+    - :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`
+    - :class:`~sentence_transformers.losses.CoSENTLoss`
+    - :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`
+    - :class:`~sentence_transformers.losses.SoftmaxLoss`
+    - `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_
+    - `sentence-transformers/stsb <https://huggingface.co/datasets/sentence-transformers/stsb>`_
+    - `sentence-transformers/quora-duplicates <https://huggingface.co/datasets/sentence-transformers/quora-duplicates>`_
+    - `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
+
+    **Training Examples:**
+
+    - `Quora Duplicate Questions > Multi-task learning <https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/quora_duplicate_questions/training_multi-task-learning.py>`_
+    - `AllNLI + STSb > Multi-task learning <https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py>`_
+::
+
+    from datasets import load_dataset
+    from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
+    from sentence_transformers.losses import CoSENTLoss, MultipleNegativesRankingLoss, SoftmaxLoss
+
+    # 1. Load a model to finetune
+    model = SentenceTransformer("bert-base-uncased")
+
+    # 2. Load several Datasets to train with
+    # (anchor, positive)
+    all_nli_pair_train = load_dataset("sentence-transformers/all-nli", "pair", split="train[:10000]")
+    # (premise, hypothesis) + label
+    all_nli_pair_class_train = load_dataset("sentence-transformers/all-nli", "pair-class", split="train[:10000]")
+    # (sentence1, sentence2) + score
+    all_nli_pair_score_train = load_dataset("sentence-transformers/all-nli", "pair-score", split="train[:10000]")
+    # (anchor, positive, negative)
+    all_nli_triplet_train = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]")
+    # (sentence1, sentence2) + score
+    stsb_pair_score_train = load_dataset("sentence-transformers/stsb", split="train[:10000]")
+    # (anchor, positive)
+    quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", split="train[:10000]")
+    # (query, answer)
+    natural_questions_train = load_dataset("sentence-transformers/natural-questions", split="train[:10000]")
+
+    # We can combine all datasets into a dictionary with dataset names to datasets
+    train_dataset = {
+        "all-nli-pair": all_nli_pair_train,
+        "all-nli-pair-class": all_nli_pair_class_train,
+        "all-nli-pair-score": all_nli_pair_score_train,
+        "all-nli-triplet": all_nli_triplet_train,
+        "stsb": stsb_pair_score_train,
+        "quora": quora_pair_train,
+        "natural-questions": natural_questions_train,
+    }
+
+    # 3. Load several Datasets to evaluate with
+    # (anchor, positive, negative)
+    all_nli_triplet_dev = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")
+    # (sentence1, sentence2, score)
+    stsb_pair_score_dev = load_dataset("sentence-transformers/stsb", split="validation")
+    # (anchor, positive)
+    quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", split="train[10000:11000]")
+    # (query, answer)
+    natural_questions_dev = load_dataset("sentence-transformers/natural-questions", split="train[10000:11000]")
+
+    # We can use a dictionary for the evaluation dataset too, but we don't have to. We could also just use
+    # no evaluation dataset, or one dataset.
+    eval_dataset = {
+        "all-nli-triplet": all_nli_triplet_dev,
+        "stsb": stsb_pair_score_dev,
+        "quora": quora_pair_dev,
+        "natural-questions": natural_questions_dev,
+    }
+
+    # 4. Load several loss functions to train with
+    # (anchor, positive), (anchor, positive, negative)
+    mnrl_loss = MultipleNegativesRankingLoss(model)
+    # (sentence_A, sentence_B) + class
+    softmax_loss = SoftmaxLoss(model)
+    # (sentence_A, sentence_B) + score
+    cosent_loss = CoSENTLoss(model)
+
+    # Create a mapping with dataset names to loss functions, so the trainer knows which loss to apply where.
+    # Note that you can also just use one loss if all of your training/evaluation datasets use the same loss
+    losses = {
+        "all-nli-pair": mnrl_loss,
+        "all-nli-pair-class": softmax_loss,
+        "all-nli-pair-score": cosent_loss,
+        "all-nli-triplet": mnrl_loss,
+        "stsb": cosent_loss,
+        "quora": mnrl_loss,
+        "natural-questions": mnrl_loss,
+    }
+
+    # 5. Define a simple trainer, although it's recommended to use one with args & evaluators
+    trainer = SentenceTransformerTrainer(
+        model=model,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        loss=losses,
+    )
+    trainer.train()
+
+    # 6. save the trained model and optionally push it to the Hugging Face Hub
+    model.save_pretrained("bert-base-all-nli-stsb-quora-nq")
+    model.push_to_hub("bert-base-all-nli-stsb-quora-nq")
+```
+
+## Deprecated Training 
+```eval_rst
+Prior to the Sentence Transformers v3.0 release, models would be trained with the :meth:`SentenceTransformer.fit <sentence_transformers.SentenceTransformer.fit>` method and a :class:`~torch.utils.data.DataLoader` of :class:`~sentence_transformers.readers.InputExample`, which looked something like this::
+
+    from sentence_transformers import SentenceTransformer, InputExample, losses
+    from torch.utils.data import DataLoader
+
+    # Define the model. Either from scratch of by loading a pre-trained model
+    model = SentenceTransformer("distilbert/distilbert-base-uncased")
+
+    # Define your train examples. You need more than just two examples...
+    train_examples = [
+        InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
+        InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
+    ]
+
+    # Define your train dataset, the dataloader and the train loss
+    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
+    train_loss = losses.CosineSimilarityLoss(model)
+
+    # Tune the model
+    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
+
+Since the v3.0 release, using :meth:`SentenceTransformer.fit <sentence_transformers.SentenceTransformer.fit>` is still possible, but it will initialize a :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` behind the scenes. It is recommended to use the Trainer directly, as you will have more control via the :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`, but existing training scripts relying on :meth:`SentenceTransformer.fit <sentence_transformers.SentenceTransformer.fit>` should still work.
+
+In case there are issues with the updated :meth:`SentenceTransformer.fit <sentence_transformers.SentenceTransformer.fit>`, you can also get exactly the old behaviour by calling :meth:`SentenceTransformer.old_fit <sentence_transformers.SentenceTransformer.old_fit>` instead, but this method will be deprecated fully in the future.
+
+```
+
+## Best Base Embedding Models
+The quality of your text embedding model depends on which transformer model you choose. Sadly we cannot infer from a better performance on e.g. the GLUE or SuperGLUE benchmark that this model will also yield better representations.
+
+To test the suitability of transformer models, I use the [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py) script and train on 560k (anchor, positive, negative)-triplets for 1 epoch with batch size 64. I then evaluate on 14 diverse text similarity tasks (clustering, semantic search, duplicate detection etc.) from various domains.
+
+In the following table you find the performance for different models and their performance on this benchmark:
+
+| Model                                                                                                                             | Performance (14 sentence similarity tasks) |
+|-----------------------------------------------------------------------------------------------------------------------------------|-:-:----------------------------------------|
+| [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base)                                                               | 60.99                                      |
+| [nghuyong/ernie-2.0-en](https://huggingface.co/nghuyong/ernie-2.0-en)                                                             | 60.73                                      |
+| [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base)                                                           | 60.21                                      |
+| [roberta-base](https://huggingface.co/roberta-base)                                                                               | 59.63                                      |
+| [t5-base](https://huggingface.co/t5-base)                                                                                         | 59.21                                      |
+| [bert-base-uncased](https://huggingface.co/bert-base-uncased)                                                                     | 59.17                                      |
+| [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)                                                         | 59.03                                      |
+| [nreimers/TinyBERT_L-6_H-768_v2](https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2)                                           | 58.27                                      |
+| [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base)                                                                 | 57.63                                      |
+| [nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large) | 57.31                                      |
+| [albert-base-v2](https://huggingface.co/albert-base-v2)                                                                           | 57.14                                      |
+| [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased)                                     | 56.79                                      |
+| [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base)                                                     | 54.46                                      |
+
diff --git a/docs/sentence_transformer/usage/semantic_textual_similarity.rst b/docs/sentence_transformer/usage/semantic_textual_similarity.rst
new file mode 100644
index 000000000..cd4c332b1
--- /dev/null
+++ b/docs/sentence_transformer/usage/semantic_textual_similarity.rst
@@ -0,0 +1,132 @@
+Semantic Textual Similarity
+===========================
+
+For Semantic Textual Similarity (STS), we want to produce embeddings for all texts involved and calculate the similarities between them. The text pairs with the highest similarity score are most semantically similar. See also the `Computing Embeddings <../../../examples/applications/computing-embeddings/README.html>`_ documentation for more advanced details on getting embedding scores.
+
+.. sidebar:: Documentation
+
+   1. :class:`SentenceTransformer <sentence_transformers.SentenceTransformer>`
+   2. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   3. :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`
+
+::
+
+    from sentence_transformers import SentenceTransformer
+
+    model = SentenceTransformer("all-MiniLM-L6-v2")
+
+    # Two lists of sentences
+    sentences1 = [
+        "The new movie is awesome",
+        "The cat sits outside",
+        "A man is playing guitar",
+    ]
+
+    sentences2 = [
+        "The dog plays in the garden",
+        "The new movie is so great",
+        "A woman watches TV",
+    ]
+
+    # Compute embeddings for both lists
+    embeddings1 = model.encode(sentences1)
+    embeddings2 = model.encode(sentences2)
+
+    # Compute cosine similarities
+    similarities = model.similarity(embeddings1, embeddings2)
+
+    # Output the pairs with their score
+    for idx_i, sentence1 in enumerate(sentences1):
+        print(sentence1)
+        for idx_j, sentence2 in enumerate(sentences2):
+            print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")
+
+.. code-block:: txt
+    :emphasize-lines: 3
+
+    The new movie is awesome
+    - The dog plays in the garden   : 0.0543
+    - The new movie is so great     : 0.8939
+    - A woman watches TV            : -0.0502
+    The cat sits outside
+    - The dog plays in the garden   : 0.2838
+    - The new movie is so great     : -0.0029
+    - A woman watches TV            : 0.1310
+    A man is playing guitar
+    - The dog plays in the garden   : 0.2277
+    - The new movie is so great     : -0.0136
+    - A woman watches TV            : -0.0327
+
+In this example, the :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>` method returns a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*.
+
+Similarity Calculation
+----------------------
+
+The similarity metric that is used is stored in the SentenceTransformer instance under :attr:`SentenceTransformer.similarity_fn_name <sentence_transformers.SentenceTransformer.similarity_fn_name>`. Valid options are:
+
+- ``SimilarityFunction.COSINE`` (a.k.a `"cosine"`): Cosine Similarity (**default**)
+- ``SimilarityFunction.DOT_PRODUCT`` (a.k.a `"dot"`): Dot Product
+- ``SimilarityFunction.EUCLIDEAN`` (a.k.a `"euclidean"`): Negative Euclidean Distance
+- ``SimilarityFunction.MANHATTAN`` (a.k.a. `"manhattan"`): Negative Manhattan Distance
+
+This value can be changed in a handful of ways:
+
+1. By initializing the SentenceTransformer instance with the desired similarity function::
+
+    from sentence_transformers import SentenceTransformer, SimilarityFunction
+
+    model = SentenceTransformer("all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.DOT_PRODUCT)
+
+2. By setting the value directly on the SentenceTransformer instance::
+
+    from sentence_transformers import SentenceTransformer, SimilarityFunction
+
+    model = SentenceTransformer("all-MiniLM-L6-v2")
+    model.similarity_fn_name = SimilarityFunction.DOT_PRODUCT
+
+3. By setting the value under the ``"similarity_fn_name"`` key in the ``config_sentence_transformers.json`` file of a saved model. When you save a Sentence Transformer model, this value will be automatically saved as well.
+
+Sentence Transformers implements two methods to calculate the similarity between embeddings:
+
+- :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`: Calculates the similarity between all pairs of embeddings.
+- :meth:`SentenceTransformer.pairwise_cosine_similarity <sentence_transformers.SentenceTransformer.pairwise_cosine_similarity>`: Calculates the similarity between embeddings in a pairwise fashion.
+
+::
+
+    from sentence_transformers import SentenceTransformer, SimilarityFunction
+
+    # Load a pretrained Sentence Transformer model
+    model = SentenceTransformer("all-MiniLM-L6-v2")
+
+    # Embed some sentences
+    sentences = [
+        "The weather is lovely today.",
+        "It's so sunny outside!",
+        "He drove to the stadium.",
+    ]
+    embeddings = model.encode(sentences)
+
+    similarities = model.similarity(embeddings, embeddings)
+    print(similarities)
+    # tensor([[1.0000, 0.6660, 0.1046],
+    #         [0.6660, 1.0000, 0.1411],
+    #         [0.1046, 0.1411, 1.0000]])
+
+    # Change the similarity function to Manhattan distance
+    model.similarity_fn_name = SimilarityFunction.MANHATTAN
+    print(model.similarity_fn_name)
+    # => "manhattan"
+
+    similarities = model.similarity(embeddings, embeddings)
+    print(similarities)
+    # tensor([[ -0.0000, -12.6269, -20.2167],
+    #         [-12.6269,  -0.0000, -20.1288],
+    #         [-20.2167, -20.1288,  -0.0000]])
+
+.. note::
+
+   If a Sentence Transformer instance ends with a :class:`~sentence_transformers.models.Normalize` module, then it is sensible to choose the "dot" metric instead of "cosine".
+
+   Dot product on normalized embeddings is equivalent to cosine similarity, but "cosine" will re-normalize the embeddings again. As a result, the "dot" metric will be faster than "cosine".
+
+If you want find the highest scoring pairs in a long list of sentences, have a look at `Paraphrase Mining <../../examples/applications/paraphrase-mining/README.md>`_.
diff --git a/docs/sentence_transformer/usage/usage.rst b/docs/sentence_transformer/usage/usage.rst
new file mode 100644
index 000000000..d8beec379
--- /dev/null
+++ b/docs/sentence_transformer/usage/usage.rst
@@ -0,0 +1,59 @@
+
+Usage
+=====
+
+Characteristics of Sentence Transformer (a.k.a bi-encoder) models:
+
+1. Calculates a **fixed-size vector representation (embedding)** given **texts or images**.
+2. Embedding calculation is often **efficient**, embedding similarity calculation is **very fast**.
+3. Applicable for a **wide range of tasks**, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
+4. Often used as a **first step in a two-step retrieval process**, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.
+
+Once you have `installed <installation.md>`_ Sentence Transformers, you can easily use Sentence Transformer models:
+
+.. sidebar:: Documentation
+
+   1. :class:`SentenceTransformer <sentence_transformers.SentenceTransformer>`
+   2. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   3. :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`
+
+::
+
+   from sentence_transformers import SentenceTransformer
+
+   # 1. Load a pretrained Sentence Transformer model
+   model = SentenceTransformer("all-MiniLM-L6-v2")
+
+   # The sentences to encode
+   sentences = [
+       "The weather is lovely today.",
+       "It's so sunny outside!",
+       "He drove to the stadium.",
+   ]
+
+   # 2. Calculate embeddings by calling model.encode()
+   embeddings = model.encode(sentences)
+   print(embeddings.shape)
+   # [3, 384]
+
+   # 3. Calculate the embedding similarities
+   similarities = model.similarity(embeddings, embeddings)
+   print(similarities)
+   # tensor([[1.0000, 0.6660, 0.1046],
+   #         [0.6660, 1.0000, 0.1411],
+   #         [0.1046, 0.1411, 1.0000]])
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Tasks and Advanced Usage
+
+   ../../../examples/applications/computing-embeddings/README
+   semantic_textual_similarity
+   ../../../examples/applications/semantic-search/README
+   ../../../examples/applications/retrieve_rerank/README
+   ../../../examples/applications/clustering/README
+   ../../../examples/applications/paraphrase-mining/README
+   ../../../examples/applications/parallel-sentence-mining/README
+   ../../../examples/applications/image-search/README
+   ../../../examples/applications/embedding-quantization/README
+
diff --git a/docs/training/overview.md b/docs/training/overview.md
deleted file mode 100644
index b3f3ad539..000000000
--- a/docs/training/overview.md
+++ /dev/null
@@ -1,283 +0,0 @@
-# Training Overview
-
-Each task is unique, and having sentence / text embeddings tuned for that specific task greatly improves the performance.
-
-SentenceTransformers was designed in such way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task.
-
-Sadly there is no single training strategy that works for all use-cases. Instead, which training strategy  to use greatly depends on your available data and on your target task.
-
-In the **Training** section, I will discuss the fundamentals of training your own embedding models with SentenceTransformers. In the **Training Examples** section, I will provide examples how to tune embedding models for common real-world applications.
-
-## Network Architecture
-
-For sentence / text embeddings, we want to map a variable length input text to a fixed sized dense vector. The most basic network architecture we can use is the following:
-
-![SBERT  Network Architecture](../img/SBERT_Architecture.png "SBERT Siamese Architecture")
-
-
-We feed the input sentence or text into a transformer network like BERT. BERT produces contextualized word embeddings for all input tokens in our text. As we want a fixed-sized output representation (vector u), we need a pooling layer. Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us. This gives us a fixed 768 dimensional output vector independent how long our input text was.
-
-The depicted architecture, consisting of a BERT layer and a pooling layer is one final SentenceTransformer model.
-
-## Creating Networks from Scratch
- 
-In the quick start & usage examples, we used pre-trained SentenceTransformer models that already come with a BERT layer and a pooling layer.
-
-But we can create the networks architectures from scratch by defining the individual layers. For example, the following code would create the depicted network architecture:
-
-```python
-from sentence_transformers import SentenceTransformer, models
-
-word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-```
-
-First we define our individual layers, in this case, we define 'bert-base-uncased' as the *word_embedding_model*. We limit that layer to a maximal sequence length of 256, texts longer than that will be truncated. Further, we create a (mean) pooling layer. We create a new *SentenceTransformer* model by calling `SentenceTransformer(modules=[word_embedding_model, pooling_model])`. For the *modules* parameter, we pass a list of layers which are executed consecutively. Input text are first passed to the first entry (*word_embedding_model*). The output is then passed to the second entry (*pooling_model*), which then returns our sentence embedding.
-
-We can also construct more complex models:
-```python
-from sentence_transformers import SentenceTransformer, models
-from torch import nn
-
-word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-dense_model = models.Dense(
-    in_features=pooling_model.get_sentence_embedding_dimension(),
-    out_features=256,
-    activation_function=nn.Tanh(),
-)
-
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
-```
-
-Here, we add on top of the pooling layer a fully connected dense layer with Tanh activation, which performs a down-project to 256 dimensions. Hence, embeddings by this model will only have 256 instead of 768 dimensions.
-
-Additionally, we can also create SentenceTransformer models from scratch for image search by loading any CLIP model from the Hugging Face Hub or a local path:
-
-```py
-from sentence_transformers import SentenceTransformer, models
-
-image_embedding_model = models.CLIPModel("openai/clip-vit-base-patch32")
-model = SentenceTransformer(modules=[image_embedding_model])
-```
-
-For all available building blocks see [» Models Package Reference](../package_reference/models.md)
-
-## Training Data 
-
-To train a SentenceTransformer model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different.
-
-Unfortunately, there is no single way to prepare your data to train a Sentence Transformers model. It largely depends on your goals and the structure of your data. If you don't have an explicit label, which is the most likely scenario, you can derive it from the design of the documents where you obtained the sentences. For example, two sentences in the same report should be more comparable than two sentences in different reports. Neighboring sentences might be more comparable than non-neighboring sentences. 
-
-For more information on available datasets for training SentenceTransformers models see [» Datasets Reference](../../examples/training/datasets/README.md).
- 
-To represent our training data, we use the `InputExample` class to store training examples. As parameters, it accepts texts, which is a list of strings representing our pairs (or triplets). Further, we can also pass a label (either float or int). The following shows a simple example, where we pass text pairs to `InputExample` together with a label indicating the semantic similarity.
- 
- ```python
- from sentence_transformers import SentenceTransformer, InputExample
- from torch.utils.data import DataLoader
-
- model = SentenceTransformer("distilbert-base-nli-mean-tokens")
- train_examples = [
-     InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
-     InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
- ]
- train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
- ```
-
-We wrap our `train_examples` with the standard PyTorch `DataLoader`, which shuffles our data and produces batches of certain sizes.
-
-
- 
-## Loss Functions
-
-The loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task.
-
-Sadly there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task.
-
-
-To fine-tune our network, we need somehow to tell our network which sentence pairs are similar, and should be close in vector space, and which pairs are dissimilar, and should be far away in vector space.
-
-The most simple way is to have sentence pairs annotated with a score indicating their similarity, e.g. on a scale 0 to 1. We can then train the network with a Siamese Network Architecture (for details see: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084))
-
-![SBERT Siamese Network Architecture](../img/SBERT_Siamese_Network.png "SBERT Siamese Architecture")
-
-
-For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.
-
-
-A minimal example with `CosineSimilarityLoss` is the following:
-```python
-from sentence_transformers import SentenceTransformer, InputExample, losses
-from torch.utils.data import DataLoader
-
-# Define the model. Either from scratch of by loading a pre-trained model
-model = SentenceTransformer("distilbert-base-nli-mean-tokens")
-
-# Define your train examples. You need more than just two examples...
-train_examples = [
-    InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
-    InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
-]
-
-# Define your train dataset, the dataloader and the train loss
-train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
-train_loss = losses.CosineSimilarityLoss(model)
-
-# Tune the model
-model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
-```
-
-
-We tune the model by calling model.fit(). We pass a list of `train_objectives`, which consist of tuples `(dataloader, loss_function)`. We can pass more than one tuple in order to perform multi-task learning on several datasets with different loss functions.
-
-The `fit` method accepts the following parameter:
-
-```eval_rst
-.. autoclass:: sentence_transformers.SentenceTransformer
-    :members: fit
-```
-
-## Evaluators
-
-During training, we usually want to measure the performance to see if the performance improves. For this, the *[sentence_transformers.evaluation](../package_reference/evaluation.md)* package exists. It contains various evaluators which we can pass to the `fit`-method. These evaluators are run periodically during training. Further, they return a score and only the model with the highest score will be stored on disc.
-
-The usage is simple:
-```python
-from sentence_transformers import evaluation
-
-sentences1 = [
-    "This list contains the first column",
-    "With your sentences",
-    "You want your model to evaluate on",
-]
-sentences2 = [
-    "Sentences contains the other column",
-    "The evaluator matches sentences1[i] with sentences2[i]",
-    "Compute the cosine similarity and compares it to scores[i]",
-]
-scores = [0.3, 0.6, 0.2]
-
-evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
-
-# ... Your other code to load training data
-
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    epochs=1,
-    warmup_steps=100,
-    evaluator=evaluator,
-    evaluation_steps=500,
-)
-```
-
-
-
-### Continue Training on Other Data
-[training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
-
-First, we load a pre-trained model from the server:
-```python
-model = SentenceTransformer("bert-base-nli-mean-tokens")
-```
-
-
-The next steps are as before. We specify training and dev data:
-```python
-train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.CosineSimilarityLoss(model=model)
-
-evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
-    sts_reader.get_examples("sts-dev.csv")
-)
-```
-
-In that example, we use CosineSimilarityLoss, which computes the cosine similarity between two sentences and compares this score with a provided gold similarity score.
-
-Then we can train as before:
-```python
-model.fit(
-    train_objectives=[(train_dataloader, train_loss)],
-    evaluator=evaluator,
-    epochs=num_epochs,
-    evaluation_steps=1000,
-    warmup_steps=warmup_steps,
-    output_path=model_save_path,
-)
-```
-
-
-## Loading Custom SentenceTransformer Models
-Loading trained models is easy. You can specify a path:
-```python
-model = SentenceTransformer("./my/path/to/model/")
-```
-Note: It is important that a / or \ is present in the path, otherwise, it is not recognized as a path.
-
-You can also host the training output on a server and download it:
- ```python
-model = SentenceTransformer('http://www.server.com/path/to/model/my_model.zip')
-```
-With the first call, the model is downloaded and stored in the local Hugging Face cache folder (`~/.cache/huggingface`). In order to work, you must zip all files and subfolders of your model.
-
-
-
-## Multitask Training
-This code allows multi-task learning with training data from different datasets and with different loss-functions. For an example, see [training_multi-task.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py).
-
-
-## Adding Special Tokens
-
-Depending on the task, you might want to add special tokens to the tokenizer and the Transformer model. You can use the following code-snippet to achieve this:
-```python
-from sentence_transformers import SentenceTransformer, models
-
-word_embedding_model = models.Transformer("bert-base-uncased")
-
-tokens = ["[DOC]", "[QRY]"]
-word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
-word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
-
-pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
-model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
-```
-
-If you want to extend the vocabulary for an existent SentenceTransformer model, you can use the following code:
-```python
-from sentence_transformers import SentenceTransformer, models
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-word_embedding_model = model._first_module()
-
-tokens = ["[DOC]", "[QRY]"]
-word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
-word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
-```
-
-In the above example, the two new tokens `[DOC]` and `[QRY]` are added to the model. Their respective word embeddings are intialized randomly. It is advisable to then fine-tune the model on your downstream task.
-
-
-## Best Transformer Model
-The quality of your text embedding model depends on which transformer model you choose. Sadly we cannot infer from a better performance on e.g. the GLUE or SuperGLUE benchmark that this model will also yield better representations.
-
-To test the suitability of transformer models, I use the [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py) script and train on 560k (anchor, positive, negative)-triplets for 1 epoch with batch size 64. I then evaluate on 14 diverse text similarity tasks (clustering, semantic search, duplicate detection etc.) from various domains.
-
-In the following table you find the performance for different models and their performance on this benchmark:
-
-| Model | Performance (14 sentence similarity tasks) |
-| --- | :---: |
-| [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) |	60.99 |
-| [nghuyong/ernie-2.0-en](https://huggingface.co/nghuyong/ernie-2.0-en) |	60.73 |
-| [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) |	60.21 |
-| [roberta-base](https://huggingface.co/roberta-base) |	59.63 |
-| [t5-base](https://huggingface.co/t5-base) |	59.21 |
-| [bert-base-uncased](https://huggingface.co/bert-base-uncased) |	59.17 |
-| [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) |	59.03 |
-| [nreimers/TinyBERT_L-6_H-768_v2](https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2) |	58.27 |
-| [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) |	57.63 |
-| [nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large) |	57.31 |
-| [albert-base-v2](https://huggingface.co/albert-base-v2) |	57.14 |
-| [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) |	56.79 |
-| [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) |	54.46 |
diff --git a/docs/usage/semantic_textual_similarity.md b/docs/usage/semantic_textual_similarity.md
deleted file mode 100644
index fc772d1e1..000000000
--- a/docs/usage/semantic_textual_similarity.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# Semantic Textual Similarity
-
-Once you have  [sentence embeddings computed](../../examples/applications/computing-embeddings/README.md), you usually want to compare them to each other. Here, I show you how you can compute the cosine similarity between embeddings, for example, to measure the semantic similarity of two texts.
-
-```python
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Two lists of sentences
-sentences1 = [
-    "The cat sits outside",
-    "A man is playing guitar",
-    "The new movie is awesome",
-]
-
-sentences2 = [
-    "The dog plays in the garden",
-    "A woman watches TV",
-    "The new movie is so great",
-]
-
-# Compute embedding for both lists
-embeddings1 = model.encode(sentences1, convert_to_tensor=True)
-embeddings2 = model.encode(sentences2, convert_to_tensor=True)
-
-# Compute cosine-similarities
-cosine_scores = util.cos_sim(embeddings1, embeddings2)
-
-# Output the pairs with their score
-for i in range(len(sentences1)):
-    print("{} \t\t {} \t\t Score: {:.4f}".format(
-        sentences1[i], sentences2[i], cosine_scores[i][i]
-    ))
-```
-
-We pass the `convert_to_tensor=True` parameter to the encode function. This will return a pytorch tensor containing our embeddings. We can then call `util.cos_sim(A, B)` which computes the cosine similarity between all vectors in *A* and all vectors in *B*. 
-
-It returns in the above example a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*.
-
-
-You can use this function also to find out the pairs with the highest cosine similarity scores:
-```python
-from sentence_transformers import SentenceTransformer, util
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Single list of sentences
-sentences = [
-    "The cat sits outside",
-    "A man is playing guitar",
-    "I love pasta",
-    "The new movie is awesome",
-    "The cat plays in the garden",
-    "A woman watches TV",
-    "The new movie is so great",
-    "Do you like pizza?",
-]
-
-# Compute embeddings
-embeddings = model.encode(sentences, convert_to_tensor=True)
-
-# Compute cosine-similarities for each sentence with each other sentence
-cosine_scores = util.cos_sim(embeddings, embeddings)
-
-# Find the pairs with the highest cosine similarity scores
-pairs = []
-for i in range(cosine_scores.shape[0]):
-    for j in range(cosine_scores.shape[1]):
-        pairs.append({"index": [i, j], "score": cosine_scores[i][j]})
-
-# Sort scores in decreasing order
-pairs = sorted(pairs, key=lambda x: x["score"], reverse=True)
-
-for pair in pairs[0:10]:
-    i, j = pair["index"]
-    print("{} \t\t {} \t\t Score: {:.4f}".format(
-        sentences[i], sentences[j], pair["score"]
-    ))
-```
-
-Note, in the above approach we use a brute-force approach to find the highest scoring pairs, which has a quadratic complexity. For long lists of sentences, this might be infeasible. If you want find the highest scoring pairs in a long list of sentences, have a look at [Paraphrase Mining](../../examples/applications/paraphrase-mining/README.md).
\ No newline at end of file
diff --git a/examples/README.md b/examples/README.md
index 4ab4c6c6f..145d3f3b6 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -9,7 +9,7 @@ The [applications](applications/) folder contains examples how to use SentenceTr
 The [evaluation](evaluation/) folder contains some examples how to evaluate SentenceTransformer models for common tasks.
 
 ## Training 
-The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html).
+The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html).
 
 
 ## Unsupervised Learning
diff --git a/examples/applications/clustering/README.md b/examples/applications/clustering/README.md
index 98c5e64f7..d8d1c3e9c 100644
--- a/examples/applications/clustering/README.md
+++ b/examples/applications/clustering/README.md
@@ -15,7 +15,7 @@ In [fast_clustering.py](fast_clustering.py) we present a clustering algorithm th
 
 You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters. 
 
-We apply it on the [Quora Duplicate Questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset and the output looks something like this:
+We apply it on the [Quora Duplicate Questions](https://huggingface.co/datasets/sentence-transformers/quora-duplicates) dataset and the output looks something like this:
 
 ```
 Cluster 1, #83 Elements
@@ -51,7 +51,6 @@ For each topic, you want to extract the words that describe this topic:
 
 ![20news](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/20news_top2vec.png) 
 
-Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the repositories [Top2Vec](https://github.com/ddangelov/Top2Vec) and [BERTopic](https://github.com/MaartenGr/BERTopic).
+Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the [BERTopic](https://github.com/MaartenGr/BERTopic) and [Top2Vec](https://github.com/ddangelov/Top2Vec) repositories.
  
- 
- Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)
+Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)
diff --git a/examples/applications/computing-embeddings/README.md b/examples/applications/computing-embeddings/README.md
deleted file mode 100644
index db1e7d80d..000000000
--- a/examples/applications/computing-embeddings/README.md
+++ /dev/null
@@ -1,213 +0,0 @@
-# Computing Sentence Embeddings
-
-
-
-The basic function to compute sentence embeddings looks like this:
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-# Our sentences we like to encode
-sentences = [
-    "This framework generates embeddings for each input sentence",
-    "Sentences are passed as a list of strings.",
-    "The quick brown fox jumps over the lazy dog.",
-]
-
-# Sentences are encoded by calling model.encode()
-embeddings = model.encode(sentences)
-
-# Print the embeddings
-for sentence, embedding in zip(sentences, embeddings):
-    print("Sentence:", sentence)
-    print("Embedding:", embedding)
-    print("")
-```
-
-**Note:** Even though we talk about sentence embeddings, you can use it also for shorter phrases as well as for longer texts with multiple sentences. See the section on Input Sequence Length for more notes on embeddings for paragraphs.
-
-First, we load a sentence-transformer model:
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("model_name_or_path")
-```
-
-You can either specify a [pre-trained model](https://www.sbert.net/docs/pretrained_models.html) or you can pass a path on your disc to load the sentence-transformer model from that folder.
-
-If available, the model is automatically executed on the GPU. You can specify the device for the model like this:
-```python
-model = SentenceTransformer("model_name_or_path", device="cuda")
-```
-
-With *device* any pytorch device (like CPU, cuda, cuda:0 etc.)
- 
-
-The relevant method to encode a set of sentences / texts is `model.encode()`. In the following, you can find parameters this method accepts. Some relevant parameters are *batch_size* (depending on your GPU a different batch size is optimal) as well as *convert_to_numpy* (returns a numpy matrix)  and *convert_to_tensor* (returns a pytorch tensor).
-
-```eval_rst
-.. autoclass:: sentence_transformers.SentenceTransformer
-    :members: encode
-```
-
-## Prompt Templates
-Some models require using specific text *prompts* to achieve optimal performance. For example, with [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) you should prefix all queries with `query: ` and all passages with `passage: `. Another example is [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5), which performs best for retrieval when the input texts are prefixed with `Represent this sentence for searching relevant passages: `. 
-
-Sentence Transformer models can be initialized with `prompts` and `default_prompt_name` parameters:
-* `prompts` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example,
-    ```python
-    model = SentenceTransformer(
-        "intfloat/multilingual-e5-large",
-        prompts={
-            "classification": "Classify the following text: ",
-            "retrieval": "Retrieve semantically similar text: ",
-            "clustering": "Identify the topic or theme based on the text: ",
-        },
-    )
-    # or
-    model.prompts = {
-        "classification": "Classify the following text: ",
-        "retrieval": "Retrieve semantically similar text: ",
-        "clustering": "Identify the topic or theme based on the text: ",
-    }
-    ```
-* `default_prompt_name` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from `prompts`. If `None`, then no prompt is used by default. For example,
-    ```python
-    model = SentenceTransformer(
-        "intfloat/multilingual-e5-large",
-        prompts={
-            "classification": "Classify the following text: ",
-            "retrieval": "Retrieve semantically similar text: ",
-            "clustering": "Identify the topic or theme based on the text: ",
-        },
-        default_prompt_name="retrieval",
-    )
-    # or
-    model.default_prompt_name="retrieval"
-    ```
-Both of these parameters can also be specified in the `config_sentence_transformers.json` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.
-
-
-During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
-1. Explicitly using the `prompt` option in `SentenceTransformer.encode`:
-    ```python
-    embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ")
-    ```
-2. Explicitly using the `prompt_name` option in `SentenceTransformer.encode` by relying on the prompts loaded from a) initialization or b) the model config.
-    ```python
-    embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval")
-    ```
-3. If `prompt` nor `prompt_name` are specified in `SentenceTransformer.encode`, then the prompt specified by `default_prompt_name` will be applied. If it is `None`, then no prompt will be applied.
-    ```python
-    embeddings = model.encode("How to bake a strawberry cake")
-    ```
-
-
-## Input Sequence Length
-Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). Longer texts than this are truncated to the first x word pieces.
-
-By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. You can get and set the maximal sequence length like this:
- 
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-print("Max Sequence Length:", model.max_seq_length)
-
-# Change the length to 200
-model.max_seq_length = 200
-
-print("Max Sequence Length:", model.max_seq_length)
-```
-
-**Note:** You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good.
-
-## Storing & Loading Embeddings
-The easiest method is to use *pickle* to store pre-computed embeddings on disc and to load it from disc. This can especially be useful if you need to encode large set of sentences. 
-
-
-```python
-from sentence_transformers import SentenceTransformer
-import pickle
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-sentences = [
-    "This framework generates embeddings for each input sentence",
-    "Sentences are passed as a list of string.",
-    "The quick brown fox jumps over the lazy dog.",
-]
-
-
-embeddings = model.encode(sentences)
-
-# Store sentences & embeddings on disc
-with open("embeddings.pkl", "wb") as fOut:
-    pickle.dump({"sentences": sentences, "embeddings": embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
-
-# Load sentences & embeddings from disc
-with open("embeddings.pkl", "rb") as fIn:
-    stored_data = pickle.load(fIn)
-    stored_sentences = stored_data["sentences"]
-    stored_embeddings = stored_data["embeddings"]
-```
-
-## Multi-Process / Multi-GPU Encoding
-
-You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: [computing_embeddings_multi_gpu.py](computing_embeddings_multi_gpu.py).
-
-The relevant method is `start_multi_process_pool()`, which starts multiple processes that are used for encoding.
-
- ```eval_rst
-.. automethod:: sentence_transformers.SentenceTransformer.start_multi_process_pool
-```
-
-## Sentence Embeddings with Transformers
-Most of our pre-trained models are based on [Huggingface.co/Transformers](https://huggingface.co/transformers/) and are also hosted in the [models repository](https://huggingface.co/models) from Huggingface. It is possible to use our sentence embeddings models without installing sentence-transformers:
-
-```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-
-
-# Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
-    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-    return sum_embeddings / sum_mask
-
-
-# Sentences we want sentence embeddings for
-sentences = [
-    "This framework generates embeddings for each input sentence",
-    "Sentences are passed as a list of string.",
-    "The quick brown fox jumps over the lazy dog.",
-]
-
-# Load AutoModel from huggingface model repository
-tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
-model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
-
-# Tokenize sentences
-encoded_input = tokenizer(
-    sentences, padding=True, truncation=True, max_length=128, return_tensors="pt"
-)
-
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-
-# Perform pooling. In this case, mean pooling
-sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
-```
-
-
-You can find the available models here: [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)
-
-
-In the above example we add mean pooling on top of the AutoModel (which will load a BERT model). We also have models with max-pooling and where we use the CLS token. How to apply this pooling correctly, have a look at [sentence-transformers/bert-base-nli-max-tokens](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens) and [/sentence-transformers/bert-base-nli-cls-token](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token).
-
-
diff --git a/examples/applications/computing-embeddings/README.rst b/examples/applications/computing-embeddings/README.rst
new file mode 100644
index 000000000..2abc73fde
--- /dev/null
+++ b/examples/applications/computing-embeddings/README.rst
@@ -0,0 +1,150 @@
+Computing Embeddings
+====================
+
+Once you have `installed <installation.md>`_ Sentence Transformers, you can easily use Sentence Transformer models:
+
+.. sidebar:: Documentation
+
+   1. :class:`SentenceTransformer <sentence_transformers.SentenceTransformer>`
+   2. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   3. :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`
+
+::
+
+   from sentence_transformers import SentenceTransformer
+
+   # 1. Load a pretrained Sentence Transformer model
+   model = SentenceTransformer("all-MiniLM-L6-v2")
+
+   # The sentences to encode
+   sentences = [
+       "The weather is lovely today.",
+       "It's so sunny outside!",
+       "He drove to the stadium.",
+   ]
+
+   # 2. Calculate embeddings by calling model.encode()
+   embeddings = model.encode(sentences)
+   print(embeddings.shape)
+   # [3, 384]
+
+   # 3. Calculate the embedding similarities
+   similarities = model.similarity(embeddings, embeddings)
+   print(similarities)
+   # tensor([[1.0000, 0.6660, 0.1046],
+   #         [0.6660, 1.0000, 0.1411],
+   #         [0.1046, 0.1411, 1.0000]])
+
+.. note::
+   Even though we talk about sentence embeddings, you can use Sentence Transformers for shorter phrases as well as for longer texts with multiple sentences. See `Input Sequence Length <#input-sequence-length>`_ for notes on embeddings for longer texts.
+
+
+Initializing a Sentence Transformer Model
+-----------------------------------------
+
+The first step is to load a pretrained Sentence Transformer model. You can use any of the models from the `Pretrained Models <../docs/sentence_transformer/pretrained_models.html>`_ or a local model. See also :class:`~sentence_transformers.SentenceTransformer` for information on parameters.
+
+::
+
+   from sentence_transformers import SentenceTransformer
+
+   model = SentenceTransformer("all-mpnet-base-v2")
+   # Alternatively, you can pass a path to a local model directory:
+   model = SentenceTransformer("output/models/mpnet-base-finetuned-all-nli")
+
+The model will automatically be placed on the most performant available device, e.g. ``cuda`` or ``mps`` if available. You can also specify the device explicitly:
+
+::
+
+   model = SentenceTransformer("all-mpnet-base-v2", device="cuda")
+
+Calculating Embeddings
+----------------------
+
+The method to calculate embeddings is :meth:`SentenceTransformer.encode< sentence_transformers.SentenceTransformer.encode>`.
+
+
+Prompt Templates
+----------------
+
+Some models require using specific text *prompts* to achieve optimal performance. For example, with `intfloat/multilingual-e5-large <https://huggingface.co/intfloat/multilingual-e5-large>`_ you should prefix all queries with ``"query: "`` and all passages with ``"passage: "``. Another example is `BAAI/bge-large-en-v1.5 <https://huggingface.co/BAAI/bge-large-en-v1.5>`_, which performs best for retrieval when the input texts are prefixed with ``"Represent this sentence for searching relevant passages: "``. 
+
+Sentence Transformer models can be initialized with ``prompts`` and ``default_prompt_name`` parameters:
+
+- ``prompts`` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example::
+
+    model = SentenceTransformer(
+        "intfloat/multilingual-e5-large",
+        prompts={
+            "classification": "Classify the following text: ",
+            "retrieval": "Retrieve semantically similar text: ",
+            "clustering": "Identify the topic or theme based on the text: ",
+        },
+    )
+    # or
+    model.prompts = {
+        "classification": "Classify the following text: ",
+        "retrieval": "Retrieve semantically similar text: ",
+        "clustering": "Identify the topic or theme based on the text: ",
+    }
+
+- ``default_prompt_name`` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from ``prompts``. If ``None``, then no prompt is used by default. For example::
+
+    model = SentenceTransformer(
+        "intfloat/multilingual-e5-large",
+        prompts={
+            "classification": "Classify the following text: ",
+            "retrieval": "Retrieve semantically similar text: ",
+            "clustering": "Identify the topic or theme based on the text: ",
+        },
+        default_prompt_name="retrieval",
+    )
+    # or
+    model.default_prompt_name="retrieval"
+
+Both of these parameters can also be specified in the ``config_sentence_transformers.json`` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.
+
+During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
+
+1. Explicitly using the ``prompt`` option in ``SentenceTransformer.encode``::
+
+    embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ")
+
+2. Explicitly using the ``prompt_name`` option in ``SentenceTransformer.encode`` by relying on the prompts loaded from a) initialization or b) the model config::
+
+    embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval")
+
+3. If ``prompt`` nor ``prompt_name`` are specified in ``SentenceTransformer.encode``, then the prompt specified by ``default_prompt_name`` will be applied. If it is ``None``, then no prompt will be applied::
+
+    embeddings = model.encode("How to bake a strawberry cake")
+
+Input Sequence Length
+---------------------
+
+For transformer models like BERT, RoBERTa, DistilBERT etc., the runtime and memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT-based models are 512 tokens, which corresponds to about 300-400 words (for English).
+
+Each model has a maximum sequence length under ``model.max_seq_length``, which is the maximal number of tokens that can be processed. Longer texts will be truncated to the first ``model.max_seq_length`` tokens::
+
+    from sentence_transformers import SentenceTransformer
+
+    model = SentenceTransformer("all-MiniLM-L6-v2")
+    print("Max Sequence Length:", model.max_seq_length)
+    # => Max Sequence Length: 256
+
+    # Change the length to 200
+    model.max_seq_length = 200
+
+    print("Max Sequence Length:", model.max_seq_length)
+    # => Max Sequence Length: 200
+
+.. note::
+
+   You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good.
+
+Multi-Process / Multi-GPU Encoding
+----------------------------------
+
+You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: `computing_embeddings_multi_gpu.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py>`_.
+
+ 
+The relevant method is :meth:`~sentence_transformers.SentenceTransformer.start_multi_process_pool`, which starts multiple processes that are used for encoding.
\ No newline at end of file
diff --git a/examples/applications/embedding-quantization/README.md b/examples/applications/embedding-quantization/README.md
index b75a96ac0..964d2997c 100644
--- a/examples/applications/embedding-quantization/README.md
+++ b/examples/applications/embedding-quantization/README.md
@@ -20,6 +20,16 @@ Quantizing an embedding with a dimensionality of 1024 to binary would result in
 
 As a result, in practice quantizing a `float32` embedding with a dimensionality of 1024 yields an `int8` or `uint8` embedding with a dimensionality of 128. See two approaches of how you can produce quantized embeddings using Sentence Transformers below:
 
+```eval_rst
+.. sidebar:: References
+
+   #. `mixedbread-ai/mxbai-embed-large-v1 <https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1>`_
+   #. :class:`~sentence_transformers.SentenceTransformer`
+   #. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   #. :func:`~sentence_transformers.quantization.quantize_embeddings`
+
+```
+
 ```python
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings
@@ -38,11 +48,6 @@ embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."
 binary_embeddings = quantize_embeddings(embeddings, precision="binary")
 ```
 
-**References:**
-* <a href="https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1"><code>mixedbread-ai/mxbai-embed-large-v1</code></a>
-* <a href="../../../docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformer.encode</code></a>
-* <a href="../../../docs/package_reference/quantization.html#sentence_transformers.quantization.quantize_embeddings"><code>quantize_embeddings</code></a>
-
 Here you can see the differences between default `float32` embeddings and binary embeddings in terms of shape, size, and `numpy` dtype:
 
 ```python
@@ -84,6 +89,16 @@ Computing int8 quantization buckets based on 2 embeddings. int8 quantization is
 
 See how you can produce scalar quantized embeddings using Sentence Transformers below:
 
+```eval_rst
+.. sidebar:: References
+
+   #. `mixedbread-ai/mxbai-embed-large-v1 <https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1>`_
+   #. :class:`~sentence_transformers.SentenceTransformer`
+   #. :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`
+   #. :func:`~sentence_transformers.quantization.quantize_embeddings`
+
+```
+
 ```python
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings
@@ -105,11 +120,6 @@ int8_embeddings = quantize_embeddings(
 )
 ```
 
-**References:**
-* <a href="https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1"><code>mixedbread-ai/mxbai-embed-large-v1</code></a>
-* <a href="../../../docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformer.encode</code></a>
-* <a href="../../../docs/package_reference/quantization.html#sentence_transformers.quantization.quantize_embeddings"><code>quantize_embeddings</code></a>
-
 Here you can see the differences between default `float32` embeddings and `int8` scalar embeddings in terms of shape, size, and `numpy` dtype:
 
 ```python
@@ -154,6 +164,7 @@ The following demo showcases the retrieval efficiency using `exact` search throu
 	width="100%"
 	height="1000"
 ></iframe>
+<p></p> <!-- Force a newline -->
 
 ## Try it yourself
 
diff --git a/examples/applications/parallel-sentence-mining/README.md b/examples/applications/parallel-sentence-mining/README.md
index de99e60f4..4996ee38d 100644
--- a/examples/applications/parallel-sentence-mining/README.md
+++ b/examples/applications/parallel-sentence-mining/README.md
@@ -24,10 +24,10 @@ This is an example sentences.   Dies ist ein Beispielsatz.
 
 Usually you apply this method to large corpora, for example, you want to find all translated sentences in the English Wikipedia and the Chinese Wikipedia. 
 
-## Marging Based Mining
+## Margin Based Mining
 
 We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1812.10464.pdf) to find translated sentences in two datasets:
-1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://tfhub.dev/google/LaBSE/1) currently the best method for bitext mining. The model is integrated in Sentence-Transformers
+1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) currently the best method for bitext mining. The model is integrated in Sentence-Transformers
 2) Once we have all embeddings, we find the *k* nearest neighbor sentences for all sentences in both directions. Typical choices for k are between 4 and 16.
 3) Then, we score all possible sentence combinations using the formula mentioned in Section 4.3. 
 4) The pairs with the highest scores are most likely translated sentences. Note, that the score can be larger than 1. Usually you have to find some cut-off where you ignore pairs below that threshold. For a high quality, a threshold of about 1.2 - 1.3 works quite well.
@@ -35,4 +35,4 @@ We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pd
 ## Examples
 - **[bucc2018.py](bucc2018.py)** - This script contains an example for the [BUCC 2018 shared task](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) on finding parallel sentences. This dataset can be used to evaluate different strategies, as we know which sentences are parallel in the two corpora. The script mines for parallel sentences and then prints the optimal threshold that leads to the highest F1-score.
 - **[bitext_mining.py](bitext_mining.py)** - This file reads in two text files (with a single sentence in each line) and outputs parallel sentences to *parallel-sentences-out.tsv.gz.
--  **[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed S-BERT to generate/select in-domain parallel data for machine translation systems – using monolingual texts. 
+-  **[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed Sentence Transformers to generate/select in-domain parallel data for machine translation systems – using monolingual texts. 
diff --git a/examples/applications/paraphrase-mining/README.md b/examples/applications/paraphrase-mining/README.md
index 02ae141eb..685d47ffa 100644
--- a/examples/applications/paraphrase-mining/README.md
+++ b/examples/applications/paraphrase-mining/README.md
@@ -1,54 +1,49 @@
 # Paraphrase Mining
 
-Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/usage/semantic_textual_similarity.md) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs. 
+Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs. 
 
-However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.
+```eval_rst
+However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences. For larger collections, the :func:`~sentence_transformers.util.paraphrase_mining` function can be used::
 
-For larger collections, *util* offers the *paraphrase_mining* function that can be used like this:
-```python
-from sentence_transformers import SentenceTransformer, util
+    from sentence_transformers import SentenceTransformer
+    from sentence_transformers.util import paraphrase_mining
 
-model = SentenceTransformer("all-MiniLM-L6-v2")
+    model = SentenceTransformer("all-MiniLM-L6-v2")
 
-# Single list of sentences - Possible tens of thousands of sentences
-sentences = [
-    "The cat sits outside",
-    "A man is playing guitar",
-    "I love pasta",
-    "The new movie is awesome",
-    "The cat plays in the garden",
-    "A woman watches TV",
-    "The new movie is so great",
-    "Do you like pizza?",
-]
+    # Single list of sentences - Possible tens of thousands of sentences
+    sentences = [
+        "The cat sits outside",
+        "A man is playing guitar",
+        "I love pasta",
+        "The new movie is awesome",
+        "The cat plays in the garden",
+        "A woman watches TV",
+        "The new movie is so great",
+        "Do you like pizza?",
+    ]
 
-paraphrases = util.paraphrase_mining(model, sentences)
+    paraphrases = paraphrase_mining(model, sentences)
 
-for paraphrase in paraphrases[0:10]:
-    score, i, j = paraphrase
-    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
-```
+    for paraphrase in paraphrases[0:10]:
+        score, i, j = paraphrase
+        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
 
-The **paraphrase_mining()**-method accepts the following parameters:
-```eval_rst
-.. autofunction:: sentence_transformers.util.paraphrase_mining
-```
+The :func:`~sentence_transformers.util.paraphrase_mining` accepts the following parameters:
 
-Instead of computing all pairwise cosine scores and ranking all possible, combinations, the approach is a bit more complex (and hence efficient). We chunk our corpus into smaller pieces, which is defined by *query_chunk_size* and *corpus_chunk_size*. For example, if we set *query_chunk_size=1000*, we search paraphrases for 1,000 sentences at a time in the remaining corpus (all other sentences). However, the remaining corpus is also chunked, for example, if we set *corpus_chunk_size=10000*, we look for paraphrases in 10k sentences at a time.
-
-If we pass a list of 20k sentences, we will chunk it to 20x1000 sentences, and each of the query is compared first against sentences 0-10k and then 10k-20k.
+.. autofunction:: sentence_transformers.util.paraphrase_mining
 
-This is done to reduce the memory requirement. Increasing both values improves the speed, but increases also the memory requirement.
+To optimize memory and computation time, paraphrase mining is performed in chunks, as specified by ``query_chunk_size`` and ``corpus_chunk_size``.
+To be specific, only ``query_chunk_size * corpus_chunk_size`` pairs will be compared at a time, rather than ``len(sentences) * len(sentences)``. This is more time- and memory-efficient. Additionally, :func:`~sentence_transformers.util.paraphrase_mining` only considers the ``top_k`` best scores per sentences per chunk. You can experiment with this value as an efficiency-performance trade-off.
 
+For example, for each sentence you will get only the one most relevant sentence in this script.
 
-The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the *top_k* scores. So with *top_k=100*, we find at most 100 paraphrases per sentence per chunk. You can play around with *top_k* to the ensure a certain behaviour.
+::
 
-So for example, with
-```python
-paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)
-```
+    paraphrases = paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)
 
-You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
+The final key parameter is ``max_pairs``, which determines the maximum number of paraphrase pairs that the function returns. Usually, you get fewer pairs returned because the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
 
-The final relevant parameter is *max_pairs*, which determines the maximum number of paraphrase pairs you like to get returned. If you set it to e.g. *max_pairs=100*, you will not get more than 100 paraphrase pairs returned. Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
- 
+.. note::
+    
+    If B is the most similar sentence for A, A is not necessarily the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
+```
\ No newline at end of file
diff --git a/examples/applications/retrieve_rerank/README.md b/examples/applications/retrieve_rerank/README.md
index d6d50de95..5b73db7ed 100644
--- a/examples/applications/retrieve_rerank/README.md
+++ b/examples/applications/retrieve_rerank/README.md
@@ -1,41 +1,33 @@
 # Retrieve & Re-Rank
 In [Semantic Search](../semantic-search/README.md) we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search. 
 
-For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**.
+For complex search tasks, for example question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**.
 
 ## Retrieve & Re-Rank Pipeline
 
-A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this article:
+The following pipeline for Information Retrieval / Question Answering Retrieval works very well. All components are provided and explained in this article:
 
 ![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)
 
-Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with Elasticsearch, or we can use dense retrieval with a bi-encoder. 
-
-However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query. 
-
-The output will be a ranked list of hits we can present to the user.
+Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with a vector engine like Elasticsearch, or we can use dense retrieval with a bi-encoder.  However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query. The output will be a ranked list of hits we can present to the user.
 
 ## Retrieval: Bi-Encoder
-For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in this repository.
+For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in Sentence Transformers.
 
 Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations. In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space. 
 
 ![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)
 
-Semantic search overcomes the short comings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md)  for different options to implement semantic search.
+Semantic search overcomes the shortcomings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md) for different options to implement semantic search.
 
 
 ## Re-Ranker: Cross-Encoder
 
-The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
-
-A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query. 
+The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates. A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query. 
 
 ![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)
 
-The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document. 
-
-Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.
+The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document. Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.
 
 ## Example Scripts
 
@@ -50,7 +42,7 @@ The bi-encoder produces embeddings independently for your paragraphs and for you
 ```python
 from sentence_transformers import SentenceTransformer
 
-model = SentenceTransformer("model_name")
+model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
 
 docs = [
     "My first paragraph. That contains information",
@@ -65,9 +57,8 @@ query_embedding = model.encode(query)
 For more details how to compare the embeddings, see [semantic search](../semantic-search/README.md).
 
 We provide pre-trained models based on:
-- **MS MARCO:** 500k real user queries from Bing search engine. See [MS MARCO models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) 
+- **MS MARCO:** 500k real user queries from Bing search engine. See [MS MARCO models](../../../docs/pretrained-models/msmarco-v3.html) 
 
 ## Pre-trained Cross-Encoders (Re-Ranker)
 
-
-For pre-trained models, see: [MS MARCO Cross-Encoders](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)
+For pre-trained Cross Encoder models, see: [MS MARCO Cross-Encoders](../../../docs/pretrained-models/ce-msmarco.html)
diff --git a/examples/applications/semantic-search/README.md b/examples/applications/semantic-search/README.md
index 77a539039..ba29a8ac3 100644
--- a/examples/applications/semantic-search/README.md
+++ b/examples/applications/semantic-search/README.md
@@ -1,50 +1,65 @@
 # Semantic Search
-Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.
-
+Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over. Semantic search can also perform well given synonyms, abbreviations, and misspellings, unlike keyword search engines that can only find documents based on lexical matches.
 
 ## Background
-The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. 
-
-At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.
+The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic similarity with the query.
 
 ![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png) 
 
-
 ## Symmetric vs. Asymmetric Semantic Search
 
 A **critical distinction** for your setup is *symmetric* vs. *asymmetric semantic search*:
-- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus.
+- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus. 
+    - Related training example: [Quora Duplicate Questions](../../training/quora_duplicate_questions/README.md).
+    - Suitable models: [Pre-Trained Sentence Embedding Models](../../../docs/sentence_transformer/pretrained_models#sentence-embedding-models)
 - For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like *"What is Python"* and you want to find the paragraph *"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy ..."*. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.
+    - Related training example: [MS MARCO](../../training/ms_marco/README.html)
+    - Suitable models: [Pre-Trained MS MARCO Models](../../../docs/pretrained-models/msmarco-v3.md)
 
 It is critical **that you choose the right model** for your type of task.
 
-Suitable models for **symmetric semantic search**: [Pre-Trained Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)
-
-
-Suitable models for **asymmetric semantic search**: [Pre-Trained MS MARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
-
-
-
-## Python
-
-For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus.
-
-In the following example, we define a small corpus with few example sentences and compute the embeddings for the corpus as well as for our query.
-
-We then use the [util.cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.
-
-For large corpora, sorting all scores would take too much time. Hence, we use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.
+## Manual Implementation
 
+For small corpora (up to about 1 million entries), we can perform semantic search with a manual implementation by computing the embeddings for the corpus as well as for our query, and then calculating the [semantic textual similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) using [<code>SentenceTransformer.similarity</code>](../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity).
 For a simple example, see [semantic_search.py](semantic_search.py):
 
 ```eval_rst
+
+.. sidebar:: Output
+
+   .. code-block:: txt
+
+        Query: A man is eating pasta.
+        Top 5 most similar sentences in corpus:
+        A man is eating food. (Score: 0.7035)
+        A man is eating a piece of bread. (Score: 0.5272)
+        A man is riding a horse. (Score: 0.1889)
+        A man is riding a white horse on an enclosed ground. (Score: 0.1047)
+        A cheetah is running behind its prey. (Score: 0.0980)
+
+        Query: Someone in a gorilla costume is playing a set of drums.
+        Top 5 most similar sentences in corpus:
+        A monkey is playing drums. (Score: 0.6433)
+        A woman is playing violin. (Score: 0.2564)
+        A man is riding a horse. (Score: 0.1389)
+        A man is riding a white horse on an enclosed ground. (Score: 0.1191)
+        A cheetah is running behind its prey. (Score: 0.1080)
+
+        Query: A cheetah chases prey on across a field.
+        Top 5 most similar sentences in corpus:
+        A cheetah is running behind its prey. (Score: 0.8253)
+        A man is eating food. (Score: 0.1399)
+        A monkey is playing drums. (Score: 0.1292)
+        A man is riding a white horse on an enclosed ground. (Score: 0.1097)
+        A man is riding a horse. (Score: 0.0650)
+
 .. literalinclude:: semantic_search.py
 ```
 
 
-## util.semantic_search
+## Optimized Implementation
 
-Instead of implementing semantic search by yourself, you can use the *util.semantic_search* function.
+Instead of implementing semantic search by yourself, you can use the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) function.
 
 The function accepts the following parameters:
 
@@ -52,12 +67,10 @@ The function accepts the following parameters:
 .. autofunction:: sentence_transformers.util.semantic_search
 ```
 
-By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase *query_chunk_size* and *corpus_chunk_size*, which leads to increased speed for large corpora, but also increases the memory requirement.
+By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase ``query_chunk_size`` and ``corpus_chunk_size``, which leads to increased speed for large corpora, but also increases the memory requirement.
 
 ## Speed Optimization
-To get the optimal speed for the `util.semantic_search` method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance.
-
-Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
+To get the optimal speed for the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
 ```python
 corpus_embeddings = corpus_embeddings.to("cuda")
 corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
@@ -67,9 +80,6 @@ query_embeddings = util.normalize_embeddings(query_embeddings)
 hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)
 ```
 
-
-
-
 ## Elasticsearch
 [Elasticsearch](https://www.elastic.co/elasticsearch/) has the possibility to [index dense vectors](https://www.elastic.co/what-is/vector-search) and to use them for document scoring. We can easily index embedding vectors, store other data alongside our vectors and, most importantly, efficiently retrieve relevant entries using [approximate nearest neighbor search](https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0) (HNSW, see also below) on the embeddings.
 
@@ -77,38 +87,37 @@ For further details, see [semantic_search_quora_elasticsearch.py](semantic_searc
 
 
 ## Approximate Nearest Neighbor
-Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by *util.semantic_search*).
+Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search)).
 
-In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors.
-
-However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed. That's the reason why it is called approximate nearest neighbor.
+In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors. However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed.
 
 For all ANN methods, there are usually one or more parameters to tune that determine the recall-speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases.
 
-Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/). Personally I find hnswlib the most suitable library: It is easy to use, offers a great performance and has nice features included that are important for real applications.
+Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/).
 
 Examples:
+
 - [semantic_search_quora_hnswlib.py](semantic_search_quora_hnswlib.py)
 - [semantic_search_quora_annoy.py](semantic_search_quora_annoy.py)
 - [semantic_search_quora_faiss.py](semantic_search_quora_faiss.py)
 
 ## Retrieve & Re-Rank
-For complex semantic search scenarios, a retrieve & re-rank pipeline is advisable:
+For complex semantic search scenarios, a two-stage retrieve & re-rank pipeline is advisable:
 ![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)
 
 For further details, see [Retrieve & Re-rank](../retrieve_rerank/README.md).
 
 ## Examples
 
-In the following we list examples for different use-cases.
+We list a handful of common use cases:
 
 ### Similar Questions Retrieval
-[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the *util.semantic_search* method. As model, we use *distilbert-multilingual-nli-stsb-quora-ranking*, which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
+[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
 
 ### Similar Publication Retrieval
-[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://arxiv.org/abs/2004.07180) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
+[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://huggingface.co/sentence-transformers/allenai-specter) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
 
 ### Question & Answer Retrieval
-[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions/). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
+[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://huggingface.co/datasets/sentence-transformers/natural-questions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
 
-[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved (see [semantic search](../semantic-search/README.md)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.
+[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved. Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.
diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py
index 5b0e3ad62..80f9e9986 100644
--- a/examples/applications/semantic-search/semantic_search.py
+++ b/examples/applications/semantic-search/semantic_search.py
@@ -24,6 +24,7 @@
     "A monkey is playing drums.",
     "A cheetah is running behind its prey.",
 ]
+# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
 corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
 
 # Query sentences:
@@ -33,7 +34,6 @@
     "A cheetah chases prey on across a field.",
 ]
 
-
 # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
 top_k = min(5, len(corpus))
 for query in queries:
@@ -41,13 +41,12 @@
 
     # We use cosine-similarity and torch.topk to find the highest 5 scores
     similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
-    top_results = torch.topk(similarity_scores, k=top_k)
+    scores, indices = torch.topk(similarity_scores, k=top_k)
 
-    print("\n\n======================\n\n")
-    print("Query:", query)
-    print("\nTop 5 most similar sentences in corpus:")
+    print("\nQuery:", query)
+    print("Top 5 most similar sentences in corpus:")
 
-    for score, idx in zip(top_results[0], top_results[1]):
+    for score, idx in zip(scores, indices):
         print(corpus[idx], "(Score: {:.4f})".format(score))
 
     """
diff --git a/examples/domain_adaptation/README.md b/examples/domain_adaptation/README.md
index d15cefada..a5deee55c 100644
--- a/examples/domain_adaptation/README.md
+++ b/examples/domain_adaptation/README.md
@@ -7,13 +7,13 @@ Domain adaptation is still an active research field and there exists no perfect
 ## Domain Adaptation vs. Unsupervised Learning
 There exists methods for [unsupervised text embedding learning](../unsupervised_learning/README.md), however, they generally perform rather badly: They are not really able to learn domain specific concepts. 
 
-A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)  
+A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552)  
 
 ## Adaptive Pre-Training
 
-When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)). 
+When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552)). 
 
-![Adaptive Pre-Training](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/adaptive_pre-training.png) 
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/adaptive_pre-training.png" alt="Adaptive Pre-Training" width="550"/>
 
 In our paper [TSDAE](https://arxiv.org/abs/2104.06979) we evaluated several methods for domain adaptation on 4 domain specific sentence embedding tasks:  
 
@@ -44,9 +44,9 @@ A big **disadvantage of adaptive pre-training** is the high computational overhe
 
 ## GPL: Generative Pseudo-Labeling
 
-[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) and adapt it to your specific domain:
+[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](../../docs/sentence_transformer/pretrained_models.md) and adapt it to your specific domain:
 
-![GPL_Overview](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_overview.png) 
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_overview.png" alt="GPL_Overview" width="750"/>
 
 
 The longer you train, the better your model gets. In our experiments, we were training the models for about 1 day on a V100-GPU. GPL can be combined with adaptive pre-training, which can give another performance boost.
@@ -58,15 +58,16 @@ The longer you train, the better your model gets. In our experiments, we were tr
 
 GPL works in three phases:
 
-![GPL Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_architecture.png) 
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_architecture.png" alt="GPL Architecture" width="750"/>
 
 - **Query Generation**: For a given text from our domain, we first use a T5 model that generates a possible query for the given text. E.g. when your text is *"Python is a high-level general-purpose programming language"*, the model might generate a query like *"What is Python"*. You can find various query generators on our [doc2query-hub](https://huggingface.co/doc2query).
 - **Negative Mining**: Next, for the generated query *"What is Python"* we mine negative passages from our corpus, i.e. passages that are similar to the query but which a user would not consider relevant. Such a negative passage could be *"Java is a high-level, class-based, object-oriented programming language."*. We do this mining using dense retrieval, i.e. we use one of the existing text embedding models and retrieve relevant paragraphs for the given query.
-- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) to score all (query, passage)-pairs. 
-- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](https://www.sbert.net/docs/package_reference/losses.html#marginmseloss).
+- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](../applications/cross-encoder/README.html) to score all (query, passage)-pairs. 
+- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](../../docs/package_reference/sentence_transformer/losses.html#marginmseloss).
 
 
 The **pseudo labeling** step is quite important and which results in the increased performance compared to the previous method QGen, which treated passages just as positive (1) or negative (0). As we see in the following picture, for a generate query (*"what is futures contract"*), the negative mining step retrieves passages that are partly or highly relevant to the generated query. Using MarginMSELoss and the Cross-Encoder, we can identify these passages and teach the text embedding model that these passages are also relevant for the given query.
+
 ![GPL Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_negatives.jpg) 
 
 
diff --git a/examples/evaluation/evaluation_inference_speed.py b/examples/evaluation/evaluation_inference_speed.py
index 6e16cbd74..a91ec0067 100644
--- a/examples/evaluation/evaluation_inference_speed.py
+++ b/examples/evaluation/evaluation_inference_speed.py
@@ -22,9 +22,9 @@
 # Load a sentence transformer model
 model = SentenceTransformer(model_name)
 
-max_sentences = 10_000
-dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train")
-sentences = list(set(dataset["anchor"] + dataset["positive"]))[:max_sentences]
+max_sentences = 100_000
+all_nli_dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train")
+sentences = list(set(all_nli_dataset["anchor"]))[:max_sentences]
 
 print("Model Name:", model_name)
 print("Number of sentences:", len(sentences))
diff --git a/examples/training/README.md b/examples/training/README.md
index fbba76048..cf48333e3 100644
--- a/examples/training/README.md
+++ b/examples/training/README.md
@@ -4,14 +4,22 @@ This folder contains various examples to fine-tune `SentenceTransformers` for sp
 
 For the beginning, I can recommend to have a look at the Semantic Textual Similarity ([STS](sts/)) or the Natural Language Inference ([NLI](nli/)) examples. 
 
-For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html).
+For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html).
 
 ## Training Examples
+- [adaptive_layer](adaptive_layer/) - Examples to train models whose layers can be removed on the fly for faster inference.
 - [avg_word_embeddings](avg_word_embeddings/) - This folder contains examples to train models based on classical word embeddings like GloVe. These models are extremely fast, but are a more inaccuracte than transformers based models.
+- [clip](clip/) - Examples to train CLIP image models.
+- [cross-encoder](cross-encoder/) - Examples to train [CrossEncoder](http://www.sbert.net/docs/cross_encoder/usage/usage.html) models.
+- [data_augmentation](data_augmentation/) Examples of how to apply data augmentation strategies to improve embedding models.
 - [distillation](distillation/) - Examples to make models smaller, faster and lighter.
+- [hpo](hpo/) - Examples with hyperparameter search to find the best hyperparameters for your task.
+- [matryoshka](matryoshka/) - Examples with training embedding models whose embeddings can be truncated (allowing for faster search) with minimal performance loss.
+- [ms_marco](ms_marco/) - Example training scripts for training on the MS MARCO information retrieval dataset.
 - [multilingual](multilingual/) - Existent monolingual models can be extend to various languages ([paper](https://arxiv.org/abs/2004.09813)). This folder contains a step-by-step guide to extend existent models to new languages. 
 - [nli](nli/) - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sentence embeddings.
+- [other](other/) - Various tiny examples for show-casing one specific training case.
+- [paraphrases](paraphrases/) - Examples for training models capable of recognizing paraphrases, i.e. understand when texts have the same meaning despite using different words.
 - [quora_duplicate_questions](quora_duplicate_questions/) - Quora Duplicate Questions is large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
 - [sts](sts/) - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we have a sentence pair and a score indicating the semantic similarity.
-- [other](other/) - Various tiny examples for show-casing one specific training case.
 
diff --git a/examples/training/adaptive_layer/adaptive_layer_sts.py b/examples/training/adaptive_layer/adaptive_layer_sts.py
index b95ac63b2..c2ebdd6f4 100644
--- a/examples/training/adaptive_layer/adaptive_layer_sts.py
+++ b/examples/training/adaptive_layer/adaptive_layer_sts.py
@@ -43,8 +43,8 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
-# similarity score column (between 0 and 1)
+# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text
+# columns and one similarity score column (between 0 and 1)
 inner_train_loss = losses.CoSENTLoss(model=model)
 train_loss = losses.AdaptiveLayerLoss(model, inner_train_loss)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
index a6bb8fe79..a89cea13a 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
@@ -49,7 +49,7 @@
 model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dan1, dan2])
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and
 # one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
index 4df3d8567..c2453ee07 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
@@ -44,7 +44,7 @@
 model = SentenceTransformer(modules=[word_embedding_model, lstm, pooling_model])
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and
 # one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
index 951e006a1..3be966717 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
@@ -74,7 +74,7 @@
 model = SentenceTransformer(modules=[bow, dan1, dan2])
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and
 # one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
index 07c743ac3..db8c4ee50 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
@@ -51,7 +51,7 @@
 model = SentenceTransformer(modules=[word_embedding_model, cnn, pooling_model])
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and
 # one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
index f11657e8c..183894b07 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
@@ -77,7 +77,7 @@
 model = SentenceTransformer(modules=[word_embedding_model, word_weights, pooling_model, dan1, dan2])
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and
 # one similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
diff --git a/examples/training/datasets/README.md b/examples/training/datasets/README.md
deleted file mode 100644
index fe39c421a..000000000
--- a/examples/training/datasets/README.md
+++ /dev/null
@@ -1,59 +0,0 @@
-# Training Datasets
-
-Most dataset configurations will take one of four forms:
-
-- **Case 1**: The example is a pair of sentences and a label indicating how similar they are. The label can be either an integer or a float. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not.
-   **Case Example:** [SNLI](https://huggingface.co/datasets/snli).
-- **Case 2**: The example is a pair of positive (similar) sentences **without** a label. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
-   **Case Examples:** [Sentence Compression](https://huggingface.co/datasets/embedding-data/sentence-compression), [COCO Captions](https://huggingface.co/datasets/embedding-data/coco_captions_quintets), [Flickr30k captions](https://huggingface.co/datasets/embedding-data/flickr30k_captions_quintets).
-- **Case 3**: The example is a sentence with an integer label indicating the class to which it belongs. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class.
-   **Case Examples:** [TREC](https://huggingface.co/datasets/trec), [Yahoo Answers Topics](https://huggingface.co/datasets/yahoo_answers_topics).
-- **Case 4**: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences.
-   **Case Example:** [Quora Triplets](https://huggingface.co/datasets/embedding-data/QQP_triplets)
-
-Note that Sentence Transformers models can be trained with human labeling (cases 1 and 3) or with labels automatically deduced from text formatting (cases 2 and 4).
-
-You can get almost ready-to-train datasets from various sources. One of them is the Hugging Face Hub.
-
-## Datasets on the Hugging Face Hub
-
-The [Datasets library](https://huggingface.co/docs/datasets/index) (`pip install datasets`) allows you to load datasets from the Hugging Face Hub with the `load_dataset` function:
-
-```python
-from datasets import load_dataset
-
-# Indicate the repo id from the Hub
-dataset_id = "embedding-data/QQP_triplets"
-
-dataset = load_dataset(dataset_id)
-```
-
-For more information on how to manipulate your dataset see [» Datasets Documentation](https://huggingface.co/docs/datasets/access).
-
-These are popular datasets used to train and fine-tune SentenceTransformers models.
-
-|   | Dataset                                                                                                   |
-| - | --------------------------------------------------------------------------------------------------------- |
-|   | [altlex pairs](https://huggingface.co/datasets/embedding-data/altlex)                                     |
-|   | [sentence compression pairs](https://huggingface.co/datasets/embedding-data/sentence-compression)         |
-|   | [QQP triplets](https://huggingface.co/datasets/embedding-data/QQP_triplets)                               |
-|   | [PAQ pairs](https://huggingface.co/datasets/embedding-data/PAQ_pairs)                                     |
-|   | [SPECTER triplets](https://huggingface.co/datasets/embedding-data/SPECTER)                                |
-|   | [Amazon QA pairs](https://huggingface.co/datasets/embedding-data/Amazon-QA)                               |
-|   | [Simple Wiki pairs](https://huggingface.co/datasets/embedding-data/simple-wiki)                           |
-|   | [Wiki Answers equivalent sentences](https://huggingface.co/datasets/embedding-data/WikiAnswers)           |
-|   | [COCO Captions quintets](https://huggingface.co/datasets/embedding-data/coco_captions_quintets)           |
-|   | [Flickr30k Captions quintets](https://huggingface.co/datasets/embedding-data/flickr30k_captions_quintets) |
-|   | [MS Marco](https://huggingface.co/datasets/ms_marco)                                                      |
-|   | [GOOAQ](https://huggingface.co/datasets/gooaq)                                                            |
-|   | [MS Marco](https://huggingface.co/datasets/ms_marco)                                                      |
-|   | [Yahoo Answers topics](https://huggingface.co/datasets/yahoo_answers_topics)                              |
-|   | [Search QA](https://huggingface.co/datasets/search_qa)                                                    |
-|   | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml )             |
-|   | [ELI5](https://huggingface.co/datasets/eli5)                                                              |
-|   | [MultiNLI](https://huggingface.co/datasets/multi_nli)                                                     |
-|   | [SNLI](https://huggingface.co/datasets/snli)                                                              |
-|   | [S2ORC](https://huggingface.co/datasets/s2orc)                                                            |
-|   | [Trivia QA](https://huggingface.co/datasets/trivia_qa)                                                    |
-|   | [Code Search Net](https://huggingface.co/datasets/code_search_net)                                        |
-|   | [Natural Questions](https://huggingface.co/datasets/natural_questions)                                    |
diff --git a/examples/training/distillation/README.md b/examples/training/distillation/README.md
index 37d7d9860..901804ba3 100644
--- a/examples/training/distillation/README.md
+++ b/examples/training/distillation/README.md
@@ -2,38 +2,40 @@
 This folder contains example to make SentenceTransformer models **faster, cheaper and lighter**. These light models achieve 97.5% - 100% performance of the original model on downstream tasks.
 
 ## Knowledge Distillation
-See: **[model_distillation.py](model_distillation.py)**
-
-Knowledge distillation describes the process to transfer knowledge from a  teacher model to a student model. It can be used to extend sentence embeddings to new languages ([Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813)), but the traditional approach is to have slow (but well performing) teacher model and a fast student model.
+Knowledge distillation describes the process to transfer knowledge from a  teacher model to a student model. It can be used to extend sentence embeddings to new languages ([Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813)), but the traditional approach is to have a slow (but well performing) teacher model and a fast student model.
 
 The fast student model imitates the teacher model and achieves by this a high performance. 
 
-![Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/monolingual-distillation.png)
-
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/monolingual-distillation.png" alt="Knowledge Distillation" width="550"/>
 
-**[model_distillation.py](model_distillation.py)** implements two options for creating the student model:
-1) Use a light transformer model like TinyBERT or BERT-Small to imitate the teacher.
-2) We take the teacher model and keep only certain layers, for example, only 4 layers.
+We implement two options for creating the student model:
+1) [model_distillation.py](model_distillation.py): Use a light transformer model like TinyBERT or BERT-Small to imitate the bigger teacher.
+2) [model_distillation_layer_reduction.py](model_distillation_layer_reduction.py): We take the teacher model and keep only certain layers, for example, only 4 layers.
 
-Option 2) works usually better, as we keep most of the weights from the teacher. In Option 1, we have to tune all
-weights in the student from scratch.
+Option 2) works usually better, as we keep most of the weights from the teacher. In Option 1, we have to tune all weights in the student from scratch.
 
 ## Speed - Performance Trade-Off
-Smaller models are faster, but show a (slightly) worse performance when evaluated on down stream tasks. To get an impression of this trade-off, we show some numbers of the *stsb-roberta-base* model with different number of layers:
+Smaller models are faster, but show a (slightly) worse performance when evaluated on down stream tasks. To get an impression of this trade-off, we show some numbers of the [stsb-roberta-base](https://huggingface.co/sentence-transformers/stsb-roberta-base) model with different number of layers:
 
 | Layers | STSbenchmark Performance | Performance Decrease |Speed (Sent. / Sec. on V100-GPU) |
 | ---- |:----:|:----:|:----:|
 | teacher: 12 | 85.44 | - | 2300 |
-| 8 | 85.54 | +0.1% | 3200 |
-| 6 | 85.23 | -0.2% | 4000 |
-| 4 | 84.92 | -0.6% | 5300 |
-| 3 |  84.39 | -1.2%  |6500 |
-| 2 | 83.32 | -2.5% | 7700 |
-| 1 | 80.86 |  -5.4%| 9200 |
+| 8 | 85.54 | +0.1% | 3200 (~1.4x) |
+| 6 | 85.23 | -0.2% | 4000 (~1.7x) |
+| 4 | 84.92 | -0.6% | 5300 (~2.3x) |
+| 3 | 84.39 | -1.2% | 6500 (~2.8x) |
+| 2 | 83.32 | -2.5% | 7700 (~3.3x) |
+| 1 | 80.86 |  -5.4%| 9200 (~4.0x) |
 
 
 ## Dimensionality Reduction
-By default, the pretrained models output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store Millions of embeddings, this can require quite a lot of memory / storage.
+
+```eval_rst
+.. warning::
+    Since writing this, `Embedding Quantization <../../applications/embedding-quantization/README.html>`_ has been introduced as the go-to approach for shrinking embedding sizes. Following `Thakur et al. <https://arxiv.org/abs/2205.11498>`_, We recommend that approach over PCA.
+```
+
+By default, the pretrained models output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store millions of embeddings, this can require quite a lot of memory / storage.
 
 **[dimensionality_reduction.py](dimensionality_reduction.py)** contains a simple example how to reduce the embedding dimension to any size by using Principle Component Analysis (PCA). In that example, we reduce 768 dimension to 128 dimension, reducing the storage requirement by factor 6. The performance only slightly drops from 85.44 to 84.96 on the STS benchmark dataset.
 
@@ -47,3 +49,8 @@ A [quantized model](https://pytorch.org/docs/stable/quantization.html) executes
 For models that are run on **CPUs**, this can yield 40% smaller models and a faster inference time: Depending on the CPU, speedup are between 15% and 400%. Model quantization is (as of now) not supported for GPUs by PyTorch.
 
 For an example, see [model_quantization.py](model_quantization.py)
+
+```eval_rst
+.. note::
+    The quantization support of Sentence Transformers is still being improved.
+```
\ No newline at end of file
diff --git a/examples/training/hpo/README.rst b/examples/training/hpo/README.rst
new file mode 100644
index 000000000..8e5a583bd
--- /dev/null
+++ b/examples/training/hpo/README.rst
@@ -0,0 +1,217 @@
+
+Hyperparameter Optimization
+===========================
+
+The :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` supports hyperparameter optimization using ``transformers``, which in turn supports four hyperparameter search backends: `optuna <https://optuna.org/>`_, `sigopt <https://sigopt.org/>`_, `raytune <https://docs.ray.io/en/latest/tune/index.html>`_, and `wandb <https://wandb.ai/site/sweeps>`_. You should install your backend of choice before using it::
+
+    pip install optuna/sigopt/wandb/ray[tune] 
+
+On this page, we'll show you how to use the hyperparameter optimization feature with the `optuna` backend. The other backends are similar to use, but you should refer to their respective documentation or the `transformers HPO documentation <https://huggingface.co/docs/transformers/en/hpo_train>`_ for more information.
+
+HPO Components
+--------------
+
+The hyperparameter optimization process consists of the following components:
+
+.. raw:: html
+
+    <div class="components">
+        <a href="#hyperparameter-search-space" class="box">
+            <div class="header">Hyperparameter Search Space</div>
+            Specify ranges for hyperparameter values.
+        </a>
+        <a href="#model-initialization" class="box">
+            <div class="header">Model Initialization</div>
+            Initialize a SentenceTransformer model for a trial.
+        </a>
+        <a href="#loss-initialization" class="box">
+            <div class="header">Loss Initialization</div>
+            Initialize a loss function given a model.
+        </a>
+        <a href="#compute-objective" class="box">
+            <div class="header">Compute Objective</div>
+            Determines the value to be minimized or maximized.
+        </a>
+    </div>
+    <br>
+
+Hyperparameter Search Space
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The hyperparameter search space is defined by a function that returns a dictionary of hyperparameters and their respective search spaces. Here's an example using ``optuna`` of a search space function that defines the hyperparameters for a `SentenceTransformer` model::
+
+    def hpo_search_space(trial):
+        return {
+            "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
+            "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128),
+            "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3),
+            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
+        }
+
+Model Initialization
+~~~~~~~~~~~~~~~~~~~~
+
+The model initialization function is a function that takes the hyperparameters of the current "trial" as input and returns a `SentenceTransformer` model. Generally, this function is quite simple. Here's an example of a model initialization function::
+
+    def hpo_model_init(trial):
+        return SentenceTransformer("distilbert-base-uncased")
+
+Loss Initialization
+~~~~~~~~~~~~~~~~~~~
+
+The loss initialization function is a function that takes the model initialized for the current trial and returns a loss function. Here's an example of a loss initialization function::
+
+    def hpo_loss_init(model):
+        return losses.CosineSimilarityLoss(model)
+
+Compute Objective
+~~~~~~~~~~~~~~~~~
+
+The compute objective function is a function that takes the evaluation ``metrics`` and returns the float value to be minimized or maximized. Here's an example of a compute objective function::
+
+    def hpo_compute_objective(metrics):
+        return metrics["eval_sts-dev_spearman_cosine"]
+
+.. note:
+
+    The dictionary keys of ``metrics`` are all prepended with ``eval_``. Additionally, if you're interested in maximizing the performance of an evaluator, note that the ``name`` of the evaluator is also prepended with a ``-``. So, to optimize on ``spearman_cosine`` from :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` which was initialized with ``name="stsb_dev"``, then you would use the key ``eval_sts-dev_spearman_cosine`` in your ``hpo_compute_objective``.
+
+    Another common option is to use ``eval_loss``.
+
+Putting It All Together
+------------------------
+
+You can perform HPO on any regular training loop, the only difference being that you don't call :meth:`SentenceTransformerTrainer.train <sentence_transformers.trainer.SentenceTransformerTrainer.train>`, but :meth:`SentenceTransformerTrainer.hyperparameter_search <sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search>` instead. Here's an example of how to put it all together:
+
+.. sidebar:: Documentation
+
+    #. `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_
+    #. :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator`
+    #. `Hyperparameter Search Space <#hyperparameter-search-space>`_
+    #. `Model Initialization <#model-initialization>`_
+    #. `Loss Initialization <#loss-initialization>`_
+    #. `Compute Objective <#compute-objective>`_
+    #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`
+    #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`
+    #. :meth:`~sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search`
+
+::
+
+    from sentence_transformers import losses
+    from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+    from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
+    from sentence_transformers.training_args import BatchSamplers
+    from datasets import load_dataset
+
+    # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, only 10k train and 1k dev
+    train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]")
+    eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]")
+
+    # 2. Create an evaluator to perform useful HPO
+    stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+    dev_evaluator = EmbeddingSimilarityEvaluator(
+        sentences1=stsb_eval_dataset["sentence1"],
+        sentences2=stsb_eval_dataset["sentence2"],
+        scores=stsb_eval_dataset["score"],
+        main_similarity=SimilarityFunction.COSINE,
+        name="sts-dev",
+    )
+
+    # 3. Define the Hyperparameter Search Space
+    def hpo_search_space(trial):
+        return {
+            "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
+            "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128),
+            "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3),
+            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
+        }
+
+    # 4. Define the Model Initialization
+    def hpo_model_init(trial):
+        return SentenceTransformer("distilbert-base-uncased")
+
+    # 5. Define the Loss Initialization
+    def hpo_loss_init(model):
+        return losses.MultipleNegativesRankingLoss(model)
+
+    # 6. Define the Objective Function
+    def hpo_compute_objective(metrics):
+        """
+        Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine',
+        'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean',
+        'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot',
+        'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second',
+        'eval_steps_per_second', 'epoch'
+
+        due to the evaluator that we're using.
+        """
+        return metrics["eval_sts-dev_spearman_cosine"]
+
+    # 7. Define the training arguments
+    args = SentenceTransformerTrainingArguments(
+        # Required parameter:
+        output_dir="checkpoints",
+        # Optional training parameters:
+        # max_steps=10000, # We might want to limit the number of steps for HPO
+        fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+        bf16=False,  # Set to True if you have a GPU that supports BF16
+        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+        # Optional tracking/debugging parameters:
+        eval_strategy="no", # We don't need to evaluate/save during HPO
+        save_strategy="no",
+        logging_steps=10,
+        run_name="hpo",  # Will be used in W&B if `wandb` is installed
+    )
+
+    # 8. Create the trainer with model_init rather than model
+    trainer = SentenceTransformerTrainer(
+        model=None,
+        args=args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        evaluator=dev_evaluator,
+        model_init=hpo_model_init,
+        loss=hpo_loss_init,
+    )
+
+    # 9. Perform the HPO
+    best_trial = trainer.hyperparameter_search(
+        hp_space=hpo_search_space,
+        compute_objective=hpo_compute_objective,
+        n_trials=20,
+        direction="maximize",
+        backend="optuna",
+    )
+    print(best_trial)
+
+::
+
+    [I 2024-05-17 15:10:47,844] Trial 0 finished with value: 0.7889856589698055 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 123, 'warmup_ratio': 0.07380948785410107, 'learning_rate': 2.686331417509812e-06}. Best is trial 0 with value: 0.7889856589698055.
+    [I 2024-05-17 15:12:13,283] Trial 1 finished with value: 0.7927780672090986 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 69, 'warmup_ratio': 0.2927897848007451, 'learning_rate': 5.885372118095137e-06}. Best is trial 1 with value: 0.7927780672090986.
+    [I 2024-05-17 15:12:43,896] Trial 2 finished with value: 0.7684829743509601 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 114, 'warmup_ratio': 0.0739429232666916, 'learning_rate': 7.344415188959276e-05}. Best is trial 1 with value: 0.7927780672090986.
+    [I 2024-05-17 15:14:49,730] Trial 3 finished with value: 0.7873032743147989 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 43, 'warmup_ratio': 0.15184370143796674, 'learning_rate': 9.703232080395476e-06}. Best is trial 1 with value: 0.7927780672090986.
+    [I 2024-05-17 15:15:39,597] Trial 4 finished with value: 0.7759251781929949 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 127, 'warmup_ratio': 0.263946220093495, 'learning_rate': 1.231454337152625e-06}. Best is trial 1 with value: 0.7927780672090986.
+    [I 2024-05-17 15:17:02,191] Trial 5 finished with value: 0.7964580509886684 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 34, 'warmup_ratio': 0.2276865359631089, 'learning_rate': 7.889007438884571e-06}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:18:55,559] Trial 6 finished with value: 0.7901878917859169 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 48, 'warmup_ratio': 0.23228838664572948, 'learning_rate': 2.883013292682523e-06}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:20:27,027] Trial 7 finished with value: 0.7935671067660925 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 62, 'warmup_ratio': 0.22061123927198237, 'learning_rate': 2.95413457610349e-06}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:22:23,147] Trial 8 finished with value: 0.7848123114933252 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 45, 'warmup_ratio': 0.23071701022961139, 'learning_rate': 9.793681667449783e-06}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:22:52,826] Trial 9 finished with value: 0.7909708416168918 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 121, 'warmup_ratio': 0.22440506724181647, 'learning_rate': 4.0744671365843346e-05}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:23:30,395] Trial 10 finished with value: 0.7928991732385567 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 89, 'warmup_ratio': 0.14607293301068847, 'learning_rate': 2.5557492055039498e-05}. Best is trial 5 with value: 0.7964580509886684.
+    [I 2024-05-17 15:24:18,024] Trial 11 finished with value: 0.7991870087507459 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 66, 'warmup_ratio': 0.16886154348739527, 'learning_rate': 3.705926066938032e-06}. Best is trial 11 with value: 0.7991870087507459.
+    [I 2024-05-17 15:25:44,198] Trial 12 finished with value: 0.7923304174306207 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 33, 'warmup_ratio': 0.15953772535423974, 'learning_rate': 1.8076298025704224e-05}. Best is trial 11 with value: 0.7991870087507459.
+    [I 2024-05-17 15:26:20,739] Trial 13 finished with value: 0.8020260244040395 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 90, 'warmup_ratio': 0.18105202625281253, 'learning_rate': 5.513908793512551e-06}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:26:57,783] Trial 14 finished with value: 0.7571110256860063 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 95, 'warmup_ratio': 0.00122391151793258, 'learning_rate': 1.0432486633629492e-06}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:27:32,581] Trial 15 finished with value: 0.8009013936824717 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 101, 'warmup_ratio': 0.1761274711346081, 'learning_rate': 4.5918293464430035e-06}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:28:05,850] Trial 16 finished with value: 0.8017668050806169 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 103, 'warmup_ratio': 0.10766501647726355, 'learning_rate': 5.0309795522333e-06}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:28:37,393] Trial 17 finished with value: 0.7769412380909586 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 108, 'warmup_ratio': 0.1036610178950246, 'learning_rate': 1.7747598626081271e-06}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:29:19,340] Trial 18 finished with value: 0.8011921300048339 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 80, 'warmup_ratio': 0.117014165550441, 'learning_rate': 1.238558867958792e-05}. Best is trial 13 with value: 0.8020260244040395.
+    [I 2024-05-17 15:29:59,508] Trial 19 finished with value: 0.8027501854704168 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}. Best is trial 19 with value: 0.8027501854704168.
+
+    BestRun(run_id='19', objective=0.8027501854704168, hyperparameters={'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}, run_summary=None)
+
+As you can see, the strongest hyperparameters reached **0.802** Spearman correlation on the STS (dev) benchmark. For context, training with the default training arguments (``per_device_train_batch_size=8``, ``learning_rate=5e-5``) results in **0.736**, and hyperparameters chosen based on experience (``per_device_train_batch_size=64``, ``learning_rate=2e-5``) results in **0.783** Spearman correlation. Consequently, HPO proved quite effective here in improving the model performance.
+
+Example Scripts
+---------------
+
+- `hpo_nli.py <hpo_nli.py>`_ - An example script that performs hyperparameter optimization on the AllNLI dataset.
diff --git a/examples/training/hpo/hpo_nli.py b/examples/training/hpo/hpo_nli.py
new file mode 100644
index 000000000..758224ec9
--- /dev/null
+++ b/examples/training/hpo/hpo_nli.py
@@ -0,0 +1,95 @@
+from sentence_transformers import losses
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
+from sentence_transformers.training_args import BatchSamplers
+from datasets import load_dataset
+
+# 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, 10k samples
+train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]")
+eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]")
+
+# 2. Create an evaluator to perform useful HPO
+stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+dev_evaluator = EmbeddingSimilarityEvaluator(
+    sentences1=stsb_eval_dataset["sentence1"],
+    sentences2=stsb_eval_dataset["sentence2"],
+    scores=stsb_eval_dataset["score"],
+    main_similarity=SimilarityFunction.COSINE,
+    name="sts-dev",
+)
+
+
+# 3. Define the Hyperparameter Search Space
+def hpo_search_space(trial):
+    return {
+        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
+        "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128),
+        "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3),
+        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
+    }
+
+
+# 4. Define the Model Initialization
+def hpo_model_init(trial):
+    return SentenceTransformer("distilbert-base-uncased")
+
+
+# 5. Define the Loss Initialization
+def hpo_loss_init(model):
+    return losses.MultipleNegativesRankingLoss(model)
+
+
+# 6. Define the Objective Function
+def hpo_compute_objective(metrics):
+    """
+    Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine',
+    'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean',
+    'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot',
+    'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second',
+    'eval_steps_per_second', 'epoch'
+
+    due to the evaluator that we're using.
+    """
+    return metrics["eval_sts-dev_spearman_cosine"]
+
+
+# 7. Define the training arguments
+args = SentenceTransformerTrainingArguments(
+    # Required parameter:
+    output_dir="checkpoints",
+    # Optional training parameters:
+    # max_steps=10000, # We might want to limit the number of steps for HPO
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="no",  # We don't need to evaluate/save during HPO
+    save_strategy="no",
+    logging_steps=10,
+    run_name="hpo",  # Will be used in W&B if `wandb` is installed
+)
+
+# 8. Create the trainer with model_init rather than model
+trainer = SentenceTransformerTrainer(
+    model=None,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    evaluator=dev_evaluator,
+    model_init=hpo_model_init,
+    loss=hpo_loss_init,
+)
+
+# 9. Perform the HPO
+best_trial = trainer.hyperparameter_search(
+    hp_space=hpo_search_space,
+    compute_objective=hpo_compute_objective,
+    n_trials=20,
+    direction="maximize",
+    backend="optuna",
+)
+print(best_trial)
+
+# Alternatively, to just train normally:
+# trainer.train()
+# print(dev_evaluator(trainer.model))
diff --git a/examples/training/ms_marco/README.md b/examples/training/ms_marco/README.md
index 31c3ae433..1ceac4c3d 100644
--- a/examples/training/ms_marco/README.md
+++ b/examples/training/ms_marco/README.md
@@ -1,31 +1,28 @@
 # MS MARCO
 [MS MARCO Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query.
 
-This pages shows how to **train** models (Cross-Encoder and Sentence Embedding Models) on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions).
+This page shows how to **train** Sentence Transformer models on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions).
 
 If you are interested in how to use these models, see [Application - Retrieve & Re-Rank](../../applications/retrieve_rerank/README.md).
 
-There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) | [Pretrained Cross-Encoders](https://www.sbert.net/docs/pretrained_cross-encoders.html)
-
-
+There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models > MSMARCO Passage Models](../../../docs/sentence_transformer/pretrained_models.html#msmarco-passage-models).
 
 ## Bi-Encoder
 
-Cross-Encoder are only suitable for reranking a small set of passages. For retrieval of suitable documents from a large collection, we have to use a bi-encoder. The documents are independently encoded into fixed-sized embeddings. A query is embedded into the same vector space. Relevant documents can then be found by using dot-product.
+For retrieval of suitable documents from a large collection, we have to use a Sentence Transformer (a.k.a. bi-encoder) model. The documents are independently encoded into fixed-sized embeddings. A query is embedded into the same vector space. Relevant documents can then be found by using cosine similarity or dot-product.
 
 ![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/BiEncoder.png)
 
-
-There are two strategies to **train an bi-encoder** on the MS MARCO dataset:
+This page describes two strategies to **train an bi-encoder** on the MS MARCO dataset:
 
 ### MultipleNegativesRankingLoss
- **Training code: [train_bi-encoder_mnrl.py](train_bi-encoder_mnrl.py)**
+**Training code: [train_bi-encoder_mnrl.py](train_bi-encoder_mnrl.py)**
 
-When we use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), we provide triplets: ``(query, positive_passage, negative_passage)`` where `positive_passage` is the relevant passage to the query and `negative_passage` is a non-relevant passage to the query.
-
-We compute the embeddings for all queries, positive passages, and negative passages in the corpus and then optimize the following objective: We want to have the `(query, positive_passage)` pair to be close in the vector space, while `(query, negative_passage)` should be distant in vector space.
+```eval_rst
+When we use :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we provide triplets: ``(query, positive_passage, negative_passage)`` where ``positive_passage`` is the relevant passage to the query and ``negative_passage`` is a non-relevant passage to the query. We compute the embeddings for all queries, positive passages, and negative passages in the corpus and then optimize the following objective: The ``(query, positive_passage)` pair must be close in the vector space, while ``(query, negative_passage)`` should be distant in vector space.
 
 To further improve the training, we use **in-batch negatives**: 
+```
 
 ![MultipleNegativesRankingLoss](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MultipleNegativeRankingLoss.png)
 
@@ -33,36 +30,38 @@ We embed all `queries`, `positive_passages`, and `negative_passages` into the ve
 
 One way to **improve training** is to choose really good negatives, also know as **hard negative**: The negative should look really similar to the positive passage, but it should not be relevant to the query.
 
-We find these hard negatives in the following way: We use existing retrieval systems (e.g. lexical search and other bi-encoder retrieval systems), and for each query we find the most relevant passages. We then use a powerful [Cross-Encoder](../../applications/cross-encoder/README.md) to score the found `(query, passage)` pairs. We provide scores for 160 million such pairs in our [msmarco-hard-negatives dataset](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives).
+We find these hard negatives in the following way: We use existing retrieval systems (e.g. lexical search and other bi-encoder retrieval systems), and for each query we find the most relevant passages. We then use a powerful [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) [Cross-Encoder](../../applications/cross-encoder/README.md) to score the found `(query, passage)` pairs. We provide scores for 160 million such pairs in our [MS MARCO Mined Triplet dataset collection](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23).
+
+```eval_rst
+For :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we must ensure that in the triplet ``(query, positive_passage, negative_passage)`` that the ``negative_passage`` is indeed not relevant for the query. The MS MARCO dataset is sadly **highly redundant**, and even though that there is on average only one passage marked as relevant for a query, it actually contains many passages that humans would consider as relevant. We must ensure that these passages are **not passed as negatives**: We do this by ensuring a certain threshold in the CrossEncoder scores between the relevant passages and the mined hard negative. By default, we set a threshold of 3: If the ``(query, positive_passage)`` gets a score of 9 from the CrossEncoder, than we will only consider negatives with a score below 6 from the CrossEncoder. This threshold ensures that we actually use negatives in our triplets.
+```
 
-For MultipleNegativesRankingLoss, we must ensure that in the triplet `(query, positive_passage, negative_passage)` that the `negative_passage` is actually not relevant for the query. The MS MARCO dataset is sadly **highly redundant**, and even though that there is on average only one passage marked as relevant for a query, it actually contains many passages that humans would consider as relevant. We must ensure that these passages are **not passed as negatives**: We do this by ensuring a certain threshold in the CrossEncoder scores between the relevant passages and the mined hard negative. By default, we set a threshold of 3: If the `(query, positive_passage)` gets a score of 9 from the CrossEncoder, than we will only consider negatives with a score below 6 from the CrossEncoder. This threshold ensures that we actually use negatives in our triplets.
+You can find this data by traversing to any of the datasets in the [MS MARCO Mined Triplet dataset collection](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23) and using the ``triplet-hard`` subset. Across all datasets, this refers to 175.7 million triplets. The original data can be found [here](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives). Load some of it using:
+```python
+from datasets import load_dataset
 
+train_dataset = load_dataset("sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1", "triplet-hard", split="train")
+# Dataset({
+#     features: ['query', 'positive', 'negative'],
+#     num_rows: 11662655
+# })
+print(train_dataset[0])
+# {'query': 'what are the liberal arts?', 'positive': 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.', 'negative': "Rather than preparing students for a specific career, liberal arts programs focus on cultural literacy and hone communication and analytical skills. They often cover various disciplines, ranging from the humanities to social sciences. 1  Program Levels in Liberal Arts: Associate degree, Bachelor's degree, Master's degree."}
+```
 
 ### MarginMSE
 **Training code: [train_bi-encoder_margin-mse.py](train_bi-encoder_margin-mse.py)**
 
-[MarginMSELoss](https://www.sbert.net/docs/package_reference/losses.html#marginmseloss) is based on the paper of [Hofstätter et al](https://arxiv.org/abs/2010.02666). As for MultipleNegativesRankingLoss, we have triplets: `(query, passage1, passage2)`. In contrast to MultipleNegativesRankingLoss, `passage1` and `passage2` do not have to be strictly positive/negative, both can be relevant or not relevant for a given query.  
-
-We then compute the [Cross-Encoder](../../applications/cross-encoder/README.md) score for `(query, passage1)` and `(query, passage2)`. We provide scores for 160 million such pairs in our [msmarco-hard-negatives dataset](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives). We then compute the distance: `CE_distance = CEScore(query, passage1) - CEScore(query, passage2)` 
-
-For our bi-encoder training, we encode `query`, `passage1`, and `passage2` into vector spaces and then measure the dot-product between  `(query, passage1)` and `(query, passage2)`. Again, we measure the distance: `BE_distance = DotScore(query, passage1) - DotScore(query, passage2)` 
-
-We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between `CE_distance` and `BE_distance`.
-
-An **advantage** of MarginMSELoss compared to MultipleNegativesRankingLoss is that we **don't require** a `positive` and `negative` passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the `CE_distance` will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space.
+```eval_rst
+:class:`~sentence_transformers.losses.MarginMSELoss` is based on the paper of `Hofstätter et al <https://arxiv.org/abs/2010.02666>`_. Like when training with :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we can use triplets: ``(query, passage1, passage2)``. However, in contrast to :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, `passage1` and `passage2` do not have to be strictly positive/negative, both can be relevant or not relevant for a given query.  
 
-And **disadvantage** of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages.
+We then compute the `Cross-Encoder <../../applications/cross-encoder/README.html>`_ score for ``(query, passage1)`` and ``(query, passage2)``. We provide scores for 160 million such pairs in our `msmarco-hard-negatives dataset <https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives>`_. We then compute the distance: ``CE_distance = CEScore(query, passage1) - CEScore(query, passage2)``.
 
-## Cross-Encoder
-A [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) accepts both inputs, the query and the possible relevant passage and returns a score between 0 and 1 how relevant the passage is for the given query.
+For our Sentence Transformer (e.g. bi-encoder) training, we encode ``query``, ``passage1``, and ``passage2`` into embeddings and then measure the dot-product between  ``(query, passage1)`` and ``(query, passage2)``. Again, we measure the distance: ``BE_distance = DotScore(query, passage1) - DotScore(query, passage2)``
 
-![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)
+We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between ``CE_distance`` and ``BE_distance``.
 
-Cross-Encoders are often used for **re-ranking:** Given a list with possible relevant passages for a query, for example retrieved from BM25 / Elasticsearch, the cross-encoder re-ranks this list so that the most relevant passages are the top of the result list. 
+An **advantage** of :class:`~sentence_transformers.losses.MarginMSELoss` compared to :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is that we **don't require** a ``positive`` and ``negative`` passage. As mentioned before, MS MARCO is redundant and many passages contain the same or similar content. With :class:`~sentence_transformers.losses.MarginMSELoss`, we can train on two relevant passages without issues: In that case, the ``CE_distance`` will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space.
 
-To **train an cross-encoder** on the MS MARCO dataset, see: 
-- **[train_cross-encoder_scratch.py](train_cross-encoder_scratch.py)** trains a cross-encoder from scratch using the provided data from the MS MARCO dataset.
-  
-## Cross-Encoder Knowledge Distillation
-![](https://github.com/UKPLab/sentence-transformers/raw/master/docs/img/msmarco-training-ce-distillation.png)
-- **[train_cross-encoder_kd.py](train_cross-encoder_kd.py)** uses a knowledge distillation setup: [Hostätter et al.](https://arxiv.org/abs/2010.02666) trained an ensemble of 3 (large) models for the MS MARCO dataset and predicted the scores for various (query, passage)-pairs (50% positive, 50% negative). In this example, we use knowledge distillation with a small & fast model and learn the logits scores from the teacher ensemble. This yields performances comparable to  large models, while being 18 times faster.
\ No newline at end of file
+And **disadvantage** of :class:`~sentence_transformers.losses.MarginMSELoss` is the slower training time: We need way more epochs to get good results. In :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, with a batch size of 64, we compare one query against 128 passages. With :class:`~sentence_transformers.losses.MarginMSELoss`, we compare a query only against two passages.
+```
diff --git a/examples/training/ms_marco/cross_encoder_README.md b/examples/training/ms_marco/cross_encoder_README.md
new file mode 100644
index 000000000..d58d5bb02
--- /dev/null
+++ b/examples/training/ms_marco/cross_encoder_README.md
@@ -0,0 +1,22 @@
+# MS MARCO
+[MS MARCO Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query.
+
+This page shows how to **train** Cross Encoder models on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions).
+
+If you are interested in how to use these models, see [Application - Retrieve & Re-Rank](../../applications/retrieve_rerank/README.md).
+
+There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see [Pretrained Cross-Encoders](../../../docs/cross_encoder/pretrained_models.html#ms-marco).
+
+## Cross-Encoder
+A [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) accepts both inputs, the query and the possible relevant passage and returns a score between 0 and 1 how relevant the passage is for the given query.
+
+![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)
+
+Cross-Encoders are often used for **re-ranking:** Given a list with possible relevant passages for a query, for example retrieved from BM25 / Elasticsearch, the cross-encoder re-ranks this list so that the most relevant passages are the top of the result list. 
+
+To **train an cross-encoder** on the MS MARCO dataset, see: 
+- **[train_cross-encoder_scratch.py](train_cross-encoder_scratch.py)** trains a cross-encoder from scratch using the provided data from the MS MARCO dataset.
+  
+## Cross-Encoder Knowledge Distillation
+![](https://github.com/UKPLab/sentence-transformers/raw/master/docs/img/msmarco-training-ce-distillation.png)
+- **[train_cross-encoder_kd.py](train_cross-encoder_kd.py)** uses a knowledge distillation setup: [Hostätter et al.](https://arxiv.org/abs/2010.02666) trained an ensemble of 3 (large) models for the MS MARCO dataset and predicted the scores for various (query, passage)-pairs (50% positive, 50% negative). In this example, we use knowledge distillation with a small & fast model and learn the logits scores from the teacher ensemble. This yields performances comparable to  large models, while being 18 times faster.
\ No newline at end of file
diff --git a/examples/training/multilingual/README.md b/examples/training/multilingual/README.md
index 98a2d706d..700bce9ca 100644
--- a/examples/training/multilingual/README.md
+++ b/examples/training/multilingual/README.md
@@ -1,25 +1,122 @@
-# Multilingual-Models
-The issue with multilingual BERT (mBERT) as well as with XLM-RoBERTa is that those produce rather bad sentence representation out-of-the-box. Further, the vectors spaces between languages are not  aligned, i.e., the sentences with the same content in different languages would be mapped to different locations in the vector space.
+# Multilingual Models
+The issue with multilingual BERT (mBERT) as well as with XLM-RoBERTa is that those produce rather bad sentence representation out-of-the-box. Further, the vectors spaces between languages are not aligned, i.e., the sentences with the same content in different languages would be mapped to different locations in the vector space.
 
-In my publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) I describe any easy approach to extend sentence embeddings to further languages.
+In my publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) I describe an easy approach to extend sentence embeddings to further languages.
 
 Chien Vu also wrote a nice blog article on this technique: [A complete guide to transfer learning from English to other Languages using Sentence Embeddings BERT Models](https://towardsdatascience.com/a-complete-guide-to-transfer-learning-from-english-to-other-languages-using-sentence-embeddings-8c427f8804a9)
 
-## Available Pre-trained Models
-For a list of available models, see [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models).
+## Extend your own models
+![Multilingual Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/multilingual-distillation.png)
+
+The idea is based on a fixed (monolingual) **teacher model** that produces sentence embeddings with our desired properties in one language (e.g. English). The **student model** is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. Additionally, in order to make the student model work for other languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence.
+
+In the above figure, the student model should map *Hello World* and the German translation *Hallo Welt* to the vector of ``teacher_model('Hello World')``. We achieve this by training the student model using mean squared error (MSE) loss.
+
+In our experiments we initialized the student model with the multilingual [XLM-RoBERTa model](https://huggingface.co/FacebookAI/xlm-roberta-base). 
+
+## Training 
+For a **fully automatic code example**, see [make_multilingual.py](make_multilingual.py). 
+
+This scripts downloads the parallel sentences corpus, a corpus with transcripts and translations from talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). This corpus contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages.
+
+## Datasets
+
+```eval_rst
+As training data we require parallel sentences, i.e., sentences translated in various languages. In particular, we will use :class:`~datasets.Dataset` instances with ``"english"`` and ``"non_english"`` columns. We have prepared a large collection of such datasets in our `Parallel Sentences dataset collection <https://huggingface.co/collections/sentence-transformers/parallel-sentences-datasets-6644d644123d31ba5b1c8785>`_.
+```
+
+The training script will take the `"english"` column and add a `"label"` column containing the embeddings of the english texts. Then, the student model `"english"` and `"non_english"` will be trained to be similar to this `"label"`. You can load such a training dataset like so:
+
+```python
+from datasets import load_dataset
+
+train_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-de", split="train")
+print(train_dataset[0])
+# {"english": "So I think practicality is one case where it's worth teaching people by hand.", "non_english": "Ich denke, dass es sich aus diesem Grund lohnt, den Leuten das Rechnen von Hand beizubringen."}
+```
+
+## Sources for Training Data
+A great website for a vast number of parallel (translated) datasets is [OPUS](http://opus.nlpl.eu/). There, you find parallel datasets for more than 400 languages. You can use these to create your own parallel sentence datasets, if you wish.
+
+## Evaluation
+
+Training can be evaluated in different ways. For an example how to use these evaluation methods, see [make_multilingual.py](make_multilingual.py). 
+
+### MSE Evaluation
+You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings.
+
+```python
+from datasets import load_dataset
 
+eval_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-fr", split="dev")
+
+dev_mse = MSEEvaluator(
+    source_sentences=eval_dataset["english"],
+    target_sentences=eval_dataset["non_english"],
+    name="en-fr-dev",
+    teacher_model=teacher_model,
+    batch_size=32,
+)
+```
+
+This evaluator computes the teacher embeddings for the `source_sentences`, for example, for English. During training, the student model is used to compute embeddings for the `target_sentences`, for example, for French. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance.
+
+### Translation Accuracy
+You can also measure the translation accuracy. As inputs, this evaluator accepts a list of `source_sentences` (e.g. English), and a list of `target_sentences` (e.g. Spanish), such that `target_sentences[i]` is a translation of `source_sentences[i]`.
+
+For each sentence pair, we check if `source_sentences[i]` we check if `target_sentences[i]` has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better). 
+
+```python
+from datasets import load_dataset
+
+eval_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-fr", split="dev")
+
+dev_trans_acc = TranslationEvaluator(
+    source_sentences=eval_dataset["english"],
+    target_sentences=eval_dataset["non_english"],
+    name="en-fr-dev",
+    batch_size=32,
+)
+```
+
+### Multilingual Semantic Textual Similarity
+You can also measure the semantic textual similarity (STS) between sentence pairs in different languages:
+
+```python
+from datasets import load_dataset
+
+test_dataset = load_dataset("mteb/sts17-crosslingual-sts", "nl-en", split="test")
+
+test_emb_similarity = EmbeddingSimilarityEvaluator(
+    sentences1=test_dataset["sentence1"],
+    sentences2=test_dataset["sentence2"],
+    scores=[score / 5.0 for score in test_dataset["score"]],  # Convert 0-5 scores to 0-1 scores
+    batch_size=32,
+    name=f"sts17-nl-en-test",
+    show_progress_bar=False,
+)
+```
+
+Where `sentences1` and `sentences2` are lists of sentences and score is numeric value indicating the semantic similarity between `sentences1[i]` and `sentences2[i]`.
+
+## Available Pre-trained Models
+For a list of available models, see [Pretrained Models](../../../docs/sentence_transformer/pretrained_models.html#multilingual-models).
 
 ## Usage
 You can use the models in the following way:
+
 ```python
 from sentence_transformers import SentenceTransformer
 
-embedder = SentenceTransformer("model-name")
-embeddings = embedder.encode(["Hello World", "Hallo Welt", "Hola mundo"])
-print(embeddings)
+model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
+embeddings = model.encode(["Hello World", "Hallo Welt", "Hola mundo", "Bye, Moon!"])
+similarities = model.similarity(embeddings, embeddings)
+# tensor([[1.0000, 0.9429, 0.8880, 0.4558],
+#         [0.9429, 1.0000, 0.9680, 0.5307],
+#         [0.8880, 0.9680, 1.0000, 0.4933],
+#         [0.4558, 0.5307, 0.4933, 1.0000]])
 ```
 
-
 ## Performance
 The performance was evaluated on the [Semantic Textual Similarity (STS) 2017 dataset](http://ixa2.si.ehu.es/stswiki/index.php/Main_Page). The task is to predict the semantic similarity (on a scale 0-5) of two given sentences. STS2017 has monolingual test data for English, Arabic, and Spanish, and cross-lingual test data for English-Arabic, -Spanish and -Turkish.
 
@@ -101,105 +198,6 @@ We extended the STS2017 and added cross-lingual test data for English-German, Fr
     </tr>
 </table>
 
-
-## Extend your own models
-![Multilingual Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/multilingual-distillation.png)
-
-The idea is based on a fixed (monolingual) **teacher model**, that produces sentence embeddings with our desired properties in one language. The **student model** is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. In order that the student model works for further languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence.
-
-In the above figure, the student model should map *Hello World* and the German translation *Hallo Welt* to the vector of *teacher_model('Hello World')*. We achieve this by training the student model using mean squared error (MSE) loss.
-
-In our experiments we initialized the student model with the multilingual XLM-RoBERTa model. 
-
-## Training 
-For a **fully automatic code example**, see [make_multilingual.py](make_multilingual.py). 
-
-This scripts downloads the parallel sentences corpus, a corpus with transcripts and translations from talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). This corpus contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages.
-
-
-
-## Data Format
-
-As training data we require parallel sentences, i.e., sentences translated in various languages. As data format, we use a tab-separated .tsv file. In the first column, you have your source sentence, for example, an English sentence. In the following columns, you have the translations of this source sentence. If you have multiple translations per source sentence, you can put them in the same line or in different lines.
-```
-Source_sentence Target_lang1    Target_lang2    Target_lang3
-Source_sentence Target_lang1    Target_lang2
-```
-
-An example file could look like this (EN DE ES):
-```
-Hello World Hallo Welt  Hola Mundo
-Sentences are separated with a tab character.    Die Sätze sind per Tab getrennt.    Las oraciones se separan con un carácter de tabulación.
-```
-
-The order of the translations are not important, it is only important that the first column contains a sentence in a language that is understood by the teacher model.
-
-## Loading Training Datasets
-
-You can load such a training file using the *ParallelSentencesDataset* class:
-```python
-from sentence_transformers.datasets import ParallelSentencesDataset
-
-train_data = ParallelSentencesDataset(student_model=student_model, teacher_model=teacher_model)
-train_data.load_data("path/to/tab/separated/train-en-de.tsv")
-train_data.load_data("path/to/tab/separated/train-en-es.tsv.gz")
-train_data.load_data("path/to/tab/separated/train-en-fr.tsv.gz")
-
-train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.MSELoss(model=student_model)
-```
-
-You load a file with the *load_data()* method. You can load multiple files by calling load_data multiple times. You can also regular files or .gz-compressed files.
-
-Per default, all datasets are weighted equally. In the above example a (source, translation)-pair will be sampled equally from all three datasets. If you pass a `weight` parameter (integer), you can weight some datasets higher or lower.
-
-## Sources for Training Data
-A great website for a vast number of parallel (translated) datasets is [OPUS](http://opus.nlpl.eu/). There, you find parallel datasets for more than 400 languages. 
-
-The [examples/training/multilingual](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/multilingual/) folder contains some scripts that downloads parallel training data and brings it into the right format:
-- [get_parallel_data_opus.py](get_parallel_data_opus.py): This script downloads data from the [OPUS](http://opus.nlpl.eu/) website.
-- [get_parallel_data_tatoeba.py](get_parallel_data_tatoeba.py): This script downloads data from the [Tatoeba](https://tatoeba.org/) website, a website for language learners with example sentences for more than many languages.
-- [get_parallel_data_talks.py](get_parallel_data_talks.py): This script downloads data the parallel sentences corpus, which contains transcripts and translations of more than 4,000 talks in 100+ languages.
-
-## Evaluation
-
-Training can be evaluated in different ways. For an example how to use these evaluation methods, see [make_multilingual.py](make_multilingual.py). 
-
-### MSE Evaluation
-You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings. This can be achieved with the ``
-
-```python
-# src_sentences and trg_sentences are lists of translated sentences, such that trg_sentences[i] is the translation of src_sentences[i]
-dev_mse = evaluation.MSEEvaluator(src_sentences, trg_sentences, teacher_model=teacher_model)
-```
-
-This evaluator computes the teacher embeddings for the `src_sentences`, for example, for English. During training, the student model is used to compute embeddings for the `trg_sentences`, for example, for Spanish. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance.
-
-### Translation Accuracy
-You can also measure the translation accuracy. Given a list with source sentences, for example, 1000 English sentences. And a list with matching target (translated) sentences, for example, 1000 Spanish sentences.
-
-For each sentence pair, we check if their embeddings are the closest using cosine similarity. I.e., for each `src_sentences[i]` we check if `trg_sentences[i]` has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better). 
-
-```python
-# src_sentences and trg_sentences are lists of translated sentences, such that trg_sentences[i] is the translation of src_sentences[i]
-dev_trans_acc = evaluation.TranslationEvaluator(
-    src_sentences,
-    trg_sentences,
-    name=os.path.basename(dev_file),
-    batch_size=inference_batch_size,
-)
-```
-
-### Multi-Lingual Semantic Textual Similarity
-You can also measure the semantic textual similarity (STS) between sentence pairs in different languages:
-
-```python
-sts_evaluator = evaluation.EmbeddingSimilarityEvaluatorFromList(sentences1, sentences2, scores)
-```
-
-Where `sentences1` and `sentences2` are lists of sentences and score is numeric value indicating the semantic similarity between `sentences1[i]` and `sentences2[i]`.
-
-
 ## Citation
 If you use the code for multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):
 ``` 
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index 21f30f8fd..b50d5d408 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -151,8 +151,8 @@ def prepare_dataset(batch):
 
     # Mean Squared Error (MSE) measures the (euclidean) distance between teacher and student embeddings
     dev_mse = MSEEvaluator(
-        eval_dataset["english"],
-        eval_dataset["non_english"],
+        source_sentences=eval_dataset["english"],
+        target_sentences=eval_dataset["non_english"],
         name=subset,
         teacher_model=teacher_model,
         batch_size=inference_batch_size,
@@ -162,8 +162,8 @@ def prepare_dataset(batch):
     # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of
     # source[i] is the closest to target[i] out of all available target sentences
     dev_trans_acc = TranslationEvaluator(
-        eval_dataset["english"],
-        eval_dataset["non_english"],
+        source_sentences=eval_dataset["english"],
+        target_sentences=eval_dataset["non_english"],
         name=subset,
         batch_size=inference_batch_size,
     )
diff --git a/examples/training/nli/README.md b/examples/training/nli/README.md
index 5ab2dce23..553afcdbe 100644
--- a/examples/training/nli/README.md
+++ b/examples/training/nli/README.md
@@ -1,16 +1,25 @@
 # Natural Language Inference
 
-Given two sentence (premise and hypothesis), Natural Language Inference (NLI) is the task of deciding if the premise entails the hypothesis, if they are contradiction or if they are neutral. Commonly used NLI dataset are [SNLI](https://arxiv.org/abs/1508.05326) and [MultiNLI](https://arxiv.org/abs/1704.05426). 
+Given two sentence (premise and hypothesis), Natural Language Inference (NLI) is the task of deciding if the premise entails the hypothesis, if they are contradiction, or if they are neutral. Commonly used NLI dataset are [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli). 
 
 [Conneau et al.](https://arxiv.org/abs/1705.02364) showed that NLI data can be quite useful when training Sentence Embedding methods. We also found this in our [Sentence-BERT-Paper](https://arxiv.org/abs/1908.10084) and often use NLI as a first fine-tuning step for sentence embedding methods.
 
 To train on NLI, see the following example files:
-- **[training_nli.py](training_nli.py)** - This example uses the Softmax-Classification-Loss, as described in the [SBERT-Paper](https://arxiv.org/abs/1908.10084), to learn sentence embeddings.
-- **[training_nli_v2.py](training_nli_v2.py)** - The Softmax-Classification-Loss, as used in our original SBERT paper, does not yield optimal performance. A better loss is [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), where we provide pairs or triplets. In that example, we provide a triplet of the format: (anchor, entailment_sentence, contradiction_sentence). The NLI data provides such triplets. The MultipleNegativesRankingLoss yields much higher performances and is more intuitive than the Softmax-Classification-Loss. We have used this loss to train the paraphrase model in our [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) paper.
-- **[training_nli_v3.py](training_nli_v3.py)** - Following the [GISTEmbed](https://arxiv.org/abs/2402.16829) paper, we can modify the in-batch negative selection from [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using a guiding model. Candidate negative pairs are ignored during training if the guiding model considers the pair to be too similar. In practice, the [GISTEmbedLoss](https://www.sbert.net/docs/package_reference/losses.html#gistembedloss) tends to produce a stronger training signal than `MultipleNegativesRankingLoss` at the cost of some training overhead for running inference on the guiding model.
+1. **[training_nli.py](training_nli.py)**:
+    ```eval_rst
+    This example uses :class:`~sentence_transformers.losses.SoftmaxLoss` as described in the original [Sentence Transformers paper](https://arxiv.org/abs/1908.10084).
+    ```
+2. **[training_nli_v2.py](training_nli_v2.py)**:
+    ```eval_rst
+    The :class:`~sentence_transformers.losses.SoftmaxLoss` as used in our original SBERT paper does not yield optimal performance. A better loss is :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, where we provide pairs or triplets. In this script, we provide a triplet of the format: (anchor, entailment_sentence, contradiction_sentence). The NLI data provides such triplets. The :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` yields much higher performances and is more intuitive than :class:`~sentence_transformers.losses.SoftmaxLoss`. We have used this loss to train the paraphrase model in our `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation <https://arxiv.org/abs/2004.09813>`_ paper.
+    ```
+3. **[training_nli_v3.py](training_nli_v3.py)**
+    ```eval_rst
+    Following the `GISTEmbed <https://arxiv.org/abs/2402.16829>`_ paper, we can modify the in-batch negative selection from :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` using a guiding model. Candidate negative pairs are ignored during training if the guiding model considers the pair to be too similar. In practice, the :class:`~sentence_transformers.losses.GISTEmbedLoss` tends to produce a stronger training signal than :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` at the cost of some training overhead for running inference on the guiding model.
+    ```
 
 ## Data
-In our experiments we combine [SNLI](https://arxiv.org/abs/1508.05326) and [MultiNLI](https://arxiv.org/abs/1704.05426), which we call AllNLI. These two datasets contain sentence pairs and one of three labels: entailment, neutral, contradiction:
+We combine [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) into a dataset we call [AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli). These two datasets contain sentence pairs and one of three labels: entailment, neutral, contradiction:
 
 | Sentence A (Premise) | Sentence B (Hypothesis) | Label |
 | --- | --- | --- |
@@ -18,45 +27,45 @@ In our experiments we combine [SNLI](https://arxiv.org/abs/1508.05326) and [Mult
 | An older and younger man smiling. | Two men are smiling and laughing at the cats playing on the floor. | neutral |
 | A man inspects the uniform of a figure in some East Asian country. | The man is sleeping. | contradiction |
 
-
-
-
+We format AllNLI in a few different subsets, compatible with different loss functions. See for example the [triplet subset of AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet).
 
 ## SoftmaxLoss
-[Conneau et al.](https://arxiv.org/abs/1705.02364) described how a softmax classifier on top of a siamese network can be used to learn meaningful sentence representation. We can achieve this by using the  [losses.SoftmaxLoss](../../../docs/package_reference/losses.html#softmaxloss) package.
+```eval_rst
+`Conneau et al. <https://arxiv.org/abs/1705.02364>`_ described how a softmax classifier on top of a `siamese network <https://en.wikipedia.org/wiki/Siamese_neural_network>`_ can be used to learn meaningful sentence representation. We can achieve this by using :class:`~sentence_transformers.losses.SoftmaxLoss`:
+```
 
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_SoftmaxLoss.png" alt="SBERT SoftmaxLoss" width="250"/>
 
-The softmax loss looks like this:
+We pass the two sentences through our SentenceTransformer model and get the sentence embeddings *u* and *v*. We then concatenate *u*, *v* and *|u-v|* to form one long vector. This vector is then passed to a softmax classifier, which predicts our three classes (entailment, neutral, contradiction).
 
-![SBERT SoftmaxLoss](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_SoftmaxLoss.png "SBERT SoftmaxLoss")
-
-We pass the two sentences through our SentenceTransformer network and get the sentence embeddings *u* and *v*. We then concatenate u, v and |u-v| to form one, long vector. This vector is then passed to a softmax classifier, which predicts our three classes (entailment, neutral, contradiction).
-
-This setup learns sentence embeddings, that can later be used for wide variety of tasks. 
+This setup learns sentence embeddings that can later be used for wide variety of tasks. 
 
 ## MultipleNegativesRankingLoss
+```eval_rst
+That the :class:`~sentence_transformers.losses.SoftmaxLoss` with NLI data produces (relatively) good sentence embeddings is rather coincidental. The :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is much more intuitive and produces significantly better sentence representations.
+```
 
-That the softmax-loss with NLI data produces (relatively) good sentence embeddings is rather coincidental. The [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) is much more intuitive and produces also significantly better sentence representations.
-
-The training data for MultipleNegativesRankingLoss consists of sentence pairs [(a<sub>1</sub>, b<sub>1</sub>), ..., (a<sub>n</sub>, b<sub>n</sub>)] where we assume that (a<sub>i</sub>, b<sub>i</sub>) are similar sentences and (a<sub>i</sub>, b<sub>j</sub>) are dissimilar sentences for i != j. The minimizes the distance between (a<sub>i</sub>, b<sub>i</sub>) while it simultaneously maximizes the distance  (a<sub>i</sub>, b<sub>j</sub>) for all i != j.
+The training data for MultipleNegativesRankingLoss consists of sentence pairs [(a<sub>1</sub>, b<sub>1</sub>), ..., (a<sub>n</sub>, b<sub>n</sub>)] where we assume that (a<sub>i</sub>, b<sub>i</sub>) are similar sentences and (a<sub>i</sub>, b<sub>j</sub>) are dissimilar sentences for i != j. The minimizes the distance between (a<sub>i</sub>, b<sub>i</sub>) while it simultaneously maximizes the distance (a<sub>i</sub>, b<sub>j</sub>) for all i != j. For example, in the following picture:
 
-
-For example in the following picture:
-
-![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MultipleNegativeRankingLoss.png)
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MultipleNegativeRankingLoss.png" alt="SBERT MultipleNegativeRankingLoss" width="350"/>
 
 The distance between (a<sub>1</sub>, b<sub>1</sub>) is reduced, while the distance between (a<sub>1</sub>, b<sub>2...5</sub>) will be increased. The same is done for a<sub>2</sub>, ..., a<sub>5</sub>.
 
-
-Using MultipleNegativeRankingLoss with NLI is rather easy: We define sentences that have an *entailment* label as positive pairs. E.g, we have pairs like (*"A soccer game with multiple males playing."*, *"Some men are playing a sport."*) and want that these pairs are close in vector space.
+```eval_rst
+Using :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` with NLI is rather easy: We define sentences that have an *entailment* label as positive pairs. E.g, we have pairs like (*"A soccer game with multiple males playing."*, *"Some men are playing a sport."*) and want that these pairs are close in vector space. The `pair subset of AllNLI <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair>`_ has been prepared in this format.
+```
 
 ### MultipleNegativesRankingLoss with Hard Negatives
 
-We can further improve MultipleNegativesRankingLoss by not only providing pairs, but by providing triplets: [(a<sub>1</sub>, b<sub>1</sub>, c<sub>1</sub>), ..., (a<sub>n</sub>, b<sub>n</sub>, c<sub>n</sub>)] 
-
-The entry for c<sub>i</sub> are so-called hard-negatives: On a lexical level, they are similar to a<sub>i</sub> and b<sub>i</sub>. But on a semantic level, they mean different things and should not be close in the vector space.
+We can further improve MultipleNegativesRankingLoss by providing triplets rather than pairs: [(a<sub>1</sub>, b<sub>1</sub>, c<sub>1</sub>), ..., (a<sub>n</sub>, b<sub>n</sub>, c<sub>n</sub>)]. The samples for c<sub>i</sub> are so-called hard-negatives: On a lexical level, they are similar to a<sub>i</sub> and b<sub>i</sub>, but on a semantic level, they mean different things and should not be close to a<sub>i</sub> in the vector space.
 
 For NLI data, we can use the contradiction-label to create such triplets with a hard negative. So our triplets look like this:
-("*A soccer game with multiple males playing."*, *"Some men are playing a sport."*, *"A group of men playing a baseball game."*).
+("*A soccer game with multiple males playing."*, *"Some men are playing a sport."*, *"A group of men playing a baseball game."*). We want the sentences *"A soccer game with multiple males playing."* and *"Some men are playing a sport."* to be close in the vector space, while there should be a larger distance between *"A soccer game with multiple males playing."* and "*A group of men playing a baseball game."*. The [triplet subset of AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet) has been prepared in this format.
+
+### GISTEmbedLoss
+```eval_rst
+
+:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` can be extended even further by recognizing that the in-batch negative sampling as shown in `this example <#multiplenegativesrankingloss>`_ is a bit flawed. In particular, we automatically assume that the pairs (a\ :sub:`1`\ , b\ :sub:`2`\ ), ..., (a\ :sub:`1`\ , b\ :sub:`n`\ ) are negative, but that does not strictly have to be true.
 
-We want the sentences *"A soccer game with multiple males playing."* and *"Some men are playing a sport."* to be close in the vector space, while there should be a larger distance between *"A soccer game with multiple males playing."* and "*A group of men playing a baseball game."*.
+To address this, :class:`~sentence_transformers.losses.GISTEmbedLoss` uses a Sentence Transformer model to guide the in-batch negative sample selection. In particular, if the guide model considers the similarity of (a\ :sub:`1`\ , b\ :sub:`n`\ ) to be larger than (a\ :sub:`1`\ , b\ :sub:`1`\ ), then the (a\ :sub:`1`\ , b\ :sub:`n`\ ) pair is considered a false negative and consequently ignored in the training process. In essence, this results in higher quality training data for the model.
+```
\ No newline at end of file
diff --git a/examples/training/paraphrases/MultiDatasetDataLoader.py b/examples/training/paraphrases/MultiDatasetDataLoader.py
deleted file mode 100644
index 9220a37be..000000000
--- a/examples/training/paraphrases/MultiDatasetDataLoader.py
+++ /dev/null
@@ -1,91 +0,0 @@
-import math
-import logging
-import random
-
-
-class MultiDatasetDataLoader:
-    def __init__(self, datasets, batch_size_pairs, batch_size_triplets=None, dataset_size_temp=-1):
-        self.allow_swap = True
-        self.batch_size_pairs = batch_size_pairs
-        self.batch_size_triplets = batch_size_pairs if batch_size_triplets is None else batch_size_triplets
-
-        # Compute dataset weights
-        self.dataset_lengths = list(map(len, datasets))
-        self.dataset_lengths_sum = sum(self.dataset_lengths)
-
-        weights = []
-        if dataset_size_temp > 0:  # Scale probability with dataset size
-            for dataset in datasets:
-                prob = len(dataset) / self.dataset_lengths_sum
-                weights.append(max(1, int(math.pow(prob, 1 / dataset_size_temp) * 1000)))
-        else:  # Equal weighting of all datasets
-            weights = [100] * len(datasets)
-
-        logging.info("Dataset lengths and weights: {}".format(list(zip(self.dataset_lengths, weights))))
-
-        self.dataset_idx = []
-        self.dataset_idx_pointer = 0
-
-        for idx, weight in enumerate(weights):
-            self.dataset_idx.extend([idx] * weight)
-        random.shuffle(self.dataset_idx)
-
-        self.datasets = []
-        for dataset in datasets:
-            random.shuffle(dataset)
-            self.datasets.append(
-                {
-                    "elements": dataset,
-                    "pointer": 0,
-                }
-            )
-
-    def __iter__(self):
-        for _ in range(int(self.__len__())):
-            # Select dataset
-            if self.dataset_idx_pointer >= len(self.dataset_idx):
-                self.dataset_idx_pointer = 0
-                random.shuffle(self.dataset_idx)
-
-            dataset_idx = self.dataset_idx[self.dataset_idx_pointer]
-            self.dataset_idx_pointer += 1
-
-            # Select batch from this dataset
-            dataset = self.datasets[dataset_idx]
-            batch_size = self.batch_size_pairs if len(dataset["elements"][0].texts) == 2 else self.batch_size_triplets
-
-            batch = []
-            texts_in_batch = set()
-            guid_in_batch = set()
-            while len(batch) < batch_size:
-                example = dataset["elements"][dataset["pointer"]]
-
-                valid_example = True
-                # First check if one of the texts in already in the batch
-                for text in example.texts:
-                    text_norm = text.strip().lower()
-                    if text_norm in texts_in_batch:
-                        valid_example = False
-
-                    texts_in_batch.add(text_norm)
-
-                # If the example has a guid, check if guid is in batch
-                if example.guid is not None:
-                    valid_example = valid_example and example.guid not in guid_in_batch
-                    guid_in_batch.add(example.guid)
-
-                if valid_example:
-                    if self.allow_swap and random.random() > 0.5:
-                        example.texts[0], example.texts[1] = example.texts[1], example.texts[0]
-
-                    batch.append(example)
-
-                dataset["pointer"] += 1
-                if dataset["pointer"] >= len(dataset["elements"]):
-                    dataset["pointer"] = 0
-                    random.shuffle(dataset["elements"])
-
-            yield self.collate_fn(batch) if self.collate_fn is not None else batch
-
-    def __len__(self):
-        return int(self.dataset_lengths_sum / self.batch_size_pairs)
diff --git a/examples/training/paraphrases/README.md b/examples/training/paraphrases/README.md
index 4ba0d7fbe..1e76d2c85 100644
--- a/examples/training/paraphrases/README.md
+++ b/examples/training/paraphrases/README.md
@@ -1,65 +1,17 @@
 # Paraphrase Data
 
-**This page is currently work-in-progress and will be extended in the future**
+```eval_rst
+In our paper `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation <https://arxiv.org/abs/2004.09813>`_, we showed that paraphrase data together with :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is a powerful combination to learn sentence embeddings models. Read `NLI > MultipleNegativesRankingLoss <../nli/README.html#multiplenegativesrankingloss>`_ for more information on this loss function.
+```
 
-In our paper [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) we showed that paraphrase dataset together with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) is a powerful combination to learn sentence embeddings models.
+The [training.py](training.py) script loads various datasets from the [Dataset Overview](../../../docs/sentence_transformer/dataset_overview.html@pre-existing-datasets). We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.
 
-You can find here: [NLI - MultipleNegativesRankingLoss](https://www.sbert.net/examples/training/nli/README.html#multiplenegativesrankingloss) more information how the loss can be used.
-
-In this folder, we collect different datasets and scripts to train using paraphrase data.
-
-## Datasets
-
-You can find here: [sbert.net/datasets/paraphrases](http://sbert.net/datasets/paraphrases) a list of datasets with paraphrases suitable for training.
-
-| Name | Source | #Sentence-Pairs | STSb-dev |
-| --- | --- | :---: | :---: |
-| [AllNLI.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/AllNLI.tsv.gz) | [SNLI](https://nlp.stanford.edu/projects/snli/) + [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | 277,230 | 86.54 |
-| [sentence-compression.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/sentence-compression.tsv.gz) | [sentence-compression](https://github.com/google-research-datasets/sentence-compression) | 180,000 | 84.36 |
-| [SimpleWiki.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/SimpleWiki.tsv.gz) | [SimpleWiki](https://cs.pomona.edu/~dkauchak/simplification/) | 102,225 | 84.26 |
-| [altlex.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/altlex.tsv.gz) | [altlex](https://github.com/chridey/altlex/) | 112,696 | 83.34 |
-| [msmarco-triplets.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/msmarco-triplets.tsv.gz) | [MS MARCO Passages](https://microsoft.github.io/msmarco/) | 5,028,051 | 83.12 |
-| [quora_duplicates.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/quora_duplicates.tsv.gz) | [Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 103,663 | 82.55 |
-| [coco_captions-with-guid.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/coco_captions-with-guid.tsv.gz) | [COCO](https://cocodataset.org/) | 828,395 | 82.25
-| [flickr30k_captions-with-guid.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/flickr30k_captions-with-guid.tsv.gz) | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | 317,695 | 82.04
-| [yahoo_answers_title_question.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answers_title_question.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) | 659,896 | 81.19 |
-| [S2ORC_citation_pairs.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/S2ORC_citation_pairs.tsv.gz) | [Semantic Scholar Open Research Corpus](http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/) | 52,603,982 | 81.02 |
-| [yahoo_answers_title_answer.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answerstitle_answer.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset)  | 1,198,260 | 80.25 
-| [stackexchange_duplicate_questions.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/stackexchange_duplicate_questions.tsv.gz) | [Stackexchange](https://stackexchange.com/) | 169,438 | 80.37
-| [yahoo_answers_question_answer.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answers_question_answer.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset)  | 681,164 | 79.88 |
-| [wiki-atomic-edits.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/wiki-atomic-edits.tsv.gz) | [wiki-atomic-edits](https://github.com/google-research-datasets/wiki-atomic-edits) |   22,980,185  | 79.58
-| [wiki-split.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/wiki-split.tsv.gz) | [wiki-split](https://github.com/google-research-datasets/wiki-split) | 929,944 | 76.59
-
-
-See the respective linked source website for the dataset license.
-
-
-All datasets have a sample per line and the individual sentences are separated by a tab (\t). Some datasets (like AllNLI) has three sentences per line: An anchor, a positive, and a hard negative.
-
-We measure for each dataset the performance on the STSb development dataset after 2k training steps with a distilroberta-base model and a batch size of 256. 
-
-**Note**: We find that the STSb dataset is a suboptimal dataset to evaluate the quality of sentence embedding models. It consists mainly of rather simple sentences, it does not require any domain specific knowledge, and the included sentences are of rather high quality compared to noisy, user-written content. Please do not infer from the above numbers how the approaches will perform on your domain specific dataset.
-
-## Training
-See [training.py](training.py) for the training script.
-
-The training script allows to load one or multiple files. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.
-
-As the dataset sizes are quite different in size, we perform a temperature controlled sampling from the datasets: Smaller datasets are up-sampled, while larger datasets are down-sampled. This allows an effective training with very large and smaller datasets.
+As the dataset sizes are quite different in size, we perform [round-robin sampling](../../../docs/package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.MultiDatasetBatchSamplers) to train using the same amount of batches from each dataset.
 
 ## Pre-Trained Models
-Have a look at [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) to view all models that were trained on these paraphrase datasets.
-
-- **paraphrase-MiniLM-L12-v2** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
-- **paraphrase-distilroberta-base-v2** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
-- **paraphrase-distilroberta-base-v1** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
-- **paraphrase-xlm-r-multilingual-v1** - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)
-
-
-## Work in Progress
+Have a look at [pre-trained models](../../../docs/sentence_transformer/pretrained_models.md) to view all models that were trained on these paraphrase datasets.
 
-Training with this data is currently work-in-progress. Things that will be added in the next time:
-- **More datasets**: Are you aware of more suitable training datasets? Let me know: [info@nils-reimers.de](mailto:info@nils-reimers.de)
-- **Optimized batching**: Currently batches are only drawn from one dataset. Future work might include also batches that are sampled across datasets
-- **Optimized loss function**: Currently the same parameters of MultipleNegativesRankingLoss is used for all datasets. Future work includes testing if the dataset benefit from individual loss functions.
-- **Pre-trained models**: Once all datasets are collected, we will train and release respective models.
\ No newline at end of file
+- [paraphrase-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L12-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
+- [paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
+- [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
+- [paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1), Student: [xlm-r-base](https://huggingface.co/FacebookAI/xlm-roberta-base))
diff --git a/examples/training/quora_duplicate_questions/README.md b/examples/training/quora_duplicate_questions/README.md
index d61b04ac3..598fcb337 100644
--- a/examples/training/quora_duplicate_questions/README.md
+++ b/examples/training/quora_duplicate_questions/README.md
@@ -1,191 +1,144 @@
 # Quora Duplicate Questions
 
-This folder contains scripts that demonstrate how to train SentenceTransformers for **Information Retrieval**. As simple example, we will use the [Quora Duplicate Questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). It contains over 500,000 sentences with over 400,000 pairwise annotations whether two questions are a duplicate or not.
-
-## Pretrained Models
-
-Currently the following models trained on Quora Duplicate Questions are available:
-* **distilbert-base-nli-stsb-quora-ranking**:  We extended the *distilbert-base-nli-stsb-mean-tokens* model and trained it with *OnlineContrastiveLoss* and with *MultipleNegativesRankingLoss* on the Quora Duplicate questions dataset. For the code, see [training_multi-task-learning.py](training_multi-task-learning.py)
-* **distilbert-multilingual-nli-stsb-quora-ranking**: Extension of *distilbert-base-nli-stsb-quora-ranking* to be multi-lingual. Trained on parallel data for 50 languages.
-
-You can load & use pre-trained models like this:
-```python
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("model_name")
-```
-
-
-## Dataset
-As dataset to train a **Duplicate Questions Semantic Search Engine** we use [Quora Duplicate Questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). The original format looks like this:
-```
-id	qid1	qid2	question1	question2	is_duplicate
-0	1	2	What is the step by step guide to invest in share market in india?	What is the step by step guide to invest in share market?	0
-1	3	4	What is the story of Kohinoor (Koh-i-Noor) Diamond?	What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?	0
-```
-
-As a first step, we process this file to create distinct train/dev/test splits for different tasks. We define the following tasks:
-- **Duplicate Questions Classification**: Given two questions, are these questions duplicates? This is the original task as defined by Quora, however, it is rather a unpractical task. How do we retrieve possible duplicates in a large corpus for a given question? Further, models performing well on this classification task do not necessarily perform well on the following two task.
-- **Duplicate Questions Mining**: Given a large set (like 100k) of questions, identify all question pairs that are duplicates.
-- **Duplicate Questions Information Retrieval**: Given a large corpus (350k+) of questions. For a new, unseen question, find the most related (i.e. duplicate) questions in this corpus.
-
-
-**Download**: You can download the finished dataset here: [quora-IR-dataset.zip](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/quora-IR-dataset.zip)
-
-For details on the creation of the dataset, see [create_splits.py](create_splits.py).
-
-
-## Usage
-
-### Duplicate Questions Mining
-
-Given a large set of sentences (in this case questions), identify all pairs that are duplicates. See [Paraphrase Mining](../../applications/paraphrase-mining/README.md) for an example how to use sentence transformers to mine for duplicate questions / paraphrases. This approach can be scaled to hundred thousands of sentences given you have enough memory.
-
-### Semantic Search
-
-The model can also be used for Information Retrieval / Semantic Search. Given a new question, search a large corpus of hundred thousands of questions for duplicate questions. Given you have enough memory, this approach works well to copora up in the Millions (depending on your real-time requirements).
-
-For an interactive example, see [Semantic Search](../../applications/semantic-search/README.md).
+This folder contains scripts that demonstrate how to train SentenceTransformers for **Information Retrieval**. As a simple example, we will use the [Quora Duplicate Questions dataset](https://huggingface.co/datasets/sentence-transformers/quora-duplicates). It contains over 500,000 sentences with over 400,000 pairwise annotations whether two questions are a duplicate or not.
 
+Models trained on this dataset can be used for mining duplicate questions, i.e., given a large set of sentences (in this case questions), identify all pairs that are duplicates. See [Paraphrase Mining](../../applications/paraphrase-mining/README.md) for an example how to use sentence transformers to mine for duplicate questions / paraphrases. This approach can be scaled to hundred thousands of sentences.
 
 ## Training
 
-Choosing the right loss function is crucial for getting well working sentence embeddings. For the given task, two loss functions are especially suitable: **ConstrativeLoss** and **MultipleNegativesRankingLoss**
+```eval_rst
+Choosing the right loss function is crucial for finetuning useful models. For the given task, two loss functions are especially suitable: :class:`~sentence_transformers.losses.OnlineContrastiveLoss` and :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`.
+```
 
-### Constrative Loss
-For the complete example, see [training_OnlineContrastiveLoss.py](training_OnlineContrastiveLoss.py).
+### Contrastive Loss
+For the complete training example, see [training_OnlineContrastiveLoss.py](training_OnlineContrastiveLoss.py).
 
-In the original dataset, we have questions given with a label of 0=not duplicate and 1=duplicate. In that case, we can use contrastive loss: Similar pairs with label 1 are pulled together, so that they are close in vector space. Dissimilar pairs, that are closer than a defined margin, are pushed away in vector space.
+```eval_rst
+The Quora Duplicates dataset has a `pair-class subset <https://huggingface.co/datasets/sentence-transformers/quora-duplicates/viewer/pair-class>`_ which consists of question pairs and labels: 1 for duplicate and 0 for different.
 
-Choosing the distance function and especially choosing a sensible margin are quite important for the success of contrastive loss. In the given example, we use cosine_distance (which is 1-cosine_similarity) with a margin of 0.5. I.e., non-duplicate questions should have a cosine_distance of at least 0.5 (which is equivalent to a 0.5 cosine similarity difference).
+As shown by our `Loss Overview <../../../docs/sentence_transformer/loss_overview.md>`_, this allows us to use :class:`~sentence_transformers.losses.ContrastiveLoss`. Similar pairs with label 1 are pulled together, so that they are close in vector space, while dissimilar pairs that are closer than a defined margin are pushed away in vector space.
 
-An improved version of contrastive loss is OnlineContrastiveLoss, which looks which negative pairs have a lower distance that the largest positive pair and which positive pairs have a higher distance than the lowest distance of negative pairs. I.e., this loss automatically detects the hard cases in a batch and computes the loss only for these cases.
+An improved version is :class:`~sentence_transformers.losses.OnlineContrastiveLoss`. This loss looks which negative pairs have a lower distance than the largest positive pair and which positive pairs have a higher distance than the lowest distance of negative pairs. I.e., this loss automatically detects the hard cases in a batch and computes the loss only for these cases.
+```
 
 The loss can be used like this:
 ```python
-train_samples = []
-with open(
-    os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8"
-) as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        sample = InputExample(
-            texts=[row["question1"], row["question2"]],
-            label=int(row["is_duplicate"]),
-        )
-        train_samples.append(sample)
-
-
-train_dataset = SentencesDataset(train_samples, model=model)
-train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
-train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=distance_metric, margin=margin)
-``` 
-
-For each row in our train dataset, we create new InputExample objects and the two questions as texts and the is_duplicate as the label.
-
-
+from datasets import load_dataset
+
+train_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train")
+# => Dataset({
+#     features: ['sentence1', 'sentence2', 'label'],
+#     num_rows: 404290
+# })
+print(train_dataset[0])
+# => {'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0}
+train_loss = losses.OnlineContrastiveLoss(model=model, margin=0.5)
+```
 
 ## MultipleNegativesRankingLoss
 For the complete example, see [training_MultipleNegativesRankingLoss.py](training_MultipleNegativesRankingLoss.py).
 
-*MultipleNegativesRankingLoss* is especially suitable for Information Retrieval / Semantic Search. A nice advantage of *MultipleNegativesRankingLoss* is that it only requires positive pairs, i.e., we only need examples of duplicate questions.
-
-From all pairs, we sample a mini-batch *(a_1, b_1), ..., (a_n, b_n)* where *(a_i, b_i)* is a duplicate question.
-
-MultipleNegativesRankingLoss now uses all *b_j* with j != i as negative example for *(a_i, b_i)*. For example, for *a_1* we have given the options *(b_1, ..., b_n)* and we need to identify which is the correct duplicate question to *a_1*. We do this by computing the dot-product between the embedding of *a_1* and all *b*'s and softmax normalize it so that we get a probability distribution over *(b_1, ..., b_n)*. In the best case, the positive example *b_1* get a probability of close to 1 while all others get scores close to 0. We use negative log-likelihood to compute the loss.
-
-
-*MultipleNegativesRankingLoss* implements this idea in an efficient way so that the embeddings are re-used. With a batch-size of 64, we have 64 positive pairs and each positive pairs has 64-1 negative distractors. 
-
+```eval_rst
+:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is especially suitable for Information Retrieval / Semantic Search. A nice advantage is that it only requires positive pairs, i.e., we only need examples of duplicate questions. See `NLI > MultipleNegativesRankingLoss <../nli/README.html#multiplenegativesrankingloss>`_ for more information on how the loss works.
+```
 
 Using the loss is easy and does not require tuning of any hyperparameters:
 ```python
-train_samples = []
-with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        if row["is_duplicate"] == "1":
-            train_samples.append(
-                InputExample(texts=[row["question1"], row["question2"]], label=1)
-            )
-            train_samples.append(
-                InputExample(texts=[row["question2"], row["question1"]], label=1)
-            )  # if A is a duplicate of B, then B is a duplicate of A
-
-
-# After reading the train_samples, we create a SentencesDataset and a DataLoader
-train_dataset = SentencesDataset(train_samples, model=model)
-train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
+from datasets import load_dataset
+
+train_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train")
+# => Dataset({
+#     features: ['anchor', 'positive'],
+#     num_rows: 149263
+# })
+print(train_dataset[0])
+# => {'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"}
 train_loss = losses.MultipleNegativesRankingLoss(model)
 ```
 
-We only use the positive examples. As 'is_duplicate' is a symmetric relation, we not only add (A, B) but also (B, A) to our training sample set.
+As 'is_duplicate' is a symmetric relation, we can use not just (anchor, positive) but also (positive, anchor) to our training sample set:
 
-**Note 1:** Increasing the batch sizes usually yields better results, as the task gets harder. It is more difficult to identify the correct duplicate question out of a set of 100 questions than out of a set of only 10 questions. So it is advisable to set the training batch size as large as possible. I trained it with a batch size of 350 on 32 GB GPU memory.
+```python
+from datasets import concatenate_datasets
+
+train_dataset = concatenate_datasets([
+    train_dataset,
+    train_dataset.rename_columns({"anchor": "positive", "positive": "anchor"})
+])
+# Dataset({
+#     features: ['anchor', 'positive'],
+#     num_rows: 298526
+# })
+```
+```eval_rst
+.. note::
+    Increasing the batch sizes usually yields better results, as the  task gets harder. It is more difficult to identify the correct duplicate question out of a set of 100 questions than out of a set of only 10 questions. So it is advisable to set the training batch size as large as possible. I trained it with a batch size of 350 on 32 GB GPU memory.
 
-**Note 2:** MultipleNegativesRankingLoss only works if *(a_i, b_j)* with j != i is actually a negative, non-duplicate question pair. In few instances, this assumption is wrong. But in the majority of cases, if we sample two random questions, they are not duplicates. If your dataset cannot fulfil this property,  MultipleNegativesRankingLoss might not work well.
+.. note::
+    :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` only works if *(a_i, b_j)* with j != i is actually a negative, non-duplicate question pair. In few instances, this assumption is wrong. But in the majority of cases, if we sample two random questions, they are not duplicates. If your dataset cannot fulfil this property,  :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` might not work well.
+```
 
 ### Multi-Task-Learning
-Contrastive Loss works well for pair classification, i.e., given two pairs, are these duplicates or not. It pushes negative pairs far away in vector space, so that the distinguishing between duplicate and non-duplicate pairs works good.
+```eval_rst
+:class:`~sentence_transformers.losses.ContrastiveLoss` works well for pair classification, i.e., given two pairs, are these duplicates or not. It pushes negative pairs far away in vector space, so that the distinguishing between duplicate and non-duplicate pairs works good.
 
-MultipleNegativesRankingLoss on the other sides mainly reduces the distance between positive pairs out of large set of possible candidates. However, the distance between  non-duplicate questions is not so large, so that this loss does not work that well for pair classification.
+:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` on the other sides mainly reduces the distance between positive pairs out of large set of possible candidates. However, the distance between  non-duplicate questions is not so large, so that this loss does not work that well for pair classification.
+```
 
 In [training_multi-task-learning.py](training_multi-task-learning.py) I demonstrate how we can train the network with both losses. The essential code is to define both losses and to pass it to the fit method.
+
 ```python
-train_samples_MultipleNegativesRankingLoss = []
-train_samples_ContrastiveLoss = []
-
-with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
-    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
-    for row in reader:
-        train_samples_ContrastiveLoss.append(
-            InputExample(
-                texts=[row["question1"], row["question2"]],
-                label=int(row["is_duplicate"]),
-            )
-        )
-        if row["is_duplicate"] == "1":
-            train_samples_MultipleNegativesRankingLoss.append(
-                InputExample(texts=[row["question1"], row["question2"]], label=1)
-            )
-            train_samples_MultipleNegativesRankingLoss.append(
-                InputExample(texts=[row["question2"], row["question1"]], label=1)
-            )  # if A is a duplicate of B, then B is a duplicate of A
-
-# Create data loader and loss for MultipleNegativesRankingLoss
-train_dataset_MultipleNegativesRankingLoss = SentencesDataset(
-    train_samples_MultipleNegativesRankingLoss, model=model
+from datasets import load_dataset
+from sentence_transformers.losses import ContrastiveLoss, MultipleNegativesRankingLoss
+from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer
+
+model_name = "stsb-distilbert-base"
+model = SentenceTransformer(model_name)
+
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+mnrl_dataset = load_dataset(
+    "sentence-transformers/quora-duplicates", "triplet", split="train"
+)  # The "pair" subset also works
+mnrl_train_dataset = mnrl_dataset.select(range(100000))
+mnrl_eval_dataset = mnrl_dataset.select(range(100000, 101000))
+
+mnrl_train_loss = MultipleNegativesRankingLoss(model=model)
+
+# https://huggingface.co/datasets/sentence-transformers/quora-duplicates
+cl_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train")
+cl_train_dataset = cl_dataset.select(range(100000))
+cl_eval_dataset = cl_dataset.select(range(100000, 101000))
+
+cl_train_loss = ContrastiveLoss(model=model, margin=0.5)
+
+# Create the trainer & start training
+trainer = SentenceTransformerTrainer(
+    model=model,
+    train_dataset={
+        "mnrl": mnrl_train_dataset,
+        "cl": cl_train_dataset,
+    },
+    eval_dataset={
+        "mnrl": mnrl_eval_dataset,
+        "cl": cl_eval_dataset,
+    },
+    loss={
+        "mnrl": mnrl_train_loss,
+        "cl": cl_train_loss,
+    },
 )
-train_dataloader_MultipleNegativesRankingLoss = DataLoader(
-    train_dataset_MultipleNegativesRankingLoss,
-    shuffle=True,
-    batch_size=train_batch_size,
-)
-train_loss_MultipleNegativesRankingLoss = losses.MultipleNegativesRankingLoss(model)
+trainer.train()
+```
 
+## Pretrained Models
 
-# Create data loader and loss for OnlineContrastiveLoss
-train_dataset_ConstrativeLoss = SentencesDataset(
-    train_samples_ConstrativeLoss, model=model
-)
-train_dataloader_ConstrativeLoss = DataLoader(
-    train_dataset_ConstrativeLoss, shuffle=True, batch_size=train_batch_size
-)
-train_loss_ConstrativeLoss = losses.OnlineContrastiveLoss(
-    model=model, distance_metric=distance_metric, margin=margin
-)
+Currently the following models trained on Quora Duplicate Questions are available:
+* [distilbert-base-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-quora-ranking):  We extended the [distilbert-base-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens) model and trained it with *OnlineContrastiveLoss* and with *MultipleNegativesRankingLoss* on the Quora Duplicate questions dataset. For the code, see [training_multi-task-learning.py](training_multi-task-learning.py)
+* [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking): Extension of *distilbert-base-nli-stsb-quora-ranking* to be multi-lingual. Trained on parallel data for 50 languages.
 
-# .....
-# Train the model
-model.fit(
-    train_objectives=[
-        (train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss),
-        (train_dataloader_ConstrativeLoss, train_loss_ConstrativeLoss),
-    ],
-    evaluator=seq_evaluator,
-    epochs=num_epochs,
-    warmup_steps=1000,
-    output_path=model_save_path,
-)
-```
+You can load & use pre-trained models like this:
+```python
+from sentence_transformers import SentenceTransformer
 
+model = SentenceTransformer("distilbert-base-nli-stsb-quora-ranking")
+```
\ No newline at end of file
diff --git a/examples/training/sts/README.md b/examples/training/sts/README.md
index e95266da6..6c95adfc4 100644
--- a/examples/training/sts/README.md
+++ b/examples/training/sts/README.md
@@ -1,42 +1,61 @@
 # Semantic Textual Similarity
 
-Semantic Textual Similarity (STS) assigns a score on the similarity of two texts. In this example, we use the [STSbenchmark](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) as training data to fine-tune our network. See the following example scripts how to tune SentenceTransformer on STS data:
+Semantic Textual Similarity (STS) assigns a score on the similarity of two texts. In this example, we use the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset as training data to fine-tune our model. See the following example scripts how to tune SentenceTransformer on STS data:
 
-- **[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model together with a pooling layer.
- - **[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model. In that example, we load a model trained on [NLI data](../nli/README.md).
- 
+- **[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model (e.g. [`distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased)) together with a pooling layer.
+- **[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model (e.g. [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)).
 
 ## Training data
-In STS, we have sentence pairs annotated together with a score indicating the similarity. For the [STSbenchmark](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark), the scores ranges from 0 (the content of the two sentences are competely different) up to 5 (the two sentences are identical in terms of their meaning). To train our network, we need to normalize these scores to a range of 0-1. This can simply be done by dividing the score by 5.
+```eval_rst
+In STS, we have sentence pairs annotated together with a score indicating the similarity. In the original STSbenchmark dataset, the scores range from 0 to 5. We have normalized these scores to range between 0 and 1 in `stsb <https://huggingface.co/datasets/sentence-transformers/stsb>`_, as that is required for :class:`~sentence_transformers.losses.CosineSimilarityLoss` as you can see in the `Loss Overiew <../../../docs/sentence_transformer/loss_overview.html>`_.
+```
 
-To store our training data, we create a list with `InputExample` objects. Each `InputExample` contains the sentence pair together with the label (score) that ranges between 0 - 1. A simplified version how the training data has to look like is the following:
+Here is a simplified version of our training data:
 
 ```python
-from sentence_transformers import (
-    SentenceTransformer,
-    SentencesDataset,
-    InputExample,
-    losses,
-)
-
-model = SentenceTransformer("nli-distilroberta-base-v2")
-train_examples = [
-    InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
-    InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
-]
-train_dataset = SentencesDataset(train_examples, model)
+from datasets import Dataset
+
+sentence1_list = ["My first sentence", "Another pair"]
+sentence2_list = ["My second sentence", "Unrelated sentence"]
+labels_list = [0.8, 0.3]
+train_dataset = Dataset.from_dict({
+    "sentence1": sentence1_list,
+    "sentence2": sentence2_list,
+    "label": labels_list,
+})
+# => Dataset({
+#     features: ['sentence1', 'sentence2', 'label'],
+#     num_rows: 2
+# })
+print(train_dataset[0])
+# => {'sentence1': 'My first sentence', 'sentence2': 'My second sentence', 'label': 0.8}
+print(train_dataset[1])
+# => {'sentence1': 'Another pair', 'sentence2': 'Unrelated sentence', 'label': 0.3}
 ```
 
-## Loss Function
-As loss function we use [CosineSimilarityLoss](../../../docs/package_reference/losses.html#cosinesimilarityloss).
+In the aforementioned scripts, we directly load the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset:
 
+```python
+from datasets import load_dataset
 
-*CosineSimilarityLoss* trains the network with a siamese network structure (for details see: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084))
+train_dataset = load_dataset("sentence-transformers/stsb", split="train")
+# => Dataset({
+#     features: ['sentence1', 'sentence2', 'score'],
+#     num_rows: 5749
+# })
+```
 
+## Loss Function
+```eval_rst
+We use :class:`~sentence_transformers.losses.CosineSimilarityLoss` as our loss function.
+```
 
-![SBERT Siamese Network Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png "SBERT Siamese Architecture")
+<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png" alt="SBERT Siamese Network Architecture" width="250"/>
 
+For each sentence pair, we pass sentence A and sentence B through the BERT-based model, which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. Note that the two sentences are fed through the same model rather than two separate models. In particular, the cosine similarity for similar texts is maximized and the cosine similarity for dissimilar texts is minimized. This allows our model to be fine-tuned and to recognize the similarity of sentences.
 
-For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences. 
+For more details, see [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084).
 
-This training in a siamese network structure is done automatically when we use CosineSimilarityLoss.
+```eval_rst
+:class:`~sentence_transformers.losses.CoSENTLoss` and :class:`~sentence_transformers.losses.AnglELoss` are more modern variants of :class:`~sentence_transformers.losses.CosineSimilarityLoss` that accept the same data format of a sentence pair with a similarity score ranging from 0.0 to 1.0. Informal experiments indicate that these two produce stronger models than :class:`~sentence_transformers.losses.CosineSimilarityLoss`.
+```
\ No newline at end of file
diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py
index ea194640e..9bdc1efe3 100644
--- a/examples/training/sts/training_stsbenchmark.py
+++ b/examples/training/sts/training_stsbenchmark.py
@@ -43,7 +43,7 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and one
 # similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 # train_loss = losses.CoSENTLoss(model=model)
diff --git a/examples/training/sts/training_stsbenchmark_continue_training.py b/examples/training/sts/training_stsbenchmark_continue_training.py
index c902306a1..ff4c70bdd 100644
--- a/examples/training/sts/training_stsbenchmark_continue_training.py
+++ b/examples/training/sts/training_stsbenchmark_continue_training.py
@@ -39,7 +39,7 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and one
 # similarity score column (between 0 and 1)
 train_loss = losses.CosineSimilarityLoss(model=model)
 # train_loss = losses.CoSENTLoss(model=model)
diff --git a/examples/unsupervised_learning/README.md b/examples/unsupervised_learning/README.md
index ba123fba2..6ad9672fe 100644
--- a/examples/unsupervised_learning/README.md
+++ b/examples/unsupervised_learning/README.md
@@ -2,8 +2,10 @@
 
 This page contains a collection of unsupervised learning methods to learn sentence embeddings. The methods have in common that they **do not require labeled training data**. Instead, they can learn semantically meaningful sentence embeddings just from the text itself.
 
-**Note:** Unsupervised learning approaches are still an activate research area and in many cases the models perform rather poorly compared to models that are using training pairs as provided in our [training data collection](https://huggingface.co/datasets/sentence-transformers/embedding-training-data). A better approach is **[Domain Adaptation](../domain_adaptation/README.md)** where you combine unsupervised learning on your target domain with existent labeled data. This gives the best performance on your specific corpus.
-
+```eval_rst
+.. note::
+    Unsupervised learning approaches are still an activate research area and in many cases the models perform rather poorly compared to models that are using training pairs as provided in our `training data collection <https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552>`_. A better approach is `Domain Adaptation <../domain_adaptation/README.md>`_ where you combine unsupervised learning on your target domain with existent labeled data. This should give the best performance on your specific corpus.
+```
 
 ## TSDAE
 In our work [TSDAE (Transformer-based Denoising AutoEncoder)](https://arxiv.org/abs/2104.06979) we present an unsupervised sentence embedding learning method based on denoising auto-encoders:
diff --git a/index.rst b/index.rst
index 4d2936c2e..ef8b630b9 100644
--- a/index.rst
+++ b/index.rst
@@ -1,77 +1,74 @@
-SentenceTransformers Documentation
-=================================================
-
-SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks <https://arxiv.org/abs/1908.10084>`_.
-
-You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for `semantic textual similarity <docs/usage/semantic_textual_similarity.html>`_, `semantic search <examples/applications/semantic-search/README.html>`_, or `paraphrase mining <examples/applications/paraphrase-mining/README.html>`_.
-
-The framework is based on `PyTorch <https://pytorch.org/>`_ and `Transformers <https://huggingface.co/transformers/>`_ and offers a large collection of `pre-trained models <docs/pretrained_models.html>`_ tuned for various tasks. Further, it is easy to `fine-tune your own models <docs/training/overview.html>`_.
+.. note::
 
+   Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview <docs/sentence_transformer/training_overview.html>`_ to learn more about the training API, and check out `v3.0 Release Notes <docs/changelog/v3.0.html>`_ for details on the other changes.
 
-Installation
-=================================================
-
-You can install it using pip:
-
-.. code-block:: python
-
-   pip install -U sentence-transformers
-
+SentenceTransformers Documentation
+==================================
 
-We recommend **Python 3.8** or higher, and at least **PyTorch 1.11.0**. See `installation <docs/installation.html>`_ for further installation options, especially if you want to use a GPU.
+Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models.
+It can be used to compute embeddings using Sentence Transformer models (`quickstart <docs/quickstart.html#sentence-transformer>`_) or to calculate similarity scores using Cross-Encoder models (`quickstart <docs/quickstart.html#cross-encoder>`_). This unlocks a wide range of applications, including `semantic search <examples/applications/semantic-search/README.html>`_, `semantic textual similarity <docs/usage/semantic_textual_similarity.html>`_, and `paraphrase mining <examples/applications/paraphrase-mining/README.html>`_.
 
+A wide selection of over `5,000 pre-trained Sentence Transformers models <https://huggingface.co/models?library=sentence-transformers>`_ are available for immediate use on 🤗 Hugging Face, including many of the state-of-the-art models from the `Massive Text Embeddings Benchmark (MTEB) leaderboard <https://huggingface.co/spaces/mteb/leaderboard>`_. Additionally, it is easy to `train or finetune your own models <docs/sentence_transformer/training_overview.html>`_ using Sentence Transformers, enabling you to create custom models for your specific use cases.
 
+Sentence Transformers was created by `UKPLab <http://www.ukp.tu-darmstadt.de/>`_ and is being maintained by `🤗 Hugging Face <https://huggingface.co>`_. Don't hesitate to open an issue on the `Sentence Transformers repository <https://github.com/UKPLab/sentence-transformers>`_ if something is broken or if you have further questions.
 
 Usage
-=================================================
-The usage is as simple as:
-
-.. code-block:: python
-
-    from sentence_transformers import SentenceTransformer
-    model = SentenceTransformer("all-MiniLM-L6-v2")
-
-    # Our sentences to encode
-    sentences = [
-        "This framework generates embeddings for each input sentence",
-        "Sentences are passed as a list of string.",
-        "The quick brown fox jumps over the lazy dog."
-    ]
-
-    # Sentences are encoded by calling model.encode()
-    embeddings = model.encode(sentences)
+=====
+.. seealso::
+  
+   See the `Quickstart <docs/quickstart.html>`_ for more quick information on how to use Sentence Transformers.
 
-    # Print the embeddings
-    for sentence, embedding in zip(sentences, embeddings):
-        print("Sentence:", sentence)
-        print("Embedding:", embedding)
-        print("")
+Using Sentence Transformer models is elementary:
 
+.. sidebar:: Installation
 
+   You can install *sentence-transformers* using pip:
+   
+   .. code-block:: python
+   
+      pip install -U sentence-transformers
+   
+   We recommend **Python 3.8+** and **PyTorch 1.11.0+**. See `installation <docs/installation.html>`_ for further installation options.
 
+.. code-block:: python
 
-Performance
-=========================
-
-Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed. Have a look at `Pre-Trained Models <docs/pretrained_models.html>`_ for an overview of available models and the respective performance on different tasks.
-
-
-
+   from sentence_transformers import SentenceTransformer
 
+   # 1. Load a pretrained Sentence Transformer model
+   model = SentenceTransformer("all-MiniLM-L6-v2")
 
+   # The sentences to encode
+   sentences = [
+       "The weather is lovely today.",
+       "It's so sunny outside!",
+       "He drove to the stadium.",
+   ]
 
-Contact
-=========================
+   # 2. Calculate embeddings by calling model.encode()
+   embeddings = model.encode(sentences)
+   print(embeddings.shape)
+   # [3, 384]
 
-Contact person: Tom Aarsen, tom.aarsen@huggingface.co
+   # 3. Calculate the embedding similarities
+   similarities = model.similarity(embeddings, embeddings)
+   print(similarities)
+   # tensor([[1.0000, 0.6660, 0.1046],
+   #         [0.6660, 1.0000, 0.1411],
+   #         [0.1046, 0.1411, 1.0000]])
 
-Don't hesitate to open an issue on the `repository <https://github.com/UKPLab/sentence-transformers>`_ if something is broken (and it shouldn't be) or if you have further questions.
+What Next?
+==========
 
-*This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.*
+Consider reading one of the following sections to answer the related questions:
 
+* How to **use** Sentence Transformer models? `Sentence Transformers > Usage <docs/sentence_transformer/usage/usage.html>`_
+* What Sentence Transformer **models** can I use? `Sentence Transformers > Pretrained Models <docs/sentence_transformer/pretrained_models.html>`_
+* How do I **train/finetune** a Sentence Transformer model? `Sentence Transformers > Training Overview <docs/sentence_transformer/training_overview.html>`_
+* How to **use** Cross Encoder models? `Cross Encoder > Usage <docs/cross_encoder/usage/usage.html>`_
+* What Cross Encoder **models** can I use? `Cross Encoder > Pretrained Models <docs/cross_encoder/pretrained_models.html>`_
 
-Citing & Authors
-=========================
+Citing
+======
 
 If you find this repository helpful, feel free to cite our publication `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks <https://arxiv.org/abs/1908.10084>`_:
 
@@ -124,71 +121,41 @@ If you use the code for `data augmentation <https://github.com/UKPLab/sentence-t
 
 
 .. toctree::
-   :maxdepth: 2
-   :caption: Overview
+   :maxdepth: 1
+   :caption: Getting Started
+   :hidden:
 
    docs/installation
    docs/quickstart
-   docs/pretrained_models
-   docs/pretrained_cross-encoders
-   docs/publications
-   docs/hugging_face
 
 .. toctree::
    :maxdepth: 2
-   :caption: Usage
-
-   examples/applications/computing-embeddings/README
-   docs/usage/semantic_textual_similarity
-   examples/applications/embedding-quantization/README
-   examples/applications/semantic-search/README
-   examples/applications/retrieve_rerank/README
-   examples/applications/clustering/README
-   examples/applications/paraphrase-mining/README
-   examples/applications/parallel-sentence-mining/README
-   examples/applications/cross-encoder/README
-   examples/applications/image-search/README
+   :caption: Sentence Transformer
+   :hidden:
 
-.. toctree::
-   :maxdepth: 2
-   :caption: Training
-
-   docs/training/overview
-   docs/training/loss_overview
-   examples/training/matryoshka/README
-   examples/training/adaptive_layer/README
-   examples/training/multilingual/README
-   examples/training/distillation/README
-   examples/training/cross-encoder/README
-   examples/training/data_augmentation/README
-   examples/training/datasets/README
+   docs/sentence_transformer/usage/usage
+   docs/sentence_transformer/pretrained_models
+   docs/sentence_transformer/training_overview
+   docs/sentence_transformer/dataset_overview
+   docs/sentence_transformer/loss_overview
+   docs/sentence_transformer/training/examples
 
 .. toctree::
    :maxdepth: 2
-   :caption: Training Examples
+   :caption: Cross Encoder
+   :hidden:
 
-   examples/training/sts/README
-   examples/training/nli/README
-   examples/training/paraphrases/README
-   examples/training/quora_duplicate_questions/README
-   examples/training/ms_marco/README
+   docs/cross_encoder/usage/usage
+   docs/cross_encoder/pretrained_models
+   docs/cross_encoder/training_overview
+   docs/cross_encoder/training/examples
 
 .. toctree::
-   :maxdepth: 2
-   :caption: Unsupervised Learning
-
-   examples/unsupervised_learning/README
-   examples/domain_adaptation/README
-
-.. toctree::
-   :maxdepth: 1
+   :maxdepth: 3
    :caption: Package Reference
+   :glob:
+   :hidden:
 
-   docs/package_reference/SentenceTransformer
+   docs/package_reference/sentence_transformer/index
+   docs/package_reference/cross_encoder/index
    docs/package_reference/util
-   docs/package_reference/quantization
-   docs/package_reference/models
-   docs/package_reference/losses
-   docs/package_reference/evaluation
-   docs/package_reference/datasets
-   docs/package_reference/cross_encoder
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index a3d8a3849..f2ad12f83 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -49,69 +49,72 @@ class SentenceTransformer(nn.Sequential, FitMixin):
     """
     Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings.
 
-    :param model_name_or_path: If it is a filepath on disc, it loads the model from that path. If it is not a path,
-        it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model
-        from the Hugging Face Hub with that name.
-    :param modules: A list of torch Modules that should be called sequentially, can be used to create custom
-        SentenceTransformer models from scratch.
-    :param device: Device (like "cuda", "cpu", "mps", "npu") that should be used for computation. If None, checks if a GPU
-        can be used.
-    :param prompts: A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text.
-        The prompt text will be prepended before any text to encode. For example:
-        `{"query": "query: ", "passage": "passage: "}` or `{"clustering": "Identify the main category based on the
-        titles in "}`.
-    :param default_prompt_name: The name of the prompt that should be used by default. If not set,
-        no prompt will be applied.
-    :param similarity_fn_name: The name of the similarity function to use. Valid options are "cosine", "dot",
-        "euclidean", and "manhattan". If not set, it is automatically to "cosine" if `similarity` or
-        `similarity_pairwise` are called while `model.similarity_fn_name` is still `None`.
-    :param cache_folder: Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable.
-    :param trust_remote_code: Whether or not to allow for custom models defined on the Hub in their own modeling files.
-        This option should only be set to True for repositories you trust and in which you have read the code, as it
-        will execute code present on the Hub on your local machine.
-    :param revision: The specific model version to use. It can be a branch name, a tag name, or a commit id,
-        for a stored model on Hugging Face.
-    :param local_files_only: If `True`, avoid downloading the model.
-    :param token: Hugging Face authentication token to download private models.
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is
-        only applicable during inference when :meth:`SentenceTransformer.encode` is called.
-    :param model_kwargs: Additional model configuration parameters to be passed to the Huggingface Transformers model.
-        Particularly useful options are:
-
-        - ``torch_dtype``: Override the default `torch.dtype` and load the model under a specific `dtype`.
-          The different options are:
-
-                1. ``torch.float16``, ``torch.bfloat16`` or ``torch.float``: load in a specified
-                ``dtype``, ignoring the model's ``config.torch_dtype`` if one exists. If not specified - the model will
-                get loaded in ``torch.float`` (fp32).
-
-                2. ``"auto"`` - A ``torch_dtype`` entry in the ``config.json`` file of the model will be
-                attempted to be used. If this entry isn't found then next check the ``dtype`` of the first weight in
-                the checkpoint that's of a floating point type and use that as ``dtype``. This will load the model
-                using the ``dtype`` it was saved in at the end of the training. It can't be used as an indicator of how
-                the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.
-        - ``attn_implementation``: The attention implementation to use in the model (if relevant). Can be any of
-          `"eager"` (manual implementation of the attention), `"sdpa"` (using `F.scaled_dot_product_attention
-          <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html>`_),
-          or `"flash_attention_2"` (using `Dao-AILab/flash-attention <https://github.com/Dao-AILab/flash-attention>`_).
-          By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"`
-          implementation.
-
-        See the `PreTrainedModel.from_pretrained
-        <https://huggingface.co/docs/transformers/en/main_classes/model#transformers.PreTrainedModel.from_pretrained>`_
-        documentation for more details.
-    :param tokenizer_kwargs: Additional tokenizer configuration parameters to be passed to the Huggingface Transformers tokenizer.
-        See the `AutoTokenizer.from_pretrained
-        <https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained>`_
-        documentation for more details.
-    :param config_kwargs: Additional model configuration parameters to be passed to the Huggingface Transformers config.
-        See the `AutoConfig.from_pretrained
-        <https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained>`_
-        documentation for more details.
-    :param model_card_data: A model card data object that contains information about the model. This is used to generate
-        a model card when saving the model. If not set, a default model card data object is created.
-
-    Example
+    Args:
+        model_name_or_path (str, optional): If it is a filepath on disc, it loads the model from that path. If it is not a path,
+            it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model
+            from the Hugging Face Hub with that name.
+        modules (Iterable[nn.Module], optional): A list of torch Modules that should be called sequentially, can be used to create custom
+            SentenceTransformer models from scratch.
+        device (str, optional): Device (like "cuda", "cpu", "mps", "npu") that should be used for computation. If None, checks if a GPU
+            can be used.
+        prompts (Dict[str, str], optional): A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text.
+            The prompt text will be prepended before any text to encode. For example:
+            `{"query": "query: ", "passage": "passage: "}` or `{"clustering": "Identify the main category based on the
+            titles in "}`.
+        default_prompt_name (str, optional): The name of the prompt that should be used by default. If not set,
+            no prompt will be applied.
+        similarity_fn_name (str or SimilarityFunction, optional): The name of the similarity function to use. Valid options are "cosine", "dot",
+            "euclidean", and "manhattan". If not set, it is automatically set to "cosine" if `similarity` or
+            `similarity_pairwise` are called while `model.similarity_fn_name` is still `None`.
+        cache_folder (str, optional): Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable.
+        trust_remote_code (bool, optional): Whether or not to allow for custom models defined on the Hub in their own modeling files.
+            This option should only be set to True for repositories you trust and in which you have read the code, as it
+            will execute code present on the Hub on your local machine.
+        revision (str, optional): The specific model version to use. It can be a branch name, a tag name, or a commit id,
+            for a stored model on Hugging Face.
+        local_files_only (bool, optional): If `True`, avoid downloading the model.
+        token (bool or str, optional): Hugging Face authentication token to download private models.
+        use_auth_token (bool or str, optional): Deprecated argument. Please use `token` instead.
+        truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is
+            only applicable during inference when :meth:`SentenceTransformer.encode` is called.
+        model_kwargs (Dict[str, Any], optional): Additional model configuration parameters to be passed to the Huggingface Transformers model.
+            Particularly useful options are:
+
+            - ``torch_dtype``: Override the default `torch.dtype` and load the model under a specific `dtype`.
+              The different options are:
+
+                    1. ``torch.float16``, ``torch.bfloat16`` or ``torch.float``: load in a specified
+                    ``dtype``, ignoring the model's ``config.torch_dtype`` if one exists. If not specified - the model will
+                    get loaded in ``torch.float`` (fp32).
+
+                    2. ``"auto"`` - A ``torch_dtype`` entry in the ``config.json`` file of the model will be
+                    attempted to be used. If this entry isn't found then next check the ``dtype`` of the first weight in
+                    the checkpoint that's of a floating point type and use that as ``dtype``. This will load the model
+                    using the ``dtype`` it was saved in at the end of the training. It can't be used as an indicator of how
+                    the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.
+            - ``attn_implementation``: The attention implementation to use in the model (if relevant). Can be any of
+              `"eager"` (manual implementation of the attention), `"sdpa"` (using `F.scaled_dot_product_attention
+              <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html>`_),
+              or `"flash_attention_2"` (using `Dao-AILab/flash-attention <https://github.com/Dao-AILab/flash-attention>`_).
+              By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"`
+              implementation.
+
+            See the `PreTrainedModel.from_pretrained
+            <https://huggingface.co/docs/transformers/en/main_classes/model#transformers.PreTrainedModel.from_pretrained>`_
+            documentation for more details.
+        tokenizer_kwargs (Dict[str, Any], optional): Additional tokenizer configuration parameters to be passed to the Huggingface Transformers tokenizer.
+            See the `AutoTokenizer.from_pretrained
+            <https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained>`_
+            documentation for more details.
+        config_kwargs (Dict[str, Any], optional): Additional model configuration parameters to be passed to the Huggingface Transformers config.
+            See the `AutoConfig.from_pretrained
+            <https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained>`_
+            documentation for more details.
+        model_card_data (:class:`~sentence_transformers.model_card.SentenceTransformerModelCardData`, optional): A model
+            card data object that contains information about the model. This is used to generate a model card when saving
+            the model. If not set, a default model card data object is created.
+
+    Example:
         ::
 
             from sentence_transformers import SentenceTransformer
@@ -364,34 +367,55 @@ def encode(
         """
         Computes sentence embeddings.
 
-        :param sentences: the sentences to embed.
-        :param prompt_name: The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary,
-            which is either set in the constructor or loaded from the model configuration. For example if
-            `prompt_name` is ``"query"`` and the `prompts` is ``{"query": "query: ", ...}``, then the sentence "What
-            is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence
-            is appended to the prompt. If `prompt` is also set, this argument is ignored.
-        :param prompt: The prompt to use for encoding. For example, if the prompt is ``"query: "``, then the
-            sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?"
-            because the sentence is appended to the prompt. If `prompt` is set, `prompt_name` is ignored.
-        :param batch_size: the batch size used for the computation.
-        :param show_progress_bar: Whether to output a progress bar when encode sentences.
-        :param output_value: The type of embeddings to return: "sentence_embedding" to get sentence embeddings,
-            "token_embeddings" to get wordpiece token embeddings, and `None`, to get all output values. Defaults
-            to "sentence_embedding".
-        :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or
-            "ubinary". All non-float32 precisions are quantized embeddings. Quantized embeddings are smaller in
-            size and faster to compute, but may have a lower accuracy. They are useful for reducing the size
-            of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32".
-        :param convert_to_numpy: Whether the output should be a list of numpy vectors. If False, it is a list of PyTorch tensors.
-        :param convert_to_tensor: Whether the output should be one large tensor. Overwrites `convert_to_numpy`.
-        :param device: Which `torch.device` to use for the computation.
-        :param normalize_embeddings: Whether to normalize returned vectors to have length 1. In that case,
-            the faster dot-product (util.dot_score) instead of cosine similarity can be used.
-
-        :return: By default, a 2d numpy array with shape [num_inputs, output_dimension] is returned. If only one string
-            input is provided, then the output is a 1d array with shape [output_dimension]. If `convert_to_tensor`, a
-            torch Tensor is returned instead. If `self.truncate_dim <= output_dimension` then output_dimension is
-            `self.truncate_dim`.
+        Args:
+            sentences (Union[str, List[str]]): The sentences to embed.
+            prompt_name (Optional[str], optional): The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary,
+                which is either set in the constructor or loaded from the model configuration. For example if
+                ``prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What
+                is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence
+                is appended to the prompt. If ``prompt`` is also set, this argument is ignored. Defaults to None.
+            prompt (Optional[str], optional): The prompt to use for encoding. For example, if the prompt is "query: ", then the
+                sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?"
+                because the sentence is appended to the prompt. If ``prompt`` is set, ``prompt_name`` is ignored. Defaults to None.
+            batch_size (int, optional): The batch size used for the computation. Defaults to 32.
+            show_progress_bar (bool, optional): Whether to output a progress bar when encode sentences. Defaults to None.
+            output_value (Optional[Literal["sentence_embedding", "token_embeddings"]], optional): The type of embeddings to return:
+                "sentence_embedding" to get sentence embeddings, "token_embeddings" to get wordpiece token embeddings, and `None`,
+                to get all output values. Defaults to "sentence_embedding".
+            precision (Literal["float32", "int8", "uint8", "binary", "ubinary"], optional): The precision to use for the embeddings.
+                Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions are quantized embeddings.
+                Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy. They are useful for
+                reducing the size of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32".
+            convert_to_numpy (bool, optional): Whether the output should be a list of numpy vectors. If False, it is a list of PyTorch tensors.
+                Defaults to True.
+            convert_to_tensor (bool, optional): Whether the output should be one large tensor. Overwrites `convert_to_numpy`.
+                Defaults to False.
+            device (str, optional): Which :class:`torch.device` to use for the computation. Defaults to None.
+            normalize_embeddings (bool, optional): Whether to normalize returned vectors to have length 1. In that case,
+                the faster dot-product (util.dot_score) instead of cosine similarity can be used. Defaults to False.
+
+        Returns:
+            Union[List[Tensor], ndarray, Tensor]: By default, a 2d numpy array with shape [num_inputs, output_dimension] is returned.
+            If only one string input is provided, then the output is a 1d array with shape [output_dimension]. If ``convert_to_tensor``,
+            a torch Tensor is returned instead. If ``self.truncate_dim <= output_dimension`` then output_dimension is ``self.truncate_dim``.
+
+        Example:
+            ::
+
+                from sentence_transformers import SentenceTransformer
+
+                # Load a pre-trained SentenceTransformer model
+                model = SentenceTransformer('all-mpnet-base-v2')
+
+                # Encode some texts
+                sentences = [
+                    "The weather is lovely today.",
+                    "It's so sunny outside!",
+                    "He drove to the stadium.",
+                ]
+                embeddings = model.encode(sentences)
+                print(embeddings.shape)
+                # (3, 768)
         """
         if self.device.type == "hpu" and not self.is_hpu_graph_enabled:
             import habana_frameworks.torch as ht
@@ -551,7 +575,17 @@ def encode(
 
     @property
     def similarity_fn_name(self) -> Optional[str]:
-        """Return the name of the similarity function used by :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise`."""
+        """Return the name of the similarity function used by :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise`.
+
+        Returns:
+            Optional[str]: The name of the similarity function. Can be None if not set, in which case any uses of
+            :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise` default to "cosine".
+
+        Example:
+            >>> model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
+            >>> model.similarity_fn_name
+            'dot'
+        """
         return self._similarity_fn_name
 
     @similarity_fn_name.setter
@@ -577,7 +611,14 @@ def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]
         scores between all embeddings from the first parameter and all embeddings from the second parameter. This
         differs from `similarity_pairwise` which computes the similarity between each pair of embeddings.
 
-        Example
+        Args:
+            embeddings1 (Union[Tensor, ndarray]): [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+            embeddings2 (Union[Tensor, ndarray]): [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+
+        Returns:
+            Tensor: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores.
+
+        Example:
             ::
 
                 >>> model = SentenceTransformer("all-mpnet-base-v2")
@@ -601,10 +642,6 @@ def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]
                         [-0.7437, -0.0000, -1.3702, -1.3320],
                         [-1.3935, -1.3702, -0.0000, -0.9973],
                         [-1.3184, -1.3320, -0.9973, -0.0000]])
-
-        :param embeddings1: [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
-        :param embeddings2: [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
-        :return: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores.
         """
         if self.similarity_fn_name is None:
             self.similarity_fn_name = SimilarityFunction.COSINE
@@ -622,7 +659,14 @@ def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor,
         Compute the similarity between two collections of embeddings. The output will be a vector with the similarity
         scores between each pair of embeddings.
 
-        Example
+        Args:
+            embeddings1 (Union[Tensor, ndarray]): [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+            embeddings2 (Union[Tensor, ndarray]): [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
+
+        Returns:
+            Tensor: A [num_embeddings]-shaped torch tensor with pairwise similarity scores.
+
+        Example:
             ::
 
                 >>> model = SentenceTransformer("all-mpnet-base-v2")
@@ -640,27 +684,28 @@ def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor,
                 >>> model.similarity_fn_name = "euclidean"
                 >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])
                 tensor([-0.7437, -0.9973])
-
-        :param embeddings1: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
-        :param embeddings2: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.
-        :return: A [num_embeddings]-shaped torch tensor with pairwise similarity scores.
         """
         if self.similarity_fn_name is None:
             self.similarity_fn_name = SimilarityFunction.COSINE
         return self._similarity_pairwise
 
-    def start_multi_process_pool(self, target_devices: List[str] = None):
+    def start_multi_process_pool(self, target_devices: List[str] = None) -> Dict[str, Any]:
         """
-        Starts multi process to process the encoding with several, independent processes.
+        Starts a multi-process pool to process the encoding with several independent processes
+        via :meth:`SentenceTransformer.encode_multi_process <sentence_transformers.SentenceTransformer.encode_multi_process>`.
+
         This method is recommended if you want to encode on multiple GPUs or CPUs. It is advised
         to start only one process per GPU. This method works together with encode_multi_process
         and stop_multi_process_pool.
 
-        :param target_devices: PyTorch target devices, e.g. ["cuda:0", "cuda:1", ...], ["npu:0", "npu:1", ...] or
-            ["cpu", "cpu", "cpu", "cpu"]. If target_devices is None and CUDA/NPU is available, then all available
-            CUDA/NPU devices will be used. If target_devices is None and CUDA/NPU is not available, then 4 CPU
-            devices will be used.
-        :return: Returns a dict with the target processes, an input queue and and output queue.
+        Args:
+            target_devices (List[str], optional): PyTorch target devices, e.g. ["cuda:0", "cuda:1", ...],
+                ["npu:0", "npu:1", ...], or ["cpu", "cpu", "cpu", "cpu"]. If target_devices is None and CUDA/NPU
+                is available, then all available CUDA/NPU devices will be used. If target_devices is None and
+                CUDA/NPU is not available, then 4 CPU devices will be used.
+
+        Returns:
+            Dict[str, Any]: A dictionary with the target processes, an input queue, and an output queue.
         """
         if target_devices is None:
             if torch.cuda.is_available():
@@ -694,7 +739,13 @@ def start_multi_process_pool(self, target_devices: List[str] = None):
     @staticmethod
     def stop_multi_process_pool(pool):
         """
-        Stops all processes started with start_multi_process_pool
+        Stops all processes started with start_multi_process_pool.
+
+        Args:
+            pool (Dict[str, object]): A dictionary containing the input queue, output queue, and process list.
+
+        Returns:
+            None
         """
         for p in pool["processes"]:
             p.terminate()
@@ -716,31 +767,56 @@ def encode_multi_process(
         chunk_size: int = None,
         precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
         normalize_embeddings: bool = False,
-    ):
+    ) -> np.ndarray:
         """
-        This method allows to run encode() on multiple GPUs. The sentences are chunked into smaller packages
-        and sent to individual processes, which encode these on the different GPUs. This method is only suitable
-        for encoding large sets of sentences
-
-        :param sentences: List of sentences
-        :param pool: A pool of workers started with SentenceTransformer.start_multi_process_pool
-        :param prompt_name: The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary,
-            which is either set in the constructor or loaded from the model configuration. For example if
-            `prompt_name` is ``"query"`` and the `prompts` is ``{"query": "query: {}", ...}``, then the sentence "What
-            is the capital of France?" will be encoded as "query: What is the capital of France?". If `prompt` is
-            also set, this argument is ignored.
-        :param prompt: The prompt to use for encoding. For example, if the prompt is ``"query: {}"``, then the
-            sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?".
-            If `prompt` is set, `prompt_name` is ignored.
-        :param batch_size: Encode sentences with batch size
-        :param chunk_size: Sentences are chunked and sent to the individual processes. If none, it determine a sensible size.
-        :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or
-            "ubinary". All non-float32 precisions are quantized embeddings. Quantized embeddings are smaller in
-            size and faster to compute, but may have a lower accuracy. They are useful for reducing the size
-            of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32".
-        :param normalize_embeddings: Whether to normalize returned vectors to have length 1. In that case,
-            the faster dot-product (util.dot_score) instead of cosine similarity can be used.
-        :return: 2d numpy array with shape [num_inputs, output_dimension]
+        Encodes a list of sentences using multiple processes and GPUs via
+        :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`.
+        The sentences are chunked into smaller packages and sent to individual processes, which encode them on different
+        GPUs or CPUs. This method is only suitable for encoding large sets of sentences.
+
+        Args:
+            sentences (List[str]): List of sentences to encode.
+            pool (Dict[str, object]): A pool of workers started with SentenceTransformer.start_multi_process_pool.
+            prompt_name (Optional[str], optional): The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary,
+                which is either set in the constructor or loaded from the model configuration. For example if
+                ``prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What
+                is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence
+                is appended to the prompt. If ``prompt`` is also set, this argument is ignored. Defaults to None.
+            prompt (Optional[str], optional): The prompt to use for encoding. For example, if the prompt is "query: ", then the
+                sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?"
+                because the sentence is appended to the prompt. If ``prompt`` is set, ``prompt_name`` is ignored. Defaults to None.
+            batch_size (int): Encode sentences with batch size. (default: 32)
+            chunk_size (int): Sentences are chunked and sent to the individual processes. If None, it determines a
+                sensible size. Defaults to None.
+            precision (Literal["float32", "int8", "uint8", "binary", "ubinary"]): The precision to use for the
+                embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions
+                are quantized embeddings. Quantized embeddings are smaller in size and faster to compute, but may
+                have lower accuracy. They are useful for reducing the size of the embeddings of a corpus for
+                semantic search, among other tasks. Defaults to "float32".
+            normalize_embeddings (bool): Whether to normalize returned vectors to have length 1. In that case,
+                the faster dot-product (util.dot_score) instead of cosine similarity can be used. Defaults to False.
+
+        Returns:
+            np.ndarray: A 2D numpy array with shape [num_inputs, output_dimension].
+
+        Example:
+            ::
+
+                from sentence_transformers import SentenceTransformer
+
+                def main():
+                    model = SentenceTransformer("all-mpnet-base-v2")
+                    sentences = ["The weather is so nice!", "It's so sunny outside.", "He's driving to the movie theater.", "She's going to the cinema."] * 1000
+
+                    pool = model.start_multi_process_pool()
+                    embeddings = model.encode_multi_process(sentences, pool)
+                    model.stop_multi_process_pool(pool)
+
+                    print(embeddings.shape)
+                    # => (4000, 768)
+
+                if __name__ == "__main__":
+                    main()
         """
 
         if chunk_size is None:
@@ -800,25 +876,42 @@ def set_pooling_include_prompt(self, include_prompt: bool) -> None:
         """
         Sets the `include_prompt` attribute in the pooling layer in the model, if there is one.
 
-        :param include_prompt: Whether to include the prompt in the pooling layer.
+        This is useful for INSTRUCTOR models, as the prompt should be excluded from the pooling strategy
+        for these models.
+
+        Args:
+            include_prompt (bool): Whether to include the prompt in the pooling layer.
+
+        Returns:
+            None
         """
         for module in self:
             if isinstance(module, Pooling):
                 module.include_prompt = include_prompt
                 break
 
-    def get_max_seq_length(self):
+    def get_max_seq_length(self) -> Optional[int]:
         """
-        Returns the maximal sequence length for input the model accepts. Longer inputs will be truncated
+        Returns the maximal sequence length that the model accepts. Longer inputs will be truncated.
+
+        Returns:
+            Optional[int]: The maximal sequence length that the model accepts, or None if it is not defined.
         """
         if hasattr(self._first_module(), "max_seq_length"):
             return self._first_module().max_seq_length
 
         return None
 
-    def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]):
+    def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]) -> Dict[str, Tensor]:
         """
-        Tokenizes the texts
+        Tokenizes the texts.
+
+        Args:
+            texts (Union[List[str], List[Dict], List[Tuple[str, str]]]): A list of texts to be tokenized.
+
+        Returns:
+            Dict[str, Tensor]: A dictionary of tensors with the tokenized texts. Common keys are "input_ids",
+                "attention_mask", and "token_type_ids".
         """
         return self._first_module().tokenize(texts)
 
@@ -827,7 +920,10 @@ def get_sentence_features(self, *features):
 
     def get_sentence_embedding_dimension(self) -> Optional[int]:
         """
-        :return: The number of dimensions in the output of `encode`. If it's not known, it's `None`.
+        Returns the number of dimensions in the output of :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>`.
+
+        Returns:
+            Optional[int]: The number of dimensions in the output of `encode`. If it's not known, it's `None`.
         """
         output_dim = None
         for mod in reversed(self._modules.values()):
@@ -844,23 +940,25 @@ def get_sentence_embedding_dimension(self) -> Optional[int]:
     @contextmanager
     def truncate_sentence_embeddings(self, truncate_dim: Optional[int]):
         """
-        In this context, `model.encode` outputs sentence embeddings truncated at dimension `truncate_dim`.
+        In this context, :meth:`SentenceTransformer.encode <sentence_transformers.SentenceTransformer.encode>` outputs
+        sentence embeddings truncated at dimension ``truncate_dim``.
 
         This may be useful when you are using the same model for different applications where different dimensions
         are needed.
 
-        :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation.
+        Args:
+            truncate_dim (int, optional): The dimension to truncate sentence embeddings to. ``None`` does no truncation.
 
-        Example::
-
-            from sentence_transformers import SentenceTransformer
+        Example:
+            ::
 
-            model = SentenceTransformer("model-name")
+                from sentence_transformers import SentenceTransformer
 
-            with model.truncate_sentence_embeddings(truncate_dim=16):
-                embeddings_truncated = model.encode(["hello there", "hiya"])
-            assert embeddings_truncated.shape[-1] == 16
+                model = SentenceTransformer("all-mpnet-base-v2")
 
+                with model.truncate_sentence_embeddings(truncate_dim=16):
+                    embeddings_truncated = model.encode(["hello there", "hiya"])
+                assert embeddings_truncated.shape[-1] == 16
         """
         original_output_dim = self.truncate_dim
         try:
@@ -887,13 +985,15 @@ def save(
     ):
         """
         Saves a model and its configuration files to a directory, so that it can be loaded
-        with `SentenceTransformer(path)` again.
-
-        :param path: Path on disc
-        :param model_name: Optional model name
-        :param create_model_card: If True, create a README.md with basic information about this model
-        :param train_datasets: Optional list with the names of the datasets used to to train the model
-        :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way
+        with ``SentenceTransformer(path)`` again.
+
+        Args:
+            path (str): Path on disc where the model will be saved.
+            model_name (str, optional): Optional model name.
+            create_model_card (bool, optional): If True, create a README.md with basic information about this model.
+            train_datasets (List[str], optional): Optional list with the names of the datasets used to train the model.
+            safe_serialization (bool, optional): If True, save the model using safetensors. If False, save the model
+                the traditional (but unsafe) PyTorch way.
         """
         if path is None:
             return
@@ -953,13 +1053,15 @@ def save_pretrained(
     ):
         """
         Saves a model and its configuration files to a directory, so that it can be loaded
-        with `SentenceTransformer(path)` again. Alias of `SentenceTransformer.save`.
-
-        :param path: Path on disc
-        :param model_name: Optional model name
-        :param create_model_card: If True, create a README.md with basic information about this model
-        :param train_datasets: Optional list with the names of the datasets used to to train the model
-        :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way
+        with ``SentenceTransformer(path)`` again.
+
+        Args:
+            path (str): Path on disc where the model will be saved.
+            model_name (str, optional): Optional model name.
+            create_model_card (bool, optional): If True, create a README.md with basic information about this model.
+            train_datasets (List[str], optional): Optional list with the names of the datasets used to train the model.
+            safe_serialization (bool, optional): If True, save the model using safetensors. If False, save the model
+                the traditional (but unsafe) PyTorch way.
         """
         self.save(
             path,
@@ -973,8 +1075,16 @@ def _create_model_card(
         self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated"
     ):
         """
-        Create an automatic model and stores it in path. If no training was done, and the loaded model was
-        a Sentence Transformer model already, then its model card is reused.
+        Create an automatic model and stores it in the specified path. If no training was done and the loaded model
+        was a Sentence Transformer model already, then its model card is reused.
+
+        Args:
+            path (str): The path where the model card will be stored.
+            model_name (Optional[str], optional): The name of the model. Defaults to None.
+            train_datasets (Optional[List[str]], optional): Deprecated argument. Defaults to "deprecated".
+
+        Returns:
+            None
         """
         if model_name:
             model_path = Path(model_name)
@@ -1018,18 +1128,19 @@ def save_to_hub(
 
         Uploads all elements of this Sentence Transformer to a new HuggingFace Hub repository.
 
-        :param repo_id: Repository name for your model in the Hub, including the user or organization.
-        :param token: An authentication token (See https://huggingface.co/settings/token)
-        :param private: Set to true, for hosting a private model
-        :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way
-        :param commit_message: Message to commit while pushing.
-        :param local_model_path: Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded
-        :param exist_ok: If true, saving to an existing repository is OK. If false, saving only to a new repository is possible
-        :param replace_model_card: If true, replace an existing model card in the hub with the automatically created model card
-        :param train_datasets: Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
-        :param organization: Deprecated. Organization in which you want to push your model or tokenizer (you must be a member of this organization).
-
-        :return: The url of the commit of your model in the repository on the Hugging Face Hub.
+        Args:
+            repo_id (str): Repository name for your model in the Hub, including the user or organization.
+            token (str, optional): An authentication token (See https://huggingface.co/settings/token)
+            private (bool, optional): Set to true, for hosting a private model
+            safe_serialization (bool, optional): If true, save the model using safetensors. If false, save the model the traditional PyTorch way
+            commit_message (str, optional): Message to commit while pushing.
+            local_model_path (str, optional): Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded
+            exist_ok (bool, optional): If true, saving to an existing repository is OK. If false, saving only to a new repository is possible
+            replace_model_card (bool, optional): If true, replace an existing model card in the hub with the automatically created model card
+            train_datasets (List[str], optional): Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
+
+        Returns:
+            str: The url of the commit of your model in the repository on the Hugging Face Hub.
         """
         logger.warning(
             "The `save_to_hub` method is deprecated and will be removed in a future version of SentenceTransformers."
@@ -1078,17 +1189,19 @@ def push_to_hub(
         """
         Uploads all elements of this Sentence Transformer to a new HuggingFace Hub repository.
 
-        :param repo_id: Repository name for your model in the Hub, including the user or organization.
-        :param token: An authentication token (See https://huggingface.co/settings/token)
-        :param private: Set to true, for hosting a private model
-        :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way
-        :param commit_message: Message to commit while pushing.
-        :param local_model_path: Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded
-        :param exist_ok: If true, saving to an existing repository is OK. If false, saving only to a new repository is possible
-        :param replace_model_card: If true, replace an existing model card in the hub with the automatically created model card
-        :param train_datasets: Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
-
-        :return: The url of the commit of your model in the repository on the Hugging Face Hub.
+        Args:
+            repo_id (str): Repository name for your model in the Hub, including the user or organization.
+            token (str, optional): An authentication token (See https://huggingface.co/settings/token)
+            private (bool, optional): Set to true, for hosting a private model
+            safe_serialization (bool, optional): If true, save the model using safetensors. If false, save the model the traditional PyTorch way
+            commit_message (str, optional): Message to commit while pushing.
+            local_model_path (str, optional): Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded
+            exist_ok (bool, optional): If true, saving to an existing repository is OK. If false, saving only to a new repository is possible
+            replace_model_card (bool, optional): If true, replace an existing model card in the hub with the automatically created model card
+            train_datasets (List[str], optional): Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
+
+        Returns:
+            str: The url of the commit of your model in the repository on the Hugging Face Hub.
         """
         api = HfApi(token=token)
         repo_url = api.create_repo(
@@ -1140,12 +1253,14 @@ def _text_length(self, text: Union[List[int], List[List[int]]]):
 
     def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None):
         """
-        Evaluate the model
+        Evaluate the model based on an evaluator
+
+        Args:
+            evaluator (SentenceEvaluator): The evaluator used to evaluate the model.
+            output_path (str, optional): The path where the evaluator can write the results. Defaults to None.
 
-        :param evaluator:
-            the evaluator
-        :param output_path:
-            the evaluator can write the results to this path
+        Returns:
+            The evaluation results.
         """
         if output_path is not None:
             os.makedirs(output_path, exist_ok=True)
@@ -1162,14 +1277,26 @@ def _load_auto_model(
         model_kwargs: Optional[Dict[str, Any]] = None,
         tokenizer_kwargs: Optional[Dict[str, Any]] = None,
         config_kwargs: Optional[Dict[str, Any]] = None,
-    ):
+    ) -> List[nn.Module]:
         """
         Creates a simple Transformer + Mean Pooling model and returns the modules
+
+        Args:
+            model_name_or_path (str): The name or path of the pre-trained model.
+            token (Optional[Union[bool, str]]): The token to use for the model.
+            cache_folder (Optional[str]): The folder to cache the model.
+            revision (Optional[str], optional): The revision of the model. Defaults to None.
+            trust_remote_code (bool, optional): Whether to trust remote code. Defaults to False.
+            local_files_only (bool, optional): Whether to use only local files. Defaults to False.
+            model_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the model. Defaults to None.
+            tokenizer_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the tokenizer. Defaults to None.
+            config_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the config. Defaults to None.
+
+        Returns:
+            List[nn.Module]: A list containing the transformer model and the pooling model.
         """
         logger.warning(
-            "No sentence-transformers model found with name {}. Creating a new one with mean pooling.".format(
-                model_name_or_path
-            )
+            f"No sentence-transformers model found with name {model_name_or_path}. Creating a new one with mean pooling."
         )
 
         shared_kwargs = {
@@ -1204,9 +1331,23 @@ def _load_sbert_model(
         model_kwargs: Optional[Dict[str, Any]] = None,
         tokenizer_kwargs: Optional[Dict[str, Any]] = None,
         config_kwargs: Optional[Dict[str, Any]] = None,
-    ):
+    ) -> Dict[str, nn.Module]:
         """
-        Loads a full sentence-transformers model
+        Loads a full SentenceTransformer model using the modules.json file.
+
+        Args:
+            model_name_or_path (str): The name or path of the pre-trained model.
+            token (Optional[Union[bool, str]]): The token to use for the model.
+            cache_folder (Optional[str]): The folder to cache the model.
+            revision (Optional[str], optional): The revision of the model. Defaults to None.
+            trust_remote_code (bool, optional): Whether to trust remote code. Defaults to False.
+            local_files_only (bool, optional): Whether to use only local files. Defaults to False.
+            model_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the model. Defaults to None.
+            tokenizer_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the tokenizer. Defaults to None.
+            config_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the config. Defaults to None.
+
+        Returns:
+            OrderedDict[str, nn.Module]: An ordered dictionary containing the modules of the model.
         """
         # Check if the config_sentence_transformers.json file exists (exists since v2 of the framework)
         config_sentence_transformers_json_path = load_file_path(
@@ -1398,9 +1539,21 @@ def tokenizer(self, value):
         self._first_module().tokenizer = value
 
     @property
-    def max_seq_length(self):
+    def max_seq_length(self) -> int:
         """
-        Property to get the maximal input sequence length for the model. Longer inputs will be truncated.
+        Returns the maximal input sequence length for the model. Longer inputs will be truncated.
+
+        Returns:
+            int: The maximal input sequence length.
+
+        Example:
+            ::
+
+                from sentence_transformers import SentenceTransformer
+
+                model = SentenceTransformer("all-mpnet-base-v2")
+                print(model.max_seq_length)
+                # => 384
         """
         return self._first_module().max_seq_length
 
@@ -1414,7 +1567,7 @@ def max_seq_length(self, value):
     @property
     def _target_device(self) -> torch.device:
         logger.warning(
-            "`SentenceTransformer._target_device` has been removed, please use `SentenceTransformer.device` instead.",
+            "`SentenceTransformer._target_device` has been deprecated, please use `SentenceTransformer.device` instead.",
         )
         return self.device
 
diff --git a/sentence_transformers/cross_encoder/CrossEncoder.py b/sentence_transformers/cross_encoder/CrossEncoder.py
index 7e99fc286..cc205c317 100644
--- a/sentence_transformers/cross_encoder/CrossEncoder.py
+++ b/sentence_transformers/cross_encoder/CrossEncoder.py
@@ -4,7 +4,7 @@
 import numpy as np
 import logging
 import os
-from typing import Dict, Type, Callable, List, Optional
+from typing import Dict, Type, Callable, List, Optional, Union
 import torch
 from torch import nn
 from torch.optim import Optimizer
@@ -29,26 +29,28 @@ class CrossEncoder(PushToHubMixin):
 
     It does not yield a sentence embedding and does not work for individual sentences.
 
-    :param model_name: A model name from Hugging Face Hub that can be loaded with AutoModel, or a path to a local
-        model. We provide several pre-trained CrossEncoder models that can be used for common tasks.
-    :param num_labels: Number of labels of the classifier. If 1, the CrossEncoder is a regression model that
-        outputs a continuous score 0...1. If > 1, it output several scores that can be soft-maxed to get
-        probability scores for the different classes.
-    :param max_length: Max length for input sequences. Longer sequences will be truncated. If None, max
-        length of the model will be used
-    :param device: Device that should be used for the model. If None, it will use CUDA if available.
-    :param tokenizer_args: Arguments passed to AutoTokenizer
-    :param automodel_args: Arguments passed to AutoModelForSequenceClassification
-    :param trust_remote_code: Whether or not to allow for custom models defined on the Hub in their own modeling files.
-        This option should only be set to True for repositories you trust and in which you have read the code, as it
-        will execute code present on the Hub on your local machine.
-    :param revision: The specific model version to use. It can be a branch name, a tag name, or a commit id,
-        for a stored model on Hugging Face.
-    :param local_files_only: If `True`, avoid downloading the model.
-    :param default_activation_function: Callable (like nn.Sigmoid) about the default activation function that
-        should be used on-top of model.predict(). If None. nn.Sigmoid() will be used if num_labels=1,
-        else nn.Identity()
-    :param classifier_dropout: The dropout ratio for the classification head.
+    Args:
+        model_name (str): A model name from Hugging Face Hub that can be loaded with AutoModel, or a path to a local
+            model. We provide several pre-trained CrossEncoder models that can be used for common tasks.
+        num_labels (int, optional): Number of labels of the classifier. If 1, the CrossEncoder is a regression model that
+            outputs a continuous score 0...1. If > 1, it output several scores that can be soft-maxed to get
+            probability scores for the different classes. Defaults to None.
+        max_length (int, optional): Max length for input sequences. Longer sequences will be truncated. If None, max
+            length of the model will be used. Defaults to None.
+        device (str, optional): Device that should be used for the model. If None, it will use CUDA if available.
+            Defaults to None.
+        tokenizer_args (Dict, optional): Arguments passed to AutoTokenizer. Defaults to None.
+        automodel_args (Dict, optional): Arguments passed to AutoModelForSequenceClassification. Defaults to None.
+        trust_remote_code (bool, optional): Whether or not to allow for custom models defined on the Hub in their own modeling files.
+            This option should only be set to True for repositories you trust and in which you have read the code, as it
+            will execute code present on the Hub on your local machine. Defaults to False.
+        revision (Optional[str], optional): The specific model version to use. It can be a branch name, a tag name, or a commit id,
+            for a stored model on Hugging Face. Defaults to None.
+        local_files_only (bool, optional): If `True`, avoid downloading the model. Defaults to False.
+        default_activation_function (Callable, optional): Callable (like nn.Sigmoid) about the default activation function that
+            should be used on-top of model.predict(). If None. nn.Sigmoid() will be used if num_labels=1,
+            else nn.Identity(). Defaults to None.
+        classifier_dropout (float, optional): The dropout ratio for the classification head. Defaults to None.
     """
 
     def __init__(
@@ -57,14 +59,18 @@ def __init__(
         num_labels: int = None,
         max_length: int = None,
         device: str = None,
-        tokenizer_args: Dict = {},
-        automodel_args: Dict = {},
+        tokenizer_args: Dict = None,
+        automodel_args: Dict = None,
         trust_remote_code: bool = False,
         revision: Optional[str] = None,
         local_files_only: bool = False,
         default_activation_function=None,
         classifier_dropout: float = None,
     ):
+        if tokenizer_args is None:
+            tokenizer_args = {}
+        if automodel_args is None:
+            automodel_args = {}
         self.config = AutoConfig.from_pretrained(
             model_name, trust_remote_code=trust_remote_code, revision=revision, local_files_only=local_files_only
         )
@@ -187,25 +193,26 @@ def fit(
         We sample only as many batches from each objective as there are in the smallest one
         to make sure of equal training with each dataset.
 
-        :param train_dataloader: DataLoader with training InputExamples
-        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
-        :param epochs: Number of epochs for training
-        :param loss_fct: Which loss function to use for training. If None, will use nn.BCEWithLogitsLoss() if self.config.num_labels == 1 else nn.CrossEntropyLoss()
-        :param activation_fct: Activation function applied on top of logits output of model.
-        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
-        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
-        :param optimizer_class: Optimizer
-        :param optimizer_params: Optimizer parameters
-        :param weight_decay: Weight decay for model parameters
-        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
-        :param output_path: Storage path for the model and evaluation files
-        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
-        :param max_grad_norm: Used for gradient normalization.
-        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
-        :param callback: Callback function that is invoked after each evaluation.
+        Args:
+            train_dataloader (DataLoader): DataLoader with training InputExamples
+            evaluator (SentenceEvaluator, optional): An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. Defaults to None.
+            epochs (int, optional): Number of epochs for training. Defaults to 1.
+            loss_fct: Which loss function to use for training. If None, will use nn.BCEWithLogitsLoss() if self.config.num_labels == 1 else nn.CrossEntropyLoss(). Defaults to None.
+            activation_fct: Activation function applied on top of logits output of model.
+            scheduler (str, optional): Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts. Defaults to "WarmupLinear".
+            warmup_steps (int, optional): Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. Defaults to 10000.
+            optimizer_class (Type[Optimizer], optional): Optimizer. Defaults to torch.optim.AdamW.
+            optimizer_params (Dict[str, object], optional): Optimizer parameters. Defaults to {"lr": 2e-5}.
+            weight_decay (float, optional): Weight decay for model parameters. Defaults to 0.01.
+            evaluation_steps (int, optional): If > 0, evaluate the model using evaluator after each number of training steps. Defaults to 0.
+            output_path (str, optional): Storage path for the model and evaluation files. Defaults to None.
+            save_best_model (bool, optional): If true, the best model (according to evaluator) is stored at output_path. Defaults to True.
+            max_grad_norm (float, optional): Used for gradient normalization. Defaults to 1.
+            use_amp (bool, optional): Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0. Defaults to False.
+            callback (Callable[[float, int, int], None], optional): Callback function that is invoked after each evaluation.
                 It must accept the following three parameters in this order:
-                `score`, `epoch`, `steps`
-        :param show_progress_bar: If True, output a tqdm progress bar
+                `score`, `epoch`, `steps`. Defaults to None.
+            show_progress_bar (bool, optional): If True, output a tqdm progress bar. Defaults to True.
         """
         train_dataloader.collate_fn = self.smart_batching_collate
 
@@ -307,19 +314,38 @@ def predict(
         apply_softmax=False,
         convert_to_numpy: bool = True,
         convert_to_tensor: bool = False,
-    ):
+    ) -> Union[List[float], np.ndarray, torch.Tensor]:
         """
-        Performs predicts with the CrossEncoder on the given sentence pairs.
-
-        :param sentences: A list of sentence pairs [[Sent1, Sent2], [Sent3, Sent4]]
-        :param batch_size: Batch size for encoding
-        :param show_progress_bar: Output progress bar
-        :param num_workers: Number of workers for tokenization
-        :param activation_fct: Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity
-        :param convert_to_numpy: Convert the output to a numpy matrix.
-        :param apply_softmax: If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output
-        :param convert_to_tensor: Convert the output to a tensor.
-        :return: Predictions for the passed sentence pairs
+        Performs predictions with the CrossEncoder on the given sentence pairs.
+
+        Args:
+            sentences (List[List[str]]): A list of sentence pairs [[Sent1, Sent2], [Sent3, Sent4]]
+            batch_size (int, optional): Batch size for encoding. Defaults to 32.
+            show_progress_bar (bool, optional): Output progress bar. Defaults to None.
+            num_workers (int, optional): Number of workers for tokenization. Defaults to 0.
+            activation_fct (callable, optional): Activation function applied on the logits output of the CrossEncoder.
+                If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity. Defaults to None.
+            convert_to_numpy (bool, optional): Convert the output to a numpy matrix. Defaults to True.
+            apply_softmax (bool, optional): If there are more than 2 dimensions and apply_softmax=True,
+                applies softmax on the logits output. Defaults to False.
+            convert_to_tensor (bool, optional): Convert the output to a tensor. Defaults to False.
+
+        Returns:
+            Union[List[float], np.ndarray, torch.Tensor]: Predictions for the passed sentence pairs.
+            The return type depends on the `convert_to_numpy` and `convert_to_tensor` parameters.
+            If `convert_to_tensor` is True, the output will be a torch.Tensor.
+            If `convert_to_numpy` is True, the output will be a numpy.ndarray.
+            Otherwise, the output will be a list of float values.
+
+        Examples:
+            ::
+
+                from sentence_transformers import CrossEncoder
+
+                model = CrossEncoder("cross-encoder/stsb-roberta-base")
+                sentences = [["I love cats", "Cats are amazing"], ["I prefer dogs", "Dogs are loyal"]]
+                model.predict(sentences)
+                # => array([0.6912767, 0.4303499], dtype=float32)
         """
         input_was_string = False
         if isinstance(sentences[0], str):  # Cast an individual sentence to a list with length 1
@@ -388,6 +414,22 @@ def rank(
         """
         Performs ranking with the CrossEncoder on the given query and documents. Returns a sorted list with the document indices and scores.
 
+        Args:
+            query (str): A single query.
+            documents (List[str]): A list of documents.
+            top_k (Optional[int], optional): Return the top-k documents. If None, all documents are returned. Defaults to None.
+            return_documents (bool, optional): If True, also returns the documents. If False, only returns the indices and scores. Defaults to False.
+            batch_size (int, optional): Batch size for encoding. Defaults to 32.
+            show_progress_bar (bool, optional): Output progress bar. Defaults to None.
+            num_workers (int, optional): Number of workers for tokenization. Defaults to 0.
+            activation_fct ([type], optional): Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity. Defaults to None.
+            convert_to_numpy (bool, optional): Convert the output to a numpy matrix. Defaults to True.
+            apply_softmax (bool, optional): If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output. Defaults to False.
+            convert_to_tensor (bool, optional): Convert the output to a tensor. Defaults to False.
+
+        Returns:
+            List[Dict]: A sorted list with the document indices and scores, and optionally also documents.
+
         Example:
             ::
 
@@ -423,19 +465,6 @@ def rank(
                 {'corpus_id': 4,
                 'score': -5.082967,
                 'text': "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era."}]
-
-        :param query: A single query
-        :param documents: A list of documents
-        :param top_k: Return the top-k documents. If None, all documents are returned.
-        :param return_documents: If True, also returns the documents. If False, only returns the indices and scores.
-        :param batch_size: Batch size for encoding
-        :param show_progress_bar: Output progress bar
-        :param num_workers: Number of workers for tokenization
-        :param activation_fct: Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity
-        :param convert_to_numpy: Convert the output to a numpy matrix.
-        :param apply_softmax: If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output
-        :param convert_to_tensor: Convert the output to a tensor.
-        :return: A sorted list with the document indices and scores, and optionally also documents.
         """
         query_doc_pairs = [[query, doc] for doc in documents]
         scores = self.predict(
diff --git a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
index 21ad057c2..f9bf25b25 100644
--- a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
@@ -4,8 +4,9 @@
 from typing import List
 
 import numpy as np
+
+from sentence_transformers.readers.InputExample import InputExample
 from .. import CrossEncoder
-from ... import InputExample
 from sklearn.metrics import f1_score
 
 logger = logging.getLogger(__name__)
@@ -19,18 +20,13 @@ class CEF1Evaluator:
     binary tasks the returned metric is binary F1 score. For the multiclass tasks
     the returned metric is macro F1 score.
 
-    :param sentence_pairs: A list of sentence pairs, where each pair is a list of two strings.
-    :type sentence_pairs: list[list[str]]
-    :param labels: A list of integer labels corresponding to each sentence pair.
-    :type labels: list[int]
-    :param batch_size: Batch size for prediction. Defaults to 32.
-    :type batch_size: int
-    :param show_progress_bar: Show tqdm progress bar.
-    :type show_progress_bar: bool
-    :param name: An optional name for the CSV file with stored results. Defaults to an empty string.
-    :type name: str, optional
-    :param write_csv: Flag to determine if the data should be saved to a CSV file. Defaults to True.
-    :type write_csv: bool, optional
+    Args:
+        sentence_pairs (List[List[str]]): A list of sentence pairs, where each pair is a list of two strings.
+        labels (List[int]): A list of integer labels corresponding to each sentence pair.
+        batch_size (int, optional): Batch size for prediction. Defaults to 32.
+        show_progress_bar (bool, optional): Show tqdm progress bar.
+        name (str, optional): An optional name for the CSV file with stored results. Defaults to an empty string.
+        write_csv (bool, optional): Flag to determine if the data should be saved to a CSV file. Defaults to True.
     """
 
     def __init__(
@@ -42,7 +38,7 @@ def __init__(
         show_progress_bar: bool = False,
         name: str = "",
         write_csv: bool = True,
-    ):
+    ) -> None:
         self.sentence_pairs = sentence_pairs
         self.labels = labels
         self.batch_size = batch_size
@@ -72,6 +68,16 @@ def __init__(
 
     @classmethod
     def from_input_examples(cls, examples: List[InputExample], **kwargs):
+        """
+        Create an instance of CEF1Evaluator from a list of InputExample objects.
+
+        Args:
+            examples (List[InputExample]): A list of InputExample objects.
+            **kwargs: Additional keyword arguments to pass to the CEF1Evaluator constructor.
+
+        Returns:
+            CEF1Evaluator: An instance of CEF1Evaluator.
+        """
         sentence_pairs = []
         labels = []
 
@@ -81,13 +87,19 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
 
         return cls(sentence_pairs, labels, **kwargs)
 
-    def __call__(
-        self,
-        model: CrossEncoder,
-        output_path: str = None,
-        epoch: int = -1,
-        steps: int = -1,
-    ) -> float:
+    def __call__(self, model: CrossEncoder, output_path: str = None, epoch: int = -1, steps: int = -1) -> float:
+        """
+        Evaluate the model using the CEF1Evaluator.
+
+        Args:
+            model (CrossEncoder): The cross-encoder model to evaluate.
+            output_path (str, optional): The path to save the evaluation results. Defaults to None.
+            epoch (int, optional): The epoch number. Defaults to -1.
+            steps (int, optional): The number of steps. Defaults to -1.
+
+        Returns:
+            float: The F1 score.
+        """
         if epoch != -1:
             if steps == -1:
                 out_txt = f"after epoch {epoch}:"
diff --git a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
index 5d4813877..fa6160ec4 100644
--- a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
@@ -15,8 +15,10 @@ class CERerankingEvaluator:
     Given a query and a list of documents, it computes the score [query, doc_i] for all possible
     documents and sorts them in decreasing order. Then, MRR@10 and NDCG@10 are computed to measure the quality of the ranking.
 
-    :param samples: Must be a list and each element is of the form: {'query': '', 'positive': [], 'negative': []}. Query is the search query,
-     positive is a list of positive (relevant) documents, negative is a list of negative (irrelevant) documents.
+    Args:
+        samples (List[Dict, str, Union[str, List[str]]): Must be a list and each element is of the form:
+            {'query': '', 'positive': [], 'negative': []}. Query is the search query, positive is a list
+            of positive (relevant) documents, negative is a list of negative (irrelevant) documents.
     """
 
     def __init__(
diff --git a/sentence_transformers/data_collator.py b/sentence_transformers/data_collator.py
index bd4d5ff27..afdd86bfb 100644
--- a/sentence_transformers/data_collator.py
+++ b/sentence_transformers/data_collator.py
@@ -9,7 +9,8 @@ class SentenceTransformerDataCollator:
     """Collator for a SentenceTransformers model.
     This encodes the text columns to {column}_input_ids and {column}_attention_mask columns.
     This works with the two text dataset that is used as the example in the training overview:
-    https://www.sbert.net/docs/training/overview.html"""
+    https://www.sbert.net/docs/training/overview.html
+    """
 
     tokenize_fn: Callable
     valid_label_columns: List[str] = field(default_factory=lambda: ["label", "score"])
diff --git a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
index 33a02c2e3..973d55cac 100644
--- a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
+++ b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
@@ -11,8 +11,10 @@ class DenoisingAutoEncoderDataset(Dataset):
     It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the
     sentence without noise.
 
-    :param sentences: A list of sentences
-    :param noise_fn: A noise function: Given a string, it returns a string with noise, e.g. deleted words
+    Args:
+        sentences: A list of sentences
+        noise_fn: A noise function: Given a string, it returns a string
+            with noise, e.g. deleted words
     """
 
     def __init__(self, sentences: List[str], noise_fn=lambda s: DenoisingAutoEncoderDataset.delete(s)):
diff --git a/sentence_transformers/datasets/ParallelSentencesDataset.py b/sentence_transformers/datasets/ParallelSentencesDataset.py
index a83e4abee..1ec72e90b 100644
--- a/sentence_transformers/datasets/ParallelSentencesDataset.py
+++ b/sentence_transformers/datasets/ParallelSentencesDataset.py
@@ -37,8 +37,11 @@ def __init__(
         """
         Parallel sentences dataset reader to train student model given a teacher model
 
-        :param student_model: Student sentence embedding model that should be trained
-        :param teacher_model: Teacher model, that provides the sentence embeddings for the first column in the dataset file
+        Args:
+            student_model (SentenceTransformer): The student sentence embedding model that should be trained.
+            teacher_model (SentenceTransformer): The teacher model that provides the sentence embeddings for the first column in the dataset file.
+            batch_size (int, optional): The batch size for training. Defaults to 8.
+            use_embedding_cache (bool, optional): Whether to use an embedding cache. Defaults to True.
         """
         self.student_model = student_model
         self.teacher_model = teacher_model
@@ -53,16 +56,20 @@ def __init__(
         self.embedding_cache = {}
         self.num_sentences = 0
 
-    def load_data(self, filepath: str, weight: int = 100, max_sentences: int = None, max_sentence_length: int = 128):
+    def load_data(
+        self, filepath: str, weight: int = 100, max_sentences: int = None, max_sentence_length: int = 128
+    ) -> None:
         """
         Reads in a tab-seperated .txt/.csv/.tsv or .gz file. The different columns contain the different translations of the sentence in the first column
 
-        :param filepath: Filepath to the file
-        :param weight: If more than one dataset is loaded with load_data: With which frequency should data be sampled from this dataset?
-        :param max_sentences: Max number of lines to be read from filepath
-        :param max_sentence_length: Skip the example if one of the sentences is has more characters than max_sentence_length
-        :param batch_size: Size for encoding parallel sentences
-        :return:
+        Args:
+            filepath (str): Filepath to the file.
+            weight (int, optional): If more than one dataset is loaded with load_data, specifies the frequency at which data should be sampled from this dataset. Defaults to 100.
+            max_sentences (int, optional): Maximum number of lines to be read from the filepath. Defaults to None.
+            max_sentence_length (int, optional): Skip the example if one of the sentences has more characters than max_sentence_length. Defaults to 128.
+
+        Returns:
+            None
         """
 
         logger.info("Load " + filepath)
diff --git a/sentence_transformers/datasets/SentenceLabelDataset.py b/sentence_transformers/datasets/SentenceLabelDataset.py
index eb69cca8c..ca90665eb 100644
--- a/sentence_transformers/datasets/SentenceLabelDataset.py
+++ b/sentence_transformers/datasets/SentenceLabelDataset.py
@@ -1,5 +1,3 @@
-""" """
-
 from torch.utils.data import IterableDataset
 import numpy as np
 from typing import List
@@ -27,14 +25,13 @@ def __init__(self, examples: List[InputExample], samples_per_label: int = 2, wit
         """
         Creates a LabelSampler for a SentenceLabelDataset.
 
-        :param examples:
-            a list with InputExamples
-        :param samples_per_label:
-            the number of consecutive, random and unique samples drawn per label. Batch size should be a multiple of samples_per_label
-        :param with_replacement:
-            if this is True, then each sample is drawn at most once (depending on the total number of samples per label).
-            if this is False, then one sample can be drawn in multiple draws, but still not multiple times in the same
-            drawing.
+        Args:
+            examples (List[InputExample]): A list of InputExamples.
+            samples_per_label (int, optional): The number of consecutive, random, and unique samples drawn per label.
+                The batch size should be a multiple of samples_per_label. Defaults to 2.
+            with_replacement (bool, optional): If True, each sample is drawn at most once (depending on the total number
+                of samples per label). If False, one sample can be drawn in multiple draws, but not multiple times in
+                the same drawing. Defaults to False.
         """
         super().__init__()
 
diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
index e15a61d9d..b3838f116 100644
--- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
+++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
@@ -27,17 +27,17 @@ class BinaryClassificationEvaluator(SentenceEvaluator):
 
     The labels need to be 0 for dissimilar pairs and 1 for similar pairs.
 
-    :param sentences1: The first column of sentences
-    :param sentences2: The second column of sentences
-    :param labels: labels[i] is the label for the pair (sentences1[i], sentences2[i]). Must be 0 or 1
-    :param name: Name for the output
-    :param batch_size: Batch size used to compute embeddings
-    :param show_progress_bar: If true, prints a progress bar
-    :param write_csv: Write results to a CSV file
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
-        dimension. Defaults to None.
-
-    Example
+    Args:
+        sentences1 (List[str]): The first column of sentences.
+        sentences2 (List[str]): The second column of sentences.
+        labels (List[int]): labels[i] is the label for the pair (sentences1[i], sentences2[i]). Must be 0 or 1.
+        name (str, optional): Name for the output. Defaults to "".
+        batch_size (int, optional): Batch size used to compute embeddings. Defaults to 32.
+        show_progress_bar (bool, optional): If true, prints a progress bar. Defaults to False.
+        write_csv (bool, optional): Write results to a CSV file. Defaults to True.
+        truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None.
+
+    Example:
         ::
 
             from sentence_transformers import SentenceTransformer
@@ -152,6 +152,18 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
     def __call__(
         self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
+        """
+        Compute the evaluation metrics for the given model.
+
+        Args:
+            model (SentenceTransformer): The model to evaluate.
+            output_path (str, optional): Path to save the evaluation results CSV file. Defaults to None.
+            epoch (int, optional): The epoch number. Defaults to -1.
+            steps (int, optional): The number of steps. Defaults to -1.
+
+        Returns:
+            Dict[str, float]: A dictionary containing the evaluation metrics.
+        """
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
index 5bbbaa4fe..4c89d6178 100644
--- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
+++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
@@ -23,7 +23,7 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator):
     The metrics are the cosine similarity as well as euclidean and Manhattan distance
     The returned score is the Spearman correlation with a specified metric.
 
-    Example
+    Example:
         ::
 
             from datasets import load_dataset
@@ -33,14 +33,14 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator):
             # Load a model
             model = SentenceTransformer('all-mpnet-base-v2')
 
-            # Load the STSB dataset (https://huggingface.co/datasets/nyu-mll/glue/viewer/stsb)
-            eval_dataset = load_dataset("nyu-mll/glue", "stsb", split="validation")
+            # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
+            eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 
             # Initialize the evaluator
             dev_evaluator = EmbeddingSimilarityEvaluator(
                 sentences1=eval_dataset["sentence1"],
                 sentences2=eval_dataset["sentence2"],
-                scores=[score / 5 for score in eval_dataset["label"]],
+                scores=eval_dataset["score"],
                 main_similarity=SimilarityFunction.COSINE,
                 name="sts-dev",
             )
@@ -52,7 +52,7 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator):
             Euclidean-Distance:       Pearson: 0.7824 Spearman: 0.7827
             Dot-Product-Similarity:   Pearson: 0.7192 Spearman: 0.7126
             '''
-            # => 0.8004
+            # => {'sts-dev_pearson_cosine': 0.880607226102985, 'sts-dev_spearman_cosine': 0.881019449484294, ...}
     """
 
     def __init__(
@@ -69,18 +69,22 @@ def __init__(
         truncate_dim: Optional[int] = None,
     ):
         """
-        Constructs an evaluator based for the dataset
-
-        The labels need to indicate the similarity between the sentences.
-
-        :param sentences1:  List with the first sentence in a pair
-        :param sentences2: List with the second sentence in a pair
-        :param scores: Similarity score between sentences1[i] and sentences2[i]
-        :param write_csv: Write results to a CSV file
-        :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or
-            "ubinary". Defaults to None.
-        :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current
-            truncation dimension. Defaults to None.
+        Constructs an evaluator based for the dataset.
+
+        Args:
+            sentences1 (List[str]): List with the first sentence in a pair.
+            sentences2 (List[str]): List with the second sentence in a pair.
+            scores (List[float]): Similarity score between sentences1[i] and sentences2[i].
+            batch_size (int, optional): The batch size for processing the sentences. Defaults to 16.
+            main_similarity (Optional[Union[str, SimilarityFunction]], optional): The main similarity function to use.
+                Can be a string (e.g. "cosine", "dot") or a SimilarityFunction object. Defaults to None.
+            name (str, optional): The name of the evaluator. Defaults to "".
+            show_progress_bar (bool, optional): Whether to show a progress bar during evaluation. Defaults to False.
+            write_csv (bool, optional): Whether to write the evaluation results to a CSV file. Defaults to True.
+            precision (Optional[Literal["float32", "int8", "uint8", "binary", "ubinary"]], optional): The precision
+                to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". Defaults to None.
+            truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the
+                model's current truncation dimension. Defaults to None.
         """
         super().__init__()
         self.sentences1 = sentences1
diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
index 235bd1b75..8381abf5b 100644
--- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
+++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
@@ -24,7 +24,7 @@ class InformationRetrievalEvaluator(SentenceEvaluator):
     Given a set of queries and a large corpus set. It will retrieve for each query the top-k most similar document. It measures
     Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG)
 
-    Example
+    Example:
         ::
 
             import random
@@ -46,9 +46,9 @@ class InformationRetrievalEvaluator(SentenceEvaluator):
             corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids)
 
             # Convert the datasets to dictionaries
-            corpus = dict(zip(corpus["_id"], corpus["text"]))  # Our corpus (qid => question)
+            corpus = dict(zip(corpus["_id"], corpus["text"]))  # Our corpus (cid => document)
             queries = dict(zip(queries["_id"], queries["text"]))  # Our queries (qid => question)
-            relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_question_ids])
+            relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
             for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]):
                 qid = str(qid)
                 corpus_ids = str(corpus_ids)
@@ -129,7 +129,28 @@ def __init__(
             SimilarityFunction.DOT_PRODUCT.value: dot_score,
         },  # Score function, higher=more similar
         main_score_function: Optional[Union[str, SimilarityFunction]] = None,
-    ):
+    ) -> None:
+        """
+        Initializes the InformationRetrievalEvaluator.
+
+        Args:
+            queries (Dict[str, str]): A dictionary mapping query IDs to queries.
+            corpus (Dict[str, str]): A dictionary mapping document IDs to documents.
+            relevant_docs (Dict[str, Set[str]]): A dictionary mapping query IDs to a set of relevant document IDs.
+            corpus_chunk_size (int): The size of each chunk of the corpus. Defaults to 50000.
+            mrr_at_k (List[int]): A list of integers representing the values of k for MRR calculation. Defaults to [10].
+            ndcg_at_k (List[int]): A list of integers representing the values of k for NDCG calculation. Defaults to [10].
+            accuracy_at_k (List[int]): A list of integers representing the values of k for accuracy calculation. Defaults to [1, 3, 5, 10].
+            precision_recall_at_k (List[int]): A list of integers representing the values of k for precision and recall calculation. Defaults to [1, 3, 5, 10].
+            map_at_k (List[int]): A list of integers representing the values of k for MAP calculation. Defaults to [100].
+            show_progress_bar (bool): Whether to show a progress bar during evaluation. Defaults to False.
+            batch_size (int): The batch size for evaluation. Defaults to 32.
+            name (str): A name for the evaluation. Defaults to "".
+            write_csv (bool): Whether to write the evaluation results to a CSV file. Defaults to True.
+            truncate_dim (int, optional): The dimension to truncate the embeddings to. Defaults to None.
+            score_functions (Dict[str, Callable[[Tensor, Tensor], Tensor]]): A dictionary mapping score function names to score functions. Defaults to {SimilarityFunction.COSINE.value: cos_sim, SimilarityFunction.DOT_PRODUCT.value: dot_score}.
+            main_score_function (Union[str, SimilarityFunction], optional): The main score function to use for evaluation. Defaults to None.
+        """
         super().__init__()
         self.queries_ids = []
         for qid in queries:
diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
index 05ebfe253..2bd65a163 100644
--- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
+++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
@@ -25,8 +25,8 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w
         """
         Constructs an evaluator for the given dataset
 
-        :param dataloader:
-            the data for the evaluation
+        Args:
+            dataloader (DataLoader): the data for the evaluation
         """
         super().__init__()
         self.dataloader = dataloader
diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py
index 1cca98e6c..46e7518f6 100644
--- a/sentence_transformers/evaluation/MSEEvaluator.py
+++ b/sentence_transformers/evaluation/MSEEvaluator.py
@@ -20,16 +20,18 @@ class MSEEvaluator(SentenceEvaluator):
     For multilingual knowledge distillation (https://arxiv.org/abs/2004.09813), source_sentences are in English
     and target_sentences are in a different language like German, Chinese, Spanish...
 
-    :param source_sentences: Source sentences are embedded with the teacher model
-    :param target_sentences: Target sentences are ambedding with the student model.
-    :param show_progress_bar: Show progress bar when computing embeddings
-    :param batch_size: Batch size to compute sentence embeddings
-    :param name: Name of the evaluator
-    :param write_csv: Write results to CSV file
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
-        dimension. Defaults to None.
-
-    Example
+    Args:
+        source_sentences (List[str]): Source sentences to embed with the teacher model.
+        target_sentences (List[str]): Target sentences to embed with the student model.
+        teacher_model (SentenceTransformer, optional): The teacher model to compute the source sentence embeddings.
+        show_progress_bar (bool, optional): Show progress bar when computing embeddings. Defaults to False.
+        batch_size (int, optional): Batch size to compute sentence embeddings. Defaults to 32.
+        name (str, optional): Name of the evaluator. Defaults to "".
+        write_csv (bool, optional): Write results to CSV file. Defaults to True.
+        truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
+            dimension. Defaults to None.
+
+    Example:
         ::
 
             from sentence_transformers import SentenceTransformer
diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
index bb027614e..0d6fd2778 100644
--- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
+++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
@@ -15,21 +15,22 @@ class MSEEvaluatorFromDataFrame(SentenceEvaluator):
     """
     Computes the mean squared error (x100) between the computed sentence embedding and some target sentence embedding.
 
-    :param dataframe: It must have the following format. Rows contains different, parallel sentences.
-        Columns are the respective language codes::
+    Args:
+        dataframe (List[Dict[str, str]]): It must have the following format. Rows contains different, parallel sentences.
+            Columns are the respective language codes::
 
             [{'en': 'My sentence in English', 'es': 'Oración en español', 'fr': 'Phrase en français'...},
              {'en': 'My second sentence', ...}]
-
-    :param combinations: Must be of the format ``[('en', 'es'), ('en', 'fr'), ...]``.
-        First entry in a tuple is the source language. The sentence in the respective language will be fetched from
-        the dataframe and passed to the teacher model. Second entry in a tuple the the target language. Sentence
-        will be fetched from the dataframe and passed to the student model
-    :param batch_size: Batch size to compute sentence embeddings
-    :param name: Name of the evaluator
-    :param write_csv: Write results to CSV file
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
-        dimension. Defaults to None.
+        teacher_model (SentenceTransformer): The teacher model used to compute the sentence embeddings.
+        combinations (List[Tuple[str, str]]): Must be of the format ``[('en', 'es'), ('en', 'fr'), ...]``.
+            First entry in a tuple is the source language. The sentence in the respective language will be fetched from
+            the dataframe and passed to the teacher model. Second entry in a tuple the the target language. Sentence
+            will be fetched from the dataframe and passed to the student model
+        batch_size (int, optional): The batch size to compute sentence embeddings. Defaults to 8.
+        name (str, optional): The name of the evaluator. Defaults to "".
+        write_csv (bool, optional): Whether to write the results to a CSV file. Defaults to True.
+        truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. If None, uses the model's
+            current truncation dimension. Defaults to None.
     """
 
     def __init__(
diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
index 46f2f4fdb..701edf29b 100644
--- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
+++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
@@ -19,7 +19,7 @@ class ParaphraseMiningEvaluator(SentenceEvaluator):
     identifies the pairs with the highest similarity. It compare the extracted paraphrase pairs
     with a set of gold labels and computes the F1 score.
 
-    Example
+    Example:
         ::
 
             from datasets import load_dataset
@@ -76,22 +76,35 @@ def __init__(
         truncate_dim: Optional[int] = None,
     ):
         """
-
-        :param sentences_map: A dictionary that maps sentence-ids to sentences, i.e. sentences_map[id] => sentence.
-        :param duplicates_list: Duplicates_list is a list with id pairs [(id1, id2), (id1, id5)] that identifies the duplicates / paraphrases in the sentences_map
-        :param duplicates_dict: A default dictionary mapping [id1][id2] to true if id1 and id2 are duplicates. Must be symmetric, i.e., if [id1][id2] => True, then [id2][id1] => True.
-        :param add_transitive_closure: If true, it adds a transitive closure, i.e. if dup[a][b] and dup[b][c], then dup[a][c]
-        :param query_chunk_size: To identify the paraphrases, the cosine-similarity between all sentence-pairs will be computed. As this might require a lot of memory, we perform a batched computation.  #query_batch_size sentences will be compared against up to #corpus_batch_size sentences. In the default setting, 5000 sentences will be grouped together and compared up-to against 100k other sentences.
-        :param corpus_chunk_size: The corpus will be batched, to reduce the memory requirement
-        :param max_pairs: We will only extract up to #max_pairs potential paraphrase candidates.
-        :param top_k: For each query, we extract the top_k most similar pairs and add it to a sorted list. I.e., for one sentence we cannot find more than top_k paraphrases
-        :param show_progress_bar: Output a progress bar
-        :param batch_size: Batch size for computing sentence embeddings
-        :param name: Name of the experiment
-        :param write_csv: Write results to CSV file
-        :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
-            dimension. Defaults to None.
-
+        Initializes the ParaphraseMiningEvaluator.
+
+        Args:
+            sentences_map (Dict[str, str]): A dictionary that maps sentence-ids to sentences.
+                For example, sentences_map[id] => sentence.
+            duplicates_list (List[Tuple[str, str]], optional): A list with id pairs [(id1, id2), (id1, id5)]
+                that identifies the duplicates / paraphrases in the sentences_map. Defaults to None.
+            duplicates_dict (Dict[str, Dict[str, bool]], optional): A default dictionary mapping [id1][id2]
+                to true if id1 and id2 are duplicates. Must be symmetric, i.e., if [id1][id2] => True,
+                then [id2][id1] => True. Defaults to None.
+            add_transitive_closure (bool, optional): If true, it adds a transitive closure,
+                i.e. if dup[a][b] and dup[b][c], then dup[a][c]. Defaults to False.
+            query_chunk_size (int, optional): To identify the paraphrases, the cosine-similarity between
+                all sentence-pairs will be computed. As this might require a lot of memory, we perform
+                a batched computation. query_chunk_size sentences will be compared against up to
+                corpus_chunk_size sentences. In the default setting, 5000 sentences will be grouped
+                together and compared up-to against 100k other sentences. Defaults to 5000.
+            corpus_chunk_size (int, optional): The corpus will be batched, to reduce the memory requirement.
+                Defaults to 100000.
+            max_pairs (int, optional): We will only extract up to max_pairs potential paraphrase candidates.
+                Defaults to 500000.
+            top_k (int, optional): For each query, we extract the top_k most similar pairs and add it to a sorted list.
+                I.e., for one sentence we cannot find more than top_k paraphrases. Defaults to 100.
+            show_progress_bar (bool, optional): Output a progress bar. Defaults to False.
+            batch_size (int, optional): Batch size for computing sentence embeddings. Defaults to 16.
+            name (str, optional): Name of the experiment. Defaults to "".
+            write_csv (bool, optional): Write results to CSV file. Defaults to True.
+            truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to.
+                `None` uses the model's current truncation dimension. Defaults to None.
         """
         super().__init__()
         self.sentences = []
diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py
index 1216bff79..902d6281d 100644
--- a/sentence_transformers/evaluation/RerankingEvaluator.py
+++ b/sentence_transformers/evaluation/RerankingEvaluator.py
@@ -21,20 +21,20 @@ class RerankingEvaluator(SentenceEvaluator):
     Given a query and a list of documents, it computes the score [query, doc_i] for all possible
     documents and sorts them in decreasing order. Then, MRR@10, NDCG@10 and MAP is compute to measure the quality of the ranking.
 
-    :param samples: Must be a list and each element is of the form: {'query': '', 'positive': [], 'negative': []}.
-        Query is the search query, positive is a list of positive (relevant) documents, negative is a list of negative
-        (irrelevant) documents.
-
-    :param at_k: Only consider the top k most similar documents to each query for the evaluation
-    :param name: Name of the evaluator
-    :param write_csv: Write results to CSV file
-    :param similarity_fct: similarity function between sentence embeddings. By default, cosine similarity.
-    :param batch_size: Batch size to compute sentence embeddings
-    :param show_progress_bar: Show progress bar when computing embeddings
-    :param use_batched_encoding: Whether or not to encode queries and documents in batches for greater speed, or 1-by-1
-        to save memory
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation
-        dimension. Defaults to None.
+    Args:
+        samples (list): A list of dictionaries, where each dictionary represents a sample and has the following keys:
+            - 'query': The search query.
+            - 'positive': A list of positive (relevant) documents.
+            - 'negative': A list of negative (irrelevant) documents.
+        at_k (int, optional): Only consider the top k most similar documents to each query for the evaluation. Defaults to 10.
+        name (str, optional): Name of the evaluator. Defaults to "".
+        write_csv (bool, optional): Write results to CSV file. Defaults to True.
+        similarity_fct (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional): Similarity function between sentence embeddings. By default, cosine similarity. Defaults to cos_sim.
+        batch_size (int, optional): Batch size to compute sentence embeddings. Defaults to 64.
+        show_progress_bar (bool, optional): Show progress bar when computing embeddings. Defaults to False.
+        use_batched_encoding (bool, optional): Whether or not to encode queries and documents in batches for greater speed, or 1-by-1 to save memory. Defaults to True.
+        truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None.
+        mrr_at_k (Optional[int], optional): Deprecated parameter. Please use `at_k` instead. Defaults to None.
     """
 
     def __init__(
@@ -88,6 +88,18 @@ def __init__(
     def __call__(
         self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
+        """
+        Evaluates the model on the dataset and returns the evaluation metrics.
+
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model to evaluate.
+            output_path (str, optional): The output path to write the results. Defaults to None.
+            epoch (int, optional): The current epoch number. Defaults to -1.
+            steps (int, optional): The current step number. Defaults to -1.
+
+        Returns:
+            Dict[str, float]: A dictionary containing the evaluation metrics.
+        """
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
@@ -145,6 +157,15 @@ def __call__(
         return metrics
 
     def compute_metrices(self, model):
+        """
+        Computes the evaluation metrics for the given model.
+
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model to compute metrics for.
+
+        Returns:
+            Dict[str, float]: A dictionary containing the evaluation metrics.
+        """
         return (
             self.compute_metrices_batched(model)
             if self.use_batched_encoding
@@ -153,8 +174,13 @@ def compute_metrices(self, model):
 
     def compute_metrices_batched(self, model):
         """
-        Computes the metrices in a batched way, by batching all queries and
-        all documents together
+        Computes the evaluation metrics in a batched way, by batching all queries and all documents together.
+
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model to compute metrics for.
+
+        Returns:
+            Dict[str, float]: A dictionary containing the evaluation metrics.
         """
         all_mrr_scores = []
         all_ndcg_scores = []
@@ -222,10 +248,13 @@ def compute_metrices_batched(self, model):
 
     def compute_metrices_individual(self, model):
         """
-        Embeds every (query, positive, negative) tuple individually.
-        Is slower than the batched version, but saves memory as only the
-        embeddings for one tuple are needed. Useful when you have
-        a really large test set
+        Computes the evaluation metrics individually by embedding every (query, positive, negative) tuple individually.
+
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model to compute metrics for.
+
+        Returns:
+            Dict[str, float]: A dictionary containing the evaluation metrics.
         """
         all_mrr_scores = []
         all_ndcg_scores = []
diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py
index d8f4f5232..a336fc362 100644
--- a/sentence_transformers/evaluation/SentenceEvaluator.py
+++ b/sentence_transformers/evaluation/SentenceEvaluator.py
@@ -22,21 +22,23 @@ def __call__(
         This is called during training to evaluate the model.
         It returns a score for the evaluation with a higher score indicating a better result.
 
-        :param model:
-            the model to evaluate
-        :param output_path:
-            path where predictions and metrics are written to
-        :param epoch
-            the epoch where the evaluation takes place.
-            This is used for the file prefixes.
-            If this is -1, then we assume evaluation on test data.
-        :param steps
-            the steps in the current epoch at time of the evaluation.
-            This is used for the file prefixes.
-            If this is -1, then we assume evaluation at the end of the epoch.
-        :return: Either a score for the evaluation with a higher score indicating a better result,
-            or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric`
-            must be defined
+        Args:
+            model: the model to evaluate
+            output_path: path where predictions and metrics are written
+                to
+            epoch: the epoch where the evaluation takes place. This is
+                used for the file prefixes. If this is -1, then we
+                assume evaluation on test data.
+            steps: the steps in the current epoch at time of the
+                evaluation. This is used for the file prefixes. If this
+                is -1, then we assume evaluation at the end of the
+                epoch.
+
+        Returns:
+            Either a score for the evaluation with a higher score
+            indicating a better result, or a dictionary with scores. If
+            the latter is chosen, then `evaluator.primary_metric` must
+            be defined
         """
         pass
 
diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py
index 2e2fb5ced..3f9f43f36 100644
--- a/sentence_transformers/evaluation/SequentialEvaluator.py
+++ b/sentence_transformers/evaluation/SequentialEvaluator.py
@@ -12,6 +12,22 @@ class SequentialEvaluator(SentenceEvaluator):
     """
 
     def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function=lambda scores: scores[-1]):
+        """
+        Initializes a SequentialEvaluator object.
+
+        Args:
+            evaluators (Iterable[SentenceEvaluator]): A collection of SentenceEvaluator objects.
+            main_score_function (function, optional): A function that takes a list of scores and returns the main score.
+                Defaults to selecting the last score in the list.
+
+        Example:
+            ::
+
+                evaluator1 = BinaryClassificationEvaluator(...)
+                evaluator2 = InformationRetrievalEvaluator(...)
+                evaluator3 = MSEEvaluator(...)
+                seq_evaluator = SequentialEvaluator([evaluator1, evaluator2, evaluator3])
+        """
         super().__init__()
         self.evaluators = evaluators
         self.main_score_function = main_score_function
diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py
index 603cbe744..8580b8d70 100644
--- a/sentence_transformers/evaluation/TranslationEvaluator.py
+++ b/sentence_transformers/evaluation/TranslationEvaluator.py
@@ -19,7 +19,7 @@ class TranslationEvaluator(SentenceEvaluator):
     and assuming that fr_i is the translation of en_i.
     Checks if vec(en_i) has the highest similarity to vec(fr_i). Computes the accuracy in both directions
 
-    Example
+    Example:
         ::
 
             from sentence_transformers import SentenceTransformer
@@ -66,23 +66,16 @@ def __init__(
 
         The labels need to indicate the similarity between the sentences.
 
-        :param source_sentences:
-            List of sentences in source language
-        :param target_sentences:
-            List of sentences in target language
-        :param show_progress_bar:
-            Show progress bar when computing embeddings
-        :param batch_size:
-            Batch size to compute sentence embeddings
-        :param name:
-            Name of the evaluator
-        :param print_wrong_matches:
-            Prints incorrect matches
-        :param write_csv:
-            Write results to CSV file
-        :param truncate_dim:
-            The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension.
-            Defaults to None.
+        Args:
+            source_sentences (List[str]): List of sentences in the source language.
+            target_sentences (List[str]): List of sentences in the target language.
+            show_progress_bar (bool): Whether to show a progress bar when computing embeddings. Defaults to False.
+            batch_size (int): The batch size to compute sentence embeddings. Defaults to 16.
+            name (str): The name of the evaluator. Defaults to an empty string.
+            print_wrong_matches (bool): Whether to print incorrect matches. Defaults to False.
+            write_csv (bool): Whether to write the evaluation results to a CSV file. Defaults to True.
+            truncate_dim (int, optional): The dimension to truncate sentence embeddings to. If None, the model's
+                current truncation dimension will be used. Defaults to None.
         """
         super().__init__()
         self.source_sentences = source_sentences
diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py
index 86ff42ba8..fe34a32ac 100644
--- a/sentence_transformers/evaluation/TripletEvaluator.py
+++ b/sentence_transformers/evaluation/TripletEvaluator.py
@@ -19,7 +19,7 @@ class TripletEvaluator(SentenceEvaluator):
     Evaluate a model based on a triplet: (sentence, positive_example, negative_example).
     Checks if distance(sentence, positive_example) < distance(sentence, negative_example).
 
-    Example
+    Example:
         ::
 
             from sentence_transformers import SentenceTransformer
@@ -66,17 +66,21 @@ def __init__(
         truncate_dim: Optional[int] = None,
     ):
         """
-        :param anchors: Sentences to check similarity to. (e.g. a query)
-        :param positives: List of positive sentences
-        :param negatives: List of negative sentences
-        :param main_distance_function: The distance function to use. If not specified, use cosine similarity,
-            dot product, Euclidean, and Manhattan.
-        :param name: Name for the output
-        :param batch_size: Batch size used to compute embeddings
-        :param show_progress_bar: If true, prints a progress bar
-        :param write_csv: Write results to a CSV file
-        :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current
-            truncation dimension. Defaults to None.
+        Initializes a TripletEvaluator object.
+
+        Args:
+            anchors (List[str]): Sentences to check similarity to. (e.g. a query)
+            positives (List[str]): List of positive sentences
+            negatives (List[str]): List of negative sentences
+            main_distance_function (Union[str, SimilarityFunction], optional):
+                The distance function to use. If not specified, use cosine similarity,
+                dot product, Euclidean, and Manhattan. Defaults to None.
+            name (str): Name for the output. Defaults to "".
+            batch_size (int): Batch size used to compute embeddings. Defaults to 16.
+            show_progress_bar (bool): If true, prints a progress bar. Defaults to False.
+            write_csv (bool): Write results to a CSV file. Defaults to True.
+            truncate_dim (int, optional): The dimension to truncate sentence embeddings to.
+                `None` uses the model's current truncation dimension. Defaults to None.
         """
         super().__init__()
         self.anchors = anchors
diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
index 02f3cc744..8ec3b94ba 100644
--- a/sentence_transformers/fit_mixin.py
+++ b/sentence_transformers/fit_mixin.py
@@ -173,32 +173,59 @@ def fit(
         checkpoint_save_total_limit: int = 0,
     ):
         """
-        Train the model with the given training objective
-        Each training objective is sampled in turn for one batch.
-        We sample only as many batches from each objective as there are in the smallest one
-        to make sure of equal training with each dataset.
-
-        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
-        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
-        :param epochs: Number of epochs for training
-        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
-        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
-        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
-        :param optimizer_class: Optimizer
-        :param optimizer_params: Optimizer parameters
-        :param weight_decay: Weight decay for model parameters
-        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
-        :param output_path: Storage path for the model and evaluation files
-        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
-        :param max_grad_norm: Used for gradient normalization.
-        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
-        :param callback: Callback function that is invoked after each evaluation.
-                It must accept the following three parameters in this order:
-                `score`, `epoch`, `steps`
-        :param show_progress_bar: If True, output a tqdm progress bar
-        :param checkpoint_path: Folder to save checkpoints during training
-        :param checkpoint_save_steps: Will save a checkpoint after so many steps
-        :param checkpoint_save_total_limit: Total number of checkpoints to store
+        Deprecated training method from before Sentence Transformers v3.0, it is recommended to use
+        :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` instead. This method uses
+        :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` behind the scenes, but does
+        not provide as much flexibility as the Trainer itself.
+
+        This training approach uses a list of DataLoaders and Loss functions to train the model. Each DataLoader
+        is sampled in turn for one batch. We sample only as many batches from each DataLoader as there are in the
+        smallest one to make sure of equal training with each dataset, i.e. round robin sampling.
+
+        This method should produce equivalent results in v3.0+ as before v3.0, but if you encounter any issues
+        with your existing training scripts, then you may wish to use
+        :meth:`SentenceTransformer.old_fit <sentence_transformers.SentenceTransformer.old_fit>` instead.
+        That uses the old training method from before v3.0.
+
+        Args:
+            train_objectives: Tuples of (DataLoader, LossFunction). Pass
+                more than one for multi-task learning
+            evaluator: An evaluator (sentence_transformers.evaluation)
+                evaluates the model performance during training on held-
+                out dev data. It is used to determine the best model
+                that is saved to disc.
+            epochs: Number of epochs for training
+            steps_per_epoch: Number of training steps per epoch. If set
+                to None (default), one epoch is equal the DataLoader
+                size from train_objectives.
+            scheduler: Learning rate scheduler. Available schedulers:
+                constantlr, warmupconstant, warmuplinear, warmupcosine,
+                warmupcosinewithhardrestarts
+            warmup_steps: Behavior depends on the scheduler. For
+                WarmupLinear (default), the learning rate is increased
+                from o up to the maximal learning rate. After these many
+                training steps, the learning rate is decreased linearly
+                back to zero.
+            optimizer_class: Optimizer
+            optimizer_params: Optimizer parameters
+            weight_decay: Weight decay for model parameters
+            evaluation_steps: If > 0, evaluate the model using evaluator
+                after each number of training steps
+            output_path: Storage path for the model and evaluation files
+            save_best_model: If true, the best model (according to
+                evaluator) is stored at output_path
+            max_grad_norm: Used for gradient normalization.
+            use_amp: Use Automatic Mixed Precision (AMP). Only for
+                Pytorch >= 1.6.0
+            callback: Callback function that is invoked after each
+                evaluation. It must accept the following three
+                parameters in this order: `score`, `epoch`, `steps`
+            show_progress_bar: If True, output a tqdm progress bar
+            checkpoint_path: Folder to save checkpoints during training
+            checkpoint_save_steps: Will save a checkpoint after so many
+                steps
+            checkpoint_save_total_limit: Total number of checkpoints to
+                store
         """
         # Delayed import to counter the SentenceTransformers -> FitMixin -> SentenceTransformerTrainer -> SentenceTransformers circular import
         from sentence_transformers.trainer import SentenceTransformerTrainer
@@ -338,7 +365,13 @@ def _default_checkpoint_dir() -> str:
     @staticmethod
     def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int):
         """
-        Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
+        Returns the correct learning rate scheduler. Available scheduler:
+
+        - constantlr,
+        - warmupconstant,
+        - warmuplinear,
+        - warmupcosine,
+        - warmupcosinewithhardrestarts
         """
         scheduler = scheduler.lower()
         if scheduler == "constantlr":
@@ -365,9 +398,10 @@ def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict
         Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model
         Here, batch is a list of InputExample instances: [InputExample(...), ...]
 
-        :param batch:
-            a batch from a SmartBatchingDataset
-        :return:
+        Args:
+            batch: a batch from a SmartBatchingDataset
+
+        Returns:
             a batch of tensors for the model
         """
         texts = [example.texts for example in batch]
@@ -410,32 +444,53 @@ def old_fit(
         checkpoint_save_total_limit: int = 0,
     ):
         """
-        Train the model with the given training objective
-        Each training objective is sampled in turn for one batch.
-        We sample only as many batches from each objective as there are in the smallest one
-        to make sure of equal training with each dataset.
-
-        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
-        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
-        :param epochs: Number of epochs for training
-        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
-        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
-        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
-        :param optimizer_class: Optimizer
-        :param optimizer_params: Optimizer parameters
-        :param weight_decay: Weight decay for model parameters
-        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
-        :param output_path: Storage path for the model and evaluation files
-        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
-        :param max_grad_norm: Used for gradient normalization.
-        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
-        :param callback: Callback function that is invoked after each evaluation.
-                It must accept the following three parameters in this order:
-                `score`, `epoch`, `steps`
-        :param show_progress_bar: If True, output a tqdm progress bar
-        :param checkpoint_path: Folder to save checkpoints during training
-        :param checkpoint_save_steps: Will save a checkpoint after so many steps
-        :param checkpoint_save_total_limit: Total number of checkpoints to store
+        Deprecated training method from before Sentence Transformers v3.0, it is recommended to use
+        :class:`sentence_transformers.trainer.SentenceTransformerTrainer` instead. This method should
+        only be used if you encounter issues with your existing training scripts after upgrading to v3.0+.
+
+        This training approach uses a list of DataLoaders and Loss functions to train the model. Each DataLoader
+        is sampled in turn for one batch. We sample only as many batches from each DataLoader as there are in the
+        smallest one to make sure of equal training with each dataset, i.e. round robin sampling.
+
+        Args:
+            train_objectives: Tuples of (DataLoader, LossFunction). Pass
+                more than one for multi-task learning
+            evaluator: An evaluator (sentence_transformers.evaluation)
+                evaluates the model performance during training on held-
+                out dev data. It is used to determine the best model
+                that is saved to disc.
+            epochs: Number of epochs for training
+            steps_per_epoch: Number of training steps per epoch. If set
+                to None (default), one epoch is equal the DataLoader
+                size from train_objectives.
+            scheduler: Learning rate scheduler. Available schedulers:
+                constantlr, warmupconstant, warmuplinear, warmupcosine,
+                warmupcosinewithhardrestarts
+            warmup_steps: Behavior depends on the scheduler. For
+                WarmupLinear (default), the learning rate is increased
+                from o up to the maximal learning rate. After these many
+                training steps, the learning rate is decreased linearly
+                back to zero.
+            optimizer_class: Optimizer
+            optimizer_params: Optimizer parameters
+            weight_decay: Weight decay for model parameters
+            evaluation_steps: If > 0, evaluate the model using evaluator
+                after each number of training steps
+            output_path: Storage path for the model and evaluation files
+            save_best_model: If true, the best model (according to
+                evaluator) is stored at output_path
+            max_grad_norm: Used for gradient normalization.
+            use_amp: Use Automatic Mixed Precision (AMP). Only for
+                Pytorch >= 1.6.0
+            callback: Callback function that is invoked after each
+                evaluation. It must accept the following three
+                parameters in this order: `score`, `epoch`, `steps`
+            show_progress_bar: If True, output a tqdm progress bar
+            checkpoint_path: Folder to save checkpoints during training
+            checkpoint_save_steps: Will save a checkpoint after so many
+                steps
+            checkpoint_save_total_limit: Total number of checkpoints to
+                store
         """
 
         ##Add info to model card
diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py
index 678aada1a..d50ac3227 100644
--- a/sentence_transformers/losses/AdaptiveLayerLoss.py
+++ b/sentence_transformers/losses/AdaptiveLayerLoss.py
@@ -111,20 +111,33 @@ def __init__(
         layers of the Sentence Transformer model. This is useful for when you want to train a model where users have
         the option to lower the number of layers used to improve their inference speed and memory usage.
 
-        :param model: SentenceTransformer model
-        :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc.
-        :param n_layers_per_step: The number of layers to use per step. If -1, then all layers are used. If > 0, then
-            a random sample of `n_layers_per_step` layers are used per step, separate from the final layer, which is
-            always used. The 2DMSE paper uses `n_layers_per_step=1`. The default value is 1.
-        :param last_layer_weight: The weight to use for the loss of the final layer. Increase this to focus more on the
-            performance when using all layers. The default value is 1.0.
-        :param prior_layers_weight: The weight to use for the loss of the prior layers. Increase this to focus more on
-            the performance when using fewer layers. The default value is 1.0.
-        :param kl_div_weight: The weight to use for the KL-divergence loss that is used to make the prior layers match
-            that of the last layer. Increase this to focus more on the performance when using fewer layers. The default
-            value is 1.0.
-        :param kl_temperature: The temperature to use for the KL-divergence loss. If 0, then the KL-divergence loss is
-            not used. The default value is 1.0.
+        Args:
+            model: SentenceTransformer model
+            loss: The loss function to be used, e.g.
+                :class:`MultipleNegativesRankingLoss`,
+                :class:`CoSENTLoss`, etc.
+            n_layers_per_step: The number of layers to use per step. If
+                -1, then all layers are used. If > 0, then a random
+                sample of `n_layers_per_step` layers are used per step,
+                separate from the final layer, which is always used. The
+                2DMSE paper uses `n_layers_per_step=1`. The default
+                value is 1.
+            last_layer_weight: The weight to use for the loss of the
+                final layer. Increase this to focus more on the
+                performance when using all layers. The default value is
+                1.0.
+            prior_layers_weight: The weight to use for the loss of the
+                prior layers. Increase this to focus more on the
+                performance when using fewer layers. The default value
+                is 1.0.
+            kl_div_weight: The weight to use for the KL-divergence loss
+                that is used to make the prior layers match that of the
+                last layer. Increase this to focus more on the
+                performance when using fewer layers. The default value
+                is 1.0.
+            kl_temperature: The temperature to use for the KL-divergence
+                loss. If 0, then the KL-divergence loss is not used. The
+                default value is 1.0.
 
         References:
             - The concept was inspired by the 2DMSE paper: https://arxiv.org/abs/2402.14776
@@ -147,21 +160,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('microsoft/mpnet-base')
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
-                train_loss = losses.MultipleNegativesRankingLoss(model=model)
-                train_loss = losses.AdaptiveLayerLoss(model, train_loss)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.MultipleNegativesRankingLoss(model=model)
+                loss = losses.AdaptiveLayerLoss(model, loss)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super().__init__()
         self.model = model
diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py
index a506a1317..661e8693d 100644
--- a/sentence_transformers/losses/AnglELoss.py
+++ b/sentence_transformers/losses/AnglELoss.py
@@ -20,8 +20,10 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0):
         pairs of input pairs in the batch that match this condition. This is the same as CoSENTLoss, with a different
         similarity function.
 
-        :param model: SentenceTransformerModel
-        :param scale: Output of similarity function is multiplied by scale value. Represents the inverse temperature.
+        Args:
+            model: SentenceTransformerModel
+            scale: Output of similarity function is multiplied by scale
+                value. Represents the inverse temperature.
 
         References:
             - For further details, see: https://arxiv.org/abs/2309.12871v1
@@ -43,15 +45,23 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0):
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
 
-                model = SentenceTransformer('bert-base-uncased')
-                train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
-                        InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)]
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": ["It's nice weather outside today.", "He drove to work."],
+                    "sentence2": ["It's so sunny.", "She walked to the store."],
+                    "score": [1.0, 0.3],
+                })
+                loss = losses.AnglELoss(model)
 
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.AnglELoss(model=model)
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
+                )
+                trainer.train()
         """
         super().__init__(model, scale, similarity_fct=util.pairwise_angle_sim)
 
diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py
index a0fb1055c..1843a615a 100644
--- a/sentence_transformers/losses/BatchAllTripletLoss.py
+++ b/sentence_transformers/losses/BatchAllTripletLoss.py
@@ -17,9 +17,13 @@ def __init__(
         must be integers, with same label indicating sentences from the same class. Your train dataset
         must contain at least 2 examples per label class.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used.
-        :param margin: Negative samples should be at least margin further apart from the anchor than the positive.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrics that can be used.
+            margin: Negative samples should be at least margin further
+                apart from the anchor than the positive.
 
         References:
             * Source: https://github.com/NegatioN/OnlineMiningTripletLoss/blob/master/online_triplet_loss/losses.py
@@ -46,24 +50,29 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['Sentence from class 0'], label=0),
-                    InputExample(texts=['Another sentence from class 0'], label=0),
-                    InputExample(texts=['Sentence from class 1'], label=1),
-                    InputExample(texts=['Sentence from class 2'], label=2),
-                ]
-                train_batch_size = 2
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.BatchAllTripletLoss(model=model)
-                model.fit(
-                    train_objectives=[(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                # E.g. 0: sports, 1: economy, 2: politics
+                train_dataset = Dataset.from_dict({
+                    "sentence": [
+                        "He played a great game.",
+                        "The stock is up 20%",
+                        "They won 2-1.",
+                        "The last goal was amazing.",
+                        "They all voted against the bill.",
+                    ],
+                    "label": [0, 1, 0, 0, 2],
+                })
+                loss = losses.BatchAllTripletLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
 
         """
         super(BatchAllTripletLoss, self).__init__()
diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
index a70f419e1..02914a165 100644
--- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
@@ -15,8 +15,11 @@ def __init__(
         must be integers, with same label indicating sentences from the same class. Your train dataset
         must contain at least 2 examples per label class. This soft-margin variant does not require setting a margin.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrics that can be used.
 
         Definitions:
             :Easy triplets: Triplets which have a loss of 0 because
@@ -49,24 +52,29 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['Sentence from class 0'], label=0),
-                    InputExample(texts=['Another sentence from class 0'], label=0),
-                    InputExample(texts=['Sentence from class 1'], label=1),
-                    InputExample(texts=['Sentence from class 2'], label=2)
-                ]
-                train_batch_size = 2
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.BatchHardSoftMarginTripletLoss(model=model)
-                model.fit(
-                    train_objectives=[(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                # E.g. 0: sports, 1: economy, 2: politics
+                train_dataset = Dataset.from_dict({
+                    "sentence": [
+                        "He played a great game.",
+                        "The stock is up 20%",
+                        "They won 2-1.",
+                        "The last goal was amazing.",
+                        "They all voted against the bill.",
+                    ],
+                    "label": [0, 1, 0, 0, 2],
+                })
+                loss = losses.BatchHardSoftMarginTripletLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(BatchHardSoftMarginTripletLoss, self).__init__(model)
         self.sentence_embedder = model
diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py
index ab023ec3d..51df4a8b5 100644
--- a/sentence_transformers/losses/BatchHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardTripletLoss.py
@@ -6,15 +6,11 @@
 
 
 class BatchHardTripletLossDistanceFunction:
-    """
-    This class defines distance functions, that can be used with Batch[All/Hard/SemiHard]TripletLoss
-    """
+    """This class defines distance functions, that can be used with Batch[All/Hard/SemiHard]TripletLoss"""
 
     @staticmethod
     def cosine_distance(embeddings):
-        """
-        Compute the 2D matrix of cosine distances (1-cosine_similarity) between all embeddings.
-        """
+        """Compute the 2D matrix of cosine distances (1-cosine_similarity) between all embeddings."""
         return 1 - util.pytorch_cos_sim(embeddings, embeddings)
 
     @staticmethod
@@ -69,9 +65,13 @@ def __init__(
         The labels must be integers, with same label indicating sentences from the same class. Your train dataset
         must contain at least 2 examples per label class.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used
-        :param margin: Negative samples should be at least margin further apart from the anchor than the positive.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrics that can be used
+            margin: Negative samples should be at least margin further
+                apart from the anchor than the positive.
 
         Definitions:
             :Easy triplets: Triplets which have a loss of 0 because
@@ -106,24 +106,29 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['Sentence from class 0'], label=0),
-                    InputExample(texts=['Another sentence from class 0'], label=0),
-                    InputExample(texts=['Sentence from class 1'], label=1),
-                    InputExample(texts=['Sentence from class 2'], label=2)
-                ]
-                train_batch_size = 2
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.BatchHardTripletLoss(model=model)
-                model.fit(
-                    train_objectives=[(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                # E.g. 0: sports, 1: economy, 2: politics
+                train_dataset = Dataset.from_dict({
+                    "sentence": [
+                        "He played a great game.",
+                        "The stock is up 20%",
+                        "They won 2-1.",
+                        "The last goal was amazing.",
+                        "They all voted against the bill.",
+                    ],
+                    "label": [0, 1, 0, 0, 2],
+                })
+                loss = losses.BatchHardTripletLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(BatchHardTripletLoss, self).__init__()
         self.sentence_embedder = model
diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
index a54a6bc26..c997d1f58 100644
--- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
@@ -19,9 +19,13 @@ def __init__(
         The labels must be integers, with same label indicating sentences from the same class. Your train dataset
         must contain at least 2 examples per label class.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used
-        :param margin: Negative samples should be at least margin further apart from the anchor than the positive.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrics that can be used
+            margin: Negative samples should be at least margin further
+                apart from the anchor than the positive.
 
         Definitions:
             :Easy triplets: Triplets which have a loss of 0 because
@@ -57,24 +61,29 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['Sentence from class 0'], label=0),
-                    InputExample(texts=['Another sentence from class 0'], label=0),
-                    InputExample(texts=['Sentence from class 1'], label=1),
-                    InputExample(texts=['Sentence from class 2'], label=2)
-                ]
-                train_batch_size = 2
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.BatchSemiHardTripletLoss(model=model)
-                model.fit(
-                    train_objectives=[(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                # E.g. 0: sports, 1: economy, 2: politics
+                train_dataset = Dataset.from_dict({
+                    "sentence": [
+                        "He played a great game.",
+                        "The stock is up 20%",
+                        "They won 2-1.",
+                        "The last goal was amazing.",
+                        "They all voted against the bill.",
+                    ],
+                    "label": [0, 1, 0, 0, 2],
+                })
+                loss = losses.BatchSemiHardTripletLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(BatchSemiHardTripletLoss, self).__init__()
         self.sentence_embedder = model
diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py
index 6612a1773..cf3392456 100644
--- a/sentence_transformers/losses/CachedGISTEmbedLoss.py
+++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py
@@ -77,9 +77,12 @@ def __init__(
         :class:`CachedMultipleNegativesRankingLoss`, it is possible to reduce memory usage while maintaining performance
         levels comparable to those of :class:`GISTEmbedLoss`.
 
-        :param model: SentenceTransformer model
-        :param guide: SentenceTransformer model to guide the in-batch negative sample selection.
-        :param temperature: Temperature parameter to scale the cosine similarities.
+        Args:
+            model: SentenceTransformer model
+            guide: SentenceTransformer model to guide the in-batch
+                negative sample selection.
+            temperature: Temperature parameter to scale the cosine
+                similarities.
 
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
@@ -105,22 +108,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-uncased')
-                guide = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0')
-
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1024)  # Here we can try much larger batch sizes!
-                train_loss = losses.CachedGISTEmbedLoss(model=model, mini_batch_size=32, guide=guide)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                guide = SentenceTransformer("all-MiniLM-L6-v2")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.CachedGISTEmbedLoss(model, guide, mini_batch_size=64)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(CachedGISTEmbedLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index 80bdc4899..d3e2c7204 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -83,9 +83,13 @@ def __init__(
         Notes: All steps are done with mini-batches. In the original implementation of GradCache, (2) is not done in mini-batches and
         requires a lot memory when batch size large. One drawback is about the speed. GradCache will sacrifice around 20% computation time according to the paper.
 
-        :param model: SentenceTransformer model
-        :param scale: Output of similarity function is multiplied by scale value
-        :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1)
+        Args:
+            model: SentenceTransformer model
+            scale: Output of similarity function is multiplied by scale
+                value
+            similarity_fct: similarity function between sentence
+                embeddings. By default, cos_sim. Can also be set to dot
+                product (and then set scale to 1)
 
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
@@ -112,20 +116,22 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-uncased')
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1024)  # Here we can try much larger batch sizes!
-                train_loss = losses.CachedMultipleNegativesRankingLoss(model=model, mini_batch_size = 32)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.CachedGISTEmbedLoss(model, mini_batch_size=64)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(CachedMultipleNegativesRankingLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py
index e937d7ef9..e0c5203b7 100644
--- a/sentence_transformers/losses/CoSENTLoss.py
+++ b/sentence_transformers/losses/CoSENTLoss.py
@@ -22,9 +22,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         resulting in faster convergence and a final model with superior performance. Consequently, CoSENTLoss may be used
         as a drop-in replacement for :class:`CosineSimilarityLoss` in any training script.
 
-        :param model: SentenceTransformerModel
-        :param similarity_fct: Function to compute the PAIRWISE similarity between embeddings. Default is ``util.pairwise_cos_sim``.
-        :param scale: Output of similarity function is multiplied by scale value. Represents the inverse temperature.
+        Args:
+            model: SentenceTransformerModel
+            similarity_fct: Function to compute the PAIRWISE similarity
+                between embeddings. Default is
+                ``util.pairwise_cos_sim``.
+            scale: Output of similarity function is multiplied by scale
+                value. Represents the inverse temperature.
 
         References:
             - For further details, see: https://kexue.fm/archives/8847
@@ -46,15 +50,23 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-
-                model = SentenceTransformer('bert-base-uncased')
-                train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
-                        InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)]
-
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.CoSENTLoss(model=model)
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": ["It's nice weather outside today.", "He drove to work."],
+                    "sentence2": ["It's so sunny.", "She walked to the store."],
+                    "score": [1.0, 0.3],
+                })
+                loss = losses.CoSENTLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
+                )
+                trainer.train()
         """
         super(CoSENTLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py
index 55f5ad993..a7c66792f 100644
--- a/sentence_transformers/losses/ContrastiveLoss.py
+++ b/sentence_transformers/losses/ContrastiveLoss.py
@@ -6,9 +6,7 @@
 
 
 class SiameseDistanceMetric(Enum):
-    """
-    The metric for the contrastive loss
-    """
+    """The metric for the contrastive loss"""
 
     EUCLIDEAN = lambda x, y: F.pairwise_distance(x, y, p=2)
     MANHATTAN = lambda x, y: F.pairwise_distance(x, y, p=1)
@@ -27,10 +25,14 @@ def __init__(
         Contrastive loss. Expects as input two texts and a label of either 0 or 1. If the label == 1, then the distance between the
         two embeddings is reduced. If the label == 0, then the distance between the embeddings is increased.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used
-        :param margin: Negative samples (label == 0) should have a distance of at least the margin value.
-        :param size_average: Average by the size of the mini-batch.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrices that can be used
+            margin: Negative samples (label == 0) should have a distance
+                of at least the margin value.
+            size_average: Average by the size of the mini-batch.
 
         References:
             * Further information: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
@@ -53,23 +55,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('all-MiniLM-L6-v2')
-                train_examples = [
-                    InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
-                    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0),
-                ]
-
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
-                train_loss = losses.ContrastiveLoss(model=model)
-
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": ["It's nice weather outside today.", "He drove to work."],
+                    "sentence2": ["It's so sunny.", "She walked to the store."],
+                    "label": [1, 0],
+                })
+                loss = losses.ContrastiveLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(ContrastiveLoss, self).__init__()
         self.distance_metric = distance_metric
diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py
index 85af67fc3..08dc5d723 100644
--- a/sentence_transformers/losses/ContrastiveTensionLoss.py
+++ b/sentence_transformers/losses/ContrastiveTensionLoss.py
@@ -23,7 +23,8 @@ class ContrastiveTensionLoss(nn.Module):
 
     Generally, :class:`ContrastiveTensionLossInBatchNegatives` is recommended over this loss, as it gives a stronger training signal.
 
-    :param model: SentenceTransformer model
+    Args:
+        model: SentenceTransformer model
 
     References:
         * Semantic Re-Tuning with Contrastive Tension: https://openreview.net/pdf?id=Ov_sMNau-PF
@@ -112,9 +113,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         Note that you should not use the `ContrastiveTensionDataLoader` for this loss, but just a normal DataLoader with `InputExample` instances.
         The two texts of each `InputExample` instance should be identical.
 
-        :param model: SentenceTransformer model
-        :param scale: Output of similarity function is multiplied by scale value
-        :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1)
+        Args:
+            model: SentenceTransformer model
+            scale: Output of similarity function is multiplied by scale
+                value
+            similarity_fct: similarity function between sentence
+                embeddings. By default, cos_sim. Can also be set to dot
+                product (and then set scale to 1)
 
         References:
             - Semantic Re-Tuning with Contrastive Tension: https://openreview.net/pdf?id=Ov_sMNau-PF
diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py
index 46b075b38..8d27300e7 100644
--- a/sentence_transformers/losses/CosineSimilarityLoss.py
+++ b/sentence_transformers/losses/CosineSimilarityLoss.py
@@ -13,11 +13,15 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_
         vectors ``u = model(sentence_A)`` and ``v = model(sentence_B)`` and measures the cosine-similarity between the two.
         By default, it minimizes the following loss: ``||input_label - cos_score_transformation(cosine_sim(u,v))||_2``.
 
-        :param model: SentenceTransformer model
-        :param loss_fct: Which pytorch loss function should be used to compare the ``cosine_similarity(u, v)`` with the input_label?
-            By default, MSE is used: ``||input_label - cosine_sim(u, v)||_2``
-        :param cos_score_transformation: The cos_score_transformation function is applied on top of cosine_similarity.
-            By default, the identify function is used (i.e. no change).
+        Args:
+            model: SentenceTransformer model
+            loss_fct: Which pytorch loss function should be used to
+                compare the ``cosine_similarity(u, v)`` with the
+                input_label? By default, MSE is used: ``||input_label -
+                cosine_sim(u, v)||_2``
+            cos_score_transformation: The cos_score_transformation
+                function is applied on top of cosine_similarity. By
+                default, the identify function is used (i.e. no change).
 
         References:
             - `Training Examples > Semantic Textual Similarity <../../examples/training/sts/README.html>`_
@@ -39,22 +43,23 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, InputExample, losses
-                from torch.utils.data import DataLoader
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
 
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
-                    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)
-                ]
-                train_batch_size = 1
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.CosineSimilarityLoss(model=model)
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": ["It's nice weather outside today.", "He drove to work."],
+                    "sentence2": ["It's so sunny.", "She walked to the store."],
+                    "score": [1.0, 0.3],
+                })
+                loss = losses.CosineSimilarityLoss(model)
 
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(CosineSimilarityLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
index 9f13698a6..cdc35cb85 100644
--- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
+++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
@@ -1,5 +1,5 @@
 from torch import nn, Tensor
-from typing import Iterable, Dict
+from typing import Iterable, Dict, Optional
 from sentence_transformers import SentenceTransformer
 from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
 import logging
@@ -8,7 +8,9 @@
 
 
 class DenoisingAutoEncoderLoss(nn.Module):
-    def __init__(self, model: SentenceTransformer, decoder_name_or_path: str = None, tie_encoder_decoder: bool = True):
+    def __init__(
+        self, model: SentenceTransformer, decoder_name_or_path: Optional[str] = None, tie_encoder_decoder: bool = True
+    ) -> None:
         """
         This loss expects as input a pairs of damaged sentences and the corresponding original ones.
         During training, the decoder reconstructs the original sentences from the encoded sentence embeddings.
@@ -21,9 +23,10 @@ def __init__(self, model: SentenceTransformer, decoder_name_or_path: str = None,
         The data generation process (i.e. the 'damaging' process) has already been implemented in ``DenoisingAutoEncoderDataset``,
         allowing you to only provide regular sentences.
 
-        :param model: SentenceTransformer model
-        :param decoder_name_or_path: Model name or path for initializing a decoder (compatible with Huggingface's Transformers)
-        :param tie_encoder_decoder: whether to tie the trainable parameters of encoder and decoder
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model.
+            decoder_name_or_path (str, optional): Model name or path for initializing a decoder (compatible with Huggingface's Transformers). Defaults to None.
+            tie_encoder_decoder (bool): Whether to tie the trainable parameters of encoder and decoder. Defaults to True.
 
         References:
             * TSDAE paper: https://arxiv.org/pdf/2104.06979.pdf
diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py
index ff8d3d288..6c719d511 100644
--- a/sentence_transformers/losses/GISTEmbedLoss.py
+++ b/sentence_transformers/losses/GISTEmbedLoss.py
@@ -18,9 +18,13 @@ def __init__(
         in-batch negative sample selection. The cosine similarity is used to compute the loss
         and the temperature parameter is used to scale the cosine similarities.
 
-        :param model: SentenceTransformer model based on a `transformers` model.
-        :param guide: SentenceTransformer model to guide the in-batch negative sample selection.
-        :param temperature: Temperature parameter to scale the cosine similarities.
+        Args:
+            model: SentenceTransformer model based on a `transformers`
+                model.
+            guide: SentenceTransformer model to guide the in-batch
+                negative sample selection.
+            temperature: Temperature parameter to scale the cosine
+                similarities.
 
         References:
             - For further details, see: https://arxiv.org/abs/2402.16829
@@ -46,21 +50,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('all-MiniLM-L6-v2')
-                guide = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0')
-                train_examples = [
-                    InputExample(texts=['The first query',  'The first positive passage',  'The first negative passage']),
-                    InputExample(texts=['The second query', 'The second positive passage', 'The second negative passage']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
-                train_loss = losses.GISTEmbedLoss(model=model, guide=guide)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                guide = SentenceTransformer("all-MiniLM-L6-v2")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.GISTEmbedLoss(model, guide)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(GISTEmbedLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py
index 798c7949f..c377a96e1 100644
--- a/sentence_transformers/losses/MSELoss.py
+++ b/sentence_transformers/losses/MSELoss.py
@@ -12,7 +12,8 @@ def __init__(self, model):
 
         For an example, see `the distillation documentation <../../examples/training/distillation/README.html>`_ on extending language models to new languages.
 
-        :param model: SentenceTransformerModel
+        Args:
+            model: SentenceTransformerModel
 
         References:
             - Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation: https://arxiv.org/abs/2004.09813
@@ -34,30 +35,33 @@ def __init__(self, model):
             | sentence_1, sentence_2, ..., sentence_N | model sentence embeddings   |
             +-----------------------------------------+-----------------------------+
 
-        Example::
-
-            from sentence_transformers import SentenceTransformer, InputExample, losses
-            from torch.utils.data import DataLoader
-
-            model_en = SentenceTransformer('bert-base-cased')
-            model_fr = SentenceTransformer('flaubert/flaubert_base_cased')
-
-            examples_en = ['The first sentence',  'The second sentence', 'The third sentence',  'The fourth sentence']
-            examples_fr = ['La première phrase',  'La deuxième phrase', 'La troisième phrase',  'La quatrième phrase']
-            train_batch_size = 2
-
-            labels_en_en = model_en.encode(examples_en)
-            examples_en_fr = [InputExample(texts=[x], label=labels_en_en[i]) for i, x in enumerate(examples_en)]
-            loader_en_fr = DataLoader(examples_en_fr, batch_size=train_batch_size)
-
-            examples_fr_fr = [InputExample(texts=[x], label=labels_en_en[i]) for i, x in enumerate(examples_fr)]
-            loader_fr_fr = DataLoader(examples_fr_fr, batch_size=train_batch_size)
-
-            train_loss = losses.MSELoss(model=model_fr)
-            model_fr.fit(
-                [(loader_en_fr, train_loss), (loader_fr_fr, train_loss)],
-                epochs=10,
-            )
+        Example:
+            ::
+
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                student_model = SentenceTransformer("microsoft/mpnet-base")
+                teacher_model = SentenceTransformer("all-mpnet-base-v2")
+                train_dataset = Dataset.from_dict({
+                    "english": ["The first sentence",  "The second sentence", "The third sentence",  "The fourth sentence"],
+                    "french": ["La première phrase",  "La deuxième phrase", "La troisième phrase",  "La quatrième phrase"],
+                })
+
+                def compute_labels(batch):
+                    return {
+                        "label": teacher_model.encode(batch["english"])
+                    }
+
+                train_dataset = train_dataset.map(compute_labels, batched=True)
+                loss = losses.MSELoss(student_model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=student_model,
+                    train_dataset=train_dataset,
+                    loss=loss,
+                )
+                trainer.train()
         """
         super(MSELoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py
index 26e202fe3..44ab49710 100644
--- a/sentence_transformers/losses/MarginMSELoss.py
+++ b/sentence_transformers/losses/MarginMSELoss.py
@@ -15,8 +15,9 @@ def __init__(self, model, similarity_fct=util.pairwise_dot_score):
         with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only
         against two passages.
 
-        :param model: SentenceTransformerModel
-        :param similarity_fct: Which similarity function to use.
+        Args:
+            model: SentenceTransformerModel
+            similarity_fct: Which similarity function to use.
 
         References:
             - For more details, please refer to https://arxiv.org/abs/2010.02666.
@@ -38,33 +39,72 @@ def __init__(self, model, similarity_fct=util.pairwise_dot_score):
             +-----------------------------------------------+-----------------------------------------------+
 
         Example:
+
+            With gold labels, e.g. if you have hard scores for sentences. Imagine you want a model to embed sentences
+            with similar "quality" close to each other. If the "text1" has quality 5 out of 5, "text2" has quality
+            1 out of 5, and "text3" has quality 3 out of 5, then the similarity of a pair can be defined as the
+            difference of the quality scores. So, the similarity between "text1" and "text2" is 4, and the
+            similarity between "text1" and "text3" is 2. If we use this as our "Teacher Model", the label becomes
+            similraity("text1", "text2") - similarity("text1", "text3") = 4 - 2 = 2.
+
+            Positive values denote that the first passage is more similar to the query than the second passage,
+            while negative values denote the opposite.
+
+            ::
+
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "text1": ["It's nice weather outside today.", "He drove to work."],
+                    "text2": ["It's so sunny.", "He took the car to work."],
+                    "text3": ["It's very sunny.", "She walked to the store."],
+                    "label": [0.1, 0.8],
+                })
+                loss = losses.MarginMSELoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
+                )
+                trainer.train()
+
+            We can also use a teacher model to compute the similarity scores. In this case, we can use the teacher model
+            to compute the similarity scores and use them as the silver labels. This is often used in knowledge distillation.
+
             ::
 
-                from sentence_transformers import SentenceTransformer, InputExample, losses
-                from sentence_transformers.util import pairwise_dot_score
-                from torch.utils.data import DataLoader
-                import torch
-
-                student_model = SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens')
-                teacher_model = SentenceTransformer('sentence-transformers/bert-base-nli-stsb-mean-tokens')
-
-                train_examples = [
-                    ['The first query',  'The first positive passage',  'The first negative passage'],
-                    ['The second query', 'The second positive passage', 'The second negative passage'],
-                    ['The third query',  'The third positive passage',  'The third negative passage'],
-                ]
-                train_batch_size = 1
-                encoded = torch.tensor([teacher_model.encode(x).tolist() for x in train_examples])
-                labels = pairwise_dot_score(encoded[:, 0], encoded[:, 1]) - pairwise_dot_score(encoded[:, 0], encoded[:, 2])
-
-                train_input_examples = [InputExample(texts=x, label=labels[i]) for i, x in enumerate(train_examples)]
-                train_dataloader = DataLoader(train_input_examples, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.MarginMSELoss(model=student_model)
-
-                student_model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                student_model = SentenceTransformer("microsoft/mpnet-base")
+                teacher_model = SentenceTransformer("all-mpnet-base-v2")
+                train_dataset = Dataset.from_dict({
+                    "query": ["It's nice weather outside today.", "He drove to work."],
+                    "passage1": ["It's so sunny.", "He took the car to work."],
+                    "passage2": ["It's very sunny.", "She walked to the store."],
+                })
+
+                def compute_labels(batch):
+                    emb_queries = teacher_model.encode(batch["query"])
+                    emb_passages1 = teacher_model.encode(batch["passage1"])
+                    emb_passages2 = teacher_model.encode(batch["passage2"])
+                    return {
+                        "label": teacher_model.similarity_pairwise(emb_queries, emb_passages1) - teacher_model.similarity_pairwise(emb_queries, emb_passages2)
+                    }
+
+                train_dataset = train_dataset.map(compute_labels, batched=True)
+                # In this example, the labels become -0.036 and 0.68, respectively
+                loss = losses.MarginMSELoss(student_model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=student_model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(MarginMSELoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py
index da6f1512b..a3043da89 100644
--- a/sentence_transformers/losses/Matryoshka2dLoss.py
+++ b/sentence_transformers/losses/Matryoshka2dLoss.py
@@ -31,25 +31,41 @@ def __init__(
         Note, this uses `n_layers_per_step=1` and `n_dims_per_step=1` as default, following the original 2DMSE
         implementation.
 
-        :param model: SentenceTransformer model
-        :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc.
-        :param matryoshka_dims: A list of embedding dimensions to be used for the loss function, e.g. [768, 512, 256, 128, 64].
-        :param matryoshka_weights: A list of weights to be used for the loss function, e.g. [1, 1, 1, 1, 1]. If None, then the
-            weights will be set to 1 for all dimensions.
-        :param n_layers_per_step: The number of layers to use per step. If -1, then all layers are used. If > 0, then
-            a random sample of n_layers_per_step layers are used per step. The 2DMSE paper uses `n_layers_per_step=1`.
-            The default value is -1.
-        :param n_dims_per_step: The number of dimensions to use per step. If -1, then all dimensions are used. If > 0, then
-            a random sample of n_dims_per_step dimensions are used per step. The default value is -1.
-        :param last_layer_weight: The weight to use for the loss of the final layer. Increase this to focus more on the
-            performance when using all layers. The default value is 1.0.
-        :param prior_layers_weight: The weight to use for the loss of the prior layers. Increase this to focus more on
-            the performance when using fewer layers. The default value is 1.0.
-        :param kl_div_weight: The weight to use for the KL-divergence loss that is used to make the prior layers match
-            that of the last layer. Increase this to focus more on the performance when using fewer layers. The default
-            value is 1.0.
-        :param kl_temperature: The temperature to use for the KL-divergence loss. If 0, then the KL-divergence loss is
-            not used. The default value is 1.0.
+        Args:
+            model: SentenceTransformer model
+            loss: The loss function to be used, e.g.
+                :class:`MultipleNegativesRankingLoss`,
+                :class:`CoSENTLoss`, etc.
+            matryoshka_dims: A list of embedding dimensions to be used
+                for the loss function, e.g. [768, 512, 256, 128, 64].
+            matryoshka_weights: A list of weights to be used for the
+                loss function, e.g. [1, 1, 1, 1, 1]. If None, then the
+                weights will be set to 1 for all dimensions.
+            n_layers_per_step: The number of layers to use per step. If
+                -1, then all layers are used. If > 0, then a random
+                sample of n_layers_per_step layers are used per step.
+                The 2DMSE paper uses `n_layers_per_step=1`. The default
+                value is -1.
+            n_dims_per_step: The number of dimensions to use per step.
+                If -1, then all dimensions are used. If > 0, then a
+                random sample of n_dims_per_step dimensions are used per
+                step. The default value is -1.
+            last_layer_weight: The weight to use for the loss of the
+                final layer. Increase this to focus more on the
+                performance when using all layers. The default value is
+                1.0.
+            prior_layers_weight: The weight to use for the loss of the
+                prior layers. Increase this to focus more on the
+                performance when using fewer layers. The default value
+                is 1.0.
+            kl_div_weight: The weight to use for the KL-divergence loss
+                that is used to make the prior layers match that of the
+                last layer. Increase this to focus more on the
+                performance when using fewer layers. The default value
+                is 1.0.
+            kl_temperature: The temperature to use for the KL-divergence
+                loss. If 0, then the KL-divergence loss is not used. The
+                default value is 1.0.
 
         References:
             - See the 2D Matryoshka Sentence Embeddings (2DMSE) paper: https://arxiv.org/abs/2402.14776
@@ -76,7 +92,7 @@ def __init__(
                 from sentence_transformers import SentenceTransformer, losses, InputExample
                 from torch.utils.data import DataLoader
 
-                model = SentenceTransformer('microsoft/mpnet-base')
+                model = SentenceTransformer("microsoft/mpnet-base")
                 train_examples = [
                     InputExample(texts=['Anchor 1', 'Positive 1']),
                     InputExample(texts=['Anchor 2', 'Positive 2']),
diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py
index c866f1e08..acdac95f4 100644
--- a/sentence_transformers/losses/MatryoshkaLoss.py
+++ b/sentence_transformers/losses/MatryoshkaLoss.py
@@ -60,13 +60,20 @@ def __init__(
         different embedding dimensions. This is useful for when you want to train a model where users have the option
         to lower the embedding dimension to improve their embedding comparison speed and costs.
 
-        :param model: SentenceTransformer model
-        :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc.
-        :param matryoshka_dims: A list of embedding dimensions to be used for the loss function, e.g. [768, 512, 256, 128, 64].
-        :param matryoshka_weights: A list of weights to be used for the loss function, e.g. [1, 1, 1, 1, 1]. If None, then the
-            weights will be set to 1 for all dimensions.
-        :param n_dims_per_step: The number of dimensions to use per step. If -1, then all dimensions are used. If > 0, then
-            a random sample of n_dims_per_step dimensions are used per step. The default value is -1.
+        Args:
+            model: SentenceTransformer model
+            loss: The loss function to be used, e.g.
+                :class:`MultipleNegativesRankingLoss`,
+                :class:`CoSENTLoss`, etc.
+            matryoshka_dims: A list of embedding dimensions to be used
+                for the loss function, e.g. [768, 512, 256, 128, 64].
+            matryoshka_weights: A list of weights to be used for the
+                loss function, e.g. [1, 1, 1, 1, 1]. If None, then the
+                weights will be set to 1 for all dimensions.
+            n_dims_per_step: The number of dimensions to use per step.
+                If -1, then all dimensions are used. If > 0, then a
+                random sample of n_dims_per_step dimensions are used per
+                step. The default value is -1.
 
         References:
             - The concept was introduced in this paper: https://arxiv.org/abs/2205.13147
@@ -92,7 +99,7 @@ def __init__(
                 from sentence_transformers import SentenceTransformer, losses, InputExample
                 from torch.utils.data import DataLoader
 
-                model = SentenceTransformer('microsoft/mpnet-base')
+                model = SentenceTransformer("microsoft/mpnet-base")
                 train_examples = [
                     InputExample(texts=['Anchor 1', 'Positive 1']),
                     InputExample(texts=['Anchor 2', 'Positive 2']),
diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py
index b63d05afd..dc63ba6d9 100644
--- a/sentence_transformers/losses/MegaBatchMarginLoss.py
+++ b/sentence_transformers/losses/MegaBatchMarginLoss.py
@@ -21,12 +21,18 @@ def __init__(
 
         Then train as with the triplet loss.
 
-        :param model: SentenceTransformerModel
-        :param positive_margin: Positive margin, cos(anchor, positive) should be > positive_margin
-        :param negative_margin: Negative margin, cos(anchor, negative) should be < negative_margin
-        :param use_mini_batched_version: As large batch sizes require a lot of memory, we can use a mini-batched version.
-            We break down the large batch into smaller batches with fewer examples.
-        :param mini_batch_size: Size for the mini-batches. Should be a devisor for the batch size in your data loader.
+        Args:
+            model: SentenceTransformerModel
+            positive_margin: Positive margin, cos(anchor, positive)
+                should be > positive_margin
+            negative_margin: Negative margin, cos(anchor, negative)
+                should be < negative_margin
+            use_mini_batched_version: As large batch sizes require a lot
+                of memory, we can use a mini-batched version. We break
+                down the large batch into smaller batches with fewer
+                examples.
+            mini_batch_size: Size for the mini-batches. Should be a
+                devisor for the batch size in your data loader.
 
         References:
             - This loss function was inspired by the ParaNMT paper: https://www.aclweb.org/anthology/P18-1042/
diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
index 30c1fd163..78b03303c 100644
--- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
@@ -24,9 +24,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         ``(a_1, p_1, n_1), (a_2, p_2, n_2)``. Then, ``n_1`` is a hard negative for ``(a_1, p_1)``. The loss will use for
         the pair ``(a_i, p_i)`` all ``p_j`` for ``j != i`` and all ``n_j`` as negatives.
 
-        :param model: SentenceTransformer model
-        :param scale: Output of similarity function is multiplied by scale value
-        :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1)
+        Args:
+            model: SentenceTransformer model
+            scale: Output of similarity function is multiplied by scale
+                value
+            similarity_fct: similarity function between sentence
+                embeddings. By default, cos_sim. Can also be set to dot
+                product (and then set scale to 1)
 
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
@@ -60,20 +64,22 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-uncased')
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
-                train_loss = losses.MultipleNegativesRankingLoss(model=model)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.MultipleNegativesRankingLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(MultipleNegativesRankingLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
index 85b157128..979502dde 100644
--- a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
@@ -20,9 +20,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
 
         Note: If you pass triplets, the negative entry will be ignored. A anchor is just searched for the positive.
 
-        :param model: SentenceTransformer model
-        :param scale: Output of similarity function is multiplied by scale value
-        :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1)
+        Args:
+            model: SentenceTransformer model
+            scale: Output of similarity function is multiplied by scale
+                value
+            similarity_fct: similarity function between sentence
+                embeddings. By default, cos_sim. Can also be set to dot
+                product (and then set scale to 1)
 
         Requirements:
             1. (anchor, positive) pairs
@@ -40,20 +44,22 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-uncased')
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2']),
-                ]
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
-                train_loss = losses.MultipleNegativesSymmetricRankingLoss(model=model)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                })
+                loss = losses.MultipleNegativesSymmetricRankingLoss(model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(MultipleNegativesSymmetricRankingLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/OnlineContrastiveLoss.py b/sentence_transformers/losses/OnlineContrastiveLoss.py
index 1aeaa0618..d36e61ccf 100644
--- a/sentence_transformers/losses/OnlineContrastiveLoss.py
+++ b/sentence_transformers/losses/OnlineContrastiveLoss.py
@@ -14,9 +14,13 @@ def __init__(
         are far apart) and hard negative pairs (negatives that are close) and computes the loss only for these pairs.
         This loss often yields better performances than ContrastiveLoss.
 
-        :param model: SentenceTransformer model
-        :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used
-        :param margin: Negative samples (label == 0) should have a distance of at least the margin value.
+        Args:
+            model: SentenceTransformer model
+            distance_metric: Function that returns a distance between
+                two embeddings. The class SiameseDistanceMetric contains
+                pre-defined metrics that can be used
+            margin: Negative samples (label == 0) should have a distance
+                of at least the margin value.
 
         References:
             - `Training Examples > Quora Duplicate Questions <../../examples/training/quora_duplicate_questions/README.html>`_
@@ -39,21 +43,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, losses, InputExample
-                from torch.utils.data import DataLoader
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
 
-                model = SentenceTransformer('all-MiniLM-L6-v2')
-                train_examples = [
-                    InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
-                    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0),
-                ]
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": ["It's nice weather outside today.", "He drove to work."],
+                    "sentence2": ["It's so sunny.", "She walked to the store."],
+                    "label": [1, 0],
+                })
+                loss = losses.OnlineContrastiveLoss(model)
 
-                train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
-                train_loss = losses.OnlineContrastiveLoss(model=model)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(OnlineContrastiveLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py
index c8da160e6..eaf95d4e3 100644
--- a/sentence_transformers/losses/SoftmaxLoss.py
+++ b/sentence_transformers/losses/SoftmaxLoss.py
@@ -26,13 +26,14 @@ def __init__(
         :class:`MultipleNegativesRankingLoss` is an alternative loss function that often yields better results,
         as per https://arxiv.org/abs/2004.09813.
 
-        :param model: SentenceTransformer model
-        :param sentence_embedding_dimension: Dimension of your sentence embeddings
-        :param num_labels: Number of different labels
-        :param concatenation_sent_rep: Concatenate vectors u,v for the softmax classifier?
-        :param concatenation_sent_difference: Add abs(u-v) for the softmax classifier?
-        :param concatenation_sent_multiplication: Add u*v for the softmax classifier?
-        :param loss_fct: Optional: Custom pytorch loss function. If not set, uses nn.CrossEntropyLoss()
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model.
+            sentence_embedding_dimension (int): The dimension of the sentence embeddings.
+            num_labels (int): The number of different labels.
+            concatenation_sent_rep (bool): Whether to concatenate vectors u,v for the softmax classifier. Defaults to True.
+            concatenation_sent_difference (bool): Whether to add abs(u-v) for the softmax classifier. Defaults to True.
+            concatenation_sent_multiplication (bool): Whether to add u*v for the softmax classifier. Defaults to False.
+            loss_fct (Callable): Custom pytorch loss function. If not set, uses nn.CrossEntropyLoss(). Defaults to nn.CrossEntropyLoss().
 
         References:
             - Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks: https://arxiv.org/abs/1908.10084
@@ -51,29 +52,33 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer, SentencesDataset, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['First pair, sent A',  'First pair, sent B'], label=0),
-                    InputExample(texts=['Second pair, sent A', 'Second pair, sent B'], label=1),
-                    InputExample(texts=['Third pair, sent A',  'Third pair, sent B'], label=0),
-                    InputExample(texts=['Fourth pair, sent A', 'Fourth pair, sent B'], label=2),
-                ]
-                train_batch_size = 2
-                train_dataset = SentencesDataset(train_examples, model)
-                train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.SoftmaxLoss(
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "sentence1": [
+                        "A person on a horse jumps over a broken down airplane.",
+                        "A person on a horse jumps over a broken down airplane.",
+                        "A person on a horse jumps over a broken down airplane.",
+                        "Children smiling and waving at camera",
+                    ],
+                    "sentence2": [
+                        "A person is training his horse for a competition.",
+                        "A person is at a diner, ordering an omelette.",
+                        "A person is outdoors, on a horse.",
+                        "There are children present.",
+                    ],
+                    "label": [1, 2, 0, 0],
+                })
+                loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3)
+
+                trainer = SentenceTransformerTrainer(
                     model=model,
-                    sentence_embedding_dimension=model.get_sentence_embedding_dimension(),
-                    num_labels=len(set(x.label for x in train_examples))
-                )
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(SoftmaxLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py
index 1768bc2b7..de44db228 100644
--- a/sentence_transformers/losses/TripletLoss.py
+++ b/sentence_transformers/losses/TripletLoss.py
@@ -6,9 +6,7 @@
 
 
 class TripletDistanceMetric(Enum):
-    """
-    The metric for the triplet loss
-    """
+    """The metric for the triplet loss"""
 
     COSINE = lambda x, y: 1 - F.cosine_similarity(x, y)
     EUCLIDEAN = lambda x, y: F.pairwise_distance(x, y, p=2)
@@ -28,10 +26,13 @@ def __init__(
 
         Margin is an important hyperparameter and needs to be tuned respectively.
 
-        :param model: SentenceTransformerModel
-        :param distance_metric: Function to compute distance between two embeddings. The class TripletDistanceMetric
-            contains common distance metrices that can be used.
-        :param triplet_margin: The negative should be at least this much further away from the anchor than the positive.
+        Args:
+            model: SentenceTransformerModel
+            distance_metric: Function to compute distance between two
+                embeddings. The class TripletDistanceMetric contains
+                common distance metrices that can be used.
+            triplet_margin: The negative should be at least this much
+                further away from the anchor than the positive.
 
         References:
             - For further details, see: https://en.wikipedia.org/wiki/Triplet_loss
@@ -49,23 +50,23 @@ def __init__(
         Example:
             ::
 
-                from sentence_transformers import SentenceTransformer,  SentencesDataset, losses
-                from sentence_transformers.readers import InputExample
-                from torch.utils.data import DataLoader
-
-                model = SentenceTransformer('distilbert-base-nli-mean-tokens')
-                train_examples = [
-                    InputExample(texts=['Anchor 1', 'Positive 1', 'Negative 1']),
-                    InputExample(texts=['Anchor 2', 'Positive 2', 'Negative 2']),
-                ]
-                train_batch_size = 1
-                train_dataset = SentencesDataset(train_examples, model)
-                train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
-                train_loss = losses.TripletLoss(model=model)
-                model.fit(
-                    [(train_dataloader, train_loss)],
-                    epochs=10,
+                from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+                from datasets import Dataset
+
+                model = SentenceTransformer("microsoft/mpnet-base")
+                train_dataset = Dataset.from_dict({
+                    "anchor": ["It's nice weather outside today.", "He drove to work."],
+                    "positive": ["It's so sunny.", "He took the car to the office."],
+                    "negative": ["It's quite rainy, sadly.", "She walked to the store."],
+                })
+                loss = losses.TripletLoss(model=model)
+
+                trainer = SentenceTransformerTrainer(
+                    model=model,
+                    train_dataset=train_dataset,
+                    loss=loss,
                 )
+                trainer.train()
         """
         super(TripletLoss, self).__init__()
         self.model = model
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index ff9f68fcb..f41cefefa 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -241,12 +241,10 @@ class SentenceTransformerModelCardData(CardData):
             e.g. ["sentence-transformers", "sentence-similarity", "feature-extraction"].
         generate_widget_examples (`bool`): Whether to generate widget examples on every model save.
 
-    <Tip>
+    .. tip::
 
-    Install [``codecarbon``](https://github.com/mlco2/codecarbon) to automatically track carbon emission usage and
-    include it in your model cards.
-
-    </Tip>
+        Install `codecarbon <https://github.com/mlco2/codecarbon>`_ to automatically track carbon emission usage and
+        include it in your model cards.
 
     Example::
 
diff --git a/sentence_transformers/models/Asym.py b/sentence_transformers/models/Asym.py
index ba9c2bc5b..a84911d50 100644
--- a/sentence_transformers/models/Asym.py
+++ b/sentence_transformers/models/Asym.py
@@ -29,9 +29,13 @@ def __init__(self, sub_modules: Dict[str, List[nn.Module]], allow_empty_key: boo
             #You can train it with InputExample like this. Note, that the order must always be the same:
             train_example = InputExample(texts=[{'query': 'Train query'}, {'doc': 'Document'}], label=1)
 
-
-        :param sub_modules: Dict in the format str -> List[models]. The models in the specified list will be applied for input marked with the respective key.
-        :param allow_empty_key: If true, inputs without a key can be processed. If false, an exception will be thrown if no key is specified.
+        Args:
+            sub_modules: Dict in the format str -> List[models]. The
+                models in the specified list will be applied for input
+                marked with the respective key.
+            allow_empty_key: If true, inputs without a key can be
+                processed. If false, an exception will be thrown if no
+                key is specified.
         """
         self.sub_modules = sub_modules
         self.allow_empty_key = allow_empty_key
@@ -91,9 +95,7 @@ def save(self, output_path):
             )
 
     def tokenize(self, texts: Union[List[str], List[Tuple[str, str]]], **kwargs):
-        """
-        Tokenizes a text and maps tokens to token-ids
-        """
+        """Tokenizes a text and maps tokens to token-ids"""
         if not isinstance(texts[0], dict):
             raise AttributeError("Asym. model requires that texts are passed as dicts: {'key': 'text'}")
 
diff --git a/sentence_transformers/models/Dense.py b/sentence_transformers/models/Dense.py
index 6592e6877..bfd50e5e5 100644
--- a/sentence_transformers/models/Dense.py
+++ b/sentence_transformers/models/Dense.py
@@ -8,16 +8,19 @@
 
 
 class Dense(nn.Module):
-    """Feed-forward function with  activiation function.
+    """
+    Feed-forward function with  activiation function.
 
     This layer takes a fixed-sized sentence embedding and passes it through a feed-forward layer. Can be used to generate deep averaging networks (DAN).
 
-    :param in_features: Size of the input dimension
-    :param out_features: Output size
-    :param bias: Add a bias vector
-    :param activation_function: Pytorch activation function applied on output
-    :param init_weight: Initial value for the matrix of the linear layer
-    :param init_bias: Initial value for the bias of the linear layer
+    Args:
+        in_features: Size of the input dimension
+        out_features: Output size
+        bias: Add a bias vector
+        activation_function: Pytorch activation function applied on
+            output
+        init_weight: Initial value for the matrix of the linear layer
+        init_bias: Initial value for the bias of the linear layer
     """
 
     def __init__(
diff --git a/sentence_transformers/models/Dropout.py b/sentence_transformers/models/Dropout.py
index 28514f720..ea353279d 100644
--- a/sentence_transformers/models/Dropout.py
+++ b/sentence_transformers/models/Dropout.py
@@ -8,7 +8,8 @@
 class Dropout(nn.Module):
     """Dropout layer.
 
-    :param dropout: Sets a dropout value for dense layer.
+    Args:
+        dropout: Sets a dropout value for dense layer.
     """
 
     def __init__(self, dropout: float = 0.2):
diff --git a/sentence_transformers/models/LSTM.py b/sentence_transformers/models/LSTM.py
index a3cee522d..bab555d17 100644
--- a/sentence_transformers/models/LSTM.py
+++ b/sentence_transformers/models/LSTM.py
@@ -6,9 +6,7 @@
 
 
 class LSTM(nn.Module):
-    """
-    Bidirectional LSTM running over word embeddings.
-    """
+    """Bidirectional LSTM running over word embeddings."""
 
     def __init__(
         self,
diff --git a/sentence_transformers/models/Normalize.py b/sentence_transformers/models/Normalize.py
index f9301a81e..337b92a72 100644
--- a/sentence_transformers/models/Normalize.py
+++ b/sentence_transformers/models/Normalize.py
@@ -5,9 +5,7 @@
 
 
 class Normalize(nn.Module):
-    """
-    This layer normalizes embeddings to unit length
-    """
+    """This layer normalizes embeddings to unit length"""
 
     def __init__(self):
         super(Normalize, self).__init__()
diff --git a/sentence_transformers/models/Pooling.py b/sentence_transformers/models/Pooling.py
index 23b61962e..9cddc7e4f 100644
--- a/sentence_transformers/models/Pooling.py
+++ b/sentence_transformers/models/Pooling.py
@@ -7,20 +7,33 @@
 
 
 class Pooling(nn.Module):
-    """Performs pooling (max or mean) on the token embeddings.
+    """
+    Performs pooling (max or mean) on the token embeddings.
 
     Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows
     to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings
     together.
 
-    :param word_embedding_dimension: Dimensions for the word embeddings
-    :param pooling_mode: Either "cls", "lasttoken", "max", "mean", "mean_sqrt_len_tokens", or "weightedmean". If set, overwrites the other pooling_mode_* settings
-    :param pooling_mode_cls_token: Use the first token (CLS token) as text representations
-    :param pooling_mode_max_tokens: Use max in each dimension over all tokens.
-    :param pooling_mode_mean_tokens: Perform mean-pooling
-    :param pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but divide by sqrt(input_length).
-    :param pooling_mode_weightedmean_tokens: Perform (position) weighted mean pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_.
-    :param pooling_mode_lasttoken: Perform last token pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search <https://arxiv.org/abs/2202.08904>`_ and `Text and Code Embeddings by Contrastive Pre-Training <https://arxiv.org/abs/2201.10005>`_.
+    Args:
+        word_embedding_dimension: Dimensions for the word embeddings
+        pooling_mode: Either "cls", "lasttoken", "max", "mean",
+            "mean_sqrt_len_tokens", or "weightedmean". If set,
+            overwrites the other pooling_mode_* settings
+        pooling_mode_cls_token: Use the first token (CLS token) as text
+            representations
+        pooling_mode_max_tokens: Use max in each dimension over all
+            tokens.
+        pooling_mode_mean_tokens: Perform mean-pooling
+        pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but
+            divide by sqrt(input_length).
+        pooling_mode_weightedmean_tokens: Perform (position) weighted
+            mean pooling. See `SGPT: GPT Sentence Embeddings for
+            Semantic Search <https://arxiv.org/abs/2202.08904>`_.
+        pooling_mode_lasttoken: Perform last token pooling. See `SGPT:
+            GPT Sentence Embeddings for Semantic Search
+            <https://arxiv.org/abs/2202.08904>`_ and `Text and Code
+            Embeddings by Contrastive Pre-Training
+            <https://arxiv.org/abs/2201.10005>`_.
     """
 
     POOLING_MODES = (
@@ -98,9 +111,7 @@ def __repr__(self):
         return "Pooling({})".format(self.get_config_dict())
 
     def get_pooling_mode_str(self) -> str:
-        """
-        Returns the pooling mode as string
-        """
+        """Returns the pooling mode as string"""
         modes = []
         if self.pooling_mode_cls_token:
             modes.append("cls")
diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py
index 866d34259..f9d94e2d1 100644
--- a/sentence_transformers/models/Transformer.py
+++ b/sentence_transformers/models/Transformer.py
@@ -9,14 +9,22 @@ class Transformer(nn.Module):
     """Huggingface AutoModel to generate token embeddings.
     Loads the correct class, e.g. BERT / RoBERTa etc.
 
-    :param model_name_or_path: Huggingface models name (https://huggingface.co/models)
-    :param max_seq_length: Truncate any inputs longer than max_seq_length
-    :param model_args: Keyword arguments passed to the Huggingface Transformers model
-    :param tokenizer_args: Keyword arguments passed to the Huggingface Transformers tokenizer
-    :param config_args: Keyword arguments passed to the Huggingface Transformers config
-    :param cache_dir: Cache dir for Huggingface Transformers to store/load models
-    :param do_lower_case: If true, lowercases the input (independent if the model is cased or not)
-    :param tokenizer_name_or_path: Name or path of the tokenizer. When None, then model_name_or_path is used
+    Args:
+        model_name_or_path: Huggingface models name
+            (https://huggingface.co/models)
+        max_seq_length: Truncate any inputs longer than max_seq_length
+        model_args: Keyword arguments passed to the Huggingface
+            Transformers model
+        tokenizer_args: Keyword arguments passed to the Huggingface
+            Transformers tokenizer
+        config_args: Keyword arguments passed to the Huggingface
+            Transformers config
+        cache_dir: Cache dir for Huggingface Transformers to store/load
+            models
+        do_lower_case: If true, lowercases the input (independent if the
+            model is cased or not)
+        tokenizer_name_or_path: Name or path of the tokenizer. When
+            None, then model_name_or_path is used
     """
 
     def __init__(
@@ -124,9 +132,7 @@ def get_word_embedding_dimension(self) -> int:
         return self.auto_model.config.hidden_size
 
     def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]], padding: Union[str, bool] = True):
-        """
-        Tokenizes a text and maps tokens to token-ids
-        """
+        """Tokenizes a text and maps tokens to token-ids"""
         output = {}
         if isinstance(texts[0], str):
             to_tokenize = [texts]
diff --git a/sentence_transformers/models/WeightedLayerPooling.py b/sentence_transformers/models/WeightedLayerPooling.py
index d2c8fd92c..33d5f4406 100644
--- a/sentence_transformers/models/WeightedLayerPooling.py
+++ b/sentence_transformers/models/WeightedLayerPooling.py
@@ -7,9 +7,7 @@
 
 
 class WeightedLayerPooling(nn.Module):
-    """
-    Token embeddings are weighted mean of their different hidden layer representations
-    """
+    """Token embeddings are weighted mean of their different hidden layer representations"""
 
     def __init__(
         self, word_embedding_dimension, num_hidden_layers: int = 12, layer_start: int = 4, layer_weights=None
diff --git a/sentence_transformers/models/WordWeights.py b/sentence_transformers/models/WordWeights.py
index 42d2c21a3..3e53738bd 100644
--- a/sentence_transformers/models/WordWeights.py
+++ b/sentence_transformers/models/WordWeights.py
@@ -15,13 +15,14 @@ class WordWeights(nn.Module):
 
     def __init__(self, vocab: List[str], word_weights: Dict[str, float], unknown_word_weight: float = 1):
         """
-
-        :param vocab:
-            Vocabulary of the tokenizer
-        :param word_weights:
-            Mapping of tokens to a float weight value. Words embeddings are multiplied by  this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values)
-        :param unknown_word_weight:
-            Weight for words in vocab, that do not appear in the word_weights lookup. These can be for example rare words in the vocab, where no weight exists.
+        Initializes the WordWeights class.
+
+        Args:
+            vocab (List[str]): Vocabulary of the tokenizer.
+            word_weights (Dict[str, float]): Mapping of tokens to a float weight value. Word embeddings are multiplied
+                by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values).
+            unknown_word_weight (float, optional): Weight for words in vocab that do not appear in the word_weights lookup.
+                These can be, for example, rare words in the vocab where no weight exists. Defaults to 1.
         """
         super(WordWeights, self).__init__()
         self.config_keys = ["vocab", "word_weights", "unknown_word_weight"]
diff --git a/sentence_transformers/quantization.py b/sentence_transformers/quantization.py
index 06aa7400a..d958b1a34 100644
--- a/sentence_transformers/quantization.py
+++ b/sentence_transformers/quantization.py
@@ -37,31 +37,53 @@ def semantic_search_faiss(
     Only if these conditions are true, will we search for `top_k * rescore_multiplier` samples and then rescore to only
     keep `top_k`.
 
-    :param query_embeddings: Embeddings of the query sentences. Ideally not quantized to allow for rescoring.
-    :param corpus_embeddings: Embeddings of the corpus sentences. Either `corpus_embeddings` or `corpus_index` should
-        be used, not both. The embeddings can be quantized to "int8" or "binary" for more efficient search.
-    :param corpus_index: FAISS index for the corpus sentences. Either `corpus_embeddings` or `corpus_index` should
-        be used, not both.
-    :param corpus_precision: Precision of the corpus embeddings. The options are "float32", "int8", or "binary".
-        Default is "float32".
-    :param top_k: Number of top results to retrieve. Default is 10.
-    :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges
-        refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim).
-        Default is None, which means that the ranges will be calculated from the calibration embeddings.
-    :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8
-        quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum
-        values for each dimension. Default is None, which means that the ranges will be calculated from the query
-        embeddings. This is not recommended.
-    :param rescore: Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings
-        are not quantized and the corpus is quantized, i.e. the corpus precision is not "float32". Default is True.
-    :param rescore_multiplier: Oversampling factor for rescoring. The code will now search `top_k * rescore_multiplier` samples
-        and then rescore to only keep `top_k`. Default is 2.
-    :param exact: Whether to use exact search or approximate search. Default is True.
-    :param output_index: Whether to output the FAISS index used for the search. Default is False.
-
-    :return: A tuple containing a list of search results and the time taken for the search. If `output_index` is True,
-        the tuple will also contain the FAISS index used for the search.
-    :raises ValueError: If both `corpus_embeddings` and `corpus_index` are provided or if neither is provided.
+    Args:
+        query_embeddings: Embeddings of the query sentences. Ideally not
+            quantized to allow for rescoring.
+        corpus_embeddings: Embeddings of the corpus sentences. Either
+            `corpus_embeddings` or `corpus_index` should be used, not
+            both. The embeddings can be quantized to "int8" or "binary"
+            for more efficient search.
+        corpus_index: FAISS index for the corpus sentences. Either
+            `corpus_embeddings` or `corpus_index` should be used, not
+            both.
+        corpus_precision: Precision of the corpus embeddings. The
+            options are "float32", "int8", or "binary". Default is
+            "float32".
+        top_k: Number of top results to retrieve. Default is 10.
+        ranges: Ranges for quantization of embeddings. This is only used
+            for int8 quantization, where the ranges refers to the
+            minimum and maximum values for each dimension. So, it's a 2D
+            array with shape (2, embedding_dim). Default is None, which
+            means that the ranges will be calculated from the
+            calibration embeddings.
+        calibration_embeddings: Embeddings used for calibration during
+            quantization. This is only used for int8 quantization, where
+            the calibration embeddings can be used to compute ranges,
+            i.e. the minimum and maximum values for each dimension.
+            Default is None, which means that the ranges will be
+            calculated from the query embeddings. This is not
+            recommended.
+        rescore: Whether to perform rescoring. Note that rescoring still
+            will only be used if the query embeddings are not quantized
+            and the corpus is quantized, i.e. the corpus precision is
+            not "float32". Default is True.
+        rescore_multiplier: Oversampling factor for rescoring. The code
+            will now search `top_k * rescore_multiplier` samples and
+            then rescore to only keep `top_k`. Default is 2.
+        exact: Whether to use exact search or approximate search.
+            Default is True.
+        output_index: Whether to output the FAISS index used for the
+            search. Default is False.
+
+    Returns:
+        A tuple containing a list of search results and the time taken
+        for the search. If `output_index` is True, the tuple will also
+        contain the FAISS index used for the search.
+
+    Raises:
+        ValueError: If both `corpus_embeddings` and `corpus_index` are
+            provided or if neither is provided.
 
     The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...]
     The time taken for the search is a float value.
@@ -182,31 +204,53 @@ def semantic_search_usearch(
     Only if these conditions are true, will we search for `top_k * rescore_multiplier` samples and then rescore to only
     keep `top_k`.
 
-    :param query_embeddings: Embeddings of the query sentences. Ideally not quantized to allow for rescoring.
-    :param corpus_embeddings: Embeddings of the corpus sentences. Either `corpus_embeddings` or `corpus_index` should
-        be used, not both. The embeddings can be quantized to "int8" or "binary" for more efficient search.
-    :param corpus_index: usearch index for the corpus sentences. Either `corpus_embeddings` or `corpus_index` should
-        be used, not both.
-    :param corpus_precision: Precision of the corpus embeddings. The options are "float32", "int8", or "binary".
-        Default is "float32".
-    :param top_k: Number of top results to retrieve. Default is 10.
-    :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges
-        refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim).
-        Default is None, which means that the ranges will be calculated from the calibration embeddings.
-    :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8
-        quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum
-        values for each dimension. Default is None, which means that the ranges will be calculated from the query
-        embeddings. This is not recommended.
-    :param rescore: Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings
-        are not quantized and the corpus is quantized, i.e. the corpus precision is not "float32". Default is True.
-    :param rescore_multiplier: Oversampling factor for rescoring. The code will now search `top_k * rescore_multiplier` samples
-        and then rescore to only keep `top_k`. Default is 2.
-    :param exact: Whether to use exact search or approximate search. Default is True.
-    :param output_index: Whether to output the usearch index used for the search. Default is False.
-
-    :return: A tuple containing a list of search results and the time taken for the search. If `output_index` is True,
-        the tuple will also contain the usearch index used for the search.
-    :raises ValueError: If both `corpus_embeddings` and `corpus_index` are provided or if neither is provided.
+    Args:
+        query_embeddings: Embeddings of the query sentences. Ideally not
+            quantized to allow for rescoring.
+        corpus_embeddings: Embeddings of the corpus sentences. Either
+            `corpus_embeddings` or `corpus_index` should be used, not
+            both. The embeddings can be quantized to "int8" or "binary"
+            for more efficient search.
+        corpus_index: usearch index for the corpus sentences. Either
+            `corpus_embeddings` or `corpus_index` should be used, not
+            both.
+        corpus_precision: Precision of the corpus embeddings. The
+            options are "float32", "int8", or "binary". Default is
+            "float32".
+        top_k: Number of top results to retrieve. Default is 10.
+        ranges: Ranges for quantization of embeddings. This is only used
+            for int8 quantization, where the ranges refers to the
+            minimum and maximum values for each dimension. So, it's a 2D
+            array with shape (2, embedding_dim). Default is None, which
+            means that the ranges will be calculated from the
+            calibration embeddings.
+        calibration_embeddings: Embeddings used for calibration during
+            quantization. This is only used for int8 quantization, where
+            the calibration embeddings can be used to compute ranges,
+            i.e. the minimum and maximum values for each dimension.
+            Default is None, which means that the ranges will be
+            calculated from the query embeddings. This is not
+            recommended.
+        rescore: Whether to perform rescoring. Note that rescoring still
+            will only be used if the query embeddings are not quantized
+            and the corpus is quantized, i.e. the corpus precision is
+            not "float32". Default is True.
+        rescore_multiplier: Oversampling factor for rescoring. The code
+            will now search `top_k * rescore_multiplier` samples and
+            then rescore to only keep `top_k`. Default is 2.
+        exact: Whether to use exact search or approximate search.
+            Default is True.
+        output_index: Whether to output the usearch index used for the
+            search. Default is False.
+
+    Returns:
+        A tuple containing a list of search results and the time taken
+        for the search. If `output_index` is True, the tuple will also
+        contain the usearch index used for the search.
+
+    Raises:
+        ValueError: If both `corpus_embeddings` and `corpus_index` are
+            provided or if neither is provided.
 
     The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...]
     The time taken for the search is a float value.
@@ -327,18 +371,27 @@ def quantize_embeddings(
     Quantizes embeddings to a lower precision. This can be used to reduce the memory footprint and increase the
     speed of similarity search. The supported precisions are "float32", "int8", "uint8", "binary", and "ubinary".
 
-    :param embeddings: Unquantized (e.g. float) embeddings with to quantize to a given precision
-    :param precision: The precision to convert to. Options are "float32", "int8", "uint8", "binary", "ubinary".
-    :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges
-        refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim).
-        Default is None, which means that the ranges will be calculated from the calibration embeddings.
-    :type ranges: Optional[np.ndarray]
-    :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8
-        quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum
-        values for each dimension. Default is None, which means that the ranges will be calculated from the query
-        embeddings. This is not recommended.
-    :type calibration_embeddings: Optional[np.ndarray]
-    :return: Quantized embeddings with the specified precision
+    Args:
+        embeddings: Unquantized (e.g. float) embeddings with to quantize
+            to a given precision
+        precision: The precision to convert to. Options are "float32",
+            "int8", "uint8", "binary", "ubinary".
+        ranges (Optional[np.ndarray]): Ranges for quantization of
+            embeddings. This is only used for int8 quantization, where
+            the ranges refers to the minimum and maximum values for each
+            dimension. So, it's a 2D array with shape (2,
+            embedding_dim). Default is None, which means that the ranges
+            will be calculated from the calibration embeddings.
+        calibration_embeddings (Optional[np.ndarray]): Embeddings used
+            for calibration during quantization. This is only used for
+            int8 quantization, where the calibration embeddings can be
+            used to compute ranges, i.e. the minimum and maximum values
+            for each dimension. Default is None, which means that the
+            ranges will be calculated from the query embeddings. This is
+            not recommended.
+
+    Returns:
+        Quantized embeddings with the specified precision
     """
     if isinstance(embeddings, Tensor):
         embeddings = embeddings.cpu().numpy()
diff --git a/sentence_transformers/readers/InputExample.py b/sentence_transformers/readers/InputExample.py
index 80e93c56f..1e0f6bbd2 100644
--- a/sentence_transformers/readers/InputExample.py
+++ b/sentence_transformers/readers/InputExample.py
@@ -2,21 +2,16 @@
 
 
 class InputExample:
-    """
-    Structure for one input example with texts, the label and a unique id
-    """
+    """Structure for one input example with texts, the label and a unique id"""
 
     def __init__(self, guid: str = "", texts: List[str] = None, label: Union[int, float] = 0):
         """
         Creates one InputExample with the given texts, guid and label
 
-
-        :param guid
-            id for the example
-        :param texts
-            the texts for the example.
-        :param label
-            the label for the example
+        Args:
+            guid: id for the example
+            texts: the texts for the example.
+            label: the label for the example
         """
         self.guid = guid
         self.texts = texts
diff --git a/sentence_transformers/readers/LabelSentenceReader.py b/sentence_transformers/readers/LabelSentenceReader.py
index 47e2c77eb..70b28c7ef 100644
--- a/sentence_transformers/readers/LabelSentenceReader.py
+++ b/sentence_transformers/readers/LabelSentenceReader.py
@@ -5,7 +5,8 @@
 class LabelSentenceReader:
     """Reads in a file that has at least two columns: a label and a sentence.
     This reader can for example be used with the BatchHardTripletLoss.
-    Maps labels automatically to integers"""
+    Maps labels automatically to integers
+    """
 
     def __init__(self, folder, label_col_idx=0, sentence_col_idx=1, separator="\t"):
         self.folder = folder
diff --git a/sentence_transformers/readers/NLIDataReader.py b/sentence_transformers/readers/NLIDataReader.py
index 2112ba290..2d78a5a8f 100644
--- a/sentence_transformers/readers/NLIDataReader.py
+++ b/sentence_transformers/readers/NLIDataReader.py
@@ -4,9 +4,7 @@
 
 
 class NLIDataReader(object):
-    """
-    Reads in the Stanford NLI dataset and the MultiGenre NLI dataset
-    """
+    """Reads in the Stanford NLI dataset and the MultiGenre NLI dataset"""
 
     def __init__(self, dataset_folder):
         self.dataset_folder = dataset_folder
diff --git a/sentence_transformers/readers/PairedFilesReader.py b/sentence_transformers/readers/PairedFilesReader.py
index 9c1a94d86..2a1c16495 100644
--- a/sentence_transformers/readers/PairedFilesReader.py
+++ b/sentence_transformers/readers/PairedFilesReader.py
@@ -3,15 +3,12 @@
 
 
 class PairedFilesReader(object):
-    """
-    Reads in the a Pair Dataset, split in two files
-    """
+    """Reads in the a Pair Dataset, split in two files"""
 
     def __init__(self, filepaths):
         self.filepaths = filepaths
 
     def get_examples(self, max_examples=0):
-        """ """
         fIns = []
         for filepath in self.filepaths:
             fIn = (
diff --git a/sentence_transformers/readers/STSDataReader.py b/sentence_transformers/readers/STSDataReader.py
index 5d000282d..e9a6e7600 100644
--- a/sentence_transformers/readers/STSDataReader.py
+++ b/sentence_transformers/readers/STSDataReader.py
@@ -5,8 +5,7 @@
 
 
 class STSDataReader:
-    """
-    Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
+    """Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
 
     Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
     """
@@ -34,9 +33,7 @@ def __init__(
         self.max_score = max_score
 
     def get_examples(self, filename, max_examples=0):
-        """
-        filename specified which data split to use (train.csv, dev.csv, test.csv).
-        """
+        """filename specified which data split to use (train.csv, dev.csv, test.csv)."""
         filepath = os.path.join(self.dataset_folder, filename)
         with gzip.open(filepath, "rt", encoding="utf8") if filename.endswith(".gz") else open(
             filepath, encoding="utf-8"
@@ -59,8 +56,7 @@ def get_examples(self, filename, max_examples=0):
 
 
 class STSBenchmarkDataReader(STSDataReader):
-    """
-    Reader especially for the STS benchmark dataset. There, the sentences are in column 5 and 6, the score is in column 4.
+    """Reader especially for the STS benchmark dataset. There, the sentences are in column 5 and 6, the score is in column 4.
     Scores are normalized from 0...5 to 0...1
     """
 
diff --git a/sentence_transformers/readers/TripletReader.py b/sentence_transformers/readers/TripletReader.py
index 6045ef697..99e1ff0f2 100644
--- a/sentence_transformers/readers/TripletReader.py
+++ b/sentence_transformers/readers/TripletReader.py
@@ -4,8 +4,7 @@
 
 
 class TripletReader(object):
-    """
-    Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1),
+    """Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1),
     one positive example (s2) and one negative example (s3)
     """
 
@@ -28,7 +27,6 @@ def __init__(
         self.quoting = quoting
 
     def get_examples(self, filename, max_examples=0):
-        """ """
         data = csv.reader(
             open(os.path.join(self.dataset_folder, filename), encoding="utf-8"),
             delimiter=self.delimiter,
diff --git a/sentence_transformers/similarity_functions.py b/sentence_transformers/similarity_functions.py
index b97e7ec92..589d0404a 100644
--- a/sentence_transformers/similarity_functions.py
+++ b/sentence_transformers/similarity_functions.py
@@ -16,6 +16,15 @@
 
 
 class SimilarityFunction(Enum):
+    """
+    Enum class for supported similarity functions. The following functions are supported:
+
+    - ``SimilarityFunction.COSINE`` (``"cosine"``): Cosine similarity
+    - ``SimilarityFunction.DOT_PRODUCT`` (``"dot"``, ``dot_product``): Dot product similarity
+    - ``SimilarityFunction.EUCLIDEAN`` (``"euclidean"``): Euclidean distance
+    - ``SimilarityFunction.MANHATTAN`` (``"manhattan"``): Manhattan distance
+    """
+
     COSINE = "cosine"
     DOT_PRODUCT = "dot"
     DOT = "dot"  # Alias for DOT_PRODUCT
@@ -26,6 +35,25 @@ class SimilarityFunction(Enum):
     def to_similarity_fn(
         similarity_function: Union[str, "SimilarityFunction"],
     ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        """
+        Converts a similarity function name or enum value to the corresponding similarity function.
+
+        Args:
+            similarity_function (Union[str, SimilarityFunction]): The name or enum value of the similarity function.
+
+        Returns:
+            Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: The corresponding similarity function.
+
+        Raises:
+            ValueError: If the provided function is not supported.
+
+        Example:
+            >>> similarity_fn = SimilarityFunction.to_similarity_fn("cosine")
+            >>> similarity_scores = similarity_fn(embeddings1, embeddings2)
+            >>> similarity_scores
+            tensor([[0.3952, 0.0554],
+                    [0.0992, 0.1570]])
+        """
         similarity_function = SimilarityFunction(similarity_function)
 
         if similarity_function == SimilarityFunction.COSINE:
@@ -47,6 +75,28 @@ def to_similarity_fn(
     def to_similarity_pairwise_fn(
         similarity_function: Union[str, "SimilarityFunction"],
     ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]:
+        """
+        Converts a similarity function into a pairwise similarity function.
+
+        The pairwise similarity function returns the diagonal vector from the similarity matrix, i.e. it only
+        computes the similarity(a[i], b[i]) for each i in the range of the input tensors, rather than
+        computing the similarity between all pairs of a and b.
+
+        Args:
+            similarity_function (Union[str, SimilarityFunction]): The name or enum value of the similarity function.
+
+        Returns:
+            Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: The pairwise similarity function.
+
+        Raises:
+            ValueError: If the provided similarity function is not supported.
+
+        Example:
+            >>> pairwise_fn = SimilarityFunction.to_similarity_pairwise_fn("cosine")
+            >>> similarity_scores = pairwise_fn(embeddings1, embeddings2)
+            >>> similarity_scores
+            tensor([0.3952, 0.1570])
+        """
         similarity_function = SimilarityFunction(similarity_function)
 
         if similarity_function == SimilarityFunction.COSINE:
@@ -66,4 +116,15 @@ def to_similarity_pairwise_fn(
 
     @staticmethod
     def possible_values():
+        """
+        Returns a list of possible values for the SimilarityFunction enum.
+
+        Returns:
+            list: A list of possible values for the SimilarityFunction enum.
+
+        Example:
+            >>> possible_values = SimilarityFunction.possible_values()
+            >>> possible_values
+            ['cosine', 'dot', 'euclidean', 'manhattan']
+        """
         return [m.value for m in SimilarityFunction]
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 532e4c009..0620d5f38 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -43,6 +43,72 @@
 
 
 class SentenceTransformerTrainer(Trainer):
+    """
+    SentenceTransformerTrainer is a simple but feature-complete training and eval loop for PyTorch
+    based on the 🤗 Transformers :class:`~transformers.Trainer`.
+
+    This trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as:
+
+    - :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed
+    - :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible.
+    - :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed.
+
+        - Note: These carbon emissions will be included in your automatically generated model card.
+
+    See the Transformers `Callbacks <https://huggingface.co/docs/transformers/main/en/main_classes/callback>`_
+    documentation for more information on the integrated callbacks and how to write your own callbacks.
+
+    Args:
+        model (:class:`~sentence_transformers.SentenceTransformer`, *optional*):
+            The model to train, evaluate or use for predictions. If not provided, a `model_init` must be passed.
+        args (:class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`, *optional*):
+            The arguments to tweak for training. Will default to a basic instance of
+            :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` with the
+            `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided.
+        train_dataset (Union[:class:`datasets.Dataset`, :class:`datasets.DatasetDict`, Dict[str, :class:`datasets.Dataset`]], *optional*):
+            The dataset to use for training. Must have a format accepted by your loss function, see
+            `Training Overview > Dataset Format <../../../docs/sentence_transformer/training_overview.html#dataset-format>`_.
+        eval_dataset (Union[:class:`datasets.Dataset`, :class:`datasets.DatasetDict`, Dict[str, :class:`datasets.Dataset`]], *optional*):
+            The dataset to use for evaluation. Must have a format accepted by your loss function, see
+            `Training Overview > Dataset Format <../../../docs/sentence_transformer/training_overview.html#dataset-format>`_.
+        loss (Optional[Union[:class:`torch.nn.Module`, Dict[str, :class:`torch.nn.Module`],\
+            Callable[[:class:`~sentence_transformers.SentenceTransformer`], :class:`torch.nn.Module`],\
+            Dict[str, Callable[[:class:`~sentence_transformers.SentenceTransformer`]]]], *optional*):
+            The loss function to use for training. Can either be a loss class instance, a dictionary mapping dataset names to
+            loss class instances, a function that returns a loss class instance given a model, or a dictionary mapping
+            dataset names to functions that return a loss class instance given a model. In practice, the latter two
+            are primarily used for hyper-parameter optimization. Will default to
+            :class:`~sentence_transformers.losses.CoSENTLoss` if no ``loss`` is provided.
+        evaluator (:class:`~sentence_transformers.evaluation.SentenceEvaluator`, *optional*):
+            The evaluator class to use for evaluation alongside the evaluation dataset. An evaluator will display more
+            useful metrics than the loss function.
+        callbacks (List of [:class:`transformers.TrainerCallback`], *optional*):
+            A list of callbacks to customize the training loop. Will add those to the list of default callbacks
+            detailed in [here](callback).
+
+            If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method.
+        optimizers (`Tuple[:class:`torch.optim.Optimizer`, :class:`torch.optim.lr_scheduler.LambdaLR`]`, *optional*, defaults to `(None, None)`):
+            A tuple containing the optimizer and the scheduler to use. Will default to an instance of :class:`torch.optim.AdamW`
+            on your model and a scheduler given by :func:`transformers.get_linear_schedule_with_warmup` controlled by `args`.
+
+    Important attributes:
+
+        - **model** -- Always points to the core model. If using a transformers model, it will be a [`PreTrainedModel`]
+          subclass.
+        - **model_wrapped** -- Always points to the most external model in case one or more other modules wrap the
+          original model. This is the model that should be used for the forward pass. For example, under `DeepSpeed`,
+          the inner model is wrapped in `DeepSpeed` and then again in `torch.nn.DistributedDataParallel`. If the inner
+          model hasn't been wrapped, then `self.model_wrapped` is the same as `self.model`.
+        - **is_model_parallel** -- Whether or not a model has been switched to a model parallel mode (different from
+          data parallelism, this means some of the model layers are split on different GPUs).
+        - **place_model_on_device** -- Whether or not to automatically place the model on the device - it will be set
+          to `False` if model parallel or deepspeed is used, or if the default
+          `TrainingArguments.place_model_on_device` is overridden to return `False` .
+        - **is_in_train** -- Whether or not a model is currently running `train` (e.g. when `evaluate` is called while
+          in `train`)
+
+    """
+
     def __init__(
         self,
         model: Optional["SentenceTransformer"] = None,
@@ -207,7 +273,24 @@ def compute_loss(
         model: "SentenceTransformer",
         inputs: Dict[str, Union[torch.Tensor, Any]],
         return_outputs: bool = False,
-    ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, Any]]]:
+        """
+        Computes the loss for the SentenceTransformer model.
+
+        It uses ``self.loss`` to compute the loss, which can be a single loss function or a dictionary of loss functions
+        for different datasets. If the loss is a dictionary, the dataset name is expected to be passed in the inputs
+        under the key "dataset_name". This is done automatically in the ``add_dataset_name_column`` method.
+        Note that even if ``return_outputs = True``, the outputs will be empty, as the SentenceTransformers losses do not
+        return outputs.
+
+        Args:
+            model (SentenceTransformer): The SentenceTransformer model.
+            inputs (Dict[str, Union[torch.Tensor, Any]]): The input data for the model.
+            return_outputs (bool, optional): Whether to return the outputs along with the loss. Defaults to False.
+
+        Returns:
+            Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, Any]]]: The computed loss. If `return_outputs` is True, returns a tuple of loss and outputs. Otherwise, returns only the loss.
+        """
         dataset_name = inputs.pop("dataset_name", None)
         features, labels = self.collect_features(inputs)
         loss_fn = self.loss
@@ -634,3 +717,9 @@ def _load_from_checkpoint(self, checkpoint_path: str) -> None:
 
         loaded_model = SentenceTransformer(checkpoint_path)
         self.model.load_state_dict(loaded_model.state_dict())
+
+    def create_model_card(self, *args, **kwargs):
+        raise NotImplementedError(
+            "SentenceTransformers does not implement the `create_model_card` method in its Trainer. "
+            "Instead, consider calling SentenceTransformer._create_model_card(path)."
+        )
diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
index f28f0d359..5f4386768 100644
--- a/sentence_transformers/training_args.py
+++ b/sentence_transformers/training_args.py
@@ -7,16 +7,32 @@
 class BatchSamplers(ExplicitEnum):
     """
     Stores the acceptable string identifiers for batch samplers.
+
+    The batch sampler is responsible for determining how samples are grouped into batches during training.
+    Valid options are:
+
+    - ``BatchSamplers.BATCH_SAMPLER``: The default PyTorch batch sampler.
+    - ``BatchSamplers.NO_DUPLICATES``: Ensures no duplicate samples in a batch.
+    - ``BatchSamplers.GROUP_BY_LABEL``: Ensures each batch has 2+ samples from the same label.
     """
 
-    BATCH_SAMPLER = "batch_sampler"  # Just the default PyTorch batch sampler [default]
-    NO_DUPLICATES = "no_duplicates"  # Ensures no duplicate samples in a batch
-    GROUP_BY_LABEL = "group_by_label"  # Ensure each batch has 2+ samples from the same label
+    BATCH_SAMPLER = "batch_sampler"
+    NO_DUPLICATES = "no_duplicates"
+    GROUP_BY_LABEL = "group_by_label"
 
 
 class MultiDatasetBatchSamplers(ExplicitEnum):
     """
     Stores the acceptable string identifiers for multi-dataset batch samplers.
+
+    The multi-dataset batch sampler is responsible for determining in what order batches are sampled from multiple
+    datasets during training. Valid options are:
+
+    - ``MultiDatasetBatchSamplers.ROUND_ROBIN``: Round-robin sampling from each dataset until one is exhausted.
+      With this strategy, it's likely that not all samples from each dataset are used, but each dataset is sampled
+      from equally.
+    - ``MultiDatasetBatchSamplers.PROPORTIONAL``: Sample from each dataset in proportion to its size [default].
+      With this strategy, all samples from each dataset are used and larger datasets are sampled from more frequently.
     """
 
     ROUND_ROBIN = "round_robin"  # Round-robin sampling from each dataset
@@ -25,6 +41,22 @@ class MultiDatasetBatchSamplers(ExplicitEnum):
 
 @dataclass
 class SentenceTransformerTrainingArguments(TransformersTrainingArguments):
+    """
+    SentenceTransformerTrainingArguments extends :class:`~transformers.TrainingArguments` with additional arguments
+    specific to Sentence Transformers. See :class:`~transformers.TrainingArguments` for the complete list of
+    available arguments.
+
+    Args:
+        output_dir (`str`):
+            The output directory where the model checkpoints will be written.
+        batch_sampler (Union[:class:`~sentence_transformers.training_args.BatchSamplers`, `str`], *optional*):
+            The batch sampler to use. See :class:`~sentence_transformers.training_args.BatchSamplers` for valid options.
+            Defaults to ``BatchSamplers.BATCH_SAMPLER``.
+        multi_dataset_batch_sampler (Union[:class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers`, `str`], *optional*):
+            The multi-dataset batch sampler to use. See :class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers`
+            for valid options. Defaults to ``MultiDatasetBatchSamplers.PROPORTIONAL``.
+    """
+
     batch_sampler: Union[BatchSamplers, str] = field(
         default=BatchSamplers.BATCH_SAMPLER, metadata={"help": "The batch sampler to use."}
     )
diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py
index 28b2376e9..30dbeaa61 100644
--- a/sentence_transformers/util.py
+++ b/sentence_transformers/util.py
@@ -21,18 +21,45 @@
 
 
 def _convert_to_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor:
+    """
+    Converts the input `a` to a PyTorch tensor if it is not already a tensor.
+
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The input array or tensor.
+
+    Returns:
+        Tensor: The converted tensor.
+    """
     if not isinstance(a, Tensor):
         a = torch.tensor(a)
     return a
 
 
 def _convert_to_batch(a: Tensor) -> Tensor:
+    """
+    If the tensor `a` is 1-dimensional, it is unsqueezed to add a batch dimension.
+
+    Args:
+        a (Tensor): The input tensor.
+
+    Returns:
+        Tensor: The tensor with a batch dimension.
+    """
     if a.dim() == 1:
         a = a.unsqueeze(0)
     return a
 
 
 def _convert_to_batch_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor:
+    """
+    Converts the input data to a tensor with a batch dimension.
+
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The input data to be converted.
+
+    Returns:
+        Tensor: The converted tensor with a batch dimension.
+    """
     a = _convert_to_tensor(a)
     a = _convert_to_batch(a)
     return a
@@ -40,18 +67,28 @@ def _convert_to_batch_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor:
 
 def pytorch_cos_sim(a: Tensor, b: Tensor) -> Tensor:
     """
-    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
+    Computes the cosine similarity between two tensors.
 
-    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Matrix with res[i][j] = cos_sim(a[i], b[j])
     """
     return cos_sim(a, b)
 
 
 def cos_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
-    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
+    Computes the cosine similarity between two tensors.
+
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
 
-    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
+    Returns:
+        Tensor: Matrix with res[i][j] = cos_sim(a[i], b[j])
     """
     a = _convert_to_batch_tensor(a)
     b = _convert_to_batch_tensor(b)
@@ -63,9 +100,14 @@ def cos_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tenso
 
 def pairwise_cos_sim(a: Tensor, b: Tensor) -> Tensor:
     """
-    Computes the pairwise cossim cos_sim(a[i], b[i])
+    Computes the pairwise cosine similarity cos_sim(a[i], b[i]).
 
-    :return: Vector with res[i] = cos_sim(a[i], b[i])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Vector with res[i] = cos_sim(a[i], b[i])
     """
     a = _convert_to_tensor(a)
     b = _convert_to_tensor(b)
@@ -77,7 +119,12 @@ def dot_score(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Ten
     """
     Computes the dot-product dot_prod(a[i], b[j]) for all i and j.
 
-    :return: Matrix with res[i][j]  = dot_prod(a[i], b[j])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Matrix with res[i][j] = dot_prod(a[i], b[j])
     """
     a = _convert_to_batch_tensor(a)
     b = _convert_to_batch_tensor(b)
@@ -87,9 +134,14 @@ def dot_score(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Ten
 
 def pairwise_dot_score(a: Tensor, b: Tensor) -> Tensor:
     """
-    Computes the pairwise dot-product dot_prod(a[i], b[i])
+    Computes the pairwise dot-product dot_prod(a[i], b[i]).
 
-    :return: Vector with res[i] = dot_prod(a[i], b[i])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Vector with res[i] = dot_prod(a[i], b[i])
     """
     a = _convert_to_tensor(a)
     b = _convert_to_tensor(b)
@@ -99,9 +151,14 @@ def pairwise_dot_score(a: Tensor, b: Tensor) -> Tensor:
 
 def manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
-    Computes the manhattan similarity manhattan_sim(a[i], b[j]) for all i and j.
+    Computes the manhattan similarity (i.e., negative distance) between two tensors.
+
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
 
-    :return: Matrix with res[i][j] = manhattan_sim(a[i], b[j])
+    Returns:
+        Tensor: Matrix with res[i][j] = -manhattan_distance(a[i], b[j])
     """
     a = _convert_to_batch_tensor(a)
     b = _convert_to_batch_tensor(b)
@@ -111,9 +168,14 @@ def manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray,
 
 def pairwise_manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]):
     """
-    Computes the negative manhattan distance.
+    Computes the manhattan similarity (i.e., negative distance) between pairs of tensors.
 
-    :return: Vector with res[i] = -manhattan_distance(a[i], b[i])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Vector with res[i] = -manhattan_distance(a[i], b[i])
     """
     a = _convert_to_tensor(a)
     b = _convert_to_tensor(b)
@@ -123,9 +185,14 @@ def pairwise_manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np
 
 def euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor:
     """
-    Computes the euclidean similarity euclidean_sim(a[i], b[j]) for all i and j.
+    Computes the euclidean similarity (i.e., negative distance) between two tensors.
+
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
 
-    :return: Matrix with res[i][j] = euclidean_sim(a[i], b[j])
+    Returns:
+        Tensor: Matrix with res[i][j] = -euclidean_distance(a[i], b[j])
     """
     a = _convert_to_batch_tensor(a)
     b = _convert_to_batch_tensor(b)
@@ -135,9 +202,14 @@ def euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray,
 
 def pairwise_euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]):
     """
-    Computes the negative euclidean distance.
+    Computes the euclidean distance (i.e., negative distance) between pairs of tensors.
 
-    :return: Vector with res[i]  = -euclidean(a[i], b[i])
+    Args:
+        a (Union[list, np.ndarray, Tensor]): The first tensor.
+        b (Union[list, np.ndarray, Tensor]): The second tensor.
+
+    Returns:
+        Tensor: Vector with res[i] = -euclidean_distance(a[i], b[i])
     """
     a = _convert_to_tensor(a)
     b = _convert_to_tensor(b)
@@ -147,11 +219,15 @@ def pairwise_euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np
 
 def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor:
     """
-    Computes the absolute normalized angle distance;
-    see AnglELoss or https://arxiv.org/abs/2309.12871v1
-    for more information.
+    Computes the absolute normalized angle distance. See :class:`~sentence_transformers.losses.AnglELoss`
+    or https://arxiv.org/abs/2309.12871v1 for more information.
+
+    Args:
+        x (Tensor): The first tensor.
+        y (Tensor): The second tensor.
 
-    :return: Vector with res[i] = angle_sim(a[i], b[i])
+    Returns:
+        Tensor: Vector with res[i] = angle_sim(a[i], b[i])
     """
 
     x = _convert_to_tensor(x)
@@ -177,7 +253,13 @@ def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor:
 
 def normalize_embeddings(embeddings: Tensor) -> Tensor:
     """
-    Normalizes the embeddings matrix, so that each sentence embedding has unit length
+    Normalizes the embeddings matrix, so that each sentence embedding has unit length.
+
+    Args:
+        embeddings (Tensor): The input embeddings matrix.
+
+    Returns:
+        Tensor: The normalized embeddings matrix.
     """
     return torch.nn.functional.normalize(embeddings, p=2, dim=1)
 
@@ -194,30 +276,64 @@ def truncate_embeddings(
     embeddings: Union[np.ndarray, torch.Tensor], truncate_dim: Optional[int]
 ) -> Union[np.ndarray, torch.Tensor]:
     """
-    :param embeddings: Embeddings to truncate.
-    :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation.
-    :return: Truncated embeddings.
+    Truncates the embeddings matrix.
+
+    Args:
+        embeddings (Union[np.ndarray, torch.Tensor]): Embeddings to truncate.
+        truncate_dim (Optional[int]): The dimension to truncate sentence embeddings to. `None` does no truncation.
+
+    Example:
+        >>> from sentence_transformers import SentenceTransformer
+        >>> from sentence_transformers.util import truncate_embeddings
+        >>> model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka")
+        >>> embeddings = model.encode(["It's so nice outside!", "Today is a beautiful day.", "He drove to work earlier"])
+        >>> embeddings.shape
+        (3, 768)
+        >>> model.similarity(embeddings, embeddings)
+        tensor([[1.0000, 0.8100, 0.1426],
+                [0.8100, 1.0000, 0.2121],
+                [0.1426, 0.2121, 1.0000]])
+        >>> truncated_embeddings = truncate_embeddings(embeddings, 128)
+        >>> truncated_embeddings.shape
+        >>> model.similarity(truncated_embeddings, truncated_embeddings)
+        tensor([[1.0000, 0.8092, 0.1987],
+                [0.8092, 1.0000, 0.2716],
+                [0.1987, 0.2716, 1.0000]])
+
+    Returns:
+        Union[np.ndarray, torch.Tensor]: Truncated embeddings.
     """
     return embeddings[..., :truncate_dim]
 
 
 def paraphrase_mining(
-    model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, *args, **kwargs
+    model,
+    sentences: List[str],
+    show_progress_bar: bool = False,
+    batch_size: int = 32,
+    query_chunk_size: int = 5000,
+    corpus_chunk_size: int = 100000,
+    max_pairs: int = 500000,
+    top_k: int = 100,
+    score_function: Callable[[Tensor, Tensor], Tensor] = cos_sim,
 ) -> List[List[Union[float, int]]]:
     """
     Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all
     other sentences and returns a list with the pairs that have the highest cosine similarity score.
 
-    :param model: SentenceTransformer model for embedding computation
-    :param sentences: A list of strings (texts or sentences)
-    :param show_progress_bar: Plotting of a progress bar
-    :param batch_size: Number of texts that are encoded simultaneously by the model
-    :param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
-    :param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
-    :param max_pairs: Maximal number of text pairs returned.
-    :param top_k: For each sentence, we retrieve up to top_k other sentences
-    :param score_function: Function for computing scores. By default, cosine similarity.
-    :return: Returns a list of triplets with the format [score, id1, id2]
+    Args:
+        model (SentenceTransformer): SentenceTransformer model for embedding computation
+        sentences (List[str]): A list of strings (texts or sentences)
+        show_progress_bar (bool, optional): Plotting of a progress bar. Defaults to False.
+        batch_size (int, optional): Number of texts that are encoded simultaneously by the model. Defaults to 32.
+        query_chunk_size (int, optional): Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000.
+        corpus_chunk_size (int, optional): Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000.
+        max_pairs (int, optional): Maximal number of text pairs returned. Defaults to 500000.
+        top_k (int, optional): For each sentence, we retrieve up to top_k other sentences. Defaults to 100.
+        score_function (Callable[[Tensor, Tensor], Tensor], optional): Function for computing scores. By default, cosine similarity. Defaults to cos_sim.
+
+    Returns:
+        List[List[Union[float, int]]]: Returns a list of triplets with the format [score, id1, id2]
     """
 
     # Compute embedding for the sentences
@@ -225,7 +341,14 @@ def paraphrase_mining(
         sentences, show_progress_bar=show_progress_bar, batch_size=batch_size, convert_to_tensor=True
     )
 
-    return paraphrase_mining_embeddings(embeddings, *args, **kwargs)
+    return paraphrase_mining_embeddings(
+        embeddings,
+        query_chunk_size=query_chunk_size,
+        corpus_chunk_size=corpus_chunk_size,
+        max_pairs=max_pairs,
+        top_k=top_k,
+        score_function=score_function,
+    )
 
 
 def paraphrase_mining_embeddings(
@@ -240,13 +363,16 @@ def paraphrase_mining_embeddings(
     Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all
     other sentences and returns a list with the pairs that have the highest cosine similarity score.
 
-    :param embeddings: A tensor with the embeddings
-    :param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
-    :param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
-    :param max_pairs: Maximal number of text pairs returned.
-    :param top_k: For each sentence, we retrieve up to top_k other sentences
-    :param score_function: Function for computing scores. By default, cosine similarity.
-    :return: Returns a list of triplets with the format [score, id1, id2]
+    Args:
+        embeddings (Tensor): A tensor with the embeddings
+        query_chunk_size (int): Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
+        corpus_chunk_size (int): Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
+        max_pairs (int): Maximal number of text pairs returned.
+        top_k (int): For each sentence, we retrieve up to top_k other sentences
+        score_function (Callable[[Tensor, Tensor], Tensor]): Function for computing scores. By default, cosine similarity.
+
+    Returns:
+        List[List[Union[float, int]]]: Returns a list of triplets with the format [score, id1, id2]
     """
 
     top_k += 1  # A sentence has the highest similarity to itself. Increase +1 as we are interest in distinct pairs
@@ -315,13 +441,16 @@ def semantic_search(
     This function performs a cosine similarity search between a list of query embeddings  and a list of corpus embeddings.
     It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
 
-    :param query_embeddings: A 2 dimensional tensor with the query embeddings.
-    :param corpus_embeddings: A 2 dimensional tensor with the corpus embeddings.
-    :param query_chunk_size: Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
-    :param corpus_chunk_size: Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
-    :param top_k: Retrieve top k matching entries.
-    :param score_function: Function for computing scores. By default, cosine similarity.
-    :return: Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys 'corpus_id' and 'score', sorted by decreasing cosine similarity scores.
+    Args:
+        query_embeddings (Tensor): A 2 dimensional tensor with the query embeddings.
+        corpus_embeddings (Tensor): A 2 dimensional tensor with the corpus embeddings.
+        query_chunk_size (int, optional): Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. Defaults to 100.
+        corpus_chunk_size (int, optional): Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. Defaults to 500000.
+        top_k (int, optional): Retrieve top k matching entries. Defaults to 10.
+        score_function (Callable[[Tensor, Tensor], Tensor], optional): Function for computing scores. By default, cosine similarity.
+
+    Returns:
+        List[List[Dict[str, Union[int, float]]]]: A list with one entry for each query. Each entry is a list of dictionaries with the keys 'corpus_id' and 'score', sorted by decreasing cosine similarity scores.
     """
 
     if isinstance(query_embeddings, (np.ndarray, np.generic)):
@@ -382,7 +511,17 @@ def semantic_search(
 
 def http_get(url, path) -> None:
     """
-    Downloads a URL to a given path on disc
+    Downloads a URL to a given path on disk.
+
+    Args:
+        url (str): The URL to download.
+        path (str): The path to save the downloaded file.
+
+    Raises:
+        requests.HTTPError: If the HTTP request returns a non-200 status code.
+
+    Returns:
+        None
     """
     if os.path.dirname(path) != "":
         os.makedirs(os.path.dirname(path), exist_ok=True)
@@ -409,7 +548,14 @@ def http_get(url, path) -> None:
 
 def batch_to_device(batch, target_device: device):
     """
-    send a pytorch batch to a device (CPU/GPU)
+    Send a PyTorch batch (i.e., a dictionary of string keys to Tensors) to a device (e.g. "cpu", "cuda", "mps").
+
+    Args:
+        batch (Dict[str, Tensor]): The batch to send to the device.
+        target_device (torch.device): The target device (e.g. "cpu", "cuda", "mps").
+
+    Returns:
+        Dict[str, Tensor]: The batch with tensors sent to the target device.
     """
     for key in batch:
         if isinstance(batch[key], Tensor):
@@ -421,6 +567,21 @@ def fullname(o) -> str:
     """
     Gives a full name (package_name.class_name) for a class / object in Python. Will
     be used to load the correct classes from JSON files
+
+    Args:
+        o: The object for which to get the full name.
+
+    Returns:
+        str: The full name of the object.
+
+    Example:
+        >>> from sentence_transformers.losses import MultipleNegativesRankingLoss
+        >>> from sentence_transformers import SentenceTransformer
+        >>> from sentence_transformers.util import fullname
+        >>> model = SentenceTransformer('all-MiniLM-L6-v2')
+        >>> loss = MultipleNegativesRankingLoss(model)
+        >>> fullname(loss)
+        'sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss'
     """
 
     module = o.__class__.__module__
@@ -434,6 +595,19 @@ def import_from_string(dotted_path):
     """
     Import a dotted module path and return the attribute/class designated by the
     last name in the path. Raise ImportError if the import failed.
+
+    Args:
+        dotted_path (str): The dotted module path.
+
+    Returns:
+        Any: The attribute/class designated by the last name in the path.
+
+    Raises:
+        ImportError: If the import failed.
+
+    Example:
+        >>> import_from_string('sentence_transformers.losses.MultipleNegativesRankingLoss')
+        <class 'sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss'>
     """
     try:
         module_path, class_name = dotted_path.rsplit(".", 1)
@@ -454,13 +628,28 @@ def import_from_string(dotted_path):
 
 
 def community_detection(
-    embeddings, threshold=0.75, min_community_size=10, batch_size=1024, show_progress_bar=False
+    embeddings: Union[torch.Tensor, np.ndarray],
+    threshold: float = 0.75,
+    min_community_size: int = 10,
+    batch_size: int = 1024,
+    show_progress_bar: bool = False,
 ) -> List[List[int]]:
     """
-    Function for Fast Community Detection
+    Function for Fast Community Detection.
+
     Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold).
     Returns only communities that are larger than min_community_size. The communities are returned
     in decreasing order. The first element in each list is the central point in the community.
+
+    Args:
+        embeddings (torch.Tensor or numpy.ndarray): The input embeddings.
+        threshold (float): The threshold for determining if two embeddings are close. Defaults to 0.75.
+        min_community_size (int): The minimum size of a community to be considered. Defaults to 10.
+        batch_size (int): The batch size for computing cosine similarity scores. Defaults to 1024.
+        show_progress_bar (bool): Whether to show a progress bar during computation. Defaults to False.
+
+    Returns:
+        List[List[int]]: A list of communities, where each community is represented as a list of indices.
     """
     if not isinstance(embeddings, torch.Tensor):
         embeddings = torch.tensor(embeddings)
@@ -571,7 +760,8 @@ def disable_logging(highest_level=logging.CRITICAL):
     A context manager that will prevent any logging messages
     triggered during the body from being processed.
 
-    :param highest_level: the maximum logging level allowed.
+    Args:
+        highest_level: the maximum logging level allowed.
     """
 
     previous_level = logging.root.manager.disable
@@ -591,6 +781,19 @@ def is_sentence_transformer_model(
     revision: Optional[str] = None,
     local_files_only: bool = False,
 ) -> bool:
+    """
+    Checks if the given model name or path corresponds to a SentenceTransformer model.
+
+    Args:
+        model_name_or_path (str): The name or path of the model.
+        token (Optional[Union[bool, str]]): The token to be used for authentication. Defaults to None.
+        cache_folder (Optional[str]): The folder to cache the model files. Defaults to None.
+        revision (Optional[str]): The revision of the model. Defaults to None.
+        local_files_only (bool): Whether to only use local files for the model. Defaults to False.
+
+    Returns:
+        bool: True if the model is a SentenceTransformer model, False otherwise.
+    """
     return bool(
         load_file_path(
             model_name_or_path,
@@ -611,6 +814,20 @@ def load_file_path(
     revision: Optional[str] = None,
     local_files_only: bool = False,
 ) -> Optional[str]:
+    """
+    Loads a file from a local or remote location.
+
+    Args:
+        model_name_or_path (str): The model name or path.
+        filename (str): The name of the file to load.
+        token (Optional[Union[bool, str]]): The token to access the remote file (if applicable).
+        cache_folder (Optional[str]): The folder to cache the downloaded file (if applicable).
+        revision (Optional[str], optional): The revision of the file (if applicable). Defaults to None.
+        local_files_only (bool, optional): Whether to only consider local files. Defaults to False.
+
+    Returns:
+        Optional[str]: The path to the loaded file, or None if the file could not be found or loaded.
+    """
     # If file is local
     file_path = os.path.join(model_name_or_path, filename)
     if os.path.exists(file_path):
@@ -639,6 +856,20 @@ def load_dir_path(
     revision: Optional[str] = None,
     local_files_only: bool = False,
 ) -> Optional[str]:
+    """
+    Loads the directory path for a given model name or path.
+
+    Args:
+        model_name_or_path (str): The name or path of the model.
+        directory (str): The directory to load.
+        token (Optional[Union[bool, str]]): The token for authentication.
+        cache_folder (Optional[str]): The folder to cache the downloaded files.
+        revision (Optional[str], optional): The revision of the model. Defaults to None.
+        local_files_only (bool, optional): Whether to only use local files. Defaults to False.
+
+    Returns:
+        Optional[str]: The directory path if it exists, otherwise None.
+    """
     # If file is local
     dir_path = os.path.join(model_name_or_path, directory)
     if os.path.exists(dir_path):
@@ -687,11 +918,12 @@ def wrapper(self, *args, **kwargs):
 def get_device_name() -> Literal["mps", "cuda", "npu", "hpu", "cpu"]:
     """
     Returns the name of the device where this module is running on.
-    It's simple implementation that doesn't cover cases when more powerful GPUs are available and
-    not a primary device ('cuda:0') or MPS device is available, but not configured properly:
-    https://pytorch.org/docs/master/notes/mps.html
 
-    :return: Device name, like 'cuda' or 'cpu'
+    It's a simple implementation that doesn't cover cases when more powerful GPUs are available and
+    not a primary device ('cuda:0') or MPS device is available, but not configured properly.
+
+    Returns:
+        str: Device name, like 'cuda' or 'cpu'
     """
     if torch.cuda.is_available():
         return "cuda"
diff --git a/setup.py b/setup.py
index 593033260..86b8717bd 100644
--- a/setup.py
+++ b/setup.py
@@ -42,6 +42,10 @@
         "Intended Audience :: Science/Research",
         "License :: OSI Approved :: Apache Software License",
         "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
         "Topic :: Scientific/Engineering :: Artificial Intelligence",
     ],
     keywords="Transformer Networks BERT XLNet sentence embedding PyTorch NLP deep learning",
diff --git a/tests/test_compute_embeddings.py b/tests/test_compute_embeddings.py
index 3ffc51982..d301367b7 100644
--- a/tests/test_compute_embeddings.py
+++ b/tests/test_compute_embeddings.py
@@ -10,7 +10,6 @@
 def test_encode_token_embeddings(paraphrase_distilroberta_base_v1_model: SentenceTransformer) -> None:
     """
     Test that encode(output_value='token_embeddings') works
-    :return:
     """
     model = paraphrase_distilroberta_base_v1_model
     sent = [

From 551feeb532126f214a64702420fc777e1cf37452 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 14:40:44 +0200
Subject: [PATCH 22/39] [`v3`] Chore - include import sorting in ruff (#2672)

* Include import sorting in ruff

* Remove deprecated ignore-init-module-imports

* Remove --select I from ruff.toml again after CI issues
---
 docs/_themes/sphinx_rtd_theme/__init__.py     |  1 -
 docs/conf.py                                  |  6 ++-
 .../applications/clustering/agglomerative.py  |  3 +-
 .../clustering/fast_clustering.py             |  4 +-
 examples/applications/clustering/kmeans.py    |  3 +-
 .../computing_embeddings.py                   |  6 ++-
 .../computing_embeddings_multi_gpu.py         |  3 +-
 .../computing_embeddings_streaming.py         |  6 ++-
 .../cross-encoder/cross-encoder_reranking.py  |  6 +--
 .../cross-encoder/cross-encoder_usage.py      |  3 +-
 .../semantic_search_faiss.py                  |  3 +-
 .../semantic_search_faiss_benchmark.py        |  2 +-
 .../semantic_search_recommended.py            |  7 +--
 .../semantic_search_usearch.py                |  3 +-
 .../semantic_search_usearch_benchmark.py      |  2 +-
 examples/applications/image-search/example.py |  4 +-
 .../parallel-sentence-mining/bitext_mining.py | 10 +++--
 .../bitext_mining_utils.py                    |  7 +--
 .../parallel-sentence-mining/bucc2018.py      | 12 ++---
 .../in_document_search_crossencoder.py        |  5 ++-
 .../semantic-search/semantic_search.py        |  3 +-
 .../semantic_search_publications.py           |  1 +
 .../semantic_search_quora_annoy.py            |  5 ++-
 .../semantic_search_quora_elasticsearch.py    |  9 ++--
 .../semantic_search_quora_faiss.py            |  5 ++-
 .../semantic_search_quora_hnswlib.py          |  5 ++-
 .../semantic_search_quora_pytorch.py          |  4 +-
 .../semantic_search_wikipedia_qa.py           |  8 ++--
 .../text-summarization/LexRank.py             |  3 +-
 .../text-summarization/text-summarization.py  |  2 +-
 .../evaluation/evaluation_inference_speed.py  |  4 +-
 .../evaluation/evaluation_stsbenchmark.py     |  9 ++--
 .../evaluation_translation_matching.py        |  6 +--
 .../adaptive_layer/adaptive_layer_nli.py      | 16 ++++---
 .../adaptive_layer/adaptive_layer_sts.py      | 15 ++++---
 ...aining_stsbenchmark_avg_word_embeddings.py |  9 ++--
 .../training_stsbenchmark_bilstm.py           |  9 ++--
 .../training_stsbenchmark_bow.py              | 13 +++---
 .../training_stsbenchmark_cnn.py              | 10 ++---
 ...ing_stsbenchmark_tf-idf_word_embeddings.py | 13 +++---
 .../training/cross-encoder/training_nli.py    | 14 +++---
 .../training_quora_duplicate_questions.py     | 14 +++---
 .../cross-encoder/training_stsbenchmark.py    | 17 +++----
 .../train_sts_indomain_bm25.py                | 20 ++++-----
 .../train_sts_indomain_nlpaug.py              | 15 ++++---
 .../train_sts_indomain_semantic.py            | 23 +++++-----
 .../train_sts_qqp_crossdomain.py              | 23 +++++-----
 .../train_sts_seed_optimization.py            | 18 ++++----
 .../distillation/dimensionality_reduction.py  |  7 +--
 .../distillation/model_distillation.py        | 13 +++---
 .../model_distillation_layer_reduction.py     | 11 +++--
 .../distillation/model_quantization.py        | 12 ++---
 examples/training/hpo/hpo_nli.py              | 10 +++--
 .../training/matryoshka/2d_matryoshka_nli.py  | 16 ++++---
 .../training/matryoshka/2d_matryoshka_sts.py  | 15 ++++---
 .../matryoshka/matryoshka_eval_stsb.py        |  7 +--
 .../training/matryoshka/matryoshka_nli.py     | 16 ++++---
 .../matryoshka/matryoshka_nli_reduced_dim.py  | 17 ++++---
 .../training/matryoshka/matryoshka_sts.py     | 15 ++++---
 .../ms_marco/eval_cross-encoder-trec-dl.py    | 12 ++---
 examples/training/ms_marco/eval_msmarco.py    |  5 ++-
 .../multilingual/translate_queries.py         |  8 ++--
 .../ms_marco/train_bi-encoder_margin-mse.py   | 21 ++++-----
 .../ms_marco/train_bi-encoder_mnrl.py         | 17 +++----
 .../ms_marco/train_cross-encoder_kd.py        | 15 ++++---
 .../ms_marco/train_cross-encoder_scratch.py   | 15 ++++---
 .../multilingual/get_parallel_data_opus.py    |  2 +-
 .../multilingual/get_parallel_data_talks.py   |  7 +--
 .../multilingual/get_parallel_data_tatoeba.py |  5 ++-
 .../get_parallel_data_wikimatrix.py           |  4 +-
 .../multilingual/make_multilingual.py         |  9 ++--
 examples/training/nli/training_nli.py         | 11 +++--
 examples/training/nli/training_nli_v2.py      | 11 +++--
 examples/training/nli/training_nli_v3.py      | 11 +++--
 .../other/training_batch_hard_trec.py         | 16 +++----
 .../training/other/training_multi-task.py     |  7 +--
 .../other/training_wikipedia_sections.py      |  7 +--
 examples/training/paraphrases/training.py     |  8 ++--
 .../create_splits.py                          |  6 +--
 .../training_MultipleNegativesRankingLoss.py  |  7 +--
 .../training_OnlineContrastiveLoss.py         |  7 +--
 .../training_multi-task-learning.py           |  7 +--
 .../training/sts/training_stsbenchmark.py     |  8 ++--
 ...training_stsbenchmark_continue_training.py |  8 ++--
 .../CT/train_askubuntu_ct.py                  |  7 +--
 .../CT/train_ct_from_file.py                  | 11 ++---
 .../unsupervised_learning/CT/train_stsb_ct.py | 15 ++++---
 .../train_askubuntu_ct-improved.py            |  7 +--
 .../train_ct-improved_from_file.py            | 11 ++---
 .../train_stsb_ct-improved.py                 | 13 +++---
 .../unsupervised_learning/MLM/train_mlm.py    | 14 ++++--
 .../SimCSE/train_askubuntu_simcse.py          |  8 ++--
 .../SimCSE/train_simcse_from_file.py          | 13 +++---
 .../SimCSE/train_stsb_simcse.py               | 17 +++----
 .../TSDAE/eval_askubuntu.py                   |  6 +--
 .../TSDAE/train_askubuntu_tsdae.py            | 11 ++---
 .../TSDAE/train_stsb_tsdae.py                 | 15 ++++---
 .../TSDAE/train_tsdae_from_file.py            | 11 ++---
 .../1_programming_query_generation.py         |  8 ++--
 .../2_programming_train_bi-encoder.py         |  2 +-
 .../3_programming_semantic_search.py          |  3 +-
 .../example_query_generation.py               |  7 +--
 ruff.toml                                     |  1 -
 sentence_transformers/LoggingHandler.py       |  1 +
 sentence_transformers/SentenceTransformer.py  | 42 +++++++++---------
 sentence_transformers/__init__.py             | 21 +++++----
 .../cross_encoder/CrossEncoder.py             | 22 ++++------
 .../evaluation/CEBinaryAccuracyEvaluator.py   |  5 ++-
 .../CEBinaryClassificationEvaluator.py        | 12 ++---
 .../evaluation/CECorrelationEvaluator.py      |  9 ++--
 .../cross_encoder/evaluation/CEF1Evaluator.py |  4 +-
 .../evaluation/CERerankingEvaluator.py        |  5 ++-
 .../evaluation/CESoftmaxAccuracyEvaluator.py  |  5 ++-
 .../cross_encoder/evaluation/__init__.py      |  4 +-
 .../datasets/DenoisingAutoEncoderDataset.py   | 10 +++--
 .../datasets/NoDuplicatesDataLoader.py        |  2 +-
 .../datasets/ParallelSentencesDataset.py      | 11 ++---
 .../datasets/SentenceLabelDataset.py          | 10 +++--
 .../datasets/SentencesDataset.py              |  8 ++--
 sentence_transformers/datasets/__init__.py    |  2 +-
 .../BinaryClassificationEvaluator.py          | 24 +++++-----
 .../EmbeddingSimilarityEvaluator.py           | 24 +++++-----
 .../InformationRetrievalEvaluator.py          | 24 +++++-----
 .../evaluation/LabelAccuracyEvaluator.py      | 19 ++++----
 .../evaluation/MSEEvaluator.py                | 14 +++---
 .../evaluation/MSEEvaluatorFromDataFrame.py   | 19 ++++----
 .../evaluation/ParaphraseMiningEvaluator.py   | 17 +++----
 .../evaluation/RerankingEvaluator.py          | 22 ++++++----
 .../evaluation/SentenceEvaluator.py           |  7 +--
 .../evaluation/SequentialEvaluator.py         | 11 +++--
 .../evaluation/TranslationEvaluator.py        | 17 ++++---
 .../evaluation/TripletEvaluator.py            | 21 +++++----
 sentence_transformers/evaluation/__init__.py  |  6 +--
 sentence_transformers/fit_mixin.py            | 22 +++++-----
 .../losses/AdaptiveLayerLoss.py               |  8 ++--
 sentence_transformers/losses/AnglELoss.py     |  2 +-
 .../losses/BatchAllTripletLoss.py             |  9 ++--
 .../losses/BatchHardSoftMarginTripletLoss.py  |  7 ++-
 .../losses/BatchHardTripletLoss.py            |  6 ++-
 .../losses/BatchSemiHardTripletLoss.py        |  9 ++--
 .../losses/CachedGISTEmbedLoss.py             |  9 ++--
 .../CachedMultipleNegativesRankingLoss.py     | 12 ++---
 sentence_transformers/losses/CoSENTLoss.py    | 10 +++--
 .../losses/ContrastiveLoss.py                 |  6 ++-
 .../losses/ContrastiveTensionLoss.py          | 15 ++++---
 .../losses/CosineSimilarityLoss.py            |  7 +--
 .../losses/DenoisingAutoEncoderLoss.py        | 10 +++--
 sentence_transformers/losses/GISTEmbedLoss.py |  8 ++--
 sentence_transformers/losses/MSELoss.py       |  5 ++-
 sentence_transformers/losses/MarginMSELoss.py |  8 ++--
 .../losses/Matryoshka2dLoss.py                |  4 +-
 .../losses/MatryoshkaLoss.py                  |  6 ++-
 .../losses/MegaBatchMarginLoss.py             |  8 ++--
 .../losses/MultipleNegativesRankingLoss.py    | 10 +++--
 .../MultipleNegativesSymmetricRankingLoss.py  | 10 +++--
 .../losses/OnlineContrastiveLoss.py           |  9 ++--
 sentence_transformers/losses/SoftmaxLoss.py   |  9 ++--
 sentence_transformers/losses/TripletLoss.py   | 10 +++--
 sentence_transformers/losses/__init__.py      | 44 +++++++++----------
 sentence_transformers/model_card.py           | 29 ++++++------
 sentence_transformers/models/Asym.py          | 11 ++---
 sentence_transformers/models/BoW.py           | 12 ++---
 sentence_transformers/models/CLIPModel.py     |  5 ++-
 sentence_transformers/models/CNN.py           |  7 +--
 sentence_transformers/models/Dense.py         | 13 +++---
 sentence_transformers/models/Dropout.py       |  8 ++--
 sentence_transformers/models/LSTM.py          |  7 +--
 sentence_transformers/models/LayerNorm.py     | 10 ++---
 sentence_transformers/models/Normalize.py     |  4 +-
 sentence_transformers/models/Pooling.py       | 10 ++---
 sentence_transformers/models/Transformer.py   |  7 +--
 .../models/WeightedLayerPooling.py            | 10 ++---
 .../models/WordEmbeddings.py                  | 18 ++++----
 sentence_transformers/models/WordWeights.py   |  9 ++--
 sentence_transformers/models/__init__.py      |  4 +-
 .../models/tokenizer/PhraseTokenizer.py       | 11 ++---
 .../models/tokenizer/WhitespaceTokenizer.py   |  9 ++--
 .../models/tokenizer/WordTokenizer.py         |  2 +-
 .../models/tokenizer/__init__.py              |  4 +-
 sentence_transformers/quantization.py         | 11 +++--
 sentence_transformers/readers/InputExample.py |  2 +-
 .../readers/LabelSentenceReader.py            |  3 +-
 .../readers/NLIDataReader.py                  |  3 +-
 .../readers/PairedFilesReader.py              |  3 +-
 .../readers/STSDataReader.py                  |  3 +-
 .../readers/TripletReader.py                  |  3 +-
 sentence_transformers/readers/__init__.py     |  2 +-
 sentence_transformers/sampler.py              |  7 +--
 sentence_transformers/similarity_functions.py |  9 ++--
 sentence_transformers/trainer.py              | 32 +++++++-------
 sentence_transformers/training_args.py        |  1 +
 sentence_transformers/util.py                 | 23 +++++-----
 setup.py                                      |  2 +-
 tests/conftest.py                             |  7 +--
 tests/test_cmnrl.py                           |  8 ++--
 tests/test_cross_encoder.py                   |  4 +-
 tests/test_image_embeddings.py                |  2 +-
 tests/test_model_card_data.py                 |  4 +-
 tests/test_multi_process.py                   |  3 +-
 tests/test_sentence_transformer.py            | 13 +++---
 tests/test_trainer.py                         |  6 ++-
 201 files changed, 1078 insertions(+), 866 deletions(-)

diff --git a/docs/_themes/sphinx_rtd_theme/__init__.py b/docs/_themes/sphinx_rtd_theme/__init__.py
index e9ae9ccc6..0f739cce4 100644
--- a/docs/_themes/sphinx_rtd_theme/__init__.py
+++ b/docs/_themes/sphinx_rtd_theme/__init__.py
@@ -8,7 +8,6 @@
 
 import sphinx
 
-
 __version__ = "0.5.0"
 __version_full__ = __version__
 
diff --git a/docs/conf.py b/docs/conf.py
index ba2e61f0b..1d3ad9bb0 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -14,10 +14,12 @@
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
 
-from recommonmark.transform import AutoStructify
+import datetime
 import os
+
+from recommonmark.transform import AutoStructify
 from sphinx.domains import Domain
-import datetime
+
 # -- Project information -----------------------------------------------------
 
 project = "Sentence-Transformers"
diff --git a/examples/applications/clustering/agglomerative.py b/examples/applications/clustering/agglomerative.py
index b5f449396..bb914f514 100644
--- a/examples/applications/clustering/agglomerative.py
+++ b/examples/applications/clustering/agglomerative.py
@@ -4,9 +4,10 @@
 Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
 """
 
-from sentence_transformers import SentenceTransformer
 from sklearn.cluster import AgglomerativeClustering
 
+from sentence_transformers import SentenceTransformer
+
 embedder = SentenceTransformer("all-MiniLM-L6-v2")
 
 # Corpus with example sentences
diff --git a/examples/applications/clustering/fast_clustering.py b/examples/applications/clustering/fast_clustering.py
index de67238ae..eb1d0d191 100644
--- a/examples/applications/clustering/fast_clustering.py
+++ b/examples/applications/clustering/fast_clustering.py
@@ -12,11 +12,11 @@
 In this example, we download a large set of questions from Quora and then find similar questions in this set.
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
 import csv
+import os
 import time
 
+from sentence_transformers import SentenceTransformer, util
 
 # Model for computing sentence embeddings. We use one trained for similar questions detection
 model = SentenceTransformer("all-MiniLM-L6-v2")
diff --git a/examples/applications/clustering/kmeans.py b/examples/applications/clustering/kmeans.py
index be3e6fc8d..076c58115 100644
--- a/examples/applications/clustering/kmeans.py
+++ b/examples/applications/clustering/kmeans.py
@@ -4,9 +4,10 @@
 Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
 """
 
-from sentence_transformers import SentenceTransformer
 from sklearn.cluster import KMeans
 
+from sentence_transformers import SentenceTransformer
+
 embedder = SentenceTransformer("all-MiniLM-L6-v2")
 
 # Corpus with example sentences
diff --git a/examples/applications/computing-embeddings/computing_embeddings.py b/examples/applications/computing-embeddings/computing_embeddings.py
index 482527e9f..666ecca26 100644
--- a/examples/applications/computing-embeddings/computing_embeddings.py
+++ b/examples/applications/computing-embeddings/computing_embeddings.py
@@ -3,10 +3,12 @@
 generate sentence embeddings for a given list of sentences.
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler
-import numpy as np
 import logging
 
+import numpy as np
+
+from sentence_transformers import LoggingHandler, SentenceTransformer
+
 #### Just some code to print debug information to stdout
 np.set_printoptions(threshold=100)
 
diff --git a/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py b/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py
index 1f47117d9..12fbfa562 100644
--- a/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py
+++ b/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py
@@ -4,9 +4,10 @@
 when encoding large text collections.
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler
 import logging
 
+from sentence_transformers import LoggingHandler, SentenceTransformer
+
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
 )
diff --git a/examples/applications/computing-embeddings/computing_embeddings_streaming.py b/examples/applications/computing-embeddings/computing_embeddings_streaming.py
index 8f9d4b4e0..6be3cac40 100644
--- a/examples/applications/computing-embeddings/computing_embeddings_streaming.py
+++ b/examples/applications/computing-embeddings/computing_embeddings_streaming.py
@@ -8,12 +8,14 @@
 https://huggingface.co/docs/datasets/stream
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler
 import logging
-from datasets import load_dataset
+
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 
+from datasets import load_dataset
+from sentence_transformers import LoggingHandler, SentenceTransformer
+
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
 )
diff --git a/examples/applications/cross-encoder/cross-encoder_reranking.py b/examples/applications/cross-encoder/cross-encoder_reranking.py
index a13aa96c3..0d0a8753c 100644
--- a/examples/applications/cross-encoder/cross-encoder_reranking.py
+++ b/examples/applications/cross-encoder/cross-encoder_reranking.py
@@ -7,13 +7,13 @@
 Then, we re-rank the hits from the Bi-Encoder using a Cross-Encoder.
 """
 
-from sentence_transformers import SentenceTransformer, util
-from sentence_transformers import CrossEncoder
-import os
 import csv
+import os
 import pickle
 import time
 
+from sentence_transformers import CrossEncoder, SentenceTransformer, util
+
 # We use a BiEncoder (SentenceTransformer) that produces embeddings for questions.
 # We then search for similar questions using cosine similarity and identify the top 100 most similar questions
 model_name = "all-MiniLM-L6-v2"
diff --git a/examples/applications/cross-encoder/cross-encoder_usage.py b/examples/applications/cross-encoder/cross-encoder_usage.py
index 7034445c6..16eba858a 100644
--- a/examples/applications/cross-encoder/cross-encoder_usage.py
+++ b/examples/applications/cross-encoder/cross-encoder_usage.py
@@ -4,9 +4,10 @@
 It output then the most similar sentences for the given query.
 """
 
-from sentence_transformers.cross_encoder import CrossEncoder
 import numpy as np
 
+from sentence_transformers.cross_encoder import CrossEncoder
+
 # Pre-trained cross encoder
 model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
 
diff --git a/examples/applications/embedding-quantization/semantic_search_faiss.py b/examples/applications/embedding-quantization/semantic_search_faiss.py
index 5da707327..1a6a1189a 100644
--- a/examples/applications/embedding-quantization/semantic_search_faiss.py
+++ b/examples/applications/embedding-quantization/semantic_search_faiss.py
@@ -1,7 +1,8 @@
 import time
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings, semantic_search_faiss
-from datasets import load_dataset
 
 # 1. Load the quora corpus with questions
 dataset = load_dataset("quora", split="train").map(
diff --git a/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py b/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py
index 0ff84333c..e869ca50b 100644
--- a/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py
+++ b/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py
@@ -1,6 +1,6 @@
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings, semantic_search_faiss
-from datasets import load_dataset
 
 # 1. Load the quora corpus with questions
 dataset = load_dataset("quora", split="train").map(
diff --git a/examples/applications/embedding-quantization/semantic_search_recommended.py b/examples/applications/embedding-quantization/semantic_search_recommended.py
index 594dbbc60..9084b56cc 100644
--- a/examples/applications/embedding-quantization/semantic_search_recommended.py
+++ b/examples/applications/embedding-quantization/semantic_search_recommended.py
@@ -8,13 +8,14 @@
 import os
 import time
 
+import faiss
 import numpy as np
+from usearch.index import Index
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings
-from datasets import load_dataset
 
-import faiss
-from usearch.index import Index
 # We use usearch as it can efficiently load int8 vectors from disk.
 
 # Load the model
diff --git a/examples/applications/embedding-quantization/semantic_search_usearch.py b/examples/applications/embedding-quantization/semantic_search_usearch.py
index c80b5ca6d..19d410a5d 100644
--- a/examples/applications/embedding-quantization/semantic_search_usearch.py
+++ b/examples/applications/embedding-quantization/semantic_search_usearch.py
@@ -1,7 +1,8 @@
 import time
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings, semantic_search_usearch
-from datasets import load_dataset
 
 # 1. Load the quora corpus with questions
 dataset = load_dataset("quora", split="train").map(
diff --git a/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py b/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py
index 5a7280f26..e8e0583a2 100644
--- a/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py
+++ b/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py
@@ -1,6 +1,6 @@
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.quantization import quantize_embeddings, semantic_search_usearch
-from datasets import load_dataset
 
 # 1. Load the quora corpus with questions
 dataset = load_dataset("quora", split="train").map(
diff --git a/examples/applications/image-search/example.py b/examples/applications/image-search/example.py
index 49bb6a699..9eadfc3aa 100644
--- a/examples/applications/image-search/example.py
+++ b/examples/applications/image-search/example.py
@@ -1,12 +1,12 @@
-from sentence_transformers import SentenceTransformer, util, models
 from PIL import Image
 
+from sentence_transformers import SentenceTransformer, models, util
 
 ###########
 
 image = Image.open("two_dogs_in_snow.jpg")
 
-from transformers import CLIPProcessor, CLIPModel
+from transformers import CLIPModel, CLIPProcessor
 
 model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
 processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
diff --git a/examples/applications/parallel-sentence-mining/bitext_mining.py b/examples/applications/parallel-sentence-mining/bitext_mining.py
index 886788fac..ac42de325 100644
--- a/examples/applications/parallel-sentence-mining/bitext_mining.py
+++ b/examples/applications/parallel-sentence-mining/bitext_mining.py
@@ -13,13 +13,15 @@
 https://github.com/facebookresearch/faiss
 """
 
-from sentence_transformers import SentenceTransformer, models
-import numpy as np
-from bitext_mining_utils import score_candidates, kNN, file_open
 import gzip
+
+import numpy as np
+import torch
 import tqdm
+from bitext_mining_utils import file_open, kNN, score_candidates
 from sklearn.decomposition import PCA
-import torch
+
+from sentence_transformers import SentenceTransformer, models
 
 # Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance
 model_name = "LaBSE"
diff --git a/examples/applications/parallel-sentence-mining/bitext_mining_utils.py b/examples/applications/parallel-sentence-mining/bitext_mining_utils.py
index 3fdbc74bc..b723392b2 100644
--- a/examples/applications/parallel-sentence-mining/bitext_mining_utils.py
+++ b/examples/applications/parallel-sentence-mining/bitext_mining_utils.py
@@ -6,11 +6,12 @@
 https://github.com/facebookresearch/LASER
 """
 
-import faiss
-import numpy as np
-import time
 import gzip
 import lzma
+import time
+
+import faiss
+import numpy as np
 
 
 ########  Functions to find and score candidates
diff --git a/examples/applications/parallel-sentence-mining/bucc2018.py b/examples/applications/parallel-sentence-mining/bucc2018.py
index 4781d86bc..47208356c 100644
--- a/examples/applications/parallel-sentence-mining/bucc2018.py
+++ b/examples/applications/parallel-sentence-mining/bucc2018.py
@@ -10,14 +10,16 @@
 https://github.com/facebookresearch/faiss
 """
 
-from sentence_transformers import SentenceTransformer, models
-from collections import defaultdict
 import os
 import pickle
-from sklearn.decomposition import PCA
-import torch
-from bitext_mining_utils import score_candidates, kNN
+from collections import defaultdict
+
 import numpy as np
+import torch
+from bitext_mining_utils import kNN, score_candidates
+from sklearn.decomposition import PCA
+
+from sentence_transformers import SentenceTransformer, models
 
 # Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance
 model_name = "LaBSE"
diff --git a/examples/applications/retrieve_rerank/in_document_search_crossencoder.py b/examples/applications/retrieve_rerank/in_document_search_crossencoder.py
index 131c9edec..fca9cb86f 100644
--- a/examples/applications/retrieve_rerank/in_document_search_crossencoder.py
+++ b/examples/applications/retrieve_rerank/in_document_search_crossencoder.py
@@ -17,10 +17,11 @@
 Note: Requires NLTK: `pip install nltk`
 """
 
-from sentence_transformers import CrossEncoder
-from nltk import sent_tokenize
 import time
 
+from nltk import sent_tokenize
+
+from sentence_transformers import CrossEncoder
 
 # As document, we take the first two section from the Wikipedia article about Europe
 document = """Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It comprises the westernmost part of Eurasia and is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, the Mediterranean Sea to the south, and Asia to the east. Europe is commonly considered to be separated from Asia by the watershed of the Ural Mountains, the Ural River, the Caspian Sea, the Greater Caucasus, the Black Sea, and the waterways of the Turkish Straits. Although some of this border is over land, Europe is generally accorded the status of a full continent because of its great physical size and the weight of history and tradition.
diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py
index 80f9e9986..9f882b7b3 100644
--- a/examples/applications/semantic-search/semantic_search.py
+++ b/examples/applications/semantic-search/semantic_search.py
@@ -7,9 +7,10 @@
 This script outputs for various queries the top 5 most similar sentences in the corpus.
 """
 
-from sentence_transformers import SentenceTransformer
 import torch
 
+from sentence_transformers import SentenceTransformer
+
 embedder = SentenceTransformer("all-MiniLM-L6-v2")
 
 # Corpus with example sentences
diff --git a/examples/applications/semantic-search/semantic_search_publications.py b/examples/applications/semantic-search/semantic_search_publications.py
index 12cb6c05b..f2a58d90d 100644
--- a/examples/applications/semantic-search/semantic_search_publications.py
+++ b/examples/applications/semantic-search/semantic_search_publications.py
@@ -11,6 +11,7 @@
 
 import json
 import os
+
 from sentence_transformers import SentenceTransformer, util
 
 # First, we load the papers dataset (with title and abstract information)
diff --git a/examples/applications/semantic-search/semantic_search_quora_annoy.py b/examples/applications/semantic-search/semantic_search_quora_annoy.py
index 25c4ac7f5..80d74594d 100644
--- a/examples/applications/semantic-search/semantic_search_quora_annoy.py
+++ b/examples/applications/semantic-search/semantic_search_quora_annoy.py
@@ -26,14 +26,15 @@
 return the closest questions in the corpus (questions in the corpus are mainly in English).
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
 import csv
+import os
 import pickle
 import time
+
 import torch
 from annoy import AnnoyIndex
 
+from sentence_transformers import SentenceTransformer, util
 
 model_name = "quora-distilbert-multilingual"
 model = SentenceTransformer(model_name)
diff --git a/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py b/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py
index ff0f37b19..f1201a83c 100644
--- a/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py
+++ b/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py
@@ -19,14 +19,15 @@
 return the closest questions in the corpus (questions in the corpus are mainly in English).
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
-from elasticsearch import Elasticsearch, helpers
-from ssl import create_default_context
 import csv
+import os
 import time
+from ssl import create_default_context
+
 import tqdm.autonotebook
+from elasticsearch import Elasticsearch, helpers
 
+from sentence_transformers import SentenceTransformer, util
 
 es = Elasticsearch(
     hosts=["https://localhost:9200"],
diff --git a/examples/applications/semantic-search/semantic_search_quora_faiss.py b/examples/applications/semantic-search/semantic_search_quora_faiss.py
index 3f45298cb..76880e36b 100644
--- a/examples/applications/semantic-search/semantic_search_quora_faiss.py
+++ b/examples/applications/semantic-search/semantic_search_quora_faiss.py
@@ -23,14 +23,15 @@
 return the closest questions in the corpus (questions in the corpus are mainly in English).
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
 import csv
+import os
 import pickle
 import time
+
 import faiss
 import numpy as np
 
+from sentence_transformers import SentenceTransformer, util
 
 model_name = "quora-distilbert-multilingual"
 model = SentenceTransformer(model_name)
diff --git a/examples/applications/semantic-search/semantic_search_quora_hnswlib.py b/examples/applications/semantic-search/semantic_search_quora_hnswlib.py
index a6f1dbc12..cff380490 100644
--- a/examples/applications/semantic-search/semantic_search_quora_hnswlib.py
+++ b/examples/applications/semantic-search/semantic_search_quora_hnswlib.py
@@ -21,13 +21,14 @@
 return the closest questions in the corpus (questions in the corpus are mainly in English).
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
 import csv
+import os
 import pickle
 import time
+
 import hnswlib
 
+from sentence_transformers import SentenceTransformer, util
 
 model_name = "quora-distilbert-multilingual"
 model = SentenceTransformer(model_name)
diff --git a/examples/applications/semantic-search/semantic_search_quora_pytorch.py b/examples/applications/semantic-search/semantic_search_quora_pytorch.py
index 0e715b479..903c6804a 100644
--- a/examples/applications/semantic-search/semantic_search_quora_pytorch.py
+++ b/examples/applications/semantic-search/semantic_search_quora_pytorch.py
@@ -13,12 +13,12 @@
 Google Colab example: https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing
 """
 
-from sentence_transformers import SentenceTransformer, util
-import os
 import csv
+import os
 import pickle
 import time
 
+from sentence_transformers import SentenceTransformer, util
 
 model_name = "quora-distilbert-multilingual"
 model = SentenceTransformer(model_name)
diff --git a/examples/applications/semantic-search/semantic_search_wikipedia_qa.py b/examples/applications/semantic-search/semantic_search_wikipedia_qa.py
index 1cd8744bd..cdc0baf86 100644
--- a/examples/applications/semantic-search/semantic_search_wikipedia_qa.py
+++ b/examples/applications/semantic-search/semantic_search_wikipedia_qa.py
@@ -13,13 +13,15 @@
 Google Colab Example: https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing
 """
 
-import json
-from sentence_transformers import SentenceTransformer, util
-import time
 import gzip
+import json
 import os
+import time
+
 import torch
 
+from sentence_transformers import SentenceTransformer, util
+
 # We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
 model_name = "nq-distilbert-base-v1"
 bi_encoder = SentenceTransformer(model_name)
diff --git a/examples/applications/text-summarization/LexRank.py b/examples/applications/text-summarization/LexRank.py
index 055cd8a79..db269c430 100644
--- a/examples/applications/text-summarization/LexRank.py
+++ b/examples/applications/text-summarization/LexRank.py
@@ -3,10 +3,11 @@
 Source: https://github.com/crabcamp/lexrank/tree/dev
 """
 
+import logging
+
 import numpy as np
 from scipy.sparse.csgraph import connected_components
 from scipy.special import softmax
-import logging
 
 logger = logging.getLogger(__name__)
 
diff --git a/examples/applications/text-summarization/text-summarization.py b/examples/applications/text-summarization/text-summarization.py
index 64dc0a4cd..58c79d49c 100644
--- a/examples/applications/text-summarization/text-summarization.py
+++ b/examples/applications/text-summarization/text-summarization.py
@@ -19,10 +19,10 @@
 """
 
 import nltk
-from sentence_transformers import SentenceTransformer
 import numpy as np
 from LexRank import degree_centrality_scores
 
+from sentence_transformers import SentenceTransformer
 
 model = SentenceTransformer("all-MiniLM-L6-v2")
 
diff --git a/examples/evaluation/evaluation_inference_speed.py b/examples/evaluation/evaluation_inference_speed.py
index a91ec0067..7ac8ec4c3 100644
--- a/examples/evaluation/evaluation_inference_speed.py
+++ b/examples/evaluation/evaluation_inference_speed.py
@@ -7,11 +7,13 @@
 python evaluation_inference_speed.py model_name
 """
 
-from sentence_transformers import SentenceTransformer
 import sys
 import time
+
 import torch
+
 from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
 
 # Limit torch to 4 threads
 torch.set_num_threads(4)
diff --git a/examples/evaluation/evaluation_stsbenchmark.py b/examples/evaluation/evaluation_stsbenchmark.py
index 1e3fb78e0..4a2e96a1f 100644
--- a/examples/evaluation/evaluation_stsbenchmark.py
+++ b/examples/evaluation/evaluation_stsbenchmark.py
@@ -7,14 +7,15 @@
 python evaluation_stsbenchmark.py model_name
 """
 
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from datasets import load_dataset
 import logging
+import os
 import sys
+
 import torch
-import os
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 
 script_folder_path = os.path.dirname(os.path.realpath(__file__))
diff --git a/examples/evaluation/evaluation_translation_matching.py b/examples/evaluation/evaluation_translation_matching.py
index bac21e9bd..567331401 100644
--- a/examples/evaluation/evaluation_translation_matching.py
+++ b/examples/evaluation/evaluation_translation_matching.py
@@ -19,11 +19,11 @@
 python examples/evaluation/evaluation_translation_matching.py distiluse-base-multilingual-cased sentence-transformers/parallel-sentences-tatoeba en-ar en-de en-nl
 """
 
-from sentence_transformers import SentenceTransformer, evaluation
-import sys
 import logging
-from datasets import load_dataset
+import sys
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, evaluation
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/adaptive_layer/adaptive_layer_nli.py b/examples/training/adaptive_layer/adaptive_layer_nli.py
index 955bc2e6c..8062cd7fe 100644
--- a/examples/training/adaptive_layer/adaptive_layer_nli.py
+++ b/examples/training/adaptive_layer/adaptive_layer_nli.py
@@ -10,15 +10,19 @@
 python adaptive_layer_nli.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 from sentence_transformers.training_args import BatchSamplers
 
 # Set the log level to INFO to get more information
diff --git a/examples/training/adaptive_layer/adaptive_layer_sts.py b/examples/training/adaptive_layer/adaptive_layer_sts.py
index c2ebdd6f4..7d13a0690 100644
--- a/examples/training/adaptive_layer/adaptive_layer_sts.py
+++ b/examples/training/adaptive_layer/adaptive_layer_sts.py
@@ -10,14 +10,19 @@
 python adaptive_layer_sts.py pretrained_transformer_model_name
 """
 
+import logging
+import sys
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
-import logging
-from datetime import datetime
-import sys
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
index a89cea13a..4ef74dc40 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py
@@ -7,14 +7,13 @@
 for available word embeddings files
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import models, losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
+import traceback
 from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses, models
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
index c2453ee07..224111cab 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py
@@ -5,14 +5,13 @@
 Note, you can also pass BERT embeddings to the BiLSTM.
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import models, losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
+import traceback
 from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses, models
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
index 3be966717..2ee204b3f 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py
@@ -5,17 +5,16 @@
 To make the model trainable, we add multiple dense layers to create a Deep Averaging Network (DAN).
 """
 
+import logging
+import math
+import os
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
-import math
-from sentence_transformers import models, losses, util
-from sentence_transformers import SentenceTransformer
+from sentence_transformers import SentenceTransformer, losses, models, util
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.models.tokenizer.WordTokenizer import ENGLISH_STOP_WORDS
-import logging
-from datetime import datetime
-import os
-
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
index db8c4ee50..3e97d0557 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py
@@ -5,20 +5,18 @@
 
 """
 
+import logging
 import sys
 import traceback
-from datasets import load_dataset
-from sentence_transformers import models, losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-import logging
 from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses, models
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
index 183894b07..08814eb26 100644
--- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
+++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py
@@ -9,16 +9,15 @@
 https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/wikipedia_doc_frequencies.txt
 """
 
-import traceback
-from datasets import load_dataset
-import math
-from sentence_transformers import models, losses, util
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
-from datetime import datetime
+import math
 import os
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/cross-encoder/training_nli.py b/examples/training/cross-encoder/training_nli.py
index b6c83d271..e368855a8 100644
--- a/examples/training/cross-encoder/training_nli.py
+++ b/examples/training/cross-encoder/training_nli.py
@@ -8,18 +8,20 @@
 python training_nli.py
 """
 
-from torch.utils.data import DataLoader
+import csv
+import gzip
+import logging
 import math
+import os
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
 from sentence_transformers import LoggingHandler, util
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CEF1Evaluator, CESoftmaxAccuracyEvaluator
 from sentence_transformers.evaluation import SequentialEvaluator
 from sentence_transformers.readers import InputExample
-import logging
-from datetime import datetime
-import os
-import gzip
-import csv
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/cross-encoder/training_quora_duplicate_questions.py b/examples/training/cross-encoder/training_quora_duplicate_questions.py
index de9e5634e..ecbdda11c 100644
--- a/examples/training/cross-encoder/training_quora_duplicate_questions.py
+++ b/examples/training/cross-encoder/training_quora_duplicate_questions.py
@@ -9,17 +9,19 @@
 
 """
 
-from torch.utils.data import DataLoader
+import csv
+import logging
 import math
+import os
+from datetime import datetime
+from zipfile import ZipFile
+
+from torch.utils.data import DataLoader
+
 from sentence_transformers import LoggingHandler, util
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
 from sentence_transformers.readers import InputExample
-import logging
-from datetime import datetime
-import os
-import csv
-from zipfile import ZipFile
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/cross-encoder/training_stsbenchmark.py b/examples/training/cross-encoder/training_stsbenchmark.py
index 8e1c83922..61a186e09 100644
--- a/examples/training/cross-encoder/training_stsbenchmark.py
+++ b/examples/training/cross-encoder/training_stsbenchmark.py
@@ -8,17 +8,18 @@
 python training_stsbenchmark.py
 """
 
-from torch.utils.data import DataLoader
+import csv
+import gzip
+import logging
 import math
-from sentence_transformers import LoggingHandler, util
+import os
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, util
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
-from sentence_transformers import InputExample
-import logging
-from datetime import datetime
-import os
-import gzip
-import csv
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/data_augmentation/train_sts_indomain_bm25.py b/examples/training/data_augmentation/train_sts_indomain_bm25.py
index 121dbb9bc..20a971bd5 100644
--- a/examples/training/data_augmentation/train_sts_indomain_bm25.py
+++ b/examples/training/data_augmentation/train_sts_indomain_bm25.py
@@ -26,22 +26,22 @@
 
 """
 
+import logging
+import math
+import sys
 import traceback
-from datasets import load_dataset, Dataset, concatenate_datasets
+from datetime import datetime
+
+import tqdm
+from elasticsearch import Elasticsearch
 from torch.utils.data import DataLoader
-from sentence_transformers import losses
+
+from datasets import Dataset, concatenate_datasets, load_dataset
+from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
-from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import InputExample
-from elasticsearch import Elasticsearch
-from datetime import datetime
-import logging
-import sys
-import tqdm
-import math
-
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
index 735e5cf36..8f0bd9310 100644
--- a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
+++ b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py
@@ -29,17 +29,18 @@
 python train_sts_indomain_nlpaug.py
 """
 
-import traceback
-from datasets import load_dataset, Dataset, concatenate_datasets
-import torch
-from sentence_transformers import SentenceTransformer, losses
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-import nlpaug.augmenter.word as naw
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
+
+import nlpaug.augmenter.word as naw
+import torch
 import tqdm
 
+from datasets import Dataset, concatenate_datasets, load_dataset
+from sentence_transformers import SentenceTransformer, losses
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/data_augmentation/train_sts_indomain_semantic.py b/examples/training/data_augmentation/train_sts_indomain_semantic.py
index b3b276cce..7849d8512 100644
--- a/examples/training/data_augmentation/train_sts_indomain_semantic.py
+++ b/examples/training/data_augmentation/train_sts_indomain_semantic.py
@@ -19,22 +19,23 @@
 python train_sts_indomain_semantic.py bert-base-uncased 3
 """
 
+import csv
+import gzip
+import logging
+import math
+import os
+import sys
+from datetime import datetime
+
+import torch
+import tqdm
 from torch.utils.data import DataLoader
-from sentence_transformers import models, losses, util
-from sentence_transformers import LoggingHandler, SentenceTransformer
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import InputExample
-from datetime import datetime
-import logging
-import csv
-import torch
-import tqdm
-import sys
-import math
-import gzip
-import os
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/data_augmentation/train_sts_qqp_crossdomain.py b/examples/training/data_augmentation/train_sts_qqp_crossdomain.py
index 55bd0a8f8..ac26ff35d 100644
--- a/examples/training/data_augmentation/train_sts_qqp_crossdomain.py
+++ b/examples/training/data_augmentation/train_sts_qqp_crossdomain.py
@@ -17,22 +17,23 @@
 python train_sts_qqp_crossdomain.py pretrained_transformer_model_name
 """
 
+import csv
+import gzip
+import logging
+import math
+import os
+import sys
+from datetime import datetime
+from zipfile import ZipFile
+
+import torch
 from torch.utils.data import DataLoader
-from sentence_transformers import models, losses, util, LoggingHandler, SentenceTransformer
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
 from sentence_transformers.evaluation import BinaryClassificationEvaluator
 from sentence_transformers.readers import InputExample
-from datetime import datetime
-from zipfile import ZipFile
-import logging
-import csv
-import sys
-import torch
-import math
-import gzip
-import os
-
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/data_augmentation/train_sts_seed_optimization.py b/examples/training/data_augmentation/train_sts_seed_optimization.py
index ddc53c676..02920fba2 100644
--- a/examples/training/data_augmentation/train_sts_seed_optimization.py
+++ b/examples/training/data_augmentation/train_sts_seed_optimization.py
@@ -23,19 +23,21 @@
 python train_sts_seed_optimization.py bert-base-uncased 10 0.3
 """
 
-from torch.utils.data import DataLoader
+import csv
+import gzip
+import logging
 import math
-import torch
+import os
 import random
+import sys
+
 import numpy as np
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
+import torch
+from torch.utils.data import DataLoader
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import InputExample
-import logging
-import sys
-import os
-import gzip
-import csv
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/distillation/dimensionality_reduction.py b/examples/training/distillation/dimensionality_reduction.py
index 3ba9432a9..3244c5f8d 100644
--- a/examples/training/distillation/dimensionality_reduction.py
+++ b/examples/training/distillation/dimensionality_reduction.py
@@ -15,14 +15,15 @@
 without further changes needed.
 """
 
-from datasets import load_dataset
-from sklearn.decomposition import PCA
-from sentence_transformers import SentenceTransformer, models
 import logging
 import random
+
 import numpy as np
 import torch
+from sklearn.decomposition import PCA
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, models
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 
 # Set the log level to INFO to get more information
diff --git a/examples/training/distillation/model_distillation.py b/examples/training/distillation/model_distillation.py
index b33f777ac..ea8bc6b7e 100644
--- a/examples/training/distillation/model_distillation.py
+++ b/examples/training/distillation/model_distillation.py
@@ -20,22 +20,21 @@
 of the teacher performance, while being 2.3 times faster.
 """
 
-import traceback
-from datasets import load_dataset, concatenate_datasets, Dataset
-import pandas as pd
-from sentence_transformers import models, losses, evaluation
-from sentence_transformers import LoggingHandler, SentenceTransformer
 import logging
+import traceback
 from datetime import datetime
-from sklearn.decomposition import PCA
+
+import pandas as pd
 import torch
+from sklearn.decomposition import PCA
 
+from datasets import Dataset, concatenate_datasets, load_dataset
+from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, losses, models
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/training/distillation/model_distillation_layer_reduction.py b/examples/training/distillation/model_distillation_layer_reduction.py
index 7a0d76f5d..83f8cc6be 100644
--- a/examples/training/distillation/model_distillation_layer_reduction.py
+++ b/examples/training/distillation/model_distillation_layer_reduction.py
@@ -20,21 +20,20 @@
 of the teacher performance, while being 2.3 times faster.
 """
 
-import traceback
-from datasets import load_dataset, concatenate_datasets, Dataset
-import pandas as pd
-from sentence_transformers import losses, evaluation
-from sentence_transformers import SentenceTransformer
 import logging
+import traceback
 from datetime import datetime
+
+import pandas as pd
 import torch
 
+from datasets import Dataset, concatenate_datasets, load_dataset
+from sentence_transformers import SentenceTransformer, evaluation, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
-
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
 
diff --git a/examples/training/distillation/model_quantization.py b/examples/training/distillation/model_quantization.py
index 78b3f49d0..07f76310d 100644
--- a/examples/training/distillation/model_quantization.py
+++ b/examples/training/distillation/model_quantization.py
@@ -9,16 +9,18 @@
 https://pytorch.org/docs/stable/quantization.html
 """
 
+import csv
+import gzip
 import logging
 import os
+import time
+
 import torch
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from torch.nn import Embedding, Linear
 from torch.quantization import quantize_dynamic
-import gzip
-import csv
-import time
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/hpo/hpo_nli.py b/examples/training/hpo/hpo_nli.py
index 758224ec9..604d92bdd 100644
--- a/examples/training/hpo/hpo_nli.py
+++ b/examples/training/hpo/hpo_nli.py
@@ -1,8 +1,12 @@
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 from sentence_transformers.training_args import BatchSamplers
-from datasets import load_dataset
 
 # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, 10k samples
 train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]")
diff --git a/examples/training/matryoshka/2d_matryoshka_nli.py b/examples/training/matryoshka/2d_matryoshka_nli.py
index 0b6acd2ec..4ed4dbfea 100644
--- a/examples/training/matryoshka/2d_matryoshka_nli.py
+++ b/examples/training/matryoshka/2d_matryoshka_nli.py
@@ -11,15 +11,19 @@
 python 2d_matryoshka_nli.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 from sentence_transformers.training_args import BatchSamplers
 
 # Set the log level to INFO to get more information
diff --git a/examples/training/matryoshka/2d_matryoshka_sts.py b/examples/training/matryoshka/2d_matryoshka_sts.py
index 3880c8c93..a170f1581 100644
--- a/examples/training/matryoshka/2d_matryoshka_sts.py
+++ b/examples/training/matryoshka/2d_matryoshka_sts.py
@@ -10,14 +10,19 @@
 python 2d_matryoshka_sts.py pretrained_transformer_model_name
 """
 
+import logging
+import sys
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
-import logging
-from datetime import datetime
-import sys
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/matryoshka/matryoshka_eval_stsb.py b/examples/training/matryoshka/matryoshka_eval_stsb.py
index 4c5cb511d..288850882 100644
--- a/examples/training/matryoshka/matryoshka_eval_stsb.py
+++ b/examples/training/matryoshka/matryoshka_eval_stsb.py
@@ -7,15 +7,16 @@
 import os
 from typing import Dict, List, Optional, Tuple, cast
 
-from datasets import load_dataset
-import numpy as np
 import matplotlib.pyplot as plt
+import numpy as np
+from tqdm.auto import tqdm
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import (
     EmbeddingSimilarityEvaluator,
     SimilarityFunction,
 )
-from tqdm.auto import tqdm
 
 
 # Dimension plot
diff --git a/examples/training/matryoshka/matryoshka_nli.py b/examples/training/matryoshka/matryoshka_nli.py
index 07a1d24d5..c111f2f05 100644
--- a/examples/training/matryoshka/matryoshka_nli.py
+++ b/examples/training/matryoshka/matryoshka_nli.py
@@ -11,15 +11,19 @@
 python matryoshka_nli.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 from sentence_transformers.training_args import BatchSamplers
 
 # Set the log level to INFO to get more information
diff --git a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
index b1e470b78..4f5dc600e 100644
--- a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
+++ b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py
@@ -15,15 +15,20 @@
 python matryoshka_nli_reduced_dim.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses, models
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+    models,
+)
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
 from sentence_transformers.training_args import BatchSamplers
 
 # Set the log level to INFO to get more information
diff --git a/examples/training/matryoshka/matryoshka_sts.py b/examples/training/matryoshka/matryoshka_sts.py
index 7cb17d715..4722f3cf3 100644
--- a/examples/training/matryoshka/matryoshka_sts.py
+++ b/examples/training/matryoshka/matryoshka_sts.py
@@ -10,14 +10,19 @@
 python matryoshka_sts.py pretrained_transformer_model_name
 """
 
+import logging
+import sys
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+    losses,
+)
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
-import logging
-from datetime import datetime
-import sys
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/ms_marco/eval_cross-encoder-trec-dl.py b/examples/training/ms_marco/eval_cross-encoder-trec-dl.py
index acc96fe2f..f4635ca53 100644
--- a/examples/training/ms_marco/eval_cross-encoder-trec-dl.py
+++ b/examples/training/ms_marco/eval_cross-encoder-trec-dl.py
@@ -14,14 +14,16 @@
 """
 
 import gzip
-from collections import defaultdict
 import logging
-import tqdm
-import numpy as np
+import os
 import sys
+from collections import defaultdict
+
+import numpy as np
 import pytrec_eval
-from sentence_transformers import util, CrossEncoder
-import os
+import tqdm
+
+from sentence_transformers import CrossEncoder, util
 
 data_folder = "trec2019-data"
 os.makedirs(data_folder, exist_ok=True)
diff --git a/examples/training/ms_marco/eval_msmarco.py b/examples/training/ms_marco/eval_msmarco.py
index b40e25920..bfe6dbed9 100644
--- a/examples/training/ms_marco/eval_msmarco.py
+++ b/examples/training/ms_marco/eval_msmarco.py
@@ -6,12 +6,13 @@
 python eval_msmarco.py model_name [max_corpus_size_in_thousands]
 """
 
-from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util
 import logging
-import sys
 import os
+import sys
 import tarfile
 
+from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/training/ms_marco/multilingual/translate_queries.py b/examples/training/ms_marco/multilingual/translate_queries.py
index af7d7e941..56ca7542a 100644
--- a/examples/training/ms_marco/multilingual/translate_queries.py
+++ b/examples/training/ms_marco/multilingual/translate_queries.py
@@ -8,12 +8,14 @@
 python translate_queries [target_language]
 """
 
-import os
-from sentence_transformers import LoggingHandler, util
 import logging
+import os
+import sys
 import tarfile
+
 from easynmt import EasyNMT
-import sys
+
+from sentence_transformers import LoggingHandler, util
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/ms_marco/train_bi-encoder_margin-mse.py b/examples/training/ms_marco/train_bi-encoder_margin-mse.py
index 7b397da62..bff34f5cb 100644
--- a/examples/training/ms_marco/train_bi-encoder_margin-mse.py
+++ b/examples/training/ms_marco/train_bi-encoder_margin-mse.py
@@ -1,18 +1,19 @@
-import sys
+import argparse
+import gzip
 import json
-from torch.utils.data import DataLoader
-from sentence_transformers import SentenceTransformer, LoggingHandler, util, models, losses, InputExample
 import logging
-from datetime import datetime
-import gzip
 import os
-import tarfile
-import tqdm
-from torch.utils.data import Dataset
+import pickle
 import random
+import sys
+import tarfile
+from datetime import datetime
 from shutil import copyfile
-import pickle
-import argparse
+
+import tqdm
+from torch.utils.data import DataLoader, Dataset
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/ms_marco/train_bi-encoder_mnrl.py b/examples/training/ms_marco/train_bi-encoder_mnrl.py
index 110ced2e0..eea99b1c2 100644
--- a/examples/training/ms_marco/train_bi-encoder_mnrl.py
+++ b/examples/training/ms_marco/train_bi-encoder_mnrl.py
@@ -17,19 +17,20 @@
 python train_bi-encoder-v3.py
 """
 
+import argparse
+import gzip
 import json
-from torch.utils.data import DataLoader
-from sentence_transformers import SentenceTransformer, LoggingHandler, util, models, losses, InputExample
 import logging
-from datetime import datetime
-import gzip
 import os
+import pickle
+import random
 import tarfile
+from datetime import datetime
+
 import tqdm
-from torch.utils.data import Dataset
-import random
-import pickle
-import argparse
+from torch.utils.data import DataLoader, Dataset
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/ms_marco/train_cross-encoder_kd.py b/examples/training/ms_marco/train_cross-encoder_kd.py
index 9045d27df..0d2cd6e0f 100644
--- a/examples/training/ms_marco/train_cross-encoder_kd.py
+++ b/examples/training/ms_marco/train_cross-encoder_kd.py
@@ -17,17 +17,18 @@
 python train_cross-encoder-v2.py
 """
 
-from torch.utils.data import DataLoader
-from sentence_transformers import LoggingHandler, util
-from sentence_transformers.cross_encoder import CrossEncoder
-from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator
-from sentence_transformers import InputExample
-import logging
-from datetime import datetime
 import gzip
+import logging
 import os
 import tarfile
+from datetime import datetime
+
 import torch
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, util
+from sentence_transformers.cross_encoder import CrossEncoder
+from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/ms_marco/train_cross-encoder_scratch.py b/examples/training/ms_marco/train_cross-encoder_scratch.py
index faffcb87d..ebdd8d955 100644
--- a/examples/training/ms_marco/train_cross-encoder_scratch.py
+++ b/examples/training/ms_marco/train_cross-encoder_scratch.py
@@ -14,17 +14,18 @@
 python train_cross-encoder.py
 """
 
-from torch.utils.data import DataLoader
-from sentence_transformers import LoggingHandler, util
-from sentence_transformers.cross_encoder import CrossEncoder
-from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator
-from sentence_transformers import InputExample
-import logging
-from datetime import datetime
 import gzip
+import logging
 import os
 import tarfile
+from datetime import datetime
+
 import tqdm
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, util
+from sentence_transformers.cross_encoder import CrossEncoder
+from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/training/multilingual/get_parallel_data_opus.py b/examples/training/multilingual/get_parallel_data_opus.py
index e66471624..a39d95b3b 100644
--- a/examples/training/multilingual/get_parallel_data_opus.py
+++ b/examples/training/multilingual/get_parallel_data_opus.py
@@ -29,9 +29,9 @@
 
 """
 
-from opustools import OpusRead
 import os
 
+from opustools import OpusRead
 
 corpora = ["JW300"]  # Corpora you want to use
 source_languages = ["en"]  # Source language, our teacher model is able to understand
diff --git a/examples/training/multilingual/get_parallel_data_talks.py b/examples/training/multilingual/get_parallel_data_talks.py
index 0c567b2e6..745e125d5 100644
--- a/examples/training/multilingual/get_parallel_data_talks.py
+++ b/examples/training/multilingual/get_parallel_data_talks.py
@@ -13,12 +13,13 @@
 https://arxiv.org/abs/2004.09813
 """
 
-import os
-import sentence_transformers.util
-import gzip
 import csv
+import gzip
+import os
+
 from tqdm.autonotebook import tqdm
 
+import sentence_transformers.util
 
 source_languages = set(["en"])  # Languages our (monolingual) teacher model understands
 target_languages = set(["de", "es", "it", "fr", "ar", "tr"])  # New languages we want to extend to
diff --git a/examples/training/multilingual/get_parallel_data_tatoeba.py b/examples/training/multilingual/get_parallel_data_tatoeba.py
index a4a226d1a..77b033e6e 100644
--- a/examples/training/multilingual/get_parallel_data_tatoeba.py
+++ b/examples/training/multilingual/get_parallel_data_tatoeba.py
@@ -5,10 +5,11 @@
 This script downloads the Tatoeba corpus and extracts the sentences & translations in the languages you like
 """
 
+import gzip
 import os
-import sentence_transformers
 import tarfile
-import gzip
+
+import sentence_transformers
 
 # Note: Tatoeba uses 3 letter languages codes (ISO-639-2),
 # while other datasets like OPUS use 2 letter language codes (ISO-639-1)
diff --git a/examples/training/multilingual/get_parallel_data_wikimatrix.py b/examples/training/multilingual/get_parallel_data_wikimatrix.py
index 1f2e4520a..8b985dd38 100644
--- a/examples/training/multilingual/get_parallel_data_wikimatrix.py
+++ b/examples/training/multilingual/get_parallel_data_wikimatrix.py
@@ -9,10 +9,10 @@
 https://arxiv.org/abs/2004.09813
 """
 
-import os
-import sentence_transformers.util
 import gzip
+import os
 
+import sentence_transformers.util
 
 source_languages = set(["en"])  # Languages our (monolingual) teacher model understands
 target_languages = set(["de", "es", "it", "fr", "ar", "tr"])  # New languages we want to extend to
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index b50d5d408..bb62d37bf 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -17,12 +17,14 @@
 https://arxiv.org/abs/2004.09813
 """
 
+import logging
 import traceback
-from sentence_transformers import SentenceTransformer, LoggingHandler
 from datetime import datetime
-from datasets import load_dataset, DatasetDict
 
-import logging
+import numpy as np
+
+from datasets import DatasetDict, load_dataset
+from sentence_transformers import LoggingHandler, SentenceTransformer
 from sentence_transformers.evaluation import (
     EmbeddingSimilarityEvaluator,
     MSEEvaluator,
@@ -32,7 +34,6 @@
 from sentence_transformers.losses import MSELoss
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
-import numpy as np
 
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/training/nli/training_nli.py b/examples/training/nli/training_nli.py
index 2dd26c112..0a2dc6ba8 100644
--- a/examples/training/nli/training_nli.py
+++ b/examples/training/nli/training_nli.py
@@ -10,15 +10,14 @@
 python training_nli.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/nli/training_nli_v2.py b/examples/training/nli/training_nli_v2.py
index 0567e0f8b..0b0025351 100644
--- a/examples/training/nli/training_nli_v2.py
+++ b/examples/training/nli/training_nli_v2.py
@@ -10,15 +10,14 @@
 python training_nli_v2.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
diff --git a/examples/training/nli/training_nli_v3.py b/examples/training/nli/training_nli_v3.py
index 1844a2588..ffc95e128 100644
--- a/examples/training/nli/training_nli_v3.py
+++ b/examples/training/nli/training_nli_v3.py
@@ -10,15 +10,14 @@
 python training_nli_v3.py pretrained_transformer_model_name
 """
 
-import traceback
-from datasets import load_dataset
-from sentence_transformers import losses
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 import logging
-from datetime import datetime
 import sys
+import traceback
+from datetime import datetime
 
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer, losses
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
diff --git a/examples/training/other/training_batch_hard_trec.py b/examples/training/other/training_batch_hard_trec.py
index 0d17817c7..7082cb706 100644
--- a/examples/training/other/training_batch_hard_trec.py
+++ b/examples/training/other/training_batch_hard_trec.py
@@ -16,18 +16,18 @@
 all sentences with the same label should be close and sentences for different labels should be clearly separated.
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util
-from sentence_transformers.datasets import SentenceLabelDataset
-from torch.utils.data import DataLoader
-from sentence_transformers.readers import InputExample
-from sentence_transformers.evaluation import TripletEvaluator
-from datetime import datetime
-
-
 import logging
 import os
 import random
 from collections import defaultdict
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, util
+from sentence_transformers.datasets import SentenceLabelDataset
+from sentence_transformers.evaluation import TripletEvaluator
+from sentence_transformers.readers import InputExample
 
 logging.basicConfig(
     format="%(asctime)s - %(message)s",
diff --git a/examples/training/other/training_multi-task.py b/examples/training/other/training_multi-task.py
index b9de63c0a..e54e6f270 100644
--- a/examples/training/other/training_multi-task.py
+++ b/examples/training/other/training_multi-task.py
@@ -4,16 +4,17 @@
 The system trains BERT on the AllNLI and on the STSbenchmark dataset.
 """
 
+import logging
 import traceback
+from datetime import datetime
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.losses import CosineSimilarityLoss, SoftmaxLoss
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments
-import logging
-from datetime import datetime
-from datasets import load_dataset
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/other/training_wikipedia_sections.py b/examples/training/other/training_wikipedia_sections.py
index 8f9b36dfa..e1c835418 100644
--- a/examples/training/other/training_wikipedia_sections.py
+++ b/examples/training/other/training_wikipedia_sections.py
@@ -4,15 +4,16 @@
 As corpus, we use the wikipedia sections dataset that was describd by Dor et al., 2018, Learning Thematic Similarity Metric Using Triplet Networks.
 """
 
+import logging
 import traceback
+from datetime import datetime
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import TripletEvaluator
 from sentence_transformers.losses import TripletLoss
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
-from datetime import datetime
-from datasets import load_dataset
-import logging
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/paraphrases/training.py b/examples/training/paraphrases/training.py
index 6ad6b5877..1184b3f05 100644
--- a/examples/training/paraphrases/training.py
+++ b/examples/training/paraphrases/training.py
@@ -3,7 +3,11 @@
 As a result, it does not produce exactly the same behaviour as the original script.
 """
 
+import logging
 import traceback
+from datetime import datetime
+
+from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.losses import MultipleNegativesRankingLoss
@@ -14,10 +18,6 @@
     MultiDatasetBatchSamplers,
     SentenceTransformerTrainingArguments,
 )
-import logging
-from datetime import datetime
-from datasets import load_dataset
-
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/quora_duplicate_questions/create_splits.py b/examples/training/quora_duplicate_questions/create_splits.py
index f69a0040d..00e485e10 100644
--- a/examples/training/quora_duplicate_questions/create_splits.py
+++ b/examples/training/quora_duplicate_questions/create_splits.py
@@ -44,11 +44,11 @@
 """
 
 import csv
-from collections import defaultdict
-import random
 import os
-from sentence_transformers import util
+import random
+from collections import defaultdict
 
+from sentence_transformers import util
 
 random.seed(42)
 
diff --git a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
index 9ddea4dd2..b8641bc87 100644
--- a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
+++ b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py
@@ -11,7 +11,11 @@
 The model we get works well for duplicate questions mining and for duplicate questions information retrieval. For question pair classification, other losses (like OnlineConstrativeLoss) work better.
 """
 
+import logging
+import random
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import (
@@ -23,9 +27,6 @@
 from sentence_transformers.losses import MultipleNegativesRankingLoss
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
-import logging
-from datetime import datetime
-import random
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
index 51fc34b0a..ab35b8edb 100644
--- a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
+++ b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py
@@ -9,7 +9,11 @@
 An issue with constrative loss is, that it might push sentences away that are already well positioned in vector space.
 """
 
+import logging
+import random
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import (
@@ -22,9 +26,6 @@
 from sentence_transformers.losses.ContrastiveLoss import SiameseDistanceMetric
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments
-import logging
-from datetime import datetime
-import random
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/quora_duplicate_questions/training_multi-task-learning.py b/examples/training/quora_duplicate_questions/training_multi-task-learning.py
index 15d8c227c..c9b301d16 100644
--- a/examples/training/quora_duplicate_questions/training_multi-task-learning.py
+++ b/examples/training/quora_duplicate_questions/training_multi-task-learning.py
@@ -11,7 +11,11 @@
 model.fit(train_objectives=[(train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss), (train_dataloader_constrative_loss, train_loss_constrative_loss)] ...)
 """
 
+import logging
+import random
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.evaluation import (
@@ -27,9 +31,6 @@
     MultiDatasetBatchSamplers,
     SentenceTransformerTrainingArguments,
 )
-import logging
-from datetime import datetime
-import random
 
 # Set the log level to INFO to get more information
 logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py
index 9bdc1efe3..c61648494 100644
--- a/examples/training/sts/training_stsbenchmark.py
+++ b/examples/training/sts/training_stsbenchmark.py
@@ -9,14 +9,14 @@
 python training_nli.py pretrained_transformer_model_name
 """
 
+import logging
+import sys
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
 from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-import logging
-from datetime import datetime
-import sys
-
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/training/sts/training_stsbenchmark_continue_training.py b/examples/training/sts/training_stsbenchmark_continue_training.py
index ff4c70bdd..852892be9 100644
--- a/examples/training/sts/training_stsbenchmark_continue_training.py
+++ b/examples/training/sts/training_stsbenchmark_continue_training.py
@@ -6,14 +6,14 @@
 If you want to fine-tune a huggingface/transformers model like bert-base-uncased, see training_nli.py and training_stsbenchmark.py
 """
 
+import logging
+import sys
 import traceback
+from datetime import datetime
+
 from datasets import load_dataset
 from sentence_transformers import SentenceTransformer, losses
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-import logging
-from datetime import datetime
-import sys
-
 from sentence_transformers.similarity_functions import SimilarityFunction
 from sentence_transformers.trainer import SentenceTransformerTrainer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
diff --git a/examples/unsupervised_learning/CT/train_askubuntu_ct.py b/examples/unsupervised_learning/CT/train_askubuntu_ct.py
index a4cf386f0..ed35538b0 100644
--- a/examples/unsupervised_learning/CT/train_askubuntu_ct.py
+++ b/examples/unsupervised_learning/CT/train_askubuntu_ct.py
@@ -1,11 +1,12 @@
-from sentence_transformers import SentenceTransformer, LoggingHandler
-from sentence_transformers import models, util, evaluation, losses
+import gzip
 import logging
 import os
-import gzip
 from datetime import datetime
+
 import torch
 
+from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, losses, models, util
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/CT/train_ct_from_file.py b/examples/unsupervised_learning/CT/train_ct_from_file.py
index 9e8444ddb..a78a9bed5 100644
--- a/examples/unsupervised_learning/CT/train_ct_from_file.py
+++ b/examples/unsupervised_learning/CT/train_ct_from_file.py
@@ -8,15 +8,16 @@
 
 """
 
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer
-import logging
-from datetime import datetime
 import gzip
+import logging
+import math
 import sys
+from datetime import datetime
+
 import tqdm
 
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/CT/train_stsb_ct.py b/examples/unsupervised_learning/CT/train_stsb_ct.py
index 23b9a3ae9..0fe756f63 100644
--- a/examples/unsupervised_learning/CT/train_stsb_ct.py
+++ b/examples/unsupervised_learning/CT/train_stsb_ct.py
@@ -1,12 +1,13 @@
-import torch
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers import SentenceTransformer, LoggingHandler, models, util, InputExample
-from sentence_transformers import losses
-import os
-import gzip
 import csv
-from datetime import datetime
+import gzip
 import logging
+import os
+from datetime import datetime
+
+import torch
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
index e0c56e7f4..9df16ff1c 100644
--- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
+++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py
@@ -1,11 +1,12 @@
-from sentence_transformers import SentenceTransformer, LoggingHandler, InputExample
-from sentence_transformers import models, util, evaluation, losses
+import gzip
 import logging
 import os
-import gzip
 from datetime import datetime
+
 from torch.utils.data import DataLoader
 
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, evaluation, losses, models, util
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py
index 8338628d1..06c04c4d6 100644
--- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py
+++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py
@@ -8,16 +8,17 @@
 
 """
 
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer
-import logging
-from datetime import datetime
 import gzip
+import logging
+import math
 import sys
+from datetime import datetime
+
 import tqdm
 from torch.utils.data import DataLoader
 
+from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py
index 015d798c0..6201d94c2 100644
--- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py
+++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py
@@ -1,13 +1,14 @@
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
-from sentence_transformers import SentenceTransformer, LoggingHandler, models, util, InputExample
-from sentence_transformers import losses
-import os
-import gzip
 import csv
-from datetime import datetime
+import gzip
 import logging
+import os
+from datetime import datetime
+
 from torch.utils.data import DataLoader
 
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/MLM/train_mlm.py b/examples/unsupervised_learning/MLM/train_mlm.py
index ccc11f08f..3f6e144d6 100644
--- a/examples/unsupervised_learning/MLM/train_mlm.py
+++ b/examples/unsupervised_learning/MLM/train_mlm.py
@@ -8,13 +8,19 @@
 python train_mlm.py model_name data/train_sentences.txt [data/dev_sentences.txt]
 """
 
-from transformers import AutoModelForMaskedLM, AutoTokenizer
-from transformers import DataCollatorForLanguageModeling, DataCollatorForWholeWordMask
-from transformers import Trainer, TrainingArguments
-import sys
 import gzip
+import sys
 from datetime import datetime
 
+from transformers import (
+    AutoModelForMaskedLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    DataCollatorForWholeWordMask,
+    Trainer,
+    TrainingArguments,
+)
+
 if len(sys.argv) < 3:
     print("Usage: python train_mlm.py model_name data/train_sentences.txt [data/dev_sentences.txt]")
     exit()
diff --git a/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py b/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py
index 8b63aad70..bf581e017 100644
--- a/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py
+++ b/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py
@@ -1,11 +1,11 @@
-from sentence_transformers import SentenceTransformer, LoggingHandler, InputExample
-from sentence_transformers import models, util, evaluation, losses
+import gzip
 import logging
 import os
-import gzip
-from torch.utils.data import DataLoader
 from datetime import datetime
 
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, evaluation, losses, models, util
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py b/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py
index 861f99951..6c196d7e0 100644
--- a/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py
+++ b/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py
@@ -8,15 +8,16 @@
 
 """
 
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer, InputExample
-import logging
-from datetime import datetime
 import gzip
+import logging
+import math
 import sys
+from datetime import datetime
+
 import tqdm
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py b/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py
index 0fcba32d4..82d98fbcb 100644
--- a/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py
+++ b/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py
@@ -1,13 +1,14 @@
-from torch.utils.data import DataLoader
-import math
-from sentence_transformers import models, losses
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+import csv
+import gzip
 import logging
-from datetime import datetime
+import math
 import os
-import gzip
-import csv
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/TSDAE/eval_askubuntu.py b/examples/unsupervised_learning/TSDAE/eval_askubuntu.py
index 7efe16dfe..4b5a3cdfc 100644
--- a/examples/unsupervised_learning/TSDAE/eval_askubuntu.py
+++ b/examples/unsupervised_learning/TSDAE/eval_askubuntu.py
@@ -5,13 +5,13 @@
 python eval_askubuntu.py [sbert_model_name_or_path]
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler
-from sentence_transformers import util, evaluation
+import gzip
 import logging
 import os
-import gzip
 import sys
 
+from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util
+
 #### Just some code to print debug information to stdout
 logging.basicConfig(
     format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
diff --git a/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py b/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py
index 6abccbb50..beac68f90 100644
--- a/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py
+++ b/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py
@@ -1,11 +1,12 @@
-from sentence_transformers import SentenceTransformer, LoggingHandler
-from sentence_transformers import models, util, datasets, evaluation, losses
+import gzip
 import logging
 import os
-import gzip
-from torch.utils.data import DataLoader
-from datetime import datetime
 import sys
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, datasets, evaluation, losses, models, util
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py b/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
index 0e898d91c..96d80b14a 100644
--- a/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
+++ b/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
@@ -1,12 +1,13 @@
-from torch.utils.data import DataLoader
-from sentence_transformers import models, losses, datasets
-from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
-from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+import csv
+import gzip
 import logging
-from datetime import datetime
 import os
-import gzip
-import csv
+from datetime import datetime
+
+from torch.utils.data import DataLoader
+
+from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, datasets, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py b/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py
index 257c9b6c9..14db4da15 100644
--- a/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py
+++ b/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py
@@ -8,14 +8,15 @@
 
 """
 
-from sentence_transformers import SentenceTransformer, LoggingHandler
-from sentence_transformers import models, datasets, losses
-import logging
 import gzip
-from torch.utils.data import DataLoader
-from datetime import datetime
+import logging
 import sys
+from datetime import datetime
+
 import tqdm
+from torch.utils.data import DataLoader
+
+from sentence_transformers import LoggingHandler, SentenceTransformer, datasets, losses, models
 
 #### Just some code to print debug information to stdout
 logging.basicConfig(
diff --git a/examples/unsupervised_learning/query_generation/1_programming_query_generation.py b/examples/unsupervised_learning/query_generation/1_programming_query_generation.py
index f75875ca9..7558edc08 100644
--- a/examples/unsupervised_learning/query_generation/1_programming_query_generation.py
+++ b/examples/unsupervised_learning/query_generation/1_programming_query_generation.py
@@ -11,12 +11,14 @@
 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search
 """
 
-import json
 import gzip
-from transformers import T5Tokenizer, T5ForConditionalGeneration
+import json
+import os
+
 import torch
 import tqdm
-import os
+from transformers import T5ForConditionalGeneration, T5Tokenizer
+
 from sentence_transformers import util
 
 paragraphs = set()
diff --git a/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py b/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py
index ae316c810..6364b5731 100644
--- a/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py
+++ b/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py
@@ -11,9 +11,9 @@
 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search
 """
 
-from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
 import os
 
+from sentence_transformers import InputExample, SentenceTransformer, datasets, losses, models
 
 train_examples = []
 with open("generated_queries.tsv") as fIn:
diff --git a/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py b/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py
index 46a8b9ef6..d3f73ae44 100644
--- a/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py
+++ b/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py
@@ -11,11 +11,12 @@
 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search
 """
 
-from sentence_transformers import SentenceTransformer, util
 import gzip
 import json
 import os
 
+from sentence_transformers import SentenceTransformer, util
+
 # Load the model we trained in 2_programming_train_bi-encoder.py
 model = SentenceTransformer("output/programming-model")
 
diff --git a/examples/unsupervised_learning/query_generation/example_query_generation.py b/examples/unsupervised_learning/query_generation/example_query_generation.py
index 99be09806..099b960a5 100644
--- a/examples/unsupervised_learning/query_generation/example_query_generation.py
+++ b/examples/unsupervised_learning/query_generation/example_query_generation.py
@@ -1,7 +1,8 @@
-import torch
-import numpy as np
 import random
-from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+import numpy as np
+import torch
+from transformers import T5ForConditionalGeneration, T5Tokenizer
 
 # Set all seeds to make output deterministic
 torch.manual_seed(0)
diff --git a/ruff.toml b/ruff.toml
index 765b3bee5..0c56fdb1d 100644
--- a/ruff.toml
+++ b/ruff.toml
@@ -1,4 +1,3 @@
-lint.ignore-init-module-imports = true
 
 line-length = 119
 
diff --git a/sentence_transformers/LoggingHandler.py b/sentence_transformers/LoggingHandler.py
index 7696f353e..7ae11480c 100644
--- a/sentence_transformers/LoggingHandler.py
+++ b/sentence_transformers/LoggingHandler.py
@@ -1,4 +1,5 @@
 import logging
+
 import tqdm
 
 
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index f2ad12f83..280312847 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -1,46 +1,46 @@
-from contextlib import contextmanager
+import copy
+import importlib
 import json
 import logging
+import math
 import os
-from collections import OrderedDict
-from pathlib import Path
+import queue
+import tempfile
 import traceback
 import warnings
-from typing import Callable, List, Dict, Literal, Tuple, Iterable, Union, Optional, overload, Any
+from collections import OrderedDict
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Any, Callable, Dict, Iterable, List, Literal, Optional, Tuple, Union, overload
+
 import numpy as np
-from numpy import ndarray
-import transformers
-from transformers import is_torch_npu_available
-from huggingface_hub import HfApi
 import torch
-from torch import nn, Tensor, device
 import torch.multiprocessing as mp
+import transformers
+from huggingface_hub import HfApi
+from numpy import ndarray
+from torch import Tensor, device, nn
 from tqdm.autonotebook import trange
-import math
-import queue
-import tempfile
-import copy
-import importlib
+from transformers import is_torch_npu_available
 
 from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card
 from sentence_transformers.similarity_functions import SimilarityFunction
 
-from . import __MODEL_HUB_ORGANIZATION__
+from . import __MODEL_HUB_ORGANIZATION__, __version__
 from .evaluation import SentenceEvaluator
+from .fit_mixin import FitMixin
+from .models import Normalize, Pooling, Transformer
+from .quantization import quantize_embeddings
 from .util import (
-    import_from_string,
     batch_to_device,
+    get_device_name,
+    import_from_string,
     is_sentence_transformer_model,
     load_dir_path,
     load_file_path,
     save_to_hub_args_decorator,
-    get_device_name,
     truncate_embeddings,
 )
-from .quantization import quantize_embeddings
-from .models import Transformer, Pooling, Normalize
-from .fit_mixin import FitMixin
-from . import __version__
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/__init__.py b/sentence_transformers/__init__.py
index 3d772711b..488a4b9fa 100644
--- a/sentence_transformers/__init__.py
+++ b/sentence_transformers/__init__.py
@@ -4,17 +4,16 @@
 import importlib
 import os
 
-from .datasets import SentencesDataset, ParallelSentencesDataset
-from .LoggingHandler import LoggingHandler
-from .SentenceTransformer import SentenceTransformer
-from .similarity_functions import SimilarityFunction
-from .readers import InputExample
-from .cross_encoder.CrossEncoder import CrossEncoder
-from .trainer import SentenceTransformerTrainer
-from .training_args import SentenceTransformerTrainingArguments
-from .model_card import SentenceTransformerModelCardData
-from .quantization import quantize_embeddings
-
+from sentence_transformers.cross_encoder.CrossEncoder import CrossEncoder
+from sentence_transformers.datasets import ParallelSentencesDataset, SentencesDataset
+from sentence_transformers.LoggingHandler import LoggingHandler
+from sentence_transformers.model_card import SentenceTransformerModelCardData
+from sentence_transformers.quantization import quantize_embeddings
+from sentence_transformers.readers import InputExample
+from sentence_transformers.SentenceTransformer import SentenceTransformer
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.trainer import SentenceTransformerTrainer
+from sentence_transformers.training_args import SentenceTransformerTrainingArguments
 
 # If codecarbon is installed and the log level is not defined,
 # automatically overwrite the default to "error"
diff --git a/sentence_transformers/cross_encoder/CrossEncoder.py b/sentence_transformers/cross_encoder/CrossEncoder.py
index cc205c317..9462b69b6 100644
--- a/sentence_transformers/cross_encoder/CrossEncoder.py
+++ b/sentence_transformers/cross_encoder/CrossEncoder.py
@@ -1,22 +1,20 @@
+import logging
+import os
 from functools import wraps
+from typing import Callable, Dict, List, Optional, Type, Union
 
-from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
 import numpy as np
-import logging
-import os
-from typing import Dict, Type, Callable, List, Optional, Union
 import torch
 from torch import nn
 from torch.optim import Optimizer
 from torch.utils.data import DataLoader
 from tqdm.autonotebook import tqdm, trange
-from transformers import is_torch_npu_available
+from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, is_torch_npu_available
 from transformers.utils import PushToHubMixin
 
-from .. import SentenceTransformer, util
-from ..evaluation import SentenceEvaluator
-from ..util import get_device_name
-
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.SentenceTransformer import SentenceTransformer
+from sentence_transformers.util import fullname, get_device_name, import_from_string
 
 logger = logging.getLogger(__name__)
 
@@ -114,7 +112,7 @@ def __init__(
         if default_activation_function is not None:
             self.default_activation_function = default_activation_function
             try:
-                self.config.sbert_ce_default_activation_function = util.fullname(self.default_activation_function)
+                self.config.sbert_ce_default_activation_function = fullname(self.default_activation_function)
             except Exception as e:
                 logger.warning(
                     "Was not able to update config about the default_activation_function: {}".format(str(e))
@@ -123,9 +121,7 @@ def __init__(
             hasattr(self.config, "sbert_ce_default_activation_function")
             and self.config.sbert_ce_default_activation_function is not None
         ):
-            self.default_activation_function = util.import_from_string(
-                self.config.sbert_ce_default_activation_function
-            )()
+            self.default_activation_function = import_from_string(self.config.sbert_ce_default_activation_function)()
         else:
             self.default_activation_function = nn.Sigmoid() if self.config.num_labels == 1 else nn.Identity()
 
diff --git a/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py
index 70eccbd23..b0fb33723 100644
--- a/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py
@@ -1,10 +1,11 @@
+import csv
 import logging
 import os
-import csv
 from typing import List
-from ... import InputExample
+
 import numpy as np
 
+from sentence_transformers import InputExample
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py
index 0d41a7a08..da6e3a36d 100644
--- a/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py
@@ -1,13 +1,13 @@
+import csv
 import logging
-from sklearn.metrics import average_precision_score
-from typing import List
-import numpy as np
 import os
-import csv
+from typing import List
 
-from ... import InputExample
-from ...evaluation import BinaryClassificationEvaluator
+import numpy as np
+from sklearn.metrics import average_precision_score
 
+from sentence_transformers import InputExample
+from sentence_transformers.evaluation import BinaryClassificationEvaluator
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py
index fbb76ac53..80f8f48df 100644
--- a/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py
@@ -1,10 +1,11 @@
+import csv
 import logging
-from scipy.stats import pearsonr, spearmanr
-from typing import List
 import os
-import csv
-from ... import InputExample
+from typing import List
+
+from scipy.stats import pearsonr, spearmanr
 
+from sentence_transformers import InputExample
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
index f9bf25b25..c28ab631e 100644
--- a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py
@@ -4,10 +4,10 @@
 from typing import List
 
 import numpy as np
+from sklearn.metrics import f1_score
 
+from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.readers.InputExample import InputExample
-from .. import CrossEncoder
-from sklearn.metrics import f1_score
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
index fa6160ec4..8552fc9d9 100644
--- a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py
@@ -1,8 +1,9 @@
+import csv
 import logging
-import numpy as np
 import os
-import csv
 from typing import Optional
+
+import numpy as np
 from sklearn.metrics import ndcg_score
 
 logger = logging.getLogger(__name__)
diff --git a/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py
index ec4704e31..e3973d952 100644
--- a/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py
+++ b/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py
@@ -1,10 +1,11 @@
+import csv
 import logging
 import os
-import csv
 from typing import List
-from ... import InputExample
+
 import numpy as np
 
+from sentence_transformers import InputExample
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/cross_encoder/evaluation/__init__.py b/sentence_transformers/cross_encoder/evaluation/__init__.py
index 43d2db677..ac176ff83 100644
--- a/sentence_transformers/cross_encoder/evaluation/__init__.py
+++ b/sentence_transformers/cross_encoder/evaluation/__init__.py
@@ -1,9 +1,9 @@
 from .CEBinaryAccuracyEvaluator import CEBinaryAccuracyEvaluator
 from .CEBinaryClassificationEvaluator import CEBinaryClassificationEvaluator
-from .CEF1Evaluator import CEF1Evaluator
 from .CECorrelationEvaluator import CECorrelationEvaluator
-from .CESoftmaxAccuracyEvaluator import CESoftmaxAccuracyEvaluator
+from .CEF1Evaluator import CEF1Evaluator
 from .CERerankingEvaluator import CERerankingEvaluator
+from .CESoftmaxAccuracyEvaluator import CESoftmaxAccuracyEvaluator
 
 __all__ = [
     "CEBinaryAccuracyEvaluator",
diff --git a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
index 973d55cac..997413cdf 100644
--- a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
+++ b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py
@@ -1,8 +1,10 @@
-from torch.utils.data import Dataset
 from typing import List
-from ..readers.InputExample import InputExample
+
 import numpy as np
-from transformers.utils.import_utils import is_nltk_available, NLTK_IMPORT_ERROR
+from torch.utils.data import Dataset
+from transformers.utils.import_utils import NLTK_IMPORT_ERROR, is_nltk_available
+
+from sentence_transformers.readers.InputExample import InputExample
 
 
 class DenoisingAutoEncoderDataset(Dataset):
@@ -34,7 +36,7 @@ def __len__(self):
     # Deletion noise.
     @staticmethod
     def delete(text, del_ratio=0.6):
-        from nltk import word_tokenize, TreebankWordDetokenizer
+        from nltk import TreebankWordDetokenizer, word_tokenize
 
         words = word_tokenize(text)
         n = len(words)
diff --git a/sentence_transformers/datasets/NoDuplicatesDataLoader.py b/sentence_transformers/datasets/NoDuplicatesDataLoader.py
index e05b504b7..f910183c1 100644
--- a/sentence_transformers/datasets/NoDuplicatesDataLoader.py
+++ b/sentence_transformers/datasets/NoDuplicatesDataLoader.py
@@ -1,5 +1,5 @@
-import random
 import math
+import random
 
 
 class NoDuplicatesDataLoader:
diff --git a/sentence_transformers/datasets/ParallelSentencesDataset.py b/sentence_transformers/datasets/ParallelSentencesDataset.py
index 1ec72e90b..6b64179dc 100644
--- a/sentence_transformers/datasets/ParallelSentencesDataset.py
+++ b/sentence_transformers/datasets/ParallelSentencesDataset.py
@@ -1,11 +1,12 @@
-from torch.utils.data import Dataset
-import logging
 import gzip
-from .. import SentenceTransformer
-from ..readers import InputExample
-from typing import List
+import logging
 import random
+from typing import List
+
+from torch.utils.data import Dataset
 
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.readers import InputExample
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/datasets/SentenceLabelDataset.py b/sentence_transformers/datasets/SentenceLabelDataset.py
index ca90665eb..c716f82ed 100644
--- a/sentence_transformers/datasets/SentenceLabelDataset.py
+++ b/sentence_transformers/datasets/SentenceLabelDataset.py
@@ -1,8 +1,10 @@
-from torch.utils.data import IterableDataset
-import numpy as np
-from typing import List
-from ..readers import InputExample
 import logging
+from typing import List
+
+import numpy as np
+from torch.utils.data import IterableDataset
+
+from sentence_transformers.readers import InputExample
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/datasets/SentencesDataset.py b/sentence_transformers/datasets/SentencesDataset.py
index ae689676e..f7795a8fc 100644
--- a/sentence_transformers/datasets/SentencesDataset.py
+++ b/sentence_transformers/datasets/SentencesDataset.py
@@ -1,7 +1,9 @@
-from torch.utils.data import Dataset
 from typing import List
-from .. import SentenceTransformer
-from ..readers.InputExample import InputExample
+
+from torch.utils.data import Dataset
+
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.readers.InputExample import InputExample
 
 
 class SentencesDataset(Dataset):
diff --git a/sentence_transformers/datasets/__init__.py b/sentence_transformers/datasets/__init__.py
index 6d0f06471..33cc755d5 100644
--- a/sentence_transformers/datasets/__init__.py
+++ b/sentence_transformers/datasets/__init__.py
@@ -1,8 +1,8 @@
 from .DenoisingAutoEncoderDataset import DenoisingAutoEncoderDataset
 from .NoDuplicatesDataLoader import NoDuplicatesDataLoader
 from .ParallelSentencesDataset import ParallelSentencesDataset
-from .SentencesDataset import SentencesDataset
 from .SentenceLabelDataset import SentenceLabelDataset
+from .SentencesDataset import SentencesDataset
 
 __all__ = [
     "DenoisingAutoEncoderDataset",
diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
index b3838f116..a4910c299 100644
--- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
+++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py
@@ -1,17 +1,19 @@
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-
-from sentence_transformers.similarity_functions import SimilarityFunction
-from . import SentenceEvaluator
+import csv
 import logging
 import os
-import csv
-from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
-from sklearn.metrics import average_precision_score
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional
+
 import numpy as np
-from typing import Dict, List, Optional
-from ..readers import InputExample
+from sklearn.metrics import average_precision_score
+from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.readers import InputExample
+from sentence_transformers.similarity_functions import SimilarityFunction
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -150,7 +152,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
         return cls(sentences1, sentences2, scores, **kwargs)
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         """
         Compute the evaluation metrics for the given model.
diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
index 4c89d6178..bf9729631 100644
--- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
+++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py
@@ -1,17 +1,19 @@
-from contextlib import nullcontext
-
-from sentence_transformers import SentenceTransformer
-from . import SentenceEvaluator
-from sentence_transformers.similarity_functions import SimilarityFunction
+import csv
 import logging
 import os
-import csv
-from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
-from scipy.stats import pearsonr, spearmanr
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Union
+
 import numpy as np
-from typing import Dict, List, Literal, Optional, Union
-from ..readers import InputExample
+from scipy.stats import pearsonr, spearmanr
+from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.readers import InputExample
+from sentence_transformers.similarity_functions import SimilarityFunction
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -139,7 +141,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
         return cls(sentences1, sentences2, scores, **kwargs)
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
index 8381abf5b..1d46f980c 100644
--- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
+++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py
@@ -1,18 +1,20 @@
-from sentence_transformers import SentenceTransformer
+import heapq
+import logging
+import os
 from contextlib import nullcontext
+from typing import TYPE_CHECKING, Callable, Dict, List, Optional, Set, Union
 
-from sentence_transformers.similarity_functions import SimilarityFunction
-from . import SentenceEvaluator
+import numpy as np
 import torch
 from torch import Tensor
-import logging
 from tqdm import trange
-from ..util import cos_sim, dot_score
-import os
-import numpy as np
-from typing import List, Dict, Optional, Set, Callable, Union
-import heapq
 
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.similarity_functions import SimilarityFunction
+from sentence_transformers.util import cos_sim, dot_score
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -203,7 +205,7 @@ def __init__(
                 self.csv_headers.append("{}-MAP@{}".format(score_name, k))
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs
     ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
@@ -272,7 +274,7 @@ def __call__(
         return metrics
 
     def compute_metrices(
-        self, model: SentenceTransformer, corpus_model=None, corpus_embeddings: Tensor = None
+        self, model: "SentenceTransformer", corpus_model=None, corpus_embeddings: Tensor = None
     ) -> Dict[str, float]:
         if corpus_model is None:
             corpus_model = model
diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
index 2bd65a163..4031cef9d 100644
--- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
+++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py
@@ -1,13 +1,16 @@
-from typing import Dict
-from sentence_transformers import SentenceTransformer
-from . import SentenceEvaluator
-import torch
-from torch.utils.data import DataLoader
+import csv
 import logging
-from ..util import batch_to_device
 import os
-import csv
+from typing import TYPE_CHECKING, Dict
+
+import torch
+from torch.utils.data import DataLoader
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.util import batch_to_device
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -42,7 +45,7 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w
         self.primary_metric = "accuracy"
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         model.eval()
         total = 0
diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py
index 46e7518f6..6fa300bb1 100644
--- a/sentence_transformers/evaluation/MSEEvaluator.py
+++ b/sentence_transformers/evaluation/MSEEvaluator.py
@@ -1,11 +1,13 @@
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-from sentence_transformers.evaluation import SentenceEvaluator
+import csv
 import logging
 import os
-import csv
-from typing import Dict, List, Optional
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -94,7 +96,7 @@ def __init__(
         self.write_csv = write_csv
         self.primary_metric = "negative_mse"
 
-    def __call__(self, model: SentenceTransformer, output_path: str = None, epoch=-1, steps=-1) -> Dict[str, float]:
+    def __call__(self, model: "SentenceTransformer", output_path: str = None, epoch=-1, steps=-1) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
                 out_txt = f" after epoch {epoch}"
diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
index 0d6fd2778..6147877ca 100644
--- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
+++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py
@@ -1,12 +1,15 @@
-from contextlib import nullcontext
-from sentence_transformers.evaluation import SentenceEvaluator
-from sentence_transformers import SentenceTransformer
-from typing import List, Optional, Tuple, Dict
-import numpy as np
+import csv
 import logging
 import os
-import csv
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional, Tuple
+
+import numpy as np
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -36,7 +39,7 @@ class MSEEvaluatorFromDataFrame(SentenceEvaluator):
     def __init__(
         self,
         dataframe: List[Dict[str, str]],
-        teacher_model: SentenceTransformer,
+        teacher_model: "SentenceTransformer",
         combinations: List[Tuple[str, str]],
         batch_size: int = 8,
         name: str = "",
@@ -81,7 +84,7 @@ def __init__(
         self.teacher_embeddings = {sent: emb for sent, emb in zip(all_source_sentences, all_src_embeddings)}
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         model.eval()
 
diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
index 701edf29b..7728e1393 100644
--- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
+++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py
@@ -1,14 +1,15 @@
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-from . import SentenceEvaluator
+import csv
 import logging
-from sentence_transformers.util import paraphrase_mining
 import os
-import csv
-
-from typing import List, Optional, Tuple, Dict
 from collections import defaultdict
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional, Tuple
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.util import paraphrase_mining
 
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -155,7 +156,7 @@ def __init__(
         self.primary_metric = "average_precision"
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py
index 902d6281d..8fa6f93bd 100644
--- a/sentence_transformers/evaluation/RerankingEvaluator.py
+++ b/sentence_transformers/evaluation/RerankingEvaluator.py
@@ -1,15 +1,19 @@
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-from . import SentenceEvaluator
+import csv
 import logging
-import numpy as np
 import os
-import csv
-from ..util import cos_sim
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Callable, Dict, Optional
+
+import numpy as np
 import torch
-from sklearn.metrics import average_precision_score, ndcg_score
 import tqdm
-from typing import Callable, Dict, Optional
+from sklearn.metrics import average_precision_score, ndcg_score
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.util import cos_sim
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -86,7 +90,7 @@ def __init__(
         self.primary_metric = "map"
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         """
         Evaluates the model on the dataset and returns the evaluation metrics.
diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py
index a336fc362..a3a2497ed 100644
--- a/sentence_transformers/evaluation/SentenceEvaluator.py
+++ b/sentence_transformers/evaluation/SentenceEvaluator.py
@@ -1,7 +1,8 @@
 import re
-from typing import Any, Dict, Union
+from typing import TYPE_CHECKING, Any, Dict, Union
 
-from sentence_transformers import SentenceTransformer
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class SentenceEvaluator:
@@ -16,7 +17,7 @@ def __init__(self):
         # TODO: Add better `primary_metrics` support
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Union[float, Dict[str, float]]:
         """
         This is called during training to evaluate the model.
diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py
index 3f9f43f36..eac6ff79c 100644
--- a/sentence_transformers/evaluation/SequentialEvaluator.py
+++ b/sentence_transformers/evaluation/SequentialEvaluator.py
@@ -1,6 +1,9 @@
-from sentence_transformers import SentenceTransformer
-from . import SentenceEvaluator
-from typing import Dict, Iterable
+from typing import TYPE_CHECKING, Dict, Iterable
+
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class SequentialEvaluator(SentenceEvaluator):
@@ -33,7 +36,7 @@ def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function=
         self.main_score_function = main_score_function
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         evaluations = []
         scores = []
diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py
index 8580b8d70..312a8c6f5 100644
--- a/sentence_transformers/evaluation/TranslationEvaluator.py
+++ b/sentence_transformers/evaluation/TranslationEvaluator.py
@@ -1,14 +1,17 @@
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-from . import SentenceEvaluator
+import csv
 import logging
-from ..util import pytorch_cos_sim
 import os
-import csv
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional
+
 import numpy as np
-from typing import Dict, List, Optional
 import torch
 
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.util import pytorch_cos_sim
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -97,7 +100,7 @@ def __init__(
         self.primary_metric = "mean_accuracy"
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py
index fe34a32ac..7b26c4a27 100644
--- a/sentence_transformers/evaluation/TripletEvaluator.py
+++ b/sentence_transformers/evaluation/TripletEvaluator.py
@@ -1,15 +1,18 @@
-import numpy as np
-from sentence_transformers import SentenceTransformer
-from contextlib import nullcontext
-from . import SentenceEvaluator
-from sentence_transformers.similarity_functions import SimilarityFunction
+import csv
 import logging
 import os
-import csv
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Dict, List, Optional, Union
+
+import numpy as np
 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
-from typing import Dict, List, Optional, Union
-from ..readers import InputExample
 
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.readers import InputExample
+from sentence_transformers.similarity_functions import SimilarityFunction
+
+if TYPE_CHECKING:
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
@@ -118,7 +121,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs):
         return cls(anchors, positives, negatives, **kwargs)
 
     def __call__(
-        self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1
+        self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
     ) -> Dict[str, float]:
         if epoch != -1:
             if steps == -1:
diff --git a/sentence_transformers/evaluation/__init__.py b/sentence_transformers/evaluation/__init__.py
index 5c0309027..7a2568992 100644
--- a/sentence_transformers/evaluation/__init__.py
+++ b/sentence_transformers/evaluation/__init__.py
@@ -1,5 +1,3 @@
-from .SentenceEvaluator import SentenceEvaluator
-from .SimilarityFunction import SimilarityFunction
 from .BinaryClassificationEvaluator import BinaryClassificationEvaluator
 from .EmbeddingSimilarityEvaluator import EmbeddingSimilarityEvaluator
 from .InformationRetrievalEvaluator import InformationRetrievalEvaluator
@@ -7,10 +5,12 @@
 from .MSEEvaluator import MSEEvaluator
 from .MSEEvaluatorFromDataFrame import MSEEvaluatorFromDataFrame
 from .ParaphraseMiningEvaluator import ParaphraseMiningEvaluator
+from .RerankingEvaluator import RerankingEvaluator
+from .SentenceEvaluator import SentenceEvaluator
 from .SequentialEvaluator import SequentialEvaluator
+from .SimilarityFunction import SimilarityFunction
 from .TranslationEvaluator import TranslationEvaluator
 from .TripletEvaluator import TripletEvaluator
-from .RerankingEvaluator import RerankingEvaluator
 
 __all__ = [
     "SentenceEvaluator",
diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
index 8ec3b94ba..62e688ff6 100644
--- a/sentence_transformers/fit_mixin.py
+++ b/sentence_transformers/fit_mixin.py
@@ -1,38 +1,40 @@
 import json
 import logging
 import os
-from pathlib import Path
 import shutil
-from typing import Any, List, Dict, Tuple, Iterable, Type, Callable, Optional, TYPE_CHECKING
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, List, Optional, Tuple, Type
+
 import numpy as np
-import transformers
 import torch
-from torch import nn, Tensor
+import transformers
+from torch import Tensor, nn
 from torch.optim import Optimizer
 from torch.utils.data import DataLoader
 from tqdm.autonotebook import trange
+from transformers import TrainerCallback, TrainerControl, TrainerState
+
 from datasets import Dataset, DatasetDict
-from transformers import TrainerCallback, TrainerState, TrainerControl
-from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset
 from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader
+from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset
 from sentence_transformers.training_args import (
-    SentenceTransformerTrainingArguments,
-    MultiDatasetBatchSamplers,
     BatchSamplers,
+    MultiDatasetBatchSamplers,
+    SentenceTransformerTrainingArguments,
 )
 
 from .evaluation import SentenceEvaluator
+from .model_card_templates import ModelCardTemplate
 from .util import (
     batch_to_device,
     fullname,
 )
-from .model_card_templates import ModelCardTemplate
 
 logger = logging.getLogger(__name__)
 
 if TYPE_CHECKING:
-    from sentence_transformers.SentenceTransformer import SentenceTransformer
     from sentence_transformers.readers.InputExample import InputExample
+    from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class SaveModelCallback(TrainerCallback):
diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py
index d50ac3227..d91e83834 100644
--- a/sentence_transformers/losses/AdaptiveLayerLoss.py
+++ b/sentence_transformers/losses/AdaptiveLayerLoss.py
@@ -1,12 +1,14 @@
 import random
-from typing import Any, Dict, Iterable, List, Tuple
 import warnings
+from typing import Any, Dict, Iterable, List, Tuple
+
+import torch
 from torch import Tensor, nn
 from torch.nn import functional as F
-import torch
+
 from sentence_transformers import SentenceTransformer
-from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
 from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss
+from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
 from sentence_transformers.models import Transformer
 
 
diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py
index 661e8693d..f69900d87 100644
--- a/sentence_transformers/losses/AnglELoss.py
+++ b/sentence_transformers/losses/AnglELoss.py
@@ -1,4 +1,4 @@
-from sentence_transformers import losses, SentenceTransformer, util
+from sentence_transformers import SentenceTransformer, losses, util
 
 
 class AnglELoss(losses.CoSENTLoss):
diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py
index 1843a615a..925c1a591 100644
--- a/sentence_transformers/losses/BatchAllTripletLoss.py
+++ b/sentence_transformers/losses/BatchAllTripletLoss.py
@@ -1,8 +1,11 @@
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
+from typing import Dict, Iterable
+
+from torch import Tensor, nn
+
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
+from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
+
 
 class BatchAllTripletLoss(nn.Module):
     def __init__(
diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
index 02914a165..b2212de7a 100644
--- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py
@@ -1,9 +1,12 @@
+from typing import Dict, Iterable
+
 import torch
 from torch import Tensor
-from typing import Iterable, Dict
-from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
+
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
+from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
+
 
 class BatchHardSoftMarginTripletLoss(BatchHardTripletLoss):
     def __init__(
diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py
index 51df4a8b5..ca940e657 100644
--- a/sentence_transformers/losses/BatchHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchHardTripletLoss.py
@@ -1,6 +1,8 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
+from torch import Tensor, nn
+
 from sentence_transformers import util
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
index c997d1f58..20b8316c2 100644
--- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py
+++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py
@@ -1,9 +1,12 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from .BatchHardTripletLoss import BatchHardTripletLossDistanceFunction
+from torch import Tensor, nn
+
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
+from .BatchHardTripletLoss import BatchHardTripletLossDistanceFunction
+
 
 class BatchSemiHardTripletLoss(nn.Module):
     def __init__(
diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py
index cf3392456..e3208298f 100644
--- a/sentence_transformers/losses/CachedGISTEmbedLoss.py
+++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py
@@ -1,12 +1,15 @@
 from __future__ import annotations
+
 from contextlib import nullcontext
 from functools import partial
+from typing import Dict, Iterable, Iterator, List, Optional, Tuple
+
 import torch
-from torch import nn, Tensor
+import tqdm
+from torch import Tensor, nn
 from torch.utils.checkpoint import get_device_states, set_device_states
-from typing import Iterable, Dict, Iterator, List, Optional, Tuple
+
 from sentence_transformers import SentenceTransformer
-import tqdm
 from sentence_transformers.models import Transformer
 
 
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index d3e2c7204..5e1b4e1d0 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -1,13 +1,15 @@
 from __future__ import annotations
+
 from contextlib import nullcontext
 from functools import partial
+from typing import Dict, Iterable, Iterator, List, Optional, Tuple
+
 import torch
-from torch import nn, Tensor
-from torch.utils.checkpoint import get_device_states, set_device_states
-from typing import Iterable, Dict, Iterator, List, Optional, Tuple
-from sentence_transformers import SentenceTransformer
-from sentence_transformers import util
 import tqdm
+from torch import Tensor, nn
+from torch.utils.checkpoint import get_device_states, set_device_states
+
+from sentence_transformers import SentenceTransformer, util
 
 
 class RandContext:
diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py
index e0c5203b7..e59de1bdc 100644
--- a/sentence_transformers/losses/CoSENTLoss.py
+++ b/sentence_transformers/losses/CoSENTLoss.py
@@ -1,8 +1,10 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from ..SentenceTransformer import SentenceTransformer
-from .. import util
+from torch import Tensor, nn
+
+from sentence_transformers import util
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class CoSENTLoss(nn.Module):
diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py
index a7c66792f..b11c80290 100644
--- a/sentence_transformers/losses/ContrastiveLoss.py
+++ b/sentence_transformers/losses/ContrastiveLoss.py
@@ -1,7 +1,9 @@
 from enum import Enum
-from typing import Iterable, Dict
+from typing import Dict, Iterable
+
 import torch.nn.functional as F
-from torch import nn, Tensor
+from torch import Tensor, nn
+
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py
index 08dc5d723..885a7db28 100644
--- a/sentence_transformers/losses/ContrastiveTensionLoss.py
+++ b/sentence_transformers/losses/ContrastiveTensionLoss.py
@@ -1,13 +1,14 @@
-import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from ..SentenceTransformer import SentenceTransformer
-from .. import util
 import copy
-import random
 import math
-from .. import InputExample
+import random
+from typing import Dict, Iterable
+
 import numpy as np
+import torch
+from torch import Tensor, nn
+
+from sentence_transformers import InputExample, util
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class ContrastiveTensionLoss(nn.Module):
diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py
index 8d27300e7..bf9776474 100644
--- a/sentence_transformers/losses/CosineSimilarityLoss.py
+++ b/sentence_transformers/losses/CosineSimilarityLoss.py
@@ -1,9 +1,10 @@
+from typing import Any, Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Any, Iterable, Dict
+from torch import Tensor, nn
 
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 from sentence_transformers.util import fullname
-from ..SentenceTransformer import SentenceTransformer
 
 
 class CosineSimilarityLoss(nn.Module):
diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
index cdc35cb85..56a31d702 100644
--- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
+++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
@@ -1,8 +1,10 @@
-from torch import nn, Tensor
-from typing import Iterable, Dict, Optional
-from sentence_transformers import SentenceTransformer
-from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
 import logging
+from typing import Dict, Iterable, Optional
+
+from torch import Tensor, nn
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
+
+from sentence_transformers import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py
index 6c719d511..9646402e4 100644
--- a/sentence_transformers/losses/GISTEmbedLoss.py
+++ b/sentence_transformers/losses/GISTEmbedLoss.py
@@ -1,8 +1,10 @@
-from typing import Any, Iterable, Dict
+from typing import Any, Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from sentence_transformers.SentenceTransformer import SentenceTransformer
+from torch import Tensor, nn
+
 from sentence_transformers.models import Transformer
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class GISTEmbedLoss(nn.Module):
diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py
index c377a96e1..f7349a39d 100644
--- a/sentence_transformers/losses/MSELoss.py
+++ b/sentence_transformers/losses/MSELoss.py
@@ -1,6 +1,7 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
+from torch import Tensor, nn
 
 
 class MSELoss(nn.Module):
diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py
index 44ab49710..2e13f59f5 100644
--- a/sentence_transformers/losses/MarginMSELoss.py
+++ b/sentence_transformers/losses/MarginMSELoss.py
@@ -1,6 +1,8 @@
-from .. import util
-from torch import nn, Tensor
-from typing import Iterable, Dict
+from typing import Dict, Iterable
+
+from torch import Tensor, nn
+
+from sentence_transformers import util
 
 
 class MarginMSELoss(nn.Module):
diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py
index a3043da89..9c30543bd 100644
--- a/sentence_transformers/losses/Matryoshka2dLoss.py
+++ b/sentence_transformers/losses/Matryoshka2dLoss.py
@@ -1,7 +1,9 @@
 from typing import Any, Dict, List, Optional, Union
+
 from torch.nn import Module
-from sentence_transformers.SentenceTransformer import SentenceTransformer
+
 from sentence_transformers.losses import AdaptiveLayerLoss, MatryoshkaLoss
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class Matryoshka2dLoss(AdaptiveLayerLoss):
diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py
index acdac95f4..850225239 100644
--- a/sentence_transformers/losses/MatryoshkaLoss.py
+++ b/sentence_transformers/losses/MatryoshkaLoss.py
@@ -1,8 +1,10 @@
 import random
-from typing import Any, Dict, Iterable, List, Optional, Union
 import warnings
-from torch import Tensor, nn
+from typing import Any, Dict, Iterable, List, Optional, Union
+
 import torch.nn.functional as F
+from torch import Tensor, nn
+
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss
 from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py
index dc63ba6d9..bc7ce6d05 100644
--- a/sentence_transformers/losses/MegaBatchMarginLoss.py
+++ b/sentence_transformers/losses/MegaBatchMarginLoss.py
@@ -1,8 +1,10 @@
-from .. import util
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
 import torch.nn.functional as F
+from torch import Tensor, nn
+
+from sentence_transformers import util
 
 
 class MegaBatchMarginLoss(nn.Module):
diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
index 78b03303c..0c3761523 100644
--- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py
@@ -1,8 +1,10 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from ..SentenceTransformer import SentenceTransformer
-from .. import util
+from torch import Tensor, nn
+
+from sentence_transformers import util
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class MultipleNegativesRankingLoss(nn.Module):
diff --git a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
index 979502dde..553cb0b97 100644
--- a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
+++ b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py
@@ -1,8 +1,10 @@
+from typing import Dict, Iterable
+
 import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict
-from ..SentenceTransformer import SentenceTransformer
-from .. import util
+from torch import Tensor, nn
+
+from sentence_transformers import util
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class MultipleNegativesSymmetricRankingLoss(nn.Module):
diff --git a/sentence_transformers/losses/OnlineContrastiveLoss.py b/sentence_transformers/losses/OnlineContrastiveLoss.py
index d36e61ccf..fcfd1a69f 100644
--- a/sentence_transformers/losses/OnlineContrastiveLoss.py
+++ b/sentence_transformers/losses/OnlineContrastiveLoss.py
@@ -1,9 +1,12 @@
-from typing import Iterable, Dict
+from typing import Dict, Iterable
+
 import torch.nn.functional as F
-from torch import nn, Tensor
-from .ContrastiveLoss import SiameseDistanceMetric
+from torch import Tensor, nn
+
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 
+from .ContrastiveLoss import SiameseDistanceMetric
+
 
 class OnlineContrastiveLoss(nn.Module):
     def __init__(
diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py
index eaf95d4e3..44e240499 100644
--- a/sentence_transformers/losses/SoftmaxLoss.py
+++ b/sentence_transformers/losses/SoftmaxLoss.py
@@ -1,9 +1,10 @@
-import torch
-from torch import nn, Tensor
-from typing import Iterable, Dict, Callable
-from ..SentenceTransformer import SentenceTransformer
 import logging
+from typing import Callable, Dict, Iterable
+
+import torch
+from torch import Tensor, nn
 
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py
index de44db228..53c6c88d6 100644
--- a/sentence_transformers/losses/TripletLoss.py
+++ b/sentence_transformers/losses/TripletLoss.py
@@ -1,8 +1,10 @@
-from torch import nn, Tensor
-from typing import Iterable, Dict
-import torch.nn.functional as F
 from enum import Enum
-from ..SentenceTransformer import SentenceTransformer
+from typing import Dict, Iterable
+
+import torch.nn.functional as F
+from torch import Tensor, nn
+
+from sentence_transformers.SentenceTransformer import SentenceTransformer
 
 
 class TripletDistanceMetric(Enum):
diff --git a/sentence_transformers/losses/__init__.py b/sentence_transformers/losses/__init__.py
index 00a64e2cb..fdac35735 100644
--- a/sentence_transformers/losses/__init__.py
+++ b/sentence_transformers/losses/__init__.py
@@ -1,33 +1,33 @@
+# CoSENTLoss must be imported before AnglELoss
+from .CoSENTLoss import CoSENTLoss  # isort: skip
+
 from .AdaptiveLayerLoss import AdaptiveLayerLoss
-from .CosineSimilarityLoss import CosineSimilarityLoss
-from .SoftmaxLoss import SoftmaxLoss
-from .MultipleNegativesRankingLoss import MultipleNegativesRankingLoss
-from .MultipleNegativesSymmetricRankingLoss import MultipleNegativesSymmetricRankingLoss
-from .TripletLoss import TripletDistanceMetric, TripletLoss
-from .MarginMSELoss import MarginMSELoss
-from .MatryoshkaLoss import MatryoshkaLoss
-from .Matryoshka2dLoss import Matryoshka2dLoss
-from .MSELoss import MSELoss
+from .AnglELoss import AnglELoss
+from .BatchAllTripletLoss import BatchAllTripletLoss
+from .BatchHardSoftMarginTripletLoss import BatchHardSoftMarginTripletLoss
+from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
+from .BatchSemiHardTripletLoss import BatchSemiHardTripletLoss
+from .CachedGISTEmbedLoss import CachedGISTEmbedLoss
 from .CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss
-from .ContrastiveLoss import SiameseDistanceMetric, ContrastiveLoss
+from .ContrastiveLoss import ContrastiveLoss, SiameseDistanceMetric
 from .ContrastiveTensionLoss import (
+    ContrastiveTensionDataLoader,
     ContrastiveTensionLoss,
     ContrastiveTensionLossInBatchNegatives,
-    ContrastiveTensionDataLoader,
 )
-from .CoSENTLoss import CoSENTLoss
-from .AnglELoss import AnglELoss
-from .OnlineContrastiveLoss import OnlineContrastiveLoss
-from .MegaBatchMarginLoss import MegaBatchMarginLoss
+from .CosineSimilarityLoss import CosineSimilarityLoss
 from .DenoisingAutoEncoderLoss import DenoisingAutoEncoderLoss
 from .GISTEmbedLoss import GISTEmbedLoss
-from .CachedGISTEmbedLoss import CachedGISTEmbedLoss
-
-# Triplet losses
-from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
-from .BatchHardSoftMarginTripletLoss import BatchHardSoftMarginTripletLoss
-from .BatchSemiHardTripletLoss import BatchSemiHardTripletLoss
-from .BatchAllTripletLoss import BatchAllTripletLoss
+from .MarginMSELoss import MarginMSELoss
+from .Matryoshka2dLoss import Matryoshka2dLoss
+from .MatryoshkaLoss import MatryoshkaLoss
+from .MegaBatchMarginLoss import MegaBatchMarginLoss
+from .MSELoss import MSELoss
+from .MultipleNegativesRankingLoss import MultipleNegativesRankingLoss
+from .MultipleNegativesSymmetricRankingLoss import MultipleNegativesSymmetricRankingLoss
+from .OnlineContrastiveLoss import OnlineContrastiveLoss
+from .SoftmaxLoss import SoftmaxLoss
+from .TripletLoss import TripletDistanceMetric, TripletLoss
 
 __all__ = [
     "AdaptiveLayerLoss",
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index f41cefefa..7488d006d 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -1,39 +1,39 @@
-from copy import copy
 import json
+import logging
 import random
+import re
 from collections import Counter, defaultdict
+from copy import copy
 from dataclasses import dataclass, field, fields
 from pathlib import Path
 from platform import python_version
-import re
 from textwrap import indent
 from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union
-import logging
 
 import torch
-from torch import nn
 import transformers
-from datasets import Dataset, DatasetDict
-from huggingface_hub import CardData, ModelCard, dataset_info as get_dataset_info, model_info as get_model_info
-from huggingface_hub.repocard_data import eval_results_to_model_index, EvalResult
+from huggingface_hub import CardData, ModelCard
+from huggingface_hub import dataset_info as get_dataset_info
+from huggingface_hub import model_info as get_model_info
+from huggingface_hub.repocard_data import EvalResult, eval_results_to_model_index
 from huggingface_hub.utils import yaml_dump
+from torch import nn
+from tqdm.autonotebook import tqdm
 from transformers import TrainerCallback
 from transformers.integrations import CodeCarbonCallback
 from transformers.modelcard import make_markdown_table
 from transformers.trainer_callback import TrainerControl, TrainerState
-from tqdm.autonotebook import tqdm
 
+from datasets import Dataset, DatasetDict
 from sentence_transformers import __version__ as sentence_transformers_version
-from sentence_transformers.evaluation import SequentialEvaluator
 from sentence_transformers.models import Transformer
-from sentence_transformers.util import cos_sim, fullname
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
-
+from sentence_transformers.util import cos_sim, fullname
 
 logger = logging.getLogger(__name__)
 
 if TYPE_CHECKING:
-    from sentence_transformers.evaluation import SentenceEvaluator
+    from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
     from sentence_transformers.SentenceTransformer import SentenceTransformer
     from sentence_transformers.trainer import SentenceTransformerTrainer
 
@@ -205,9 +205,10 @@ def on_log(
 
 def get_versions() -> Dict[str, Any]:
     from accelerate import __version__ as accelerate_version
-    from datasets import __version__ as datasets_version
     from tokenizers import __version__ as tokenizers_version
 
+    from datasets import __version__ as datasets_version
+
     return {
         "python": python_version(),
         "sentence_transformers": sentence_transformers_version,
@@ -435,6 +436,8 @@ def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None:
             self.predict_example = [source_sentence, similar_sentence, median_sentence]
 
     def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[str, Any]):
+        from sentence_transformers.evaluation import SequentialEvaluator
+
         self.eval_results_dict[evaluator] = copy(metrics)
 
         # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs
diff --git a/sentence_transformers/models/Asym.py b/sentence_transformers/models/Asym.py
index a84911d50..c4ca6ace8 100644
--- a/sentence_transformers/models/Asym.py
+++ b/sentence_transformers/models/Asym.py
@@ -1,10 +1,11 @@
-from torch import Tensor
-from torch import nn
-import os
 import json
-from ..util import import_from_string
+import os
 from collections import OrderedDict
-from typing import List, Dict, Union, Tuple
+from typing import Dict, List, Tuple, Union
+
+from torch import Tensor, nn
+
+from sentence_transformers.util import import_from_string
 
 
 class Asym(nn.Sequential):
diff --git a/sentence_transformers/models/BoW.py b/sentence_transformers/models/BoW.py
index c9f7aef06..1501ff121 100644
--- a/sentence_transformers/models/BoW.py
+++ b/sentence_transformers/models/BoW.py
@@ -1,12 +1,12 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import List, Dict
-import os
 import json
 import logging
-from .tokenizer import WhitespaceTokenizer
+import os
+from typing import Dict, List
 
+import torch
+from torch import Tensor, nn
+
+from .tokenizer import WhitespaceTokenizer
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/models/CLIPModel.py b/sentence_transformers/models/CLIPModel.py
index 9e5a06842..79a8f9f73 100644
--- a/sentence_transformers/models/CLIPModel.py
+++ b/sentence_transformers/models/CLIPModel.py
@@ -1,8 +1,9 @@
 from typing import Union
-from torch import nn
-import transformers
+
 import torch
+import transformers
 from PIL import Image
+from torch import nn
 
 
 class CLIPModel(nn.Module):
diff --git a/sentence_transformers/models/CNN.py b/sentence_transformers/models/CNN.py
index feaa23901..5cefa7f4e 100644
--- a/sentence_transformers/models/CNN.py
+++ b/sentence_transformers/models/CNN.py
@@ -1,8 +1,9 @@
+import json
+import os
+from typing import List
+
 import torch
 from torch import nn
-from typing import List
-import os
-import json
 
 
 class CNN(nn.Module):
diff --git a/sentence_transformers/models/Dense.py b/sentence_transformers/models/Dense.py
index bfd50e5e5..f0c5f1f4f 100644
--- a/sentence_transformers/models/Dense.py
+++ b/sentence_transformers/models/Dense.py
@@ -1,10 +1,11 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import Dict
-import os
 import json
-from ..util import fullname, import_from_string
+import os
+from typing import Dict
+
+import torch
+from torch import Tensor, nn
+
+from sentence_transformers.util import fullname, import_from_string
 
 
 class Dense(nn.Module):
diff --git a/sentence_transformers/models/Dropout.py b/sentence_transformers/models/Dropout.py
index ea353279d..f909e609b 100644
--- a/sentence_transformers/models/Dropout.py
+++ b/sentence_transformers/models/Dropout.py
@@ -1,8 +1,8 @@
-from torch import Tensor
-from torch import nn
-from typing import Dict
-import os
 import json
+import os
+from typing import Dict
+
+from torch import Tensor, nn
 
 
 class Dropout(nn.Module):
diff --git a/sentence_transformers/models/LSTM.py b/sentence_transformers/models/LSTM.py
index bab555d17..239aff4f0 100644
--- a/sentence_transformers/models/LSTM.py
+++ b/sentence_transformers/models/LSTM.py
@@ -1,8 +1,9 @@
+import json
+import os
+from typing import List
+
 import torch
 from torch import nn
-from typing import List
-import os
-import json
 
 
 class LSTM(nn.Module):
diff --git a/sentence_transformers/models/LayerNorm.py b/sentence_transformers/models/LayerNorm.py
index f63369223..d02fd32e2 100644
--- a/sentence_transformers/models/LayerNorm.py
+++ b/sentence_transformers/models/LayerNorm.py
@@ -1,9 +1,9 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import Dict
-import os
 import json
+import os
+from typing import Dict
+
+import torch
+from torch import Tensor, nn
 
 
 class LayerNorm(nn.Module):
diff --git a/sentence_transformers/models/Normalize.py b/sentence_transformers/models/Normalize.py
index 337b92a72..06dc44186 100644
--- a/sentence_transformers/models/Normalize.py
+++ b/sentence_transformers/models/Normalize.py
@@ -1,7 +1,7 @@
-from torch import Tensor
-from torch import nn
 from typing import Dict
+
 import torch.nn.functional as F
+from torch import Tensor, nn
 
 
 class Normalize(nn.Module):
diff --git a/sentence_transformers/models/Pooling.py b/sentence_transformers/models/Pooling.py
index 9cddc7e4f..e0a3bf954 100644
--- a/sentence_transformers/models/Pooling.py
+++ b/sentence_transformers/models/Pooling.py
@@ -1,9 +1,9 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import Dict
-import os
 import json
+import os
+from typing import Dict
+
+import torch
+from torch import Tensor, nn
 
 
 class Pooling(nn.Module):
diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py
index f9d94e2d1..d5a670869 100644
--- a/sentence_transformers/models/Transformer.py
+++ b/sentence_transformers/models/Transformer.py
@@ -1,8 +1,9 @@
-from torch import nn
-from transformers import AutoModel, AutoTokenizer, AutoConfig, T5Config, MT5Config
 import json
-from typing import Any, List, Dict, Optional, Union, Tuple
 import os
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from torch import nn
+from transformers import AutoConfig, AutoModel, AutoTokenizer, MT5Config, T5Config
 
 
 class Transformer(nn.Module):
diff --git a/sentence_transformers/models/WeightedLayerPooling.py b/sentence_transformers/models/WeightedLayerPooling.py
index 33d5f4406..beb686d57 100644
--- a/sentence_transformers/models/WeightedLayerPooling.py
+++ b/sentence_transformers/models/WeightedLayerPooling.py
@@ -1,9 +1,9 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import Dict
-import os
 import json
+import os
+from typing import Dict
+
+import torch
+from torch import Tensor, nn
 
 
 class WeightedLayerPooling(nn.Module):
diff --git a/sentence_transformers/models/WordEmbeddings.py b/sentence_transformers/models/WordEmbeddings.py
index 44d7c5931..40086ade2 100644
--- a/sentence_transformers/models/WordEmbeddings.py
+++ b/sentence_transformers/models/WordEmbeddings.py
@@ -1,15 +1,17 @@
+import gzip
+import json
+import logging
+import os
+from typing import List
+
+import numpy as np
 import torch
 from torch import nn
-from typing import List
-import logging
-import gzip
 from tqdm import tqdm
-import numpy as np
-import os
-import json
-from ..util import import_from_string, fullname, http_get
-from .tokenizer import WordTokenizer, WhitespaceTokenizer
 
+from sentence_transformers.util import fullname, http_get, import_from_string
+
+from .tokenizer import WhitespaceTokenizer, WordTokenizer
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/models/WordWeights.py b/sentence_transformers/models/WordWeights.py
index 3e53738bd..d545fab38 100644
--- a/sentence_transformers/models/WordWeights.py
+++ b/sentence_transformers/models/WordWeights.py
@@ -1,11 +1,10 @@
-import torch
-from torch import Tensor
-from torch import nn
-from typing import List, Dict
-import os
 import json
 import logging
+import os
+from typing import Dict, List
 
+import torch
+from torch import Tensor, nn
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/models/__init__.py b/sentence_transformers/models/__init__.py
index a0a518ba4..c238101ed 100644
--- a/sentence_transformers/models/__init__.py
+++ b/sentence_transformers/models/__init__.py
@@ -1,6 +1,6 @@
-from .Transformer import Transformer
 from .Asym import Asym
 from .BoW import BoW
+from .CLIPModel import CLIPModel
 from .CNN import CNN
 from .Dense import Dense
 from .Dropout import Dropout
@@ -8,10 +8,10 @@
 from .LSTM import LSTM
 from .Normalize import Normalize
 from .Pooling import Pooling
+from .Transformer import Transformer
 from .WeightedLayerPooling import WeightedLayerPooling
 from .WordEmbeddings import WordEmbeddings
 from .WordWeights import WordWeights
-from .CLIPModel import CLIPModel
 
 __all__ = [
     "Transformer",
diff --git a/sentence_transformers/models/tokenizer/PhraseTokenizer.py b/sentence_transformers/models/tokenizer/PhraseTokenizer.py
index 578a3c949..834154e0d 100644
--- a/sentence_transformers/models/tokenizer/PhraseTokenizer.py
+++ b/sentence_transformers/models/tokenizer/PhraseTokenizer.py
@@ -1,12 +1,13 @@
-from typing import List, Iterable
 import collections
-import string
-import os
 import json
 import logging
-from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS
-from transformers.utils.import_utils import is_nltk_available, NLTK_IMPORT_ERROR
+import os
+import string
+from typing import Iterable, List
+
+from transformers.utils.import_utils import NLTK_IMPORT_ERROR, is_nltk_available
 
+from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py b/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py
index a5d9fc478..7a6a39473 100644
--- a/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py
+++ b/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py
@@ -1,9 +1,10 @@
-from typing import List, Iterable
 import collections
-import string
-import os
 import json
-from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS
+import os
+import string
+from typing import Iterable, List
+
+from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer
 
 
 class WhitespaceTokenizer(WordTokenizer):
diff --git a/sentence_transformers/models/tokenizer/WordTokenizer.py b/sentence_transformers/models/tokenizer/WordTokenizer.py
index cfe00f701..51bcfd09c 100644
--- a/sentence_transformers/models/tokenizer/WordTokenizer.py
+++ b/sentence_transformers/models/tokenizer/WordTokenizer.py
@@ -1,5 +1,5 @@
 from abc import ABC, abstractmethod
-from typing import List, Iterable
+from typing import Iterable, List
 
 ENGLISH_STOP_WORDS = [
     "!",
diff --git a/sentence_transformers/models/tokenizer/__init__.py b/sentence_transformers/models/tokenizer/__init__.py
index f11b207eb..b09bed73a 100644
--- a/sentence_transformers/models/tokenizer/__init__.py
+++ b/sentence_transformers/models/tokenizer/__init__.py
@@ -1,5 +1,5 @@
-from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS
-from .WhitespaceTokenizer import WhitespaceTokenizer
 from .PhraseTokenizer import PhraseTokenizer
+from .WhitespaceTokenizer import WhitespaceTokenizer
+from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer
 
 __all__ = ["WordTokenizer", "WhitespaceTokenizer", "PhraseTokenizer", "ENGLISH_STOP_WORDS"]
diff --git a/sentence_transformers/quantization.py b/sentence_transformers/quantization.py
index d958b1a34..8750b974b 100644
--- a/sentence_transformers/quantization.py
+++ b/sentence_transformers/quantization.py
@@ -1,10 +1,9 @@
-import time
-from torch import Tensor
-from typing import List, Literal, Tuple, TYPE_CHECKING
-import numpy as np
 import logging
-from typing import Dict, Optional, Union
+import time
+from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple, Union
 
+import numpy as np
+from torch import Tensor
 
 logger = logging.getLogger(__name__)
 
@@ -255,8 +254,8 @@ def semantic_search_usearch(
     The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...]
     The time taken for the search is a float value.
     """
-    from usearch.index import Index
     from usearch.compiled import ScalarKind
+    from usearch.index import Index
 
     if corpus_embeddings is not None and corpus_index is not None:
         raise ValueError("Only corpus_embeddings or corpus_index should be used, not both.")
diff --git a/sentence_transformers/readers/InputExample.py b/sentence_transformers/readers/InputExample.py
index 1e0f6bbd2..7266159e3 100644
--- a/sentence_transformers/readers/InputExample.py
+++ b/sentence_transformers/readers/InputExample.py
@@ -1,4 +1,4 @@
-from typing import Union, List
+from typing import List, Union
 
 
 class InputExample:
diff --git a/sentence_transformers/readers/LabelSentenceReader.py b/sentence_transformers/readers/LabelSentenceReader.py
index 70b28c7ef..82aefedb7 100644
--- a/sentence_transformers/readers/LabelSentenceReader.py
+++ b/sentence_transformers/readers/LabelSentenceReader.py
@@ -1,6 +1,7 @@
-from . import InputExample
 import os
 
+from . import InputExample
+
 
 class LabelSentenceReader:
     """Reads in a file that has at least two columns: a label and a sentence.
diff --git a/sentence_transformers/readers/NLIDataReader.py b/sentence_transformers/readers/NLIDataReader.py
index 2d78a5a8f..ce359d6f5 100644
--- a/sentence_transformers/readers/NLIDataReader.py
+++ b/sentence_transformers/readers/NLIDataReader.py
@@ -1,7 +1,8 @@
-from . import InputExample
 import gzip
 import os
 
+from . import InputExample
+
 
 class NLIDataReader(object):
     """Reads in the Stanford NLI dataset and the MultiGenre NLI dataset"""
diff --git a/sentence_transformers/readers/PairedFilesReader.py b/sentence_transformers/readers/PairedFilesReader.py
index 2a1c16495..157ac5cbe 100644
--- a/sentence_transformers/readers/PairedFilesReader.py
+++ b/sentence_transformers/readers/PairedFilesReader.py
@@ -1,6 +1,7 @@
-from . import InputExample
 import gzip
 
+from . import InputExample
+
 
 class PairedFilesReader(object):
     """Reads in the a Pair Dataset, split in two files"""
diff --git a/sentence_transformers/readers/STSDataReader.py b/sentence_transformers/readers/STSDataReader.py
index e9a6e7600..6c9533989 100644
--- a/sentence_transformers/readers/STSDataReader.py
+++ b/sentence_transformers/readers/STSDataReader.py
@@ -1,8 +1,9 @@
-from . import InputExample
 import csv
 import gzip
 import os
 
+from . import InputExample
+
 
 class STSDataReader:
     """Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
diff --git a/sentence_transformers/readers/TripletReader.py b/sentence_transformers/readers/TripletReader.py
index 99e1ff0f2..be32ebd9b 100644
--- a/sentence_transformers/readers/TripletReader.py
+++ b/sentence_transformers/readers/TripletReader.py
@@ -1,7 +1,8 @@
-from . import InputExample
 import csv
 import os
 
+from . import InputExample
+
 
 class TripletReader(object):
     """Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1),
diff --git a/sentence_transformers/readers/__init__.py b/sentence_transformers/readers/__init__.py
index f9b956cb8..fb2add55a 100644
--- a/sentence_transformers/readers/__init__.py
+++ b/sentence_transformers/readers/__init__.py
@@ -1,7 +1,7 @@
 from .InputExample import InputExample
 from .LabelSentenceReader import LabelSentenceReader
 from .NLIDataReader import NLIDataReader
-from .STSDataReader import STSDataReader, STSBenchmarkDataReader
+from .STSDataReader import STSBenchmarkDataReader, STSDataReader
 from .TripletReader import TripletReader
 
 __all__ = [
diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py
index e717d21ec..d2cec4cee 100644
--- a/sentence_transformers/sampler.py
+++ b/sentence_transformers/sampler.py
@@ -1,11 +1,12 @@
+import logging
 from collections import defaultdict
 from itertools import accumulate, cycle
 from typing import List
-import logging
 
-from datasets import Dataset
-from torch.utils.data import BatchSampler, SubsetRandomSampler, ConcatDataset
 import torch
+from torch.utils.data import BatchSampler, ConcatDataset, SubsetRandomSampler
+
+from datasets import Dataset
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/similarity_functions.py b/sentence_transformers/similarity_functions.py
index 589d0404a..753970683 100644
--- a/sentence_transformers/similarity_functions.py
+++ b/sentence_transformers/similarity_functions.py
@@ -3,15 +3,16 @@
 
 from numpy import ndarray
 from torch import Tensor
+
 from .util import (
     cos_sim,
-    manhattan_sim,
-    euclidean_sim,
     dot_score,
+    euclidean_sim,
+    manhattan_sim,
     pairwise_cos_sim,
-    pairwise_manhattan_sim,
-    pairwise_euclidean_sim,
     pairwise_dot_score,
+    pairwise_euclidean_sim,
+    pairwise_manhattan_sim,
 )
 
 
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index 0620d5f38..a30c3e6e5 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -1,30 +1,25 @@
-from contextlib import nullcontext
 import logging
 import os
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union, TYPE_CHECKING
 import warnings
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
 
 import torch
 from torch import nn
-from torch.utils.data import DataLoader, ConcatDataset, Dataset, BatchSampler, SubsetRandomSampler
-from transformers import PreTrainedTokenizerBase, Trainer, EvalPrediction, TrainerCallback
+from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, Dataset, SubsetRandomSampler
+from transformers import EvalPrediction, PreTrainedTokenizerBase, Trainer, TrainerCallback
+from transformers.data.data_collator import DataCollator
 from transformers.integrations import WandbCallback
 from transformers.trainer import TRAINING_ARGS_NAME
+from transformers.trainer_utils import EvalLoopOutput
 from transformers.training_args import ParallelMode
 
 from datasets import DatasetDict
-from transformers.trainer_utils import EvalLoopOutput
-from transformers.data.data_collator import DataCollator
-from sentence_transformers.losses import CoSENTLoss
-
-from sentence_transformers.models.Transformer import Transformer
-from sentence_transformers.training_args import (
-    SentenceTransformerTrainingArguments,
-    BatchSamplers,
-    MultiDatasetBatchSamplers,
-)
 from sentence_transformers.data_collator import SentenceTransformerDataCollator
-from sentence_transformers.evaluation import SentenceEvaluator
+from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
+from sentence_transformers.losses.CoSENTLoss import CoSENTLoss
+from sentence_transformers.model_card import ModelCardCallback
+from sentence_transformers.models.Transformer import Transformer
 from sentence_transformers.sampler import (
     DefaultBatchSampler,
     GroupByLabelBatchSampler,
@@ -32,10 +27,13 @@
     ProportionalBatchSampler,
     RoundRobinBatchSampler,
 )
+from sentence_transformers.training_args import (
+    BatchSamplers,
+    MultiDatasetBatchSamplers,
+    SentenceTransformerTrainingArguments,
+)
 from sentence_transformers.util import disable_logging
 
-from sentence_transformers.model_card import ModelCardCallback
-
 logger = logging.getLogger(__name__)
 
 if TYPE_CHECKING:
diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
index 5f4386768..326c649a6 100644
--- a/sentence_transformers/training_args.py
+++ b/sentence_transformers/training_args.py
@@ -1,5 +1,6 @@
 from dataclasses import dataclass, field
 from typing import Union
+
 from transformers import TrainingArguments as TransformersTrainingArguments
 from transformers.utils import ExplicitEnum
 
diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py
index 30dbeaa61..6b9cef844 100644
--- a/sentence_transformers/util.py
+++ b/sentence_transformers/util.py
@@ -1,21 +1,20 @@
-from contextlib import contextmanager
 import functools
-import requests
-from torch import Tensor, device
-from typing import List, Callable, Literal, overload
-from tqdm.autonotebook import tqdm
-import sys
+import heapq
 import importlib
+import logging
 import os
-import torch
-import numpy as np
 import queue
-import logging
-from typing import Dict, Optional, Union
+import sys
+from contextlib import contextmanager
+from typing import Callable, Dict, List, Literal, Optional, Union, overload
 
+import numpy as np
+import requests
+import torch
+from huggingface_hub import hf_hub_download, snapshot_download
+from torch import Tensor, device
+from tqdm.autonotebook import tqdm
 from transformers import is_torch_npu_available
-from huggingface_hub import snapshot_download, hf_hub_download
-import heapq
 
 logger = logging.getLogger(__name__)
 
diff --git a/setup.py b/setup.py
index 86b8717bd..922dcf7ae 100644
--- a/setup.py
+++ b/setup.py
@@ -1,4 +1,4 @@
-from setuptools import setup, find_packages
+from setuptools import find_packages, setup
 
 with open("README.md", mode="r", encoding="utf-8") as readme_file:
     readme = readme_file.read()
diff --git a/tests/conftest.py b/tests/conftest.py
index 05609b7a9..acd2870da 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,11 +1,12 @@
 import os
 import platform
 import tempfile
+
 import pytest
 
-from sentence_transformers import SentenceTransformer, CrossEncoder
-from sentence_transformers.models import Transformer, Pooling
-from datasets import load_dataset, DatasetDict
+from datasets import DatasetDict, load_dataset
+from sentence_transformers import CrossEncoder, SentenceTransformer
+from sentence_transformers.models import Pooling, Transformer
 
 
 @pytest.fixture()
diff --git a/tests/test_cmnrl.py b/tests/test_cmnrl.py
index 9967c51af..3d47b4b02 100644
--- a/tests/test_cmnrl.py
+++ b/tests/test_cmnrl.py
@@ -1,11 +1,13 @@
 from contextlib import nullcontext
 from typing import List
+
 import pytest
-from sentence_transformers import SentenceTransformer, InputExample, losses
-import tqdm
-from transformers import set_seed
 import torch
+import tqdm
 from torch.optim import Adam
+from transformers import set_seed
+
+from sentence_transformers import InputExample, SentenceTransformer, losses
 
 
 @pytest.mark.parametrize(
diff --git a/tests/test_cross_encoder.py b/tests/test_cross_encoder.py
index 2a7f16d5b..cdc442a16 100644
--- a/tests/test_cross_encoder.py
+++ b/tests/test_cross_encoder.py
@@ -5,8 +5,9 @@
 import csv
 import gzip
 import os
-from pathlib import Path
 import tempfile
+from pathlib import Path
+from typing import Generator, List, Tuple
 
 import pytest
 import torch
@@ -15,7 +16,6 @@
 from sentence_transformers import CrossEncoder, util
 from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
 from sentence_transformers.readers import InputExample
-from typing import Generator, List, Tuple
 
 
 @pytest.fixture()
diff --git a/tests/test_image_embeddings.py b/tests/test_image_embeddings.py
index 581ab3ffd..d684e258f 100644
--- a/tests/test_image_embeddings.py
+++ b/tests/test_image_embeddings.py
@@ -6,7 +6,7 @@
 
 from PIL import Image
 
-from sentence_transformers import util, SentenceTransformer
+from sentence_transformers import SentenceTransformer, util
 
 
 def test_simple_encode(clip_vit_b_32_model: SentenceTransformer) -> None:
diff --git a/tests/test_model_card_data.py b/tests/test_model_card_data.py
index 3c0a0f06a..434fc081c 100644
--- a/tests/test_model_card_data.py
+++ b/tests/test_model_card_data.py
@@ -1,7 +1,7 @@
-from sentence_transformers import SentenceTransformer
-
 import pytest
 
+from sentence_transformers import SentenceTransformer
+
 
 @pytest.mark.parametrize(
     ("revision", "expected_base_revision"),
diff --git a/tests/test_multi_process.py b/tests/test_multi_process.py
index 624ca4e89..a1deef2f9 100644
--- a/tests/test_multi_process.py
+++ b/tests/test_multi_process.py
@@ -2,9 +2,10 @@
 Computes embeddings
 """
 
+from typing import Optional
+
 import numpy as np
 import pytest
-from typing import Optional
 
 from sentence_transformers import SentenceTransformer
 
diff --git a/tests/test_sentence_transformer.py b/tests/test_sentence_transformer.py
index ca3b15433..0a6479701 100644
--- a/tests/test_sentence_transformer.py
+++ b/tests/test_sentence_transformer.py
@@ -2,23 +2,22 @@
 Tests general behaviour of the SentenceTransformer class
 """
 
-from functools import partial
 import json
 import logging
 import os
-from pathlib import Path
 import re
 import tempfile
+from functools import partial
+from pathlib import Path
 from typing import Dict, List, Literal, Optional, Union, cast
 
 import numpy as np
 import pytest
-
-from huggingface_hub import HfApi, RepoUrl, GitRefs, GitRefInfo
 import torch
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.models import Normalize, Transformer, Pooling
-from sentence_transformers import util
+from huggingface_hub import GitRefInfo, GitRefs, HfApi, RepoUrl
+
+from sentence_transformers import SentenceTransformer, util
+from sentence_transformers.models import Normalize, Pooling, Transformer
 from sentence_transformers.similarity_functions import SimilarityFunction
 
 
diff --git a/tests/test_trainer.py b/tests/test_trainer.py
index 8d8c123af..8d2de1b48 100644
--- a/tests/test_trainer.py
+++ b/tests/test_trainer.py
@@ -1,9 +1,11 @@
-from pathlib import Path
 import re
 import tempfile
+from pathlib import Path
+
 import pytest
-from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer, losses
+
 from datasets import DatasetDict
+from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
 
 
 def test_trainer_multi_dataset_errors(

From 5db04cbf0533c397cd7af84732385c0d3cdbf7bf Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 15:00:40 +0200
Subject: [PATCH 23/39] [`v3`] Prevent warning with 'model.fit' with
 transformers >= 4.41.0 due to evaluation_strategy (#2673)

* Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy

* Reformat
---
 sentence_transformers/fit_mixin.py | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
index 62e688ff6..6af7dc149 100644
--- a/sentence_transformers/fit_mixin.py
+++ b/sentence_transformers/fit_mixin.py
@@ -8,6 +8,7 @@
 import numpy as np
 import torch
 import transformers
+from packaging import version
 from torch import Tensor, nn
 from torch.optim import Optimizer
 from torch.utils.data import DataLoader
@@ -297,6 +298,12 @@ def _default_checkpoint_dir() -> str:
                 )
                 steps_per_epoch = None
 
+        # Transformers renamed `evaluation_strategy` to `eval_strategy` in v4.41.0
+        eval_strategy_key = (
+            "eval_strategy"
+            if version.parse(transformers.__version__) >= version.parse("4.41.0")
+            else "evaluation_strategy"
+        )
         args = SentenceTransformerTrainingArguments(
             output_dir=checkpoint_path or _default_checkpoint_dir(),
             batch_sampler=batch_sampler,
@@ -305,7 +312,9 @@ def _default_checkpoint_dir() -> str:
             per_device_eval_batch_size=batch_size,
             num_train_epochs=epochs,
             max_steps=max_steps,
-            evaluation_strategy="steps" if evaluation_steps is not None and evaluation_steps > 0 else "no",
+            **{
+                eval_strategy_key: "steps" if evaluation_steps is not None and evaluation_steps > 0 else "no",
+            },
             eval_steps=evaluation_steps,
             # load_best_model_at_end=save_best_model, # <- TODO: Look into a good solution for save_best_model
             max_grad_norm=max_grad_norm,

From 7177c4867ac37cfb1294b09e70a0c4e7852b37f1 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 18:22:14 +0200
Subject: [PATCH 24/39] [`v3`] Add various useful Sphinx packages (copy code,
 link to code, nicer tabs) (#2674)

* No longer hide toctrees in API Reference

* Add linkcode support

It's not perfect, as it'll always link to 'master', but it'll do pretty nicely for the most part.

* Add copy button to all code blocks

* Add nicer tabs

* Reformatted
---
 docs/conf.py                                  |  45 +++-
 .../package_reference/cross_encoder/index.rst |   1 -
 .../sentence_transformer/index.rst            |   1 -
 docs/requirements.txt                         |   9 +-
 .../training/distributed.rst                  |  28 +-
 .../sentence_transformer/training_overview.md | 248 +++++++++---------
 6 files changed, 187 insertions(+), 145 deletions(-)

diff --git a/docs/conf.py b/docs/conf.py
index 1d3ad9bb0..f7fe0882f 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -15,6 +15,8 @@
 # sys.path.insert(0, os.path.abspath('.'))
 
 import datetime
+import importlib
+import inspect
 import os
 
 from recommonmark.transform import AutoStructify
@@ -22,7 +24,7 @@
 
 # -- Project information -----------------------------------------------------
 
-project = "Sentence-Transformers"
+project = "Sentence Transformers"
 copyright = str(datetime.datetime.now().year)
 author = "Nils Reimers, Tom Aarsen"
 
@@ -37,8 +39,10 @@
     "sphinx.ext.autodoc",
     "recommonmark",
     "sphinx_markdown_tables",
+    "sphinx_copybutton",
     "sphinx.ext.intersphinx",
-    "sphinx_tabs.tabs",
+    "sphinx.ext.linkcode",
+    "sphinx_inline_tabs",
 ]
 
 # Add any paths that contain templates here, relative to this directory.
@@ -108,6 +112,43 @@
 autoclass_content = "both"
 
 
+# https://github.com/readthedocs/sphinx-autoapi/issues/202#issuecomment-907582382
+def linkcode_resolve(domain, info):
+    # Non-linkable objects from the starter kit in the tutorial.
+    if domain == "js" or info["module"] == "connect4":
+        return
+
+    assert domain == "py", "expected only Python objects"
+
+    mod = importlib.import_module(info["module"])
+    if "." in info["fullname"]:
+        objname, attrname = info["fullname"].split(".")
+        obj = getattr(mod, objname)
+        try:
+            # object is a method of a class
+            obj = getattr(obj, attrname)
+        except AttributeError:
+            # object is an attribute of a class
+            return None
+    else:
+        obj = getattr(mod, info["fullname"])
+    obj = inspect.unwrap(obj)
+
+    try:
+        file = inspect.getsourcefile(obj)
+        lines = inspect.getsourcelines(obj)
+    except TypeError:
+        # e.g. object is a typing.Union
+        return None
+    file = os.path.relpath(file, os.path.abspath(".."))
+    if not file.startswith("sentence_transformers"):
+        # e.g. object is a typing.NewType
+        return None
+    start, end = lines[1], lines[1] + len(lines[0]) - 1
+
+    return f"https://github.com/UKPLab/sentence-transformers/blob/master/{file}#L{start}-L{end}"
+
+
 class GithubURLDomain(Domain):
     """
     Resolve .py links to their respective Github URL
diff --git a/docs/package_reference/cross_encoder/index.rst b/docs/package_reference/cross_encoder/index.rst
index f27406944..81dc3bc41 100644
--- a/docs/package_reference/cross_encoder/index.rst
+++ b/docs/package_reference/cross_encoder/index.rst
@@ -3,7 +3,6 @@ Cross Encoder
 =============
 
 .. toctree::
-   :hidden:
 
    cross_encoder
    evaluation
\ No newline at end of file
diff --git a/docs/package_reference/sentence_transformer/index.rst b/docs/package_reference/sentence_transformer/index.rst
index 063ed31e1..0e724e78b 100644
--- a/docs/package_reference/sentence_transformer/index.rst
+++ b/docs/package_reference/sentence_transformer/index.rst
@@ -3,7 +3,6 @@ Sentence Transformer
 ====================
 
 .. toctree::
-   :hidden:
 
    SentenceTransformer
    trainer
diff --git a/docs/requirements.txt b/docs/requirements.txt
index bbc151601..312d2e2eb 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,8 +1,9 @@
 # Must use Python 3.8!
 
-sphinx<4
-Jinja2<3.1
-sphinx_markdown_tables
+sphinx==3.5.4
+Jinja2==3.0.3
+sphinx_markdown_tables==0.0.17
 recommonmark==0.7.1
-sphinx-tabs==3.4.5
+sphinx-copybutton==0.5.2
+sphinx_inline_tabs==2023.4.21
 -e ..
\ No newline at end of file
diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst
index fc6e78138..f2ade01cf 100644
--- a/docs/sentence_transformer/training/distributed.rst
+++ b/docs/sentence_transformer/training/distributed.rst
@@ -10,23 +10,29 @@ Sentence Transformers implements two forms of distributed training: Data Paralle
 
 In short, **DDP is generally recommended**. You can use DDP by running your normal training scripts with ``torchrun`` or ``accelerate``. For example, if you have a script called ``train_script.py``, you can run it with DDP using the following command:
 
-.. tabs::
+.. |br| raw:: html
 
-   .. tab:: Via ``torchrun``
+   <div style="line-height: 0; padding: 0; margin: 0"></div>
 
-      - `torchrun documentation <https://pytorch.org/docs/stable/elastic/run.html>`_
+.. tab:: Via ``torchrun``
 
-      ::
+   |br|
 
-         torchrun --nproc_per_node=4 train_script.py
-    
-   .. tab:: Via ``accelerate``
+   - `torchrun documentation <https://pytorch.org/docs/stable/elastic/run.html>`_
 
-      - `accelerate documentation <https://huggingface.co/docs/accelerate/en/index>`_
+   ::
 
-      ::
-        
-         accelerate launch --num_processes 4 train_script.py
+      torchrun --nproc_per_node=4 train_script.py
+   
+.. tab:: Via ``accelerate``
+
+   |br|
+
+   - `accelerate documentation <https://huggingface.co/docs/accelerate/en/index>`_
+
+   ::
+      
+      accelerate launch --num_processes 4 train_script.py
 
 .. note::
   
diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
index def0a1759..4f55bfba7 100644
--- a/docs/sentence_transformer/training_overview.md
+++ b/docs/sentence_transformer/training_overview.md
@@ -45,102 +45,100 @@ Training Sentence Transformer models involves between 3 to 5 components:
 ```eval_rst
 The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datasets.Dataset` (one dataset) or :class:`datasets.DatasetDict` instances (multiple datasets, see also `Multi-dataset training <#multi-dataset-training>`_). 
 
-.. tabs::
+.. tab:: Data on 🤗 Hugging Face Hub
 
-    .. tab:: Data on 🤗 Hugging Face Hub
+    If you want to load data from the `Hugging Face Datasets <https://huggingface.co/datasets>`_, then you should use :func:`datasets.load_dataset`:
 
-        If you want to load data from the `Hugging Face Datasets <https://huggingface.co/datasets>`_, then you should use :func:`datasets.load_dataset`:
+    .. raw:: html
 
-        .. raw:: html
+        <div class="sidebar">
+            <p class="sidebar-title">Documentation</p>
+            <ul class="simple">
+                <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#hugging-face-hub">Datasets, Loading from the Hugging Face Hub</a></li>
+                <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
+                <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
+            </ul>
+        </div>
 
-            <div class="sidebar">
-                <p class="sidebar-title">Documentation</p>
-                <ul class="simple">
-                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#hugging-face-hub">Datasets, Loading from the Hugging Face Hub</a></li>
-                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
-                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
-                </ul>
-            </div>
+    ::
 
-        ::
+        from datasets import load_dataset
 
-            from datasets import load_dataset
+        train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
+        eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")
 
-            train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
-            eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")
+        print(train_dataset)
+        """
+        Dataset({
+            features: ['premise', 'hypothesis', 'label'],
+            num_rows: 942069
+        })
+        """
 
-            print(train_dataset)
-            """
-            Dataset({
-                features: ['premise', 'hypothesis', 'label'],
-                num_rows: 942069
-            })
-            """
+    Some datasets (including `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair>`_, `pair-class <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-class>`_, `pair-score <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-score>`_, `triplet <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet>`_.
 
-        Some datasets (including `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair>`_, `pair-class <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-class>`_, `pair-score <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/pair-score>`_, `triplet <https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet>`_.
+    .. note::
 
-        .. note::
+        Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers <https://huggingface.co/datasets?other=sentence-transformers>`_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks.
 
-            Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers <https://huggingface.co/datasets?other=sentence-transformers>`_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks.
+.. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL)
 
-    .. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL)
+    If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`:
 
-        If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`:
+    .. raw:: html
 
-        .. raw:: html
+        <div class="sidebar">
+            <p class="sidebar-title">Documentation</p>
+            <ul class="simple">
+                <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files">Datasets, Loading local files</a></li>
+                <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
+            </ul>
+        </div>
 
-            <div class="sidebar">
-                <p class="sidebar-title">Documentation</p>
-                <ul class="simple">
-                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files">Datasets, Loading local files</a></li>
-                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset" title="(in datasets vmain)"><code class="xref py py-func docutils literal notranslate"><span class="pre">datasets.load_dataset()</span></code></a></li>
-                </ul>
-            </div>
+    ::
 
-        ::
+        from datasets import load_dataset
 
-            from datasets import load_dataset
-
-            dataset = load_dataset("csv", data_files="my_file.csv")
-        
-        or::
+        dataset = load_dataset("csv", data_files="my_file.csv")
+    
+    or::
 
-            from datasets import load_dataset
+        from datasets import load_dataset
 
-            dataset = load_dataset("json", data_files="my_file.json")
+        dataset = load_dataset("json", data_files="my_file.json")
 
-    .. tab:: Local Data that requires pre-processing
+.. tab:: Local Data that requires pre-processing
 
-        .. sidebar:: Documentation
+    .. sidebar:: Documentation
 
-            - :meth:`datasets.Dataset.from_dict`
+        - :meth:`datasets.Dataset.from_dict`
 
-        If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so:
+    If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so:
 
-        .. raw:: html
+    .. raw:: html
 
-            <div class="sidebar">
-                <p class="sidebar-title">Documentation</p>
-                <ul class="simple">
-                    <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.from_dict" title="(in datasets vmain)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">datasets.Dataset.from_dict()</span></code></a></li>
-                </ul>
-            </div>
+        <div class="sidebar">
+            <p class="sidebar-title">Documentation</p>
+            <ul class="simple">
+                <li><a class="reference external" href="https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.from_dict" title="(in datasets vmain)"><code class="xref py py-meth docutils literal notranslate"><span class="pre">datasets.Dataset.from_dict()</span></code></a></li>
+            </ul>
+        </div>
 
-        ::
+    ::
 
-            from datasets import Dataset
+        from datasets import Dataset
 
-            sentence1_list = []
-            sentence2_list = []
-            # Open a file, do preprocessing, filtering, cleaning, etc.
-            # and append to the lists
+        sentence1_list = []
+        sentence2_list = []
+        # Open a file, do preprocessing, filtering, cleaning, etc.
+        # and append to the lists
 
-            dataset = Dataset.from_dict({
-                "sentence1": sentence1_list,
-                "sentence2": sentence2_list,
-            })
+        dataset = Dataset.from_dict({
+            "sentence1": sentence1_list,
+            "sentence2": sentence2_list,
+        })
 
-        Each key from the dictionary will become a column in the resulting dataset.
+    Each key from the dictionary will become a column in the resulting dataset.
 
 ```
 
@@ -298,72 +296,70 @@ Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` sho
 
 Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face.
 
-.. tabs::
-
-    .. tab:: EmbeddingSimilarityEvaluator with STSb
+.. tab:: EmbeddingSimilarityEvaluator with STSb
 
-        .. raw:: html
+    .. raw:: html
 
-            <div class="sidebar">
-                <p class="sidebar-title">Documentation</p>
-                <ul class="simple">
-                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/stsb">sentence-transformers/stsb</a></li>
-                    <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator" title="sentence_transformers.evaluation.EmbeddingSimilarityEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.EmbeddingSimilarityEvaluator</span></code></a></li>
-                    <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
-                </ul>
-            </div>
+        <div class="sidebar">
+            <p class="sidebar-title">Documentation</p>
+            <ul class="simple">
+                <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/stsb">sentence-transformers/stsb</a></li>
+                <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator" title="sentence_transformers.evaluation.EmbeddingSimilarityEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.EmbeddingSimilarityEvaluator</span></code></a></li>
+                <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
+            </ul>
+        </div>
 
-        ::
+    ::
 
-            from datasets import load_dataset
-            from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
+        from datasets import load_dataset
+        from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
 
-            # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
-            eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
+        # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
+        eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
 
-            # Initialize the evaluator
-            dev_evaluator = EmbeddingSimilarityEvaluator(
-                sentences1=eval_dataset["sentence1"],
-                sentences2=eval_dataset["sentence2"],
-                scores=eval_dataset["score"],
-                main_similarity=SimilarityFunction.COSINE,
-                name="sts-dev",
-            )
-            # You can run evaluation like so:
-            # dev_evaluator(model)
-    
-    .. tab:: TripletEvaluator with AllNLI
-
-        .. raw:: html
-
-            <div class="sidebar">
-                <p class="sidebar-title">Documentation</p>
-                <ul class="simple">
-                    <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
-                    <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator" title="sentence_transformers.evaluation.TripletEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.TripletEvaluator</span></code></a></li>
-                    <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
-                </ul>
-            </div>
-
-        ::
-
-            from datasets import load_dataset
-            from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction
-
-            # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli)
-            max_samples = 1000
-            eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")
-
-            # Initialize the evaluator
-            dev_evaluator = TripletEvaluator(
-                anchors=eval_dataset["anchor"],
-                positives=eval_dataset["positive"],
-                negatives=eval_dataset["negative"],
-                main_distance_function=SimilarityFunction.COSINE,
-                name="all-nli-dev",
-            )
-            # You can run evaluation like so:
-            # dev_evaluator(model)
+        # Initialize the evaluator
+        dev_evaluator = EmbeddingSimilarityEvaluator(
+            sentences1=eval_dataset["sentence1"],
+            sentences2=eval_dataset["sentence2"],
+            scores=eval_dataset["score"],
+            main_similarity=SimilarityFunction.COSINE,
+            name="sts-dev",
+        )
+        # You can run evaluation like so:
+        # dev_evaluator(model)
+
+.. tab:: TripletEvaluator with AllNLI
+
+    .. raw:: html
+
+        <div class="sidebar">
+            <p class="sidebar-title">Documentation</p>
+            <ul class="simple">
+                <li><a class="reference external" href="https://huggingface.co/datasets/sentence-transformers/all-nli">sentence-transformers/all-nli</a></li>
+                <li><a class="reference internal" href="../package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator" title="sentence_transformers.evaluation.TripletEvaluator"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.evaluation.TripletEvaluator</span></code></a></li>
+                <li><a class="reference internal" href="../package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SimilarityFunction" title="sentence_transformers.SimilarityFunction"><code class="xref py py-class docutils literal notranslate"><span class="pre">sentence_transformers.SimilarityFunction</span></code></a></li>
+            </ul>
+        </div>
+
+    ::
+
+        from datasets import load_dataset
+        from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction
+
+        # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli)
+        max_samples = 1000
+        eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]")
+
+        # Initialize the evaluator
+        dev_evaluator = TripletEvaluator(
+            anchors=eval_dataset["anchor"],
+            positives=eval_dataset["positive"],
+            negatives=eval_dataset["negative"],
+            main_distance_function=SimilarityFunction.COSINE,
+            name="all-nli-dev",
+        )
+        # You can run evaluation like so:
+        # dev_evaluator(model)
 ```
 
 ## Trainer

From fdd31eebd873f6aabf78d77ed0891f5f3aa567af Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Fri, 24 May 2024 18:29:26 +0200
Subject: [PATCH 25/39] [`v3`] Make the "primary_metric" for evaluators a bit
 more robust (#2675)

* Make the "primary_metric" for evaluators a bit more robust

* Also remove some other TODOs that are not very important or already done
---
 sentence_transformers/evaluation/SentenceEvaluator.py | 11 ++++++++++-
 sentence_transformers/fit_mixin.py                    |  4 +---
 sentence_transformers/model_card.py                   |  4 +---
 tests/test_sentence_transformer.py                    |  1 -
 4 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py
index a3a2497ed..9d0646556 100644
--- a/sentence_transformers/evaluation/SentenceEvaluator.py
+++ b/sentence_transformers/evaluation/SentenceEvaluator.py
@@ -13,8 +13,17 @@ class SentenceEvaluator:
     """
 
     def __init__(self):
+        """
+        Base class for all evaluators. Notably, this class introduces the ``greater_is_better`` and ``primary_metric``
+        attributes. The former is a boolean indicating whether a higher evaluation score is better, which is used
+        for choosing the best checkpoint if ``load_best_model_at_end`` is set to ``True`` in the training arguments.
+
+        The latter is a string indicating the primary metric for the evaluator. This has to be defined whenever
+        the evaluator returns a dictionary of metrics, and the primary metric is the key pointing to the primary
+        metric, i.e. the one that is used for model selection and/or logging.
+        """
         self.greater_is_better = True
-        # TODO: Add better `primary_metrics` support
+        self.primary_metric = None
 
     def __call__(
         self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1
diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
index 6af7dc149..099cd2042 100644
--- a/sentence_transformers/fit_mixin.py
+++ b/sentence_transformers/fit_mixin.py
@@ -53,7 +53,6 @@ def __init__(self, output_dir: str, evaluator: Optional[SentenceEvaluator], save
         super().__init__()
         self.output_dir = output_dir
         self.evaluator = evaluator
-        # TODO: ^ has to implement `greater_is_better` and `primary_metric`
         self.save_best_model = save_best_model
         self.best_metric = None
 
@@ -245,7 +244,7 @@ def identity(batch):
         batch_size = 8
         batch_sampler = BatchSamplers.BATCH_SAMPLER
         # Convert dataloaders into a DatasetDict
-        # TODO: This should be done in a more efficient way
+        # TODO: This is rather inefficient, as we load all data into memory. We might benefit from a more efficient solution
         train_dataset_dict = {}
         for loader_idx, data_loader in enumerate(data_loaders, start=1):
             if isinstance(data_loader, NoDuplicatesDataLoader):
@@ -284,7 +283,6 @@ def _default_checkpoint_dir() -> str:
 
         # Convert loss_fns into a dict with `dataset_{idx}` keys
         loss_fn_dict = {f"_dataset_{idx}": loss_fn for idx, loss_fn in enumerate(loss_fns, start=1)}
-        # TODO: Test model checkpointing & loading
 
         # Use steps_per_epoch to perhaps set max_steps
         max_steps = -1
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index 7488d006d..e3db3b975 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -340,7 +340,6 @@ def validate_datasets(self, dataset_list, infer_languages: bool = True) -> None:
                     )
                     del dataset["id"]
                 else:
-                    # TODO: Perhaps we can try to infer the dataset name from the dataset card
                     if info.cardData and infer_languages and "language" in info.cardData:
                         dataset_language = info.cardData.get("language")
                         if dataset_language is None:
@@ -441,8 +440,7 @@ def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[s
         self.eval_results_dict[evaluator] = copy(metrics)
 
         # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs
-        if hasattr(evaluator, "primary_metric"):
-            primary_metrics = evaluator.primary_metric
+        if hasattr(evaluator, "primary_metric") and (primary_metrics := evaluator.primary_metric):
             if isinstance(evaluator, SequentialEvaluator):
                 primary_metrics = [sub_evaluator.primary_metric for sub_evaluator in evaluator.evaluators]
             elif isinstance(primary_metrics, str):
diff --git a/tests/test_sentence_transformer.py b/tests/test_sentence_transformer.py
index 0a6479701..233be3f8c 100644
--- a/tests/test_sentence_transformer.py
+++ b/tests/test_sentence_transformer.py
@@ -461,7 +461,6 @@ def test(model: SentenceTransformer, expected_dim: int):
                 embeddings = outputs["sentence_embedding"]
             else:
                 outputs = cast(List[Dict[str, torch.Tensor]], outputs)
-                # TODO: can overload model.encode if ppl want type checker compatibility
                 embeddings = [out_features["sentence_embedding"] for out_features in outputs]
         else:
             embeddings = outputs

From 6b86c3e3aa444faf20f55d412ef3b12785953ac7 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Sun, 26 May 2024 10:07:08 +0200
Subject: [PATCH 26/39] Set `broadcast_buffers = False` when training with DDP
 (#2663)

---
 sentence_transformers/training_args.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
index 326c649a6..803bb4d85 100644
--- a/sentence_transformers/training_args.py
+++ b/sentence_transformers/training_args.py
@@ -74,3 +74,7 @@ def __post_init__(self):
         # The `compute_loss` method in `SentenceTransformerTrainer` is overridden to only compute the prediction loss,
         # so we set `prediction_loss_only` to `True` here to avoid
         self.prediction_loss_only = True
+
+        # Disable broadcasting of buffers to avoid `RuntimeError: one of the variables needed for gradient computation
+        # has been modified by an inplace operation.` when training with DDP & a BertModel-based model.
+        self.ddp_broadcast_buffers = False

From 8e5b1fe3a4cb04845de4a0e83f11565655c294fb Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 27 May 2024 11:44:18 +0200
Subject: [PATCH 27/39] [`v3`] Warn about using DP instead of DDP + set
 dataloader_drop_last with DDP (#2677)

* Warn about using DP instead of DDP + set dataloader_drop_last with DDP

* Prevent duplicate warnings

* Remove note, done automatically now

* Avoid inequality comparison to True
---
 .../training/distributed.rst                  |  4 ----
 sentence_transformers/training_args.py        | 23 +++++++++++++++++++
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst
index f2ade01cf..3753dd852 100644
--- a/docs/sentence_transformer/training/distributed.rst
+++ b/docs/sentence_transformer/training/distributed.rst
@@ -47,10 +47,6 @@ In short, **DDP is generally recommended**. You can use DDP by running your norm
       if __name__ == "__main__":
           main()
 
-.. note::
-
-   When using DDP, using ``dataloader_drop_last=True`` in :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` is recommended, as the training may halt at the last (incomplete) training batch otherwise.
-
 Comparison
 ----------
 
diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py
index 803bb4d85..4aefcd426 100644
--- a/sentence_transformers/training_args.py
+++ b/sentence_transformers/training_args.py
@@ -1,9 +1,13 @@
+import logging
 from dataclasses import dataclass, field
 from typing import Union
 
 from transformers import TrainingArguments as TransformersTrainingArguments
+from transformers.training_args import ParallelMode
 from transformers.utils import ExplicitEnum
 
+logger = logging.getLogger(__name__)
+
 
 class BatchSamplers(ExplicitEnum):
     """
@@ -78,3 +82,22 @@ def __post_init__(self):
         # Disable broadcasting of buffers to avoid `RuntimeError: one of the variables needed for gradient computation
         # has been modified by an inplace operation.` when training with DDP & a BertModel-based model.
         self.ddp_broadcast_buffers = False
+
+        if self.parallel_mode == ParallelMode.NOT_DISTRIBUTED:
+            # If output_dir is "unused", then this instance is created to compare training arguments vs the defaults,
+            # so we don't have to warn.
+            if self.output_dir != "unused":
+                logger.warning(
+                    "Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. "
+                    "See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information."
+                )
+
+        elif self.parallel_mode == ParallelMode.DISTRIBUTED and not self.dataloader_drop_last:
+            # If output_dir is "unused", then this instance is created to compare training arguments vs the defaults,
+            # so we don't have to warn.
+            if self.output_dir != "unused":
+                logger.warning(
+                    "When using DistributedDataParallel (DDP), it is recommended to set `dataloader_drop_last=True` to avoid hanging issues with an uneven last batch. "
+                    "Setting `dataloader_drop_last=True`."
+                )
+            self.dataloader_drop_last = True

From cf62248e8d9f898474543a53000c9efd42641434 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 27 May 2024 12:03:58 +0200
Subject: [PATCH 28/39] [`v3`] Add warning that Evaluators only run on 1 GPU
 when multi-GPU training (#2678)

* Add warning that Evaluators only run on 1 GPU when multi-GPU training

* Also add a note in the distributed training docs
---
 docs/sentence_transformer/training/distributed.rst | 4 ++++
 docs/sentence_transformer/training_overview.md     | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst
index 3753dd852..6e2f6c5a3 100644
--- a/docs/sentence_transformer/training/distributed.rst
+++ b/docs/sentence_transformer/training/distributed.rst
@@ -47,6 +47,10 @@ In short, **DDP is generally recommended**. You can use DDP by running your norm
       if __name__ == "__main__":
           main()
 
+.. note::
+
+   When using an `Evaluator <../training_overview.html#evaluator>`_, the evaluator only runs on the first device unlike the training and evaluation datasets, which are shared across all devices. 
+
 Comparison
 ----------
 
diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
index 4f55bfba7..72c0fcf0f 100644
--- a/docs/sentence_transformer/training_overview.md
+++ b/docs/sentence_transformer/training_overview.md
@@ -360,12 +360,16 @@ Sometimes you don't have the required evaluation data to prepare one of these ev
         )
         # You can run evaluation like so:
         # dev_evaluator(model)
+
+.. warning::
+
+    When using `Distributed Training <training/distributed.html>`_, the evaluator only runs on the first device, unlike the training and evaluation datasets, which are shared across all devices. 
 ```
 
 ## Trainer
 
 ```eval_rst
-The :class:`sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together:
+The :class:`~sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together:
 
 .. sidebar:: Documentation
 

From 1a2a883011e65379dc2b8d325a22be2f6eb5935c Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 27 May 2024 12:04:23 +0200
Subject: [PATCH 29/39] [`v3`] Move training dependencies into a "train" extra
 (#2676)

* Move training dependencies into a "train" extra

* Install the train extra with the CI tests

* Simplify dev install: also include train deps there

* Implement is_..._available in ST instead; add is_training_available
---
 .github/workflows/tests.yml         |   2 +-
 docs/installation.md                | 110 ++++++++++++++++++++++++----
 docs/package_reference/util.md      |   2 +-
 sentence_transformers/fit_mixin.py  |   9 +--
 sentence_transformers/model_card.py |  37 ++++++----
 sentence_transformers/sampler.py    |   9 ++-
 sentence_transformers/trainer.py    |  31 +++++---
 sentence_transformers/util.py       |  21 ++++++
 setup.py                            |   8 +-
 tests/conftest.py                   |   7 +-
 tests/test_train_stsb.py            |  17 +++++
 tests/test_trainer.py               |  21 +++++-
 12 files changed, 216 insertions(+), 58 deletions(-)

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 7c66f1507..391d160ac 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -45,7 +45,7 @@ jobs:
         if: steps.restore-cache.outputs.cache-hit != 'true'
 
       - name: Install the checked-out sentence-transformers
-        run: python -m pip install .
+        run: python -m pip install .[train]
 
       - name: Run unit tests
         shell: bash
diff --git a/docs/installation.md b/docs/installation.md
index 2f694b475..84ce4c0e4 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,35 +1,115 @@
 # Installation
 
-We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**.
+We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**. There are three options to install Sentence Transformers:
+* **Default:** This allows for loading, saving, and inference (i.e., getting embeddings) of models.
+* **Default and Training**: All of the above plus training.
+* **Development**: All of the above plus some dependencies for developing Sentence Transformers, see [Editable Install](#editable-install).
 
 ## Install with pip
 
-Install the *sentence-transformers* with `pip`:
-```
-pip install -U sentence-transformers
-```
+```eval_rst
 
-## Install with conda
+.. tab:: Default
+
+    ::
+
+        pip install -U sentence-transformers
+
+.. tab:: Default and Training
+
+    ::
+
+        pip install -U "sentence-transformers[train]"
+
+    To use `Weights and Biases <https://wandb.ai/>`_ to track your training logs, you should also install ``wandb`` **(recommended)**::
+
+        pip install wandb
+    
+    And to track your Carbon Emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**::
+
+        pip install codecarbon
+
+.. tab:: Development
+
+    ::
+
+        pip install -U "sentence-transformers[dev]"
 
-Apple silicon installation of *sentence-transformers*
-```
-conda install -c conda-forge sentence-transformers
 ```
 
-## Install from source
+## Install with Conda
+
+```eval_rst
+
+.. tab:: Default
+
+    ::
+
+        conda install -c conda-forge sentence-transformers
+
+.. tab:: Default and Training
+
+    ::
+
+        conda install -c conda-forge sentence-transformers accelerate datasets
+
+    To use `Weights and Biases <https://wandb.ai/>`_ to track your training logs, you should also install ``wandb`` **(recommended)**::
+
+        pip install wandb
+    
+    And to track your Carbon Emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**::
+
+        pip install codecarbon
+
+.. tab:: Development
+
+    ::
+
+        conda install -c conda-forge sentence-transformers accelerate datasets pre-commit pytest ruff
 
-You can install *sentence-transformers* directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release:
 ```
-pip install git+https://github.com/UKPLab/sentence-transformers
+
+## Install from Source
+
+You can install ``sentence-transformers`` directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release:
+
+```eval_rst
+
+.. tab:: Default
+
+    ::
+
+        pip install git+https://github.com/UKPLab/sentence-transformers.git
+
+.. tab:: Default and Training
+
+    ::
+
+        pip install -U "sentence-transformers[train] @ git+https://github.com/UKPLab/sentence-transformers.git"
+
+    To use `Weights and Biases <https://wandb.ai/>`_ to track your training logs, you should also install ``wandb`` **(recommended)**::
+
+        pip install wandb
+    
+    And to track your carbon emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**::
+
+        pip install codecarbon
+
+.. tab:: Development
+
+    ::
+
+        pip install -U "sentence-transformers[dev] @ git+https://github.com/UKPLab/sentence-transformers.git"
+
 ```
 
-## Editable install
+## Editable Install
 
-If you want to make changes to *sentence-transformers*, you will need an editable install. Clone the repository and install it with these commands:
+If you want to make changes to ``sentence-transformers``, you will need an editable install. Clone the repository and install it with these commands:
 ```
 git clone https://github.com/UKPLab/sentence-transformers
 cd sentence-transformers
-pip install -e .
+pip install -e ".[train,dev]"
 ```
 
 These commands will link the new `sentence-transformers` folder and your Python library paths, such that this folder will be used when importing `sentence-transformers`.
diff --git a/docs/package_reference/util.md b/docs/package_reference/util.md
index 690b4cd19..1b5e9e326 100644
--- a/docs/package_reference/util.md
+++ b/docs/package_reference/util.md
@@ -4,7 +4,7 @@
 ## Helper Functions
 ```eval_rst
 .. automodule:: sentence_transformers.util
-   :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings
+   :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings, is_training_available
 ```
 
 ## Similarity Metrics
diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py
index 099cd2042..b4ece8c57 100644
--- a/sentence_transformers/fit_mixin.py
+++ b/sentence_transformers/fit_mixin.py
@@ -15,7 +15,6 @@
 from tqdm.autonotebook import trange
 from transformers import TrainerCallback, TrainerControl, TrainerState
 
-from datasets import Dataset, DatasetDict
 from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader
 from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset
 from sentence_transformers.training_args import (
@@ -23,13 +22,13 @@
     MultiDatasetBatchSamplers,
     SentenceTransformerTrainingArguments,
 )
+from sentence_transformers.util import batch_to_device, fullname, is_datasets_available
 
 from .evaluation import SentenceEvaluator
 from .model_card_templates import ModelCardTemplate
-from .util import (
-    batch_to_device,
-    fullname,
-)
+
+if is_datasets_available():
+    from datasets import Dataset, DatasetDict
 
 logger = logging.getLogger(__name__)
 
diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index e3db3b975..68d2c0a44 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -24,11 +24,13 @@
 from transformers.modelcard import make_markdown_table
 from transformers.trainer_callback import TrainerControl, TrainerState
 
-from datasets import Dataset, DatasetDict
 from sentence_transformers import __version__ as sentence_transformers_version
 from sentence_transformers.models import Transformer
 from sentence_transformers.training_args import SentenceTransformerTrainingArguments
-from sentence_transformers.util import cos_sim, fullname
+from sentence_transformers.util import cos_sim, fullname, is_accelerate_available, is_datasets_available
+
+if is_datasets_available():
+    from datasets import Dataset, DatasetDict
 
 logger = logging.getLogger(__name__)
 
@@ -204,20 +206,25 @@ def on_log(
 
 
 def get_versions() -> Dict[str, Any]:
-    from accelerate import __version__ as accelerate_version
-    from tokenizers import __version__ as tokenizers_version
-
-    from datasets import __version__ as datasets_version
-
-    return {
+    versions = {
         "python": python_version(),
         "sentence_transformers": sentence_transformers_version,
         "transformers": transformers.__version__,
         "torch": torch.__version__,
-        "accelerate": accelerate_version,
-        "datasets": datasets_version,
-        "tokenizers": tokenizers_version,
     }
+    if is_accelerate_available():
+        from accelerate import __version__ as accelerate_version
+
+        versions["accelerate"] = accelerate_version
+    if is_datasets_available():
+        from datasets import __version__ as datasets_version
+
+        versions["datasets"] = datasets_version
+    from tokenizers import __version__ as tokenizers_version
+
+    versions["tokenizers"] = tokenizers_version
+
+    return versions
 
 
 @dataclass
@@ -387,7 +394,7 @@ def join_list(losses: List[str]) -> str:
     def set_best_model_step(self, step: int) -> None:
         self.best_model_step = step
 
-    def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None:
+    def set_widget_examples(self, dataset: Union["Dataset", "DatasetDict"]) -> None:
         if isinstance(dataset, Dataset):
             dataset = DatasetDict(dataset=dataset)
 
@@ -465,7 +472,7 @@ def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[s
                     }
                 )
 
-    def set_label_examples(self, dataset: Dataset) -> None:
+    def set_label_examples(self, dataset: "Dataset") -> None:
         num_examples_per_label = 3
         examples = defaultdict(list)
         finished_labels = set()
@@ -487,7 +494,7 @@ def set_label_examples(self, dataset: Dataset) -> None:
         ]
 
     def infer_datasets(
-        self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None
+        self, dataset: Union["Dataset", "DatasetDict"], dataset_name: Optional[str] = None
     ) -> List[Dict[str, str]]:
         if isinstance(dataset, DatasetDict):
             return [
@@ -661,7 +668,7 @@ def to_html_list(data: dict):
         return dataset_info
 
     def extract_dataset_metadata(
-        self, dataset: Union[Dataset, DatasetDict], dataset_metadata, dataset_type: Literal["train", "eval"]
+        self, dataset: Union["Dataset", "DatasetDict"], dataset_metadata, dataset_type: Literal["train", "eval"]
     ) -> Dict[str, Any]:
         if dataset:
             if dataset_metadata and (
diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py
index d2cec4cee..7dad2dec2 100644
--- a/sentence_transformers/sampler.py
+++ b/sentence_transformers/sampler.py
@@ -6,7 +6,10 @@
 import torch
 from torch.utils.data import BatchSampler, ConcatDataset, SubsetRandomSampler
 
-from datasets import Dataset
+from sentence_transformers.util import is_datasets_available
+
+if is_datasets_available():
+    from datasets import Dataset
 
 logger = logging.getLogger(__name__)
 
@@ -33,7 +36,7 @@ class DefaultBatchSampler(SetEpochMixin, BatchSampler):
 class GroupByLabelBatchSampler(SetEpochMixin, BatchSampler):
     def __init__(
         self,
-        dataset: Dataset,
+        dataset: "Dataset",
         batch_size: int,
         drop_last: bool,
         valid_label_columns: List[str] = None,
@@ -89,7 +92,7 @@ def __iter__(self):
 class NoDuplicatesBatchSampler(SetEpochMixin, BatchSampler):
     def __init__(
         self,
-        dataset: Dataset,
+        dataset: "Dataset",
         batch_size: int,
         drop_last: bool,
         valid_label_columns: List[str] = [],
diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py
index a30c3e6e5..6d6c35ecc 100644
--- a/sentence_transformers/trainer.py
+++ b/sentence_transformers/trainer.py
@@ -6,7 +6,7 @@
 
 import torch
 from torch import nn
-from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, Dataset, SubsetRandomSampler
+from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, SubsetRandomSampler
 from transformers import EvalPrediction, PreTrainedTokenizerBase, Trainer, TrainerCallback
 from transformers.data.data_collator import DataCollator
 from transformers.integrations import WandbCallback
@@ -14,7 +14,6 @@
 from transformers.trainer_utils import EvalLoopOutput
 from transformers.training_args import ParallelMode
 
-from datasets import DatasetDict
 from sentence_transformers.data_collator import SentenceTransformerDataCollator
 from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
 from sentence_transformers.losses.CoSENTLoss import CoSENTLoss
@@ -32,7 +31,10 @@
     MultiDatasetBatchSamplers,
     SentenceTransformerTrainingArguments,
 )
-from sentence_transformers.util import disable_logging
+from sentence_transformers.util import disable_logging, is_datasets_available, is_training_available
+
+if is_datasets_available():
+    from datasets import Dataset, DatasetDict
 
 logger = logging.getLogger(__name__)
 
@@ -111,8 +113,8 @@ def __init__(
         self,
         model: Optional["SentenceTransformer"] = None,
         args: SentenceTransformerTrainingArguments = None,
-        train_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None,
-        eval_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None,
+        train_dataset: Optional[Union["Dataset", "DatasetDict", Dict[str, "Dataset"]]] = None,
+        eval_dataset: Optional[Union["Dataset", "DatasetDict", Dict[str, "Dataset"]]] = None,
         loss: Optional[
             Union[
                 nn.Module,
@@ -130,6 +132,13 @@ def __init__(
         optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
         preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None,
     ) -> None:
+        if not is_training_available():
+            raise RuntimeError(
+                "To train a SentenceTransformer model, you need to install the `accelerate` and `datasets` modules. "
+                "You can do so with the `train` extra:\n"
+                'pip install -U "sentence-transformers[train]"'
+            )
+
         if args is None:
             output_dir = "tmp_trainer"
             logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.")
@@ -260,7 +269,7 @@ def prepare_loss(
             return loss.to(model.device)
         return loss(model).to(model.device)
 
-    def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict:
+    def add_dataset_name_column(self, dataset_dict: "DatasetDict") -> "DatasetDict":
         for key, dataset in dataset_dict.items():
             if "dataset_name" not in dataset.column_names:
                 dataset_dict[key] = dataset.add_column("dataset_name", [key] * len(dataset))
@@ -350,7 +359,7 @@ def collect_features(
 
     def evaluate(
         self,
-        eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None,
+        eval_dataset: Optional[Union["Dataset", Dict[str, "Dataset"]]] = None,
         ignore_keys: Optional[List[str]] = None,
         metric_key_prefix: str = "eval",
     ) -> Dict[str, float]:
@@ -427,7 +436,7 @@ def _load_best_model(self) -> None:
             self.model = full_model
             self.model[0].auto_model = loaded_auto_model
 
-    def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] = None) -> bool:
+    def validate_column_names(self, dataset: "Dataset", dataset_name: Optional[str] = None) -> bool:
         if overlap := set(dataset.column_names) & {"return_loss", "dataset_name"}:
             raise ValueError(
                 f"The following column names are invalid in your {dataset_name + ' ' if dataset_name else ''}dataset: {list(overlap)}."
@@ -436,7 +445,7 @@ def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] =
 
     def get_batch_sampler(
         self,
-        dataset: Dataset,
+        dataset: "Dataset",
         batch_size: int,
         drop_last: bool,
         valid_label_columns: Optional[List[str]] = None,
@@ -559,7 +568,7 @@ def get_train_dataloader(self) -> DataLoader:
         self._train_dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
         return self._train_dataloader
 
-    def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> DataLoader:
+    def get_eval_dataloader(self, eval_dataset: Union["Dataset", None] = None) -> DataLoader:
         """
         Returns the evaluation [`~torch.utils.data.DataLoader`].
 
@@ -628,7 +637,7 @@ def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> Data
         self.accelerator.even_batches = True
         return self.accelerator.prepare(DataLoader(eval_dataset, **dataloader_params))
 
-    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
+    def get_test_dataloader(self, test_dataset: "Dataset") -> DataLoader:
         """
         Returns the training [`~torch.utils.data.DataLoader`].
 
diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py
index 6b9cef844..80fba6978 100644
--- a/sentence_transformers/util.py
+++ b/sentence_transformers/util.py
@@ -936,3 +936,24 @@ def get_device_name() -> Literal["mps", "cuda", "npu", "hpu", "cpu"]:
         if hthpu.is_available():
             return "hpu"
     return "cpu"
+
+
+def is_accelerate_available() -> bool:
+    """
+    Returns True if the accelerate library is available.
+    """
+    return importlib.util.find_spec("accelerate") is not None
+
+
+def is_datasets_available() -> bool:
+    """
+    Returns True if the datasets library is available.
+    """
+    return importlib.util.find_spec("datasets") is not None
+
+
+def is_training_available() -> bool:
+    """
+    Returns True if we have the required dependencies for training Sentence Transformer models
+    """
+    return is_accelerate_available() and is_datasets_available()
diff --git a/setup.py b/setup.py
index 922dcf7ae..2ad8457ae 100644
--- a/setup.py
+++ b/setup.py
@@ -27,11 +27,15 @@
         "scipy",
         "huggingface-hub>=0.15.1",
         "Pillow",
-        "datasets",
-        "accelerate>=0.20.3",
     ],
     extras_require={
+        "train": [
+            "datasets",
+            "accelerate>=0.20.3",
+        ],
         "dev": [
+            "datasets",
+            "accelerate>=0.20.3",
             "pre-commit",
             "pytest",
             "ruff>=0.3.0",
diff --git a/tests/conftest.py b/tests/conftest.py
index acd2870da..5a83759ad 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -4,9 +4,12 @@
 
 import pytest
 
-from datasets import DatasetDict, load_dataset
 from sentence_transformers import CrossEncoder, SentenceTransformer
 from sentence_transformers.models import Pooling, Transformer
+from sentence_transformers.util import is_datasets_available
+
+if is_datasets_available():
+    from datasets import DatasetDict, load_dataset
 
 
 @pytest.fixture()
@@ -43,7 +46,7 @@ def distilbert_base_uncased_model() -> SentenceTransformer:
 
 
 @pytest.fixture(scope="session")
-def stsb_dataset_dict() -> DatasetDict:
+def stsb_dataset_dict() -> "DatasetDict":
     return load_dataset("mteb/stsbenchmark-sts")
 
 
diff --git a/tests/test_train_stsb.py b/tests/test_train_stsb.py
index a71fe8f06..e2ac0171a 100644
--- a/tests/test_train_stsb.py
+++ b/tests/test_train_stsb.py
@@ -19,6 +19,7 @@
 )
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import InputExample
+from sentence_transformers.util import is_training_available
 
 
 @pytest.fixture()
@@ -71,6 +72,10 @@ def evaluate_stsb_test(model, expected_score, test_samples) -> None:
 
 
 @pytest.mark.slow
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_train_stsb_slow(
     distilbert_base_uncased_model: SentenceTransformer, sts_resource: Tuple[List[InputExample], List[InputExample]]
 ) -> None:
@@ -92,6 +97,10 @@ def test_train_stsb_slow(
 
 
 @pytest.mark.skipif("CI" in os.environ, reason="This test is too slow for the CI (~8 minutes)")
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_train_stsb(
     distilbert_base_uncased_model: SentenceTransformer, sts_resource: Tuple[List[InputExample], List[InputExample]]
 ) -> None:
@@ -113,6 +122,10 @@ def test_train_stsb(
 
 
 @pytest.mark.slow
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_train_nli_slow(
     distilbert_base_uncased_model: SentenceTransformer,
     nli_resource: List[InputExample],
@@ -139,6 +152,10 @@ def test_train_nli_slow(
 
 
 @pytest.mark.skipif("CI" in os.environ, reason="This test is too slow for the CI (~25 minutes)")
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_train_nli(
     distilbert_base_uncased_model: SentenceTransformer,
     nli_resource: List[InputExample],
diff --git a/tests/test_trainer.py b/tests/test_trainer.py
index 8d2de1b48..5188837cb 100644
--- a/tests/test_trainer.py
+++ b/tests/test_trainer.py
@@ -4,12 +4,19 @@
 
 import pytest
 
-from datasets import DatasetDict
 from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
+from sentence_transformers.util import is_datasets_available, is_training_available
 
+if is_datasets_available():
+    from datasets import DatasetDict
 
+
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_trainer_multi_dataset_errors(
-    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict
+    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: "DatasetDict"
 ) -> None:
     train_dataset = stsb_dataset_dict["train"]
     loss = {
@@ -73,8 +80,12 @@ def test_trainer_multi_dataset_errors(
         )
 
 
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_trainer_invalid_column_names(
-    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict
+    stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: "DatasetDict"
 ) -> None:
     train_dataset = stsb_dataset_dict["train"]
     for column_name in ("return_loss", "dataset_name"):
@@ -106,6 +117,10 @@ def test_trainer_invalid_column_names(
             trainer.train()
 
 
+@pytest.mark.skipif(
+    not is_training_available(),
+    reason='Sentence Transformers was not installed with the `["train"]` extra.',
+)
 def test_model_card_reuse(stsb_bert_tiny_model: SentenceTransformer):
     assert stsb_bert_tiny_model._model_card_text
     # Reuse the model card if no training was done

From cd236c9f283064f7d3c535c29d9111a5df32c99a Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 27 May 2024 13:25:45 +0200
Subject: [PATCH 30/39] Update references to the API ref (#2679)

---
 docs/sentence_transformer/loss_overview.md    | 45 +++++++++----------
 examples/training/adaptive_layer/README.md    | 10 ++---
 .../training/matryoshka/2d_matryoshka_sts.py  |  2 +-
 examples/training/matryoshka/README.md        | 10 ++---
 .../training/matryoshka/matryoshka_sts.py     |  2 +-
 .../multilingual/make_multilingual.py         |  2 +-
 examples/training/nli/training_nli.py         |  2 +-
 examples/training/nli/training_nli_v2.py      |  2 +-
 examples/training/nli/training_nli_v3.py      |  2 +-
 .../other/training_wikipedia_sections.py      |  2 +-
 .../unsupervised_learning/SimCSE/README.md    |  2 +-
 .../query_generation/README.md                |  2 +-
 sentence_transformers/model_card_template.md  |  4 +-
 13 files changed, 43 insertions(+), 44 deletions(-)

diff --git a/docs/sentence_transformer/loss_overview.md b/docs/sentence_transformer/loss_overview.md
index f46b0418e..ab5e8ed72 100644
--- a/docs/sentence_transformer/loss_overview.md
+++ b/docs/sentence_transformer/loss_overview.md
@@ -8,44 +8,43 @@ Loss functions play a critical role in the performance of your fine-tuned model.
     You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, ``(sentence_A, sentence_B) pairs`` with ``class`` labels can be converted into ``(anchor, positive, negative) triplets`` by sampling sentences with the same or different classes.
 ```
 
-| Inputs                                        | Labels                         | Appropriate Loss Functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
-|-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `single sentences`                            | `class`                        | <a href="../package_reference/losses.html#batchalltripletloss">`BatchAllTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardsoftmargintripletloss">`BatchHardSoftMarginTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardtripletloss">`BatchHardTripletLoss`</a><br><a href="../package_reference/losses.html#batchsemihardtripletloss">`BatchSemiHardTripletLoss`</a>                                                                                                                                                                                                                               |
-| `single sentences`                            | `none`                         | <a href="../package_reference/losses.html#contrastivetensionloss">`ContrastiveTensionLoss`</a><br><a href="../package_reference/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| `(anchor, anchor) pairs`                      | `none`                         | <a href="../package_reference/losses.html#contrastivetensionlossinbatchnegatives">`ContrastiveTensionLossInBatchNegatives`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| `(damaged_sentence, original_sentence) pairs` | `none`                         | <a href="../package_reference/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| `(sentence_A, sentence_B) pairs`              | `class`                        | <a href="../package_reference/losses.html#softmaxloss">`SoftmaxLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-| `(anchor, positive) pairs`                    | `none`                         | <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativessymmetricrankingloss">`MultipleNegativesSymmetricRankingLoss`</a><br><a href="../package_reference/losses.html#megabatchmarginloss">`MegaBatchMarginLoss`</a><br><a href="../package_reference/losses.html#cachedgistembedloss">`CachedGISTEmbedLoss`</a><br><a href="../package_reference/losses.html#gistembedloss">`GISTEmbedLoss`</a> |
-| `(anchor, positive/negative) pairs`           | `1 if positive, 0 if negative` | <a href="../package_reference/losses.html#contrastiveloss">`ContrastiveLoss`</a><br><a href="../package_reference/losses.html#onlinecontrastiveloss">`OnlineContrastiveLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-| `(sentence_A, sentence_B) pairs`              | `float similarity score`       | <a href="../package_reference/losses.html#cosentloss">`CoSENTLoss`</a><br><a href="../package_reference/losses.html#angleloss">`AnglELoss`</a><br><a href="../package_reference/losses.html#cosinesimilarityloss">`CosineSimilarityLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                 |
-| `(anchor, positive, negative) triplets`       | `none`                         | <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#tripletloss">`TripletLoss`</a><br><a href="../package_reference/losses.html#cachedgistembedloss">`CachedGISTEmbedLoss`</a><br><a href="../package_reference/losses.html#gistembedloss">`GISTEmbedLoss`</a>                                                                                                                                                 |
+| Inputs                                        | Labels                         | Appropriate Loss Functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+|-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `single sentences`                            | `class`                        | <a href="../package_reference/sentence_transformer/losses.html#batchalltripletloss">`BatchAllTripletLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#batchhardsoftmargintripletloss">`BatchHardSoftMarginTripletLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#batchhardtripletloss">`BatchHardTripletLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#batchsemihardtripletloss">`BatchSemiHardTripletLoss`</a>                                                                                                                                                                                                                                                                         |
+| `single sentences`                            | `none`                         | <a href="../package_reference/sentence_transformer/losses.html#contrastivetensionloss">`ContrastiveTensionLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| `(anchor, anchor) pairs`                      | `none`                         | <a href="../package_reference/sentence_transformer/losses.html#contrastivetensionlossinbatchnegatives">`ContrastiveTensionLossInBatchNegatives`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `(damaged_sentence, original_sentence) pairs` | `none`                         | <a href="../package_reference/sentence_transformer/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `(sentence_A, sentence_B) pairs`              | `class`                        | <a href="../package_reference/sentence_transformer/losses.html#softmaxloss">`SoftmaxLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| `(anchor, positive) pairs`                    | `none`                         | <a href="../package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss">`MultipleNegativesSymmetricRankingLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#megabatchmarginloss">`MegaBatchMarginLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#cachedgistembedloss">`CachedGISTEmbedLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#gistembedloss">`GISTEmbedLoss`</a> |
+| `(anchor, positive/negative) pairs`           | `1 if positive, 0 if negative` | <a href="../package_reference/sentence_transformer/losses.html#contrastiveloss">`ContrastiveLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#onlinecontrastiveloss">`OnlineContrastiveLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| `(sentence_A, sentence_B) pairs`              | `float similarity score`       | <a href="../package_reference/sentence_transformer/losses.html#cosentloss">`CoSENTLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#angleloss">`AnglELoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#cosinesimilarityloss">`CosineSimilarityLoss`</a>                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `(anchor, positive, negative) triplets`       | `none`                         | <a href="../package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#tripletloss">`TripletLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#cachedgistembedloss">`CachedGISTEmbedLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#gistembedloss">`GISTEmbedLoss`</a>                                                                                                                                                                      |
 
 ## Loss modifiers
 
 These loss functions can be seen as *loss modifiers*: they work on top of standard loss functions, but apply those loss functions in different ways to try and instil useful properties into the trained embedding model.
 
-For example, models trained with <a href="../package_reference/losses.html#matryoshkaloss">`MatryoshkaLoss`</a> produce embeddings whose size can be truncated without notable losses in performance, and models trained with <a href="../package_reference/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a> still perform well when you remove model layers for faster inference.
-
-| Texts | Labels | Appropriate Loss Functions                                                                                                                                                                                                                                   |
-|-------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `any` | `any`  | <a href="../package_reference/losses.html#matryoshkaloss">`MatryoshkaLoss`</a><br><a href="../package_reference/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a><br><a href="../package_reference/losses.html#matryoshka2dloss">`Matryoshka2dLoss`</a> |
+For example, models trained with <a href="../package_reference/sentence_transformer/losses.html#matryoshkaloss">`MatryoshkaLoss`</a> produce embeddings whose size can be truncated without notable losses in performance, and models trained with <a href="../package_reference/sentence_transformer/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a> still perform well when you remove model layers for faster inference.
 
+| Texts | Labels | Appropriate Loss Functions                                                                                                                                                                                                                                                                                                  |
+|-------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `any` | `any`  | <a href="../package_reference/sentence_transformer/losses.html#matryoshkaloss">`MatryoshkaLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#matryoshka2dloss">`Matryoshka2dLoss`</a> |
 
 ## Distillation
 These loss functions are specifically designed to be used when distilling the knowledge from one model into another.
 For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.
 
-| Texts                                        | Labels                                                        | Appropriate Loss Functions                                                   |
-|----------------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------------------|
-| `sentence`                           | `model sentence embeddings`                                   | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a>             |
-| `sentence_1, sentence_2, ..., sentence_N`                           | `model sentence embeddings`                                   | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a>             |
-| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| Texts                                        | Labels                                                        | Appropriate Loss Functions                                                                        |
+|----------------------------------------------|---------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
+| `sentence`                                   | `model sentence embeddings`                                   | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>             |
+| `sentence_1, sentence_2, ..., sentence_N`    | `model sentence embeddings`                                   | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>             |
+| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a> |
 
 ## Commonly used Loss Functions
 In practice, not all loss functions get used equally often. The most common scenarios are:
 
-* `(anchor, positive) pairs` without any labels: <a href="../package_reference/losses.html#multiplenegativesrankingloss"><code>MultipleNegativesRankingLoss</code></a> is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss"><code>CachedMultipleNegativesRankingLoss</code></a> is often used to increase the batch size, resulting in superior performance.
-* `(sentence_A, sentence_B) pairs` with a `float similarity score`: <a href="../package_reference/losses.html#cosinesimilarityloss"><code>CosineSimilarityLoss</code></a> is traditionally used a lot, though more recently <a href="../package_reference/losses.html#cosentloss"><code>CoSENTLoss</code></a> and <a href="../package_reference/losses.html#angleloss"><code>AnglELoss</code></a> are used as drop-in replacements with superior performance.
+* `(anchor, positive) pairs` without any labels: <a href="../package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss"><code>MultipleNegativesRankingLoss</code></a> is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. <a href="../package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss"><code>CachedMultipleNegativesRankingLoss</code></a> is often used to increase the batch size, resulting in superior performance.
+* `(sentence_A, sentence_B) pairs` with a `float similarity score`: <a href="../package_reference/sentence_transformer/losses.html#cosinesimilarityloss"><code>CosineSimilarityLoss</code></a> is traditionally used a lot, though more recently <a href="../package_reference/sentence_transformer/losses.html#cosentloss"><code>CoSENTLoss</code></a> and <a href="../package_reference/sentence_transformer/losses.html#angleloss"><code>AnglELoss</code></a> are used as drop-in replacements with superior performance.
 
 ## Custom Loss Functions
 
diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md
index 8ab7dcf8b..71afe7c43 100644
--- a/examples/training/adaptive_layer/README.md
+++ b/examples/training/adaptive_layer/README.md
@@ -36,7 +36,7 @@ model = SentenceTransformer("microsoft/mpnet-base")
 base_loss = CoSENTLoss(model=model)
 loss = AdaptiveLayerLoss(model=model, loss=base_loss)
 ```
-* **Reference**: <a href="../../../docs/package_reference/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a>
+* **Reference**: <a href="../../../docs/package_reference/sentence_transformer/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a>
 
 Note that training with `AdaptiveLayerLoss` is not notably slower than without using it.
 
@@ -52,7 +52,7 @@ base_loss = CoSENTLoss(model=model)
 loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64])
 ```
 
-* **Reference**: <a href="../../../docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>
+* **Reference**: <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>
 
 ## Inference
 
@@ -116,7 +116,7 @@ new_num_layers = 3
 model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers]
 ```
 
-Then we can run inference with it using <a href="../../../docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>. 
+Then we can run inference with it using <a href="../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>. 
 
 ```python
 from sentence_transformers import SentenceTransformer
@@ -142,11 +142,11 @@ As you can see, the similarity between the related sentences is much higher than
 
 ## Code Examples
 
-See the following scripts as examples of how to apply the <a href="../../../docs/package_reference/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a> in practice:
+See the following scripts as examples of how to apply the <a href="../../../docs/package_reference/sentence_transformer/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a> in practice:
 
 * **[adaptive_layer_nli.py](adaptive_layer_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `AdaptiveLayerLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
 * **[adaptive_layer_sts.py](adaptive_layer_sts.py)**: This example uses the CoSENTLoss with AdaptiveLayerLoss to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.
 
-And the following scripts to see how to apply <a href="../../../docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
+And the following scripts to see how to apply <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
 * **[2d_matryoshka_nli.py](../matryoshka/2d_matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `Matryoshka2dLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
 * **[2d_matryoshka_sts.py](../matryoshka/2d_matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `Matryoshka2dLoss` to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.
diff --git a/examples/training/matryoshka/2d_matryoshka_sts.py b/examples/training/matryoshka/2d_matryoshka_sts.py
index a170f1581..55f31871c 100644
--- a/examples/training/matryoshka/2d_matryoshka_sts.py
+++ b/examples/training/matryoshka/2d_matryoshka_sts.py
@@ -48,7 +48,7 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text columns and one
 # similarity score column (between 0 and 1)
 inner_train_loss = losses.CoSENTLoss(model=model)
 train_loss = losses.Matryoshka2dLoss(model, inner_train_loss, [768, 512, 256, 128, 64])
diff --git a/examples/training/matryoshka/README.md b/examples/training/matryoshka/README.md
index 6781bf53c..25cf1eaea 100644
--- a/examples/training/matryoshka/README.md
+++ b/examples/training/matryoshka/README.md
@@ -36,7 +36,7 @@ model = SentenceTransformer("microsoft/mpnet-base")
 base_loss = CoSENTLoss(model=model)
 loss = MatryoshkaLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64])
 ```
-* **Reference**: <a href="../../../docs/package_reference/losses.html#matryoshkaloss"><code>MatryoshkaLoss</code></a>
+* **Reference**: <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshkaloss"><code>MatryoshkaLoss</code></a>
 
 Additionally, this can be combined with the `AdaptiveLayerLoss` such that the resulting model can be reduced both in the size of the output dimensions, but also in the number of layers for faster inference. See also the [Adaptive Layers](../adaptive_layer/README.html) for more information on reducing the number of model layers. In Sentence Transformers, the combination of these two losses is called `Matryoshka2dLoss`, and a shorthand is provided for simpler training.
 
@@ -50,11 +50,11 @@ base_loss = CoSENTLoss(model=model)
 loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64])
 ```
 
-* **Reference**: <a href="../../../docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>
+* **Reference**: <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>
 
 ## Inference
 
-After a model has been trained using a Matryoshka loss, you can then run inference with it using <a href="../../../docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>.
+After a model has been trained using a Matryoshka loss, you can then run inference with it using <a href="../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>.
 
 ```python
 from sentence_transformers import SentenceTransformer
@@ -85,13 +85,13 @@ As you can see, the similarity between the search query and the correct document
 
 ## Code Examples
 
-See the following scripts as examples of how to apply the <a href="../../../docs/package_reference/losses.html#matryoshkaloss"><code>MatryoshkaLoss</code></a> in practice:
+See the following scripts as examples of how to apply the <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshkaloss"><code>MatryoshkaLoss</code></a> in practice:
 
 * **[matryoshka_nli.py](matryoshka_nli.py)**: This example uses the MultipleNegativesRankingLoss with MatryoshkaLoss to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
 * **[matryoshka_nli_reduced_dim.py](matryoshka_nli_reduced_dim.py)**: This example uses the MultipleNegativesRankingLoss with MatryoshkaLoss to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation.
 * **[matryoshka_eval_stsb.py](matryoshka_eval_stsb.py)**: This example evaluates the embedding model trained with MatryoshkaLoss in [matryoshka_nli.py](matryoshka_nli.py) on the test set of the STSBenchmark dataset, and compares it to a non-Matryoshka trained model.
 * **[matryoshka_sts.py](matryoshka_sts.py)**: This example uses the CoSENTLoss with MatryoshkaLoss to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.
 
-And the following scripts to see how to apply <a href="../../../docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
+And the following scripts to see how to apply <a href="../../../docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
 * **[2d_matryoshka_nli.py](2d_matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `Matryoshka2dLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
 * **[2d_matryoshka_sts.py](2d_matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `Matryoshka2dLoss` to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.
diff --git a/examples/training/matryoshka/matryoshka_sts.py b/examples/training/matryoshka/matryoshka_sts.py
index 4722f3cf3..f0813f1ff 100644
--- a/examples/training/matryoshka/matryoshka_sts.py
+++ b/examples/training/matryoshka/matryoshka_sts.py
@@ -49,7 +49,7 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one
+# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text columns and one
 # similarity score column (between 0 and 1)
 inner_train_loss = losses.CoSENTLoss(model=model)
 train_loss = losses.MatryoshkaLoss(model, loss=inner_train_loss, matryoshka_dims=matryoshka_dims)
diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py
index bb62d37bf..6d0555125 100644
--- a/examples/training/multilingual/make_multilingual.py
+++ b/examples/training/multilingual/make_multilingual.py
@@ -140,7 +140,7 @@ def prepare_dataset(batch):
 logging.info("Prepared datasets for training:", train_dataset_dict)
 
 # 3. Define our training loss
-# MSELoss (https://sbert.net/docs/package_reference/losses.html#mseloss) needs one text columns and one
+# MSELoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss) needs one text columns and one
 # column with embeddings from the teacher model
 train_loss = MSELoss(model=student_model)
 
diff --git a/examples/training/nli/training_nli.py b/examples/training/nli/training_nli.py
index 0a2dc6ba8..1f41e2e25 100644
--- a/examples/training/nli/training_nli.py
+++ b/examples/training/nli/training_nli.py
@@ -43,7 +43,7 @@
 eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000))
 logging.info(train_dataset)
 
-# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#softmaxloss
+# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#softmaxloss
 train_loss = losses.SoftmaxLoss(
     model=model,
     sentence_embedding_dimension=model.get_sentence_embedding_dimension(),
diff --git a/examples/training/nli/training_nli_v2.py b/examples/training/nli/training_nli_v2.py
index 0b0025351..00727a5f1 100644
--- a/examples/training/nli/training_nli_v2.py
+++ b/examples/training/nli/training_nli_v2.py
@@ -47,7 +47,7 @@
 eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000))
 logging.info(train_dataset)
 
-# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss
+# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss
 train_loss = losses.MultipleNegativesRankingLoss(model)
 
 
diff --git a/examples/training/nli/training_nli_v3.py b/examples/training/nli/training_nli_v3.py
index ffc95e128..43a946089 100644
--- a/examples/training/nli/training_nli_v3.py
+++ b/examples/training/nli/training_nli_v3.py
@@ -47,7 +47,7 @@
 eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000))
 logging.info(train_dataset)
 
-# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#gistembedloss
+# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#gistembedloss
 # The guiding model
 guide_model = SentenceTransformer("all-MiniLM-L6-v2")
 train_loss = losses.GISTEmbedLoss(model, guide_model)
diff --git a/examples/training/other/training_wikipedia_sections.py b/examples/training/other/training_wikipedia_sections.py
index e1c835418..a25614dfb 100644
--- a/examples/training/other/training_wikipedia_sections.py
+++ b/examples/training/other/training_wikipedia_sections.py
@@ -43,7 +43,7 @@
 logging.info(train_dataset)
 
 # 3. Define our training loss
-# TripletLoss (https://sbert.net/docs/package_reference/losses.html#tripletloss) needs three text columns
+# TripletLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) needs three text columns
 train_loss = TripletLoss(model)
 
 # 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss.
diff --git a/examples/unsupervised_learning/SimCSE/README.md b/examples/unsupervised_learning/SimCSE/README.md
index 6b670de0b..7b67d9a73 100644
--- a/examples/unsupervised_learning/SimCSE/README.md
+++ b/examples/unsupervised_learning/SimCSE/README.md
@@ -6,7 +6,7 @@ The idea is to encode the same sentence twice. Due to the used dropout in transf
 ![SimCSE working](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SimCSE.png)
 
 ## Usage with SentenceTransformers
-SentenceTransformers implements the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), which makes training with SimCSE trivial:
+SentenceTransformers implements the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss), which makes training with SimCSE trivial:
 
 ```python
 from sentence_transformers import SentenceTransformer, InputExample
diff --git a/examples/unsupervised_learning/query_generation/README.md b/examples/unsupervised_learning/query_generation/README.md
index de237a269..5fa0e0724 100644
--- a/examples/unsupervised_learning/query_generation/README.md
+++ b/examples/unsupervised_learning/query_generation/README.md
@@ -80,7 +80,7 @@ In the above code, we use [Top-p (nucleus) sampling](https://huggingface.co/blog
 
 ## Bi-Encoder Training
 
-With the generated queries, we can then train a bi-encoder using the use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss).
+With the generated queries, we can then train a bi-encoder using the use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss).
  
  ## Full Example
 We train a semantic search model to search through Wikipedia
diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md
index bf1ac2896..73eb8b954 100644
--- a/sentence_transformers/model_card_template.md
+++ b/sentence_transformers/model_card_template.md
@@ -126,7 +126,7 @@ You can finetune this model on your own dataset.
 {% for metrics in eval_metrics %}
 #### {{ metrics.description }}
 {% if metrics.dataset_name %}* Dataset: `{{ metrics.dataset_name }}`{% endif %}
-* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[<code>{{ metrics.class_name.split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}<code>{{ metrics.class_name }}</code>{% endif %}
+* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[<code>{{ metrics.class_name.split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}<code>{{ metrics.class_name }}</code>{% endif %}
 
 {{ metrics.table }}
 {%- endfor %}{% endif %}
@@ -154,7 +154,7 @@ You can finetune this model on your own dataset.
 * Columns: {% if dataset['columns'] | length == 1 %}{{ dataset['columns'][0] }}{% elif dataset['columns'] | length == 2 %}{{ dataset['columns'][0] }} and {{ dataset['columns'][1] }}{% else %}{{ dataset['columns'][:-1] | join(', ') }}, and {{ dataset['columns'][-1] }}{% endif %}
 * Approximate statistics based on the first 1000 samples:
 {{ dataset['stats_table'] }}* Samples:
-{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[<code>{{ dataset["loss"]["fullname"].split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}<code>{{ dataset["loss"]["fullname"] }}</code>{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters:
+{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[<code>{{ dataset["loss"]["fullname"].split(".")[-1] }}</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}<code>{{ dataset["loss"]["fullname"] }}</code>{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters:
 {{ dataset["loss"]["config_code"] }}{% endif %}
 {% endfor %}{% endif %}{% endfor -%}
 

From 57794192c48246e2dac5ebbebdacd99d1678445e Mon Sep 17 00:00:00 2001
From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Date: Mon, 27 May 2024 14:37:50 +0200
Subject: [PATCH 31/39] [`v3`] Add "dataset_size:" to the tag denoting the
 number of training samples (#2680)

* Prepend "dataset_size:" instead. I can always change the look of this later

On the HF side
---
 sentence_transformers/model_card.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py
index 68d2c0a44..ea67e4dee 100644
--- a/sentence_transformers/model_card.py
+++ b/sentence_transformers/model_card.py
@@ -702,7 +702,7 @@ def extract_dataset_metadata(
         if dataset_type == "train":
             num_training_samples = sum([metadata.get("size", 0) for metadata in dataset_metadata])
             if num_training_samples:
-                self.tags += [self.num_training_samples_to_tag(num_training_samples)]
+                self.tags += ["dataset_size:" + self.num_training_samples_to_tag(num_training_samples)]
 
         return self.validate_datasets(dataset_metadata)
 

From 24bee0949f087b4b7919d24eda6c1fe60631b78a Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Mon, 27 May 2024 14:46:57 +0200
Subject: [PATCH 32/39] Fix formatting of Python modules

---
 docs/sentence_transformer/training_overview.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
index 72c0fcf0f..51bac2fb2 100644
--- a/docs/sentence_transformer/training_overview.md
+++ b/docs/sentence_transformer/training_overview.md
@@ -481,9 +481,9 @@ The :class:`~sentence_transformers.SentenceTransformerTrainer` is where all prev
 ```eval_rst
 This Sentence Transformers trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as:
 
-- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed
-- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible.
-- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed.
+- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if ``wandb`` is installed
+- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if ``tensorboard`` is accessible.
+- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if ``codecarbon`` is installed.
 
     - Note: These carbon emissions will be included in your automatically generated model card.
 

From a373931ed336f1488f6a793810971d2f61b8c3aa Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Mon, 27 May 2024 15:26:29 +0200
Subject: [PATCH 33/39] Docs: pairwise_cosine_similarity -> pairwise_similarity

---
 docs/sentence_transformer/usage/semantic_textual_similarity.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sentence_transformer/usage/semantic_textual_similarity.rst b/docs/sentence_transformer/usage/semantic_textual_similarity.rst
index cd4c332b1..219b4c291 100644
--- a/docs/sentence_transformer/usage/semantic_textual_similarity.rst
+++ b/docs/sentence_transformer/usage/semantic_textual_similarity.rst
@@ -89,7 +89,7 @@ This value can be changed in a handful of ways:
 Sentence Transformers implements two methods to calculate the similarity between embeddings:
 
 - :meth:`SentenceTransformer.similarity <sentence_transformers.SentenceTransformer.similarity>`: Calculates the similarity between all pairs of embeddings.
-- :meth:`SentenceTransformer.pairwise_cosine_similarity <sentence_transformers.SentenceTransformer.pairwise_cosine_similarity>`: Calculates the similarity between embeddings in a pairwise fashion.
+- :meth:`SentenceTransformer.pairwise_similarity <sentence_transformers.SentenceTransformer.pairwise_similarity>`: Calculates the similarity between embeddings in a pairwise fashion.
 
 ::
 

From 403d188980755d8694e74aabad7496f393ce0c4b Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Mon, 27 May 2024 16:54:58 +0200
Subject: [PATCH 34/39] Link to the yet-to-be-released release notes instead

---
 docs/changelog/v3.0.md | 0
 index.rst              | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 delete mode 100644 docs/changelog/v3.0.md

diff --git a/docs/changelog/v3.0.md b/docs/changelog/v3.0.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/index.rst b/index.rst
index ef8b630b9..980701531 100644
--- a/index.rst
+++ b/index.rst
@@ -1,6 +1,6 @@
 .. note::
 
-   Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview <docs/sentence_transformer/training_overview.html>`_ to learn more about the training API, and check out `v3.0 Release Notes <docs/changelog/v3.0.html>`_ for details on the other changes.
+   Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview <docs/sentence_transformer/training_overview.html>`_ to learn more about the training API, and check out `v3.0 Release Notes <https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0>`_ for details on the other changes.
 
 SentenceTransformers Documentation
 ==================================

From 3f5dccbe5ebaf1060449ed4b637c820597cc3967 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Mon, 27 May 2024 16:55:44 +0200
Subject: [PATCH 35/39] Update phrasing on local_files_only docstring

---
 sentence_transformers/SentenceTransformer.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
index 280312847..f6a48ae29 100644
--- a/sentence_transformers/SentenceTransformer.py
+++ b/sentence_transformers/SentenceTransformer.py
@@ -72,7 +72,7 @@ class SentenceTransformer(nn.Sequential, FitMixin):
             will execute code present on the Hub on your local machine.
         revision (str, optional): The specific model version to use. It can be a branch name, a tag name, or a commit id,
             for a stored model on Hugging Face.
-        local_files_only (bool, optional): If `True`, avoid downloading the model.
+        local_files_only (bool, optional): Whether or not to only look at local files (i.e., do not try to download the model).
         token (bool or str, optional): Hugging Face authentication token to download private models.
         use_auth_token (bool or str, optional): Deprecated argument. Please use `token` instead.
         truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is

From 2f89fd617951be78081b53fcaf9a98be66658938 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Tue, 28 May 2024 08:07:19 +0200
Subject: [PATCH 36/39] Link directly to the 2DMSE preprint

---
 examples/training/adaptive_layer/README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md
index 71afe7c43..b904c40c8 100644
--- a/examples/training/adaptive_layer/README.md
+++ b/examples/training/adaptive_layer/README.md
@@ -1,6 +1,11 @@
 # Adaptive Layers
 
-Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776) (2DMSE) revisits  this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
+Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. The [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776v1) (2DMSE) preprint revisits  this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
+
+```eval_rst
+.. note::
+   The 2DMSE preprint was later updated and renamed to `ESE: Espresso Sentence Embeddings <https://arxiv.org/abs/2402.14776>`_. The Sentence Transformers implementation of Adaptive Layers and Matryoshka2d (Adaptive Layer + Matryoshka Embeddings) are based on the initial preprint, and we accept contributions that implement the updated ESE paper.
+```
 
 ## Use Cases
 

From 649a31c06a50d571a806f272f04d9ab652fbe7ab Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Tue, 28 May 2024 09:56:49 +0200
Subject: [PATCH 37/39] Add missing subset in quora-duplicates

---
 docs/sentence_transformer/training_overview.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
index 51bac2fb2..7999c2cf0 100644
--- a/docs/sentence_transformer/training_overview.md
+++ b/docs/sentence_transformer/training_overview.md
@@ -545,7 +545,7 @@ Training on multiple datasets looks like this:
     # (sentence1, sentence2) + score
     stsb_pair_score_train = load_dataset("sentence-transformers/stsb", split="train[:10000]")
     # (anchor, positive)
-    quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", split="train[:10000]")
+    quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train[:10000]")
     # (query, answer)
     natural_questions_train = load_dataset("sentence-transformers/natural-questions", split="train[:10000]")
 
@@ -566,7 +566,7 @@ Training on multiple datasets looks like this:
     # (sentence1, sentence2, score)
     stsb_pair_score_dev = load_dataset("sentence-transformers/stsb", split="validation")
     # (anchor, positive)
-    quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", split="train[10000:11000]")
+    quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train[10000:11000]")
     # (query, answer)
     natural_questions_dev = load_dataset("sentence-transformers/natural-questions", split="train[10000:11000]")
 

From 946a97d41d0f5e98f30ed61d38b902123f9e219c Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Tue, 28 May 2024 10:29:15 +0200
Subject: [PATCH 38/39] Add missing docstrings arguments for Cached... losses

---
 sentence_transformers/losses/CachedGISTEmbedLoss.py | 11 +++++++----
 .../losses/CachedMultipleNegativesRankingLoss.py    | 13 ++++++++-----
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py
index e3208298f..9480263d1 100644
--- a/sentence_transformers/losses/CachedGISTEmbedLoss.py
+++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py
@@ -82,10 +82,13 @@ def __init__(
 
         Args:
             model: SentenceTransformer model
-            guide: SentenceTransformer model to guide the in-batch
-                negative sample selection.
-            temperature: Temperature parameter to scale the cosine
-                similarities.
+            guide: SentenceTransformer model to guide the in-batch negative sample selection.
+            temperature: Temperature parameter to scale the cosine similarities.
+            mini_batch_size: Mini-batch size for the forward pass, this denotes how much memory is actually used during
+                training and evaluation. The larger the mini-batch size, the more memory efficient the training is, but
+                the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default
+                value is 32.
+            show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False.
 
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index 5e1b4e1d0..c6838ddef 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -87,12 +87,15 @@ def __init__(
 
         Args:
             model: SentenceTransformer model
-            scale: Output of similarity function is multiplied by scale
-                value
-            similarity_fct: similarity function between sentence
-                embeddings. By default, cos_sim. Can also be set to dot
+            scale: Output of similarity function is multiplied by scale value
+            similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot
                 product (and then set scale to 1)
-
+            mini_batch_size: Mini-batch size for the forward pass, this denotes how much memory is actually used during
+                training and evaluation. The larger the mini-batch size, the more memory efficient the training is, but
+                the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default
+                value is 32.
+            show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False.
+                
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
             - Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: https://arxiv.org/pdf/2101.06983.pdf

From 85890d5713b24fd69af08d89c81db9fc0f3c5ea6 Mon Sep 17 00:00:00 2001
From: Tom Aarsen <Cubiegamedev@gmail.com>
Date: Tue, 28 May 2024 12:10:32 +0200
Subject: [PATCH 39/39] Update training overview docs based on the blogpost
 reviews

---
 docs/sentence_transformer/training_overview.md    | 15 ++++++++-------
 .../losses/CachedMultipleNegativesRankingLoss.py  |  2 +-
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
index 7999c2cf0..fb9c421f6 100644
--- a/docs/sentence_transformer/training_overview.md
+++ b/docs/sentence_transformer/training_overview.md
@@ -128,14 +128,14 @@ The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datas
 
         from datasets import Dataset
 
-        sentence1_list = []
-        sentence2_list = []
+        anchors = []
+        positives = []
         # Open a file, do preprocessing, filtering, cleaning, etc.
         # and append to the lists
 
         dataset = Dataset.from_dict({
-            "sentence1": sentence1_list,
-            "sentence2": sentence2_list,
+            "anchor": anchors,
+            "positive": positives,
         })
 
     Each key from the dictionary will become a column in the resulting dataset.
@@ -276,9 +276,10 @@ args = SentenceTransformerTrainingArguments(
 
 ## Evaluator
 
-```eval_rst
-Several evaluators exist that can help with evaluation before, during, and after training:
+You can provide the [`SentenceTransformerTrainer`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer) with an `eval_dataset` to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can both an `eval_dataset` and an evaluator, one or the other, or neither. They evaluate based on the `eval_strategy` and `eval_steps` [Training Arguments](#training-arguments).
 
+Here are the implemented Evaluators that come with Sentence Tranformers:
+```eval_rst
 ========================================================================  ===========================================================================================================================
 Evaluator                                                                 Required Data
 ========================================================================  ===========================================================================================================================
@@ -292,7 +293,7 @@ Evaluator                                                                 Requir
 :class:`~sentence_transformers.evaluation.TripletEvaluator`               (anchor, positive, negative) pairs.
 ========================================================================  ===========================================================================================================================
 
-Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`. When the evaluator is run depends on the ``eval_strategy`` and ``eval_steps`` `Training Arguments <#training-arguments>`_.
+Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`.
 
 Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face.
 
diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
index c6838ddef..e8f4c0f3c 100644
--- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
+++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py
@@ -95,7 +95,7 @@ def __init__(
                 the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default
                 value is 32.
             show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False.
-                
+
         References:
             - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
             - Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: https://arxiv.org/pdf/2101.06983.pdf