From ae5f51b2793058744bbfc64d72bf5be73b19928b Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Thu, 25 Apr 2024 13:37:23 +0200 Subject: [PATCH 01/39] [`v3`] Training refactor - MultiGPU, loss logging, bf16, etc. (#2449) * See #1638: Adds huggingface trainer for sentence transformers * Fix type of tokenizer * Get the trainer using the feature collation * Update the docstring to reflect changes * Initial draft for refactoring training usig the Transformers Trainer * Separate 'fit' functionality (new and old) into a mixin * Resolve test issues * Reformat * Update the imports * Add TODO regarding custom label columns * Remove dead code * Don't provide the trainer to the eval sampler * Introduce datasets as a dependency * Introduce "accelerate" as a dependency * Avoid use_amp on CPU tests * Specify that SentenceTransformer is a class, not a module * Avoid circular import * Remove | used as an "or" operator in typing * Use test evaluator after training, as intended * Use tokenize function instead of tokenizer; Add EvaluatorCallback which calls the evaluator on every epoch (for BC); Stop saving "do_lower_case" from Transformer; * Reformat * Revert Transformer tokenizer changes * Add support for the tokenizer to return more than just input_ids & attention_masks Required for LSTM * Use the test evaluators after training the examples * Use pure torch for BoW tokenization * Use dev evaluator for BiLSTM - test fails * Add Trainer support for BoW-based models * Pass epoch to evaluator in every-epoch callback For fit backwards compatibility * Run formatting * Use steps_per_epoch to set max_steps if possible * Ignore extracting dataloader arguments for now * Remove dead code * Allow both "label" and "score" columns for labels * Reformatting * Improve errors if datasets don't match with loss dictionary well * Made tests more consistent; list instead of set * Simplify trainer with DatasetDict * Implement a proportional sampler in addition to round robin * Add CLIP finetuning support to the Trainer * Start updating evaluators to return dictionaries * Reformat * Hackishly insert the DataParallel model into the loss function * Allow for fsdp=["full_shard", "auto_wrap"] with fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"} * Re-add support for DataParallel * Use 'ParallelMode.NOT_PARALLEL' * Prevent crash with DDP & an evaluation set * When training with multiple datasets, add "dataset_name" column Rather than relying on some Batch Sampler hacking (which fails with some distributed training approaches) * Update type hints: make loss & evaluator optional Co-authored-by: Wang Bo * Set correct superclasses for samplers * Override 'accelerator.even_batches' as it's incompatible with multi-dataset * Throw exception if "return_loss" or "dataset_name" columns are used * Set min. version for accelerate * Heavily extend model card generation * Remove some dead code * Fix evaluator type hints * Ensure that 'model_card_template.md' is included in the built package * Rephrase comments slightly * Heavily refactor samplers; add no duplicates/group by label samplers * Ensure that data_loader.dataset exists in FitMixin * Adopt 8 as the default batch * Fix logging error in example * Remove the deprecated correct_bias * Simplify with walrus operator * Fix some bugs in set_widget_examples with short datasets * Improve docstring slightly * Add edge case in case training data has an unrecognized format * Fix extracting dataset metadata * Remove moot TYPE_CHECKING * Set base model when loading a ST model also * Add test_dataloader, add prefetch_factor to dataloaders * Resolve predict_example fix; fix newlines in text * Fix bug in compute_dataset_metrics examples * Add call to action in ValueError * Reuse original model card if no training is done * Also collect nested losses (e.g. MatryoshkaLoss) and make losses in tags * Remove generated tag; keep loss: prefix on tags * Remove unused arguments * Add support for "best model step" in model card * Make hyperparameters code-formatted * Fix load_best_model for Transformers models, prevent for non-Transformers * Store base_model_revision in model_card_data * Prevent crash when loading a local model * Allow for bfloat16 inference --------- Co-authored-by: Matthew Franglen Co-authored-by: Wang Bo --- .gitignore | 6 +- MANIFEST.in | 1 + ...aining_stsbenchmark_avg_word_embeddings.py | 2 +- .../training_stsbenchmark_bow.py | 2 +- .../training_stsbenchmark_cnn.py | 2 +- ...ing_stsbenchmark_tf-idf_word_embeddings.py | 2 +- examples/training/clip/train_clip.ipynb | 6 +- .../distillation/model_distillation.py | 2 +- .../ms_marco/train_bi-encoder_margin-mse.py | 2 +- .../multilingual/make_multilingual.py | 6 +- .../train_askubuntu_ct-improved.py | 2 + requirements.txt | 4 +- sentence_transformers/SentenceTransformer.py | 400 ++------ sentence_transformers/__init__.py | 16 + sentence_transformers/data_collator.py | 39 + .../BinaryClassificationEvaluator.py | 40 +- .../EmbeddingSimilarityEvaluator.py | 46 +- .../InformationRetrievalEvaluator.py | 25 +- .../evaluation/LabelAccuracyEvaluator.py | 12 +- .../evaluation/MSEEvaluator.py | 16 +- .../evaluation/MSEEvaluatorFromDataFrame.py | 16 +- .../evaluation/ParaphraseMiningEvaluator.py | 17 +- .../evaluation/RerankingEvaluator.py | 17 +- .../evaluation/SentenceEvaluator.py | 43 +- .../evaluation/SequentialEvaluator.py | 29 +- .../evaluation/TranslationEvaluator.py | 17 +- .../evaluation/TripletEvaluator.py | 29 +- sentence_transformers/fit_mixin.py | 619 ++++++++++++ .../losses/AdaptiveLayerLoss.py | 13 + sentence_transformers/losses/AnglELoss.py | 13 + .../losses/BatchAllTripletLoss.py | 13 + .../losses/BatchHardSoftMarginTripletLoss.py | 13 + .../losses/BatchHardTripletLoss.py | 13 + .../losses/BatchSemiHardTripletLoss.py | 13 + .../CachedMultipleNegativesRankingLoss.py | 18 +- sentence_transformers/losses/CoSENTLoss.py | 12 + .../losses/ContrastiveLoss.py | 15 + .../losses/ContrastiveTensionLoss.py | 24 + .../losses/CosineSimilarityLoss.py | 9 +- .../losses/DenoisingAutoEncoderLoss.py | 16 + sentence_transformers/losses/GISTEmbedLoss.py | 13 + sentence_transformers/losses/MSELoss.py | 14 + sentence_transformers/losses/MarginMSELoss.py | 13 + .../losses/Matryoshka2dLoss.py | 13 + .../losses/MatryoshkaLoss.py | 13 + .../losses/MegaBatchMarginLoss.py | 18 + .../losses/MultipleNegativesRankingLoss.py | 13 + sentence_transformers/losses/SoftmaxLoss.py | 14 + sentence_transformers/losses/TripletLoss.py | 31 +- sentence_transformers/model_card.py | 920 ++++++++++++++++++ sentence_transformers/model_card_template.md | 228 +++++ sentence_transformers/models/BoW.py | 5 +- sentence_transformers/models/CLIPModel.py | 4 + sentence_transformers/models/Transformer.py | 2 + sentence_transformers/sampler.py | 210 ++++ sentence_transformers/trainer.py | 553 +++++++++++ sentence_transformers/training_args.py | 39 + sentence_transformers/util.py | 20 + setup.py | 3 + tests/conftest.py | 6 + tests/test_evaluator.py | 9 +- tests/test_model_card_data.py | 24 + tests/test_pretrained_stsb.py | 3 +- tests/test_train_stsb.py | 12 +- tests/test_trainer.py | 127 +++ 65 files changed, 3450 insertions(+), 447 deletions(-) create mode 100644 MANIFEST.in create mode 100644 sentence_transformers/data_collator.py create mode 100644 sentence_transformers/fit_mixin.py create mode 100644 sentence_transformers/model_card.py create mode 100644 sentence_transformers/model_card_template.md create mode 100644 sentence_transformers/sampler.py create mode 100644 sentence_transformers/trainer.py create mode 100644 sentence_transformers/training_args.py create mode 100644 tests/test_model_card_data.py create mode 100644 tests/test_trainer.py diff --git a/.gitignore b/.gitignore index 7c27f10bf..9eac52b8e 100644 --- a/.gitignore +++ b/.gitignore @@ -19,7 +19,9 @@ nr_*/ /docs/make.bat /examples/training/quora_duplicate_questions/quora-IR-dataset/ build - htmlcov .coverage -.venv \ No newline at end of file +wandb +checkpoints +tmp +.venv diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 000000000..d6144a5d0 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1 @@ +include sentence_transformers/model_card_template.md \ No newline at end of file diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py index e3c9a7376..bb965df98 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py @@ -105,4 +105,4 @@ model = SentenceTransformer(model_save_path) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") -model.evaluate(evaluator) +model.evaluate(test_evaluator) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py index 16121753f..503de464d 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py @@ -128,4 +128,4 @@ model = SentenceTransformer(model_save_path) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") -model.evaluate(evaluator) +model.evaluate(test_evaluator) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py index c73315364..a7c822f52 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py @@ -105,4 +105,4 @@ model = SentenceTransformer(model_save_path) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") -model.evaluate(evaluator) +model.evaluate(test_evaluator) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py index 17006638d..f45a4e7d3 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py @@ -132,4 +132,4 @@ model = SentenceTransformer(model_save_path) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") -model.evaluate(evaluator) +model.evaluate(test_evaluator) diff --git a/examples/training/clip/train_clip.ipynb b/examples/training/clip/train_clip.ipynb index a5dd57de1..ea5e7758e 100644 --- a/examples/training/clip/train_clip.ipynb +++ b/examples/training/clip/train_clip.ipynb @@ -89,15 +89,15 @@ "train_dataset = []\n", "for idx in range(0, len(photos), 2):\n", " # We can use image pairs directly. Because our images aren't labeled, we use a random label as an example\n", - " train_dataset.append(InputExample(texts=[photos[idx], photos[idx + 1]], label=random.choice([0, 1])))\n", + " # train_dataset.append(InputExample(texts=[photos[idx], photos[idx + 1]], label=random.choice([0, 1])))\n", " \n", " # Or images and text together\n", " train_dataset.append(InputExample(texts=[photos[idx], \"This is the caption\"], label=1))\n", " train_dataset.append(InputExample(texts=[photos[idx], \"This is another unrelated caption\"], label=0))\n", "\n", " # Or just texts\n", - " train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is a similar caption\"], label=1))\n", - " train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is an unrelated caption\"], label=0))\n" + " # train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is a similar caption\"], label=1))\n", + " # train_dataset.append(InputExample(texts=[\"This is a caption\", \"This is an unrelated caption\"], label=0))\n" ] }, { diff --git a/examples/training/distillation/model_distillation.py b/examples/training/distillation/model_distillation.py index bf469bc8f..f8e6bf333 100644 --- a/examples/training/distillation/model_distillation.py +++ b/examples/training/distillation/model_distillation.py @@ -205,6 +205,6 @@ evaluation_steps=5000, output_path=output_path, save_best_model=True, - optimizer_params={"lr": 1e-4, "eps": 1e-6, "correct_bias": False}, + optimizer_params={"lr": 1e-4, "eps": 1e-6}, use_amp=True, ) diff --git a/examples/training/ms_marco/train_bi-encoder_margin-mse.py b/examples/training/ms_marco/train_bi-encoder_margin-mse.py index d84852861..7b397da62 100644 --- a/examples/training/ms_marco/train_bi-encoder_margin-mse.py +++ b/examples/training/ms_marco/train_bi-encoder_margin-mse.py @@ -165,7 +165,7 @@ negs_to_use = args.negs_to_use.split(",") else: # Use all systems negs_to_use = list(data["neg"].keys()) - logging.info("Using negatives from the following systems:", negs_to_use) + logging.info("Using negatives from the following systems: {}".format(", ".join(negs_to_use))) for system_name in negs_to_use: if system_name not in data["neg"]: diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py index 0b8f3d29b..fafe454dd 100644 --- a/examples/training/multilingual/make_multilingual.py +++ b/examples/training/multilingual/make_multilingual.py @@ -189,7 +189,7 @@ def download_corpora(filepaths): dev_mse = evaluation.MSEEvaluator( src_sentences, trg_sentences, - name=os.path.basename(dev_file), + name=os.path.basename(dev_file).split(".")[0], teacher_model=teacher_model, batch_size=inference_batch_size, ) @@ -197,7 +197,7 @@ def download_corpora(filepaths): # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences dev_trans_acc = evaluation.TranslationEvaluator( - src_sentences, trg_sentences, name=os.path.basename(dev_file), batch_size=inference_batch_size + src_sentences, trg_sentences, name=os.path.basename(dev_file).split(".")[0], batch_size=inference_batch_size ) evaluators.append(dev_trans_acc) @@ -238,7 +238,7 @@ def download_corpora(filepaths): data["sentences2"], data["scores"], batch_size=inference_batch_size, - name=filename, + name=filename.split(".")[0], show_progress_bar=False, ) evaluators.append(test_evaluator) diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py index fa0b9f8d7..e0c56e7f4 100644 --- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py +++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py @@ -103,6 +103,8 @@ def read_eval_dataset(filepath): model.fit( train_objectives=[(train_dataloader, train_loss)], + evaluator=dev_evaluator, + evaluation_steps=100, epochs=1, warmup_steps=100, use_amp=True, # Set to True, if your GPU has optimized FP16 cores diff --git a/requirements.txt b/requirements.txt index 7fe24f4a6..6344944fb 100644 --- a/requirements.txt +++ b/requirements.txt @@ -5,4 +5,6 @@ numpy scikit-learn scipy huggingface-hub>=0.15.1 -Pillow \ No newline at end of file +Pillow +datasets +accelerate>=0.20.3 \ No newline at end of file diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py index 656103966..bdbd03905 100644 --- a/sentence_transformers/SentenceTransformer.py +++ b/sentence_transformers/SentenceTransformer.py @@ -2,10 +2,10 @@ import json import logging import os -import shutil from collections import OrderedDict +from pathlib import Path import warnings -from typing import List, Dict, Literal, Tuple, Iterable, Type, Union, Callable, Optional, TYPE_CHECKING +from typing import List, Dict, Literal, Tuple, Iterable, Union, Optional import numpy as np from numpy import ndarray import transformers @@ -13,20 +13,20 @@ from huggingface_hub import HfApi import torch from torch import nn, Tensor, device -from torch.optim import Optimizer -from torch.utils.data import DataLoader import torch.multiprocessing as mp from tqdm.autonotebook import trange import math import queue import tempfile +from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card + + from . import __MODEL_HUB_ORGANIZATION__ from .evaluation import SentenceEvaluator from .util import ( import_from_string, batch_to_device, - fullname, is_sentence_transformer_model, load_dir_path, load_file_path, @@ -36,17 +36,13 @@ ) from .quantization import quantize_embeddings from .models import Transformer, Pooling, Normalize -from .model_card_templates import ModelCardTemplate +from .fit_mixin import FitMixin from . import __version__ logger = logging.getLogger(__name__) -if TYPE_CHECKING: - from sentence_transformers.readers import InputExample - - -class SentenceTransformer(nn.Sequential): +class SentenceTransformer(nn.Sequential, FitMixin): """ Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings. @@ -89,11 +85,13 @@ def __init__( token: Optional[Union[bool, str]] = None, use_auth_token: Optional[Union[bool, str]] = None, truncate_dim: Optional[int] = None, + model_card_data: Optional[SentenceTransformerModelCardData] = None, ): # Note: self._load_sbert_model can also update `self.prompts` and `self.default_prompt_name` self.prompts = prompts or {} self.default_prompt_name = default_prompt_name self.truncate_dim = truncate_dim + self.model_card_data = model_card_data or SentenceTransformerModelCardData() self._model_card_vars = {} self._model_card_text = None self._model_config = {} @@ -263,6 +261,9 @@ def __init__( "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model." ) + # Pass the model to the model card data for later use in generating a model card upon saving this model + self.model_card_data.register_model(self) + def encode( self, sentences: Union[str, List[str]], @@ -423,7 +424,10 @@ def encode( all_embeddings = torch.Tensor() elif convert_to_numpy: if not isinstance(all_embeddings, np.ndarray): - all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings]) + if all_embeddings[0].dtype == torch.bfloat16: + all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings]) + else: + all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings]) elif isinstance(all_embeddings, np.ndarray): all_embeddings = [torch.from_numpy(embedding) for embedding in all_embeddings] @@ -724,63 +728,34 @@ def save( self._create_model_card(path, model_name, train_datasets) def _create_model_card( - self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = None + self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated" ): """ - Create an automatic model and stores it in path + Create an automatic model and stores it in path. If no training was done, and the loaded model was + a Sentence Transformer model already, then its model card is reused. """ - if self._model_card_text is not None and len(self._model_card_text) > 0: + if model_name: + model_path = Path(model_name) + if not model_path.exists() and not self.model_card_data.model_id: + self.model_card_data.model_id = model_name + + # If we loaded a Sentence Transformer model from the Hub, and no training was done, then + # we don't generate a new model card, but reuse the old one instead. + if self._model_card_text and self.model_card_data.trainer is None: model_card = self._model_card_text else: - tags = ModelCardTemplate.__TAGS__.copy() - model_card = ModelCardTemplate.__MODEL_CARD__ - - if ( - len(self._modules) == 2 - and isinstance(self._first_module(), Transformer) - and isinstance(self._last_module(), Pooling) - and self._last_module().get_pooling_mode_str() in ["cls", "max", "mean"] - ): - pooling_module = self._last_module() - pooling_mode = pooling_module.get_pooling_mode_str() - model_card = model_card.replace( - "{USAGE_TRANSFORMERS_SECTION}", ModelCardTemplate.__USAGE_TRANSFORMERS__ - ) - pooling_fct_name, pooling_fct = ModelCardTemplate.model_card_get_pooling_function(pooling_mode) - model_card = ( - model_card.replace("{POOLING_FUNCTION}", pooling_fct) - .replace("{POOLING_FUNCTION_NAME}", pooling_fct_name) - .replace("{POOLING_MODE}", pooling_mode) + try: + model_card = generate_model_card(self) + except Exception as exc: + logger.error( + f"Error while generating model card: {exc}\n" + "Consider opening an issue on https://github.com/UKPLab/sentence-transformers/issues with these logs.\n" + "Skipping model card creation." ) - tags.append("transformers") - - # Print full model - model_card = model_card.replace("{FULL_MODEL_STR}", str(self)) - - # Add tags - model_card = model_card.replace("{TAGS}", "\n".join(["- " + t for t in tags])) - - datasets_str = "" - if train_datasets is not None: - datasets_str = "datasets:\n" + "\n".join(["- " + d for d in train_datasets]) - model_card = model_card.replace("{DATASETS}", datasets_str) - - # Add dim info - self._model_card_vars["{NUM_DIMENSIONS}"] = self.get_sentence_embedding_dimension() - - # Replace vars we created while using the model - for name, value in self._model_card_vars.items(): - model_card = model_card.replace(name, str(value)) - - # Replace remaining vars with default values - for name, value in ModelCardTemplate.__DEFAULT_VARS__.items(): - model_card = model_card.replace(name, str(value)) - - if model_name is not None: - model_card = model_card.replace("{MODEL_NAME}", model_name.strip()) + return with open(os.path.join(path, "README.md"), "w", encoding="utf8") as fOut: - fOut.write(model_card.strip()) + fOut.write(model_card) @save_to_hub_args_decorator def save_to_hub( @@ -881,6 +856,7 @@ def push_to_hub( exist_ok=exist_ok, ) repo_id = repo_url.repo_id # Update the repo_id in case the old repo_id didn't contain a user or organization + self.model_card_data.set_model_id(repo_id) if local_model_path: folder_url = api.upload_folder( repo_id=repo_id, folder_path=local_model_path, commit_message=commit_message @@ -904,21 +880,6 @@ def push_to_hub( # This isn't expected to ever be reached. return folder_url - def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict[str, Tensor]], Tensor]: - """ - Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model - Here, batch is a list of InputExample instances: [InputExample(...), ...] - - :param batch: - a batch from a SmartBatchingDataset - :return: - a batch of tensors for the model - """ - texts = [example.texts for example in batch] - sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)] - labels = torch.tensor([example.label for example in batch]) - return sentence_features, labels - def _text_length(self, text: Union[List[int], List[List[int]]]): """ Help function to get the length for the input text. Text can be either @@ -935,214 +896,6 @@ def _text_length(self, text: Union[List[int], List[List[int]]]): else: return sum([len(t) for t in text]) # Sum of length of individual strings - def fit( - self, - train_objectives: Iterable[Tuple[DataLoader, nn.Module]], - evaluator: SentenceEvaluator = None, - epochs: int = 1, - steps_per_epoch=None, - scheduler: str = "WarmupLinear", - warmup_steps: int = 10000, - optimizer_class: Type[Optimizer] = torch.optim.AdamW, - optimizer_params: Dict[str, object] = {"lr": 2e-5}, - weight_decay: float = 0.01, - evaluation_steps: int = 0, - output_path: str = None, - save_best_model: bool = True, - max_grad_norm: float = 1, - use_amp: bool = False, - callback: Callable[[float, int, int], None] = None, - show_progress_bar: bool = True, - checkpoint_path: str = None, - checkpoint_save_steps: int = 500, - checkpoint_save_total_limit: int = 0, - ): - """ - Train the model with the given training objective - Each training objective is sampled in turn for one batch. - We sample only as many batches from each objective as there are in the smallest one - to make sure of equal training with each dataset. - - :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning - :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. - :param epochs: Number of epochs for training - :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives. - :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts - :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. - :param optimizer_class: Optimizer - :param optimizer_params: Optimizer parameters - :param weight_decay: Weight decay for model parameters - :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps - :param output_path: Storage path for the model and evaluation files - :param save_best_model: If true, the best model (according to evaluator) is stored at output_path - :param max_grad_norm: Used for gradient normalization. - :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 - :param callback: Callback function that is invoked after each evaluation. - It must accept the following three parameters in this order: - `score`, `epoch`, `steps` - :param show_progress_bar: If True, output a tqdm progress bar - :param checkpoint_path: Folder to save checkpoints during training - :param checkpoint_save_steps: Will save a checkpoint after so many steps - :param checkpoint_save_total_limit: Total number of checkpoints to store - """ - - ##Add info to model card - # info_loss_functions = "\n".join(["- {} with {} training examples".format(str(loss), len(dataloader)) for dataloader, loss in train_objectives]) - info_loss_functions = [] - for dataloader, loss in train_objectives: - info_loss_functions.extend(ModelCardTemplate.get_train_objective_info(dataloader, loss)) - info_loss_functions = "\n\n".join([text for text in info_loss_functions]) - - info_fit_parameters = json.dumps( - { - "evaluator": fullname(evaluator), - "epochs": epochs, - "steps_per_epoch": steps_per_epoch, - "scheduler": scheduler, - "warmup_steps": warmup_steps, - "optimizer_class": str(optimizer_class), - "optimizer_params": optimizer_params, - "weight_decay": weight_decay, - "evaluation_steps": evaluation_steps, - "max_grad_norm": max_grad_norm, - }, - indent=4, - sort_keys=True, - ) - self._model_card_text = None - self._model_card_vars["{TRAINING_SECTION}"] = ModelCardTemplate.__TRAINING_SECTION__.replace( - "{LOSS_FUNCTIONS}", info_loss_functions - ).replace("{FIT_PARAMETERS}", info_fit_parameters) - - if use_amp: - if is_torch_npu_available(): - scaler = torch.npu.amp.GradScaler() - else: - scaler = torch.cuda.amp.GradScaler() - self.to(self.device) - - dataloaders = [dataloader for dataloader, _ in train_objectives] - - # Use smart batching - for dataloader in dataloaders: - dataloader.collate_fn = self.smart_batching_collate - - loss_models = [loss for _, loss in train_objectives] - for loss_model in loss_models: - loss_model.to(self.device) - - self.best_score = -9999999 - - if steps_per_epoch is None or steps_per_epoch == 0: - steps_per_epoch = min([len(dataloader) for dataloader in dataloaders]) - - num_train_steps = int(steps_per_epoch * epochs) - - # Prepare optimizers - optimizers = [] - schedulers = [] - for loss_model in loss_models: - param_optimizer = list(loss_model.named_parameters()) - - no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] - optimizer_grouped_parameters = [ - { - "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], - "weight_decay": weight_decay, - }, - {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, - ] - - optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params) - scheduler_obj = self._get_scheduler( - optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps - ) - - optimizers.append(optimizer) - schedulers.append(scheduler_obj) - - global_step = 0 - data_iterators = [iter(dataloader) for dataloader in dataloaders] - - num_train_objectives = len(train_objectives) - - skip_scheduler = False - for epoch in trange(epochs, desc="Epoch", disable=not show_progress_bar): - training_steps = 0 - - for loss_model in loss_models: - loss_model.zero_grad() - loss_model.train() - - for _ in trange(steps_per_epoch, desc="Iteration", smoothing=0.05, disable=not show_progress_bar): - for train_idx in range(num_train_objectives): - loss_model = loss_models[train_idx] - optimizer = optimizers[train_idx] - scheduler = schedulers[train_idx] - data_iterator = data_iterators[train_idx] - - try: - data = next(data_iterator) - except StopIteration: - data_iterator = iter(dataloaders[train_idx]) - data_iterators[train_idx] = data_iterator - data = next(data_iterator) - - features, labels = data - labels = labels.to(self.device) - features = list(map(lambda batch: batch_to_device(batch, self.device), features)) - - if use_amp: - with torch.autocast(device_type=self.device.type): - loss_value = loss_model(features, labels) - - scale_before_step = scaler.get_scale() - scaler.scale(loss_value).backward() - scaler.unscale_(optimizer) - torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm) - scaler.step(optimizer) - scaler.update() - - skip_scheduler = scaler.get_scale() != scale_before_step - else: - loss_value = loss_model(features, labels) - loss_value.backward() - torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm) - optimizer.step() - - optimizer.zero_grad() - - if not skip_scheduler: - scheduler.step() - - training_steps += 1 - global_step += 1 - - if evaluation_steps > 0 and training_steps % evaluation_steps == 0: - self._eval_during_training( - evaluator, output_path, save_best_model, epoch, training_steps, callback - ) - - for loss_model in loss_models: - loss_model.zero_grad() - loss_model.train() - - if ( - checkpoint_path is not None - and checkpoint_save_steps is not None - and checkpoint_save_steps > 0 - and global_step % checkpoint_save_steps == 0 - ): - self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step) - - self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback) - - if evaluator is None and output_path is not None: # No evaluator, but output path: save final model version - self.save(output_path) - - if checkpoint_path is not None: - self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step) - def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None): """ Evaluate the model @@ -1156,38 +909,6 @@ def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None): os.makedirs(output_path, exist_ok=True) return evaluator(self, output_path) - def _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback): - """Runs evaluation during the training""" - eval_path = output_path - if output_path is not None: - os.makedirs(output_path, exist_ok=True) - eval_path = os.path.join(output_path, "eval") - os.makedirs(eval_path, exist_ok=True) - - if evaluator is not None: - score = evaluator(self, output_path=eval_path, epoch=epoch, steps=steps) - if callback is not None: - callback(score, epoch, steps) - if score > self.best_score: - self.best_score = score - if save_best_model: - self.save(output_path) - - def _save_checkpoint(self, checkpoint_path, checkpoint_save_total_limit, step): - # Store new checkpoint - self.save(os.path.join(checkpoint_path, str(step))) - - # Delete old checkpoints - if checkpoint_save_total_limit is not None and checkpoint_save_total_limit > 0: - old_checkpoints = [] - for subdir in os.listdir(checkpoint_path): - if subdir.isdigit(): - old_checkpoints.append({"step": int(subdir), "path": os.path.join(checkpoint_path, subdir)}) - - if len(old_checkpoints) > checkpoint_save_total_limit: - old_checkpoints = sorted(old_checkpoints, key=lambda x: x["step"]) - shutil.rmtree(old_checkpoints[0]["path"]) - def _load_auto_model( self, model_name_or_path: str, @@ -1222,6 +943,7 @@ def _load_auto_model( }, ) pooling_model = Pooling(transformer_model.get_word_embedding_dimension(), "mean") + self.model_card_data.set_base_model(model_name_or_path, revision=revision) return [transformer_model, pooling_model] def _load_sbert_model( @@ -1353,37 +1075,19 @@ def _load_sbert_model( module = module_class.load(module_path) modules[module_config["name"]] = module + if revision is None: + path_parts = Path(modules_json_path) + if len(path_parts.parts) >= 2: + revision_path_part = Path(modules_json_path).parts[-2] + if len(revision_path_part) == 40: + revision = revision_path_part + self.model_card_data.set_base_model(model_name_or_path, revision=revision) return modules @staticmethod def load(input_path): return SentenceTransformer(input_path) - @staticmethod - def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int): - """ - Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts - """ - scheduler = scheduler.lower() - if scheduler == "constantlr": - return transformers.get_constant_schedule(optimizer) - elif scheduler == "warmupconstant": - return transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps) - elif scheduler == "warmuplinear": - return transformers.get_linear_schedule_with_warmup( - optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total - ) - elif scheduler == "warmupcosine": - return transformers.get_cosine_schedule_with_warmup( - optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total - ) - elif scheduler == "warmupcosinewithhardrestarts": - return transformers.get_cosine_with_hard_restarts_schedule_with_warmup( - optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total - ) - else: - raise ValueError("Unknown scheduler {}".format(scheduler)) - @property def device(self) -> device: """ @@ -1440,3 +1144,17 @@ def _target_device(self) -> torch.device: @_target_device.setter def _target_device(self, device: Optional[Union[int, str, torch.device]] = None) -> None: self.to(device) + + @property + def _no_split_modules(self) -> List[str]: + try: + return self._first_module()._no_split_modules + except AttributeError: + return [] + + @property + def _keys_to_ignore_on_save(self) -> List[str]: + try: + return self._first_module()._keys_to_ignore_on_save + except AttributeError: + return [] diff --git a/sentence_transformers/__init__.py b/sentence_transformers/__init__.py index f8b457b7b..7cd7a2230 100644 --- a/sentence_transformers/__init__.py +++ b/sentence_transformers/__init__.py @@ -1,12 +1,25 @@ __version__ = "2.8.0.dev0" __MODEL_HUB_ORGANIZATION__ = "sentence-transformers" + +import importlib +import os + from .datasets import SentencesDataset, ParallelSentencesDataset from .LoggingHandler import LoggingHandler from .SentenceTransformer import SentenceTransformer from .readers import InputExample from .cross_encoder.CrossEncoder import CrossEncoder +from .trainer import SentenceTransformerTrainer +from .training_args import SentenceTransformerTrainingArguments +from .model_card import SentenceTransformerModelCardData from .quantization import quantize_embeddings + +# If codecarbon is installed and the log level is not defined, +# automatically overwrite the default to "error" +if importlib.util.find_spec("codecarbon") and "CODECARBON_LOG_LEVEL" not in os.environ: + os.environ["CODECARBON_LOG_LEVEL"] = "error" + __all__ = [ "LoggingHandler", "SentencesDataset", @@ -14,5 +27,8 @@ "SentenceTransformer", "InputExample", "CrossEncoder", + "SentenceTransformerTrainer", + "SentenceTransformerTrainingArguments", + "SentenceTransformerModelCardData", "quantize_embeddings", ] diff --git a/sentence_transformers/data_collator.py b/sentence_transformers/data_collator.py new file mode 100644 index 000000000..bd4d5ff27 --- /dev/null +++ b/sentence_transformers/data_collator.py @@ -0,0 +1,39 @@ +from dataclasses import dataclass, field +from typing import Any, Callable, Dict, List + +import torch + + +@dataclass +class SentenceTransformerDataCollator: + """Collator for a SentenceTransformers model. + This encodes the text columns to {column}_input_ids and {column}_attention_mask columns. + This works with the two text dataset that is used as the example in the training overview: + https://www.sbert.net/docs/training/overview.html""" + + tokenize_fn: Callable + valid_label_columns: List[str] = field(default_factory=lambda: ["label", "score"]) + + def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]: + columns = list(features[0].keys()) + + # We should always be able to return a loss, label or not: + batch = {"return_loss": True} + + if "dataset_name" in columns: + columns.remove("dataset_name") + batch["dataset_name"] = features[0]["dataset_name"] + + # Extract the label column if it exists + for label_column in self.valid_label_columns: + if label_column in columns: + batch["label"] = torch.tensor([row[label_column] for row in features]) + columns.remove(label_column) + break + + # Extract the feature columns + for column in columns: + tokenized = self.tokenize_fn([row[column] for row in features]) + for key, value in tokenized.items(): + batch[f"{column}_{key}"] = value + return batch diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py index e8cdab23e..40709ae18 100644 --- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py +++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py @@ -7,7 +7,7 @@ from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances from sklearn.metrics import average_precision_score import numpy as np -from typing import List, Optional +from typing import Dict, List, Optional from ..readers import InputExample @@ -52,6 +52,8 @@ def __init__( self.labels = labels self.truncate_dim = truncate_dim + self.primary_metric = "max_ap" + assert len(self.sentences1) == len(self.sentences2) assert len(self.sentences1) == len(self.labels) for label in labels: @@ -70,13 +72,13 @@ def __init__( self.csv_headers = [ "epoch", "steps", - "cossim_accuracy", - "cossim_accuracy_threshold", - "cossim_f1", - "cossim_precision", - "cossim_recall", - "cossim_f1_threshold", - "cossim_ap", + "cosine_accuracy", + "cosine_accuracy_threshold", + "cosine_f1", + "cosine_precision", + "cosine_recall", + "cosine_f1_threshold", + "cosine_ap", "manhattan_accuracy", "manhattan_accuracy_threshold", "manhattan_f1", @@ -99,6 +101,7 @@ def __init__( "dot_f1_threshold", "dot_ap", ] + self.primary_metric = "cosine_accuracy" @classmethod def from_input_examples(cls, examples: List[InputExample], **kwargs): @@ -112,7 +115,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): scores.append(example.label) return cls(sentences1, sentences2, scores, **kwargs) - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -127,9 +132,6 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i scores = self.compute_metrices(model) - # Main score is the max of Average Precision (AP) - main_score = max(scores[short_name]["ap"] for short_name in scores) - file_output_data = [epoch, steps] for header_name in self.csv_headers: @@ -149,7 +151,17 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer = csv.writer(f) writer.writerow(file_output_data) - return main_score + metrics = { + f"{short_name}_{metric}": value + for short_name, values in scores.items() + for metric, value in values.items() + } + metrics.update( + {f"max_{metric}": max(scores[short_name][metric] for short_name in scores) for metric in scores["cosine"]} + ) + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics def compute_metrices(self, model): with nullcontext() if self.truncate_dim is None else model.truncate_sentence_embeddings(self.truncate_dim): @@ -189,7 +201,7 @@ def compute_metrices(self, model): labels = np.asarray(self.labels) output_scores = {} for short_name, name, scores, reverse in [ - ["cossim", "Cosine-Similarity", cosine_scores, True], + ["cosine", "Cosine-Similarity", cosine_scores, True], ["manhattan", "Manhattan-Distance", manhattan_distances, False], ["euclidean", "Euclidean-Distance", euclidean_distances, False], ["dot", "Dot-Product", dot_scores, True], diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py index 531e0680a..0cb14500e 100644 --- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py +++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py @@ -8,7 +8,7 @@ from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances from scipy.stats import pearsonr, spearmanr import numpy as np -from typing import List, Literal, Optional +from typing import Dict, List, Literal, Optional from ..readers import InputExample @@ -52,6 +52,7 @@ def __init__( :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None. """ + super().__init__() self.sentences1 = sentences1 self.sentences2 = sentences2 self.scores = scores @@ -103,7 +104,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): scores.append(example.label) return cls(sentences1, sentences2, scores, **kwargs) - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -200,15 +203,30 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i ] ) - if self.main_similarity == SimilarityFunction.COSINE: - return eval_spearman_cosine - elif self.main_similarity == SimilarityFunction.EUCLIDEAN: - return eval_spearman_euclidean - elif self.main_similarity == SimilarityFunction.MANHATTAN: - return eval_spearman_manhattan - elif self.main_similarity == SimilarityFunction.DOT_PRODUCT: - return eval_spearman_dot - elif self.main_similarity is None: - return max(eval_spearman_cosine, eval_spearman_manhattan, eval_spearman_euclidean, eval_spearman_dot) - else: - raise ValueError("Unknown main_similarity value") + self.primary_metric = { + SimilarityFunction.COSINE: "spearman_cosine", + SimilarityFunction.EUCLIDEAN: "spearman_euclidean", + SimilarityFunction.MANHATTAN: "spearman_manhattan", + SimilarityFunction.DOT_PRODUCT: "spearman_dot", + }.get(self.main_similarity, "spearman_max") + metrics = { + "pearson_cosine": eval_pearson_cosine, + "spearman_cosine": eval_spearman_cosine, + "pearson_manhattan": eval_pearson_manhattan, + "spearman_manhattan": eval_spearman_manhattan, + "pearson_euclidean": eval_pearson_euclidean, + "spearman_euclidean": eval_spearman_euclidean, + "pearson_dot": eval_pearson_dot, + "spearman_dot": eval_spearman_dot, + "pearson_max": max(eval_pearson_cosine, eval_pearson_manhattan, eval_pearson_euclidean, eval_pearson_dot), + "spearman_max": max( + eval_spearman_cosine, eval_spearman_manhattan, eval_spearman_euclidean, eval_spearman_dot + ), + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics + + @property + def description(self) -> str: + return "Semantic Similarity" diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py index 9b0466420..54338e6f5 100644 --- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py +++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py @@ -40,11 +40,12 @@ def __init__( write_csv: bool = True, truncate_dim: Optional[int] = None, score_functions: Dict[str, Callable[[Tensor, Tensor], Tensor]] = { - "cos_sim": cos_sim, - "dot_score": dot_score, + "cosine": cos_sim, + "dot": dot_score, }, # Score function, higher=more similar main_score_function: str = None, ): + super().__init__() self.queries_ids = [] for qid in queries: if qid in relevant_docs and len(relevant_docs[qid]) > 0: @@ -97,7 +98,7 @@ def __init__( def __call__( self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs - ) -> float: + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -146,9 +147,23 @@ def __call__( fOut.close() if self.main_score_function is None: - return max([scores[name]["map@k"][max(self.map_at_k)] for name in self.score_function_names]) + score_function = max( + [(name, scores[name]["map@k"][max(self.map_at_k)]) for name in self.score_function_names], + key=lambda x: x[1], + )[0] + self.primary_metric = f"{score_function}_map@{max(self.map_at_k)}" else: - return scores[self.main_score_function]["map@k"][max(self.map_at_k)] + self.primary_metric = f"{self.main_score_function}_map@{max(self.map_at_k)}" + + metrics = { + f"{score_function}_{metric_name.replace('@k', '@' + str(k))}": value + for score_function, values_dict in scores.items() + for metric_name, values in values_dict.items() + for k, value in values.items() + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics def compute_metrices( self, model: SentenceTransformer, corpus_model=None, corpus_embeddings: Tensor = None diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py index e94e0adfe..05ebfe253 100644 --- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py +++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py @@ -1,3 +1,4 @@ +from typing import Dict from sentence_transformers import SentenceTransformer from . import SentenceEvaluator import torch @@ -27,6 +28,7 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w :param dataloader: the data for the evaluation """ + super().__init__() self.dataloader = dataloader self.name = name self.softmax_model = softmax_model @@ -37,8 +39,11 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w self.write_csv = write_csv self.csv_file = "accuracy_evaluation" + name + "_results.csv" self.csv_headers = ["epoch", "steps", "accuracy"] + self.primary_metric = "accuracy" - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: model.eval() total = 0 correct = 0 @@ -79,4 +84,7 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer = csv.writer(f) writer.writerow([epoch, steps, accuracy]) - return accuracy + metrics = {"accuracy": accuracy} + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py index ecb92be09..80ae29899 100644 --- a/sentence_transformers/evaluation/MSEEvaluator.py +++ b/sentence_transformers/evaluation/MSEEvaluator.py @@ -4,7 +4,7 @@ import logging import os import csv -from typing import List, Optional +from typing import Dict, List, Optional logger = logging.getLogger(__name__) @@ -41,6 +41,7 @@ def __init__( write_csv: bool = True, truncate_dim: Optional[int] = None, ): + super().__init__() self.truncate_dim = truncate_dim with nullcontext() if self.truncate_dim is None else teacher_model.truncate_sentence_embeddings( self.truncate_dim @@ -57,8 +58,9 @@ def __init__( self.csv_file = "mse_evaluation_" + name + "_results.csv" self.csv_headers = ["epoch", "steps", "MSE"] self.write_csv = write_csv + self.primary_metric = "negative_mse" - def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1): + def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -93,4 +95,12 @@ def __call__(self, model: SentenceTransformer, output_path, epoch=-1, steps=-1): writer.writerow([epoch, steps, mse]) - return -mse # Return negative score as SentenceTransformers maximizes the performance + # Return negative score as SentenceTransformers maximizes the performance + metrics = {"negative_mse": -mse} + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics + + @property + def description(self) -> str: + return "Knowledge Distillation" diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py index fd66f8942..bb027614e 100644 --- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py +++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py @@ -42,6 +42,7 @@ def __init__( write_csv: bool = True, truncate_dim: Optional[int] = None, ): + super().__init__() self.combinations = combinations self.name = name self.batch_size = batch_size @@ -51,6 +52,7 @@ def __init__( self.csv_file = "mse_evaluation" + name + "_results.csv" self.csv_headers = ["epoch", "steps"] + self.primary_metric = "negative_mse" self.write_csv = write_csv self.truncate_dim = truncate_dim self.data = {} @@ -77,7 +79,9 @@ def __init__( all_src_embeddings = teacher_model.encode(all_source_sentences, batch_size=self.batch_size) self.teacher_embeddings = {sent: emb for sent, emb in zip(all_source_sentences, all_src_embeddings)} - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1): + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: model.eval() mse_scores = [] @@ -105,4 +109,12 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer.writerow([epoch, steps] + mse_scores) - return -np.mean(mse_scores) # Return negative score as SentenceTransformers maximizes the performance + # Return negative score as SentenceTransformers maximizes the performance + metrics = {"negative_mse": -np.mean(mse_scores).item()} + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics + + @property + def description(self) -> str: + return "Knowledge Distillation" diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py index de9fe7059..264cf7782 100644 --- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py +++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py @@ -54,6 +54,7 @@ def __init__( dimension. Defaults to None. """ + super().__init__() self.sentences = [] self.ids = [] @@ -99,8 +100,11 @@ def __init__( self.csv_file: str = "paraphrase_mining_evaluation" + name + "_results.csv" self.csv_headers = ["epoch", "steps", "precision", "recall", "f1", "threshold", "average_precision"] self.write_csv = write_csv + self.primary_metric = "average_precision" - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -174,7 +178,16 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer = csv.writer(f) writer.writerow([epoch, steps, best_precision, best_recall, best_f1, threshold, average_precision]) - return average_precision + metrics = { + "average_precision": average_precision, + "f1": best_f1, + "precision": best_precision, + "recall": best_recall, + "threshold": threshold, + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics @staticmethod def add_transitive_closure(graph): diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py index 4cba7d15d..1216bff79 100644 --- a/sentence_transformers/evaluation/RerankingEvaluator.py +++ b/sentence_transformers/evaluation/RerankingEvaluator.py @@ -9,7 +9,7 @@ import torch from sklearn.metrics import average_precision_score, ndcg_score import tqdm -from typing import Callable, Optional +from typing import Callable, Dict, Optional logger = logging.getLogger(__name__) @@ -50,6 +50,7 @@ def __init__( truncate_dim: Optional[int] = None, mrr_at_k: Optional[int] = None, ): + super().__init__() self.samples = samples self.name = name @@ -82,8 +83,11 @@ def __init__( "NDCG@{}".format(self.at_k), ] self.write_csv = write_csv + self.primary_metric = "map" - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -131,7 +135,14 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer.writerow([epoch, steps, mean_ap, mean_mrr, mean_ndcg]) - return mean_ap + metrics = { + "map": mean_ap, + f"mrr@{self.at_k}": mean_mrr, + f"ndcg@{self.at_k}": mean_ndcg, + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics def compute_metrices(self, model): return ( diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py index 7e7116689..d8f4f5232 100644 --- a/sentence_transformers/evaluation/SentenceEvaluator.py +++ b/sentence_transformers/evaluation/SentenceEvaluator.py @@ -1,3 +1,6 @@ +import re +from typing import Any, Dict, Union + from sentence_transformers import SentenceTransformer @@ -8,7 +11,13 @@ class SentenceEvaluator: Extend this class and implement __call__ for custom evaluators. """ - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __init__(self): + self.greater_is_better = True + # TODO: Add better `primary_metrics` support + + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Union[float, Dict[str, float]]: """ This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result. @@ -25,6 +34,36 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. - :return: a score for the evaluation with a higher score indicating a better result + :return: Either a score for the evaluation with a higher score indicating a better result, + or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` + must be defined """ pass + + def prefix_name_to_metrics(self, metrics: Dict[str, float], name: str): + if not name: + return metrics + metrics = {name + "_" + key: value for key, value in metrics.items()} + if hasattr(self, "primary_metric") and not self.primary_metric.startswith(name + "_"): + self.primary_metric = name + "_" + self.primary_metric + return metrics + + def store_metrics_in_model_card_data(self, model: "SentenceTransformer", metrics: Dict[str, Any]) -> None: + model.model_card_data.set_evaluation_metrics(self, metrics) + + @property + def description(self) -> str: + """ + Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification + + 1. Remove "Evaluator" from the class name + 2. Add a space before every capital letter + """ + class_name = self.__class__.__name__ + try: + index = class_name.index("Evaluator") + class_name = class_name[:index] + except IndexError: + pass + + return re.sub(r"([a-z])([A-Z])", "\g<1> \g<2>", class_name) diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py index 6808530cc..2e2fb5ced 100644 --- a/sentence_transformers/evaluation/SequentialEvaluator.py +++ b/sentence_transformers/evaluation/SequentialEvaluator.py @@ -1,6 +1,6 @@ from sentence_transformers import SentenceTransformer from . import SentenceEvaluator -from typing import Iterable +from typing import Dict, Iterable class SequentialEvaluator(SentenceEvaluator): @@ -12,12 +12,31 @@ class SequentialEvaluator(SentenceEvaluator): """ def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function=lambda scores: scores[-1]): + super().__init__() self.evaluators = evaluators self.main_score_function = main_score_function - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: + evaluations = [] scores = [] - for evaluator in self.evaluators: - scores.append(evaluator(model, output_path, epoch, steps)) + for evaluator_idx, evaluator in enumerate(self.evaluators): + evaluation = evaluator(model, output_path, epoch, steps) - return self.main_score_function(scores) + if not isinstance(evaluation, dict): + scores.append(evaluation) + evaluation = {f"evaluator_{evaluator_idx}": evaluation} + else: + if hasattr(evaluation, "primary_metric"): + scores.append(evaluation[evaluation.primary_metric]) + else: + scores.append(evaluation[list(evaluation.keys())[0]]) + + evaluations.append(evaluation) + + self.primary_metric = "sequential_score" + main_score = self.main_score_function(scores) + results = {key: value for evaluation in evaluations for key, value in evaluation.items()} + results["sequential_score"] = main_score + return results diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py index acc5a887d..199d4cad1 100644 --- a/sentence_transformers/evaluation/TranslationEvaluator.py +++ b/sentence_transformers/evaluation/TranslationEvaluator.py @@ -6,7 +6,7 @@ import os import csv import numpy as np -from typing import List, Optional +from typing import Dict, List, Optional import torch @@ -54,6 +54,7 @@ def __init__( The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None. """ + super().__init__() self.source_sentences = source_sentences self.target_sentences = target_sentences self.name = name @@ -70,8 +71,11 @@ def __init__( self.csv_file = "translation_evaluation" + name + "_results.csv" self.csv_headers = ["epoch", "steps", "src2trg", "trg2src"] self.write_csv = write_csv + self.primary_metric = "mean_accuracy" - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -145,4 +149,11 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer.writerow([epoch, steps, acc_src2trg, acc_trg2src]) - return (acc_src2trg + acc_trg2src) / 2 + metrics = { + "src2trg_accuracy": acc_src2trg, + "trg2src_accuracy": acc_trg2src, + "mean_accuracy": (acc_src2trg + acc_trg2src) / 2, + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py index 17e76790e..da7719f97 100644 --- a/sentence_transformers/evaluation/TripletEvaluator.py +++ b/sentence_transformers/evaluation/TripletEvaluator.py @@ -5,7 +5,7 @@ import os import csv from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances -from typing import List, Optional +from typing import Dict, List, Optional from ..readers import InputExample @@ -42,6 +42,7 @@ def __init__( :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None. """ + super().__init__() self.anchors = anchors self.positives = positives self.negatives = negatives @@ -76,7 +77,9 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): negatives.append(example.texts[2]) return cls(anchors, positives, negatives, **kwargs) - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + def __call__( + self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + ) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -157,11 +160,17 @@ def __call__(self, model: SentenceTransformer, output_path: str = None, epoch: i writer = csv.writer(f) writer.writerow([epoch, steps, accuracy_cos, accuracy_manhattan, accuracy_euclidean]) - if self.main_distance_function == SimilarityFunction.COSINE: - return accuracy_cos - if self.main_distance_function == SimilarityFunction.MANHATTAN: - return accuracy_manhattan - if self.main_distance_function == SimilarityFunction.EUCLIDEAN: - return accuracy_euclidean - - return max(accuracy_cos, accuracy_manhattan, accuracy_euclidean) + self.primary_metric = { + SimilarityFunction.COSINE: "accuracy_cosine", + SimilarityFunction.EUCLIDEAN: "accuracy_euclidean", + SimilarityFunction.MANHATTAN: "accuracy_manhattan", + }.get(self.main_distance_function, "accuracy_max") + metrics = { + "accuracy_cosine": accuracy_cos, + "accuracy_manhattan": accuracy_manhattan, + "accuracy_euclidean": accuracy_euclidean, + "accuracy_max": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean), + } + metrics = self.prefix_name_to_metrics(metrics, self.name) + self.store_metrics_in_model_card_data(model, metrics) + return metrics diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py new file mode 100644 index 000000000..66e1b9130 --- /dev/null +++ b/sentence_transformers/fit_mixin.py @@ -0,0 +1,619 @@ +import json +import logging +import os +from pathlib import Path +import shutil +from typing import Any, List, Dict, Tuple, Iterable, Type, Callable, Optional, TYPE_CHECKING +import transformers +import torch +from torch import nn, Tensor +from torch.optim import Optimizer +from torch.utils.data import DataLoader +from tqdm.autonotebook import trange +from datasets import Dataset, DatasetDict +from transformers import TrainerCallback, TrainerState, TrainerControl +from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset +from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader +from sentence_transformers.training_args import ( + SentenceTransformerTrainingArguments, + MultiDatasetBatchSamplers, + BatchSamplers, +) + +from .evaluation import SentenceEvaluator +from .util import ( + batch_to_device, + fullname, +) +from .model_card_templates import ModelCardTemplate + +logger = logging.getLogger(__name__) + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer + from sentence_transformers.readers.InputExample import InputExample + + +class SaveModelCallback(TrainerCallback): + """A Callback to save the model to the `output_dir`. + + There are two cases: + 1. save_best_model is True and evaluator is defined: + We save on evaluate, but only if the new model is better than the currently saved one + according to the evaluator. + 2. If evaluator is not defined: + We save after the model has been trained. + """ + + def __init__(self, output_dir: str, evaluator: Optional[SentenceEvaluator], save_best_model: bool) -> None: + super().__init__() + self.output_dir = output_dir + self.evaluator = evaluator + # TODO: ^ has to implement `greater_is_better` and `primary_metric` + self.save_best_model = save_best_model + self.best_metric = None + + def is_better(self, new_metric: float) -> bool: + if getattr(self.evaluator, "greater_is_better", True): + return new_metric > self.best_metric + return new_metric < self.best_metric + + def on_evaluate( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + metrics: Dict[str, Any], + model: "SentenceTransformer", + **kwargs, + ): + if self.evaluator is not None and self.save_best_model: + metric_key = getattr(self.evaluator, "primary_metric", "evaluator") + for key, value in metrics.items(): + if key.endswith(metric_key): + if self.best_metric is None or self.is_better(value): + self.best_metric = value + model.save(self.output_dir) + + def on_train_end( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + **kwargs, + ): + if self.evaluator is None: + model.save(self.output_dir) + + +class EvaluatorCallback(TrainerCallback): + """The SentenceTransformers.fit method always ran the evaluator on every epoch, + in addition to every "evaluation_steps". This callback is responsible for that. + + The `.trainer` must be provided after the trainer has been created. + """ + + def __init__(self, evaluator: SentenceEvaluator) -> None: + super().__init__() + self.evaluator = evaluator + self.metric_key_prefix = "eval" + self.trainer = None + + def on_epoch_end( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + **kwargs, + ): + evaluator_metrics = self.evaluator(model, epoch=state.epoch) + if not isinstance(evaluator_metrics, dict): + evaluator_metrics = {"evaluator": evaluator_metrics} + + # Prefix all keys with metric_key_prefix + '_' + for key in list(evaluator_metrics.keys()): + if not key.startswith(f"{self.metric_key_prefix}_"): + evaluator_metrics[f"{self.metric_key_prefix}_{key}"] = evaluator_metrics.pop(key) + + if self.trainer is not None: + self.trainer.callback_handler.on_evaluate(args, state, control, metrics=evaluator_metrics) + + +class OriginalCallback(TrainerCallback): + """A Callback to invoke the original callback function that was provided to SentenceTransformer.fit() + + This callback has the following signature: `(score: float, epoch: int, steps: int) -> None` + """ + + def __init__(self, callback: Callable[[float, int, int], None], evaluator: SentenceEvaluator) -> None: + super().__init__() + self.callback = callback + self.evaluator = evaluator + + def on_evaluate( + self, + args: transformers.TrainingArguments, + state: TrainerState, + control: TrainerControl, + metrics: Dict[str, Any], + **kwargs, + ): + metric_key = getattr(self.evaluator, "primary_metric", "evaluator") + for key, value in metrics.items(): + if key.endswith(metric_key): + return self.callback(value, state.epoch, state.global_step) + + +class FitMixin: + """Mixin class for injecting the `fit` method into Sentence Transformers""" + + def fit( + self, + train_objectives: Iterable[Tuple[DataLoader, nn.Module]], + evaluator: SentenceEvaluator = None, + epochs: int = 1, + steps_per_epoch=None, + scheduler: str = "WarmupLinear", + warmup_steps: int = 10000, + optimizer_class: Type[Optimizer] = torch.optim.AdamW, + optimizer_params: Dict[str, object] = {"lr": 2e-5}, + weight_decay: float = 0.01, + evaluation_steps: int = 0, + output_path: str = None, + save_best_model: bool = True, + max_grad_norm: float = 1, + use_amp: bool = False, + callback: Callable[[float, int, int], None] = None, + show_progress_bar: bool = True, + checkpoint_path: str = None, + checkpoint_save_steps: int = 500, + checkpoint_save_total_limit: int = 0, + ): + """ + Train the model with the given training objective + Each training objective is sampled in turn for one batch. + We sample only as many batches from each objective as there are in the smallest one + to make sure of equal training with each dataset. + + :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning + :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. + :param epochs: Number of epochs for training + :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives. + :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts + :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. + :param optimizer_class: Optimizer + :param optimizer_params: Optimizer parameters + :param weight_decay: Weight decay for model parameters + :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps + :param output_path: Storage path for the model and evaluation files + :param save_best_model: If true, the best model (according to evaluator) is stored at output_path + :param max_grad_norm: Used for gradient normalization. + :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 + :param callback: Callback function that is invoked after each evaluation. + It must accept the following three parameters in this order: + `score`, `epoch`, `steps` + :param show_progress_bar: If True, output a tqdm progress bar + :param checkpoint_path: Folder to save checkpoints during training + :param checkpoint_save_steps: Will save a checkpoint after so many steps + :param checkpoint_save_total_limit: Total number of checkpoints to store + """ + # Delayed import to counter the SentenceTransformers -> FitMixin -> SentenceTransformerTrainer -> SentenceTransformers circular import + from sentence_transformers.trainer import SentenceTransformerTrainer + + data_loaders, loss_fns = zip(*train_objectives) + + # Clear the dataloaders from collate functions as we just want raw InputExamples + def identity(batch): + return batch + + for data_loader in data_loaders: + data_loader.collate_fn = identity + + batch_size = 8 + batch_sampler = BatchSamplers.BATCH_SAMPLER + # Convert dataloaders into a DatasetDict + # TODO: This should be done in a more efficient way + train_dataset_dict = {} + for loader_idx, data_loader in enumerate(data_loaders, start=1): + if isinstance(data_loader, NoDuplicatesDataLoader): + batch_sampler = BatchSamplers.NO_DUPLICATES + elif hasattr(data_loader, "dataset") and isinstance(data_loader.dataset, SentenceLabelDataset): + batch_sampler = BatchSamplers.GROUP_BY_LABEL + + batch_size = getattr(data_loader, "batch_size", batch_size) + texts = [] + labels = [] + for batch in data_loader: + batch_texts, batch_labels = zip(*[(example.texts, example.label) for example in batch]) + texts += batch_texts + labels += batch_labels + dataset = Dataset.from_dict({f"sentence_{idx}": text for idx, text in enumerate(zip(*texts))}) + # Add label column, unless all labels are 0 (the default value for `labels` in InputExample) + add_label_column = True + try: + if set(labels) == {0}: + add_label_column = False + except TypeError: + pass + if add_label_column: + dataset = dataset.add_column("label", labels) + train_dataset_dict[f"_dataset_{loader_idx}"] = dataset + + train_dataset_dict = DatasetDict(train_dataset_dict) + + def _default_checkpoint_dir() -> str: + dir_name = "checkpoints/model" + idx = 1 + while Path(dir_name).exists() and len(list(Path(dir_name).iterdir())) != 0: + dir_name = f"checkpoints/model_{idx}" + idx += 1 + return dir_name + + # Convert loss_fns into a dict with `dataset_{idx}` keys + loss_fn_dict = {f"_dataset_{idx}": loss_fn for idx, loss_fn in enumerate(loss_fns, start=1)} + # TODO: Test model checkpointing & loading + + # Use steps_per_epoch to perhaps set max_steps + max_steps = -1 + if steps_per_epoch is not None and steps_per_epoch > 0: + if epochs == 1: + max_steps = steps_per_epoch + else: + logger.warning( + "Setting `steps_per_epoch` alongside `epochs` > 1 no longer works. " + "We will train with the full datasets per epoch." + ) + steps_per_epoch = None + + args = SentenceTransformerTrainingArguments( + output_dir=checkpoint_path or _default_checkpoint_dir(), + batch_sampler=batch_sampler, + multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN, + per_device_train_batch_size=batch_size, + per_device_eval_batch_size=batch_size, + num_train_epochs=epochs, + max_steps=max_steps, + evaluation_strategy="steps" if evaluation_steps is not None and evaluation_steps > 0 else "no", + eval_steps=evaluation_steps, + # load_best_model_at_end=save_best_model, # <- TODO: Look into a good solution for save_best_model + max_grad_norm=max_grad_norm, + fp16=use_amp, + disable_tqdm=not show_progress_bar, + save_strategy="steps" if checkpoint_path is not None else "no", + save_steps=checkpoint_save_steps, + save_total_limit=checkpoint_save_total_limit, + ) + + if steps_per_epoch is None or steps_per_epoch == 0: + steps_per_epoch = min([len(train_dataset) // batch_size for train_dataset in train_dataset_dict.values()]) + num_train_steps = int(steps_per_epoch * epochs) + + # Prepare optimizer & scheduler + param_optimizer = list(self.named_parameters()) + + no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] + optimizer_grouped_parameters = [ + { + "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], + "weight_decay": weight_decay, + }, + {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, + ] + + optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params) + scheduler_obj = self._get_scheduler( + optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps + ) + + # Create callbacks + callbacks = [] + if evaluator is not None: + callbacks.append(EvaluatorCallback(evaluator)) + if callback is not None: + callbacks.append(OriginalCallback(callback, evaluator)) + + trainer = SentenceTransformerTrainer( + model=self, + args=args, + train_dataset=train_dataset_dict, + eval_dataset=None, + loss=loss_fn_dict, + evaluator=evaluator, + optimizers=(optimizer, scheduler_obj), + callbacks=callbacks, + ) + # Set the trainer on the EvaluatorCallback, required for logging the metrics + for callback in trainer.callback_handler.callbacks: + if isinstance(callback, EvaluatorCallback): + callback.trainer = trainer + + if output_path is not None: + trainer.add_callback(SaveModelCallback(output_path, evaluator, save_best_model)) + + trainer.train() + + @staticmethod + def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int): + """ + Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts + """ + scheduler = scheduler.lower() + if scheduler == "constantlr": + return transformers.get_constant_schedule(optimizer) + elif scheduler == "warmupconstant": + return transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps) + elif scheduler == "warmuplinear": + return transformers.get_linear_schedule_with_warmup( + optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total + ) + elif scheduler == "warmupcosine": + return transformers.get_cosine_schedule_with_warmup( + optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total + ) + elif scheduler == "warmupcosinewithhardrestarts": + return transformers.get_cosine_with_hard_restarts_schedule_with_warmup( + optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total + ) + else: + raise ValueError("Unknown scheduler {}".format(scheduler)) + + def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict[str, Tensor]], Tensor]: + """ + Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model + Here, batch is a list of InputExample instances: [InputExample(...), ...] + + :param batch: + a batch from a SmartBatchingDataset + :return: + a batch of tensors for the model + """ + texts = [example.texts for example in batch] + sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)] + labels = torch.tensor([example.label for example in batch]) + return sentence_features, labels + + """ + Temporary methods that will be removed when this refactor is complete: + """ + + def old_fit( + self, + train_objectives: Iterable[Tuple[DataLoader, nn.Module]], + evaluator: SentenceEvaluator = None, + epochs: int = 1, + steps_per_epoch=None, + scheduler: str = "WarmupLinear", + warmup_steps: int = 10000, + optimizer_class: Type[Optimizer] = torch.optim.AdamW, + optimizer_params: Dict[str, object] = {"lr": 2e-5}, + weight_decay: float = 0.01, + evaluation_steps: int = 0, + output_path: str = None, + save_best_model: bool = True, + max_grad_norm: float = 1, + use_amp: bool = False, + callback: Callable[[float, int, int], None] = None, + show_progress_bar: bool = True, + checkpoint_path: str = None, + checkpoint_save_steps: int = 500, + checkpoint_save_total_limit: int = 0, + ): + """ + Train the model with the given training objective + Each training objective is sampled in turn for one batch. + We sample only as many batches from each objective as there are in the smallest one + to make sure of equal training with each dataset. + + :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning + :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. + :param epochs: Number of epochs for training + :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives. + :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts + :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. + :param optimizer_class: Optimizer + :param optimizer_params: Optimizer parameters + :param weight_decay: Weight decay for model parameters + :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps + :param output_path: Storage path for the model and evaluation files + :param save_best_model: If true, the best model (according to evaluator) is stored at output_path + :param max_grad_norm: Used for gradient normalization. + :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 + :param callback: Callback function that is invoked after each evaluation. + It must accept the following three parameters in this order: + `score`, `epoch`, `steps` + :param show_progress_bar: If True, output a tqdm progress bar + :param checkpoint_path: Folder to save checkpoints during training + :param checkpoint_save_steps: Will save a checkpoint after so many steps + :param checkpoint_save_total_limit: Total number of checkpoints to store + """ + + ##Add info to model card + # info_loss_functions = "\n".join(["- {} with {} training examples".format(str(loss), len(dataloader)) for dataloader, loss in train_objectives]) + info_loss_functions = [] + for dataloader, loss in train_objectives: + info_loss_functions.extend(ModelCardTemplate.get_train_objective_info(dataloader, loss)) + info_loss_functions = "\n\n".join([text for text in info_loss_functions]) + + info_fit_parameters = json.dumps( + { + "evaluator": fullname(evaluator), + "epochs": epochs, + "steps_per_epoch": steps_per_epoch, + "scheduler": scheduler, + "warmup_steps": warmup_steps, + "optimizer_class": str(optimizer_class), + "optimizer_params": optimizer_params, + "weight_decay": weight_decay, + "evaluation_steps": evaluation_steps, + "max_grad_norm": max_grad_norm, + }, + indent=4, + sort_keys=True, + ) + self._model_card_text = None + self._model_card_vars["{TRAINING_SECTION}"] = ModelCardTemplate.__TRAINING_SECTION__.replace( + "{LOSS_FUNCTIONS}", info_loss_functions + ).replace("{FIT_PARAMETERS}", info_fit_parameters) + + if use_amp: + from torch.cuda.amp import autocast + + scaler = torch.cuda.amp.GradScaler() + + self.to(self.device) + + dataloaders = [dataloader for dataloader, _ in train_objectives] + + # Use smart batching + for dataloader in dataloaders: + dataloader.collate_fn = self.smart_batching_collate + + loss_models = [loss for _, loss in train_objectives] + for loss_model in loss_models: + loss_model.to(self.device) + + self.best_score = -9999999 + + if steps_per_epoch is None or steps_per_epoch == 0: + steps_per_epoch = min([len(dataloader) for dataloader in dataloaders]) + + num_train_steps = int(steps_per_epoch * epochs) + + # Prepare optimizers + optimizers = [] + schedulers = [] + for loss_model in loss_models: + param_optimizer = list(loss_model.named_parameters()) + + no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] + optimizer_grouped_parameters = [ + { + "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], + "weight_decay": weight_decay, + }, + {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, + ] + + optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params) + scheduler_obj = self._get_scheduler( + optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps + ) + + optimizers.append(optimizer) + schedulers.append(scheduler_obj) + + global_step = 0 + data_iterators = [iter(dataloader) for dataloader in dataloaders] + + num_train_objectives = len(train_objectives) + + skip_scheduler = False + for epoch in trange(epochs, desc="Epoch", disable=not show_progress_bar): + training_steps = 0 + + for loss_model in loss_models: + loss_model.zero_grad() + loss_model.train() + + for _ in trange(steps_per_epoch, desc="Iteration", smoothing=0.05, disable=not show_progress_bar): + for train_idx in range(num_train_objectives): + loss_model = loss_models[train_idx] + optimizer = optimizers[train_idx] + scheduler = schedulers[train_idx] + data_iterator = data_iterators[train_idx] + + try: + data = next(data_iterator) + except StopIteration: + data_iterator = iter(dataloaders[train_idx]) + data_iterators[train_idx] = data_iterator + data = next(data_iterator) + + features, labels = data + labels = labels.to(self.device) + features = list(map(lambda batch: batch_to_device(batch, self.device), features)) + + if use_amp: + with autocast(): + loss_value = loss_model(features, labels) + + scale_before_step = scaler.get_scale() + scaler.scale(loss_value).backward() + scaler.unscale_(optimizer) + torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm) + scaler.step(optimizer) + scaler.update() + + skip_scheduler = scaler.get_scale() != scale_before_step + else: + loss_value = loss_model(features, labels) + loss_value.backward() + torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm) + optimizer.step() + + optimizer.zero_grad() + + if not skip_scheduler: + scheduler.step() + + training_steps += 1 + global_step += 1 + + if evaluation_steps > 0 and training_steps % evaluation_steps == 0: + self._eval_during_training( + evaluator, output_path, save_best_model, epoch, training_steps, callback + ) + + for loss_model in loss_models: + loss_model.zero_grad() + loss_model.train() + + if ( + checkpoint_path is not None + and checkpoint_save_steps is not None + and checkpoint_save_steps > 0 + and global_step % checkpoint_save_steps == 0 + ): + self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step) + + self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback) + + if evaluator is None and output_path is not None: # No evaluator, but output path: save final model version + self.save(output_path) + + if checkpoint_path is not None: + self._save_checkpoint(checkpoint_path, checkpoint_save_total_limit, global_step) + + def _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback): + """Runs evaluation during the training""" + eval_path = output_path + if output_path is not None: + os.makedirs(output_path, exist_ok=True) + eval_path = os.path.join(output_path, "eval") + os.makedirs(eval_path, exist_ok=True) + + if evaluator is not None: + score = evaluator(self, output_path=eval_path, epoch=epoch, steps=steps) + if callback is not None: + callback(score, epoch, steps) + if score > self.best_score: + self.best_score = score + if save_best_model: + self.save(output_path) + + def _save_checkpoint(self, checkpoint_path, checkpoint_save_total_limit, step): + # Store new checkpoint + self.save(os.path.join(checkpoint_path, str(step))) + + # Delete old checkpoints + if checkpoint_save_total_limit is not None and checkpoint_save_total_limit > 0: + old_checkpoints = [] + for subdir in os.listdir(checkpoint_path): + if subdir.isdigit(): + old_checkpoints.append({"step": int(subdir), "path": os.path.join(checkpoint_path, subdir)}) + + if len(old_checkpoints) > checkpoint_save_total_limit: + old_checkpoints = sorted(old_checkpoints, key=lambda x: x["step"]) + shutil.rmtree(old_checkpoints[0]["path"]) diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py index f63c6b6d5..6337b95f3 100644 --- a/sentence_transformers/losses/AdaptiveLayerLoss.py +++ b/sentence_transformers/losses/AdaptiveLayerLoss.py @@ -230,3 +230,16 @@ def get_config_dict(self) -> Dict[str, Any]: "kl_div_weight": self.kl_div_weight, "kl_temperature": self.kl_temperature, } + + @property + def citation(self) -> str: + return """ +@misc{li20242d, + title={2D Matryoshka Sentence Embeddings}, + author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li}, + year={2024}, + eprint={2402.14776}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +""" diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py index 00780444b..a506a1317 100644 --- a/sentence_transformers/losses/AnglELoss.py +++ b/sentence_transformers/losses/AnglELoss.py @@ -54,3 +54,16 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0): train_loss = losses.AnglELoss(model=model) """ super().__init__(model, scale, similarity_fct=util.pairwise_angle_sim) + + @property + def citation(self) -> str: + return """ +@misc{li2023angleoptimized, + title={AnglE-optimized Text Embeddings}, + author={Xianming Li and Jing Li}, + year={2023}, + eprint={2309.12871}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +""" diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py index a7364b59d..a0fb1055c 100644 --- a/sentence_transformers/losses/BatchAllTripletLoss.py +++ b/sentence_transformers/losses/BatchAllTripletLoss.py @@ -117,3 +117,16 @@ def batch_all_triplet_loss(self, labels, embeddings): triplet_loss = triplet_loss.sum() / (num_positive_triplets + 1e-16) return triplet_loss + + @property + def citation(self) -> str: + return """ +@misc{hermans2017defense, + title={In Defense of the Triplet Loss for Person Re-Identification}, + author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, + year={2017}, + eprint={1703.07737}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +""" diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py index e3f1b0262..a70f419e1 100644 --- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py +++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py @@ -120,3 +120,16 @@ def batch_hard_triplet_soft_margin_loss(self, labels: Tensor, embeddings: Tensor triplet_loss = tl.mean() return triplet_loss + + @property + def citation(self) -> str: + return """ +@misc{hermans2017defense, + title={In Defense of the Triplet Loss for Person Re-Identification}, + author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, + year={2017}, + eprint={1703.07737}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +""" diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py index 6668fe70a..ab023ec3d 100644 --- a/sentence_transformers/losses/BatchHardTripletLoss.py +++ b/sentence_transformers/losses/BatchHardTripletLoss.py @@ -238,3 +238,16 @@ def get_anchor_negative_triplet_mask(labels): # Uses broadcasting where the 1st argument has shape (1, batch_size) and the 2nd (batch_size, 1) return ~(labels.unsqueeze(0) == labels.unsqueeze(1)) + + @property + def citation(self) -> str: + return """ +@misc{hermans2017defense, + title={In Defense of the Triplet Loss for Person Re-Identification}, + author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, + year={2017}, + eprint={1703.07737}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +""" diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py index 15d71add8..a54a6bc26 100644 --- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py +++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py @@ -154,3 +154,16 @@ def _masked_maximum(data, mask, dim=1): masked_maximums += axis_minimums return masked_maximums + + @property + def citation(self) -> str: + return """ +@misc{hermans2017defense, + title={In Defense of the Triplet Loss for Person Re-Identification}, + author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, + year={2017}, + eprint={1703.07737}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +""" diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py index 72868faea..12a0cb931 100644 --- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py @@ -232,8 +232,9 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor reps.append(reps_mbs) self.random_states.append(random_state_mbs) - # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings - loss = self.calculate_loss_and_cache_gradients(reps) + with torch.set_grad_enabled(True): + # Step (2): Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings + loss = self.calculate_loss_and_cache_gradients(reps) # Step (3): A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain loss.register_hook(partial(_backward_hook, sentence_features=sentence_features, loss_obj=self)) @@ -241,3 +242,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor def get_config_dict(self): return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__} + + @property + def citation(self) -> str: + return """ +@misc{gao2021scaling, + title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, + author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan}, + year={2021}, + eprint={2101.06983}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} +""" diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py index b88c3a353..e937d7ef9 100644 --- a/sentence_transformers/losses/CoSENTLoss.py +++ b/sentence_transformers/losses/CoSENTLoss.py @@ -83,3 +83,15 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor def get_config_dict(self): return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__} + + @property + def citation(self) -> str: + return """ +@online{kexuefm-8847, + title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT}, + author={Su Jianlin}, + year={2022}, + month={Jan}, + url={https://kexue.fm/archives/8847}, +} +""" diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py index 13be27fef..55f5ad993 100644 --- a/sentence_transformers/losses/ContrastiveLoss.py +++ b/sentence_transformers/losses/ContrastiveLoss.py @@ -95,3 +95,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2) ) return losses.mean() if self.size_average else losses.sum() + + @property + def citation(self) -> str: + return """ +@inproceedings{hadsell2006dimensionality, + author={Hadsell, R. and Chopra, S. and LeCun, Y.}, + booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)}, + title={Dimensionality Reduction by Learning an Invariant Mapping}, + year={2006}, + volume={2}, + number={}, + pages={1735-1742}, + doi={10.1109/CVPR.2006.100} +} +""" diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py index 828b72406..85af67fc3 100644 --- a/sentence_transformers/losses/ContrastiveTensionLoss.py +++ b/sentence_transformers/losses/ContrastiveTensionLoss.py @@ -86,6 +86,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor loss = self.criterion(sim_scores, labels.type_as(sim_scores)) return loss + @property + def citation(self) -> str: + return """ +@inproceedings{carlsson2021semantic, + title={Semantic Re-tuning with Contrastive Tension}, + author={Fredrik Carlsson and Amaru Cuba Gyllensten and Evangelia Gogoulou and Erik Ylip{\"a}{\"a} Hellqvist and Magnus Sahlgren}, + booktitle={International Conference on Learning Representations}, + year={2021}, + url={https://openreview.net/forum?id=Ov_sMNau-PF} +} +""" + class ContrastiveTensionLossInBatchNegatives(nn.Module): def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_fct=util.cos_sim): @@ -161,6 +173,18 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device) return (self.cross_entropy_loss(scores, labels) + self.cross_entropy_loss(scores.t(), labels)) / 2 + @property + def citation(self) -> str: + return """ +@inproceedings{carlsson2021semantic, + title={Semantic Re-tuning with Contrastive Tension}, + author={Fredrik Carlsson and Amaru Cuba Gyllensten and Evangelia Gogoulou and Erik Ylip{\"a}{\"a} Hellqvist and Magnus Sahlgren}, + booktitle={International Conference on Learning Representations}, + year={2021}, + url={https://openreview.net/forum?id=Ov_sMNau-PF} +} +""" + ################# CT Data Loader ################# # For CT, we need batches in a specific format diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py index 7920fc93b..46b075b38 100644 --- a/sentence_transformers/losses/CosineSimilarityLoss.py +++ b/sentence_transformers/losses/CosineSimilarityLoss.py @@ -1,6 +1,8 @@ import torch from torch import nn, Tensor -from typing import Iterable, Dict +from typing import Any, Iterable, Dict + +from sentence_transformers.util import fullname from ..SentenceTransformer import SentenceTransformer @@ -62,4 +64,7 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor): embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features] output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1])) - return self.loss_fct(output, labels.view(-1)) + return self.loss_fct(output, labels.float().view(-1)) + + def get_config_dict(self) -> Dict[str, Any]: + return {"loss_fct": fullname(self.loss_fct)} diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py index 8cdb607df..b29e70f68 100644 --- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py +++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py @@ -158,3 +158,19 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor ce_loss_fct = nn.CrossEntropyLoss(ignore_index=self.tokenizer_decoder.pad_token_id) loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), label_ids.reshape(-1)) return loss + + @property + def citation(self) -> str: + return """ +@inproceedings{wang-2021-TSDAE, + title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning", + author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna", + booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", + month = nov, + year = "2021", + address = "Punta Cana, Dominican Republic", + publisher = "Association for Computational Linguistics", + pages = "671--688", + url = "https://arxiv.org/abs/2104.06979", +} +""" diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py index 1fdc753ef..ff8d3d288 100644 --- a/sentence_transformers/losses/GISTEmbedLoss.py +++ b/sentence_transformers/losses/GISTEmbedLoss.py @@ -152,3 +152,16 @@ def get_config_dict(self) -> Dict[str, Any]: "guide": self.guide, "temperature": self.temperature, } + + @property + def citation(self) -> str: + return """ +@misc{solatorio2024gistembed, + title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}, + author={Aivin V. Solatorio}, + year={2024}, + eprint={2402.16829}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} +""" diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py index acbc90c03..89db913bd 100644 --- a/sentence_transformers/losses/MSELoss.py +++ b/sentence_transformers/losses/MSELoss.py @@ -63,3 +63,17 @@ def __init__(self, model): def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor): rep = self.model(sentence_features[0])["sentence_embedding"] return self.loss_fct(rep, labels) + + @property + def citation(self) -> str: + return """ +@inproceedings{reimers-2020-multilingual-sentence-bert, + title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", + author = "Reimers, Nils and Gurevych, Iryna", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", + month = "11", + year = "2020", + publisher = "Association for Computational Linguistics", + url = "https://arxiv.org/abs/2004.09813", +} +""" diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py index 063a36e65..26e202fe3 100644 --- a/sentence_transformers/losses/MarginMSELoss.py +++ b/sentence_transformers/losses/MarginMSELoss.py @@ -83,3 +83,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor margin_pred = scores_pos - scores_neg return self.loss_fct(margin_pred, labels) + + @property + def citation(self) -> str: + return """ +@misc{hofstätter2021improving, + title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation}, + author={Sebastian Hofstätter and Sophia Althammer and Michael Schröder and Mete Sertkan and Allan Hanbury}, + year={2021}, + eprint={2010.02666}, + archivePrefix={arXiv}, + primaryClass={cs.IR} +} +""" diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py index 7d0e8fd5e..da6f1512b 100644 --- a/sentence_transformers/losses/Matryoshka2dLoss.py +++ b/sentence_transformers/losses/Matryoshka2dLoss.py @@ -111,3 +111,16 @@ def get_config_dict(self) -> Dict[str, Any]: **super().get_config_dict(), **self.loss.get_config_dict(), } + + @property + def citation(self) -> str: + return """ +@misc{li20242d, + title={2D Matryoshka Sentence Embeddings}, + author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li}, + year={2024}, + eprint={2402.14776}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +""" diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py index 3d25027a1..0e9ab32a6 100644 --- a/sentence_transformers/losses/MatryoshkaLoss.py +++ b/sentence_transformers/losses/MatryoshkaLoss.py @@ -142,3 +142,16 @@ def get_config_dict(self) -> Dict[str, Any]: "matryoshka_weights": self.matryoshka_weights, "n_dims_per_step": self.n_dims_per_step, } + + @property + def citation(self) -> str: + return """ +@misc{kusupati2024matryoshka, + title={Matryoshka Representation Learning}, + author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, + year={2024}, + eprint={2205.13147}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} +""" diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py index 3c2b983ef..b63d05afd 100644 --- a/sentence_transformers/losses/MegaBatchMarginLoss.py +++ b/sentence_transformers/losses/MegaBatchMarginLoss.py @@ -141,3 +141,21 @@ def forward_non_mini_batched(self, sentence_features: Iterable[Dict[str, Tensor] negatives_max, _ = torch.max(negative_scores, dim=1) losses = F.relu(self.positive_margin - positive_scores) + F.relu(negatives_max - self.negative_margin) return losses.mean() + + @property + def citation(self) -> str: + return """ +@inproceedings{wieting-gimpel-2018-paranmt, + title = "{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations", + author = "Wieting, John and Gimpel, Kevin", + editor = "Gurevych, Iryna and Miyao, Yusuke", + booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", + month = jul, + year = "2018", + address = "Melbourne, Australia", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/P18-1042", + doi = "10.18653/v1/P18-1042", + pages = "451--462", +} +""" diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py index 0fd191b14..5416b3d70 100644 --- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py @@ -94,3 +94,16 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor def get_config_dict(self): return {"scale": self.scale, "similarity_fct": self.similarity_fct.__name__} + + @property + def citation(self) -> str: + return """ +@misc{henderson2017efficient, + title={Efficient Natural Language Response Suggestion for Smart Reply}, + author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, + year={2017}, + eprint={1705.00652}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +""" diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py index 201af5573..c8da160e6 100644 --- a/sentence_transformers/losses/SoftmaxLoss.py +++ b/sentence_transformers/losses/SoftmaxLoss.py @@ -119,3 +119,17 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor return loss else: return reps, output + + @property + def citation(self) -> str: + return """ +@inproceedings{reimers-2019-sentence-bert, + title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", + author = "Reimers, Nils and Gurevych, Iryna", + booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", + month = "11", + year = "2019", + publisher = "Association for Computational Linguistics", + url = "https://arxiv.org/abs/1908.10084", +} +""" diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py index ee25cebf8..1768bc2b7 100644 --- a/sentence_transformers/losses/TripletLoss.py +++ b/sentence_transformers/losses/TripletLoss.py @@ -72,15 +72,6 @@ def __init__( self.distance_metric = distance_metric self.triplet_margin = triplet_margin - def get_config_dict(self): - distance_metric_name = self.distance_metric.__name__ - for name, value in vars(TripletDistanceMetric).items(): - if value == self.distance_metric: - distance_metric_name = "TripletDistanceMetric.{}".format(name) - break - - return {"distance_metric": distance_metric_name, "triplet_margin": self.triplet_margin} - def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor): reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features] @@ -90,3 +81,25 @@ def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor losses = F.relu(distance_pos - distance_neg + self.triplet_margin) return losses.mean() + + def get_config_dict(self): + distance_metric_name = self.distance_metric.__name__ + for name, value in vars(TripletDistanceMetric).items(): + if value == self.distance_metric: + distance_metric_name = "TripletDistanceMetric.{}".format(name) + break + + return {"distance_metric": distance_metric_name, "triplet_margin": self.triplet_margin} + + @property + def citation(self) -> str: + return """ +@misc{hermans2017defense, + title={In Defense of the Triplet Loss for Person Re-Identification}, + author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, + year={2017}, + eprint={1703.07737}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +""" diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py new file mode 100644 index 000000000..633517574 --- /dev/null +++ b/sentence_transformers/model_card.py @@ -0,0 +1,920 @@ +from copy import copy +import json +import random +from collections import Counter, defaultdict +from dataclasses import dataclass, field, fields +from pathlib import Path +from platform import python_version +import re +from textwrap import indent +from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union +import logging + +import accelerate +import datasets +import tokenizers +import torch +from torch import nn +import transformers +from datasets import Dataset, DatasetDict +from huggingface_hub import CardData, ModelCard, dataset_info as get_dataset_info, model_info as get_model_info +from huggingface_hub.repocard_data import eval_results_to_model_index, EvalResult +from huggingface_hub.utils import yaml_dump +from transformers import TrainerCallback +from transformers.integrations import CodeCarbonCallback +from transformers.modelcard import make_markdown_table +from transformers.trainer_callback import TrainerControl, TrainerState +from tqdm.autonotebook import tqdm + +from sentence_transformers import __version__ as sentence_transformers_version +from sentence_transformers.evaluation import SequentialEvaluator +from sentence_transformers.models import Transformer +from sentence_transformers.util import cos_sim, fullname +from sentence_transformers.training_args import SentenceTransformerTrainingArguments + + +logger = logging.getLogger(__name__) + +if TYPE_CHECKING: + from sentence_transformers.evaluation import SentenceEvaluator + from sentence_transformers.SentenceTransformer import SentenceTransformer + from sentence_transformers.trainer import SentenceTransformerTrainer + + +class ModelCardCallback(TrainerCallback): + def __init__(self, trainer: "SentenceTransformerTrainer", default_args_dict: Dict[str, Any]) -> None: + super().__init__() + self.trainer = trainer + self.default_args_dict = default_args_dict + + callbacks = [ + callback + for callback in self.trainer.callback_handler.callbacks + if isinstance(callback, CodeCarbonCallback) + ] + if callbacks: + trainer.model.model_card_data.code_carbon_callback = callbacks[0] + + trainer.model.model_card_data.trainer = trainer + + def on_init_end( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + **kwargs, + ): + from sentence_transformers.losses import AdaptiveLayerLoss, Matryoshka2dLoss, MatryoshkaLoss + + # Try to infer the dataset "name", "id" and "revision" from the dataset cache files + if self.trainer.train_dataset: + model.model_card_data.train_datasets = model.model_card_data.extract_dataset_metadata( + self.trainer.train_dataset, model.model_card_data.train_datasets, "train" + ) + + if self.trainer.eval_dataset: + model.model_card_data.eval_datasets = model.model_card_data.extract_dataset_metadata( + self.trainer.eval_dataset, model.model_card_data.eval_datasets, "eval" + ) + + if isinstance(self.trainer.loss, dict): + losses = list(self.trainer.loss.values()) + else: + losses = [self.trainer.loss] + # Some losses are known to use other losses internally, e.g. MatryoshkaLoss, AdaptiveLayerLoss and Matryoshka2dLoss + # So, verify for `loss` attributes in the losses + loss_idx = 0 + while loss_idx < len(losses): + loss = losses[loss_idx] + if ( + isinstance(loss, (MatryoshkaLoss, AdaptiveLayerLoss, Matryoshka2dLoss)) + and hasattr(loss, "loss") + and loss.loss not in losses + ): + losses.append(loss.loss) + loss_idx += 1 + + model.model_card_data.set_losses(losses) + + def on_train_begin( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + **kwargs, + ) -> None: + # model.model_card_data.hyperparameters = extract_hyperparameters_from_trainer(self.trainer) + ignore_keys = { + "output_dir", + "logging_dir", + "logging_strategy", + "logging_first_step", + "logging_steps", + "evaluation_strategy", + "eval_steps", + "eval_delay", + "save_strategy", + "save_steps", + "save_total_limit", + "metric_for_best_model", + "greater_is_better", + "report_to", + "samples_per_label", + "show_progress_bar", + "do_train", + "do_eval", + "do_test", + "run_name", + "hub_token", + "push_to_hub_token", + } + args_dict = args.to_dict() + model.model_card_data.all_hyperparameters = { + key: value for key, value in args_dict.items() if key not in ignore_keys + } + model.model_card_data.non_default_hyperparameters = { + key: value + for key, value in args_dict.items() + if key not in ignore_keys and key in self.default_args_dict and value != self.default_args_dict[key] + } + + def on_evaluate( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + metrics: Dict[str, float], + **kwargs, + ) -> None: + loss_dict = {" ".join(key.split("_")[1:]): metrics[key] for key in metrics if key.endswith("_loss")} + if ( + model.model_card_data.training_logs + and model.model_card_data.training_logs[-1]["Step"] == state.global_step + ): + model.model_card_data.training_logs[-1].update(loss_dict) + else: + model.model_card_data.training_logs.append( + { + "Epoch": state.epoch, + "Step": state.global_step, + **loss_dict, + } + ) + + def on_log( + self, + args: SentenceTransformerTrainingArguments, + state: TrainerState, + control: TrainerControl, + model: "SentenceTransformer", + logs: Dict[str, float], + **kwargs, + ): + keys = {"loss"} & set(logs) + if keys: + if ( + model.model_card_data.training_logs + and model.model_card_data.training_logs[-1]["Step"] == state.global_step + ): + model.model_card_data.training_logs[-1]["Training Loss"] = logs[keys.pop()] + else: + model.model_card_data.training_logs.append( + { + "Epoch": state.epoch, + "Step": state.global_step, + "Training Loss": logs[keys.pop()], + } + ) + + +YAML_FIELDS = [ + "language", + "license", + "library_name", + "tags", + "datasets", + "metrics", + "pipeline_tag", + "widget", + "model-index", + "co2_eq_emissions", + "base_model", +] +IGNORED_FIELDS = ["model", "trainer", "eval_results_dict"] + + +@dataclass +class SentenceTransformerModelCardData(CardData): + """A dataclass storing data used in the model card. + + Args: + language (`Optional[Union[str, List[str]]]`): The model language, either a string or a list, + e.g. "en" or ["en", "de", "nl"] + license (`Optional[str]`): The license of the model, e.g. "apache-2.0", "mit", + or "cc-by-nc-sa-4.0" + model_name (`Optional[str]`): The pretty name of the model, e.g. "SentenceTransformer based on microsoft/mpnet-base". + model_id (`Optional[str]`): The model ID when pushing the model to the Hub, + e.g. "tomaarsen/sbert-mpnet-base-allnli". + train_datasets (`List[Dict[str, str]]`): A list of the names and/or Hugging Face dataset IDs of the training datasets. + e.g. [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}, {"name": "STSB"}] + eval_datasets (`List[Dict[str, str]]`): A list of the names and/or Hugging Face dataset IDs of the evaluation datasets. + e.g. [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"id": "mteb/stsbenchmark-sts"}] + task_name (`str`): The human-readable task the model is trained on, + e.g. "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more". + tags (`Optional[List[str]]`): A list of tags for the model, + e.g. ["sentence-transformers", "sentence-similarity", "feature-extraction"]. + generate_widget_examples (`bool`): Whether to generate widget examples on every model save. + + + + Install [``codecarbon``](https://github.com/mlco2/codecarbon) to automatically track carbon emission usage and + include it in your model cards. + + + + Example:: + + >>> model = SentenceTransformer( + ... "microsoft/mpnet-base", + ... model_card_data=SentenceTransformerModelCardData( + ... model_id="tomaarsen/sbert-mpnet-base-allnli", + ... train_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}], + ... eval_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}], + ... license="apache-2.0", + ... language="en", + ... ), + ... ) + """ + + # Potentially provided by the user + language: Optional[Union[str, List[str]]] = field(default_factory=list) + license: Optional[str] = None + model_name: Optional[str] = None + model_id: Optional[str] = None + train_datasets: List[Dict[str, str]] = field(default_factory=list) + eval_datasets: List[Dict[str, str]] = field(default_factory=list) + task_name: str = ( + "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more" + ) + tags: Optional[List[str]] = field( + default_factory=lambda: [ + "sentence-transformers", + "sentence-similarity", + "feature-extraction", + ] + ) + generate_widget_examples: bool = True + + # Automatically filled by `ModelCardCallback` and the Trainer directly + base_model: Optional[str] = field(default=None, init=False) + base_model_revision: Optional[str] = field(default=None, init=False) + non_default_hyperparameters: Dict[str, Any] = field(default_factory=dict, init=False) + all_hyperparameters: Dict[str, Any] = field(default_factory=dict, init=False) + eval_results_dict: Optional[Dict["SentenceEvaluator", Dict[str, Any]]] = field(default_factory=dict, init=False) + training_logs: List[Dict[str, float]] = field(default_factory=list, init=False) + widget: List[Dict[str, str]] = field(default_factory=list, init=False) + predict_example: Optional[str] = field(default=None, init=False) + label_example_list: List[Dict[str, str]] = field(default_factory=list, init=False) + code_carbon_callback: Optional[CodeCarbonCallback] = field(default=None, init=False) + citations: Dict[str, str] = field(default_factory=dict, init=False) + best_model_step: Optional[int] = field(default=None, init=False) + trainer: Optional["SentenceTransformerTrainer"] = field(default=None, init=False, repr=False) + + # Utility fields + first_save: bool = field(default=True, init=False) + widget_step: int = field(default=-1, init=False) + + # Computed once, always unchanged + pipeline_tag: str = field(default="sentence-similarity", init=False) + library_name: str = field(default="sentence-transformers", init=False) + version: Dict[str, str] = field( + default_factory=lambda: { + "python": python_version(), + "sentence_transformers": sentence_transformers_version, + "transformers": transformers.__version__, + "torch": torch.__version__, + "accelerate": accelerate.__version__, + "datasets": datasets.__version__, + "tokenizers": tokenizers.__version__, + }, + init=False, + ) + + # Passed via `register_model` only + model: Optional["SentenceTransformer"] = field(default=None, init=False, repr=False) + + def __post_init__(self): + # We don't want to save "ignore_metadata_errors" in our Model Card + infer_languages = not self.language + if isinstance(self.language, str): + self.language = [self.language] + + self.train_datasets = self.validate_datasets(self.train_datasets, infer_languages=infer_languages) + self.eval_datasets = self.validate_datasets(self.eval_datasets, infer_languages=infer_languages) + + if self.model_id and self.model_id.count("/") != 1: + logger.warning( + f"The provided {self.model_id!r} model ID should include the organization or user," + ' such as "tomaarsen/mpnet-base-nli-matryoshka". Setting `model_id` to None.' + ) + self.model_id = None + + def validate_datasets(self, dataset_list, infer_languages: bool = True) -> None: + output_dataset_list = [] + for dataset in dataset_list: + if "name" not in dataset: + if "id" in dataset: + dataset["name"] = dataset["id"] + + if "id" in dataset: + # Try to determine the language from the dataset on the Hub + try: + info = get_dataset_info(dataset["id"]) + except Exception: + logger.warning( + f"The dataset `id` {dataset['id']!r} does not exist on the Hub. Setting the `id` to None." + ) + del dataset["id"] + else: + # TODO: Perhaps we can try to infer the dataset name from the dataset card + if info.cardData and infer_languages and "language" in info.cardData: + dataset_language = info.cardData.get("language") + if isinstance(dataset_language, str): + dataset_language = [dataset_language] + for language in dataset_language: + if language not in self.language: + self.language.append(language) + + output_dataset_list.append(dataset) + return output_dataset_list + + def set_losses(self, losses: nn.Module) -> None: + citations = { + "Sentence Transformers": """ +@inproceedings{reimers-2019-sentence-bert, + title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", + author = "Reimers, Nils and Gurevych, Iryna", + booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", + month = "11", + year = "2019", + publisher = "Association for Computational Linguistics", + url = "https://arxiv.org/abs/1908.10084", +} +""" + } + for loss in losses: + try: + citations[loss.__class__.__name__] = loss.citation + except Exception: + pass + inverted_citations = defaultdict(list) + for loss, citation in citations.items(): + inverted_citations[citation].append(loss) + + def join_list(losses: List[str]) -> str: + if len(losses) > 1: + return ", ".join(losses[:-1]) + " and " + losses[-1] + return losses[0] + + self.citations = {join_list(losses): citation for citation, losses in inverted_citations.items()} + self.tags += [f"loss:{loss}" for loss in {loss.__class__.__name__: loss for loss in losses}] + + def set_best_model_step(self, step: int) -> None: + self.best_model_step = step + + def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None: + if isinstance(dataset, Dataset): + dataset = DatasetDict(dataset=dataset) + + self.widget = [] + # Sample the datasets to use for the widget + dataset_names = random.choices(list(dataset.keys()), k=5) + num_samples = 1000 + num_samples_to_encode = 500 + source_sentences = set() + for dataset_name in tqdm(dataset_names, desc="Computing widget examples", unit="example", leave=False): + # Sample 1000 examples from the dataset, get the 500 shortest texts and encode them + dataset_size = len(dataset[dataset_name]) + samples = dataset[dataset_name].select( + random.sample(range(dataset_size), k=min(num_samples, dataset_size)) + ) + all_texts = { + value + for sample in samples + for key, value in sample.items() + if isinstance(value, str) and value not in source_sentences and key != "dataset_name" + } + if len(all_texts) < 5: + continue + + all_texts = sorted(all_texts, key=len)[:num_samples_to_encode] + embeddings = self.model.encode(all_texts, show_progress_bar=False) + + # Select a relatively short example from the dataset as the source, + # and find the most similar, median, and dissimilar examples + source_sentence_idx, source_sentence = sorted(list(enumerate(all_texts)), key=lambda x: len(x[1]))[ + min(len(all_texts) - 1, 10) + ] + _, indices = cos_sim(embeddings[source_sentence_idx], embeddings)[0].sort() + similar_sentence = all_texts[indices[-2]] + median_sentence = all_texts[len(all_texts) // 2] + dissimilar_sentence = all_texts[indices[0]] + self.widget.append( + { + "source_sentence": source_sentence, + "sentences": [similar_sentence, median_sentence, dissimilar_sentence], + } + ) + source_sentences.add(source_sentence) + + self.predict_example = [source_sentence, similar_sentence, median_sentence] + + def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[str, Any]): + self.eval_results_dict[evaluator] = copy(metrics) + + # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs + if hasattr(evaluator, "primary_metric"): + primary_metrics = evaluator.primary_metric + if isinstance(evaluator, SequentialEvaluator): + primary_metrics = [sub_evaluator.primary_metric for sub_evaluator in evaluator.evaluators] + elif isinstance(primary_metrics, str): + primary_metrics = [primary_metrics] + + if self.trainer is None: + step = 0 + epoch = 0 + else: + step = self.trainer.state.global_step + epoch = self.trainer.state.epoch + training_log_metrics = {key: value for key, value in metrics.items() if key in primary_metrics} + + if self.training_logs and self.training_logs[-1]["Step"] == step: + self.training_logs[-1].update(training_log_metrics) + else: + self.training_logs.append( + { + "Epoch": epoch, + "Step": step, + **training_log_metrics, + } + ) + + def set_label_examples(self, dataset: Dataset) -> None: + num_examples_per_label = 3 + examples = defaultdict(list) + finished_labels = set() + for sample in dataset: + text = sample["text"] + label = sample["label"] + if label not in finished_labels: + examples[label].append(f"
  • {repr(text)}
  • ") + if len(examples[label]) >= num_examples_per_label: + finished_labels.add(label) + if len(finished_labels) == self.num_classes: + break + self.label_example_list = [ + { + "Label": self.model.labels[label] if self.model.labels and isinstance(label, int) else label, + "Examples": "
      " + "".join(example_set) + "
    ", + } + for label, example_set in examples.items() + ] + + def infer_datasets(self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None) -> None: + if isinstance(dataset, DatasetDict): + return [ + dataset + for dataset_name, sub_dataset in dataset.items() + for dataset in self.infer_datasets(sub_dataset, dataset_name=dataset_name) + ] + + def subtuple_finder(tuple: Tuple[str], subtuple: Tuple[str]) -> int: + for i, element in enumerate(tuple): + if element == subtuple[0] and tuple[i : i + len(subtuple)] == subtuple: + return i + return -1 + + cache_files = dataset.cache_files + dataset_output = {} + # Ignore the dataset name if it is a default name from the FitMixin backwards compatibility + if dataset_name and re.match("_dataset_\d+", dataset_name): + dataset_name = None + if dataset_name: + dataset_output["name"] = dataset_name + if cache_files and "filename" in cache_files[0]: + cache_path_parts = Path(cache_files[0]["filename"]).parts + # Check if the cachefile is under "huggingface/datasets" + subtuple = ("huggingface", "datasets") + index = subtuple_finder(cache_path_parts, subtuple) + if index == -1: + return + + # Get the folder after "huggingface/datasets" + cache_dataset_name = cache_path_parts[index + len(subtuple)] + # If the dataset has an author: + if "___" in cache_dataset_name: + author, dataset_name = cache_dataset_name.split("___") + dataset_output["id"] = f"{author}/{dataset_name}" + else: + author = None + dataset_name = cache_dataset_name + dataset_output["id"] = get_dataset_info(dataset_name).id + + # If the cache path ends with a 40 character hash, it is the current revision + if len(cache_path_parts[-2]) == 40: + dataset_output["revision"] = cache_path_parts[-2] + + return [dataset_output] + + def compute_dataset_metrics( + self, + dataset: Dict[str, str], + dataset_info: Dict[str, Any], + loss: Optional[Union[Dict[str, nn.Module], nn.Module]], + ) -> Dict[str, str]: + """ + Given a dataset, compute the following: + * Dataset Size + * Dataset Columns + * Dataset Stats + - Strings: min, mean, max word count/token length + - Integers: Counter() instance + - Floats: min, mean, max range + - List: number of elements or min, mean, max number of elements + * 3 Example samples + * Loss function name + - Loss function config + """ + if not dataset: + return {} + + dataset_info["size"] = len(dataset) + dataset_info["columns"] = [f"{column}" for column in dataset.column_names] + dataset_info["stats"] = {} + for column in dataset.column_names: + subsection = dataset[:1000][column] + first = subsection[0] + if isinstance(first, str): + tokenized = self.model.tokenize(subsection) + if isinstance(tokenized, dict) and "attention_mask" in tokenized: + lengths = tokenized["attention_mask"].sum(dim=1).tolist() + suffix = "tokens" + else: + lengths = [len(sentence) for sentence in subsection] + suffix = "characters" + dataset_info["stats"][column] = { + "dtype": "string", + "data": { + "min": f"{round(min(lengths), 2)} {suffix}", + "mean": f"{round(sum(lengths) / len(lengths), 2)} {suffix}", + "max": f"{round(max(lengths), 2)} {suffix}", + }, + } + elif isinstance(first, (int, bool)): + counter = Counter(subsection) + dataset_info["stats"][column] = { + "dtype": "int", + "data": { + key: f"{'~' if len(counter) > 1 else ''}{counter[key] / len(subsection):.2%}" + for key in sorted(counter) + }, + } + elif isinstance(first, float): + dataset_info["stats"][column] = { + "dtype": "float", + "data": { + "min": round(min(dataset[column]), 2), + "mean": round(sum(dataset[column]) / len(dataset), 2), + "max": round(max(dataset[column]), 2), + }, + } + elif isinstance(first, list): + counter = Counter([len(lst) for lst in subsection]) + if len(counter) == 1: + dataset_info["stats"][column] = { + "dtype": "list", + "data": { + "size": f"{len(first)} elements", + }, + } + else: + dataset_info["stats"][column] = { + "dtype": "list", + "data": { + "min": f"{min(counter)} elements", + "mean": f"{sum(counter) / len(counter):.2f} elements", + "max": f"{max(counter)} elements", + }, + } + else: + dataset_info["stats"][column] = {"dtype": fullname(first), "data": {}} + + def to_html_list(data: dict): + return "
    • " + "
    • ".join(f"{key}: {value}" for key, value in data.items()) + "
    " + + stats_lines = [ + {"": "type", **{key: value["dtype"] for key, value in dataset_info["stats"].items()}}, + {"": "details", **{key: to_html_list(value["data"]) for key, value in dataset_info["stats"].items()}}, + ] + dataset_info["stats_table"] = indent(make_markdown_table(stats_lines).replace("-:|", "--|"), " ") + + dataset_info["examples"] = dataset[:3] + num_samples = len(dataset_info["examples"][list(dataset_info["examples"])[0]]) + examples_lines = [] + for sample_idx in range(num_samples): + columns = {} + for column in dataset.column_names: + value = dataset_info["examples"][column][sample_idx] + # If the value is a long list, truncate it + if isinstance(value, list) and len(value) > 5: + value = str(value[:5])[:-1] + ", ...]" + # Avoid newlines in the table + value = str(value).replace("\n", "
    ") + columns[column] = f"{value}" + examples_lines.append(columns) + dataset_info["examples_table"] = indent(make_markdown_table(examples_lines).replace("-:|", "--|"), " ") + + dataset_info["loss"] = { + "fullname": fullname(loss), + } + if hasattr(loss, "get_config_dict"): + config = loss.get_config_dict() + try: + str_config = json.dumps(config, indent=4) + except TypeError: + str_config = str(config) + dataset_info["loss"]["config_code"] = indent(f"```json\n{str_config}\n```", " ") + return dataset_info + + def extract_dataset_metadata( + self, dataset: Union[Dataset, DatasetDict], dataset_metadata, dataset_type: Literal["train", "eval"] + ) -> Dict[str, Any]: + if dataset: + if dataset_metadata and ( + (isinstance(dataset, DatasetDict) and len(dataset_metadata) != len(dataset)) + or (isinstance(dataset, Dataset) and len(dataset_metadata) != 1) + ): + logger.warning( + f"The number of `{dataset_type}_datasets` in the model card data does not match the number of {dataset_type} datasets in the Trainer. " + f"Removing the provided `{dataset_type}_datasets` from the model card data." + ) + dataset_metadata = [] + + if not dataset_metadata: + dataset_metadata = self.infer_datasets(dataset) + + if isinstance(dataset, DatasetDict): + dataset_metadata = [ + self.compute_dataset_metrics( + dataset_value, + dataset_info, + self.trainer.loss[dataset_name] if isinstance(self.trainer.loss, dict) else self.trainer.loss, + ) + for dataset_name, dataset_value, dataset_info in zip( + dataset.keys(), dataset.values(), dataset_metadata + ) + ] + else: + dataset_metadata = [self.compute_dataset_metrics(dataset, dataset_metadata[0], self.trainer.loss)] + + return self.validate_datasets(dataset_metadata) + + def register_model(self, model: "SentenceTransformer") -> None: + self.model = model + + def set_model_id(self, model_id: str) -> None: + self.model_id = model_id + + def set_base_model(self, model_id: str, revision: Optional[str] = None) -> None: + try: + model_info = get_model_info(model_id) + except Exception: + # Getting the model info can fail for many reasons: model does not exist, no internet, outage, etc. + return False + self.base_model = model_info.id + if revision is None or revision == "main": + revision = model_info.sha + self.base_model_revision = revision + return True + + def try_to_set_base_model(self) -> None: + if isinstance(self.model[0], Transformer): + base_model = self.model[0].auto_model.config._name_or_path + base_model_path = Path(base_model) + # Sometimes the name_or_path ends exactly with the model_id, e.g. + # "C:\\Users\\tom/.cache\\torch\\sentence_transformers\\BAAI_bge-small-en-v1.5\\" + candidate_model_ids = ["/".join(base_model_path.parts[-2:])] + # Sometimes the name_or_path its final part contains the full model_id, with "/" replaced with a "_", e.g. + # "/root/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/" + # In that case, we take the last part, split on _, and try all combinations + # e.g. "a_b_c_d" -> ['a/b_c_d', 'a_b/c_d', 'a_b_c/d'] + splits = base_model_path.name.split("_") + candidate_model_ids += [ + "_".join(splits[:idx]) + "/" + "_".join(splits[idx:]) for idx in range(1, len(splits)) + ] + for model_id in candidate_model_ids: + if self.set_base_model(model_id): + break + + def format_eval_metrics(self) -> Dict[str, Any]: + """Format the evaluation metrics for the model card. + + The following keys will be returned: + - eval_metrics: A list of dictionaries containing the class name, description, dataset name, and a markdown table + This is used to display the evaluation metrics in the model card. + - metrics: A list of all metric keys. This is used in the model card metadata. + - model-index: A list of dictionaries containing the task name, task type, dataset type, dataset name, metric name, + metric type, and metric value. This is used to display the evaluation metrics in the model card metadata. + """ + eval_metrics = [] + all_metrics = {} + eval_results = [] + for evaluator, metrics in self.eval_results_dict.items(): + name = getattr(evaluator, "name", None) + primary_metric = getattr(evaluator, "primary_metric", None) + if name and all(key.startswith(name + "_") for key in metrics.keys()): + metrics = {key[len(name) + 1 :]: value for key, value in metrics.items()} + if primary_metric and primary_metric.startswith(name + "_"): + primary_metric = primary_metric[len(name) + 1 :] + + def try_to_pure_python(value: Any) -> Any: + """Try to convert a value from a Numpy or Torch scalar to pure Python, if not already pure Python""" + try: + if hasattr(value, "dtype"): + return value.item() + except Exception: + pass + return value + + # Try to convert to pure Python + metrics = {key: try_to_pure_python(value) for key, value in metrics.items()} + + table_lines = [ + { + "Metric": f"**{metric_key}**" if metric_key == primary_metric else metric_key, + "Value": f"**{round(metric_value, 4)}**" + if metric_key == primary_metric + else round(metric_value, 4), + } + for metric_key, metric_value in metrics.items() + ] + + # E.g. "Binary Classification" or "Semantic Similarity" + description = evaluator.description + dataset_name = getattr(evaluator, "name", None) + eval_metrics.append( + { + "class_name": fullname(evaluator), + "description": description, + "dataset_name": dataset_name, + "table": make_markdown_table(table_lines).replace("-:|", "--|"), + } + ) + eval_results.extend( + [ + EvalResult( + task_name=description, + task_type=description.lower().replace(" ", "-"), + dataset_type=dataset_name or "unknown", + dataset_name=dataset_name.replace("_", " ").replace("-", " ") or "Unknown", + metric_name=metric_key.replace("_", " ").title(), + metric_type=metric_key, + metric_value=metric_value, + ) + for metric_key, metric_value in metrics.items() + if isinstance(metric_value, (int, float)) + ] + ) + all_metrics.update(metrics) + + return { + "eval_metrics": eval_metrics, + "metrics": list(all_metrics.keys()), + "model-index": eval_results_to_model_index(self.model_name, eval_results), + } + + def format_training_logs(self): + # Get the keys from all evaluation lines + eval_lines_keys = {key for lines in self.training_logs for key in lines.keys()} + + # Sort the metric columns: Epoch, Step, Training Loss, Validation Loss, Evaluator results + def sort_metrics(key: str) -> str: + if key == "Epoch": + return "0" + if key == "Step": + return "1" + if key == "Training Loss": + return "2" + if key.endswith("loss"): + return "3" + return key + + sorted_eval_lines_keys = sorted(eval_lines_keys, key=sort_metrics) + training_logs = [ + { + key: f"**{round(line[key], 4) if key in line else '-'}**" + if line["Step"] == self.best_model_step + else line.get(key, "-") + for key in sorted_eval_lines_keys + } + for line in self.training_logs + ] + eval_lines = make_markdown_table(training_logs) + return { + "eval_lines": eval_lines, + "explain_bold_in_eval": "**" in eval_lines, + } + + def get_codecarbon_data(self): + emissions_data = self.code_carbon_callback.tracker._prepare_emissions_data() + results = { + "co2_eq_emissions": { + # * 1000 to convert kg to g + "emissions": float(emissions_data.emissions) * 1000, + "energy_consumed": float(emissions_data.energy_consumed), + "source": "codecarbon", + "training_type": "fine-tuning", + "on_cloud": emissions_data.on_cloud == "Y", + "cpu_model": emissions_data.cpu_model, + "ram_total_size": emissions_data.ram_total_size, + "hours_used": round(emissions_data.duration / 3600, 3), + } + } + if emissions_data.gpu_model: + results["co2_eq_emissions"]["hardware_used"] = emissions_data.gpu_model + return results + + def to_dict(self) -> Dict[str, Any]: + # Extract some meaningful examples from the evaluation or training dataset to showcase the performance + if self.trainer and self.widget_step < self.trainer.state.global_step and self.generate_widget_examples: + if dataset := self.trainer.eval_dataset or self.trainer.train_dataset: + self.set_widget_examples(dataset) + self.widget_step = self.trainer.state.global_step + + # Try to set the base model + if self.first_save and not self.base_model: + try: + self.try_to_set_base_model() + except Exception: + pass + + # Set the model name + if not self.model_name: + if self.base_model: + self.model_name = f"SentenceTransformer based on {self.base_model}" + else: + self.model_name = "SentenceTransformer" + + super_dict = {field.name: getattr(self, field.name) for field in fields(self)} + + # Compute required formats from the (usually post-training) evaluation data + if self.eval_results_dict: + try: + super_dict.update(self.format_eval_metrics()) + except Exception as exc: + logger.warning(f"Error while formatting evaluation metrics: {exc}") + raise exc + + # Compute required formats for the during-training evaluation data + if self.training_logs: + try: + super_dict.update(self.format_training_logs()) + except Exception as exc: + logger.warning(f"Error while formatting training logs: {exc}") + + super_dict["hide_eval_lines"] = len(self.training_logs) > 100 + + # Try to add the code carbon callback data + if ( + self.code_carbon_callback + and self.code_carbon_callback.tracker + and self.code_carbon_callback.tracker._start_time is not None + ): + super_dict.update(self.get_codecarbon_data()) + + # Add some additional metadata stored in the model itself + super_dict["model_max_length"] = self.model.get_max_seq_length() + super_dict["output_dimensionality"] = self.model.get_sentence_embedding_dimension() + super_dict["model_string"] = str(self.model) + + self.first_save = False + + for key in IGNORED_FIELDS: + super_dict.pop(key, None) + return super_dict + + def to_yaml(self, line_break=None) -> str: + return yaml_dump( + {key: value for key, value in self.to_dict().items() if key in YAML_FIELDS and value is not None}, + sort_keys=False, + line_break=line_break, + ).strip() + + +def generate_model_card(model: "SentenceTransformer") -> str: + template_path = Path(__file__).parent / "model_card_template.md" + model_card = ModelCard.from_template(card_data=model.model_card_data, template_path=template_path, hf_emoji="🤗") + return model_card.content diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md new file mode 100644 index 000000000..2362eb0c6 --- /dev/null +++ b/sentence_transformers/model_card_template.md @@ -0,0 +1,228 @@ +--- +# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 +# Doc / guide: https://huggingface.co/docs/hub/model-cards +{{ card_data }} +--- + +# {{ model_name if model_name else "Sentence Transformer model" }} + +This is a [sentence-transformers](https://www.SBERT.net) model{% if base_model %} finetuned from [{{ base_model }}](https://huggingface.co/{{ base_model }}){% else %} trained{% endif %}{% if train_datasets | selectattr("name") | list %} on the {% for dataset in (train_datasets | selectattr("name")) %}{% if dataset.id %}[{{ dataset.name if dataset.name else dataset.id }}](https://huggingface.co/datasets/{{ dataset.id }}){% else %}{{ dataset.name }}{% endif %}{% if not loop.last %}{% if loop.index == (train_datasets | selectattr("name") | list | length - 1) %} and {% else %}, {% endif %}{% endif %}{% endfor %} dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}{% endif %}. It maps sentences & paragraphs to a {{ output_dimensionality }}-dimensional dense vector space and can be used for {{ task_name }}. + +## Model Details + +### Model Description +- **Model Type:** Sentence Transformer +{% if base_model -%} + {%- if base_model_revision -%} + - **Base model:** [{{ base_model }}](https://huggingface.co/{{ base_model }}) + {%- else -%} + - **Base model:** [{{ base_model }}](https://huggingface.co/{{ base_model }}) + {%- endif -%} +{%- else -%} + +{%- endif %} +- **Maximum Sequence Length:** {{ model_max_length }} tokens +- **Output Dimensionality:** {{ output_dimensionality }} tokens +{% if train_datasets | selectattr("name") | list -%} + - **Training Dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}:** + {%- for dataset in (train_datasets | selectattr("name")) %} + {%- if dataset.id %} + - [{{ dataset.name if dataset.name else dataset.id }}](https://huggingface.co/datasets/{{ dataset.id }}) + {%- else %} + - {{ dataset.name }} + {%- endif %} + {%- endfor %} +{%- else -%} + +{%- endif %} +{% if language -%} + - **Language{{"s" if language is not string and language | length > 1 else ""}}:** + {%- if language is string %} {{ language }} + {%- else %} {% for lang in language -%} + {{ lang }}{{ ", " if not loop.last else "" }} + {%- endfor %} + {%- endif %} +{%- else -%} + +{%- endif %} +{% if license -%} + - **License:** {{ license }} +{%- else -%} + +{%- endif %} + +### Model Sources + +- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) +- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) +- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) + +### Full Model Architecture + +``` +{{ model_string }} +``` + +## Usage + +### Direct Usage (Sentence Transformers) + +First install the Sentence Transformers library: + +```bash +pip install -U sentence-transformers +``` + +Then you can load this model and run inference. +```python +from sentence_transformers import SentenceTransformer + +# Download from the {{ hf_emoji }} Hub +model = SentenceTransformer("{{ model_id | default('sentence_transformers_model_id', true) }}") +# Run inference +sentences = [ +{%- for text in (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) %} + {{ "%r" | format(text) }}, +{%- endfor %} +] +embeddings = model.encode(sentences) +print(embeddings.shape) +# [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ output_dimensionality | default(1024, true) }}] +``` + + + + + + +{% if eval_metrics %} +## Evaluation + +### Metrics +{% for metrics in eval_metrics %} +#### {{ metrics.description }} +{% if metrics.dataset_name %}* Dataset: `{{ metrics.dataset_name }}`{% endif %} +* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[{{ metrics.class_name.split(".")[-1] }}](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}{{ metrics.class_name }}{% endif %} + +{{ metrics.table }} +{%- endfor %}{% endif %} + + + + +## Training Details +{% for dataset_type, dataset_list in [("training", train_datasets), ("evaluation", eval_datasets)] %}{% if dataset_list %} +### {{ dataset_type.title() }} Dataset{{"s" if dataset_list | length > 1 else ""}} +{% for dataset in dataset_list %} +#### {{ dataset['name'] or 'Unnamed Dataset' }} + +{% if dataset['name'] %}* Dataset: {% if 'id' in dataset %}[{{ dataset['name'] }}](https://huggingface.co/datasets/{{ dataset['id'] }}){% else %}{{ dataset['name'] }}{% endif %} +{%- if 'revision' in dataset and 'id' in dataset %} at [{{ dataset['revision'][:7] }}](https://huggingface.co/datasets/{{ dataset['id'] }}/tree/{{ dataset['revision'] }}){% endif %}{% endif %} +* Size: {{ "{:,}".format(dataset['size']) }} {{ dataset_type }} samples +* Columns: {% if dataset['columns'] | length == 1 %}{{ dataset['columns'][0] }}{% elif dataset['columns'] | length == 2 %}{{ dataset['columns'][0] }} and {{ dataset['columns'][1] }}{% else %}{{ dataset['columns'][:-1] | join(', ') }}, and {{ dataset['columns'][-1] }}{% endif %} +* Approximate statistics based on the first 1000 samples: +{{ dataset['stats_table'] }}* Samples: +{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[{{ dataset["loss"]["fullname"].split(".")[-1] }}](https://sbert.net/docs/package_reference/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}{{ dataset["loss"]["fullname"] }}{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters: +{{ dataset["loss"]["config_code"] }}{% endif %} +{% endfor %}{% endif %}{% endfor -%} + +{% if all_hyperparameters %} +### Training Hyperparameters +{% if non_default_hyperparameters -%} +#### Non-Default Hyperparameters + +{% for name, value in non_default_hyperparameters.items() %}- `{{ name }}`: {{ value }} +{% endfor %}{%- endif %} +#### All Hyperparameters +
    Click to expand + +{% for name, value in all_hyperparameters.items() %}- `{{ name }}`: {{ value }} +{% endfor %} +
    +{% endif %} + +{%- if eval_lines %} +### Training Logs +{% if hide_eval_lines %}
    Click to expand + +{% endif -%} +{{ eval_lines }}{% if explain_bold_in_eval %} +* The bold row denotes the saved checkpoint.{% endif %} +{%- if hide_eval_lines %} +
    {% endif %} +{% endif %} + +{%- if co2_eq_emissions %} +### Environmental Impact +Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon). +- **Energy Consumed**: {{ "%.3f"|format(co2_eq_emissions["energy_consumed"]) }} kWh +- **Carbon Emitted**: {{ "%.3f"|format(co2_eq_emissions["emissions"] / 1000) }} kg of CO2 +- **Hours Used**: {{ co2_eq_emissions["hours_used"] }} hours + +### Training Hardware +- **On Cloud**: {{ "Yes" if co2_eq_emissions["on_cloud"] else "No" }} +- **GPU Model**: {{ co2_eq_emissions["hardware_used"] or "No GPU used" }} +- **CPU Model**: {{ co2_eq_emissions["cpu_model"] }} +- **RAM Size**: {{ "%.2f"|format(co2_eq_emissions["ram_total_size"]) }} GB +{% endif %} +### Framework Versions +- Python: {{ version["python"] }} +- Sentence Transformers: {{ version["sentence_transformers"] }} +- Transformers: {{ version["transformers"] }} +- PyTorch: {{ version["torch"] }} +- Accelerate: {{ version["accelerate"] }} +- Datasets: {{ version["datasets"] }} +- Tokenizers: {{ version["tokenizers"] }} + +## Citation + +### BibTeX +{% for loss_name, citation in citations.items() %} +#### {{ loss_name }} +```bibtex +{{ citation | trim }} +``` +{% endfor %} + + + + + \ No newline at end of file diff --git a/sentence_transformers/models/BoW.py b/sentence_transformers/models/BoW.py index 02fee875e..c9f7aef06 100644 --- a/sentence_transformers/models/BoW.py +++ b/sentence_transformers/models/BoW.py @@ -5,7 +5,6 @@ import os import json import logging -import numpy as np from .tokenizer import WhitespaceTokenizer @@ -70,7 +69,7 @@ def get_sentence_features(self, tokenized_texts: List[List[int]], pad_seq_length vectors = [] for tokens in tokenized_texts: - vector = np.zeros(self.get_sentence_embedding_dimension(), dtype=np.float32) + vector = torch.zeros(self.get_sentence_embedding_dimension(), dtype=torch.float32) for token in tokens: if self.cumulative_term_frequency: vector[token] += self.weights[token] @@ -78,7 +77,7 @@ def get_sentence_features(self, tokenized_texts: List[List[int]], pad_seq_length vector[token] = self.weights[token] vectors.append(vector) - return {"sentence_embedding": torch.tensor(vectors, dtype=torch.float)} + return {"sentence_embedding": torch.stack(vectors)} def get_config_dict(self): return {key: self.__dict__[key] for key in self.config_keys} diff --git a/sentence_transformers/models/CLIPModel.py b/sentence_transformers/models/CLIPModel.py index b4ab32e37..dea1c12a0 100644 --- a/sentence_transformers/models/CLIPModel.py +++ b/sentence_transformers/models/CLIPModel.py @@ -74,6 +74,10 @@ def tokenize(self, texts, padding: Union[str, bool] = True): encoding["image_text_info"] = image_text_info return encoding + @property + def tokenizer(self): + return self.processor + def save(self, output_path: str): self.model.save_pretrained(output_path) self.processor.save_pretrained(output_path) diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py index 25727ab1e..ae9dbfbb9 100644 --- a/sentence_transformers/models/Transformer.py +++ b/sentence_transformers/models/Transformer.py @@ -35,6 +35,8 @@ def __init__( config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir) self._load_model(model_name_or_path, config, cache_dir, **model_args) + if max_seq_length is not None and "model_max_length" not in tokenizer_args: + tokenizer_args["model_max_length"] = max_seq_length self.tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py new file mode 100644 index 000000000..e717d21ec --- /dev/null +++ b/sentence_transformers/sampler.py @@ -0,0 +1,210 @@ +from collections import defaultdict +from itertools import accumulate, cycle +from typing import List +import logging + +from datasets import Dataset +from torch.utils.data import BatchSampler, SubsetRandomSampler, ConcatDataset +import torch + +logger = logging.getLogger(__name__) + + +class SetEpochMixin: + """ + Required for a BatchSampler as the Trainer will call set_epoch on the BatchSampler at the beginning of each epoch. + The BatchSampler can then set the generator seed accordingly. + """ + + def __init__(self, *args, **kwargs) -> None: + super().__init__(*args, **kwargs) + + self.epoch = 0 + + def set_epoch(self, epoch: int): + self.epoch = epoch + + +class DefaultBatchSampler(SetEpochMixin, BatchSampler): + pass + + +class GroupByLabelBatchSampler(SetEpochMixin, BatchSampler): + def __init__( + self, + dataset: Dataset, + batch_size: int, + drop_last: bool, + valid_label_columns: List[str] = None, + generator: torch.Generator = None, + seed: int = 0, + ): + super().__init__(dataset, batch_size, drop_last) + self.dataset = dataset + self.batch_size = batch_size + self.drop_last = drop_last + self.generator = generator + self.seed = seed + + if self.batch_size % 2 == 1: + raise ValueError("The batch size for `GroupByLabelBatchSampler` must be divisible by 2.") + + for column_name in valid_label_columns or []: + if column_name in dataset.column_names: + labels = dataset["label"] + break + else: + raise ValueError(f"None of the valid_label_columns {valid_label_columns} are in the dataset.") + + del dataset + groups = defaultdict(list) + for sample_idx, label in enumerate(labels): + groups[label].append(sample_idx) + + self.groups = { + label: sample_indices[:num_samples] + for label, sample_indices in groups.items() + if (num_samples := len(sample_indices) // 2) + } + + def __iter__(self): + if self.generator and self.seed: + self.generator.manual_seed(self.seed + self.epoch) + + labels = list(self.groups.keys()) + partial_batch = [] + for label_idx in torch.randperm(len(self.groups), generator=self.generator): + label = labels[label_idx] + samples = self.groups[label] + partial_batch.extend(samples) + while len(partial_batch) >= self.batch_size: + yield partial_batch[: self.batch_size] + partial_batch = partial_batch[self.batch_size :] + + if not self.drop_last and partial_batch: + yield partial_batch + + +class NoDuplicatesBatchSampler(SetEpochMixin, BatchSampler): + def __init__( + self, + dataset: Dataset, + batch_size: int, + drop_last: bool, + valid_label_columns: List[str] = [], + generator: torch.Generator = None, + seed: int = 0, + ): + super().__init__(dataset, batch_size, drop_last) + if label_columns := set(dataset.column_names) & (set(valid_label_columns) | {"dataset_name"}): + dataset = dataset.remove_columns(label_columns) + self.dataset = dataset + self.batch_size = batch_size + self.drop_last = drop_last + self.generator = generator + self.seed = seed + + def __iter__(self): + """ + Iterate over the remaining non-yielded indices. For each index, check if the sample values are already in the + batch. If not, add the sample values to the batch keep going until the batch is full. If the batch is full, yield + the batch indices and continue with the next batch. + """ + if self.generator and self.seed: + self.generator.manual_seed(self.seed + self.epoch) + + remaining_indices = set(torch.randperm(len(self.dataset), generator=self.generator).tolist()) + while remaining_indices: + batch_values = set() + batch_indices = [] + for index in remaining_indices: + sample_values = set(self.dataset[index].values()) + if sample_values & batch_values: + continue + + batch_indices.append(index) + if len(batch_indices) == self.batch_size: + yield batch_indices + break + + batch_values.update(sample_values) + + else: + # NOTE: some indices might still have been ignored here + if not self.drop_last: + yield batch_indices + + remaining_indices -= set(batch_indices) + + def __len__(self) -> int: + if self.drop_last: + return len(self.dataset) // self.batch_size + else: + return (len(self.dataset) + self.batch_size - 1) // self.batch_size + + +class RoundRobinBatchSampler(SetEpochMixin, BatchSampler): + def __init__( + self, + dataset: ConcatDataset, + batch_samplers: List[BatchSampler], + generator: torch.Generator, + seed: int, + ): + super().__init__(dataset, batch_samplers[0].batch_size, batch_samplers[0].drop_last) + self.dataset = dataset + self.batch_samplers = batch_samplers + self.generator = generator + self.seed = seed + + def __iter__(self): + self.generator.manual_seed(self.seed + self.epoch) + + num_samples = [len(dataset) for dataset in self.dataset.datasets] + sample_offsets = [0] + list(accumulate(num_samples)) + + batch_samplers = [iter(sampler) for sampler in self.batch_samplers] + for dataset_idx in cycle(range(len(batch_samplers))): + sample_offset = sample_offsets[dataset_idx] + try: + yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])] + + except StopIteration: + # current iterator is apparently exhausted + break + + def __len__(self) -> int: + return min([len(sampler) for sampler in self.batch_samplers]) * len(self.batch_samplers) + + +class ProportionalBatchSampler(SetEpochMixin, BatchSampler): + def __init__( + self, + dataset: ConcatDataset, + batch_samplers: List[BatchSampler], + generator: torch.Generator, + seed: int, + ): + super().__init__(dataset, batch_samplers[0].batch_size, batch_samplers[0].drop_last) + self.dataset = dataset + self.batch_samplers = batch_samplers + self.generator = generator + self.seed = seed + + def __iter__(self): + self.generator.manual_seed(self.seed + self.epoch) + + num_samples = [len(dataset) for dataset in self.dataset.datasets] + sample_offsets = [0] + list(accumulate(num_samples)) + + num_batches = [len(sampler) for sampler in self.batch_samplers] + dataset_indices = [idx for idx, length in enumerate(num_batches) for _ in range(length)] + dataset_idx_sampler = SubsetRandomSampler(dataset_indices, generator=self.generator) + + batch_samplers = [iter(sampler) for sampler in self.batch_samplers] + for dataset_idx in dataset_idx_sampler: + sample_offset = sample_offsets[dataset_idx] + yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])] + + def __len__(self) -> int: + return sum([len(sampler) for sampler in self.batch_samplers]) diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py new file mode 100644 index 000000000..8eb4877ba --- /dev/null +++ b/sentence_transformers/trainer.py @@ -0,0 +1,553 @@ +from contextlib import nullcontext +import logging +import os +from typing import Any, Callable, Dict, List, Optional, Tuple, Union, TYPE_CHECKING + +import torch +from torch import nn +from torch.utils.data import DataLoader, ConcatDataset, Dataset, BatchSampler, SubsetRandomSampler +from transformers import PreTrainedTokenizerBase, Trainer, EvalPrediction, TrainerCallback +from transformers.integrations import WandbCallback +from transformers.trainer import TRAINING_ARGS_NAME +from transformers.training_args import ParallelMode + +from datasets import DatasetDict +from transformers.trainer_utils import EvalLoopOutput +from transformers.data.data_collator import DataCollator +from sentence_transformers.losses import CoSENTLoss + +from sentence_transformers.models.Transformer import Transformer +from sentence_transformers.training_args import ( + SentenceTransformerTrainingArguments, + BatchSamplers, + MultiDatasetBatchSamplers, +) +from sentence_transformers.data_collator import SentenceTransformerDataCollator +from sentence_transformers.evaluation import SentenceEvaluator +from sentence_transformers.sampler import ( + DefaultBatchSampler, + GroupByLabelBatchSampler, + NoDuplicatesBatchSampler, + ProportionalBatchSampler, + RoundRobinBatchSampler, +) +from sentence_transformers.util import disable_logging + +from sentence_transformers.model_card import ModelCardCallback + +logger = logging.getLogger(__name__) + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer + + +class SentenceTransformerTrainer(Trainer): + def __init__( + self, + model: Optional["SentenceTransformer"] = None, + args: SentenceTransformerTrainingArguments = None, + train_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, + eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, + loss: Optional[Union[Dict[str, nn.Module], nn.Module]] = None, + evaluator: Optional[SentenceEvaluator] = None, + data_collator: Optional[DataCollator] = None, + tokenizer: Optional[Union[PreTrainedTokenizerBase, Callable]] = None, + model_init: Optional[Callable[[], "SentenceTransformer"]] = None, + compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, + callbacks: Optional[List[TrainerCallback]] = None, + optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), + preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, + ) -> None: + if args is None: + output_dir = "tmp_trainer" + logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.") + args = SentenceTransformerTrainingArguments(output_dir=output_dir) + elif not isinstance(args, SentenceTransformerTrainingArguments): + raise ValueError("Please use `TrainingArguments` imported from `sentence_transformers`.") + + # Get a dictionary of the default training arguments, so we can determine which arguments have been changed + # for the model card + default_args_dict = SentenceTransformerTrainingArguments(output_dir="unused").to_dict() + + # If the model ID is set via the SentenceTransformerTrainingArguments, but not via the SentenceTransformerModelCardData, + # then we can set it here for the model card regardless + if args.hub_model_id and not model.model_card_data.model_id: + model.model_card_data.set_model_id(args.hub_model_id) + + if tokenizer is None and isinstance(model.tokenizer, PreTrainedTokenizerBase): + tokenizer = model.tokenizer + + if data_collator is None: + data_collator = SentenceTransformerDataCollator(tokenize_fn=model.tokenize) + super().__init__( + model=model, + args=args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + tokenizer=tokenizer, + model_init=model_init, + compute_metrics=compute_metrics, + callbacks=callbacks, + optimizers=optimizers, + preprocess_logits_for_metrics=preprocess_logits_for_metrics, + ) + # Set the W&B project via environment variables if it's not already set + if any([isinstance(callback, WandbCallback) for callback in self.callback_handler.callbacks]): + os.environ.setdefault("WANDB_PROJECT", "sentence-transformers") + + if loss is None: + logger.info("No `loss` passed, using `losses.CoSENTLoss` as a default option.") + loss = CoSENTLoss(self.model) + + self.loss = loss + if isinstance(loss, dict): + self.loss = {dataset_name: loss_fn.to(self.model.device) for dataset_name, loss_fn in loss.items()} + for dataset_name, dataset in zip(["train", "eval"], [train_dataset, eval_dataset]): + if dataset is None: + continue + if not isinstance(dataset, dict): + raise ValueError( + f"If the provided `loss` is a dict, then the `{dataset_name}_dataset` must be a `DatasetDict`." + ) + if missing := set(dataset.keys()) - set(loss.keys()): + raise ValueError( + f"If the provided `loss` is a dict, then all keys from the `{dataset_name}_dataset` dictionary must occur in `loss` also. " + f"Currently, {sorted(missing)} occur{'s' if len(missing) == 1 else ''} in `{dataset_name}_dataset` but not in `loss`." + ) + else: + self.loss.to(self.model.device) + self.evaluator = evaluator + + # Add a callback responsible for automatically tracking data required for the automatic model card generation + model_card_callback = ModelCardCallback(self, default_args_dict) + self.add_callback(model_card_callback) + model_card_callback.on_init_end(self.args, self.state, self.control, self.model) + + def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict: + for key, dataset in dataset_dict.items(): + if "dataset_name" not in dataset.column_names: + dataset_dict[key] = dataset.add_column("dataset_name", [key] * len(dataset)) + return dataset_dict + + def compute_loss( + self, + model: "SentenceTransformer", + inputs: Dict[str, Union[torch.Tensor, Any]], + return_outputs: bool = False, + ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]: + dataset_name = inputs.pop("dataset_name", None) + features, labels = self.collect_features(inputs) + loss_fn = self.loss + + if isinstance(loss_fn, dict) and dataset_name: + loss_fn = loss_fn[dataset_name] + + # Hackishly insert the distributed model into the loss function, if the loss stores the model + # Only called once per process + if ( + self.args.parallel_mode != ParallelMode.NOT_PARALLEL + and hasattr(model, "module") + and getattr(loss_fn, "model", None) == model.module + ): + loss_fn.model = model + loss = loss_fn(features, labels) + if return_outputs: + output = torch.cat([model(row)["sentence_embedding"][:, None] for row in features], dim=1) + return loss, output + return loss + + def collect_features( + self, inputs: Dict[str, Union[torch.Tensor, Any]] + ) -> Tuple[List[Dict[str, torch.Tensor]], Optional[torch.Tensor]]: + """Turn the inputs from the dataloader into the separate model inputs & the labels. + + Example:: + + >>> list(inputs.keys()) + ['return_loss', 'label', 'sentence_0_input_ids', 'sentence_0_token_type_ids', 'sentence_0_attention_mask', 'sentence_1_input_ids', 'sentence_1_token_type_ids', 'sentence_1_attention_mask'] + >>> features, labels = self.collect_features(inputs) + >>> len(features) + 2 + >>> list(features[0].keys()) + ['input_ids', 'token_type_ids', 'attention_mask'] + >>> list(features[1].keys()) + ['input_ids', 'token_type_ids', 'attention_mask'] + >>> torch.equal(labels, inputs["label"]) + True + """ + # All inputs ending with `_input_ids` (Transformers), `_sentence_embedding` (BoW), `_pixel_values` (CLIPModel) + # are considered to correspond to a feature + features = [] + for column in inputs: + if column.endswith("_input_ids"): + prefix = column[: -len("input_ids")] + elif column.endswith("_sentence_embedding"): + prefix = column[: -len("sentence_embedding")] + elif column.endswith("_pixel_values"): + prefix = column[: -len("pixel_values")] + else: + continue + features.append({key[len(prefix) :]: value for key, value in inputs.items() if key.startswith(prefix)}) + labels = inputs.get("label", None) + return features, labels + + def evaluate( + self, + eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> Dict[str, float]: + eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset + if isinstance(eval_dataset, DatasetDict): + eval_dataset = self.add_dataset_name_column(eval_dataset) + return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix) + + def evaluation_loop( + self, + dataloader: DataLoader, + description: str, + prediction_loss_only: Optional[bool] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> EvalLoopOutput: + output = super().evaluation_loop( + dataloader=dataloader, + description=description, + prediction_loss_only=prediction_loss_only, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + + # If the evaluator is not defined, we can just return the output + if self.evaluator is None: + return output + + # If we are training and eval_dataset is a DatasetDict, then we should + # 1) only run the evaluator for the first dataset + # 2) prefix that only run as "eval", rather than e.g. "eval_multi_nli" + if self.is_in_train and isinstance(self.eval_dataset, dict) and metric_key_prefix.startswith("eval_"): + if metric_key_prefix[5:] == list(self.eval_dataset.keys())[0]: + metric_key_prefix = "eval" + else: + return output + + with nullcontext() if self.is_local_process_zero() else disable_logging(logging.INFO): + evaluator_metrics = self.evaluator(self.model) + if not isinstance(evaluator_metrics, dict): + evaluator_metrics = {"evaluator": evaluator_metrics} + + # Prefix all keys with metric_key_prefix + '_' + for key in list(evaluator_metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + evaluator_metrics[f"{metric_key_prefix}_{key}"] = evaluator_metrics.pop(key) + + output.metrics.update(evaluator_metrics) + + return output + + def _load_best_model(self) -> None: + # We want to ensure that this does not fail, and it may change if transformers updates how checkpoints are saved + # Loading the best model is only supported for `transformers`-based models + if not isinstance(self.model[0], Transformer): + logger.info("Could not load best model, as the model is not a `transformers`-based model.") + return + + try: + if checkpoint := self.state.best_model_checkpoint: + step = checkpoint.rsplit("-", 1)[-1] + self.model.model_card_data.set_best_model_step(int(step)) + except Exception: + pass + + # Override the model with the `tranformers`-based auto_model, and restore the original SentenceTransformers + # model with the loaded `transformers` model + full_model = self.model + self.model = self.model[0].auto_model + try: + return super()._load_best_model() + finally: + loaded_auto_model = self.model + self.model = full_model + self.model[0].auto_model = loaded_auto_model + + def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] = None) -> bool: + if overlap := set(dataset.column_names) & {"return_loss", "dataset_name"}: + raise ValueError( + f"The following column names are invalid in your {dataset_name + ' ' if dataset_name else ''}dataset: {list(overlap)}." + " Avoid using these column names, as they are reserved for internal use." + ) + + def get_batch_sampler( + self, + dataset: Dataset, + batch_size: int, + drop_last: bool, + valid_label_columns: Optional[List[str]] = None, + generator: Optional[torch.Generator] = None, + ) -> BatchSampler: + if self.args.batch_sampler == BatchSamplers.NO_DUPLICATES: + return NoDuplicatesBatchSampler( + dataset=dataset, + batch_size=batch_size, + drop_last=drop_last, + valid_label_columns=valid_label_columns, + generator=generator, + ) + + if self.args.batch_sampler == BatchSamplers.GROUP_BY_LABEL: + return GroupByLabelBatchSampler( + dataset=dataset, + batch_size=batch_size, + drop_last=drop_last, + valid_label_columns=valid_label_columns, + ) + + if self.args.batch_sampler == BatchSamplers.BATCH_SAMPLER: + return DefaultBatchSampler( + SubsetRandomSampler(range(len(dataset)), generator=generator), + batch_size=batch_size, + drop_last=drop_last, + ) + + def get_multi_dataset_batch_sampler( + self, + dataset: ConcatDataset, + batch_samplers: List[BatchSampler], + generator: Optional[torch.Generator] = None, + seed: Optional[int] = 0, + ) -> BatchSampler: + if self.args.multi_dataset_batch_sampler == MultiDatasetBatchSamplers.ROUND_ROBIN: + return RoundRobinBatchSampler( + dataset=dataset, + batch_samplers=batch_samplers, + generator=generator, + seed=seed, + ) + + if self.args.multi_dataset_batch_sampler == MultiDatasetBatchSamplers.PROPORTIONAL: + return ProportionalBatchSampler( + dataset=dataset, + batch_samplers=batch_samplers, + generator=generator, + seed=seed, + ) + + def get_train_dataloader(self) -> DataLoader: + """ + Returns the training [`~torch.utils.data.DataLoader`]. + + Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed + training if necessary) otherwise. + + Subclass and override this method if you want to inject some custom behavior. + """ + if self.train_dataset is None: + raise ValueError("Trainer: training requires a train_dataset.") + + train_dataset = self.train_dataset + data_collator = self.data_collator + + generator = torch.Generator() + if self.args.seed: + generator.manual_seed(self.args.seed) + + if isinstance(train_dataset, DatasetDict): + for dataset_name, dataset in train_dataset.items(): + self.validate_column_names(dataset, dataset_name=dataset_name) + train_dataset = self.add_dataset_name_column(train_dataset) + batch_samplers = [ + self.get_batch_sampler( + dataset, + batch_size=self.args.per_device_train_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + for dataset in train_dataset.values() + ] + + train_dataset = ConcatDataset(train_dataset.values()) + batch_sampler = self.get_multi_dataset_batch_sampler( + dataset=train_dataset, + batch_samplers=batch_samplers, + generator=generator, + seed=self.args.seed, + ) + + else: + self.validate_column_names(train_dataset) + + batch_sampler = self.get_batch_sampler( + train_dataset, + batch_size=self.args.train_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + + dataloader_params = { + "collate_fn": data_collator, + "num_workers": self.args.dataloader_num_workers, + "pin_memory": self.args.dataloader_pin_memory, + "persistent_workers": self.args.dataloader_persistent_workers, + "prefetch_factor": self.args.dataloader_prefetch_factor, + "batch_sampler": batch_sampler, + } + + # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can + # cause issues with multi-dataset training, so we want to set this to False. + # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True there. + self.accelerator.even_batches = False + self._train_dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params)) + return self._train_dataloader + + def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> DataLoader: + """ + Returns the evaluation [`~torch.utils.data.DataLoader`]. + + Subclass and override this method if you want to inject some custom behavior. + + Args: + eval_dataset (`torch.utils.data.Dataset`, *optional*): + If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted + by the `model.forward()` method are automatically removed. It must implement `__len__`. + """ + if eval_dataset is None and self.eval_dataset is None: + # Prevent errors if the evaluator is set but no eval_dataset is provided + if self.evaluator is not None: + return DataLoader([]) + raise ValueError("Trainer: evaluation requires an eval_dataset.") + eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset + data_collator = self.data_collator + + generator = torch.Generator() + if self.args.seed: + generator.manual_seed(self.args.seed) + + # TODO: Correctly validate the column names for the eval_dataset + if isinstance(eval_dataset, DatasetDict): + eval_dataset = self.add_dataset_name_column(eval_dataset) + eval_dataset = self.add_dataset_name_column(eval_dataset) + batch_samplers = [ + self.get_batch_sampler( + dataset, + batch_size=self.args.per_device_eval_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + for dataset in eval_dataset.values() + ] + + eval_dataset = ConcatDataset(eval_dataset.values()) + batch_sampler = self.get_multi_dataset_batch_sampler( + dataset=eval_dataset, + batch_samplers=batch_samplers, + generator=generator, + seed=self.args.seed, + ) + else: + batch_sampler = self.get_batch_sampler( + eval_dataset, + batch_size=self.args.train_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + + dataloader_params = { + "collate_fn": data_collator, + "num_workers": self.args.dataloader_num_workers, + "pin_memory": self.args.dataloader_pin_memory, + "persistent_workers": self.args.dataloader_persistent_workers, + "prefetch_factor": self.args.dataloader_prefetch_factor, + "batch_sampler": batch_sampler, + } + + # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can + # cause issues with multi-dataset training, so we want to set this to False during training. + # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True here. + self.accelerator.even_batches = True + return self.accelerator.prepare(DataLoader(eval_dataset, **dataloader_params)) + + def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader: + """ + Returns the training [`~torch.utils.data.DataLoader`]. + + Subclass and override this method if you want to inject some custom behavior. + + Args: + test_dataset (`torch.utils.data.Dataset`, *optional*): + The test dataset to use. If it is a [`~datasets.Dataset`], columns not accepted by the + `model.forward()` method are automatically removed. It must implement `__len__`. + """ + data_collator = self.data_collator + + generator = torch.Generator() + if self.args.seed: + generator.manual_seed(self.args.seed) + + if isinstance(test_dataset, DatasetDict): + for dataset_name, dataset in test_dataset.items(): + self.validate_column_names(dataset, dataset_name=dataset_name) + test_dataset = self.add_dataset_name_column(test_dataset) + batch_samplers = [ + self.get_batch_sampler( + dataset, + batch_size=self.args.per_device_train_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + for dataset in test_dataset.values() + ] + + test_dataset = ConcatDataset(test_dataset.values()) + batch_sampler = self.get_multi_dataset_batch_sampler( + dataset=test_dataset, + batch_samplers=batch_samplers, + generator=generator, + seed=self.args.seed, + ) + + else: + self.validate_column_names(test_dataset) + + batch_sampler = self.get_batch_sampler( + test_dataset, + batch_size=self.args.train_batch_size, + drop_last=self.args.dataloader_drop_last, + valid_label_columns=data_collator.valid_label_columns, + generator=generator, + ) + + dataloader_params = { + "collate_fn": data_collator, + "num_workers": self.args.dataloader_num_workers, + "pin_memory": self.args.dataloader_pin_memory, + "persistent_workers": self.args.dataloader_persistent_workers, + "prefetch_factor": self.args.dataloader_prefetch_factor, + "batch_sampler": batch_sampler, + } + + # If 'even_batches' is True, it will use the initial few samples to pad out the last sample. This can + # cause issues with multi-dataset training, so we want to set this to False. + # For evaluation, setting 'even_batches' to False results in hanging, so we keep it as True there. + self.accelerator.even_batches = False + self._train_dataloader = self.accelerator.prepare(DataLoader(test_dataset, **dataloader_params)) + return self._train_dataloader + + def _save(self, output_dir: Optional[str] = None, state_dict=None): + # If we are executing this function, we are the process zero, so we don't check for that. + output_dir = output_dir if output_dir is not None else self.args.output_dir + os.makedirs(output_dir, exist_ok=True) + logger.info(f"Saving model checkpoint to {output_dir}") + + self.model.save(output_dir, safe_serialization=self.args.save_safetensors) + + if self.tokenizer is not None: + self.tokenizer.save_pretrained(output_dir) + + # Good practice: save your training arguments together with the trained model + torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME)) diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py new file mode 100644 index 000000000..bdd46a9aa --- /dev/null +++ b/sentence_transformers/training_args.py @@ -0,0 +1,39 @@ +from dataclasses import dataclass, field +from typing import Union +from transformers import TrainingArguments as TransformersTrainingArguments +from transformers.utils import ExplicitEnum + + +class BatchSamplers(ExplicitEnum): + """ + Stores the acceptable string identifiers for batch samplers. + """ + + BATCH_SAMPLER = "batch_sampler" # Just the default PyTorch batch sampler [default] + NO_DUPLICATES = "no_duplicates" # Ensures no duplicate samples in a batch + GROUP_BY_LABEL = "group_by_label" # Ensure each batch has 2+ samples from the same label + + +class MultiDatasetBatchSamplers(ExplicitEnum): + """ + Stores the acceptable string identifiers for multi-dataset batch samplers. + """ + + ROUND_ROBIN = "round_robin" # Round-robin sampling from each dataset + PROPORTIONAL = "proportional" # Sample from each dataset in proportion to its size [default] + + +@dataclass +class SentenceTransformerTrainingArguments(TransformersTrainingArguments): + batch_sampler: Union[BatchSamplers, str] = field( + default=BatchSamplers.BATCH_SAMPLER, metadata={"help": "The batch sampler to use."} + ) + multi_dataset_batch_sampler: Union[MultiDatasetBatchSamplers, str] = field( + default=MultiDatasetBatchSamplers.PROPORTIONAL, metadata={"help": "The multi-dataset batch sampler to use."} + ) + + def __post_init__(self): + super().__post_init__() + + self.batch_sampler = BatchSamplers(self.batch_sampler) + self.multi_dataset_batch_sampler = MultiDatasetBatchSamplers(self.multi_dataset_batch_sampler) diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py index 4cc0a2c8a..85bda801c 100644 --- a/sentence_transformers/util.py +++ b/sentence_transformers/util.py @@ -1,3 +1,4 @@ +from contextlib import contextmanager import functools import requests from torch import Tensor, device @@ -525,6 +526,25 @@ def __delattr__(self, attr: str) -> None: raise +@contextmanager +def disable_logging(highest_level=logging.CRITICAL): + """ + A context manager that will prevent any logging messages + triggered during the body from being processed. + + :param highest_level: the maximum logging level allowed. + """ + + previous_level = logging.root.manager.disable + + logging.disable(highest_level) + + try: + yield + finally: + logging.disable(previous_level) + + def is_sentence_transformer_model( model_name_or_path: str, token: Optional[Union[bool, str]] = None, diff --git a/setup.py b/setup.py index 637e1f192..a791e73e2 100644 --- a/setup.py +++ b/setup.py @@ -16,6 +16,7 @@ url="https://www.SBERT.net", download_url="https://github.com/UKPLab/sentence-transformers/", packages=find_packages(), + include_package_data=True, python_requires=">=3.8.0", install_requires=[ "transformers>=4.34.0,<5.0.0", @@ -26,6 +27,8 @@ "scipy", "huggingface-hub>=0.15.1", "Pillow", + "datasets", + "accelerate>=0.20.3", ], extras_require={ "dev": [ diff --git a/tests/conftest.py b/tests/conftest.py index f9db97866..05609b7a9 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -5,6 +5,7 @@ from sentence_transformers import SentenceTransformer, CrossEncoder from sentence_transformers.models import Transformer, Pooling +from datasets import load_dataset, DatasetDict @pytest.fixture() @@ -40,6 +41,11 @@ def distilbert_base_uncased_model() -> SentenceTransformer: return model +@pytest.fixture(scope="session") +def stsb_dataset_dict() -> DatasetDict: + return load_dataset("mteb/stsbenchmark-sts") + + @pytest.fixture() def cache_dir(): """ diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py index e62f0dcf4..ee6eb3409 100644 --- a/tests/test_evaluator.py +++ b/tests/test_evaluator.py @@ -77,8 +77,9 @@ def test_LabelAccuracyEvaluator(paraphrase_distilroberta_base_v1_model: Sentence dev_dataloader = DataLoader(dev_samples, shuffle=False, batch_size=16) evaluator = evaluation.LabelAccuracyEvaluator(dev_dataloader, softmax_model=train_loss) - acc = evaluator(model) - assert acc > 0.2 + metrics = evaluator(model) + assert "accuracy" in metrics + assert metrics["accuracy"] > 0.2 def test_ParaphraseMiningEvaluator(paraphrase_distilroberta_base_v1_model: SentenceTransformer) -> None: @@ -91,5 +92,5 @@ def test_ParaphraseMiningEvaluator(paraphrase_distilroberta_base_v1_model: Sente 3: "On the table the cat is", } data_eval = evaluation.ParaphraseMiningEvaluator(sentences, [(0, 1), (2, 3)]) - score = data_eval(model) - assert score > 0.99 + metrics = data_eval(model) + assert metrics[data_eval.primary_metric] > 0.99 diff --git a/tests/test_model_card_data.py b/tests/test_model_card_data.py new file mode 100644 index 000000000..3c0a0f06a --- /dev/null +++ b/tests/test_model_card_data.py @@ -0,0 +1,24 @@ +from sentence_transformers import SentenceTransformer + +import pytest + + +@pytest.mark.parametrize( + ("revision", "expected_base_revision"), + [ + ("f3cb857cba53019a20df283396bcca179cf051a4", "f3cb857cba53019a20df283396bcca179cf051a4"), + ("f3cb857", "f3cb857"), + ("main", "valid-revision"), + (None, "valid-revision"), + ], +) +def test_model_card_data(revision, expected_base_revision) -> None: + model_name = "sentence-transformers-testing/stsb-bert-tiny-safetensors" + model = SentenceTransformer(model_name, revision=revision) + + assert model.model_card_data.base_model == model_name + if expected_base_revision == "valid-revision": + assert model.model_card_data.base_model_revision + assert len(model.model_card_data.base_model_revision) == 40 + else: + assert model.model_card_data.base_model_revision == expected_base_revision diff --git a/tests/test_pretrained_stsb.py b/tests/test_pretrained_stsb.py index 0616cbbe0..4a98a337d 100644 --- a/tests/test_pretrained_stsb.py +++ b/tests/test_pretrained_stsb.py @@ -37,7 +37,8 @@ def pretrained_model_score( evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") - score = model.evaluate(evaluator) * 100 + scores = model.evaluate(evaluator) + score = scores[evaluator.primary_metric] * 100 print(model_name, "{:.2f} vs. exp: {:.2f}".format(score, expected_score)) assert score > expected_score or abs(score - expected_score) < 0.1 diff --git a/tests/test_train_stsb.py b/tests/test_train_stsb.py index ca6c1d867..a71fe8f06 100644 --- a/tests/test_train_stsb.py +++ b/tests/test_train_stsb.py @@ -8,6 +8,7 @@ from typing import Generator, List, Tuple import pytest +import torch from torch.utils.data import DataLoader from sentence_transformers import ( @@ -63,7 +64,8 @@ def nli_resource() -> Generator[List[InputExample], None, None]: def evaluate_stsb_test(model, expected_score, test_samples) -> None: evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name="sts-test") - score = model.evaluate(evaluator) * 100 + scores = model.evaluate(evaluator) + score = scores[evaluator.primary_metric] * 100 print("STS-Test Performance: {:.2f} vs. exp: {:.2f}".format(score, expected_score)) assert score > expected_score or abs(score - expected_score) < 0.1 @@ -83,7 +85,7 @@ def test_train_stsb_slow( epochs=1, evaluation_steps=1000, warmup_steps=int(len(train_dataloader) * 0.1), - use_amp=True, + use_amp=torch.cuda.is_available(), ) evaluate_stsb_test(model, 80.0, sts_test_samples) @@ -104,7 +106,7 @@ def test_train_stsb( epochs=1, evaluation_steps=1000, warmup_steps=int(len(train_dataloader) * 0.1), - use_amp=True, + use_amp=torch.cuda.is_available(), ) evaluate_stsb_test(model, 60.0, sts_test_samples) @@ -130,7 +132,7 @@ def test_train_nli_slow( evaluator=None, epochs=1, warmup_steps=int(len(train_dataloader) * 0.1), - use_amp=True, + use_amp=torch.cuda.is_available(), ) evaluate_stsb_test(model, 50.0, sts_test_samples) @@ -156,7 +158,7 @@ def test_train_nli( evaluator=None, epochs=1, warmup_steps=int(len(train_dataloader) * 0.1), - use_amp=True, + use_amp=torch.cuda.is_available(), ) evaluate_stsb_test(model, 50.0, sts_test_samples) diff --git a/tests/test_trainer.py b/tests/test_trainer.py new file mode 100644 index 000000000..8d8c123af --- /dev/null +++ b/tests/test_trainer.py @@ -0,0 +1,127 @@ +from pathlib import Path +import re +import tempfile +import pytest +from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer, losses +from datasets import DatasetDict + + +def test_trainer_multi_dataset_errors( + stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict +) -> None: + train_dataset = stsb_dataset_dict["train"] + loss = { + "multi_nli": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model), + "snli": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model), + "stsb": losses.CosineSimilarityLoss(model=stsb_bert_tiny_model), + } + with pytest.raises( + ValueError, match="If the provided `loss` is a dict, then the `train_dataset` must be a `DatasetDict`." + ): + SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=train_dataset, loss=loss) + + train_dataset = DatasetDict( + { + "multi_nli": stsb_dataset_dict["train"], + "snli": stsb_dataset_dict["train"], + "stsb": stsb_dataset_dict["train"], + "stsb-extra": stsb_dataset_dict["train"], + } + ) + with pytest.raises( + ValueError, + match="If the provided `loss` is a dict, then all keys from the `train_dataset` dictionary must occur in `loss` also. " + "Currently, \['stsb-extra'\] occurs in `train_dataset` but not in `loss`.", + ): + SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=train_dataset, loss=loss) + + train_dataset = DatasetDict( + { + "multi_nli": stsb_dataset_dict["train"], + "snli": stsb_dataset_dict["train"], + "stsb": stsb_dataset_dict["train"], + } + ) + with pytest.raises( + ValueError, match="If the provided `loss` is a dict, then the `eval_dataset` must be a `DatasetDict`." + ): + SentenceTransformerTrainer( + model=stsb_bert_tiny_model, + train_dataset=train_dataset, + eval_dataset=stsb_dataset_dict["validation"], + loss=loss, + ) + + eval_dataset = DatasetDict( + { + "multi_nli": stsb_dataset_dict["validation"], + "snli": stsb_dataset_dict["validation"], + "stsb": stsb_dataset_dict["validation"], + "stsb-extra-1": stsb_dataset_dict["validation"], + "stsb-extra-2": stsb_dataset_dict["validation"], + } + ) + with pytest.raises( + ValueError, + match="If the provided `loss` is a dict, then all keys from the `eval_dataset` dictionary must occur in `loss` also. " + "Currently, \['stsb-extra-1', 'stsb-extra-2'\] occur in `eval_dataset` but not in `loss`.", + ): + SentenceTransformerTrainer( + model=stsb_bert_tiny_model, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss + ) + + +def test_trainer_invalid_column_names( + stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict +) -> None: + train_dataset = stsb_dataset_dict["train"] + for column_name in ("return_loss", "dataset_name"): + invalid_train_dataset = train_dataset.rename_column("sentence1", column_name) + trainer = SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=invalid_train_dataset) + with pytest.raises( + ValueError, + match=re.escape( + f"The following column names are invalid in your dataset: ['{column_name}']." + " Avoid using these column names, as they are reserved for internal use." + ), + ): + trainer.train() + + invalid_train_dataset = DatasetDict( + { + "stsb": train_dataset.rename_column("sentence1", column_name), + "stsb-2": train_dataset, + } + ) + trainer = SentenceTransformerTrainer(model=stsb_bert_tiny_model, train_dataset=invalid_train_dataset) + with pytest.raises( + ValueError, + match=re.escape( + f"The following column names are invalid in your stsb dataset: ['{column_name}']." + " Avoid using these column names, as they are reserved for internal use." + ), + ): + trainer.train() + + +def test_model_card_reuse(stsb_bert_tiny_model: SentenceTransformer): + assert stsb_bert_tiny_model._model_card_text + # Reuse the model card if no training was done + with tempfile.TemporaryDirectory() as tmp_folder: + model_path = Path(tmp_folder) / "tiny_model_local" + stsb_bert_tiny_model.save(str(model_path)) + + with open(model_path / "README.md", "r") as f: + model_card_text = f.read() + assert model_card_text == stsb_bert_tiny_model._model_card_text + + # Create a new model card if a Trainer was initialized + SentenceTransformerTrainer(model=stsb_bert_tiny_model) + + with tempfile.TemporaryDirectory() as tmp_folder: + model_path = Path(tmp_folder) / "tiny_model_local" + stsb_bert_tiny_model.save(str(model_path)) + + with open(model_path / "README.md", "r") as f: + model_card_text = f.read() + assert model_card_text != stsb_bert_tiny_model._model_card_text From d2ac37d7115e38b126bc0174b0c01a792cdf6498 Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Thu, 25 Apr 2024 15:52:23 +0200 Subject: [PATCH 02/39] [`v3`] Add `similarity` and `similarity_pairwise` methods to Sentence Transformers (#2615) * Add similarity function to model configuration * Add more tests * Replace util.cos_sim with model.similarity in some examples * Reintroduce evaluation.SimilarityFunction * Remove last references of score function in ST class * Add similarity_fn_name to model card * Add save_pretrained alias for save * Introduce DOT alias for DOT_PRODUCT --- examples/applications/image-search/README.md | 8 +- .../semantic-search/semantic_search.py | 6 +- .../text-summarization/text-summarization.py | 10 +- examples/training/adaptive_layer/README.md | 3 +- examples/training/matryoshka/README.md | 3 +- sentence_transformers/SentenceTransformer.py | 142 +++++++++++++++++- .../BinaryClassificationEvaluator.py | 58 +++---- .../EmbeddingSimilarityEvaluator.py | 9 +- .../InformationRetrievalEvaluator.py | 14 +- .../evaluation/SimilarityFunction.py | 9 +- .../evaluation/TripletEvaluator.py | 47 ++++-- sentence_transformers/model_card.py | 38 +++-- sentence_transformers/model_card_template.md | 6 + sentence_transformers/similarity_functions.py | 69 +++++++++ sentence_transformers/util.py | 127 ++++++++++------ tests/test_sentence_transformer.py | 47 ++++++ tests/test_util.py | 66 +++++++- 17 files changed, 513 insertions(+), 149 deletions(-) create mode 100644 sentence_transformers/similarity_functions.py diff --git a/examples/applications/image-search/README.md b/examples/applications/image-search/README.md index f995e409e..7d691d0c7 100644 --- a/examples/applications/image-search/README.md +++ b/examples/applications/image-search/README.md @@ -12,7 +12,7 @@ Ensure that you have [transformers](https://pypi.org/project/transformers/) inst SentenceTransformers provides a wrapper for the [OpenAI CLIP Model](https://github.com/openai/CLIP), which was trained on a variety of (image, text)-pairs. ```python -from sentence_transformers import SentenceTransformer, util +from sentence_transformers import SentenceTransformer from PIL import Image # Load CLIP model @@ -26,9 +26,9 @@ text_emb = model.encode( ["Two dogs in the snow", "A cat on a table", "A picture of London at night"] ) -# Compute cosine similarities -cos_scores = util.cos_sim(img_emb, text_emb) -print(cos_scores) +# Compute similarities +similarity_scores = model.similarity(img_emb, text_emb) +print(similarity_scores) ``` You can use the CLIP model for: diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py index c8da195d5..5b0e3ad62 100644 --- a/examples/applications/semantic-search/semantic_search.py +++ b/examples/applications/semantic-search/semantic_search.py @@ -7,7 +7,7 @@ This script outputs for various queries the top 5 most similar sentences in the corpus. """ -from sentence_transformers import SentenceTransformer, util +from sentence_transformers import SentenceTransformer import torch embedder = SentenceTransformer("all-MiniLM-L6-v2") @@ -40,8 +40,8 @@ query_embedding = embedder.encode(query, convert_to_tensor=True) # We use cosine-similarity and torch.topk to find the highest 5 scores - cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0] - top_results = torch.topk(cos_scores, k=top_k) + similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0] + top_results = torch.topk(similarity_scores, k=top_k) print("\n\n======================\n\n") print("Query:", query) diff --git a/examples/applications/text-summarization/text-summarization.py b/examples/applications/text-summarization/text-summarization.py index a510debef..64dc0a4cd 100644 --- a/examples/applications/text-summarization/text-summarization.py +++ b/examples/applications/text-summarization/text-summarization.py @@ -19,7 +19,7 @@ """ import nltk -from sentence_transformers import SentenceTransformer, util +from sentence_transformers import SentenceTransformer import numpy as np from LexRank import degree_centrality_scores @@ -43,13 +43,13 @@ print("Num sentences:", len(sentences)) # Compute the sentence embeddings -embeddings = model.encode(sentences, convert_to_tensor=True) +embeddings = model.encode(sentences) -# Compute the pair-wise cosine similarities -cos_scores = util.cos_sim(embeddings, embeddings).numpy() +# Compute the similarity scores +similarity_scores = model.similarity(embeddings, embeddings).numpy() # Compute the centrality for each sentence -centrality_scores = degree_centrality_scores(cos_scores, threshold=None) +centrality_scores = degree_centrality_scores(similarity_scores, threshold=None) # We argsort so that the first element is the sentence with the highest score most_central_sentence_indices = np.argsort(-centrality_scores) diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md index c3843bf56..8ab7dcf8b 100644 --- a/examples/training/adaptive_layer/README.md +++ b/examples/training/adaptive_layer/README.md @@ -120,7 +120,6 @@ Then we can run inference with it using tensor([[0.7761, 0.1655]]) # compared to tensor([[ 0.7547, -0.0162]]) for the full model ``` diff --git a/examples/training/matryoshka/README.md b/examples/training/matryoshka/README.md index 62fb2e623..6781bf53c 100644 --- a/examples/training/matryoshka/README.md +++ b/examples/training/matryoshka/README.md @@ -58,7 +58,6 @@ After a model has been trained using a Matryoshka loss, you can then run inferen ```python from sentence_transformers import SentenceTransformer -from sentence_transformers.util import cos_sim import torch.nn.functional as F matryoshka_dim = 64 @@ -77,7 +76,7 @@ embeddings = model.encode( ) assert embeddings.shape[-1] == matryoshka_dim -similarities = cos_sim(embeddings[0], embeddings[1:]) +similarities = model.similarity(embeddings[0], embeddings[1:]) # => tensor([[0.7839, 0.4933]]) ``` As you can see, the similarity between the search query and the correct document is much higher than that of an unrelated document, despite the very small matryoshka dimension applied. Feel free to copy this script locally, modify the `matryoshka_dim`, and observe the difference in similarities. diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py index bdbd03905..54a1bae6d 100644 --- a/sentence_transformers/SentenceTransformer.py +++ b/sentence_transformers/SentenceTransformer.py @@ -5,7 +5,7 @@ from collections import OrderedDict from pathlib import Path import warnings -from typing import List, Dict, Literal, Tuple, Iterable, Union, Optional +from typing import Callable, List, Dict, Literal, Tuple, Iterable, Union, Optional, overload import numpy as np from numpy import ndarray import transformers @@ -20,7 +20,7 @@ import tempfile from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card - +from sentence_transformers.similarity_functions import SimilarityFunction from . import __MODEL_HUB_ORGANIZATION__ from .evaluation import SentenceEvaluator @@ -59,6 +59,9 @@ class SentenceTransformer(nn.Sequential, FitMixin): titles in "}`. :param default_prompt_name: The name of the prompt that should be used by default. If not set, no prompt will be applied. + :param similarity_fn_name: The name of the similarity function to use. Valid options are "cosine", "dot", + "euclidean", and "manhattan". If not set, it is automatically to "cosine" if `similarity` or + `similarity_pairwise` are called while `model.similarity_fn_name` is still `None`. :param cache_folder: Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable. :param trust_remote_code: Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it @@ -78,6 +81,7 @@ def __init__( device: Optional[str] = None, prompts: Optional[Dict[str, str]] = None, default_prompt_name: Optional[str] = None, + similarity_fn_name: Optional[Union[str, SimilarityFunction]] = None, cache_folder: Optional[str] = None, trust_remote_code: bool = False, revision: Optional[str] = None, @@ -90,6 +94,7 @@ def __init__( # Note: self._load_sbert_model can also update `self.prompts` and `self.default_prompt_name` self.prompts = prompts or {} self.default_prompt_name = default_prompt_name + self.similarity_fn_name = similarity_fn_name self.truncate_dim = truncate_dim self.model_card_data = model_card_data or SentenceTransformerModelCardData() self._model_card_vars = {} @@ -436,6 +441,105 @@ def encode( return all_embeddings + @property + def similarity_fn_name(self) -> Optional[str]: + return self._similarity_fn_name + + @similarity_fn_name.setter + def similarity_fn_name(self, value: Union[str, SimilarityFunction]) -> None: + if isinstance(value, SimilarityFunction): + value = value.value + self._similarity_fn_name = value + + if value is not None: + self._similarity = SimilarityFunction.to_similarity_fn(value) + self._similarity_pairwise = SimilarityFunction.to_similarity_pairwise_fn(value) + + @overload + def similarity(self, embeddings1: Tensor, embeddings2: Tensor) -> Tensor: ... + + @overload + def similarity(self, embeddings1: ndarray, embeddings2: ndarray) -> Tensor: ... + + @property + def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: + """ + Compute the similarity between two collections of embeddings. The output will be a matrix with the similarity + scores between all embeddings from the first parameter and all embeddings from the second parameter. This + differs from `similarity_pairwise` which computes the similarity between each pair of embeddings. + + Example + :: + + >>> model = SentenceTransformer("all-mpnet-base-v2") + >>> sentences = [ + ... "The weather is so nice!", + ... "It's so sunny outside.", + ... "He's driving to the movie theater.", + ... "She's going to the cinema.", + ... ] + >>> embeddings = model.encode(sentences, normalize_embeddings=True) + >>> model.similarity(embeddings, embeddings) + tensor([[1.0000, 0.7235, 0.0290, 0.1309], + [0.7235, 1.0000, 0.0613, 0.1129], + [0.0290, 0.0613, 1.0000, 0.5027], + [0.1309, 0.1129, 0.5027, 1.0000]]) + >>> model.similarity_fn_name + "cosine" + >>> model.similarity_fn_name = "euclidean" + >>> model.similarity(embeddings, embeddings) + tensor([[-0.0000, -0.7437, -1.3935, -1.3184], + [-0.7437, -0.0000, -1.3702, -1.3320], + [-1.3935, -1.3702, -0.0000, -0.9973], + [-1.3184, -1.3320, -0.9973, -0.0000]]) + + :param embeddings1: [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + :param embeddings2: [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + :return: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores. + """ + if self.similarity_fn_name is None: + self.similarity_fn_name = SimilarityFunction.COSINE + return self._similarity + + @overload + def similarity_pairwise(self, embeddings1: Tensor, embeddings2: Tensor) -> Tensor: ... + + @overload + def similarity_pairwise(self, embeddings1: ndarray, embeddings2: ndarray) -> Tensor: ... + + @property + def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: + """ + Compute the similarity between two collections of embeddings. The output will be a vector with the similarity + scores between each pair of embeddings. + + Example + :: + + >>> model = SentenceTransformer("all-mpnet-base-v2") + >>> sentences = [ + ... "The weather is so nice!", + ... "It's so sunny outside.", + ... "He's driving to the movie theater.", + ... "She's going to the cinema.", + ... ] + >>> embeddings = model.encode(sentences, normalize_embeddings=True) + >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2]) + tensor([0.7235, 0.5027]) + >>> model.similarity_fn_name + "cosine" + >>> model.similarity_fn_name = "euclidean" + >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2]) + tensor([-0.7437, -0.9973]) + + :param embeddings1: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + :param embeddings2: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + :return: A [num_embeddings]-shaped torch tensor with pairwise similarity scores. + """ + if self.similarity_fn_name is None: + self.similarity_fn_name = SimilarityFunction.COSINE + return self._similarity_pairwise + def start_multi_process_pool(self, target_devices: List[str] = None): """ Starts multi process to process the encoding with several, independent processes. @@ -672,7 +776,8 @@ def save( safe_serialization: bool = True, ): """ - Saves all elements for this seq. sentence embedder into different sub-folders + Saves a model and its configuration files to a directory, so that it can be loaded + with `SentenceTransformer(path)` again. :param path: Path on disc :param model_name: Optional model name @@ -700,6 +805,7 @@ def save( config = self._model_config.copy() config["prompts"] = self.prompts config["default_prompt_name"] = self.default_prompt_name + config["similarity_fn_name"] = self.similarity_fn_name json.dump(config, fOut, indent=2) # Save modules @@ -727,6 +833,32 @@ def save( if create_model_card: self._create_model_card(path, model_name, train_datasets) + def save_pretrained( + self, + path: str, + model_name: Optional[str] = None, + create_model_card: bool = True, + train_datasets: Optional[List[str]] = None, + safe_serialization: bool = True, + ): + """ + Saves a model and its configuration files to a directory, so that it can be loaded + with `SentenceTransformer(path)` again. Alias of `SentenceTransformer.save`. + + :param path: Path on disc + :param model_name: Optional model name + :param create_model_card: If True, create a README.md with basic information about this model + :param train_datasets: Optional list with the names of the datasets used to to train the model + :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way + """ + self.save( + path, + model_name=model_name, + create_model_card=create_model_card, + train_datasets=train_datasets, + safe_serialization=safe_serialization, + ) + def _create_model_card( self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated" ): @@ -982,7 +1114,9 @@ def _load_sbert_model( ) ) - # Set prompts if not already overridden by the __init__ calls + # Set score functions & prompts if not already overridden by the __init__ calls + if self.similarity_fn_name is None: + self.similarity_fn_name = self._model_config.get("similarity_fn_name", None) if not self.prompts: self.prompts = self._model_config.get("prompts", {}) if not self.default_prompt_name: diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py index 40709ae18..a18ec9f85 100644 --- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py +++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py @@ -1,5 +1,7 @@ from sentence_transformers import SentenceTransformer from contextlib import nullcontext + +from sentence_transformers.similarity_functions import SimilarityFunction from . import SentenceEvaluator import logging import os @@ -18,7 +20,7 @@ class BinaryClassificationEvaluator(SentenceEvaluator): """ Evaluate a model based on the similarity of the embeddings by calculating the accuracy of identifying similar and dissimilar sentences. - The metrics are the cosine similarity as well as euclidean and Manhattan distance + The metrics are the cosine similarity, dot score, Euclidean and Manhattan distance The returned score is the accuracy with a specified metric. The results are written in a CSV. If a CSV already exists, then values are appended. @@ -69,39 +71,19 @@ def __init__( self.show_progress_bar = show_progress_bar self.csv_file = "binary_classification_evaluation" + ("_" + name if name else "") + "_results.csv" - self.csv_headers = [ - "epoch", - "steps", - "cosine_accuracy", - "cosine_accuracy_threshold", - "cosine_f1", - "cosine_precision", - "cosine_recall", - "cosine_f1_threshold", - "cosine_ap", - "manhattan_accuracy", - "manhattan_accuracy_threshold", - "manhattan_f1", - "manhattan_precision", - "manhattan_recall", - "manhattan_f1_threshold", - "manhattan_ap", - "euclidean_accuracy", - "euclidean_accuracy_threshold", - "euclidean_f1", - "euclidean_precision", - "euclidean_recall", - "euclidean_f1_threshold", - "euclidean_ap", - "dot_accuracy", - "dot_accuracy_threshold", - "dot_f1", - "dot_precision", - "dot_recall", - "dot_f1_threshold", - "dot_ap", + self.csv_headers = ["epoch", "steps"] + metrics = [ + "accuracy", + "accuracy_threshold", + "f1", + "precision", + "recall", + "f1_threshold", + "ap", ] - self.primary_metric = "cosine_accuracy" + for v in SimilarityFunction.possible_values(): + for m in metrics: + self.csv_headers.append(f"{v}_{m}") @classmethod def from_input_examples(cls, examples: List[InputExample], **kwargs): @@ -196,15 +178,15 @@ def compute_metrices(self, model): embeddings1_np = np.asarray(embeddings1) embeddings2_np = np.asarray(embeddings2) - dot_scores = [np.dot(embeddings1_np[i], embeddings2_np[i]) for i in range(len(embeddings1_np))] + dot_scores = np.sum(embeddings1_np * embeddings2_np, axis=-1) labels = np.asarray(self.labels) output_scores = {} for short_name, name, scores, reverse in [ - ["cosine", "Cosine-Similarity", cosine_scores, True], - ["manhattan", "Manhattan-Distance", manhattan_distances, False], - ["euclidean", "Euclidean-Distance", euclidean_distances, False], - ["dot", "Dot-Product", dot_scores, True], + [SimilarityFunction.COSINE.value, "Cosine-Similarity", cosine_scores, True], + [SimilarityFunction.DOT_PRODUCT.value, "Dot-Product", dot_scores, True], + [SimilarityFunction.MANHATTAN.value, "Manhattan-Distance", manhattan_distances, False], + [SimilarityFunction.EUCLIDEAN.value, "Euclidean-Distance", euclidean_distances, False], ]: acc, acc_threshold = self.find_best_acc_and_threshold(scores, labels, reverse) f1, precision, recall, f1_threshold = self.find_best_f1_and_threshold(scores, labels, reverse) diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py index 0cb14500e..0f2e9ca39 100644 --- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py +++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py @@ -1,14 +1,15 @@ from contextlib import nullcontext from sentence_transformers import SentenceTransformer -from . import SentenceEvaluator, SimilarityFunction +from . import SentenceEvaluator +from sentence_transformers.similarity_functions import SimilarityFunction import logging import os import csv from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances from scipy.stats import pearsonr, spearmanr import numpy as np -from typing import Dict, List, Literal, Optional +from typing import Dict, List, Literal, Optional, Union from ..readers import InputExample @@ -31,7 +32,7 @@ def __init__( sentences2: List[str], scores: List[float], batch_size: int = 16, - main_similarity: SimilarityFunction = None, + main_similarity: Optional[Union[str, SimilarityFunction]] = None, name: str = "", show_progress_bar: bool = False, write_csv: bool = True, @@ -63,7 +64,7 @@ def __init__( assert len(self.sentences1) == len(self.sentences2) assert len(self.sentences1) == len(self.scores) - self.main_similarity = main_similarity + self.main_similarity = SimilarityFunction(main_similarity) if main_similarity else None self.name = name self.batch_size = batch_size diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py index 54338e6f5..917574c3e 100644 --- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py +++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py @@ -1,5 +1,7 @@ from sentence_transformers import SentenceTransformer from contextlib import nullcontext + +from sentence_transformers.similarity_functions import SimilarityFunction from . import SentenceEvaluator import torch from torch import Tensor @@ -8,7 +10,7 @@ from ..util import cos_sim, dot_score import os import numpy as np -from typing import List, Dict, Optional, Set, Callable +from typing import List, Dict, Optional, Set, Callable, Union import heapq @@ -40,10 +42,10 @@ def __init__( write_csv: bool = True, truncate_dim: Optional[int] = None, score_functions: Dict[str, Callable[[Tensor, Tensor], Tensor]] = { - "cosine": cos_sim, - "dot": dot_score, + SimilarityFunction.COSINE.value: cos_sim, + SimilarityFunction.DOT_PRODUCT.value: dot_score, }, # Score function, higher=more similar - main_score_function: str = None, + main_score_function: Optional[Union[str, SimilarityFunction]] = None, ): super().__init__() self.queries_ids = [] @@ -70,7 +72,7 @@ def __init__( self.write_csv = write_csv self.score_functions = score_functions self.score_function_names = sorted(list(self.score_functions.keys())) - self.main_score_function = main_score_function + self.main_score_function = SimilarityFunction(main_score_function) if main_score_function else None self.truncate_dim = truncate_dim if name: @@ -153,7 +155,7 @@ def __call__( )[0] self.primary_metric = f"{score_function}_map@{max(self.map_at_k)}" else: - self.primary_metric = f"{self.main_score_function}_map@{max(self.map_at_k)}" + self.primary_metric = f"{self.main_score_function.value}_map@{max(self.map_at_k)}" metrics = { f"{score_function}_{metric_name.replace('@k', '@' + str(k))}": value diff --git a/sentence_transformers/evaluation/SimilarityFunction.py b/sentence_transformers/evaluation/SimilarityFunction.py index 22d112732..f149b30a3 100644 --- a/sentence_transformers/evaluation/SimilarityFunction.py +++ b/sentence_transformers/evaluation/SimilarityFunction.py @@ -1,8 +1,3 @@ -from enum import Enum +from sentence_transformers.similarity_functions import SimilarityFunction - -class SimilarityFunction(Enum): - COSINE = 0 - EUCLIDEAN = 1 - MANHATTAN = 2 - DOT_PRODUCT = 3 +__all__ = ["SimilarityFunction"] diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py index da7719f97..3b9a908c2 100644 --- a/sentence_transformers/evaluation/TripletEvaluator.py +++ b/sentence_transformers/evaluation/TripletEvaluator.py @@ -1,11 +1,13 @@ +import numpy as np from sentence_transformers import SentenceTransformer from contextlib import nullcontext -from . import SentenceEvaluator, SimilarityFunction +from . import SentenceEvaluator +from sentence_transformers.similarity_functions import SimilarityFunction import logging import os import csv from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances -from typing import Dict, List, Optional +from typing import Dict, List, Optional, Union from ..readers import InputExample @@ -23,7 +25,7 @@ def __init__( anchors: List[str], positives: List[str], negatives: List[str], - main_distance_function: SimilarityFunction = None, + main_distance_function: Optional[Union[str, SimilarityFunction]] = None, name: str = "", batch_size: int = 16, show_progress_bar: bool = False, @@ -34,7 +36,8 @@ def __init__( :param anchors: Sentences to check similarity to. (e.g. a query) :param positives: List of positive sentences :param negatives: List of negative sentences - :param main_distance_function: One of 0 (Cosine), 1 (Euclidean) or 2 (Manhattan). Defaults to None, returning all 3. + :param main_distance_function: The distance function to use. If not specified, use cosine similarity, + dot product, Euclidean, and Manhattan. :param name: Name for the output :param batch_size: Batch size used to compute embeddings :param show_progress_bar: If true, prints a progress bar @@ -52,7 +55,7 @@ def __init__( assert len(self.anchors) == len(self.positives) assert len(self.anchors) == len(self.negatives) - self.main_distance_function = main_distance_function + self.main_distance_function = SimilarityFunction(main_distance_function) if main_distance_function else None self.batch_size = batch_size if show_progress_bar is None: @@ -93,7 +96,12 @@ def __call__( logger.info(f"TripletEvaluator: Evaluating the model on the {self.name} dataset{out_txt}:") num_triplets = 0 - num_correct_cos_triplets, num_correct_manhattan_triplets, num_correct_euclidean_triplets = 0, 0, 0 + ( + num_correct_cos_triplets, + num_correct_dot_triplets, + num_correct_manhattan_triplets, + num_correct_euclidean_triplets, + ) = 0, 0, 0, 0 with nullcontext() if self.truncate_dim is None else model.truncate_sentence_embeddings(self.truncate_dim): embeddings_anchors = model.encode( @@ -119,6 +127,10 @@ def __call__( pos_cos_distance = paired_cosine_distances(embeddings_anchors, embeddings_positives) neg_cos_distances = paired_cosine_distances(embeddings_anchors, embeddings_negatives) + # Dot score + pos_dot_distance = np.sum(embeddings_anchors * embeddings_positives, axis=-1) + neg_dot_distances = np.sum(embeddings_anchors * embeddings_negatives, axis=-1) + # Manhattan pos_manhattan_distance = paired_manhattan_distances(embeddings_anchors, embeddings_positives) neg_manhattan_distances = paired_manhattan_distances(embeddings_anchors, embeddings_negatives) @@ -133,6 +145,9 @@ def __call__( if pos_cos_distance[idx] < neg_cos_distances[idx]: num_correct_cos_triplets += 1 + if pos_dot_distance[idx] < neg_dot_distances[idx]: + num_correct_dot_triplets += 1 + if pos_manhattan_distance[idx] < neg_manhattan_distances[idx]: num_correct_manhattan_triplets += 1 @@ -140,10 +155,12 @@ def __call__( num_correct_euclidean_triplets += 1 accuracy_cos = num_correct_cos_triplets / num_triplets + accuracy_dot = num_correct_dot_triplets / num_triplets accuracy_manhattan = num_correct_manhattan_triplets / num_triplets accuracy_euclidean = num_correct_euclidean_triplets / num_triplets logger.info("Accuracy Cosine Distance: \t{:.2f}".format(accuracy_cos * 100)) + logger.info("Accuracy Dot Product: \t{:.2f}".format(accuracy_dot * 100)) logger.info("Accuracy Manhattan Distance:\t{:.2f}".format(accuracy_manhattan * 100)) logger.info("Accuracy Euclidean Distance:\t{:.2f}\n".format(accuracy_euclidean * 100)) @@ -161,15 +178,17 @@ def __call__( writer.writerow([epoch, steps, accuracy_cos, accuracy_manhattan, accuracy_euclidean]) self.primary_metric = { - SimilarityFunction.COSINE: "accuracy_cosine", - SimilarityFunction.EUCLIDEAN: "accuracy_euclidean", - SimilarityFunction.MANHATTAN: "accuracy_manhattan", - }.get(self.main_distance_function, "accuracy_max") + SimilarityFunction.COSINE: "cosine_accuracy", + SimilarityFunction.DOT_PRODUCT: "dot_accuracy", + SimilarityFunction.EUCLIDEAN: "euclidean_accuracy", + SimilarityFunction.MANHATTAN: "manhattan_accuracy", + }.get(self.main_distance_function, "max_accuracy") metrics = { - "accuracy_cosine": accuracy_cos, - "accuracy_manhattan": accuracy_manhattan, - "accuracy_euclidean": accuracy_euclidean, - "accuracy_max": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean), + "cosine_accuracy": accuracy_cos, + "dot_accuracy": accuracy_dot, + "manhattan_accuracy": accuracy_manhattan, + "euclidean_accuracy": accuracy_euclidean, + "max_accuracy": max(accuracy_cos, accuracy_manhattan, accuracy_euclidean), } metrics = self.prefix_name_to_metrics(metrics, self.name) self.store_metrics_in_model_card_data(model, metrics) diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index 633517574..df382374b 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -10,9 +10,6 @@ from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union import logging -import accelerate -import datasets -import tokenizers import torch from torch import nn import transformers @@ -206,6 +203,22 @@ def on_log( IGNORED_FIELDS = ["model", "trainer", "eval_results_dict"] +def get_versions() -> Dict[str, Any]: + from accelerate import __version__ as accelerate_version + from datasets import __version__ as datasets_version + from tokenizers import __version__ as tokenizers_version + + return { + "python": python_version(), + "sentence_transformers": sentence_transformers_version, + "transformers": transformers.__version__, + "torch": torch.__version__, + "accelerate": accelerate_version, + "datasets": datasets_version, + "tokenizers": tokenizers_version, + } + + @dataclass class SentenceTransformerModelCardData(CardData): """A dataclass storing data used in the model card. @@ -290,18 +303,7 @@ class SentenceTransformerModelCardData(CardData): # Computed once, always unchanged pipeline_tag: str = field(default="sentence-similarity", init=False) library_name: str = field(default="sentence-transformers", init=False) - version: Dict[str, str] = field( - default_factory=lambda: { - "python": python_version(), - "sentence_transformers": sentence_transformers_version, - "transformers": transformers.__version__, - "torch": torch.__version__, - "accelerate": accelerate.__version__, - "datasets": datasets.__version__, - "tokenizers": tokenizers.__version__, - }, - init=False, - ) + version: Dict[str, str] = field(default_factory=get_versions, init=False) # Passed via `register_model` only model: Optional["SentenceTransformer"] = field(default=None, init=False, repr=False) @@ -899,6 +901,12 @@ def to_dict(self) -> Dict[str, Any]: super_dict["model_max_length"] = self.model.get_max_seq_length() super_dict["output_dimensionality"] = self.model.get_sentence_embedding_dimension() super_dict["model_string"] = str(self.model) + super_dict["similarity_fn_name"] = { + "cosine": "Cosine Similarity", + "dot": "Dot Product", + "euclidean": "Euclidean Distance", + "manhattan": "Manhattan Distance", + }.get(self.model.similarity_fn_name, self.model.similarity_fn_name.replace("_", " ").title()) self.first_save = False diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md index 2362eb0c6..f503c6770 100644 --- a/sentence_transformers/model_card_template.md +++ b/sentence_transformers/model_card_template.md @@ -23,6 +23,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model{% if base_model % {%- endif %} - **Maximum Sequence Length:** {{ model_max_length }} tokens - **Output Dimensionality:** {{ output_dimensionality }} tokens +- **Similarity Function:** {{ similarity_fn_name }} {% if train_datasets | selectattr("name") | list -%} - **Training Dataset{{"s" if train_datasets | selectattr("name") | list | length > 1 else ""}}:** {%- for dataset in (train_datasets | selectattr("name")) %} @@ -88,6 +89,11 @@ sentences = [ embeddings = model.encode(sentences) print(embeddings.shape) # [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ output_dimensionality | default(1024, true) }}] + +# Get the similarity scores for the embeddings +similarities = model.similarity(embeddings) +print(similarities.shape) +# [{{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}, {{ (predict_example or ["The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium."]) | length}}] ``` +[![HF Models](https://img.shields.io/badge/%F0%9F%A4%97-models-yellow)](https://huggingface.co/models?library=sentence-transformers) [![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license] [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package] [![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package] -[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package] -[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package] [![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package] - + [#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE [#pypi-package]: https://pypi.org/project/sentence-transformers/ @@ -20,38 +16,24 @@ This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity. -We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases. +We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases. -Further, this framework allows an easy **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task. +Further, this framework allows an easy **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/sentence_transformer/training_overview.html)**, to achieve maximal performance on your specific task. For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**. -The following publications are integrated in this framework: - -- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019) -- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020) -- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021) -- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020) -- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021) -- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021) -- [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) (arXiv 2022) - ## Installation -We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7. +We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**. **Install with pip** -Install the *sentence-transformers* with `pip`: - ``` pip install -U sentence-transformers ``` **Install with conda** -You can install the *sentence-transformers* with `conda`: - ``` conda install -c conda-forge sentence-transformers ``` @@ -73,8 +55,6 @@ If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation. -[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task. - First download a pretrained model. ````python @@ -87,45 +67,40 @@ Then provide some sentences to the model. ````python sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of string.", - "The quick brown fox jumps over the lazy dog.", + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", ] -sentence_embeddings = model.encode(sentences) +embeddings = model.encode(sentences) +print(embeddings.shape) +# => (3, 384) ```` -And that's it already. We now have a list of numpy arrays with the embeddings. +And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities. ````python -for sentence, embedding in zip(sentences, sentence_embeddings): - print("Sentence:", sentence) - print("Embedding:", embedding) - print("") +similarities = model.similarity(embeddings, embeddings) +print(similarities) +# tensor([[1.0000, 0.6660, 0.1046], +# [0.6660, 1.0000, 0.1411], +# [0.1046, 0.1411, 1.0000]]) ```` ## Pre-Trained Models -We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`. - -[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html) +We provide a large list of [Pretrained Models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`. ## Training This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task. -See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets. +See [Training Overview](https://www.sbert.net/docs/sentence_transformer/training_overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets. Some highlights are: - Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ... - Multi-Lingual and multi-task learning - Evaluation during training to find optimal model -- [20+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss. - -## Performance - -Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**. - -[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html) +- [20+ loss-functions](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc. ## Application Examples @@ -133,12 +108,11 @@ You can use this framework for: - [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html) - [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) +- [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html) +- [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) - [Clustering](https://www.sbert.net/examples/applications/clustering/README.html) - [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html) - - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html) - - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html) - - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) - - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html) +- [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html) - [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html) and many more use-cases. @@ -193,7 +167,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers. -Contact person: Tom Aarsen, [tom.aarsen@huggingface.co](mailto:tom.aarsen@huggingface.co) +Maintainer: [Tom Aarsen](https://github.com/tomaarsen), 🤗 Hugging Face https://www.ukp.tu-darmstadt.de/ diff --git a/docs/Makefile b/docs/Makefile index 484135cad..ae30537c3 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1,3 +1,6 @@ docs: - sphinx-build -c . -a -E .. _build \ No newline at end of file + sphinx-build -c . -a -E .. _build + +docs-quick: + sphinx-build -c . .. _build \ No newline at end of file diff --git a/docs/_static/css/custom.css b/docs/_static/css/custom.css index 7938469ed..0bab3e76f 100644 --- a/docs/_static/css/custom.css +++ b/docs/_static/css/custom.css @@ -24,4 +24,88 @@ dl.class > dt { .wy-side-nav-search { padding-top: 0px; -} \ No newline at end of file +} + +.components { + display: flex; + flex-flow: row wrap; +} + +.components > .box { + flex: 1; + margin: 0.5rem; + padding: 1rem; + border-style: solid; + border-width: 1px; + border-radius: 0.5rem; + border-color: rgb(55 65 81); + background-color: #e3e3e3; + color: #404040; /* Override the colors imposed by */ +} + +.components > .box:nth-child(1) > .header { + background-image: linear-gradient(to bottom right, #60a5fa, #3b82f6); +} + +.components > .box:nth-child(2) > .header { + background-image: linear-gradient(to bottom right, #fb923c, #f97316); +} + +.components > .box:nth-child(3) > .header { + background-image: linear-gradient(to bottom right, #f472b6, #ec4899); +} + +.components > .box:nth-child(4) > .header { + background-image: linear-gradient(to bottom right, #a78bfa, #8b5cf6); +} + +.components > .box:nth-child(5) > .header { + background-image: linear-gradient(to bottom right, #34d399, #10b981); +} + +.components > .optional { + background: repeating-linear-gradient( + 135deg, + #f1f1f1, + #f1f1f1 25px, + #e3e3e3 25px, + #e3e3e3 50px + ); +} + +.components > .box > .header { + border-style: solid; + border-width: 1px; + border-radius: 0.5rem; + border-color: rgb(55 65 81); + padding: 0.5rem; + text-align: center; + margin-bottom: 0.5rem; + font-weight: bold; + color: white; +} + +.sidebar p { + font-size: 100% !important; +} + +.training-arguments { + background-color: #f3f6f6; + border: 1px solid #e1e4e5; +} + +.training-arguments > .header { + font-weight: 700; + padding: 6px 12px; + background: #e1e4e5; +} + +.training-arguments > .table { + display: grid; + grid-template-columns: repeat(auto-fill, minmax(15em, 1fr)); +} + +.training-arguments > .table > a { + padding: 0.5rem; + border: 1px solid #e1e4e5; +} diff --git a/docs/_themes/sphinx_rtd_theme/footer.html b/docs/_themes/sphinx_rtd_theme/footer.html index c82e5ed45..4c4c2b429 100644 --- a/docs/_themes/sphinx_rtd_theme/footer.html +++ b/docs/_themes/sphinx_rtd_theme/footer.html @@ -24,9 +24,6 @@ © {% trans %}Copyright{% endtrans %} {{ copyright }} {%- endif %} {%- endif %} - - • Contact - {%- if build_id and build_url %} {# Translators: Build is a noun, not a verb #} diff --git a/docs/_themes/sphinx_rtd_theme/layout.html b/docs/_themes/sphinx_rtd_theme/layout.html index 2696eaaa2..3e30b0fa5 100644 --- a/docs/_themes/sphinx_rtd_theme/layout.html +++ b/docs/_themes/sphinx_rtd_theme/layout.html @@ -121,8 +121,12 @@
    -
    + + +
    +
    diff --git a/docs/_themes/sphinx_rtd_theme/theme.conf b/docs/_themes/sphinx_rtd_theme/theme.conf index fd0521f02..f26931470 100644 --- a/docs/_themes/sphinx_rtd_theme/theme.conf +++ b/docs/_themes/sphinx_rtd_theme/theme.conf @@ -8,7 +8,7 @@ canonical_url = analytics_id = collapse_navigation = True sticky_navigation = True -navigation_depth = 4 +navigation_depth = includehidden = True titles_only = logo_only = diff --git a/docs/changelog/v3.0.md b/docs/changelog/v3.0.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/conf.py b/docs/conf.py index e9c182541..ba2e61f0b 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -21,8 +21,8 @@ # -- Project information ----------------------------------------------------- project = "Sentence-Transformers" -copyright = str(datetime.datetime.now().year) + ", Nils Reimers" -author = "Nils Reimers" +copyright = str(datetime.datetime.now().year) +author = "Nils Reimers, Tom Aarsen" # -- General configuration --------------------------------------------------- @@ -30,7 +30,14 @@ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ["sphinx.ext.autodoc", "recommonmark", "sphinx_markdown_tables"] +extensions = [ + "sphinx.ext.napoleon", + "sphinx.ext.autodoc", + "recommonmark", + "sphinx_markdown_tables", + "sphinx.ext.intersphinx", + "sphinx_tabs.tabs", +] # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] @@ -38,7 +45,24 @@ # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "nr_examples"] +exclude_patterns = [ + "_build", + "Thumbs.db", + ".DS_Store", + "nr_examples", + "archived", + "dist", + "build", + "output", + "models", + "model_card_template.md", +] + +intersphinx_mapping = { + "datasets": ("https://huggingface.co/docs/datasets/main/en/", None), + "transformers": ("https://huggingface.co/docs/transformers/main/en/", None), + "torch": ("https://pytorch.org/docs/stable/", None), +} # -- Options for HTML output ------------------------------------------------- @@ -49,7 +73,11 @@ html_theme = "sphinx_rtd_theme" html_theme_path = ["_themes"] -html_theme_options = {"logo_only": True, "canonical_url": "https://www.sbert.net"} +html_theme_options = { + "logo_only": True, + "canonical_url": "https://www.sbert.net", + "collapse_navigation": False, +} # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, diff --git a/docs/contact.md b/docs/contact.md deleted file mode 100644 index 1ce3cf82e..000000000 --- a/docs/contact.md +++ /dev/null @@ -1,19 +0,0 @@ -# Contact - -In case of questions, feel free to open a [Github Issue](https://github.com/UKPLab/sentence-transformers/issues) or write me an email: [info@nils-reimers.de](mailto:info@nils-reimers.de). - -**SentenceTransformers is maintained by:** -Nils Reimers -Ubiquitous Knowledge Processing (UKP) Lab -FB 20 / Department of Computer Science -Technische Universität Darmstadt -Hochschulstr. 10 -64289 Darmstadt -Germany -[Website](https://www.informatik.tu-darmstadt.de/ukp/ukp_home/index.en.jsp) - - -**Privacy Policy** -The webserver / web hosting company might collect certain log files to prevent abuse of services. These log files can include: IP address, URL, date and time. - -We do not use any tracking services or cookies to track or re-identify visitors. \ No newline at end of file diff --git a/docs/cross_encoder/pretrained_models.md b/docs/cross_encoder/pretrained_models.md new file mode 100644 index 000000000..14715ccd8 --- /dev/null +++ b/docs/cross_encoder/pretrained_models.md @@ -0,0 +1,111 @@ +# Pretrained Models + +We have released various pre-trained Cross Encoder models via our [Cross Encoder Hugging Face organization](https://huggingface.co/models?author=cross-encoder). Additionally, numerous community CrossEncoder models have been publicly released on the Hugging Face Hub. + +Each of these models can be easily downloaded and used like so: + +```python +from sentence_transformers import CrossEncoder +import torch + +model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", default_activation_function=torch.nn.Sigmoid()) +scores = model.predict([ + ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."), + ("How many people live in Berlin?", "Berlin is well known for its museums."), +]) +# => array([0.9998173 , 0.01312432], dtype=float32) +``` + +Cross-Encoders require text pairs as inputs and output a score 0...1 (if the Sigmoid activation function is used). They do not work for individual sentences and they don't compute embeddings for individual texts. + +## MS MARCO +[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages. + +```eval_rst +.. note:: + You can initialize these models with ``default_activation_function=torch.nn.Sigmoid()`` to force the model to return scores between 0 and 1. Otherwise, the raw value can reasonably range between -10 and 10. +``` + +- [cross-encoder/ms-marco-TinyBERT-L-2-v2](https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2) - MRR@10 on MS Marco Dev Set: 32.56 +- [cross-encoder/ms-marco-MiniLM-L-2-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-2-v2) - MRR@10 on MS Marco Dev Set: 34.85 +- [cross-encoder/ms-marco-MiniLM-L-4-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-4-v2) - MRR@10 on MS Marco Dev Set: 37.70 +- [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) - MRR@10 on MS Marco Dev Set: 39.01 +- [cross-encoder/ms-marco-MiniLM-L-12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2) - MRR@10 on MS Marco Dev Set: 39.02 + +For details on the usage, see [Retrieve & Re-Rank](../../examples/applications/retrieve_rerank/README.md) or [MS MARCO Cross-Encoders](../pretrained-models/ce-msmarco.md). + +## SQuAD (QNLI) + +QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) ([HF](https://huggingface.co/datasets/rajpurkar/squad)) and was introduced by the [GLUE Benchmark](https://arxiv.org/abs/1804.07461) ([HF](https://huggingface.co/datasets/nyu-mll/glue)). Given a passage from Wikipedia, annotators created questions that are answerable by that passage. + +- [cross-encoder/qnli-distilroberta-base](https://huggingface.co/cross-encoder/qnli-distilroberta-base) - Accuracy on QNLI dev set: 90.96 +- [cross-encoder/qnli-electra-base](https://huggingface.co/cross-encoder/qnli-electra-base) - Accuracy on QNLI dev set: 93.21 + +## STSbenchmark +The following models can be used like this: +```python +from sentence_transformers import CrossEncoder + +model = CrossEncoder("cross-encoder/stsb-roberta-base") +scores = model.predict([("It's a wonderful day outside.", "It's so sunny today!"), ("It's a wonderful day outside.", "He drove to work earlier.")]) +# => array([0.60443085, 0.00240758], dtype=float32) +``` + +They return a score 0...1 indicating the semantic similarity of the given sentence pair. +- [cross-encoder/stsb-TinyBERT-L-4](https://huggingface.co/cross-encoder/stsb-TinyBERT-L-4) - STSbenchmark test performance: 85.50 +- [cross-encoder/stsb-distilroberta-base](https://huggingface.co/cross-encoder/stsb-distilroberta-base) - STSbenchmark test performance: 87.92 +- [cross-encoder/stsb-roberta-base](https://huggingface.co/cross-encoder/stsb-roberta-base) - STSbenchmark test performance: 90.17 +- [cross-encoder/stsb-roberta-large](https://huggingface.co/cross-encoder/stsb-roberta-large) - STSbenchmark test performance: 91.47 + +## Quora Duplicate Questions +These models have been trained on the [Quora duplicate questions dataset](https://huggingface.co/datasets/sentence-transformers/quora-duplicates). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions. + +- [cross-encoder/quora-distilroberta-base](https://huggingface.co/cross-encoder/quora-distilroberta-base) - Average Precision dev set: 87.48 +- [cross-encoder/quora-roberta-base](https://huggingface.co/cross-encoder/quora-roberta-base) - Average Precision dev set: 87.80 +- [cross-encoder/quora-roberta-large](https://huggingface.co/cross-encoder/quora-roberta-large) - Average Precision dev set: 87.91 + +```eval_rst +.. note:: + The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results. +``` + +## NLI +Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets. +- [cross-encoder/nli-deberta-v3-base](https://huggingface.co/cross-encoder/nli-deberta-v3-base) - Accuracy on MNLI mismatched set: 90.04 +- [cross-encoder/nli-deberta-base](https://huggingface.co/cross-encoder/nli-deberta-base) - Accuracy on MNLI mismatched set: 88.08 +- [cross-encoder/nli-deberta-v3-xsmall](https://huggingface.co/cross-encoder/nli-deberta-v3-xsmall) - Accuracy on MNLI mismatched set: 87.77 +- [cross-encoder/nli-deberta-v3-small](https://huggingface.co/cross-encoder/nli-deberta-v3-small) - Accuracy on MNLI mismatched set: 87.55 +- [cross-encoder/nli-roberta-base](https://huggingface.co/cross-encoder/nli-roberta-base) - Accuracy on MNLI mismatched set: 87.47 +- [cross-encoder/nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768) - Accuracy on MNLI mismatched set: 86.89 +- [cross-encoder/nli-distilroberta-base](https://huggingface.co/cross-encoder/nli-distilroberta-base) - Accuracy on MNLI mismatched set: 83.98 + +```python +from sentence_transformers import CrossEncoder + +model = CrossEncoder("cross-encoder/nli-deberta-v3-base") +scores = model.predict([ + ("A man is eating pizza", "A man eats something"), + ("A black race car starts up in front of a crowd of people.", "A man is driving down a lonely road."), +]) + +# Convert scores to labels +label_mapping = ["contradiction", "entailment", "neutral"] +labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)] +# => ['entailment', 'contradiction'] +``` + +## Community Models + +Some notable models from the Community include: + +- [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) +- [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) +- [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) +- [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) +- [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) +- [jinaai/jina-reranker-v1-tiny-en](https://huggingface.co/jinaai/jina-reranker-v1-tiny-en) +- [jinaai/jina-reranker-v1-turbo-en](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en) +- [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1) +- [mixedbread-ai/mxbai-rerank-base-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1) +- [mixedbread-ai/mxbai-rerank-large-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1) +- [maidalun1020/bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) \ No newline at end of file diff --git a/docs/cross_encoder/training/examples.md b/docs/cross_encoder/training/examples.md new file mode 100644 index 000000000..2235e3bbd --- /dev/null +++ b/docs/cross_encoder/training/examples.md @@ -0,0 +1,16 @@ + +# Training Examples + +See the following examples how to train Cross-Encoders: + +- [training_stsbenchmark.py](../../../examples/training/cross-encoder/training_stsbenchmark.py) - Example how to train for Semantic Textual Similarity (STS) on the STS benchmark dataset. +- [training_quora_duplicate_questions.py](../../../examples/training/cross-encoder/training_quora_duplicate_questions.py) - Example how to train a Cross-Encoder to predict if two questions are duplicates. Uses Quora Duplicate Questions as training dataset. +- [training_nli.py](../../../examples/training/cross-encoder/training_nli.py) - Example for a multilabel classification task for Natural Language Inference (NLI) task. + +```eval_rst +.. toctree:: + :maxdepth: 1 + :caption: Supervised Learning + + ../../../examples/training/ms_marco/cross_encoder_README +``` \ No newline at end of file diff --git a/docs/cross_encoder/training_overview.md b/docs/cross_encoder/training_overview.md new file mode 100644 index 000000000..a8902e6ec --- /dev/null +++ b/docs/cross_encoder/training_overview.md @@ -0,0 +1,65 @@ + +# Training Overview + +```eval_rst +.. note:: + The CrossEncoder training approach has not been updated in v3.0 when `training Sentence Transformer models <../sentence_transformer/training_overview.html>`_ was improved. Improving training CrossEncoders is planned for a future major update. +``` + +The `CrossEncoder` class is a wrapper around Huggingface `AutoModelForSequenceClassification`, but with some methods to make training and predicting scores a little bit easier. The saved models are 100% compatible with Huggingface and can also be loaded with their classes. + +First, you need some sentence pair data. You can either have a continuous score, like: + +```eval_rst + +.. sidebar:: Documentation + + - :class:`~sentence_transformers.readers.InputExample` +``` + +```python +from sentence_transformers import InputExample + +train_samples = [ + InputExample(texts=["sentence1", "sentence2"], label=0.3), + InputExample(texts=["Another", "pair"], label=0.8), +] +``` + +Or you have distinct classes as in the [training_nli.py](../../examples/training/cross-encoder/training_nli.py) example: +```python +from sentence_transformers import InputExample + +label2int = {"contradiction": 0, "entailment": 1, "neutral": 2} +train_samples = [ + InputExample(texts=["sentence1", "sentence2"], label=label2int["neutral"]), + InputExample(texts=["Another", "pair"], label=label2int["entailment"]), +] +``` + +Then, you define the base model and the number of labels. You can take any [Hugging Face pre-trained model](https://huggingface.co/models) that is compatible with AutoModel: +``` +model = CrossEncoder('distilroberta-base', num_labels=1) +``` + +For binary tasks and tasks with continuous scores (like STS), we set num_labels=1. For classification tasks, we set it to the number of labels we have. + +```eval_rst + +We start the training by calling :meth:`CrossEncoder.fit `: + +.. sidebar:: Documentation + + - :class:`~sentence_transformers.cross_encoder.CrossEncoder` + - :meth:`CrossEncoder.fit ` + +:: + + model.fit( + train_dataloader=train_dataloader, + evaluator=evaluator, + epochs=num_epochs, + warmup_steps=warmup_steps, + output_path=model_save_path, + ) +``` \ No newline at end of file diff --git a/docs/cross_encoder/usage/usage.rst b/docs/cross_encoder/usage/usage.rst new file mode 100644 index 000000000..c36d5a5f4 --- /dev/null +++ b/docs/cross_encoder/usage/usage.rst @@ -0,0 +1,75 @@ + +Usage +===== + +Characteristics of Cross Encoder (a.k.a reranker) models: + +1. Calculates a **similarity score** given **pairs of texts**. +2. Generally provides **superior performance** compared to a Sentence Transformer (a.k.a. bi-encoder) model. +3. Often **slower** than a Sentence Transformer model, as it requires computation for each pair rather than each text. +4. Due to the previous 2 characteristics, Cross Encoders are often used to **re-rank the top-k results** from a Sentence Transformer model. + +Once you have installed `installed `_ Sentence Transformers, you can easily use Cross Encoder models: + +.. sidebar:: Documentation + + 1. :class:`~sentence_transformers.cross_encoder.CrossEncoder` + 2. :meth:`CrossEncoder.predict ` + 3. :meth:`CrossEncoder.rank ` + + .. note:: + MS Marco models return logits rather than scores between 0 and 1. Load the :class:`~sentence_transformers.cross_encoder.CrossEncoder` with ``default_activation_function=torch.nn.Sigmoid()`` to get scores between 0 and 1. This does not affect the ranking. + +:: + + from sentence_transformers import CrossEncoder + + # 1. Load a pre-trained CrossEncoder model + model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") + + # 2. Predict scores for a pair of sentences + scores = model.predict([ + ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."), + ("How many people live in Berlin?", "Berlin is well known for its museums."), + ]) + # => array([ 8.607138 , -4.3200774], dtype=float32) + + # 3. Rank a list of passages for a query + query = "How many people live in Berlin?" + passages = [ + "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.", + "Berlin is well known for its museums.", + "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.", + "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.", + "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019", + "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.", + "Berlin is subdivided into 12 boroughs or districts (Bezirke).", + "In 2015, the total labour force in Berlin was 1.85 million.", + "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.", + "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.", + ] + ranks = model.rank(query, passages) + + # Print the scores + print("Query:", query) + for rank in ranks: + print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}") + """ + Query: How many people live in Berlin? + 8.92 The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union. + 8.61 Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers. + 8.24 An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population. + 7.60 In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991. + 6.35 In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs. + 5.42 Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union. + 3.45 In 2015, the total labour force in Berlin was 1.85 million. + 0.33 Berlin is subdivided into 12 boroughs or districts (Bezirke). + -4.24 The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019 + -4.32 Berlin is well known for its museums. + """ + +.. toctree:: + :maxdepth: 1 + :caption: Tasks + + ../../../examples/applications/retrieve_rerank/README diff --git a/docs/hugging_face.md b/docs/hugging_face.md deleted file mode 100644 index 19348ea46..000000000 --- a/docs/hugging_face.md +++ /dev/null @@ -1,96 +0,0 @@ -# Hugging Face 🤗 - -## The Hugging Face Hub - -In addition to the official [pre-trained models](https://www.sbert.net/docs/pretrained_models.html), you can find over 500 `sentence-transformer` models on the [Hugging Face Hub](http://hf.co/models?library=sentence-transformers&sort=downloads). - -All models on the Hugging Face Hub come with the following: -1. An [automatically generated model card](https://huggingface.co/docs/hub/models-cards#what-are-model-cards) with a description, example code snippets, architecture overview, and more. -2. [Metadata tags](https://huggingface.co/docs/hub/models-cards#model-card-metadata) that help for discoverability and contain additional information such as a usage license. -3. An [interactive widget](https://huggingface.co/docs/hub/models-widgets) you can use to play with the model directly in the browser. -4. An [Inference API](https://huggingface.co/docs/hub/models-inference) that allows you to make inference requests. - - - -## Using Hugging Face models - -Any pre-trained models from the Hub can be loaded with a single line of code: - -```py -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("model_name") -``` - -You can even click `Use in sentence-transformers` to get a code snippet that you can copy and paste! - -
    - - -
    - -Here is an example that loads the [multi-qa-MiniLM-L6-cos-v1 model](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) and uses it to encode sentences and then compute the distance between them for doing semantic search. - -```py -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1") - -query_embedding = model.encode("How big is London") -passage_embedding = model.encode([ - "London has 9,787,426 inhabitants at the 2011 census", - "London is known for its finacial district", -]) - -print("Similarity:", util.dot_score(query_embedding, passage_embedding)) -``` - -Here is another example, this time using the [clips/mfaq model](https://huggingface.co/clips/mfaq) for multilingual FAQ retrieval. After embedding the query and the answers, we perform a semantic search to find the most relevant answer. - -```py -from sentence_transformers import SentenceTransformer, util - -question = "How many models can I host on HuggingFace?" -answer_1 = "All plans come with unlimited private models and datasets." -answer_2 = "AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem." -answer_3 = "Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job." - -model = SentenceTransformer("clips/mfaq") -query_embedding = model.encode(question) -corpus_embeddings = model.encode([answer_1, answer_2, answer_3]) - -print(util.semantic_search(query_embedding, corpus_embeddings)) -``` - -## Sharing your models - -Once you've installed the [Hub Client Library](https://huggingface.co/docs/huggingface_hub/quick-start), you can login through your terminal with your Hugging Face account. - -```bash -pip install huggingface_hub -huggingface-cli login -``` - -Then, you can share your SentenceTransformers models by calling the [`push_to_hub` method](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.push_to_hub) from a trained model. By default, the model will be uploaded to your account, but you can upload to an [organization](https://huggingface.co/docs/hub/organizations) by providing the organization as a part of the `repo_id`, e.g. `model.push_to_hub("my_organization/my_model_name")`. `push_to_hub` automatically generates a model card, an inference widget, example code snippets, and more. - -```py -from sentence_transformers import SentenceTransformer - -# Load or train a model -model.push_to_hub("my_new_model") -``` - -You can automatically add to the Hub's model card a list of datasets you used to train the model with the argument `train_datasets: Optional[List[str]] = None)`. See the "Datasets used to train" section in the [ITESM/sentece-embeddings-BETO](https://huggingface.co/ITESM/sentece-embeddings-BETO) model for an example of the final result. - -```py -model.push_to_hub("my_new_model", train_datasets=["GEM/wiki_lingua", "code_search_net"]) -``` - -## Sharing your embeddings - -The Hugging Face Hub can also be used to store and share any embeddings you generate. You can export your embeddings to CSV, ZIP, Pickle, or any other format, and then upload them to the Hub as a [Dataset](https://huggingface.co/docs/hub/datasets-adding). Read the ["Getting Started With Embeddings" blog post](https://huggingface.co/blog/getting-started-with-embeddings#2-host-embeddings-for-free-on-the-hugging-face-hub) for more information. - -## Additional resources - -* [Hugging Face Hub docs](https://huggingface.co/docs/hub/index) -* Integration with Hub [announcement](https://huggingface.co/blog/sentence-transformers-in-the-hub). diff --git a/docs/img/hf-logo.svg b/docs/img/hf-logo.svg new file mode 100644 index 000000000..18797de4f --- /dev/null +++ b/docs/img/hf-logo.svg @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/installation.md b/docs/installation.md index 6fc0b3036..2f694b475 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,31 +1,39 @@ # Installation -We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher. +We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**. -## Install SentenceTransformers - -**Install with pip** +## Install with pip Install the *sentence-transformers* with `pip`: ``` pip install -U sentence-transformers ``` -**Install with conda** +## Install with conda -Apple silicon Installation of *sentence-transformers* +Apple silicon installation of *sentence-transformers* ``` conda install -c conda-forge sentence-transformers ``` -**Install from source** +## Install from source + +You can install *sentence-transformers* directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release: +``` +pip install git+https://github.com/UKPLab/sentence-transformers +``` + +## Editable install -Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code: -```` +If you want to make changes to *sentence-transformers*, you will need an editable install. Clone the repository and install it with these commands: +``` +git clone https://github.com/UKPLab/sentence-transformers +cd sentence-transformers pip install -e . -```` +``` + +These commands will link the new `sentence-transformers` folder and your Python library paths, such that this folder will be used when importing `sentence-transformers`. ## Install PyTorch with CUDA support -If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow -[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch. +To use a GPU/CUDA, you must install PyTorch with CUDA support. Follow [PyTorch - Get Started](https://pytorch.org/get-started/locally/) for installation steps. \ No newline at end of file diff --git a/docs/package_reference/SentenceTransformer.md b/docs/package_reference/SentenceTransformer.md deleted file mode 100644 index cb4d36c9a..000000000 --- a/docs/package_reference/SentenceTransformer.md +++ /dev/null @@ -1,15 +0,0 @@ -# SentenceTransformer - -This page documents the properties and methods when you load a SentenceTransformer model: -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("model-name") -``` - -```eval_rst -.. autoclass:: sentence_transformers.SentenceTransformer - :members: - :exclude-members: save_to_hub - -``` \ No newline at end of file diff --git a/docs/package_reference/cross_encoder/cross_encoder.md b/docs/package_reference/cross_encoder/cross_encoder.md new file mode 100644 index 000000000..30c1fcf9d --- /dev/null +++ b/docs/package_reference/cross_encoder/cross_encoder.md @@ -0,0 +1,14 @@ +# CrossEncoder + +## CrossEncoder +For an introduction to Cross-Encoders, see [Cross-Encoders](../../examples/applications/cross-encoder/README.md). +```eval_rst +.. autoclass:: sentence_transformers.cross_encoder.CrossEncoder + :members: +``` + +## Training Inputs + +```eval_rst +.. autoclass:: sentence_transformers.readers.InputExample +``` \ No newline at end of file diff --git a/docs/package_reference/cross_encoder.md b/docs/package_reference/cross_encoder/evaluation.md similarity index 73% rename from docs/package_reference/cross_encoder.md rename to docs/package_reference/cross_encoder/evaluation.md index fc8737d0c..23f9d1265 100644 --- a/docs/package_reference/cross_encoder.md +++ b/docs/package_reference/cross_encoder/evaluation.md @@ -1,19 +1,31 @@ -# cross_encoder -For an introduction to Cross-Encoders, see [Cross-Encoders](../../examples/applications/cross-encoder/README.md). -```eval_rst -.. autoclass:: sentence_transformers.cross_encoder.CrossEncoder - :members: -``` - - -## Evaluation +# Evaluation CrossEncoder have their own evaluation classes, that are in `sentence_transformers.cross_encoder.evaluation`. +## CEBinaryAccuracyEvaluator ```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryAccuracyEvaluator +``` +## CEBinaryClassificationEvaluator +```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryClassificationEvaluator +``` + +## CECorrelationEvaluator +```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CECorrelationEvaluator +``` + +## CEF1Evaluator +```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEF1Evaluator +``` + +## CESoftmaxAccuracyEvaluator +```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CESoftmaxAccuracyEvaluator +``` + +## CERerankingEvaluator +```eval_rst .. autoclass:: sentence_transformers.cross_encoder.evaluation.CERerankingEvaluator ``` \ No newline at end of file diff --git a/docs/package_reference/cross_encoder/index.rst b/docs/package_reference/cross_encoder/index.rst new file mode 100644 index 000000000..f27406944 --- /dev/null +++ b/docs/package_reference/cross_encoder/index.rst @@ -0,0 +1,9 @@ + +Cross Encoder +============= + +.. toctree:: + :hidden: + + cross_encoder + evaluation \ No newline at end of file diff --git a/docs/package_reference/quantization.md b/docs/package_reference/quantization.md deleted file mode 100644 index 4e47fd112..000000000 --- a/docs/package_reference/quantization.md +++ /dev/null @@ -1,7 +0,0 @@ -# quantization -`sentence_transformers.quantization` defines different helpful functions to quantize. - -```eval_rst -.. automodule:: sentence_transformers.quantization - :members: quantize_embeddings, semantic_search_faiss, semantic_search_usearch -``` diff --git a/docs/package_reference/sentence_transformer/SentenceTransformer.md b/docs/package_reference/sentence_transformer/SentenceTransformer.md new file mode 100644 index 000000000..1a1b3e71c --- /dev/null +++ b/docs/package_reference/sentence_transformer/SentenceTransformer.md @@ -0,0 +1,20 @@ +# SentenceTransformer + +## SentenceTransformer +```eval_rst +.. autoclass:: sentence_transformers.SentenceTransformer + :members: + :inherited-members: fit, old_fit + :exclude-members: save_to_hub, add_module, append, apply, buffers, children, extra_repr, forward, get_buffer, get_extra_state, get_parameter, get_submodule, ipu, load_state_dict, modules, named_buffers, named_children, named_modules, named_parameters, parameters, register_backward_hook, register_buffer, register_forward_hook, register_forward_pre_hook, register_full_backward_hook, register_full_backward_pre_hook, register_load_state_dict_post_hook, register_module, register_parameter, register_state_dict_pre_hook, requires_grad_, set_extra_state, share_memory, state_dict, to_empty, type, xpu, zero_grad +``` + +## SentenceTransformerModelCardData +```eval_rst +.. autoclass:: sentence_transformers.model_card.SentenceTransformerModelCardData +``` + +## SimilarityFunction +```eval_rst +.. autoclass:: sentence_transformers.SimilarityFunction + :members: +``` \ No newline at end of file diff --git a/docs/package_reference/datasets.md b/docs/package_reference/sentence_transformer/datasets.md similarity index 100% rename from docs/package_reference/datasets.md rename to docs/package_reference/sentence_transformer/datasets.md diff --git a/docs/package_reference/evaluation.md b/docs/package_reference/sentence_transformer/evaluation.md similarity index 67% rename from docs/package_reference/evaluation.md rename to docs/package_reference/sentence_transformer/evaluation.md index eb1c46c6e..df5fb258c 100644 --- a/docs/package_reference/evaluation.md +++ b/docs/package_reference/sentence_transformer/evaluation.md @@ -1,17 +1,52 @@ # Evaluation `sentence_transformers.evaluation` defines different classes, that can be used to evaluate the model during training. +## BinaryClassificationEvaluator ```eval_rst .. autoclass:: sentence_transformers.evaluation.BinaryClassificationEvaluator +``` + +## EmbeddingSimilarityEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.EmbeddingSimilarityEvaluator +``` + +## InformationRetrievalEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.InformationRetrievalEvaluator -.. autoclass:: sentence_transformers.evaluation.LabelAccuracyEvaluator +``` + +## MSEEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.MSEEvaluator -.. autoclass:: sentence_transformers.evaluation.MSEEvaluatorFromDataFrame +``` + +## ParaphraseMiningEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.ParaphraseMiningEvaluator +``` + +## RerankingEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.RerankingEvaluator +``` + +## SentenceEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.SentenceEvaluator +``` + +## SequentialEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.SequentialEvaluator +``` + +## TranslationEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.TranslationEvaluator +``` + +## TripletEvaluator +```eval_rst .. autoclass:: sentence_transformers.evaluation.TripletEvaluator ``` diff --git a/docs/package_reference/sentence_transformer/index.rst b/docs/package_reference/sentence_transformer/index.rst new file mode 100644 index 000000000..063ed31e1 --- /dev/null +++ b/docs/package_reference/sentence_transformer/index.rst @@ -0,0 +1,15 @@ + +Sentence Transformer +==================== + +.. toctree:: + :hidden: + + SentenceTransformer + trainer + training_args + losses + evaluation + datasets + models + quantization \ No newline at end of file diff --git a/docs/package_reference/losses.md b/docs/package_reference/sentence_transformer/losses.md similarity index 93% rename from docs/package_reference/losses.md rename to docs/package_reference/sentence_transformer/losses.md index 65475427d..db5b4a66c 100644 --- a/docs/package_reference/losses.md +++ b/docs/package_reference/sentence_transformer/losses.md @@ -1,7 +1,7 @@ # Losses `sentence_transformers.losses` defines different loss functions that can be used to fine-tune embedding models on training data. The choice of loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task. -Sadly, there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the [Loss Overview](../training/loss_overview.html) to help narrow down your choice of loss function(s). +Sadly, there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the [Loss Overview](../../sentence_transformer/loss_overview.html) to help narrow down your choice of loss function(s). ## BatchAllTripletLoss ```eval_rst @@ -57,8 +57,7 @@ Sadly, there is no "one size fits all" loss function. Which loss function is sui ## CosineSimilarityLoss -![SBERT Siamese Network Architecture](../img/SBERT_Siamese_Network.png "SBERT Siamese Architecture") - +SBERT Siamese Network Architecture For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. diff --git a/docs/package_reference/models.md b/docs/package_reference/sentence_transformer/models.md similarity index 100% rename from docs/package_reference/models.md rename to docs/package_reference/sentence_transformer/models.md diff --git a/docs/package_reference/sentence_transformer/quantization.md b/docs/package_reference/sentence_transformer/quantization.md new file mode 100644 index 000000000..cc53a0ef5 --- /dev/null +++ b/docs/package_reference/sentence_transformer/quantization.md @@ -0,0 +1,12 @@ +# quantization +`sentence_transformers.quantization` defines different helpful functions to perform embedding quantization. + +```eval_rst +.. note:: + `Embedding Quantization <../../../examples/applications/embedding-quantization/README.html>`_ differs from model quantization. The former shrinks the size of embeddings such that semantic search/retrieval is faster and requires less memory and disk space. The latter refers to lowering the precision of the model weights to speed up inference. This page only shows documentation for the former. +``` + +```eval_rst +.. automodule:: sentence_transformers.quantization + :members: quantize_embeddings, semantic_search_faiss, semantic_search_usearch +``` diff --git a/docs/package_reference/sentence_transformer/trainer.md b/docs/package_reference/sentence_transformer/trainer.md new file mode 100644 index 000000000..64e03f84e --- /dev/null +++ b/docs/package_reference/sentence_transformer/trainer.md @@ -0,0 +1,11 @@ + +# Trainer + +## SentenceTransformerTrainer + +```eval_rst +.. autoclass:: sentence_transformers.trainer.SentenceTransformerTrainer + :members: + :inherited-members: + :exclude-members: autocast_smart_context_manager, collect_features, compute_loss_context_manager, evaluation_loop, floating_point_ops, get_decay_parameter_names, get_optimizer_cls_and_kwargs, init_hf_repo, log_metrics, metrics_format, num_examples, num_tokens, predict, prediction_loop, prediction_step, save_metrics, save_model, save_state, training_step +``` \ No newline at end of file diff --git a/docs/package_reference/sentence_transformer/training_args.md b/docs/package_reference/sentence_transformer/training_args.md new file mode 100644 index 000000000..0c68fe97c --- /dev/null +++ b/docs/package_reference/sentence_transformer/training_args.md @@ -0,0 +1,21 @@ + +# Training Arguments + +## SentenceTransformerTrainingArguments +```eval_rst +.. autoclass:: sentence_transformers.training_args.SentenceTransformerTrainingArguments + :members: + :inherited-members: +``` + +## BatchSamplers +```eval_rst +.. autoclass:: sentence_transformers.training_args.BatchSamplers + :members: +``` + +## MultiDatasetBatchSamplers +```eval_rst +.. autoclass:: sentence_transformers.training_args.MultiDatasetBatchSamplers + :members: +``` \ No newline at end of file diff --git a/docs/package_reference/util.md b/docs/package_reference/util.md index a3f30fb15..690b4cd19 100644 --- a/docs/package_reference/util.md +++ b/docs/package_reference/util.md @@ -1,7 +1,15 @@ # util `sentence_transformers.util` defines different helpful functions to work with text embeddings. +## Helper Functions ```eval_rst .. automodule:: sentence_transformers.util - :members: cos_sim, dot_score, paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings + :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings +``` + +## Similarity Metrics + +```eval_rst +.. automodule:: sentence_transformers.util + :members: cos_sim, pairwise_cos_sim, dot_score, pairwise_dot_score, manhattan_sim, pairwise_manhattan_sim, euclidean_sim, pairwise_euclidean_sim ``` diff --git a/docs/pretrained-models/msmarco-v1.md b/docs/pretrained-models/msmarco-v1.md index be537f2d7..3123bfc03 100644 --- a/docs/pretrained-models/msmarco-v1.md +++ b/docs/pretrained-models/msmarco-v1.md @@ -6,7 +6,6 @@ The training data constist of over 500k examples, while the complete corpus con ## Version Histroy -As we work on the topic, we will publish updated (and improved) models. ### v1 Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128. diff --git a/docs/pretrained-models/msmarco-v2.md b/docs/pretrained-models/msmarco-v2.md index c9a88e4df..23a528d6d 100644 --- a/docs/pretrained-models/msmarco-v2.md +++ b/docs/pretrained-models/msmarco-v2.md @@ -34,6 +34,5 @@ As baseline we show the results for lexical search with BM25 using Elasticsearch ## Version Histroy -As we work on the topic, we will publish updated (and improved) models. - [Version 1](msmarco-v1.md) diff --git a/docs/pretrained-models/msmarco-v3.md b/docs/pretrained-models/msmarco-v3.md index e5134fd97..8ba14c798 100644 --- a/docs/pretrained-models/msmarco-v3.md +++ b/docs/pretrained-models/msmarco-v3.md @@ -58,7 +58,6 @@ If they received a low score by the cross-encoder, we saved them as hard negativ We then trained the v2 models with these new hard negatives. ## Version Histroy -As we work on the topic, we will publish updated (and improved) models. - [Version 2](msmarco-v2.md) - [Version 1](msmarco-v1.md) diff --git a/docs/pretrained-models/msmarco-v5.md b/docs/pretrained-models/msmarco-v5.md index d3f29ca71..9f93c0741 100644 --- a/docs/pretrained-models/msmarco-v5.md +++ b/docs/pretrained-models/msmarco-v5.md @@ -65,7 +65,6 @@ If they received a low score by the cross-encoder, we saved them as hard negativ We then trained the v2 models with these new hard negatives. ## Version Histroy -As we work on the topic, we will publish updated (and improved) models. - [Version 3](msmarco-v3.md) - [Version 2](msmarco-v2.md) diff --git a/docs/pretrained_cross-encoders.md b/docs/pretrained_cross-encoders.md deleted file mode 100644 index e95097e8d..000000000 --- a/docs/pretrained_cross-encoders.md +++ /dev/null @@ -1,93 +0,0 @@ -# Pretrained Cross-Encoders - -This page lists available **pretrained Cross-Encoders**. Cross-Encoders require the input of a text pair and output a score 0...1. They do not work for individual sentences and they don't compute embeddings for individual texts. - -![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png) - -## MS MARCO -[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages. - -These models can be used like this: -```python -from sentence_transformers import CrossEncoder - -model = CrossEncoder("model_name", max_length=512) -scores = model.predict([("Query1", "Paragraph1"), ("Query1", "Paragraph2")]) - -# For Example -scores = model.predict([ - ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."), - ("How many people live in Berlin?", "Berlin is well known for its museums."), -]) -``` - -- **cross-encoder/ms-marco-TinyBERT-L-2-v2** - MRR@10 on MS Marco Dev Set: 32.56 -- **cross-encoder/ms-marco-MiniLM-L-2-v2** - MRR@10 on MS Marco Dev Set: 34.85 -- **cross-encoder/ms-marco-MiniLM-L-4-v2** - MRR@10 on MS Marco Dev Set: 37.70 -- **cross-encoder/ms-marco-MiniLM-L-6-v2** - MRR@10 on MS Marco Dev Set: 39.01 -- **cross-encoder/ms-marco-MiniLM-L-12-v2** - MRR@10 on MS Marco Dev Set: 39.02 - - -For details on the usage, see [Applications - Information Retrieval](../examples/applications/retrieve_rerank/README.md) - - -[MS MARCO Cross-Encoders - More details](pretrained-models/ce-msmarco.md) - -## SQuAD (QNLI) - -QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) and was introduced by the [GLUE Benchmark](https://arxiv.org/abs/1804.07461). Given a passage from Wikipedia, annotators created questions that are answerable by that passage. - -- **cross-encoder/qnli-distilroberta-base** - Accuracy on QNLI dev set: 90.96 -- **cross-encoder/qnli-electra-base** - Accuracy on QNLI dev set: 93.21 - - -## STSbenchmark -The following models can be used like this: -```python -from sentence_transformers import CrossEncoder - -model = CrossEncoder("model_name") -scores = model.predict([("Sent A1", "Sent B1"), ("Sent A2", "Sent B2")]) -``` - -They return a score 0...1 indicating the semantic similarity of the given sentence pair. -- **cross-encoder/stsb-TinyBERT-L-4** - STSbenchmark test performance: 85.50 -- **cross-encoder/stsb-distilroberta-base** - STSbenchmark test performance: 87.92 -- **cross-encoder/stsb-roberta-base** - STSbenchmark test performance: 90.17 -- **cross-encoder/stsb-roberta-large** - STSbenchmark test performance: 91.47 - -## Quora Duplicate Questions -These models have been trained on the [Quora duplicate questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions. - -- **cross-encoder/quora-distilroberta-base** - Average Precision dev set: 87.48 -- **cross-encoder/quora-roberta-base** - Average Precision dev set: 87.80 -- **cross-encoder/quora-roberta-large** - Average Precision dev set: 87.91 - -Note: The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results. - - - -## NLI -Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) datasets. -- **cross-encoder/nli-deberta-v3-base** - Accuracy on MNLI mismatched set: 90.04 -- **cross-encoder/nli-deberta-base** - Accuracy on MNLI mismatched set: 88.08 -- **cross-encoder/nli-deberta-v3-xsmall** - Accuracy on MNLI mismatched set: 87.77 -- **cross-encoder/nli-deberta-v3-small** - Accuracy on MNLI mismatched set: 87.55 -- **cross-encoder/nli-roberta-base** - Accuracy on MNLI mismatched set: 87.47 -- **cross-encoder/nli-MiniLM2-L6-H768** - Accuracy on MNLI mismatched set: 86.89 -- **cross-encoder/nli-distilroberta-base** - Accuracy on MNLI mismatched set: 83.98 - -```python -from sentence_transformers import CrossEncoder - -model = CrossEncoder("model_name") -scores = model.predict([ - ("A man is eating pizza", "A man eats something"), - ("A black race car starts up in front of a crowd of people.", "A man is driving down a lonely road."), -]) - -# Convert scores to labels -label_mapping = ["contradiction", "entailment", "neutral"] -labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)] -``` - diff --git a/docs/quickstart.md b/docs/quickstart.md deleted file mode 100644 index cf6c003aa..000000000 --- a/docs/quickstart.md +++ /dev/null @@ -1,96 +0,0 @@ -# Quickstart -Once you have [installed](installation.md) Sentence Transformers, the usage is simple: -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("all-MiniLM-L6-v2") - -# Our sentences we like to encode -sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of string.", - "The quick brown fox jumps over the lazy dog.", -] - -# Sentences are encoded by calling model.encode() -sentence_embeddings = model.encode(sentences) - -# Print the embeddings -for sentence, embedding in zip(sentences, sentence_embeddings): - print("Sentence:", sentence) - print("Embedding:", embedding) - print("") -``` - - -With `SentenceTransformer('all-MiniLM-L6-v2')` we define which sentence transformer model we like to load. In this example, we load [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs. - -BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector. - -## Comparing Sentence Similarities - -The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). For two sentences, this can be done like this: - -```python -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("all-MiniLM-L6-v2") - -# Sentences are encoded by calling model.encode() -emb1 = model.encode("This is a red cat with a hat.") -emb2 = model.encode("Have you seen my red cat?") - -cos_sim = util.cos_sim(emb1, emb2) -print("Cosine-Similarity:", cos_sim) -``` - -If you have a list with more sentences, you can use the following code example: -```python -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("all-MiniLM-L6-v2") - -sentences = [ - "A man is eating food.", - "A man is eating a piece of bread.", - "The girl is carrying a baby.", - "A man is riding a horse.", - "A woman is playing violin.", - "Two men pushed carts through the woods.", - "A man is riding a white horse on an enclosed ground.", - "A monkey is playing drums.", - "Someone in a gorilla costume is playing a set of drums.", -] - -# Encode all sentences -embeddings = model.encode(sentences) - -# Compute cosine similarity between all pairs -cos_sim = util.cos_sim(embeddings, embeddings) - -# Add all pairs to a list with their cosine similarity score -all_sentence_combinations = [] -for i in range(len(cos_sim) - 1): - for j in range(i + 1, len(cos_sim)): - all_sentence_combinations.append([cos_sim[i][j], i, j]) - -# Sort list by the highest cosine similarity score -all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True) - -print("Top-5 most similar pairs:") -for score, i, j in all_sentence_combinations[0:5]: - print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j])) -``` - -See on the left the *Usage* sections for more examples how to use SentenceTransformers. - -## Pre-Trained Models -Various pre-trained models exists optimized for many tasks exists. For a full list, see **[Pretrained Models](pretrained_models.md)**. - - - -## Training your own Embeddings - -Training your own sentence embeddings models for all type of use-cases is easy and requires often only minimal coding effort. For a comprehensive tutorial, see [Training/Overview](training/overview.md). - -You can also extend easily existent sentence embeddings models to **further languages**. For details, see [Multi-Lingual Training](../examples/training/multilingual/README). diff --git a/docs/quickstart.rst b/docs/quickstart.rst new file mode 100644 index 000000000..38461820f --- /dev/null +++ b/docs/quickstart.rst @@ -0,0 +1,160 @@ +Quickstart +========== + +Sentence Transformer +-------------------- + +Characteristics of Sentence Transformer (a.k.a bi-encoder) models: + +1. Calculates a **fixed-size vector representation (embedding)** given **texts or images**. +2. Embedding calculation is often **efficient**, embedding similarity calculation is **very fast**. +3. Applicable for a **wide range of tasks**, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more. +4. Often used as a **first step in a two-step retrieval process**, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder. + +Once you have `installed `_ Sentence Transformers, you can easily use Sentence Transformer models: + +.. sidebar:: Documentation + + 1. :class:`SentenceTransformer ` + 2. :meth:`SentenceTransformer.encode ` + 3. :meth:`SentenceTransformer.similarity ` + + **Other useful methods and links:** + + - :meth:`SentenceTransformer.similarity_pairwise ` + - `SentenceTransformer > Usage <./sentence_transformer/usage/usage.html>`_ + - `SentenceTransformer > Pretrained Models <./sentence_transformer/pretrained_models.html>`_ + - `SentenceTransformer > Training Overview <./sentence_transformer/training_overview.html>`_ + - `SentenceTransformer > Dataset Overview <./sentence_transformer/dataset_overview.html>`_ + - `SentenceTransformer > Loss Overview <./sentence_transformer/loss_overview.html>`_ + - `SentenceTransformer > Training Examples <./sentence_transformer/training/examples.html>`_ + +:: + + from sentence_transformers import SentenceTransformer + + # 1. Load a pretrained Sentence Transformer model + model = SentenceTransformer("all-MiniLM-L6-v2") + + # The sentences to encode + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] + + # 2. Calculate embeddings by calling model.encode() + embeddings = model.encode(sentences) + print(embeddings.shape) + # [3, 384] + + # 3. Calculate the embedding similarities + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[1.0000, 0.6660, 0.1046], + # [0.6660, 1.0000, 0.1411], + # [0.1046, 0.1411, 1.0000]]) + +With ``SentenceTransformer("all-MiniLM-L6-v2")`` we pick which `Sentence Transformer model `_ we load. In this example, we load `all-MiniLM-L6-v2 `_, which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs. Using `SentenceTransformer.similarity() <./package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity>`_, we compute the similarity between all pairs of sentences. As expected, the similarity between the first two sentences (0.6660) is higher than the similarity between the first and the third sentence (0.1046) or the second and the third sentence (0.1411). + +Finetuning Sentence Transformer models is easy and requires only a few lines of code. For more information, see the `Training Overview <./sentence_transformer/training_overview.html>`_ section. + +Cross Encoder +------------- + +Characteristics of Cross Encoder (a.k.a reranker) models: + +1. Calculates a **similarity score** given **pairs of texts**. +2. Generally provides **superior performance** compared to a Sentence Transformer (a.k.a. bi-encoder) model. +3. Often **slower** than a Sentence Transformer model, as it requires computation for each pair rather than each text. +4. Due to the previous 2 characteristics, Cross Encoders are often used to **re-rank the top-k results** from a Sentence Transformer model. + +The usage for Cross Encoder (a.k.a. reranker) models is similar to Sentence Transformers: + +.. sidebar:: Documentation + + 1. :class:`CrossEncoder ` + 2. :meth:`CrossEncoder.rank ` + 3. :meth:`CrossEncoder.predict ` + + **Other useful methods and links:** + + - `CrossEncoder > Usage <./cross_encoder/usage/usage.html>`_ + - `CrossEncoder > Pretrained Models <./cross_encoder/pretrained_models.html>`_ + - `CrossEncoder > Training Overview <./cross_encoder/training_overview.html>`_ + - `CrossEncoder > Dataset Overview <./cross_encoder/dataset_overview.html>`_ + - `CrossEncoder > Loss Overview <./cross_encoder/loss_overview.html>`_ + - `CrossEncoder > Training Examples <./cross_encoder/training/examples.html>`_ + +:: + + from sentence_transformers.cross_encoder import CrossEncoder + + # 1. Load a pretrained CrossEncoder model + model = CrossEncoder("cross-encoder/stsb-distilroberta-base") + + # We want to compute the similarity between the query sentence... + query = "A man is eating pasta." + + # ... and all sentences in the corpus + corpus = [ + "A man is eating food.", + "A man is eating a piece of bread.", + "The girl is carrying a baby.", + "A man is riding a horse.", + "A woman is playing violin.", + "Two men pushed carts through the woods.", + "A man is riding a white horse on an enclosed ground.", + "A monkey is playing drums.", + "A cheetah is running behind its prey.", + ] + + # 2. We rank all sentences in the corpus for the query + ranks = model.rank(query, corpus) + + # Print the scores + print("Query: ", query) + for rank in ranks: + print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}") + """ + Query: A man is eating pasta. + 0.67 A man is eating food. + 0.34 A man is eating a piece of bread. + 0.08 A man is riding a horse. + 0.07 A man is riding a white horse on an enclosed ground. + 0.01 The girl is carrying a baby. + 0.01 Two men pushed carts through the woods. + 0.01 A monkey is playing drums. + 0.01 A woman is playing violin. + 0.01 A cheetah is running behind its prey. + """ + + # 3. Alternatively, you can also manually compute the score between two sentences + import numpy as np + + sentence_combinations = [[query, sentence] for sentence in corpus] + scores = model.predict(sentence_combinations) + + # Sort the scores in decreasing order to get the corpus indices + ranked_indices = np.argsort(scores)[::-1] + print("Scores:", scores) + print("Indices:", ranked_indices) + """ + Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717] + Indices: [0 1 3 6 2 5 7 4 8] + """ + +With ``CrossEncoder("cross-encoder/stsb-distilroberta-base")`` we pick which `CrossEncoder model <./cross_encoder/pretrained_models.html>`_ we load. In this example, we load `cross-encoder/stsb-distilroberta-base `_, which is a `DistilRoBERTa `_ model finetuned on the `STS Benchmark `_ dataset. + +Next Steps +---------- + +Consider reading one of the following sections next: + +* `Sentence Transformers > Usage <./sentence_transformer/usage/usage.html>`_ +* `Sentence Transformers > Pretrained Models <./sentence_transformer/pretrained_models.html>`_ +* `Sentence Transformers > Training Overview <./sentence_transformer/training_overview.html>`_ +* `Sentence Transformers > Training Examples > Multilingual Models <../examples/training/multilingual/README.html>`_ +* `Cross Encoder > Usage <./cross_encoder/usage/usage.html>`_ +* `Cross Encoder > Pretrained Models <./cross_encoder/pretrained_models.html>`_ + diff --git a/docs/requirements.txt b/docs/requirements.txt index 0721a6b8a..bbc151601 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -4,4 +4,5 @@ sphinx<4 Jinja2<3.1 sphinx_markdown_tables recommonmark==0.7.1 +sphinx-tabs==3.4.5 -e .. \ No newline at end of file diff --git a/docs/sentence_transformer/dataset_overview.md b/docs/sentence_transformer/dataset_overview.md new file mode 100644 index 000000000..95923a3e8 --- /dev/null +++ b/docs/sentence_transformer/dataset_overview.md @@ -0,0 +1,121 @@ +# Dataset Overview + +```eval_rst +.. hint:: + + **Quickstart:** Find `curated datasets `_ or `community datasets `_, choose a loss function via this `loss overview `_, and `verify `_ that it works with your dataset. +``` + +It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format). See [Training Overview > Dataset Format](./training_overview.html#dataset-format) to learn how to verify whether a dataset format works with a loss function. + +In practice, most dataset configurations will take one of four forms: + +- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or assymetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences. + - **Examples:** [sentence-transformers/sentence-compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression), [sentence-transformers/coco-captions](https://huggingface.co/datasets/sentence-transformers/coco-captions), [sentence-transformers/codesearchnet](https://huggingface.co/datasets/sentence-transformers/codesearchnet), [sentence-transformers/natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions), [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad), [sentence-transformers/wikihow](https://huggingface.co/datasets/sentence-transformers/wikihow), [sentence-transformers/eli5](https://huggingface.co/datasets/sentence-transformers/eli5) +- **Triplets**: (anchor, positive, negative) text triplets. These datasets don't need labels. + - **Examples:** [sentence-transformers/quora-duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates), [nirantk/triplets](https://huggingface.co/datasets/nirantk/triplets), [sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli) +- **Pair with Similarity Score**: A pair of sentences with a score indicating their similarity. Common examples are "Semantic Textual Similarity" datasets. + - **Examples:** [sentence-transformers/stsb](https://huggingface.co/datasets/sentence-transformers/stsb), [PhilipMay/stsb_multi_mt](https://huggingface.co/datasets/PhilipMay/stsb_multi_mt). +- **Texts with Classes**: A text with its corresponding class. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class. + - **Examples:** [trec](https://huggingface.co/datasets/trec), [yahoo_answers_topics](https://huggingface.co/datasets/yahoo_answers_topics). + +Note that it is often simple to transform a dataset from one format to another, such that it works with your loss function of choice. + +## Datasets on the Hugging Face Hub + +```eval_rst +The `Datasets library `_ (``pip install datasets``) allows you to load datasets from the Hugging Face Hub with the :func:`~datasets.load_dataset` function:: + + from datasets import load_dataset + + # Indicate the dataset id from the Hub + dataset_id = "sentence-transformers/natural-questions" + dataset = load_dataset(dataset_id, split="train") + """ + Dataset({ + features: ['query', 'answer'], + num_rows: 100231 + }) + """ + print(dataset[0]) + """ + { + 'query': 'when did richmond last play in a preliminary final', + 'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next." + } + """ +``` + +For more information on how to manipulate your dataset see the [Datasets Documentation](https://huggingface.co/docs/datasets/access). + +```eval_rst +.. tip:: + + It's common for Hugging Face Datasets to contain extraneous columns, e.g. sample_id, metadata, source, type, etc. You can use :meth:`Dataset.remove_columns ` to remove these columns, as they will be used as inputs otherwise. You can also use :meth:`Dataset.select_columns ` to keep only the desired columns. +``` + +## Pre-existing Datasets + +The [Hugging Face Hub](https://huggingface.co/datasets) hosts 150k+ datasets, many of which can be converted for training embedding models. +We are aiming to tag all Hugging Face datasets that work out of the box with Sentence Transformers with `sentence-transformers`, allowing you to easily find them by browsing to [https://huggingface.co/datasets?other=sentence-transformers](https://huggingface.co/datasets?other=sentence-transformers). We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks. + +These are some of the popular pre-existing datasets tagged as ``sentence-transformers`` that can be used to train and fine-tune SentenceTransformer models: + +| Dataset | Description | +|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------| +| [GooAQ](https://huggingface.co/datasets/sentence-transformers/gooaq) | (Question, Answer) pairs from Google auto suggest | +| [Yahoo Answers](https://huggingface.co/datasets/sentence-transformers/yahoo-answers) | (Title+Question, Answer), (Title, Answer), (Title, Question), (Question, Answer) pairs from Yahoo Answers | +| [MS MARCO Triplets (msmarco-distilbert-base-tas-b)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (msmarco-distilbert-base-v3)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (msmarco-MiniLM-L-6-v3)](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-cls-dot-v2)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-cls-dot-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-mean-dot-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (mpnet-margin-mse-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (co-condenser-margin-mse-cls-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v2)](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (co-condenser-margin-mse-sym-mnrl-mean-v1)](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [MS MARCO Triplets (BM25)](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives | +| [Stack Exchange Duplicates](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) | (Title, Title), (Title+Body, Title+Body), (Body, Body) pairs of duplicate questions from StackExchange | +| [ELI5](https://huggingface.co/datasets/sentence-transformers/eli5) | (Question, Answer) pairs from ELI5 dataset | +| [SQuAD](https://huggingface.co/datasets/sentence-transformers/squad) | (Question, Answer) pairs from SQuAD dataset | +| [WikiHow](https://huggingface.co/datasets/sentence-transformers/wikihow) | (Summary, Text) pairs from WikiHow | +| [Amazon Reviews 2018](https://huggingface.co/datasets/sentence-transformers/amazon-reviews) | (Title, review) pairs from Amazon Reviews | +| [Natural Questions](https://huggingface.co/datasets/sentence-transformers/natural-questions) | (Query, Answer) pairs from the Natural Questions dataset | +| [Amazon QA](https://huggingface.co/datasets/sentence-transformers/amazon-qa) | (Question, Answer) pairs from Amazon | +| [S2ORC](https://huggingface.co/datasets/sentence-transformers/s2orc) | (Title, Abstract), (Abstract, Citation), (Title, Citation) pairs of scientific papers | +| [Quora Duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates) | Duplicate question pairs from Quora | +| [WikiAnswers](https://huggingface.co/datasets/sentence-transformers/wikianswers-duplicates) | Duplicate question pairs from WikiAnswers | +| [AGNews](https://huggingface.co/datasets/sentence-transformers/agnews) | (Title, Description) pairs of news articles from the AG News dataset | +| [AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli) | (Anchor, Entailment, Contradiction) triplets from SNLI + MultiNLI | +| [NPR](https://huggingface.co/datasets/sentence-transformers/npr) | (Title, Body) pairs from the npr.org website | +| [SPECTER](https://huggingface.co/datasets/sentence-transformers/specter) | (Title, Positive Title, Negative Title) triplets of Scientific Publications from Specter | +| [Simple Wiki](https://huggingface.co/datasets/sentence-transformers/simple-wiki) | (English, Simple English) pairs from Wikipedia | +| [PAQ](https://huggingface.co/datasets/sentence-transformers/paq) | (Query, Answer) from the Probably-Asked Questions dataset | +| [altlex](https://huggingface.co/datasets/sentence-transformers/altlex) | (English, Simple English) pairs from Wikipedia | +| [CC News](https://huggingface.co/datasets/sentence-transformers/ccnews) | (Title, article) pairs from the CC News dataset | +| [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) | (Comment, Code) pairs from open source libraries on GitHub | +| [Sentence Compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression) | (Long text, Short text) pairs from the Sentence Compression dataset | +| [Trivia QA](https://huggingface.co/datasets/sentence-transformers/trivia-qa) | (Query, Answer) pairs from the TriviaQA dataset | +| [Flickr30k Captions](https://huggingface.co/datasets/sentence-transformers/flickr30k-captions) | Duplicate captions from the Flickr30k dataset | +| [xsum](https://huggingface.co/datasets/sentence-transformers/xsum) | (News Article, Summary) pairs from XSUM dataset | +| [Coco Captions](https://huggingface.co/datasets/sentence-transformers/coco-captions) | Duplicate captions from the Coco Captions dataset | +| [Parallel Sentences: Europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: Global Voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: MUSE](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: JW300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: News Commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: OpenSubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: Talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: Tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: WikiMatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix) | (English, Non-English) pairs across numerous languages | +| [Parallel Sentences: WikiTitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles) | (English, Non-English) pairs across numerous languages | + +```eval_rst + +.. note:: + + We advise users to tag datasets that can be used for training embedding models with ``sentence-transformers`` by adding ``tags: sentence-transformers``. We would also gladly accept high quality datasets to be added to the list above for all to see and use. +``` \ No newline at end of file diff --git a/docs/training/loss_overview.md b/docs/sentence_transformer/loss_overview.md similarity index 88% rename from docs/training/loss_overview.md rename to docs/sentence_transformer/loss_overview.md index 33ec36f41..f46b0418e 100644 --- a/docs/training/loss_overview.md +++ b/docs/sentence_transformer/loss_overview.md @@ -1,10 +1,14 @@ # Loss Overview -Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no "one size fits all" loss function. Ideally, this overview should help narrow down your choice of loss function(s) by matching them to your data formats. +Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no "one size fits all" loss function. Ideally, this table should help narrow down your choice of loss function(s) by matching them to your data formats. -**Note**: you can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, `(sentence_A, sentence_B) pairs` with `class` labels can be converted into `(anchor, positive, negative) triplets` by sampling sentences with the same or different classes. +```eval_rst +.. note:: -| Texts | Labels | Appropriate Loss Functions | + You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, ``(sentence_A, sentence_B) pairs`` with ``class`` labels can be converted into ``(anchor, positive, negative) triplets`` by sampling sentences with the same or different classes. +``` + +| Inputs | Labels | Appropriate Loss Functions | |-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `single sentences` | `class` | `BatchAllTripletLoss`
    `BatchHardSoftMarginTripletLoss`
    `BatchHardTripletLoss`
    `BatchSemiHardTripletLoss` | | `single sentences` | `none` | `ContrastiveTensionLoss`
    `DenoisingAutoEncoderLoss` | @@ -41,4 +45,19 @@ For example, when finetuning a small model to behave more like a larger & strong In practice, not all loss functions get used equally often. The most common scenarios are: * `(anchor, positive) pairs` without any labels: MultipleNegativesRankingLoss is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. CachedMultipleNegativesRankingLoss is often used to increase the batch size, resulting in superior performance. -* `(sentence_A, sentence_B) pairs` with a `float similarity score`: CosineSimilarityLoss is traditionally used a lot, though more recently CoSENTLoss and AnglELoss are used as drop-in replacements with superior performance. \ No newline at end of file +* `(sentence_A, sentence_B) pairs` with a `float similarity score`: CosineSimilarityLoss is traditionally used a lot, though more recently CoSENTLoss and AnglELoss are used as drop-in replacements with superior performance. + +## Custom Loss Functions + +```eval_rst +Advanced users can create and train with their own loss functions. Custom loss functions only have a few requirements: + +- They must be a subclass of :class:`torch.nn.Module`. +- They must have ``model`` as the first argument in the constructor. +- They must implement a ``forward`` method that accepts ``sentence_features`` and ``labels``. The former is a list of tokenized batches, one element for each column. These tokenized batches can be fed directly to the ``model`` being trained to produce embeddings. The latter is an optional tensor of labels. The method must return a single loss value. + +To get full support with the automatic model card generation, you may also wish to implement: + +- a ``get_config_dict`` method that returns a dictionary of loss parameters. +- a ``citation`` property so your work gets cited in all models that train with the loss. +``` \ No newline at end of file diff --git a/docs/pretrained_models.md b/docs/sentence_transformer/pretrained_models.md similarity index 50% rename from docs/pretrained_models.md rename to docs/sentence_transformer/pretrained_models.md index b35947802..ac653ef85 100644 --- a/docs/pretrained_models.md +++ b/docs/sentence_transformer/pretrained_models.md @@ -1,42 +1,77 @@ # Pretrained Models -We provide various pre-trained models. Using these models is easy: +We provide various pre-trained Sentence Transformers models via our Sentence Transformers Hugging Face organization. Additionally, over 6,000 community Sentence Transformers models have been publicly released on the Hugging Face Hub. All models can be found here: +* **Original models**: [Sentence Transformers Hugging Face organization](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers). +* **Community models**: [All Sentence Transformer models on Hugging Face](https://huggingface.co/models?library=sentence-transformers). + +Each of these models can be easily downloaded and used like so: + +```eval_rst +.. sidebar:: Original Models + + For the original models from the `Sentence Transformers Hugging Face organization `_, it is not necessary to include the model author or organization prefix. For example, this snippet loads `sentence-transformers/all-mpnet-base-v2 `_. +``` ```python from sentence_transformers import SentenceTransformer -model = SentenceTransformer("model_name") +# Load https://huggingface.co/sentence-transformers/all-mpnet-base-v2 +model = SentenceTransformer("all-mpnet-base-v2") +embeddings = model.encode([ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", +]) +similarities = model.similarity(embeddings, embeddings) ``` -All models are hosted on the [HuggingFace Model Hub](https://huggingface.co/sentence-transformers). +```eval_rst +.. note:: + Consider using the `Massive Textual Embedding Benchmark leaderboard `_ as an inspiration of strong Sentence Transformer models. Be wary: -## Model Overview + - **Model sizes**: it is recommended to filter away the large models that might not be feasible without excessive hardware. + - **Experimentation is key**: models that perform well on the leaderboard do not necessarily do well on your tasks, it is **crucial** to experiment with various promising models. +``` + +## Original Models -The following table provides an overview of (selected) models. They have been extensively evaluated for their quality to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search). +The following table provides an overview of a selection of our models. They have been extensively evaluated for their quality to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search). -The **all-*** models were trained on all available training data (more than 1 billion training pairs) and are designed as **general purpose** models. The **all-mpnet-base-v2** model provides the best quality, while **all-MiniLM-L6-v2** is 5 times faster and still offers good quality. Toggle *All models* to see all evaluated models or visit [HuggingFace Model Hub](https://huggingface.co/models?library=sentence-transformers) to view all existing sentence-transformers models. +The **all-*** models were trained on all available training data (more than 1 billion training pairs) and are designed as **general purpose** models. The [**all-mpnet-base-v2**](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model provides the best quality, while [**all-MiniLM-L6-v2**](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) is 5 times faster and still offers good quality. Toggle *All models* to see all evaluated original models. - + --- -## Semantic Search +## Semantic Search Models + +The following models have been specifically trained for **Semantic Search**: Given a question / search query, these models are able to find relevant text passages. For more details, see [Usage > Semantic Search](../../examples/applications/semantic-search/README.md). + +```eval_rst +.. sidebar:: Documentation -The following models have been specifically trained for **Semantic Search**: Given a question / search query, these models are able to find relevant text passages. For more details, see [Usage - Semantic Search](../examples/applications/semantic-search/README.md). + #. `multi-qa-mpnet-base-cos-v1 `_ + #. :class:`SentenceTransformer ` + #. :meth:`SentenceTransformer.encode ` + #. :meth:`SentenceTransformer.similarity ` + +``` ```python -from sentence_transformers import SentenceTransformer, util +from sentence_transformers import SentenceTransformer -model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1") +model = SentenceTransformer("multi-qa-mpnet-base-cos-v1") query_embedding = model.encode("How big is London") -passage_embedding = model.encode([ - "London has 9,787,426 inhabitants at the 2011 census", +passage_embeddings = model.encode([ "London is known for its finacial district", + "London has 9,787,426 inhabitants at the 2011 census", + "The United Kingdom is the fourth largest exporter of goods in the world", ]) -print("Similarity:", util.dot_score(query_embedding, passage_embedding)) +similarity = model.similarity(query_embedding, passage_embeddings) +# => tensor([[0.4659, 0.6142, 0.2697]]) ``` @@ -45,94 +80,80 @@ print("Similarity:", util.dot_score(query_embedding, passage_embedding)) The following models have been trained on [215M question-answer pairs](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1#training) from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries and many more. These model perform well across many search tasks and domains. - -These models were tuned to be used with dot-product: +These models were tuned to be used with the dot-product similarity score: | Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | | --- | :---: | :---: | -| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | 49.19 | 18,000 / 750 | -| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | 52.51 | 7,000 / 350 | | [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | 57.60 | 4,000 / 170 | +| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | 52.51 | 7,000 / 350 | +| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | 49.19 | 18,000 / 750 | - - -These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance: +These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance as the similarity functions: | Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | | --- | :---: | :---: | -| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | 51.83 | 18,000 / 750 | -| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) | 52.83 | 7,000 / 350 | | [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) | 57.46 | 4,000 / 170 | +| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) | 52.83 | 7,000 / 350 | +| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | 51.83 | 18,000 / 750 | ### MSMARCO Passage Models -The [MSMARCO Passage Ranking Dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking) contains 500k real queries from Bing search together with the relevant passages from various web sources. Given the diversity of the MSMARCO dataset, models also perform well on other domains. +The following models have been trained on the [MSMARCO Passage Ranking Dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking), which contains 500k real queries from Bing search together with the relevant passages from various web sources. Given the diversity of the MSMARCO dataset, models also perform well on other domains. -Models tuned to be used with dot-product: +These models were tuned to be used with the dot-product similarity score: | Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | | --- | :---: | :---: | :---: | -| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 49.25 | 7,000 / 350 | -| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 49.47 | 7,000 / 350 | | [msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) | 38.08 | 52.11 | 4,000 / 170 | +| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 49.47 | 7,000 / 350 | +| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 49.25 | 7,000 / 350 | - -These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance: +These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance as the similarity functions: | Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. | | --- | :---: | :---: | :---: | -| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 42.16 | 18,000 / 750 | -| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 43.89 | 11,000 / 400 | | [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5) | 33.79 | 44.98 | 7,000 / 350 | +| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 43.89 | 11,000 / 400 | +| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 42.16 | 18,000 / 750 | -[MSMARCO Models - More details](pretrained-models/msmarco-v5.md) +[MSMARCO Models - More details](../pretrained-models/msmarco-v5.md) --- -## Multi-Lingual Models -The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw. - - +## Multilingual Models +The following models similar embeddings for the same texts in different languages. You do not need to specify the input language. Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw. -**Semantic Similarity** +### Semantic Similarity Models These models find semantically similar sentences within one language or across languages: -- **distiluse-base-multilingual-cased-v1**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. -- **distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). This version supports 50+ languages, but performs a bit weaker than the v1 model. -- **paraphrase-multilingual-MiniLM-L12-v2** - Multilingual version of *paraphrase-MiniLM-L12-v2*, trained on parallel data for 50+ languages. -- **paraphrase-multilingual-mpnet-base-v2** - Multilingual version of *paraphrase-mpnet-base-v2*, trained on parallel data for 50+ languages. +- **[distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. +- **[distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2)**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). This version supports 50+ languages, but performs a bit weaker than the v1 model. +- **[paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)** - Multilingual version of [paraphrase-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L12-v2), trained on parallel data for 50+ languages. +- **[paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)** - Multilingual version of [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2), trained on parallel data for 50+ languages. -**Bitext Mining** +### Bitext Mining Bitext mining describes the process of finding translated sentence pairs in two languages. If this is your use-case, the following model gives the best performance: -- **LaBSE** - [LaBSE](https://arxiv.org/abs/2007.01852) Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed [here](https://arxiv.org/abs/2004.09813), LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other. +- **[LaBSE](https://huggingface.co/sentence-transformers/LaBSE)** - [LaBSE](https://arxiv.org/abs/2007.01852) Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed [here](https://arxiv.org/abs/2004.09813), LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other. - -Extending a model to new languages is easy by following [the description here](https://www.sbert.net/examples/training/multilingual/README.html). - ----- +Extending a model to new languages is easy by following [Training Examples > Multilingual Models](../../examples/training/multilingual/README.html). ## Image & Text-Models -The following models can embed images and text into a joint vector space. See [Image Search](../examples/applications/image-search/README.md) for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification. +The following models can embed images and text into a joint vector space. See [Usage > Image Search](../../examples/applications/image-search/README.md) for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification. The following models are available with their respective Top 1 accuracy on zero-shot ImageNet validation dataset. | Model | Top 1 Performance | | --- | :---: | -| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 | -| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 | | [clip-ViT-L-14](https://huggingface.co/sentence-transformers/clip-ViT-L-14) | 75.4 | +| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 | +| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 | We further provide this multilingual text-image model: -- **clip-ViT-B-32-multilingual-v1** - Multilingual text encoder for the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model using [Multilingual Knowledge Distillation](https://arxiv.org/abs/2004.09813). This model can encode text in 50+ languages to match the image vectors from the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model. - - ---- +- **[clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)** - Multilingual text encoder for the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model using [Multilingual Knowledge Distillation](https://arxiv.org/abs/2004.09813). This model can encode text in 50+ languages to match the image vectors from the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model. -## Other Models - -### INSTRUCTOR models +## INSTRUCTOR models Some INSTRUCTOR models, such as [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large), are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step. The following models work out of the box: @@ -185,61 +206,7 @@ print(similarities) All other Instructor models either 1) will not load as they refer to `InstructorEmbedding` in their `modules.json` or 2) require calling `model.set_pooling_include_prompt(include_prompt=False)` after loading. -### Scientific Publications +## Scientific Similarity Models [SPECTER](https://arxiv.org/abs/2004.07180) is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers. -- **allenai-specter** - [Semantic Search Python Example](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_publications.py) / [Semantic Search Colab Example](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06) - - - - - -### Natural Questions (NQ) Dataset Models -The following models were trained on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions), a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia. - -- **nq-distilbert-base-v1**: MRR10: 72.36 on NQ dev set (small) - -```python -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("nq-distilbert-base-v1") - -query_embedding = model.encode("How many people live in London?") - -# The passages are encoded as [ [title1, text1], [title2, text2], ...] -passage_embedding = model.encode( - [["London", "London has 9,787,426 inhabitants at the 2011 census."]] -) - -print("Similarity:", util.cos_sim(query_embedding, passage_embedding)) -``` - -You can index the passages as shown [here](../examples/applications/semantic-search/README.md). - -**Note:** The NQ model doesn't perform well. Use the above mentioned Multi-QA models to achieve the optimal performance. - -[More details](pretrained-models/nq-v1.md) - - - -### DPR-Models - -In [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) Karpukhin et al. trained models based on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions): -- **facebook-dpr-ctx_encoder-single-nq-base** -- **facebook-dpr-question_encoder-single-nq-base** - -They also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC. -- **facebook-dpr-ctx_encoder-multiset-base** -- **facebook-dpr-question_encoder-multiset-base** - -**Note:** The DPR models perform comparabily bad. Use the above mentioned Multi-QA models to achieve the optimal performance. - -[More details & usage of the DPR models](pretrained-models/dpr.md) - -### Average Word Embeddings Models - -The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse. -- **average_word_embeddings_glove.6B.300d** -- **average_word_embeddings_komninos** -- **average_word_embeddings_levy_dependency** -- **average_word_embeddings_glove.840B.300d** +- **[allenai-specter](https://huggingface.co/sentence-transformers/allenai-specter)** - [Semantic Search Python Example](../../examples/applications/semantic-search/semantic_search_publications.py) / [Semantic Search Colab Example](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06) diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst new file mode 100644 index 000000000..fc6e78138 --- /dev/null +++ b/docs/sentence_transformer/training/distributed.rst @@ -0,0 +1,85 @@ + +Distributed Training +==================== + +Sentence Transformers implements two forms of distributed training: Data Parallel (DP) and Distributed Data Parallel (DDP). Read the `Data Parallelism documentation `_ on Hugging Face for more details on these strategies. Some of the key differences include: + +1. DDP is generally faster than DP because it has to communicate less data. +2. With DP, GPU 0 does the bulk of the work, while with DDP, the work is distributed more evenly across all GPUs. +3. DDP allows for training across multiple machines, while DP is limited to a single machine. + +In short, **DDP is generally recommended**. You can use DDP by running your normal training scripts with ``torchrun`` or ``accelerate``. For example, if you have a script called ``train_script.py``, you can run it with DDP using the following command: + +.. tabs:: + + .. tab:: Via ``torchrun`` + + - `torchrun documentation `_ + + :: + + torchrun --nproc_per_node=4 train_script.py + + .. tab:: Via ``accelerate`` + + - `accelerate documentation `_ + + :: + + accelerate launch --num_processes 4 train_script.py + +.. note:: + + When performing distributed training, you have to wrap your code in a ``main`` function and call it with ``if __name__ == "__main__":``. This is because each process will run the entire script, so you don't want to run the same code multiple times. Here is an example of how to do this:: + + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments, SentenceTransformerTrainer + # Other imports here + + def main(): + # Your training code here + + if __name__ == "__main__": + main() + +.. note:: + + When using DDP, using ``dataloader_drop_last=True`` in :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` is recommended, as the training may halt at the last (incomplete) training batch otherwise. + +Comparison +---------- + +The following table shows the speedup of DDP over DP and no parallelism given a certain hardware setup. + +- Hardware: a ``p3.8xlarge`` AWS instance, i.e. 4x V100 GPUs +- Model being trained: `microsoft/mpnet-base `_ (133M parameters) +- Maximum sequence length: 384 (following `all-mpnet-base-v2 `_) +- Training datasets: MultiNLI, SNLI and STSB (note: these have short texts) +- Losses: :class:`~sentence_transformers.losses.SoftmaxLoss` for MultiNLI and SNLI, :class:`~sentence_transformers.losses.CosineSimilarityLoss` for STSB +- Batch size per device: 32 + +.. list-table:: + :header-rows: 1 + + * - Strategy + - Launcher + - Samples per Second + * - No Parallelism + - ``CUDA_VISIBLE_DEVICES=0 python train_script.py`` + - 2724 + * - Data Parallel (DP) + - ``python train_script.py`` (DP is used by default when launching a script with ``python``) + - 3675 (1.349x speedup) + * - **Distributed Data Parallel (DDP)** + - ``torchrun --nproc_per_node=4 train_script.py`` or ``accelerate launch --num_processes 4 train_script.py`` + - **6980 (2.562x speedup)** + +FSDP +---- + +Fully Sharded Data Parallelism (FSDP) is another distributed training strategy that is not fully supported by Sentence Transformers. It is a more advanced version of DDP that is particularly useful for very large models. Note that in the previous comparison, FSDP reaches 5782 samples per second (2.122x speedup), i.e. **worse than DDP**. FSDP only makes sense with very large models. If you want to use FSDP with Sentence Transformers, you have to be aware of the following limitations: + +- You can't use the ``evaluator`` functionality with FSDP. +- You have to save the trained model with ``trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")`` followed with ``trainer.save_model("output")``. +- You have to use ``fsdp=["full_shard", "auto_wrap"]`` and ``fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"}`` in your ``SentenceTransformerTrainingArguments``, where ``BertLayer`` is the repeated layer in the encoder that houses the multi-head attention and feed-forward layers, so e.g. ``BertLayer`` or ``MPNetLayer``. + +Read the `FSDP documentation `_ by Accelerate for more details. \ No newline at end of file diff --git a/docs/sentence_transformer/training/examples.rst b/docs/sentence_transformer/training/examples.rst new file mode 100644 index 000000000..f78d5916b --- /dev/null +++ b/docs/sentence_transformer/training/examples.rst @@ -0,0 +1,32 @@ + +Training Examples +================= + +.. toctree:: + :maxdepth: 1 + :caption: Supervised Learning + + ../../../examples/training/sts/README + ../../../examples/training/nli/README + ../../../examples/training/paraphrases/README + ../../../examples/training/quora_duplicate_questions/README + ../../../examples/training/ms_marco/README + ../../../examples/training/matryoshka/README + ../../../examples/training/adaptive_layer/README + ../../../examples/training/multilingual/README + ../../../examples/training/distillation/README + ../../../examples/training/data_augmentation/README + +.. toctree:: + :maxdepth: 1 + :caption: Unsupervised Learning + + ../../../examples/unsupervised_learning/README + ../../../examples/domain_adaptation/README + +.. toctree:: + :maxdepth: 1 + :caption: Advanced Usage + + ../../../examples/training/hpo/README + distributed \ No newline at end of file diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md new file mode 100644 index 000000000..def0a1759 --- /dev/null +++ b/docs/sentence_transformer/training_overview.md @@ -0,0 +1,667 @@ +# Training Overview + +## Why Finetune? +Finetuning Sentence Transformer models often heavily improves the performance of the model on your use case, because each task requires a different notion of similarity. For example, given news articles: +- "Apple launches the new iPad" +- "NVIDIA is gearing up for the next GPU generation" + +Then the following use cases, we may have different notions of similarity: +- a model for **classification** of news articles as Economy, Sports, Technology, Politics, etc., should produce **similar embeddings** for these texts. +- a model for **semantic textual similarity** should produce **dissimilar embeddings** for these texts, as they have different meanings. +- a model for **semantic search** would **not need a notion for similarity** between two documents, as it should only compare queries and documents. + + +Also see [**Training Examples**](training/examples) for numerous training scripts for common real-world applications that you can adopt. + + +## Training Components +Training Sentence Transformer models involves between 3 to 5 components: + + +

    + +## Dataset +```eval_rst +The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datasets.Dataset` (one dataset) or :class:`datasets.DatasetDict` instances (multiple datasets, see also `Multi-dataset training <#multi-dataset-training>`_). + +.. tabs:: + + .. tab:: Data on 🤗 Hugging Face Hub + + If you want to load data from the `Hugging Face Datasets `_, then you should use :func:`datasets.load_dataset`: + + .. raw:: html + + + + :: + + from datasets import load_dataset + + train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train") + eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev") + + print(train_dataset) + """ + Dataset({ + features: ['premise', 'hypothesis', 'label'], + num_rows: 942069 + }) + """ + + Some datasets (including `sentence-transformers/all-nli `_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair `_, `pair-class `_, `pair-score `_, `triplet `_. + + .. note:: + + Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers `_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks. + + .. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL) + + If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`: + + .. raw:: html + + + + :: + + from datasets import load_dataset + + dataset = load_dataset("csv", data_files="my_file.csv") + + or:: + + from datasets import load_dataset + + dataset = load_dataset("json", data_files="my_file.json") + + .. tab:: Local Data that requires pre-processing + + .. sidebar:: Documentation + + - :meth:`datasets.Dataset.from_dict` + + If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so: + + .. raw:: html + + + + :: + + from datasets import Dataset + + sentence1_list = [] + sentence2_list = [] + # Open a file, do preprocessing, filtering, cleaning, etc. + # and append to the lists + + dataset = Dataset.from_dict({ + "sentence1": sentence1_list, + "sentence2": sentence2_list, + }) + + Each key from the dictionary will become a column in the resulting dataset. + +``` + +### Dataset Format + +```eval_rst +It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format). Verifying whether a dataset format works with a loss function involves two steps: + +1. If your loss function requires a *Label* according to the `Loss Overview `_ table, then your dataset must have a **column named "label" or "score"**. This column is automatically taken as the label. +2. All columns not named "label" or "score" are considered *Inputs* according to the `Loss Overview `_ table. The number of remaining columns must match the number of valid inputs for your chosen loss. The names of these columns are **irrelevant**, only the **order matters**. + +For example, given a dataset with columns ``["text1", "text2", "label"]`` where the "label" column has float similarity score, we can use it with :class:`~sentence_transformers.losses.CoSENTLoss`, :class:`~sentence_transformers.losses.AnglELoss`, and :class:`~sentence_transformers.losses.CosineSimilarityLoss` because it: + +1. has a "label" column as is required for these loss functions. +2. has 2 non-label columns, exactly the amount required by these loss functions. + +Be sure to re-order your dataset columns with :meth:`Dataset.select_columns ` if your columns are not ordered correctly. For example, if your dataset has ``["good_answer", "bad_answer", "question"]`` as columns, then this dataset can technically be used with a loss that requires (anchor, positive, negative) triplets, but the ``good_answer`` column will be taken as the anchor, ``bad_answer`` as the positive, and ``question`` as the negative. + +Additionally, if your dataset has extraneous columns (e.g. sample_id, metadata, source, type), you should remove these with :meth:`Dataset.remove_columns ` as they will be used as inputs otherwise. You can also use :meth:`Dataset.select_columns ` to keep only the desired columns. +``` + +## Loss Function +Loss functions quantify how well a model performs for a given batch of data, allowing an optimizer to update the model weights to produce more favourable (i.e., lower) loss values. This is the core of the training process. + +Sadly, there is no single loss function that works best for all use-cases. Instead, which loss function to use greatly depends on your available data and on your target task. See [Dataset Format](#dataset-format) to learn what datasets are valid for which loss functions. Additionally, the [Loss Overview](loss_overview) will be your best friend to learn about the options. + +```eval_rst +Most loss functions can be initialized with just the :class:`SentenceTransformer` that you're training, alongside some optional parameters, e.g.: + +.. sidebar:: Documentation + + - :class:`sentence_transformers.losses.CoSENTLoss` + - `Losses API Reference <../package_reference/sentence_transformer/losses>`_ + - `Loss Overview `_ + +:: + + from datasets import load_dataset + from sentence_transformers import SentenceTransformer + from sentence_transformers.losses import CoSENTLoss + + # Load a model to train/finetune + model = SentenceTransformer("xlm-roberta-base") + + # Initialize the CoSENTLoss + # This loss requires pairs of text and a float similarity score as a label + loss = CoSENTLoss(model) + + # Load an example training dataset that works with our loss function: + train_dataset = load_dataset("sentence-transformers/all-nli", "pair-score", split="train") + """ + Dataset({ + features: ['sentence1', 'sentence2', 'label'], + num_rows: 942069 + }) + """ +``` + +## Training Arguments + +```eval_rst +The :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` class can be used to specify parameters for influencing training performance as well as defining the tracking/debugging parameters. Although it is optional, it is heavily recommended to experiment with the various useful arguments. +``` + +The following are tables with some of the most useful training arguments. + + +
    + +
    + +```eval_rst +Here is an example of how :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` can be initialized: +``` + +```python +args = SentenceTransformerTrainingArguments( + # Required parameter: + output_dir="models/mpnet-base-all-nli-triplet", + # Optional training parameters: + num_train_epochs=1, + per_device_train_batch_size=16, + per_device_eval_batch_size=16, + warmup_ratio=0.1, + fp16=True, # Set to False if you get an error that your GPU can't run on FP16 + bf16=False, # Set to True if you have a GPU that supports BF16 + batch_sampler=BatchSamplers.NO_DUPLICATES, # losses that use "in-batch negatives" benefit from no duplicates + # Optional tracking/debugging parameters: + eval_strategy="steps", + eval_steps=100, + save_strategy="steps", + save_steps=100, + save_total_limit=2, + logging_steps=100, + run_name="mpnet-base-all-nli-triplet", # Will be used in W&B if `wandb` is installed +) +``` + +## Evaluator + +```eval_rst +Several evaluators exist that can help with evaluation before, during, and after training: + +======================================================================== =========================================================================================================================== +Evaluator Required Data +======================================================================== =========================================================================================================================== +:class:`~sentence_transformers.evaluation.BinaryClassificationEvaluator` Pairs with class labels +:class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` Pairs with similarity scores +:class:`~sentence_transformers.evaluation.InformationRetrievalEvaluator` Queries (qid => question), Corpus (cid => document), and relevant documents (qid => set[cid]) +:class:`~sentence_transformers.evaluation.MSEEvaluator` Source sentences to embed with a teacher model and target sentences to embed with the student model. Can be the same texts. +:class:`~sentence_transformers.evaluation.ParaphraseMiningEvaluator` Mapping of IDs to sentences & pairs with IDs of duplicate sentences. +:class:`~sentence_transformers.evaluation.RerankingEvaluator` List of ``{'query': '...', 'positive': [...], 'negative': [...]}`` dictionaries. +:class:`~sentence_transformers.evaluation.TranslationEvaluator` Pairs of sentences in two separate languages. +:class:`~sentence_transformers.evaluation.TripletEvaluator` (anchor, positive, negative) pairs. +======================================================================== =========================================================================================================================== + +Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`. When the evaluator is run depends on the ``eval_strategy`` and ``eval_steps`` `Training Arguments <#training-arguments>`_. + +Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face. + +.. tabs:: + + .. tab:: EmbeddingSimilarityEvaluator with STSb + + .. raw:: html + + + + :: + + from datasets import load_dataset + from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction + + # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb) + eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") + + # Initialize the evaluator + dev_evaluator = EmbeddingSimilarityEvaluator( + sentences1=eval_dataset["sentence1"], + sentences2=eval_dataset["sentence2"], + scores=eval_dataset["score"], + main_similarity=SimilarityFunction.COSINE, + name="sts-dev", + ) + # You can run evaluation like so: + # dev_evaluator(model) + + .. tab:: TripletEvaluator with AllNLI + + .. raw:: html + + + + :: + + from datasets import load_dataset + from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction + + # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli) + max_samples = 1000 + eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]") + + # Initialize the evaluator + dev_evaluator = TripletEvaluator( + anchors=eval_dataset["anchor"], + positives=eval_dataset["positive"], + negatives=eval_dataset["negative"], + main_distance_function=SimilarityFunction.COSINE, + name="all-nli-dev", + ) + # You can run evaluation like so: + # dev_evaluator(model) +``` + +## Trainer + +```eval_rst +The :class:`sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together: + +.. sidebar:: Documentation + + #. :class:`~sentence_transformers.SentenceTransformer` + #. :class:`~sentence_transformers.model_card.SentenceTransformerModelCardData` + #. :func:`~datasets.load_dataset` + #. :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` + #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` + #. :class:`~sentence_transformers.evaluation.TripletEvaluator` + #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` + #. :class:`SentenceTransformer.save_pretrained ` + #. :class:`SentenceTransformer.push_to_hub ` + + - `Training Examples `_ + +:: + + from datasets import load_dataset + from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + SentenceTransformerModelCardData, + ) + from sentence_transformers.losses import MultipleNegativesRankingLoss + from sentence_transformers.training_args import BatchSamplers + from sentence_transformers.evaluation import TripletEvaluator + + # 1. Load a model to finetune with 2. (Optional) model card data + model = SentenceTransformer( + "microsoft/mpnet-base", + model_card_data=SentenceTransformerModelCardData( + language="en", + license="apache-2.0", + model_name="MPNet base trained on AllNLI triplets", + ) + ) + + # 3. Load a dataset to finetune on + dataset = load_dataset("sentence-transformers/all-nli", "triplet") + train_dataset = dataset["train"].select(range(100_000)) + eval_dataset = dataset["dev"] + test_dataset = dataset["test"] + + # 4. Define a loss function + loss = MultipleNegativesRankingLoss(model) + + # 5. (Optional) Specify training arguments + args = SentenceTransformerTrainingArguments( + # Required parameter: + output_dir="models/mpnet-base-all-nli-triplet", + # Optional training parameters: + num_train_epochs=1, + per_device_train_batch_size=16, + per_device_eval_batch_size=16, + warmup_ratio=0.1, + fp16=True, # Set to False if you get an error that your GPU can't run on FP16 + bf16=False, # Set to True if you have a GPU that supports BF16 + batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch + # Optional tracking/debugging parameters: + eval_strategy="steps", + eval_steps=100, + save_strategy="steps", + save_steps=100, + save_total_limit=2, + logging_steps=100, + run_name="mpnet-base-all-nli-triplet", # Will be used in W&B if `wandb` is installed + ) + + # 6. (Optional) Create an evaluator & evaluate the base model + dev_evaluator = TripletEvaluator( + anchors=eval_dataset["anchor"], + positives=eval_dataset["positive"], + negatives=eval_dataset["negative"], + name="all-nli-dev", + ) + dev_evaluator(model) + + # 7. Create a trainer & train + trainer = SentenceTransformerTrainer( + model=model, + args=args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + loss=loss, + evaluator=dev_evaluator, + ) + trainer.train() + + # (Optional) Evaluate the trained model on the test set + test_evaluator = TripletEvaluator( + anchors=test_dataset["anchor"], + positives=test_dataset["positive"], + negatives=test_dataset["negative"], + name="all-nli-test", + ) + test_evaluator(model) + + # 8. Save the trained model + model.save_pretrained("models/mpnet-base-all-nli-triplet/final") + + # 9. (Optional) Push it to the Hugging Face Hub + model.push_to_hub("mpnet-base-all-nli-triplet") + +``` + +### Callbacks + +```eval_rst +This Sentence Transformers trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as: + +- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed +- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible. +- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed. + + - Note: These carbon emissions will be included in your automatically generated model card. + +See the Transformers `Callbacks `_ +documentation for more information on the integrated callbacks and how to write your own callbacks. +``` + +## Multi-Dataset Training +```eval_rst +The top performing models are trained using many datasets at once. Normally, this is rather tricky, as each dataset has a different format. However, :class:`SentenceTransformerTrainer` can train with multiple datasets without having to convert each dataset to the same format. It can even apply different loss functions to each of the datasets. The steps to train with multiple datasets are: + +- Use a dictionary of :class:`~datasets.Dataset` instances (or a :class:`~datasets.DatasetDict`) as the ``train_dataset`` and ``eval_dataset``. +- (Optional) Use a dictionary of loss functions mapping dataset names to losses. Only required if you wish to use different loss function for different datasets. + +Each training/evaluation batch will only contain samples from one of the datasets. The order in which batches are samples from the multiple datasets is defined by the :class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers` enum, which can be passed to the :class:`~sentence_transformers.training_args.SentenceTransformersTrainingArguments` via ``multi_dataset_batch_sampler``. Valid options are: + +- ``MultiDatasetBatchSamplers.ROUND_ROBIN``: Round-robin sampling from each dataset until one is exhausted. With this strategy, it’s likely that not all samples from each dataset are used, but each dataset is sampled from equally. +- ``MultiDatasetBatchSamplers.PROPORTIONAL`` (default): Sample from each dataset in proportion to its size. With this strategy, all samples from each dataset are used and larger datasets are sampled from more frequently. + +This multi-task training has been shown to be very effective, e.g. `Huang et al. `_ employed :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, :class:`~sentence_transformers.losses.CoSENTLoss`, and a variation on :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` without in-batch negatives and only hard negatives to reach state-of-the-art performance on Chinese. They even applied :class:`~sentence_transformers.losses.MatryoshkaLoss` to allow the model to produce `Matryoshka Embeddings <../../examples/training/matryoshka/README.html>`_. + +Training on multiple datasets looks like this: + +.. sidebar:: Documentation + + - :func:`datasets.load_dataset` + - :class:`~sentence_transformers.SentenceTransformer` + - :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` + - :class:`~sentence_transformers.losses.CoSENTLoss` + - :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` + - :class:`~sentence_transformers.losses.SoftmaxLoss` + - `sentence-transformers/all-nli `_ + - `sentence-transformers/stsb `_ + - `sentence-transformers/quora-duplicates `_ + - `sentence-transformers/natural-questions `_ + + **Training Examples:** + + - `Quora Duplicate Questions > Multi-task learning `_ + - `AllNLI + STSb > Multi-task learning `_ +:: + + from datasets import load_dataset + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer + from sentence_transformers.losses import CoSENTLoss, MultipleNegativesRankingLoss, SoftmaxLoss + + # 1. Load a model to finetune + model = SentenceTransformer("bert-base-uncased") + + # 2. Load several Datasets to train with + # (anchor, positive) + all_nli_pair_train = load_dataset("sentence-transformers/all-nli", "pair", split="train[:10000]") + # (premise, hypothesis) + label + all_nli_pair_class_train = load_dataset("sentence-transformers/all-nli", "pair-class", split="train[:10000]") + # (sentence1, sentence2) + score + all_nli_pair_score_train = load_dataset("sentence-transformers/all-nli", "pair-score", split="train[:10000]") + # (anchor, positive, negative) + all_nli_triplet_train = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]") + # (sentence1, sentence2) + score + stsb_pair_score_train = load_dataset("sentence-transformers/stsb", split="train[:10000]") + # (anchor, positive) + quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", split="train[:10000]") + # (query, answer) + natural_questions_train = load_dataset("sentence-transformers/natural-questions", split="train[:10000]") + + # We can combine all datasets into a dictionary with dataset names to datasets + train_dataset = { + "all-nli-pair": all_nli_pair_train, + "all-nli-pair-class": all_nli_pair_class_train, + "all-nli-pair-score": all_nli_pair_score_train, + "all-nli-triplet": all_nli_triplet_train, + "stsb": stsb_pair_score_train, + "quora": quora_pair_train, + "natural-questions": natural_questions_train, + } + + # 3. Load several Datasets to evaluate with + # (anchor, positive, negative) + all_nli_triplet_dev = load_dataset("sentence-transformers/all-nli", "triplet", split="dev") + # (sentence1, sentence2, score) + stsb_pair_score_dev = load_dataset("sentence-transformers/stsb", split="validation") + # (anchor, positive) + quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", split="train[10000:11000]") + # (query, answer) + natural_questions_dev = load_dataset("sentence-transformers/natural-questions", split="train[10000:11000]") + + # We can use a dictionary for the evaluation dataset too, but we don't have to. We could also just use + # no evaluation dataset, or one dataset. + eval_dataset = { + "all-nli-triplet": all_nli_triplet_dev, + "stsb": stsb_pair_score_dev, + "quora": quora_pair_dev, + "natural-questions": natural_questions_dev, + } + + # 4. Load several loss functions to train with + # (anchor, positive), (anchor, positive, negative) + mnrl_loss = MultipleNegativesRankingLoss(model) + # (sentence_A, sentence_B) + class + softmax_loss = SoftmaxLoss(model) + # (sentence_A, sentence_B) + score + cosent_loss = CoSENTLoss(model) + + # Create a mapping with dataset names to loss functions, so the trainer knows which loss to apply where. + # Note that you can also just use one loss if all of your training/evaluation datasets use the same loss + losses = { + "all-nli-pair": mnrl_loss, + "all-nli-pair-class": softmax_loss, + "all-nli-pair-score": cosent_loss, + "all-nli-triplet": mnrl_loss, + "stsb": cosent_loss, + "quora": mnrl_loss, + "natural-questions": mnrl_loss, + } + + # 5. Define a simple trainer, although it's recommended to use one with args & evaluators + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + loss=losses, + ) + trainer.train() + + # 6. save the trained model and optionally push it to the Hugging Face Hub + model.save_pretrained("bert-base-all-nli-stsb-quora-nq") + model.push_to_hub("bert-base-all-nli-stsb-quora-nq") +``` + +## Deprecated Training +```eval_rst +Prior to the Sentence Transformers v3.0 release, models would be trained with the :meth:`SentenceTransformer.fit ` method and a :class:`~torch.utils.data.DataLoader` of :class:`~sentence_transformers.readers.InputExample`, which looked something like this:: + + from sentence_transformers import SentenceTransformer, InputExample, losses + from torch.utils.data import DataLoader + + # Define the model. Either from scratch of by loading a pre-trained model + model = SentenceTransformer("distilbert/distilbert-base-uncased") + + # Define your train examples. You need more than just two examples... + train_examples = [ + InputExample(texts=["My first sentence", "My second sentence"], label=0.8), + InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3), + ] + + # Define your train dataset, the dataloader and the train loss + train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) + train_loss = losses.CosineSimilarityLoss(model) + + # Tune the model + model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100) + +Since the v3.0 release, using :meth:`SentenceTransformer.fit ` is still possible, but it will initialize a :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` behind the scenes. It is recommended to use the Trainer directly, as you will have more control via the :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`, but existing training scripts relying on :meth:`SentenceTransformer.fit ` should still work. + +In case there are issues with the updated :meth:`SentenceTransformer.fit `, you can also get exactly the old behaviour by calling :meth:`SentenceTransformer.old_fit ` instead, but this method will be deprecated fully in the future. + +``` + +## Best Base Embedding Models +The quality of your text embedding model depends on which transformer model you choose. Sadly we cannot infer from a better performance on e.g. the GLUE or SuperGLUE benchmark that this model will also yield better representations. + +To test the suitability of transformer models, I use the [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py) script and train on 560k (anchor, positive, negative)-triplets for 1 epoch with batch size 64. I then evaluate on 14 diverse text similarity tasks (clustering, semantic search, duplicate detection etc.) from various domains. + +In the following table you find the performance for different models and their performance on this benchmark: + +| Model | Performance (14 sentence similarity tasks) | +|-----------------------------------------------------------------------------------------------------------------------------------|-:-:----------------------------------------| +| [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) | 60.99 | +| [nghuyong/ernie-2.0-en](https://huggingface.co/nghuyong/ernie-2.0-en) | 60.73 | +| [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) | 60.21 | +| [roberta-base](https://huggingface.co/roberta-base) | 59.63 | +| [t5-base](https://huggingface.co/t5-base) | 59.21 | +| [bert-base-uncased](https://huggingface.co/bert-base-uncased) | 59.17 | +| [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) | 59.03 | +| [nreimers/TinyBERT_L-6_H-768_v2](https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2) | 58.27 | +| [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) | 57.63 | +| [nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large) | 57.31 | +| [albert-base-v2](https://huggingface.co/albert-base-v2) | 57.14 | +| [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) | 56.79 | +| [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) | 54.46 | + diff --git a/docs/sentence_transformer/usage/semantic_textual_similarity.rst b/docs/sentence_transformer/usage/semantic_textual_similarity.rst new file mode 100644 index 000000000..cd4c332b1 --- /dev/null +++ b/docs/sentence_transformer/usage/semantic_textual_similarity.rst @@ -0,0 +1,132 @@ +Semantic Textual Similarity +=========================== + +For Semantic Textual Similarity (STS), we want to produce embeddings for all texts involved and calculate the similarities between them. The text pairs with the highest similarity score are most semantically similar. See also the `Computing Embeddings <../../../examples/applications/computing-embeddings/README.html>`_ documentation for more advanced details on getting embedding scores. + +.. sidebar:: Documentation + + 1. :class:`SentenceTransformer ` + 2. :meth:`SentenceTransformer.encode ` + 3. :meth:`SentenceTransformer.similarity ` + +:: + + from sentence_transformers import SentenceTransformer + + model = SentenceTransformer("all-MiniLM-L6-v2") + + # Two lists of sentences + sentences1 = [ + "The new movie is awesome", + "The cat sits outside", + "A man is playing guitar", + ] + + sentences2 = [ + "The dog plays in the garden", + "The new movie is so great", + "A woman watches TV", + ] + + # Compute embeddings for both lists + embeddings1 = model.encode(sentences1) + embeddings2 = model.encode(sentences2) + + # Compute cosine similarities + similarities = model.similarity(embeddings1, embeddings2) + + # Output the pairs with their score + for idx_i, sentence1 in enumerate(sentences1): + print(sentence1) + for idx_j, sentence2 in enumerate(sentences2): + print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}") + +.. code-block:: txt + :emphasize-lines: 3 + + The new movie is awesome + - The dog plays in the garden : 0.0543 + - The new movie is so great : 0.8939 + - A woman watches TV : -0.0502 + The cat sits outside + - The dog plays in the garden : 0.2838 + - The new movie is so great : -0.0029 + - A woman watches TV : 0.1310 + A man is playing guitar + - The dog plays in the garden : 0.2277 + - The new movie is so great : -0.0136 + - A woman watches TV : -0.0327 + +In this example, the :meth:`SentenceTransformer.similarity ` method returns a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*. + +Similarity Calculation +---------------------- + +The similarity metric that is used is stored in the SentenceTransformer instance under :attr:`SentenceTransformer.similarity_fn_name `. Valid options are: + +- ``SimilarityFunction.COSINE`` (a.k.a `"cosine"`): Cosine Similarity (**default**) +- ``SimilarityFunction.DOT_PRODUCT`` (a.k.a `"dot"`): Dot Product +- ``SimilarityFunction.EUCLIDEAN`` (a.k.a `"euclidean"`): Negative Euclidean Distance +- ``SimilarityFunction.MANHATTAN`` (a.k.a. `"manhattan"`): Negative Manhattan Distance + +This value can be changed in a handful of ways: + +1. By initializing the SentenceTransformer instance with the desired similarity function:: + + from sentence_transformers import SentenceTransformer, SimilarityFunction + + model = SentenceTransformer("all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.DOT_PRODUCT) + +2. By setting the value directly on the SentenceTransformer instance:: + + from sentence_transformers import SentenceTransformer, SimilarityFunction + + model = SentenceTransformer("all-MiniLM-L6-v2") + model.similarity_fn_name = SimilarityFunction.DOT_PRODUCT + +3. By setting the value under the ``"similarity_fn_name"`` key in the ``config_sentence_transformers.json`` file of a saved model. When you save a Sentence Transformer model, this value will be automatically saved as well. + +Sentence Transformers implements two methods to calculate the similarity between embeddings: + +- :meth:`SentenceTransformer.similarity `: Calculates the similarity between all pairs of embeddings. +- :meth:`SentenceTransformer.pairwise_cosine_similarity `: Calculates the similarity between embeddings in a pairwise fashion. + +:: + + from sentence_transformers import SentenceTransformer, SimilarityFunction + + # Load a pretrained Sentence Transformer model + model = SentenceTransformer("all-MiniLM-L6-v2") + + # Embed some sentences + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] + embeddings = model.encode(sentences) + + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[1.0000, 0.6660, 0.1046], + # [0.6660, 1.0000, 0.1411], + # [0.1046, 0.1411, 1.0000]]) + + # Change the similarity function to Manhattan distance + model.similarity_fn_name = SimilarityFunction.MANHATTAN + print(model.similarity_fn_name) + # => "manhattan" + + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[ -0.0000, -12.6269, -20.2167], + # [-12.6269, -0.0000, -20.1288], + # [-20.2167, -20.1288, -0.0000]]) + +.. note:: + + If a Sentence Transformer instance ends with a :class:`~sentence_transformers.models.Normalize` module, then it is sensible to choose the "dot" metric instead of "cosine". + + Dot product on normalized embeddings is equivalent to cosine similarity, but "cosine" will re-normalize the embeddings again. As a result, the "dot" metric will be faster than "cosine". + +If you want find the highest scoring pairs in a long list of sentences, have a look at `Paraphrase Mining <../../examples/applications/paraphrase-mining/README.md>`_. diff --git a/docs/sentence_transformer/usage/usage.rst b/docs/sentence_transformer/usage/usage.rst new file mode 100644 index 000000000..d8beec379 --- /dev/null +++ b/docs/sentence_transformer/usage/usage.rst @@ -0,0 +1,59 @@ + +Usage +===== + +Characteristics of Sentence Transformer (a.k.a bi-encoder) models: + +1. Calculates a **fixed-size vector representation (embedding)** given **texts or images**. +2. Embedding calculation is often **efficient**, embedding similarity calculation is **very fast**. +3. Applicable for a **wide range of tasks**, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more. +4. Often used as a **first step in a two-step retrieval process**, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder. + +Once you have `installed `_ Sentence Transformers, you can easily use Sentence Transformer models: + +.. sidebar:: Documentation + + 1. :class:`SentenceTransformer ` + 2. :meth:`SentenceTransformer.encode ` + 3. :meth:`SentenceTransformer.similarity ` + +:: + + from sentence_transformers import SentenceTransformer + + # 1. Load a pretrained Sentence Transformer model + model = SentenceTransformer("all-MiniLM-L6-v2") + + # The sentences to encode + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] + + # 2. Calculate embeddings by calling model.encode() + embeddings = model.encode(sentences) + print(embeddings.shape) + # [3, 384] + + # 3. Calculate the embedding similarities + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[1.0000, 0.6660, 0.1046], + # [0.6660, 1.0000, 0.1411], + # [0.1046, 0.1411, 1.0000]]) + +.. toctree:: + :maxdepth: 1 + :caption: Tasks and Advanced Usage + + ../../../examples/applications/computing-embeddings/README + semantic_textual_similarity + ../../../examples/applications/semantic-search/README + ../../../examples/applications/retrieve_rerank/README + ../../../examples/applications/clustering/README + ../../../examples/applications/paraphrase-mining/README + ../../../examples/applications/parallel-sentence-mining/README + ../../../examples/applications/image-search/README + ../../../examples/applications/embedding-quantization/README + diff --git a/docs/training/overview.md b/docs/training/overview.md deleted file mode 100644 index b3f3ad539..000000000 --- a/docs/training/overview.md +++ /dev/null @@ -1,283 +0,0 @@ -# Training Overview - -Each task is unique, and having sentence / text embeddings tuned for that specific task greatly improves the performance. - -SentenceTransformers was designed in such way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task. - -Sadly there is no single training strategy that works for all use-cases. Instead, which training strategy to use greatly depends on your available data and on your target task. - -In the **Training** section, I will discuss the fundamentals of training your own embedding models with SentenceTransformers. In the **Training Examples** section, I will provide examples how to tune embedding models for common real-world applications. - -## Network Architecture - -For sentence / text embeddings, we want to map a variable length input text to a fixed sized dense vector. The most basic network architecture we can use is the following: - -![SBERT Network Architecture](../img/SBERT_Architecture.png "SBERT Siamese Architecture") - - -We feed the input sentence or text into a transformer network like BERT. BERT produces contextualized word embeddings for all input tokens in our text. As we want a fixed-sized output representation (vector u), we need a pooling layer. Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us. This gives us a fixed 768 dimensional output vector independent how long our input text was. - -The depicted architecture, consisting of a BERT layer and a pooling layer is one final SentenceTransformer model. - -## Creating Networks from Scratch - -In the quick start & usage examples, we used pre-trained SentenceTransformer models that already come with a BERT layer and a pooling layer. - -But we can create the networks architectures from scratch by defining the individual layers. For example, the following code would create the depicted network architecture: - -```python -from sentence_transformers import SentenceTransformer, models - -word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256) -pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) - -model = SentenceTransformer(modules=[word_embedding_model, pooling_model]) -``` - -First we define our individual layers, in this case, we define 'bert-base-uncased' as the *word_embedding_model*. We limit that layer to a maximal sequence length of 256, texts longer than that will be truncated. Further, we create a (mean) pooling layer. We create a new *SentenceTransformer* model by calling `SentenceTransformer(modules=[word_embedding_model, pooling_model])`. For the *modules* parameter, we pass a list of layers which are executed consecutively. Input text are first passed to the first entry (*word_embedding_model*). The output is then passed to the second entry (*pooling_model*), which then returns our sentence embedding. - -We can also construct more complex models: -```python -from sentence_transformers import SentenceTransformer, models -from torch import nn - -word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256) -pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) -dense_model = models.Dense( - in_features=pooling_model.get_sentence_embedding_dimension(), - out_features=256, - activation_function=nn.Tanh(), -) - -model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model]) -``` - -Here, we add on top of the pooling layer a fully connected dense layer with Tanh activation, which performs a down-project to 256 dimensions. Hence, embeddings by this model will only have 256 instead of 768 dimensions. - -Additionally, we can also create SentenceTransformer models from scratch for image search by loading any CLIP model from the Hugging Face Hub or a local path: - -```py -from sentence_transformers import SentenceTransformer, models - -image_embedding_model = models.CLIPModel("openai/clip-vit-base-patch32") -model = SentenceTransformer(modules=[image_embedding_model]) -``` - -For all available building blocks see [» Models Package Reference](../package_reference/models.md) - -## Training Data - -To train a SentenceTransformer model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different. - -Unfortunately, there is no single way to prepare your data to train a Sentence Transformers model. It largely depends on your goals and the structure of your data. If you don't have an explicit label, which is the most likely scenario, you can derive it from the design of the documents where you obtained the sentences. For example, two sentences in the same report should be more comparable than two sentences in different reports. Neighboring sentences might be more comparable than non-neighboring sentences. - -For more information on available datasets for training SentenceTransformers models see [» Datasets Reference](../../examples/training/datasets/README.md). - -To represent our training data, we use the `InputExample` class to store training examples. As parameters, it accepts texts, which is a list of strings representing our pairs (or triplets). Further, we can also pass a label (either float or int). The following shows a simple example, where we pass text pairs to `InputExample` together with a label indicating the semantic similarity. - - ```python - from sentence_transformers import SentenceTransformer, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer("distilbert-base-nli-mean-tokens") - train_examples = [ - InputExample(texts=["My first sentence", "My second sentence"], label=0.8), - InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) - ``` - -We wrap our `train_examples` with the standard PyTorch `DataLoader`, which shuffles our data and produces batches of certain sizes. - - - -## Loss Functions - -The loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task. - -Sadly there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task. - - -To fine-tune our network, we need somehow to tell our network which sentence pairs are similar, and should be close in vector space, and which pairs are dissimilar, and should be far away in vector space. - -The most simple way is to have sentence pairs annotated with a score indicating their similarity, e.g. on a scale 0 to 1. We can then train the network with a Siamese Network Architecture (for details see: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)) - -![SBERT Siamese Network Architecture](../img/SBERT_Siamese_Network.png "SBERT Siamese Architecture") - - -For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences. - - -A minimal example with `CosineSimilarityLoss` is the following: -```python -from sentence_transformers import SentenceTransformer, InputExample, losses -from torch.utils.data import DataLoader - -# Define the model. Either from scratch of by loading a pre-trained model -model = SentenceTransformer("distilbert-base-nli-mean-tokens") - -# Define your train examples. You need more than just two examples... -train_examples = [ - InputExample(texts=["My first sentence", "My second sentence"], label=0.8), - InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3), -] - -# Define your train dataset, the dataloader and the train loss -train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) -train_loss = losses.CosineSimilarityLoss(model) - -# Tune the model -model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100) -``` - - -We tune the model by calling model.fit(). We pass a list of `train_objectives`, which consist of tuples `(dataloader, loss_function)`. We can pass more than one tuple in order to perform multi-task learning on several datasets with different loss functions. - -The `fit` method accepts the following parameter: - -```eval_rst -.. autoclass:: sentence_transformers.SentenceTransformer - :members: fit -``` - -## Evaluators - -During training, we usually want to measure the performance to see if the performance improves. For this, the *[sentence_transformers.evaluation](../package_reference/evaluation.md)* package exists. It contains various evaluators which we can pass to the `fit`-method. These evaluators are run periodically during training. Further, they return a score and only the model with the highest score will be stored on disc. - -The usage is simple: -```python -from sentence_transformers import evaluation - -sentences1 = [ - "This list contains the first column", - "With your sentences", - "You want your model to evaluate on", -] -sentences2 = [ - "Sentences contains the other column", - "The evaluator matches sentences1[i] with sentences2[i]", - "Compute the cosine similarity and compares it to scores[i]", -] -scores = [0.3, 0.6, 0.2] - -evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores) - -# ... Your other code to load training data - -model.fit( - train_objectives=[(train_dataloader, train_loss)], - epochs=1, - warmup_steps=100, - evaluator=evaluator, - evaluation_steps=500, -) -``` - - - -### Continue Training on Other Data -[training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark. - -First, we load a pre-trained model from the server: -```python -model = SentenceTransformer("bert-base-nli-mean-tokens") -``` - - -The next steps are as before. We specify training and dev data: -```python -train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size) -train_loss = losses.CosineSimilarityLoss(model=model) - -evaluator = EmbeddingSimilarityEvaluator.from_input_examples( - sts_reader.get_examples("sts-dev.csv") -) -``` - -In that example, we use CosineSimilarityLoss, which computes the cosine similarity between two sentences and compares this score with a provided gold similarity score. - -Then we can train as before: -```python -model.fit( - train_objectives=[(train_dataloader, train_loss)], - evaluator=evaluator, - epochs=num_epochs, - evaluation_steps=1000, - warmup_steps=warmup_steps, - output_path=model_save_path, -) -``` - - -## Loading Custom SentenceTransformer Models -Loading trained models is easy. You can specify a path: -```python -model = SentenceTransformer("./my/path/to/model/") -``` -Note: It is important that a / or \ is present in the path, otherwise, it is not recognized as a path. - -You can also host the training output on a server and download it: - ```python -model = SentenceTransformer('http://www.server.com/path/to/model/my_model.zip') -``` -With the first call, the model is downloaded and stored in the local Hugging Face cache folder (`~/.cache/huggingface`). In order to work, you must zip all files and subfolders of your model. - - - -## Multitask Training -This code allows multi-task learning with training data from different datasets and with different loss-functions. For an example, see [training_multi-task.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py). - - -## Adding Special Tokens - -Depending on the task, you might want to add special tokens to the tokenizer and the Transformer model. You can use the following code-snippet to achieve this: -```python -from sentence_transformers import SentenceTransformer, models - -word_embedding_model = models.Transformer("bert-base-uncased") - -tokens = ["[DOC]", "[QRY]"] -word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True) -word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer)) - -pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) -model = SentenceTransformer(modules=[word_embedding_model, pooling_model]) -``` - -If you want to extend the vocabulary for an existent SentenceTransformer model, you can use the following code: -```python -from sentence_transformers import SentenceTransformer, models - -model = SentenceTransformer("all-MiniLM-L6-v2") -word_embedding_model = model._first_module() - -tokens = ["[DOC]", "[QRY]"] -word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True) -word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer)) -``` - -In the above example, the two new tokens `[DOC]` and `[QRY]` are added to the model. Their respective word embeddings are intialized randomly. It is advisable to then fine-tune the model on your downstream task. - - -## Best Transformer Model -The quality of your text embedding model depends on which transformer model you choose. Sadly we cannot infer from a better performance on e.g. the GLUE or SuperGLUE benchmark that this model will also yield better representations. - -To test the suitability of transformer models, I use the [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py) script and train on 560k (anchor, positive, negative)-triplets for 1 epoch with batch size 64. I then evaluate on 14 diverse text similarity tasks (clustering, semantic search, duplicate detection etc.) from various domains. - -In the following table you find the performance for different models and their performance on this benchmark: - -| Model | Performance (14 sentence similarity tasks) | -| --- | :---: | -| [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) | 60.99 | -| [nghuyong/ernie-2.0-en](https://huggingface.co/nghuyong/ernie-2.0-en) | 60.73 | -| [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) | 60.21 | -| [roberta-base](https://huggingface.co/roberta-base) | 59.63 | -| [t5-base](https://huggingface.co/t5-base) | 59.21 | -| [bert-base-uncased](https://huggingface.co/bert-base-uncased) | 59.17 | -| [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) | 59.03 | -| [nreimers/TinyBERT_L-6_H-768_v2](https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2) | 58.27 | -| [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) | 57.63 | -| [nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large) | 57.31 | -| [albert-base-v2](https://huggingface.co/albert-base-v2) | 57.14 | -| [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) | 56.79 | -| [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) | 54.46 | diff --git a/docs/usage/semantic_textual_similarity.md b/docs/usage/semantic_textual_similarity.md deleted file mode 100644 index fc772d1e1..000000000 --- a/docs/usage/semantic_textual_similarity.md +++ /dev/null @@ -1,82 +0,0 @@ -# Semantic Textual Similarity - -Once you have [sentence embeddings computed](../../examples/applications/computing-embeddings/README.md), you usually want to compare them to each other. Here, I show you how you can compute the cosine similarity between embeddings, for example, to measure the semantic similarity of two texts. - -```python -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("all-MiniLM-L6-v2") - -# Two lists of sentences -sentences1 = [ - "The cat sits outside", - "A man is playing guitar", - "The new movie is awesome", -] - -sentences2 = [ - "The dog plays in the garden", - "A woman watches TV", - "The new movie is so great", -] - -# Compute embedding for both lists -embeddings1 = model.encode(sentences1, convert_to_tensor=True) -embeddings2 = model.encode(sentences2, convert_to_tensor=True) - -# Compute cosine-similarities -cosine_scores = util.cos_sim(embeddings1, embeddings2) - -# Output the pairs with their score -for i in range(len(sentences1)): - print("{} \t\t {} \t\t Score: {:.4f}".format( - sentences1[i], sentences2[i], cosine_scores[i][i] - )) -``` - -We pass the `convert_to_tensor=True` parameter to the encode function. This will return a pytorch tensor containing our embeddings. We can then call `util.cos_sim(A, B)` which computes the cosine similarity between all vectors in *A* and all vectors in *B*. - -It returns in the above example a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*. - - -You can use this function also to find out the pairs with the highest cosine similarity scores: -```python -from sentence_transformers import SentenceTransformer, util - -model = SentenceTransformer("all-MiniLM-L6-v2") - -# Single list of sentences -sentences = [ - "The cat sits outside", - "A man is playing guitar", - "I love pasta", - "The new movie is awesome", - "The cat plays in the garden", - "A woman watches TV", - "The new movie is so great", - "Do you like pizza?", -] - -# Compute embeddings -embeddings = model.encode(sentences, convert_to_tensor=True) - -# Compute cosine-similarities for each sentence with each other sentence -cosine_scores = util.cos_sim(embeddings, embeddings) - -# Find the pairs with the highest cosine similarity scores -pairs = [] -for i in range(cosine_scores.shape[0]): - for j in range(cosine_scores.shape[1]): - pairs.append({"index": [i, j], "score": cosine_scores[i][j]}) - -# Sort scores in decreasing order -pairs = sorted(pairs, key=lambda x: x["score"], reverse=True) - -for pair in pairs[0:10]: - i, j = pair["index"] - print("{} \t\t {} \t\t Score: {:.4f}".format( - sentences[i], sentences[j], pair["score"] - )) -``` - -Note, in the above approach we use a brute-force approach to find the highest scoring pairs, which has a quadratic complexity. For long lists of sentences, this might be infeasible. If you want find the highest scoring pairs in a long list of sentences, have a look at [Paraphrase Mining](../../examples/applications/paraphrase-mining/README.md). \ No newline at end of file diff --git a/examples/README.md b/examples/README.md index 4ab4c6c6f..145d3f3b6 100644 --- a/examples/README.md +++ b/examples/README.md @@ -9,7 +9,7 @@ The [applications](applications/) folder contains examples how to use SentenceTr The [evaluation](evaluation/) folder contains some examples how to evaluate SentenceTransformer models for common tasks. ## Training -The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html). +The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html). ## Unsupervised Learning diff --git a/examples/applications/clustering/README.md b/examples/applications/clustering/README.md index 98c5e64f7..d8d1c3e9c 100644 --- a/examples/applications/clustering/README.md +++ b/examples/applications/clustering/README.md @@ -15,7 +15,7 @@ In [fast_clustering.py](fast_clustering.py) we present a clustering algorithm th You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters. -We apply it on the [Quora Duplicate Questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset and the output looks something like this: +We apply it on the [Quora Duplicate Questions](https://huggingface.co/datasets/sentence-transformers/quora-duplicates) dataset and the output looks something like this: ``` Cluster 1, #83 Elements @@ -51,7 +51,6 @@ For each topic, you want to extract the words that describe this topic: ![20news](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/20news_top2vec.png) -Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the repositories [Top2Vec](https://github.com/ddangelov/Top2Vec) and [BERTopic](https://github.com/MaartenGr/BERTopic). +Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the [BERTopic](https://github.com/MaartenGr/BERTopic) and [Top2Vec](https://github.com/ddangelov/Top2Vec) repositories. - - Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470) +Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470) diff --git a/examples/applications/computing-embeddings/README.md b/examples/applications/computing-embeddings/README.md deleted file mode 100644 index db1e7d80d..000000000 --- a/examples/applications/computing-embeddings/README.md +++ /dev/null @@ -1,213 +0,0 @@ -# Computing Sentence Embeddings - - - -The basic function to compute sentence embeddings looks like this: -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("all-MiniLM-L6-v2") - -# Our sentences we like to encode -sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of strings.", - "The quick brown fox jumps over the lazy dog.", -] - -# Sentences are encoded by calling model.encode() -embeddings = model.encode(sentences) - -# Print the embeddings -for sentence, embedding in zip(sentences, embeddings): - print("Sentence:", sentence) - print("Embedding:", embedding) - print("") -``` - -**Note:** Even though we talk about sentence embeddings, you can use it also for shorter phrases as well as for longer texts with multiple sentences. See the section on Input Sequence Length for more notes on embeddings for paragraphs. - -First, we load a sentence-transformer model: -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("model_name_or_path") -``` - -You can either specify a [pre-trained model](https://www.sbert.net/docs/pretrained_models.html) or you can pass a path on your disc to load the sentence-transformer model from that folder. - -If available, the model is automatically executed on the GPU. You can specify the device for the model like this: -```python -model = SentenceTransformer("model_name_or_path", device="cuda") -``` - -With *device* any pytorch device (like CPU, cuda, cuda:0 etc.) - - -The relevant method to encode a set of sentences / texts is `model.encode()`. In the following, you can find parameters this method accepts. Some relevant parameters are *batch_size* (depending on your GPU a different batch size is optimal) as well as *convert_to_numpy* (returns a numpy matrix) and *convert_to_tensor* (returns a pytorch tensor). - -```eval_rst -.. autoclass:: sentence_transformers.SentenceTransformer - :members: encode -``` - -## Prompt Templates -Some models require using specific text *prompts* to achieve optimal performance. For example, with [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) you should prefix all queries with `query: ` and all passages with `passage: `. Another example is [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5), which performs best for retrieval when the input texts are prefixed with `Represent this sentence for searching relevant passages: `. - -Sentence Transformer models can be initialized with `prompts` and `default_prompt_name` parameters: -* `prompts` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example, - ```python - model = SentenceTransformer( - "intfloat/multilingual-e5-large", - prompts={ - "classification": "Classify the following text: ", - "retrieval": "Retrieve semantically similar text: ", - "clustering": "Identify the topic or theme based on the text: ", - }, - ) - # or - model.prompts = { - "classification": "Classify the following text: ", - "retrieval": "Retrieve semantically similar text: ", - "clustering": "Identify the topic or theme based on the text: ", - } - ``` -* `default_prompt_name` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from `prompts`. If `None`, then no prompt is used by default. For example, - ```python - model = SentenceTransformer( - "intfloat/multilingual-e5-large", - prompts={ - "classification": "Classify the following text: ", - "retrieval": "Retrieve semantically similar text: ", - "clustering": "Identify the topic or theme based on the text: ", - }, - default_prompt_name="retrieval", - ) - # or - model.default_prompt_name="retrieval" - ``` -Both of these parameters can also be specified in the `config_sentence_transformers.json` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well. - - -During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded: -1. Explicitly using the `prompt` option in `SentenceTransformer.encode`: - ```python - embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ") - ``` -2. Explicitly using the `prompt_name` option in `SentenceTransformer.encode` by relying on the prompts loaded from a) initialization or b) the model config. - ```python - embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval") - ``` -3. If `prompt` nor `prompt_name` are specified in `SentenceTransformer.encode`, then the prompt specified by `default_prompt_name` will be applied. If it is `None`, then no prompt will be applied. - ```python - embeddings = model.encode("How to bake a strawberry cake") - ``` - - -## Input Sequence Length -Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). Longer texts than this are truncated to the first x word pieces. - -By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. You can get and set the maximal sequence length like this: - -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("all-MiniLM-L6-v2") - -print("Max Sequence Length:", model.max_seq_length) - -# Change the length to 200 -model.max_seq_length = 200 - -print("Max Sequence Length:", model.max_seq_length) -``` - -**Note:** You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good. - -## Storing & Loading Embeddings -The easiest method is to use *pickle* to store pre-computed embeddings on disc and to load it from disc. This can especially be useful if you need to encode large set of sentences. - - -```python -from sentence_transformers import SentenceTransformer -import pickle - -model = SentenceTransformer("all-MiniLM-L6-v2") -sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of string.", - "The quick brown fox jumps over the lazy dog.", -] - - -embeddings = model.encode(sentences) - -# Store sentences & embeddings on disc -with open("embeddings.pkl", "wb") as fOut: - pickle.dump({"sentences": sentences, "embeddings": embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL) - -# Load sentences & embeddings from disc -with open("embeddings.pkl", "rb") as fIn: - stored_data = pickle.load(fIn) - stored_sentences = stored_data["sentences"] - stored_embeddings = stored_data["embeddings"] -``` - -## Multi-Process / Multi-GPU Encoding - -You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: [computing_embeddings_multi_gpu.py](computing_embeddings_multi_gpu.py). - -The relevant method is `start_multi_process_pool()`, which starts multiple processes that are used for encoding. - - ```eval_rst -.. automethod:: sentence_transformers.SentenceTransformer.start_multi_process_pool -``` - -## Sentence Embeddings with Transformers -Most of our pre-trained models are based on [Huggingface.co/Transformers](https://huggingface.co/transformers/) and are also hosted in the [models repository](https://huggingface.co/models) from Huggingface. It is possible to use our sentence embeddings models without installing sentence-transformers: - -```python -from transformers import AutoTokenizer, AutoModel -import torch - - -# Mean Pooling - Take attention mask into account for correct averaging -def mean_pooling(model_output, attention_mask): - token_embeddings = model_output[0] # First element of model_output contains all token embeddings - input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() - sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) - sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) - return sum_embeddings / sum_mask - - -# Sentences we want sentence embeddings for -sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of string.", - "The quick brown fox jumps over the lazy dog.", -] - -# Load AutoModel from huggingface model repository -tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") -model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") - -# Tokenize sentences -encoded_input = tokenizer( - sentences, padding=True, truncation=True, max_length=128, return_tensors="pt" -) - -# Compute token embeddings -with torch.no_grad(): - model_output = model(**encoded_input) - -# Perform pooling. In this case, mean pooling -sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) -``` - - -You can find the available models here: [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers) - - -In the above example we add mean pooling on top of the AutoModel (which will load a BERT model). We also have models with max-pooling and where we use the CLS token. How to apply this pooling correctly, have a look at [sentence-transformers/bert-base-nli-max-tokens](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens) and [/sentence-transformers/bert-base-nli-cls-token](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token). - - diff --git a/examples/applications/computing-embeddings/README.rst b/examples/applications/computing-embeddings/README.rst new file mode 100644 index 000000000..2abc73fde --- /dev/null +++ b/examples/applications/computing-embeddings/README.rst @@ -0,0 +1,150 @@ +Computing Embeddings +==================== + +Once you have `installed `_ Sentence Transformers, you can easily use Sentence Transformer models: + +.. sidebar:: Documentation + + 1. :class:`SentenceTransformer ` + 2. :meth:`SentenceTransformer.encode ` + 3. :meth:`SentenceTransformer.similarity ` + +:: + + from sentence_transformers import SentenceTransformer + + # 1. Load a pretrained Sentence Transformer model + model = SentenceTransformer("all-MiniLM-L6-v2") + + # The sentences to encode + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] + + # 2. Calculate embeddings by calling model.encode() + embeddings = model.encode(sentences) + print(embeddings.shape) + # [3, 384] + + # 3. Calculate the embedding similarities + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[1.0000, 0.6660, 0.1046], + # [0.6660, 1.0000, 0.1411], + # [0.1046, 0.1411, 1.0000]]) + +.. note:: + Even though we talk about sentence embeddings, you can use Sentence Transformers for shorter phrases as well as for longer texts with multiple sentences. See `Input Sequence Length <#input-sequence-length>`_ for notes on embeddings for longer texts. + + +Initializing a Sentence Transformer Model +----------------------------------------- + +The first step is to load a pretrained Sentence Transformer model. You can use any of the models from the `Pretrained Models <../docs/sentence_transformer/pretrained_models.html>`_ or a local model. See also :class:`~sentence_transformers.SentenceTransformer` for information on parameters. + +:: + + from sentence_transformers import SentenceTransformer + + model = SentenceTransformer("all-mpnet-base-v2") + # Alternatively, you can pass a path to a local model directory: + model = SentenceTransformer("output/models/mpnet-base-finetuned-all-nli") + +The model will automatically be placed on the most performant available device, e.g. ``cuda`` or ``mps`` if available. You can also specify the device explicitly: + +:: + + model = SentenceTransformer("all-mpnet-base-v2", device="cuda") + +Calculating Embeddings +---------------------- + +The method to calculate embeddings is :meth:`SentenceTransformer.encode< sentence_transformers.SentenceTransformer.encode>`. + + +Prompt Templates +---------------- + +Some models require using specific text *prompts* to achieve optimal performance. For example, with `intfloat/multilingual-e5-large `_ you should prefix all queries with ``"query: "`` and all passages with ``"passage: "``. Another example is `BAAI/bge-large-en-v1.5 `_, which performs best for retrieval when the input texts are prefixed with ``"Represent this sentence for searching relevant passages: "``. + +Sentence Transformer models can be initialized with ``prompts`` and ``default_prompt_name`` parameters: + +- ``prompts`` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example:: + + model = SentenceTransformer( + "intfloat/multilingual-e5-large", + prompts={ + "classification": "Classify the following text: ", + "retrieval": "Retrieve semantically similar text: ", + "clustering": "Identify the topic or theme based on the text: ", + }, + ) + # or + model.prompts = { + "classification": "Classify the following text: ", + "retrieval": "Retrieve semantically similar text: ", + "clustering": "Identify the topic or theme based on the text: ", + } + +- ``default_prompt_name`` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from ``prompts``. If ``None``, then no prompt is used by default. For example:: + + model = SentenceTransformer( + "intfloat/multilingual-e5-large", + prompts={ + "classification": "Classify the following text: ", + "retrieval": "Retrieve semantically similar text: ", + "clustering": "Identify the topic or theme based on the text: ", + }, + default_prompt_name="retrieval", + ) + # or + model.default_prompt_name="retrieval" + +Both of these parameters can also be specified in the ``config_sentence_transformers.json`` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well. + +During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded: + +1. Explicitly using the ``prompt`` option in ``SentenceTransformer.encode``:: + + embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ") + +2. Explicitly using the ``prompt_name`` option in ``SentenceTransformer.encode`` by relying on the prompts loaded from a) initialization or b) the model config:: + + embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval") + +3. If ``prompt`` nor ``prompt_name`` are specified in ``SentenceTransformer.encode``, then the prompt specified by ``default_prompt_name`` will be applied. If it is ``None``, then no prompt will be applied:: + + embeddings = model.encode("How to bake a strawberry cake") + +Input Sequence Length +--------------------- + +For transformer models like BERT, RoBERTa, DistilBERT etc., the runtime and memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT-based models are 512 tokens, which corresponds to about 300-400 words (for English). + +Each model has a maximum sequence length under ``model.max_seq_length``, which is the maximal number of tokens that can be processed. Longer texts will be truncated to the first ``model.max_seq_length`` tokens:: + + from sentence_transformers import SentenceTransformer + + model = SentenceTransformer("all-MiniLM-L6-v2") + print("Max Sequence Length:", model.max_seq_length) + # => Max Sequence Length: 256 + + # Change the length to 200 + model.max_seq_length = 200 + + print("Max Sequence Length:", model.max_seq_length) + # => Max Sequence Length: 200 + +.. note:: + + You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good. + +Multi-Process / Multi-GPU Encoding +---------------------------------- + +You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: `computing_embeddings_multi_gpu.py `_. + + +The relevant method is :meth:`~sentence_transformers.SentenceTransformer.start_multi_process_pool`, which starts multiple processes that are used for encoding. \ No newline at end of file diff --git a/examples/applications/embedding-quantization/README.md b/examples/applications/embedding-quantization/README.md index b75a96ac0..964d2997c 100644 --- a/examples/applications/embedding-quantization/README.md +++ b/examples/applications/embedding-quantization/README.md @@ -20,6 +20,16 @@ Quantizing an embedding with a dimensionality of 1024 to binary would result in As a result, in practice quantizing a `float32` embedding with a dimensionality of 1024 yields an `int8` or `uint8` embedding with a dimensionality of 128. See two approaches of how you can produce quantized embeddings using Sentence Transformers below: +```eval_rst +.. sidebar:: References + + #. `mixedbread-ai/mxbai-embed-large-v1 `_ + #. :class:`~sentence_transformers.SentenceTransformer` + #. :meth:`SentenceTransformer.encode ` + #. :func:`~sentence_transformers.quantization.quantize_embeddings` + +``` + ```python from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings @@ -38,11 +48,6 @@ embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day." binary_embeddings = quantize_embeddings(embeddings, precision="binary") ``` -**References:** -* mixedbread-ai/mxbai-embed-large-v1 -* SentenceTransformer.encode -* quantize_embeddings - Here you can see the differences between default `float32` embeddings and binary embeddings in terms of shape, size, and `numpy` dtype: ```python @@ -84,6 +89,16 @@ Computing int8 quantization buckets based on 2 embeddings. int8 quantization is See how you can produce scalar quantized embeddings using Sentence Transformers below: +```eval_rst +.. sidebar:: References + + #. `mixedbread-ai/mxbai-embed-large-v1 `_ + #. :class:`~sentence_transformers.SentenceTransformer` + #. :meth:`SentenceTransformer.encode ` + #. :func:`~sentence_transformers.quantization.quantize_embeddings` + +``` + ```python from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings @@ -105,11 +120,6 @@ int8_embeddings = quantize_embeddings( ) ``` -**References:** -* mixedbread-ai/mxbai-embed-large-v1 -* SentenceTransformer.encode -* quantize_embeddings - Here you can see the differences between default `float32` embeddings and `int8` scalar embeddings in terms of shape, size, and `numpy` dtype: ```python @@ -154,6 +164,7 @@ The following demo showcases the retrieval efficiency using `exact` search throu width="100%" height="1000" > +

    ## Try it yourself diff --git a/examples/applications/parallel-sentence-mining/README.md b/examples/applications/parallel-sentence-mining/README.md index de99e60f4..4996ee38d 100644 --- a/examples/applications/parallel-sentence-mining/README.md +++ b/examples/applications/parallel-sentence-mining/README.md @@ -24,10 +24,10 @@ This is an example sentences. Dies ist ein Beispielsatz. Usually you apply this method to large corpora, for example, you want to find all translated sentences in the English Wikipedia and the Chinese Wikipedia. -## Marging Based Mining +## Margin Based Mining We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1812.10464.pdf) to find translated sentences in two datasets: -1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://tfhub.dev/google/LaBSE/1) currently the best method for bitext mining. The model is integrated in Sentence-Transformers +1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) currently the best method for bitext mining. The model is integrated in Sentence-Transformers 2) Once we have all embeddings, we find the *k* nearest neighbor sentences for all sentences in both directions. Typical choices for k are between 4 and 16. 3) Then, we score all possible sentence combinations using the formula mentioned in Section 4.3. 4) The pairs with the highest scores are most likely translated sentences. Note, that the score can be larger than 1. Usually you have to find some cut-off where you ignore pairs below that threshold. For a high quality, a threshold of about 1.2 - 1.3 works quite well. @@ -35,4 +35,4 @@ We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pd ## Examples - **[bucc2018.py](bucc2018.py)** - This script contains an example for the [BUCC 2018 shared task](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) on finding parallel sentences. This dataset can be used to evaluate different strategies, as we know which sentences are parallel in the two corpora. The script mines for parallel sentences and then prints the optimal threshold that leads to the highest F1-score. - **[bitext_mining.py](bitext_mining.py)** - This file reads in two text files (with a single sentence in each line) and outputs parallel sentences to *parallel-sentences-out.tsv.gz. -- **[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed S-BERT to generate/select in-domain parallel data for machine translation systems – using monolingual texts. +- **[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed Sentence Transformers to generate/select in-domain parallel data for machine translation systems – using monolingual texts. diff --git a/examples/applications/paraphrase-mining/README.md b/examples/applications/paraphrase-mining/README.md index 02ae141eb..685d47ffa 100644 --- a/examples/applications/paraphrase-mining/README.md +++ b/examples/applications/paraphrase-mining/README.md @@ -1,54 +1,49 @@ # Paraphrase Mining -Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/usage/semantic_textual_similarity.md) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs. +Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs. -However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences. +```eval_rst +However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences. For larger collections, the :func:`~sentence_transformers.util.paraphrase_mining` function can be used:: -For larger collections, *util* offers the *paraphrase_mining* function that can be used like this: -```python -from sentence_transformers import SentenceTransformer, util + from sentence_transformers import SentenceTransformer + from sentence_transformers.util import paraphrase_mining -model = SentenceTransformer("all-MiniLM-L6-v2") + model = SentenceTransformer("all-MiniLM-L6-v2") -# Single list of sentences - Possible tens of thousands of sentences -sentences = [ - "The cat sits outside", - "A man is playing guitar", - "I love pasta", - "The new movie is awesome", - "The cat plays in the garden", - "A woman watches TV", - "The new movie is so great", - "Do you like pizza?", -] + # Single list of sentences - Possible tens of thousands of sentences + sentences = [ + "The cat sits outside", + "A man is playing guitar", + "I love pasta", + "The new movie is awesome", + "The cat plays in the garden", + "A woman watches TV", + "The new movie is so great", + "Do you like pizza?", + ] -paraphrases = util.paraphrase_mining(model, sentences) + paraphrases = paraphrase_mining(model, sentences) -for paraphrase in paraphrases[0:10]: - score, i, j = paraphrase - print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score)) -``` + for paraphrase in paraphrases[0:10]: + score, i, j = paraphrase + print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score)) -The **paraphrase_mining()**-method accepts the following parameters: -```eval_rst -.. autofunction:: sentence_transformers.util.paraphrase_mining -``` +The :func:`~sentence_transformers.util.paraphrase_mining` accepts the following parameters: -Instead of computing all pairwise cosine scores and ranking all possible, combinations, the approach is a bit more complex (and hence efficient). We chunk our corpus into smaller pieces, which is defined by *query_chunk_size* and *corpus_chunk_size*. For example, if we set *query_chunk_size=1000*, we search paraphrases for 1,000 sentences at a time in the remaining corpus (all other sentences). However, the remaining corpus is also chunked, for example, if we set *corpus_chunk_size=10000*, we look for paraphrases in 10k sentences at a time. - -If we pass a list of 20k sentences, we will chunk it to 20x1000 sentences, and each of the query is compared first against sentences 0-10k and then 10k-20k. +.. autofunction:: sentence_transformers.util.paraphrase_mining -This is done to reduce the memory requirement. Increasing both values improves the speed, but increases also the memory requirement. +To optimize memory and computation time, paraphrase mining is performed in chunks, as specified by ``query_chunk_size`` and ``corpus_chunk_size``. +To be specific, only ``query_chunk_size * corpus_chunk_size`` pairs will be compared at a time, rather than ``len(sentences) * len(sentences)``. This is more time- and memory-efficient. Additionally, :func:`~sentence_transformers.util.paraphrase_mining` only considers the ``top_k`` best scores per sentences per chunk. You can experiment with this value as an efficiency-performance trade-off. +For example, for each sentence you will get only the one most relevant sentence in this script. -The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the *top_k* scores. So with *top_k=100*, we find at most 100 paraphrases per sentence per chunk. You can play around with *top_k* to the ensure a certain behaviour. +:: -So for example, with -```python -paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1) -``` + paraphrases = paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1) -You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C). +The final key parameter is ``max_pairs``, which determines the maximum number of paraphrase pairs that the function returns. Usually, you get fewer pairs returned because the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned. -The final relevant parameter is *max_pairs*, which determines the maximum number of paraphrase pairs you like to get returned. If you set it to e.g. *max_pairs=100*, you will not get more than 100 paraphrase pairs returned. Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned. - +.. note:: + + If B is the most similar sentence for A, A is not necessarily the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C). +``` \ No newline at end of file diff --git a/examples/applications/retrieve_rerank/README.md b/examples/applications/retrieve_rerank/README.md index d6d50de95..5b73db7ed 100644 --- a/examples/applications/retrieve_rerank/README.md +++ b/examples/applications/retrieve_rerank/README.md @@ -1,41 +1,33 @@ # Retrieve & Re-Rank In [Semantic Search](../semantic-search/README.md) we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search. -For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**. +For complex search tasks, for example question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**. ## Retrieve & Re-Rank Pipeline -A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this article: +The following pipeline for Information Retrieval / Question Answering Retrieval works very well. All components are provided and explained in this article: ![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png) -Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with Elasticsearch, or we can use dense retrieval with a bi-encoder. - -However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query. - -The output will be a ranked list of hits we can present to the user. +Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with a vector engine like Elasticsearch, or we can use dense retrieval with a bi-encoder. However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query. The output will be a ranked list of hits we can present to the user. ## Retrieval: Bi-Encoder -For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in this repository. +For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in Sentence Transformers. Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations. In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space. ![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png) -Semantic search overcomes the short comings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md) for different options to implement semantic search. +Semantic search overcomes the shortcomings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md) for different options to implement semantic search. ## Re-Ranker: Cross-Encoder -The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates. - -A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query. +The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates. A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query. ![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png) -The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document. - -Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder. +The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document. Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder. ## Example Scripts @@ -50,7 +42,7 @@ The bi-encoder produces embeddings independently for your paragraphs and for you ```python from sentence_transformers import SentenceTransformer -model = SentenceTransformer("model_name") +model = SentenceTransformer("multi-qa-mpnet-base-dot-v1") docs = [ "My first paragraph. That contains information", @@ -65,9 +57,8 @@ query_embedding = model.encode(query) For more details how to compare the embeddings, see [semantic search](../semantic-search/README.md). We provide pre-trained models based on: -- **MS MARCO:** 500k real user queries from Bing search engine. See [MS MARCO models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) +- **MS MARCO:** 500k real user queries from Bing search engine. See [MS MARCO models](../../../docs/pretrained-models/msmarco-v3.html) ## Pre-trained Cross-Encoders (Re-Ranker) - -For pre-trained models, see: [MS MARCO Cross-Encoders](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html) +For pre-trained Cross Encoder models, see: [MS MARCO Cross-Encoders](../../../docs/pretrained-models/ce-msmarco.html) diff --git a/examples/applications/semantic-search/README.md b/examples/applications/semantic-search/README.md index 77a539039..ba29a8ac3 100644 --- a/examples/applications/semantic-search/README.md +++ b/examples/applications/semantic-search/README.md @@ -1,50 +1,65 @@ # Semantic Search -Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms. - +Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over. Semantic search can also perform well given synonyms, abbreviations, and misspellings, unlike keyword search engines that can only find documents based on lexical matches. ## Background -The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. - -At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query. +The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic similarity with the query. ![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png) - ## Symmetric vs. Asymmetric Semantic Search A **critical distinction** for your setup is *symmetric* vs. *asymmetric semantic search*: -- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus. +- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus. + - Related training example: [Quora Duplicate Questions](../../training/quora_duplicate_questions/README.md). + - Suitable models: [Pre-Trained Sentence Embedding Models](../../../docs/sentence_transformer/pretrained_models#sentence-embedding-models) - For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like *"What is Python"* and you want to find the paragraph *"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy ..."*. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense. + - Related training example: [MS MARCO](../../training/ms_marco/README.html) + - Suitable models: [Pre-Trained MS MARCO Models](../../../docs/pretrained-models/msmarco-v3.md) It is critical **that you choose the right model** for your type of task. -Suitable models for **symmetric semantic search**: [Pre-Trained Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) - - -Suitable models for **asymmetric semantic search**: [Pre-Trained MS MARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) - - - -## Python - -For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus. - -In the following example, we define a small corpus with few example sentences and compute the embeddings for the corpus as well as for our query. - -We then use the [util.cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries. - -For large corpora, sorting all scores would take too much time. Hence, we use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries. +## Manual Implementation +For small corpora (up to about 1 million entries), we can perform semantic search with a manual implementation by computing the embeddings for the corpus as well as for our query, and then calculating the [semantic textual similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) using [SentenceTransformer.similarity](../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity). For a simple example, see [semantic_search.py](semantic_search.py): ```eval_rst + +.. sidebar:: Output + + .. code-block:: txt + + Query: A man is eating pasta. + Top 5 most similar sentences in corpus: + A man is eating food. (Score: 0.7035) + A man is eating a piece of bread. (Score: 0.5272) + A man is riding a horse. (Score: 0.1889) + A man is riding a white horse on an enclosed ground. (Score: 0.1047) + A cheetah is running behind its prey. (Score: 0.0980) + + Query: Someone in a gorilla costume is playing a set of drums. + Top 5 most similar sentences in corpus: + A monkey is playing drums. (Score: 0.6433) + A woman is playing violin. (Score: 0.2564) + A man is riding a horse. (Score: 0.1389) + A man is riding a white horse on an enclosed ground. (Score: 0.1191) + A cheetah is running behind its prey. (Score: 0.1080) + + Query: A cheetah chases prey on across a field. + Top 5 most similar sentences in corpus: + A cheetah is running behind its prey. (Score: 0.8253) + A man is eating food. (Score: 0.1399) + A monkey is playing drums. (Score: 0.1292) + A man is riding a white horse on an enclosed ground. (Score: 0.1097) + A man is riding a horse. (Score: 0.0650) + .. literalinclude:: semantic_search.py ``` -## util.semantic_search +## Optimized Implementation -Instead of implementing semantic search by yourself, you can use the *util.semantic_search* function. +Instead of implementing semantic search by yourself, you can use the [util.semantic_search](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) function. The function accepts the following parameters: @@ -52,12 +67,10 @@ The function accepts the following parameters: .. autofunction:: sentence_transformers.util.semantic_search ``` -By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase *query_chunk_size* and *corpus_chunk_size*, which leads to increased speed for large corpora, but also increases the memory requirement. +By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase ``query_chunk_size`` and ``corpus_chunk_size``, which leads to increased speed for large corpora, but also increases the memory requirement. ## Speed Optimization -To get the optimal speed for the `util.semantic_search` method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. - -Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores. +To get the optimal speed for the [util.semantic_search](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores. ```python corpus_embeddings = corpus_embeddings.to("cuda") corpus_embeddings = util.normalize_embeddings(corpus_embeddings) @@ -67,9 +80,6 @@ query_embeddings = util.normalize_embeddings(query_embeddings) hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score) ``` - - - ## Elasticsearch [Elasticsearch](https://www.elastic.co/elasticsearch/) has the possibility to [index dense vectors](https://www.elastic.co/what-is/vector-search) and to use them for document scoring. We can easily index embedding vectors, store other data alongside our vectors and, most importantly, efficiently retrieve relevant entries using [approximate nearest neighbor search](https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0) (HNSW, see also below) on the embeddings. @@ -77,38 +87,37 @@ For further details, see [semantic_search_quora_elasticsearch.py](semantic_searc ## Approximate Nearest Neighbor -Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by *util.semantic_search*). +Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by [util.semantic_search](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search)). -In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors. - -However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed. That's the reason why it is called approximate nearest neighbor. +In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors. However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed. For all ANN methods, there are usually one or more parameters to tune that determine the recall-speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases. -Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/). Personally I find hnswlib the most suitable library: It is easy to use, offers a great performance and has nice features included that are important for real applications. +Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/). Examples: + - [semantic_search_quora_hnswlib.py](semantic_search_quora_hnswlib.py) - [semantic_search_quora_annoy.py](semantic_search_quora_annoy.py) - [semantic_search_quora_faiss.py](semantic_search_quora_faiss.py) ## Retrieve & Re-Rank -For complex semantic search scenarios, a retrieve & re-rank pipeline is advisable: +For complex semantic search scenarios, a two-stage retrieve & re-rank pipeline is advisable: ![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png) For further details, see [Retrieve & Re-rank](../retrieve_rerank/README.md). ## Examples -In the following we list examples for different use-cases. +We list a handful of common use cases: ### Similar Questions Retrieval -[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the *util.semantic_search* method. As model, we use *distilbert-multilingual-nli-stsb-quora-ranking*, which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus. +[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the [util.semantic_search](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus. ### Similar Publication Retrieval -[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://arxiv.org/abs/2004.07180) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract. +[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://huggingface.co/sentence-transformers/allenai-specter) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract. ### Question & Answer Retrieval -[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions/). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory. +[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://huggingface.co/datasets/sentence-transformers/natural-questions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory. -[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved (see [semantic search](../semantic-search/README.md)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search. +[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved. Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search. diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py index 5b0e3ad62..80f9e9986 100644 --- a/examples/applications/semantic-search/semantic_search.py +++ b/examples/applications/semantic-search/semantic_search.py @@ -24,6 +24,7 @@ "A monkey is playing drums.", "A cheetah is running behind its prey.", ] +# Use "convert_to_tensor=True" to keep the tensors on GPU (if available) corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True) # Query sentences: @@ -33,7 +34,6 @@ "A cheetah chases prey on across a field.", ] - # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity top_k = min(5, len(corpus)) for query in queries: @@ -41,13 +41,12 @@ # We use cosine-similarity and torch.topk to find the highest 5 scores similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0] - top_results = torch.topk(similarity_scores, k=top_k) + scores, indices = torch.topk(similarity_scores, k=top_k) - print("\n\n======================\n\n") - print("Query:", query) - print("\nTop 5 most similar sentences in corpus:") + print("\nQuery:", query) + print("Top 5 most similar sentences in corpus:") - for score, idx in zip(top_results[0], top_results[1]): + for score, idx in zip(scores, indices): print(corpus[idx], "(Score: {:.4f})".format(score)) """ diff --git a/examples/domain_adaptation/README.md b/examples/domain_adaptation/README.md index d15cefada..a5deee55c 100644 --- a/examples/domain_adaptation/README.md +++ b/examples/domain_adaptation/README.md @@ -7,13 +7,13 @@ Domain adaptation is still an active research field and there exists no perfect ## Domain Adaptation vs. Unsupervised Learning There exists methods for [unsupervised text embedding learning](../unsupervised_learning/README.md), however, they generally perform rather badly: They are not really able to learn domain specific concepts. -A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) +A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552) ## Adaptive Pre-Training -When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)). +When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552)). -![Adaptive Pre-Training](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/adaptive_pre-training.png) +Adaptive Pre-Training In our paper [TSDAE](https://arxiv.org/abs/2104.06979) we evaluated several methods for domain adaptation on 4 domain specific sentence embedding tasks: @@ -44,9 +44,9 @@ A big **disadvantage of adaptive pre-training** is the high computational overhe ## GPL: Generative Pseudo-Labeling -[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) and adapt it to your specific domain: +[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](../../docs/sentence_transformer/pretrained_models.md) and adapt it to your specific domain: -![GPL_Overview](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_overview.png) +GPL_Overview The longer you train, the better your model gets. In our experiments, we were training the models for about 1 day on a V100-GPU. GPL can be combined with adaptive pre-training, which can give another performance boost. @@ -58,15 +58,16 @@ The longer you train, the better your model gets. In our experiments, we were tr GPL works in three phases: -![GPL Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_architecture.png) +GPL Architecture - **Query Generation**: For a given text from our domain, we first use a T5 model that generates a possible query for the given text. E.g. when your text is *"Python is a high-level general-purpose programming language"*, the model might generate a query like *"What is Python"*. You can find various query generators on our [doc2query-hub](https://huggingface.co/doc2query). - **Negative Mining**: Next, for the generated query *"What is Python"* we mine negative passages from our corpus, i.e. passages that are similar to the query but which a user would not consider relevant. Such a negative passage could be *"Java is a high-level, class-based, object-oriented programming language."*. We do this mining using dense retrieval, i.e. we use one of the existing text embedding models and retrieve relevant paragraphs for the given query. -- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) to score all (query, passage)-pairs. -- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](https://www.sbert.net/docs/package_reference/losses.html#marginmseloss). +- **Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](../applications/cross-encoder/README.html) to score all (query, passage)-pairs. +- **Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](../../docs/package_reference/sentence_transformer/losses.html#marginmseloss). The **pseudo labeling** step is quite important and which results in the increased performance compared to the previous method QGen, which treated passages just as positive (1) or negative (0). As we see in the following picture, for a generate query (*"what is futures contract"*), the negative mining step retrieves passages that are partly or highly relevant to the generated query. Using MarginMSELoss and the Cross-Encoder, we can identify these passages and teach the text embedding model that these passages are also relevant for the given query. + ![GPL Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/gpl_negatives.jpg) diff --git a/examples/evaluation/evaluation_inference_speed.py b/examples/evaluation/evaluation_inference_speed.py index 6e16cbd74..a91ec0067 100644 --- a/examples/evaluation/evaluation_inference_speed.py +++ b/examples/evaluation/evaluation_inference_speed.py @@ -22,9 +22,9 @@ # Load a sentence transformer model model = SentenceTransformer(model_name) -max_sentences = 10_000 -dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train") -sentences = list(set(dataset["anchor"] + dataset["positive"]))[:max_sentences] +max_sentences = 100_000 +all_nli_dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train") +sentences = list(set(all_nli_dataset["anchor"]))[:max_sentences] print("Model Name:", model_name) print("Number of sentences:", len(sentences)) diff --git a/examples/training/README.md b/examples/training/README.md index fbba76048..cf48333e3 100644 --- a/examples/training/README.md +++ b/examples/training/README.md @@ -4,14 +4,22 @@ This folder contains various examples to fine-tune `SentenceTransformers` for sp For the beginning, I can recommend to have a look at the Semantic Textual Similarity ([STS](sts/)) or the Natural Language Inference ([NLI](nli/)) examples. -For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html). +For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html). ## Training Examples +- [adaptive_layer](adaptive_layer/) - Examples to train models whose layers can be removed on the fly for faster inference. - [avg_word_embeddings](avg_word_embeddings/) - This folder contains examples to train models based on classical word embeddings like GloVe. These models are extremely fast, but are a more inaccuracte than transformers based models. +- [clip](clip/) - Examples to train CLIP image models. +- [cross-encoder](cross-encoder/) - Examples to train [CrossEncoder](http://www.sbert.net/docs/cross_encoder/usage/usage.html) models. +- [data_augmentation](data_augmentation/) Examples of how to apply data augmentation strategies to improve embedding models. - [distillation](distillation/) - Examples to make models smaller, faster and lighter. +- [hpo](hpo/) - Examples with hyperparameter search to find the best hyperparameters for your task. +- [matryoshka](matryoshka/) - Examples with training embedding models whose embeddings can be truncated (allowing for faster search) with minimal performance loss. +- [ms_marco](ms_marco/) - Example training scripts for training on the MS MARCO information retrieval dataset. - [multilingual](multilingual/) - Existent monolingual models can be extend to various languages ([paper](https://arxiv.org/abs/2004.09813)). This folder contains a step-by-step guide to extend existent models to new languages. - [nli](nli/) - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sentence embeddings. +- [other](other/) - Various tiny examples for show-casing one specific training case. +- [paraphrases](paraphrases/) - Examples for training models capable of recognizing paraphrases, i.e. understand when texts have the same meaning despite using different words. - [quora_duplicate_questions](quora_duplicate_questions/) - Quora Duplicate Questions is large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search. - [sts](sts/) - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we have a sentence pair and a score indicating the semantic similarity. -- [other](other/) - Various tiny examples for show-casing one specific training case. diff --git a/examples/training/adaptive_layer/adaptive_layer_sts.py b/examples/training/adaptive_layer/adaptive_layer_sts.py index b95ac63b2..c2ebdd6f4 100644 --- a/examples/training/adaptive_layer/adaptive_layer_sts.py +++ b/examples/training/adaptive_layer/adaptive_layer_sts.py @@ -43,8 +43,8 @@ logging.info(train_dataset) # 3. Define our training loss -# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one -# similarity score column (between 0 and 1) +# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text +# columns and one similarity score column (between 0 and 1) inner_train_loss = losses.CoSENTLoss(model=model) train_loss = losses.AdaptiveLayerLoss(model, inner_train_loss) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py index a6bb8fe79..a89cea13a 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py @@ -49,7 +49,7 @@ model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dan1, dan2]) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and # one similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py index 4df3d8567..c2453ee07 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py @@ -44,7 +44,7 @@ model = SentenceTransformer(modules=[word_embedding_model, lstm, pooling_model]) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and # one similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py index 951e006a1..3be966717 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py @@ -74,7 +74,7 @@ model = SentenceTransformer(modules=[bow, dan1, dan2]) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and # one similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py index 07c743ac3..db8c4ee50 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py @@ -51,7 +51,7 @@ model = SentenceTransformer(modules=[word_embedding_model, cnn, pooling_model]) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and # one similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py index f11657e8c..183894b07 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py @@ -77,7 +77,7 @@ model = SentenceTransformer(modules=[word_embedding_model, word_weights, pooling_model, dan1, dan2]) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and # one similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) diff --git a/examples/training/datasets/README.md b/examples/training/datasets/README.md deleted file mode 100644 index fe39c421a..000000000 --- a/examples/training/datasets/README.md +++ /dev/null @@ -1,59 +0,0 @@ -# Training Datasets - -Most dataset configurations will take one of four forms: - -- **Case 1**: The example is a pair of sentences and a label indicating how similar they are. The label can be either an integer or a float. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not. - **Case Example:** [SNLI](https://huggingface.co/datasets/snli). -- **Case 2**: The example is a pair of positive (similar) sentences **without** a label. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences. - **Case Examples:** [Sentence Compression](https://huggingface.co/datasets/embedding-data/sentence-compression), [COCO Captions](https://huggingface.co/datasets/embedding-data/coco_captions_quintets), [Flickr30k captions](https://huggingface.co/datasets/embedding-data/flickr30k_captions_quintets). -- **Case 3**: The example is a sentence with an integer label indicating the class to which it belongs. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class. - **Case Examples:** [TREC](https://huggingface.co/datasets/trec), [Yahoo Answers Topics](https://huggingface.co/datasets/yahoo_answers_topics). -- **Case 4**: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences. - **Case Example:** [Quora Triplets](https://huggingface.co/datasets/embedding-data/QQP_triplets) - -Note that Sentence Transformers models can be trained with human labeling (cases 1 and 3) or with labels automatically deduced from text formatting (cases 2 and 4). - -You can get almost ready-to-train datasets from various sources. One of them is the Hugging Face Hub. - -## Datasets on the Hugging Face Hub - -The [Datasets library](https://huggingface.co/docs/datasets/index) (`pip install datasets`) allows you to load datasets from the Hugging Face Hub with the `load_dataset` function: - -```python -from datasets import load_dataset - -# Indicate the repo id from the Hub -dataset_id = "embedding-data/QQP_triplets" - -dataset = load_dataset(dataset_id) -``` - -For more information on how to manipulate your dataset see [» Datasets Documentation](https://huggingface.co/docs/datasets/access). - -These are popular datasets used to train and fine-tune SentenceTransformers models. - -| | Dataset | -| - | --------------------------------------------------------------------------------------------------------- | -| | [altlex pairs](https://huggingface.co/datasets/embedding-data/altlex) | -| | [sentence compression pairs](https://huggingface.co/datasets/embedding-data/sentence-compression) | -| | [QQP triplets](https://huggingface.co/datasets/embedding-data/QQP_triplets) | -| | [PAQ pairs](https://huggingface.co/datasets/embedding-data/PAQ_pairs) | -| | [SPECTER triplets](https://huggingface.co/datasets/embedding-data/SPECTER) | -| | [Amazon QA pairs](https://huggingface.co/datasets/embedding-data/Amazon-QA) | -| | [Simple Wiki pairs](https://huggingface.co/datasets/embedding-data/simple-wiki) | -| | [Wiki Answers equivalent sentences](https://huggingface.co/datasets/embedding-data/WikiAnswers) | -| | [COCO Captions quintets](https://huggingface.co/datasets/embedding-data/coco_captions_quintets) | -| | [Flickr30k Captions quintets](https://huggingface.co/datasets/embedding-data/flickr30k_captions_quintets) | -| | [MS Marco](https://huggingface.co/datasets/ms_marco) | -| | [GOOAQ](https://huggingface.co/datasets/gooaq) | -| | [MS Marco](https://huggingface.co/datasets/ms_marco) | -| | [Yahoo Answers topics](https://huggingface.co/datasets/yahoo_answers_topics) | -| | [Search QA](https://huggingface.co/datasets/search_qa) | -| | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml ) | -| | [ELI5](https://huggingface.co/datasets/eli5) | -| | [MultiNLI](https://huggingface.co/datasets/multi_nli) | -| | [SNLI](https://huggingface.co/datasets/snli) | -| | [S2ORC](https://huggingface.co/datasets/s2orc) | -| | [Trivia QA](https://huggingface.co/datasets/trivia_qa) | -| | [Code Search Net](https://huggingface.co/datasets/code_search_net) | -| | [Natural Questions](https://huggingface.co/datasets/natural_questions) | diff --git a/examples/training/distillation/README.md b/examples/training/distillation/README.md index 37d7d9860..901804ba3 100644 --- a/examples/training/distillation/README.md +++ b/examples/training/distillation/README.md @@ -2,38 +2,40 @@ This folder contains example to make SentenceTransformer models **faster, cheaper and lighter**. These light models achieve 97.5% - 100% performance of the original model on downstream tasks. ## Knowledge Distillation -See: **[model_distillation.py](model_distillation.py)** - -Knowledge distillation describes the process to transfer knowledge from a teacher model to a student model. It can be used to extend sentence embeddings to new languages ([Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813)), but the traditional approach is to have slow (but well performing) teacher model and a fast student model. +Knowledge distillation describes the process to transfer knowledge from a teacher model to a student model. It can be used to extend sentence embeddings to new languages ([Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813)), but the traditional approach is to have a slow (but well performing) teacher model and a fast student model. The fast student model imitates the teacher model and achieves by this a high performance. -![Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/monolingual-distillation.png) - +Knowledge Distillation -**[model_distillation.py](model_distillation.py)** implements two options for creating the student model: -1) Use a light transformer model like TinyBERT or BERT-Small to imitate the teacher. -2) We take the teacher model and keep only certain layers, for example, only 4 layers. +We implement two options for creating the student model: +1) [model_distillation.py](model_distillation.py): Use a light transformer model like TinyBERT or BERT-Small to imitate the bigger teacher. +2) [model_distillation_layer_reduction.py](model_distillation_layer_reduction.py): We take the teacher model and keep only certain layers, for example, only 4 layers. -Option 2) works usually better, as we keep most of the weights from the teacher. In Option 1, we have to tune all -weights in the student from scratch. +Option 2) works usually better, as we keep most of the weights from the teacher. In Option 1, we have to tune all weights in the student from scratch. ## Speed - Performance Trade-Off -Smaller models are faster, but show a (slightly) worse performance when evaluated on down stream tasks. To get an impression of this trade-off, we show some numbers of the *stsb-roberta-base* model with different number of layers: +Smaller models are faster, but show a (slightly) worse performance when evaluated on down stream tasks. To get an impression of this trade-off, we show some numbers of the [stsb-roberta-base](https://huggingface.co/sentence-transformers/stsb-roberta-base) model with different number of layers: | Layers | STSbenchmark Performance | Performance Decrease |Speed (Sent. / Sec. on V100-GPU) | | ---- |:----:|:----:|:----:| | teacher: 12 | 85.44 | - | 2300 | -| 8 | 85.54 | +0.1% | 3200 | -| 6 | 85.23 | -0.2% | 4000 | -| 4 | 84.92 | -0.6% | 5300 | -| 3 | 84.39 | -1.2% |6500 | -| 2 | 83.32 | -2.5% | 7700 | -| 1 | 80.86 | -5.4%| 9200 | +| 8 | 85.54 | +0.1% | 3200 (~1.4x) | +| 6 | 85.23 | -0.2% | 4000 (~1.7x) | +| 4 | 84.92 | -0.6% | 5300 (~2.3x) | +| 3 | 84.39 | -1.2% | 6500 (~2.8x) | +| 2 | 83.32 | -2.5% | 7700 (~3.3x) | +| 1 | 80.86 | -5.4%| 9200 (~4.0x) | ## Dimensionality Reduction -By default, the pretrained models output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store Millions of embeddings, this can require quite a lot of memory / storage. + +```eval_rst +.. warning:: + Since writing this, `Embedding Quantization <../../applications/embedding-quantization/README.html>`_ has been introduced as the go-to approach for shrinking embedding sizes. Following `Thakur et al. `_, We recommend that approach over PCA. +``` + +By default, the pretrained models output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store millions of embeddings, this can require quite a lot of memory / storage. **[dimensionality_reduction.py](dimensionality_reduction.py)** contains a simple example how to reduce the embedding dimension to any size by using Principle Component Analysis (PCA). In that example, we reduce 768 dimension to 128 dimension, reducing the storage requirement by factor 6. The performance only slightly drops from 85.44 to 84.96 on the STS benchmark dataset. @@ -47,3 +49,8 @@ A [quantized model](https://pytorch.org/docs/stable/quantization.html) executes For models that are run on **CPUs**, this can yield 40% smaller models and a faster inference time: Depending on the CPU, speedup are between 15% and 400%. Model quantization is (as of now) not supported for GPUs by PyTorch. For an example, see [model_quantization.py](model_quantization.py) + +```eval_rst +.. note:: + The quantization support of Sentence Transformers is still being improved. +``` \ No newline at end of file diff --git a/examples/training/hpo/README.rst b/examples/training/hpo/README.rst new file mode 100644 index 000000000..8e5a583bd --- /dev/null +++ b/examples/training/hpo/README.rst @@ -0,0 +1,217 @@ + +Hyperparameter Optimization +=========================== + +The :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` supports hyperparameter optimization using ``transformers``, which in turn supports four hyperparameter search backends: `optuna `_, `sigopt `_, `raytune `_, and `wandb `_. You should install your backend of choice before using it:: + + pip install optuna/sigopt/wandb/ray[tune] + +On this page, we'll show you how to use the hyperparameter optimization feature with the `optuna` backend. The other backends are similar to use, but you should refer to their respective documentation or the `transformers HPO documentation `_ for more information. + +HPO Components +-------------- + +The hyperparameter optimization process consists of the following components: + +.. raw:: html + + +
    + +Hyperparameter Search Space +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The hyperparameter search space is defined by a function that returns a dictionary of hyperparameters and their respective search spaces. Here's an example using ``optuna`` of a search space function that defines the hyperparameters for a `SentenceTransformer` model:: + + def hpo_search_space(trial): + return { + "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2), + "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128), + "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3), + "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), + } + +Model Initialization +~~~~~~~~~~~~~~~~~~~~ + +The model initialization function is a function that takes the hyperparameters of the current "trial" as input and returns a `SentenceTransformer` model. Generally, this function is quite simple. Here's an example of a model initialization function:: + + def hpo_model_init(trial): + return SentenceTransformer("distilbert-base-uncased") + +Loss Initialization +~~~~~~~~~~~~~~~~~~~ + +The loss initialization function is a function that takes the model initialized for the current trial and returns a loss function. Here's an example of a loss initialization function:: + + def hpo_loss_init(model): + return losses.CosineSimilarityLoss(model) + +Compute Objective +~~~~~~~~~~~~~~~~~ + +The compute objective function is a function that takes the evaluation ``metrics`` and returns the float value to be minimized or maximized. Here's an example of a compute objective function:: + + def hpo_compute_objective(metrics): + return metrics["eval_sts-dev_spearman_cosine"] + +.. note: + + The dictionary keys of ``metrics`` are all prepended with ``eval_``. Additionally, if you're interested in maximizing the performance of an evaluator, note that the ``name`` of the evaluator is also prepended with a ``-``. So, to optimize on ``spearman_cosine`` from :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` which was initialized with ``name="stsb_dev"``, then you would use the key ``eval_sts-dev_spearman_cosine`` in your ``hpo_compute_objective``. + + Another common option is to use ``eval_loss``. + +Putting It All Together +------------------------ + +You can perform HPO on any regular training loop, the only difference being that you don't call :meth:`SentenceTransformerTrainer.train `, but :meth:`SentenceTransformerTrainer.hyperparameter_search ` instead. Here's an example of how to put it all together: + +.. sidebar:: Documentation + + #. `sentence-transformers/all-nli `_ + #. :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` + #. `Hyperparameter Search Space <#hyperparameter-search-space>`_ + #. `Model Initialization <#model-initialization>`_ + #. `Loss Initialization <#loss-initialization>`_ + #. `Compute Objective <#compute-objective>`_ + #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` + #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` + #. :meth:`~sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search` + +:: + + from sentence_transformers import losses + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments + from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction + from sentence_transformers.training_args import BatchSamplers + from datasets import load_dataset + + # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, only 10k train and 1k dev + train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]") + eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]") + + # 2. Create an evaluator to perform useful HPO + stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") + dev_evaluator = EmbeddingSimilarityEvaluator( + sentences1=stsb_eval_dataset["sentence1"], + sentences2=stsb_eval_dataset["sentence2"], + scores=stsb_eval_dataset["score"], + main_similarity=SimilarityFunction.COSINE, + name="sts-dev", + ) + + # 3. Define the Hyperparameter Search Space + def hpo_search_space(trial): + return { + "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2), + "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128), + "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3), + "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), + } + + # 4. Define the Model Initialization + def hpo_model_init(trial): + return SentenceTransformer("distilbert-base-uncased") + + # 5. Define the Loss Initialization + def hpo_loss_init(model): + return losses.MultipleNegativesRankingLoss(model) + + # 6. Define the Objective Function + def hpo_compute_objective(metrics): + """ + Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine', + 'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean', + 'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot', + 'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second', + 'eval_steps_per_second', 'epoch' + + due to the evaluator that we're using. + """ + return metrics["eval_sts-dev_spearman_cosine"] + + # 7. Define the training arguments + args = SentenceTransformerTrainingArguments( + # Required parameter: + output_dir="checkpoints", + # Optional training parameters: + # max_steps=10000, # We might want to limit the number of steps for HPO + fp16=True, # Set to False if you get an error that your GPU can't run on FP16 + bf16=False, # Set to True if you have a GPU that supports BF16 + batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch + # Optional tracking/debugging parameters: + eval_strategy="no", # We don't need to evaluate/save during HPO + save_strategy="no", + logging_steps=10, + run_name="hpo", # Will be used in W&B if `wandb` is installed + ) + + # 8. Create the trainer with model_init rather than model + trainer = SentenceTransformerTrainer( + model=None, + args=args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + evaluator=dev_evaluator, + model_init=hpo_model_init, + loss=hpo_loss_init, + ) + + # 9. Perform the HPO + best_trial = trainer.hyperparameter_search( + hp_space=hpo_search_space, + compute_objective=hpo_compute_objective, + n_trials=20, + direction="maximize", + backend="optuna", + ) + print(best_trial) + +:: + + [I 2024-05-17 15:10:47,844] Trial 0 finished with value: 0.7889856589698055 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 123, 'warmup_ratio': 0.07380948785410107, 'learning_rate': 2.686331417509812e-06}. Best is trial 0 with value: 0.7889856589698055. + [I 2024-05-17 15:12:13,283] Trial 1 finished with value: 0.7927780672090986 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 69, 'warmup_ratio': 0.2927897848007451, 'learning_rate': 5.885372118095137e-06}. Best is trial 1 with value: 0.7927780672090986. + [I 2024-05-17 15:12:43,896] Trial 2 finished with value: 0.7684829743509601 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 114, 'warmup_ratio': 0.0739429232666916, 'learning_rate': 7.344415188959276e-05}. Best is trial 1 with value: 0.7927780672090986. + [I 2024-05-17 15:14:49,730] Trial 3 finished with value: 0.7873032743147989 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 43, 'warmup_ratio': 0.15184370143796674, 'learning_rate': 9.703232080395476e-06}. Best is trial 1 with value: 0.7927780672090986. + [I 2024-05-17 15:15:39,597] Trial 4 finished with value: 0.7759251781929949 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 127, 'warmup_ratio': 0.263946220093495, 'learning_rate': 1.231454337152625e-06}. Best is trial 1 with value: 0.7927780672090986. + [I 2024-05-17 15:17:02,191] Trial 5 finished with value: 0.7964580509886684 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 34, 'warmup_ratio': 0.2276865359631089, 'learning_rate': 7.889007438884571e-06}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:18:55,559] Trial 6 finished with value: 0.7901878917859169 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 48, 'warmup_ratio': 0.23228838664572948, 'learning_rate': 2.883013292682523e-06}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:20:27,027] Trial 7 finished with value: 0.7935671067660925 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 62, 'warmup_ratio': 0.22061123927198237, 'learning_rate': 2.95413457610349e-06}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:22:23,147] Trial 8 finished with value: 0.7848123114933252 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 45, 'warmup_ratio': 0.23071701022961139, 'learning_rate': 9.793681667449783e-06}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:22:52,826] Trial 9 finished with value: 0.7909708416168918 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 121, 'warmup_ratio': 0.22440506724181647, 'learning_rate': 4.0744671365843346e-05}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:23:30,395] Trial 10 finished with value: 0.7928991732385567 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 89, 'warmup_ratio': 0.14607293301068847, 'learning_rate': 2.5557492055039498e-05}. Best is trial 5 with value: 0.7964580509886684. + [I 2024-05-17 15:24:18,024] Trial 11 finished with value: 0.7991870087507459 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 66, 'warmup_ratio': 0.16886154348739527, 'learning_rate': 3.705926066938032e-06}. Best is trial 11 with value: 0.7991870087507459. + [I 2024-05-17 15:25:44,198] Trial 12 finished with value: 0.7923304174306207 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 33, 'warmup_ratio': 0.15953772535423974, 'learning_rate': 1.8076298025704224e-05}. Best is trial 11 with value: 0.7991870087507459. + [I 2024-05-17 15:26:20,739] Trial 13 finished with value: 0.8020260244040395 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 90, 'warmup_ratio': 0.18105202625281253, 'learning_rate': 5.513908793512551e-06}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:26:57,783] Trial 14 finished with value: 0.7571110256860063 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 95, 'warmup_ratio': 0.00122391151793258, 'learning_rate': 1.0432486633629492e-06}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:27:32,581] Trial 15 finished with value: 0.8009013936824717 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 101, 'warmup_ratio': 0.1761274711346081, 'learning_rate': 4.5918293464430035e-06}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:28:05,850] Trial 16 finished with value: 0.8017668050806169 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 103, 'warmup_ratio': 0.10766501647726355, 'learning_rate': 5.0309795522333e-06}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:28:37,393] Trial 17 finished with value: 0.7769412380909586 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 108, 'warmup_ratio': 0.1036610178950246, 'learning_rate': 1.7747598626081271e-06}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:29:19,340] Trial 18 finished with value: 0.8011921300048339 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 80, 'warmup_ratio': 0.117014165550441, 'learning_rate': 1.238558867958792e-05}. Best is trial 13 with value: 0.8020260244040395. + [I 2024-05-17 15:29:59,508] Trial 19 finished with value: 0.8027501854704168 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}. Best is trial 19 with value: 0.8027501854704168. + + BestRun(run_id='19', objective=0.8027501854704168, hyperparameters={'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}, run_summary=None) + +As you can see, the strongest hyperparameters reached **0.802** Spearman correlation on the STS (dev) benchmark. For context, training with the default training arguments (``per_device_train_batch_size=8``, ``learning_rate=5e-5``) results in **0.736**, and hyperparameters chosen based on experience (``per_device_train_batch_size=64``, ``learning_rate=2e-5``) results in **0.783** Spearman correlation. Consequently, HPO proved quite effective here in improving the model performance. + +Example Scripts +--------------- + +- `hpo_nli.py `_ - An example script that performs hyperparameter optimization on the AllNLI dataset. diff --git a/examples/training/hpo/hpo_nli.py b/examples/training/hpo/hpo_nli.py new file mode 100644 index 000000000..758224ec9 --- /dev/null +++ b/examples/training/hpo/hpo_nli.py @@ -0,0 +1,95 @@ +from sentence_transformers import losses +from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction +from sentence_transformers.training_args import BatchSamplers +from datasets import load_dataset + +# 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, 10k samples +train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]") +eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]") + +# 2. Create an evaluator to perform useful HPO +stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") +dev_evaluator = EmbeddingSimilarityEvaluator( + sentences1=stsb_eval_dataset["sentence1"], + sentences2=stsb_eval_dataset["sentence2"], + scores=stsb_eval_dataset["score"], + main_similarity=SimilarityFunction.COSINE, + name="sts-dev", +) + + +# 3. Define the Hyperparameter Search Space +def hpo_search_space(trial): + return { + "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2), + "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128), + "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3), + "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), + } + + +# 4. Define the Model Initialization +def hpo_model_init(trial): + return SentenceTransformer("distilbert-base-uncased") + + +# 5. Define the Loss Initialization +def hpo_loss_init(model): + return losses.MultipleNegativesRankingLoss(model) + + +# 6. Define the Objective Function +def hpo_compute_objective(metrics): + """ + Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine', + 'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean', + 'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot', + 'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second', + 'eval_steps_per_second', 'epoch' + + due to the evaluator that we're using. + """ + return metrics["eval_sts-dev_spearman_cosine"] + + +# 7. Define the training arguments +args = SentenceTransformerTrainingArguments( + # Required parameter: + output_dir="checkpoints", + # Optional training parameters: + # max_steps=10000, # We might want to limit the number of steps for HPO + fp16=True, # Set to False if you get an error that your GPU can't run on FP16 + bf16=False, # Set to True if you have a GPU that supports BF16 + batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch + # Optional tracking/debugging parameters: + eval_strategy="no", # We don't need to evaluate/save during HPO + save_strategy="no", + logging_steps=10, + run_name="hpo", # Will be used in W&B if `wandb` is installed +) + +# 8. Create the trainer with model_init rather than model +trainer = SentenceTransformerTrainer( + model=None, + args=args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + evaluator=dev_evaluator, + model_init=hpo_model_init, + loss=hpo_loss_init, +) + +# 9. Perform the HPO +best_trial = trainer.hyperparameter_search( + hp_space=hpo_search_space, + compute_objective=hpo_compute_objective, + n_trials=20, + direction="maximize", + backend="optuna", +) +print(best_trial) + +# Alternatively, to just train normally: +# trainer.train() +# print(dev_evaluator(trainer.model)) diff --git a/examples/training/ms_marco/README.md b/examples/training/ms_marco/README.md index 31c3ae433..1ceac4c3d 100644 --- a/examples/training/ms_marco/README.md +++ b/examples/training/ms_marco/README.md @@ -1,31 +1,28 @@ # MS MARCO [MS MARCO Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query. -This pages shows how to **train** models (Cross-Encoder and Sentence Embedding Models) on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions). +This page shows how to **train** Sentence Transformer models on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions). If you are interested in how to use these models, see [Application - Retrieve & Re-Rank](../../applications/retrieve_rerank/README.md). -There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) | [Pretrained Cross-Encoders](https://www.sbert.net/docs/pretrained_cross-encoders.html) - - +There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see: [Pretrained Models > MSMARCO Passage Models](../../../docs/sentence_transformer/pretrained_models.html#msmarco-passage-models). ## Bi-Encoder -Cross-Encoder are only suitable for reranking a small set of passages. For retrieval of suitable documents from a large collection, we have to use a bi-encoder. The documents are independently encoded into fixed-sized embeddings. A query is embedded into the same vector space. Relevant documents can then be found by using dot-product. +For retrieval of suitable documents from a large collection, we have to use a Sentence Transformer (a.k.a. bi-encoder) model. The documents are independently encoded into fixed-sized embeddings. A query is embedded into the same vector space. Relevant documents can then be found by using cosine similarity or dot-product. ![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/BiEncoder.png) - -There are two strategies to **train an bi-encoder** on the MS MARCO dataset: +This page describes two strategies to **train an bi-encoder** on the MS MARCO dataset: ### MultipleNegativesRankingLoss - **Training code: [train_bi-encoder_mnrl.py](train_bi-encoder_mnrl.py)** +**Training code: [train_bi-encoder_mnrl.py](train_bi-encoder_mnrl.py)** -When we use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), we provide triplets: ``(query, positive_passage, negative_passage)`` where `positive_passage` is the relevant passage to the query and `negative_passage` is a non-relevant passage to the query. - -We compute the embeddings for all queries, positive passages, and negative passages in the corpus and then optimize the following objective: We want to have the `(query, positive_passage)` pair to be close in the vector space, while `(query, negative_passage)` should be distant in vector space. +```eval_rst +When we use :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we provide triplets: ``(query, positive_passage, negative_passage)`` where ``positive_passage`` is the relevant passage to the query and ``negative_passage`` is a non-relevant passage to the query. We compute the embeddings for all queries, positive passages, and negative passages in the corpus and then optimize the following objective: The ``(query, positive_passage)` pair must be close in the vector space, while ``(query, negative_passage)`` should be distant in vector space. To further improve the training, we use **in-batch negatives**: +``` ![MultipleNegativesRankingLoss](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MultipleNegativeRankingLoss.png) @@ -33,36 +30,38 @@ We embed all `queries`, `positive_passages`, and `negative_passages` into the ve One way to **improve training** is to choose really good negatives, also know as **hard negative**: The negative should look really similar to the positive passage, but it should not be relevant to the query. -We find these hard negatives in the following way: We use existing retrieval systems (e.g. lexical search and other bi-encoder retrieval systems), and for each query we find the most relevant passages. We then use a powerful [Cross-Encoder](../../applications/cross-encoder/README.md) to score the found `(query, passage)` pairs. We provide scores for 160 million such pairs in our [msmarco-hard-negatives dataset](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives). +We find these hard negatives in the following way: We use existing retrieval systems (e.g. lexical search and other bi-encoder retrieval systems), and for each query we find the most relevant passages. We then use a powerful [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) [Cross-Encoder](../../applications/cross-encoder/README.md) to score the found `(query, passage)` pairs. We provide scores for 160 million such pairs in our [MS MARCO Mined Triplet dataset collection](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23). + +```eval_rst +For :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we must ensure that in the triplet ``(query, positive_passage, negative_passage)`` that the ``negative_passage`` is indeed not relevant for the query. The MS MARCO dataset is sadly **highly redundant**, and even though that there is on average only one passage marked as relevant for a query, it actually contains many passages that humans would consider as relevant. We must ensure that these passages are **not passed as negatives**: We do this by ensuring a certain threshold in the CrossEncoder scores between the relevant passages and the mined hard negative. By default, we set a threshold of 3: If the ``(query, positive_passage)`` gets a score of 9 from the CrossEncoder, than we will only consider negatives with a score below 6 from the CrossEncoder. This threshold ensures that we actually use negatives in our triplets. +``` -For MultipleNegativesRankingLoss, we must ensure that in the triplet `(query, positive_passage, negative_passage)` that the `negative_passage` is actually not relevant for the query. The MS MARCO dataset is sadly **highly redundant**, and even though that there is on average only one passage marked as relevant for a query, it actually contains many passages that humans would consider as relevant. We must ensure that these passages are **not passed as negatives**: We do this by ensuring a certain threshold in the CrossEncoder scores between the relevant passages and the mined hard negative. By default, we set a threshold of 3: If the `(query, positive_passage)` gets a score of 9 from the CrossEncoder, than we will only consider negatives with a score below 6 from the CrossEncoder. This threshold ensures that we actually use negatives in our triplets. +You can find this data by traversing to any of the datasets in the [MS MARCO Mined Triplet dataset collection](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23) and using the ``triplet-hard`` subset. Across all datasets, this refers to 175.7 million triplets. The original data can be found [here](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives). Load some of it using: +```python +from datasets import load_dataset +train_dataset = load_dataset("sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1", "triplet-hard", split="train") +# Dataset({ +# features: ['query', 'positive', 'negative'], +# num_rows: 11662655 +# }) +print(train_dataset[0]) +# {'query': 'what are the liberal arts?', 'positive': 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.', 'negative': "Rather than preparing students for a specific career, liberal arts programs focus on cultural literacy and hone communication and analytical skills. They often cover various disciplines, ranging from the humanities to social sciences. 1 Program Levels in Liberal Arts: Associate degree, Bachelor's degree, Master's degree."} +``` ### MarginMSE **Training code: [train_bi-encoder_margin-mse.py](train_bi-encoder_margin-mse.py)** -[MarginMSELoss](https://www.sbert.net/docs/package_reference/losses.html#marginmseloss) is based on the paper of [Hofstätter et al](https://arxiv.org/abs/2010.02666). As for MultipleNegativesRankingLoss, we have triplets: `(query, passage1, passage2)`. In contrast to MultipleNegativesRankingLoss, `passage1` and `passage2` do not have to be strictly positive/negative, both can be relevant or not relevant for a given query. - -We then compute the [Cross-Encoder](../../applications/cross-encoder/README.md) score for `(query, passage1)` and `(query, passage2)`. We provide scores for 160 million such pairs in our [msmarco-hard-negatives dataset](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives). We then compute the distance: `CE_distance = CEScore(query, passage1) - CEScore(query, passage2)` - -For our bi-encoder training, we encode `query`, `passage1`, and `passage2` into vector spaces and then measure the dot-product between `(query, passage1)` and `(query, passage2)`. Again, we measure the distance: `BE_distance = DotScore(query, passage1) - DotScore(query, passage2)` - -We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between `CE_distance` and `BE_distance`. - -An **advantage** of MarginMSELoss compared to MultipleNegativesRankingLoss is that we **don't require** a `positive` and `negative` passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the `CE_distance` will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space. +```eval_rst +:class:`~sentence_transformers.losses.MarginMSELoss` is based on the paper of `Hofstätter et al `_. Like when training with :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, we can use triplets: ``(query, passage1, passage2)``. However, in contrast to :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, `passage1` and `passage2` do not have to be strictly positive/negative, both can be relevant or not relevant for a given query. -And **disadvantage** of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages. +We then compute the `Cross-Encoder <../../applications/cross-encoder/README.html>`_ score for ``(query, passage1)`` and ``(query, passage2)``. We provide scores for 160 million such pairs in our `msmarco-hard-negatives dataset `_. We then compute the distance: ``CE_distance = CEScore(query, passage1) - CEScore(query, passage2)``. -## Cross-Encoder -A [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) accepts both inputs, the query and the possible relevant passage and returns a score between 0 and 1 how relevant the passage is for the given query. +For our Sentence Transformer (e.g. bi-encoder) training, we encode ``query``, ``passage1``, and ``passage2`` into embeddings and then measure the dot-product between ``(query, passage1)`` and ``(query, passage2)``. Again, we measure the distance: ``BE_distance = DotScore(query, passage1) - DotScore(query, passage2)`` -![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png) +We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between ``CE_distance`` and ``BE_distance``. -Cross-Encoders are often used for **re-ranking:** Given a list with possible relevant passages for a query, for example retrieved from BM25 / Elasticsearch, the cross-encoder re-ranks this list so that the most relevant passages are the top of the result list. +An **advantage** of :class:`~sentence_transformers.losses.MarginMSELoss` compared to :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is that we **don't require** a ``positive`` and ``negative`` passage. As mentioned before, MS MARCO is redundant and many passages contain the same or similar content. With :class:`~sentence_transformers.losses.MarginMSELoss`, we can train on two relevant passages without issues: In that case, the ``CE_distance`` will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space. -To **train an cross-encoder** on the MS MARCO dataset, see: -- **[train_cross-encoder_scratch.py](train_cross-encoder_scratch.py)** trains a cross-encoder from scratch using the provided data from the MS MARCO dataset. - -## Cross-Encoder Knowledge Distillation -![](https://github.com/UKPLab/sentence-transformers/raw/master/docs/img/msmarco-training-ce-distillation.png) -- **[train_cross-encoder_kd.py](train_cross-encoder_kd.py)** uses a knowledge distillation setup: [Hostätter et al.](https://arxiv.org/abs/2010.02666) trained an ensemble of 3 (large) models for the MS MARCO dataset and predicted the scores for various (query, passage)-pairs (50% positive, 50% negative). In this example, we use knowledge distillation with a small & fast model and learn the logits scores from the teacher ensemble. This yields performances comparable to large models, while being 18 times faster. \ No newline at end of file +And **disadvantage** of :class:`~sentence_transformers.losses.MarginMSELoss` is the slower training time: We need way more epochs to get good results. In :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, with a batch size of 64, we compare one query against 128 passages. With :class:`~sentence_transformers.losses.MarginMSELoss`, we compare a query only against two passages. +``` diff --git a/examples/training/ms_marco/cross_encoder_README.md b/examples/training/ms_marco/cross_encoder_README.md new file mode 100644 index 000000000..d58d5bb02 --- /dev/null +++ b/examples/training/ms_marco/cross_encoder_README.md @@ -0,0 +1,22 @@ +# MS MARCO +[MS MARCO Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query. + +This page shows how to **train** Cross Encoder models on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions). + +If you are interested in how to use these models, see [Application - Retrieve & Re-Rank](../../applications/retrieve_rerank/README.md). + +There are **pre-trained models** available, which you can directly use without the need of training your own models. For more information, see [Pretrained Cross-Encoders](../../../docs/cross_encoder/pretrained_models.html#ms-marco). + +## Cross-Encoder +A [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) accepts both inputs, the query and the possible relevant passage and returns a score between 0 and 1 how relevant the passage is for the given query. + +![CrossEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png) + +Cross-Encoders are often used for **re-ranking:** Given a list with possible relevant passages for a query, for example retrieved from BM25 / Elasticsearch, the cross-encoder re-ranks this list so that the most relevant passages are the top of the result list. + +To **train an cross-encoder** on the MS MARCO dataset, see: +- **[train_cross-encoder_scratch.py](train_cross-encoder_scratch.py)** trains a cross-encoder from scratch using the provided data from the MS MARCO dataset. + +## Cross-Encoder Knowledge Distillation +![](https://github.com/UKPLab/sentence-transformers/raw/master/docs/img/msmarco-training-ce-distillation.png) +- **[train_cross-encoder_kd.py](train_cross-encoder_kd.py)** uses a knowledge distillation setup: [Hostätter et al.](https://arxiv.org/abs/2010.02666) trained an ensemble of 3 (large) models for the MS MARCO dataset and predicted the scores for various (query, passage)-pairs (50% positive, 50% negative). In this example, we use knowledge distillation with a small & fast model and learn the logits scores from the teacher ensemble. This yields performances comparable to large models, while being 18 times faster. \ No newline at end of file diff --git a/examples/training/multilingual/README.md b/examples/training/multilingual/README.md index 98a2d706d..700bce9ca 100644 --- a/examples/training/multilingual/README.md +++ b/examples/training/multilingual/README.md @@ -1,25 +1,122 @@ -# Multilingual-Models -The issue with multilingual BERT (mBERT) as well as with XLM-RoBERTa is that those produce rather bad sentence representation out-of-the-box. Further, the vectors spaces between languages are not aligned, i.e., the sentences with the same content in different languages would be mapped to different locations in the vector space. +# Multilingual Models +The issue with multilingual BERT (mBERT) as well as with XLM-RoBERTa is that those produce rather bad sentence representation out-of-the-box. Further, the vectors spaces between languages are not aligned, i.e., the sentences with the same content in different languages would be mapped to different locations in the vector space. -In my publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) I describe any easy approach to extend sentence embeddings to further languages. +In my publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) I describe an easy approach to extend sentence embeddings to further languages. Chien Vu also wrote a nice blog article on this technique: [A complete guide to transfer learning from English to other Languages using Sentence Embeddings BERT Models](https://towardsdatascience.com/a-complete-guide-to-transfer-learning-from-english-to-other-languages-using-sentence-embeddings-8c427f8804a9) -## Available Pre-trained Models -For a list of available models, see [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models). +## Extend your own models +![Multilingual Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/multilingual-distillation.png) + +The idea is based on a fixed (monolingual) **teacher model** that produces sentence embeddings with our desired properties in one language (e.g. English). The **student model** is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. Additionally, in order to make the student model work for other languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence. + +In the above figure, the student model should map *Hello World* and the German translation *Hallo Welt* to the vector of ``teacher_model('Hello World')``. We achieve this by training the student model using mean squared error (MSE) loss. + +In our experiments we initialized the student model with the multilingual [XLM-RoBERTa model](https://huggingface.co/FacebookAI/xlm-roberta-base). + +## Training +For a **fully automatic code example**, see [make_multilingual.py](make_multilingual.py). + +This scripts downloads the parallel sentences corpus, a corpus with transcripts and translations from talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). This corpus contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages. + +## Datasets + +```eval_rst +As training data we require parallel sentences, i.e., sentences translated in various languages. In particular, we will use :class:`~datasets.Dataset` instances with ``"english"`` and ``"non_english"`` columns. We have prepared a large collection of such datasets in our `Parallel Sentences dataset collection `_. +``` + +The training script will take the `"english"` column and add a `"label"` column containing the embeddings of the english texts. Then, the student model `"english"` and `"non_english"` will be trained to be similar to this `"label"`. You can load such a training dataset like so: + +```python +from datasets import load_dataset + +train_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-de", split="train") +print(train_dataset[0]) +# {"english": "So I think practicality is one case where it's worth teaching people by hand.", "non_english": "Ich denke, dass es sich aus diesem Grund lohnt, den Leuten das Rechnen von Hand beizubringen."} +``` + +## Sources for Training Data +A great website for a vast number of parallel (translated) datasets is [OPUS](http://opus.nlpl.eu/). There, you find parallel datasets for more than 400 languages. You can use these to create your own parallel sentence datasets, if you wish. + +## Evaluation + +Training can be evaluated in different ways. For an example how to use these evaluation methods, see [make_multilingual.py](make_multilingual.py). + +### MSE Evaluation +You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings. + +```python +from datasets import load_dataset +eval_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-fr", split="dev") + +dev_mse = MSEEvaluator( + source_sentences=eval_dataset["english"], + target_sentences=eval_dataset["non_english"], + name="en-fr-dev", + teacher_model=teacher_model, + batch_size=32, +) +``` + +This evaluator computes the teacher embeddings for the `source_sentences`, for example, for English. During training, the student model is used to compute embeddings for the `target_sentences`, for example, for French. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance. + +### Translation Accuracy +You can also measure the translation accuracy. As inputs, this evaluator accepts a list of `source_sentences` (e.g. English), and a list of `target_sentences` (e.g. Spanish), such that `target_sentences[i]` is a translation of `source_sentences[i]`. + +For each sentence pair, we check if `source_sentences[i]` we check if `target_sentences[i]` has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better). + +```python +from datasets import load_dataset + +eval_dataset = load_dataset("sentence-transformers/parallel-sentences-talks", "en-fr", split="dev") + +dev_trans_acc = TranslationEvaluator( + source_sentences=eval_dataset["english"], + target_sentences=eval_dataset["non_english"], + name="en-fr-dev", + batch_size=32, +) +``` + +### Multilingual Semantic Textual Similarity +You can also measure the semantic textual similarity (STS) between sentence pairs in different languages: + +```python +from datasets import load_dataset + +test_dataset = load_dataset("mteb/sts17-crosslingual-sts", "nl-en", split="test") + +test_emb_similarity = EmbeddingSimilarityEvaluator( + sentences1=test_dataset["sentence1"], + sentences2=test_dataset["sentence2"], + scores=[score / 5.0 for score in test_dataset["score"]], # Convert 0-5 scores to 0-1 scores + batch_size=32, + name=f"sts17-nl-en-test", + show_progress_bar=False, +) +``` + +Where `sentences1` and `sentences2` are lists of sentences and score is numeric value indicating the semantic similarity between `sentences1[i]` and `sentences2[i]`. + +## Available Pre-trained Models +For a list of available models, see [Pretrained Models](../../../docs/sentence_transformer/pretrained_models.html#multilingual-models). ## Usage You can use the models in the following way: + ```python from sentence_transformers import SentenceTransformer -embedder = SentenceTransformer("model-name") -embeddings = embedder.encode(["Hello World", "Hallo Welt", "Hola mundo"]) -print(embeddings) +model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2") +embeddings = model.encode(["Hello World", "Hallo Welt", "Hola mundo", "Bye, Moon!"]) +similarities = model.similarity(embeddings, embeddings) +# tensor([[1.0000, 0.9429, 0.8880, 0.4558], +# [0.9429, 1.0000, 0.9680, 0.5307], +# [0.8880, 0.9680, 1.0000, 0.4933], +# [0.4558, 0.5307, 0.4933, 1.0000]]) ``` - ## Performance The performance was evaluated on the [Semantic Textual Similarity (STS) 2017 dataset](http://ixa2.si.ehu.es/stswiki/index.php/Main_Page). The task is to predict the semantic similarity (on a scale 0-5) of two given sentences. STS2017 has monolingual test data for English, Arabic, and Spanish, and cross-lingual test data for English-Arabic, -Spanish and -Turkish. @@ -101,105 +198,6 @@ We extended the STS2017 and added cross-lingual test data for English-German, Fr - -## Extend your own models -![Multilingual Knowledge Distillation](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/multilingual-distillation.png) - -The idea is based on a fixed (monolingual) **teacher model**, that produces sentence embeddings with our desired properties in one language. The **student model** is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. In order that the student model works for further languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence. - -In the above figure, the student model should map *Hello World* and the German translation *Hallo Welt* to the vector of *teacher_model('Hello World')*. We achieve this by training the student model using mean squared error (MSE) loss. - -In our experiments we initialized the student model with the multilingual XLM-RoBERTa model. - -## Training -For a **fully automatic code example**, see [make_multilingual.py](make_multilingual.py). - -This scripts downloads the parallel sentences corpus, a corpus with transcripts and translations from talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). This corpus contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages. - - - -## Data Format - -As training data we require parallel sentences, i.e., sentences translated in various languages. As data format, we use a tab-separated .tsv file. In the first column, you have your source sentence, for example, an English sentence. In the following columns, you have the translations of this source sentence. If you have multiple translations per source sentence, you can put them in the same line or in different lines. -``` -Source_sentence Target_lang1 Target_lang2 Target_lang3 -Source_sentence Target_lang1 Target_lang2 -``` - -An example file could look like this (EN DE ES): -``` -Hello World Hallo Welt Hola Mundo -Sentences are separated with a tab character. Die Sätze sind per Tab getrennt. Las oraciones se separan con un carácter de tabulación. -``` - -The order of the translations are not important, it is only important that the first column contains a sentence in a language that is understood by the teacher model. - -## Loading Training Datasets - -You can load such a training file using the *ParallelSentencesDataset* class: -```python -from sentence_transformers.datasets import ParallelSentencesDataset - -train_data = ParallelSentencesDataset(student_model=student_model, teacher_model=teacher_model) -train_data.load_data("path/to/tab/separated/train-en-de.tsv") -train_data.load_data("path/to/tab/separated/train-en-es.tsv.gz") -train_data.load_data("path/to/tab/separated/train-en-fr.tsv.gz") - -train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size) -train_loss = losses.MSELoss(model=student_model) -``` - -You load a file with the *load_data()* method. You can load multiple files by calling load_data multiple times. You can also regular files or .gz-compressed files. - -Per default, all datasets are weighted equally. In the above example a (source, translation)-pair will be sampled equally from all three datasets. If you pass a `weight` parameter (integer), you can weight some datasets higher or lower. - -## Sources for Training Data -A great website for a vast number of parallel (translated) datasets is [OPUS](http://opus.nlpl.eu/). There, you find parallel datasets for more than 400 languages. - -The [examples/training/multilingual](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/multilingual/) folder contains some scripts that downloads parallel training data and brings it into the right format: -- [get_parallel_data_opus.py](get_parallel_data_opus.py): This script downloads data from the [OPUS](http://opus.nlpl.eu/) website. -- [get_parallel_data_tatoeba.py](get_parallel_data_tatoeba.py): This script downloads data from the [Tatoeba](https://tatoeba.org/) website, a website for language learners with example sentences for more than many languages. -- [get_parallel_data_talks.py](get_parallel_data_talks.py): This script downloads data the parallel sentences corpus, which contains transcripts and translations of more than 4,000 talks in 100+ languages. - -## Evaluation - -Training can be evaluated in different ways. For an example how to use these evaluation methods, see [make_multilingual.py](make_multilingual.py). - -### MSE Evaluation -You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings. This can be achieved with the `` - -```python -# src_sentences and trg_sentences are lists of translated sentences, such that trg_sentences[i] is the translation of src_sentences[i] -dev_mse = evaluation.MSEEvaluator(src_sentences, trg_sentences, teacher_model=teacher_model) -``` - -This evaluator computes the teacher embeddings for the `src_sentences`, for example, for English. During training, the student model is used to compute embeddings for the `trg_sentences`, for example, for Spanish. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance. - -### Translation Accuracy -You can also measure the translation accuracy. Given a list with source sentences, for example, 1000 English sentences. And a list with matching target (translated) sentences, for example, 1000 Spanish sentences. - -For each sentence pair, we check if their embeddings are the closest using cosine similarity. I.e., for each `src_sentences[i]` we check if `trg_sentences[i]` has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better). - -```python -# src_sentences and trg_sentences are lists of translated sentences, such that trg_sentences[i] is the translation of src_sentences[i] -dev_trans_acc = evaluation.TranslationEvaluator( - src_sentences, - trg_sentences, - name=os.path.basename(dev_file), - batch_size=inference_batch_size, -) -``` - -### Multi-Lingual Semantic Textual Similarity -You can also measure the semantic textual similarity (STS) between sentence pairs in different languages: - -```python -sts_evaluator = evaluation.EmbeddingSimilarityEvaluatorFromList(sentences1, sentences2, scores) -``` - -Where `sentences1` and `sentences2` are lists of sentences and score is numeric value indicating the semantic similarity between `sentences1[i]` and `sentences2[i]`. - - ## Citation If you use the code for multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813): ``` diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py index 21f30f8fd..b50d5d408 100644 --- a/examples/training/multilingual/make_multilingual.py +++ b/examples/training/multilingual/make_multilingual.py @@ -151,8 +151,8 @@ def prepare_dataset(batch): # Mean Squared Error (MSE) measures the (euclidean) distance between teacher and student embeddings dev_mse = MSEEvaluator( - eval_dataset["english"], - eval_dataset["non_english"], + source_sentences=eval_dataset["english"], + target_sentences=eval_dataset["non_english"], name=subset, teacher_model=teacher_model, batch_size=inference_batch_size, @@ -162,8 +162,8 @@ def prepare_dataset(batch): # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of # source[i] is the closest to target[i] out of all available target sentences dev_trans_acc = TranslationEvaluator( - eval_dataset["english"], - eval_dataset["non_english"], + source_sentences=eval_dataset["english"], + target_sentences=eval_dataset["non_english"], name=subset, batch_size=inference_batch_size, ) diff --git a/examples/training/nli/README.md b/examples/training/nli/README.md index 5ab2dce23..553afcdbe 100644 --- a/examples/training/nli/README.md +++ b/examples/training/nli/README.md @@ -1,16 +1,25 @@ # Natural Language Inference -Given two sentence (premise and hypothesis), Natural Language Inference (NLI) is the task of deciding if the premise entails the hypothesis, if they are contradiction or if they are neutral. Commonly used NLI dataset are [SNLI](https://arxiv.org/abs/1508.05326) and [MultiNLI](https://arxiv.org/abs/1704.05426). +Given two sentence (premise and hypothesis), Natural Language Inference (NLI) is the task of deciding if the premise entails the hypothesis, if they are contradiction, or if they are neutral. Commonly used NLI dataset are [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli). [Conneau et al.](https://arxiv.org/abs/1705.02364) showed that NLI data can be quite useful when training Sentence Embedding methods. We also found this in our [Sentence-BERT-Paper](https://arxiv.org/abs/1908.10084) and often use NLI as a first fine-tuning step for sentence embedding methods. To train on NLI, see the following example files: -- **[training_nli.py](training_nli.py)** - This example uses the Softmax-Classification-Loss, as described in the [SBERT-Paper](https://arxiv.org/abs/1908.10084), to learn sentence embeddings. -- **[training_nli_v2.py](training_nli_v2.py)** - The Softmax-Classification-Loss, as used in our original SBERT paper, does not yield optimal performance. A better loss is [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), where we provide pairs or triplets. In that example, we provide a triplet of the format: (anchor, entailment_sentence, contradiction_sentence). The NLI data provides such triplets. The MultipleNegativesRankingLoss yields much higher performances and is more intuitive than the Softmax-Classification-Loss. We have used this loss to train the paraphrase model in our [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) paper. -- **[training_nli_v3.py](training_nli_v3.py)** - Following the [GISTEmbed](https://arxiv.org/abs/2402.16829) paper, we can modify the in-batch negative selection from [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using a guiding model. Candidate negative pairs are ignored during training if the guiding model considers the pair to be too similar. In practice, the [GISTEmbedLoss](https://www.sbert.net/docs/package_reference/losses.html#gistembedloss) tends to produce a stronger training signal than `MultipleNegativesRankingLoss` at the cost of some training overhead for running inference on the guiding model. +1. **[training_nli.py](training_nli.py)**: + ```eval_rst + This example uses :class:`~sentence_transformers.losses.SoftmaxLoss` as described in the original [Sentence Transformers paper](https://arxiv.org/abs/1908.10084). + ``` +2. **[training_nli_v2.py](training_nli_v2.py)**: + ```eval_rst + The :class:`~sentence_transformers.losses.SoftmaxLoss` as used in our original SBERT paper does not yield optimal performance. A better loss is :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`, where we provide pairs or triplets. In this script, we provide a triplet of the format: (anchor, entailment_sentence, contradiction_sentence). The NLI data provides such triplets. The :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` yields much higher performances and is more intuitive than :class:`~sentence_transformers.losses.SoftmaxLoss`. We have used this loss to train the paraphrase model in our `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation `_ paper. + ``` +3. **[training_nli_v3.py](training_nli_v3.py)** + ```eval_rst + Following the `GISTEmbed `_ paper, we can modify the in-batch negative selection from :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` using a guiding model. Candidate negative pairs are ignored during training if the guiding model considers the pair to be too similar. In practice, the :class:`~sentence_transformers.losses.GISTEmbedLoss` tends to produce a stronger training signal than :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` at the cost of some training overhead for running inference on the guiding model. + ``` ## Data -In our experiments we combine [SNLI](https://arxiv.org/abs/1508.05326) and [MultiNLI](https://arxiv.org/abs/1704.05426), which we call AllNLI. These two datasets contain sentence pairs and one of three labels: entailment, neutral, contradiction: +We combine [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) into a dataset we call [AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli). These two datasets contain sentence pairs and one of three labels: entailment, neutral, contradiction: | Sentence A (Premise) | Sentence B (Hypothesis) | Label | | --- | --- | --- | @@ -18,45 +27,45 @@ In our experiments we combine [SNLI](https://arxiv.org/abs/1508.05326) and [Mult | An older and younger man smiling. | Two men are smiling and laughing at the cats playing on the floor. | neutral | | A man inspects the uniform of a figure in some East Asian country. | The man is sleeping. | contradiction | - - - +We format AllNLI in a few different subsets, compatible with different loss functions. See for example the [triplet subset of AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet). ## SoftmaxLoss -[Conneau et al.](https://arxiv.org/abs/1705.02364) described how a softmax classifier on top of a siamese network can be used to learn meaningful sentence representation. We can achieve this by using the [losses.SoftmaxLoss](../../../docs/package_reference/losses.html#softmaxloss) package. +```eval_rst +`Conneau et al. `_ described how a softmax classifier on top of a `siamese network `_ can be used to learn meaningful sentence representation. We can achieve this by using :class:`~sentence_transformers.losses.SoftmaxLoss`: +``` +SBERT SoftmaxLoss -The softmax loss looks like this: +We pass the two sentences through our SentenceTransformer model and get the sentence embeddings *u* and *v*. We then concatenate *u*, *v* and *|u-v|* to form one long vector. This vector is then passed to a softmax classifier, which predicts our three classes (entailment, neutral, contradiction). -![SBERT SoftmaxLoss](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_SoftmaxLoss.png "SBERT SoftmaxLoss") - -We pass the two sentences through our SentenceTransformer network and get the sentence embeddings *u* and *v*. We then concatenate u, v and |u-v| to form one, long vector. This vector is then passed to a softmax classifier, which predicts our three classes (entailment, neutral, contradiction). - -This setup learns sentence embeddings, that can later be used for wide variety of tasks. +This setup learns sentence embeddings that can later be used for wide variety of tasks. ## MultipleNegativesRankingLoss +```eval_rst +That the :class:`~sentence_transformers.losses.SoftmaxLoss` with NLI data produces (relatively) good sentence embeddings is rather coincidental. The :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is much more intuitive and produces significantly better sentence representations. +``` -That the softmax-loss with NLI data produces (relatively) good sentence embeddings is rather coincidental. The [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) is much more intuitive and produces also significantly better sentence representations. - -The training data for MultipleNegativesRankingLoss consists of sentence pairs [(a1, b1), ..., (an, bn)] where we assume that (ai, bi) are similar sentences and (ai, bj) are dissimilar sentences for i != j. The minimizes the distance between (ai, bi) while it simultaneously maximizes the distance (ai, bj) for all i != j. +The training data for MultipleNegativesRankingLoss consists of sentence pairs [(a1, b1), ..., (an, bn)] where we assume that (ai, bi) are similar sentences and (ai, bj) are dissimilar sentences for i != j. The minimizes the distance between (ai, bi) while it simultaneously maximizes the distance (ai, bj) for all i != j. For example, in the following picture: - -For example in the following picture: - -![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MultipleNegativeRankingLoss.png) +SBERT MultipleNegativeRankingLoss The distance between (a1, b1) is reduced, while the distance between (a1, b2...5) will be increased. The same is done for a2, ..., a5. - -Using MultipleNegativeRankingLoss with NLI is rather easy: We define sentences that have an *entailment* label as positive pairs. E.g, we have pairs like (*"A soccer game with multiple males playing."*, *"Some men are playing a sport."*) and want that these pairs are close in vector space. +```eval_rst +Using :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` with NLI is rather easy: We define sentences that have an *entailment* label as positive pairs. E.g, we have pairs like (*"A soccer game with multiple males playing."*, *"Some men are playing a sport."*) and want that these pairs are close in vector space. The `pair subset of AllNLI `_ has been prepared in this format. +``` ### MultipleNegativesRankingLoss with Hard Negatives -We can further improve MultipleNegativesRankingLoss by not only providing pairs, but by providing triplets: [(a1, b1, c1), ..., (an, bn, cn)] - -The entry for ci are so-called hard-negatives: On a lexical level, they are similar to ai and bi. But on a semantic level, they mean different things and should not be close in the vector space. +We can further improve MultipleNegativesRankingLoss by providing triplets rather than pairs: [(a1, b1, c1), ..., (an, bn, cn)]. The samples for ci are so-called hard-negatives: On a lexical level, they are similar to ai and bi, but on a semantic level, they mean different things and should not be close to ai in the vector space. For NLI data, we can use the contradiction-label to create such triplets with a hard negative. So our triplets look like this: -("*A soccer game with multiple males playing."*, *"Some men are playing a sport."*, *"A group of men playing a baseball game."*). +("*A soccer game with multiple males playing."*, *"Some men are playing a sport."*, *"A group of men playing a baseball game."*). We want the sentences *"A soccer game with multiple males playing."* and *"Some men are playing a sport."* to be close in the vector space, while there should be a larger distance between *"A soccer game with multiple males playing."* and "*A group of men playing a baseball game."*. The [triplet subset of AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet) has been prepared in this format. + +### GISTEmbedLoss +```eval_rst + +:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` can be extended even further by recognizing that the in-batch negative sampling as shown in `this example <#multiplenegativesrankingloss>`_ is a bit flawed. In particular, we automatically assume that the pairs (a\ :sub:`1`\ , b\ :sub:`2`\ ), ..., (a\ :sub:`1`\ , b\ :sub:`n`\ ) are negative, but that does not strictly have to be true. -We want the sentences *"A soccer game with multiple males playing."* and *"Some men are playing a sport."* to be close in the vector space, while there should be a larger distance between *"A soccer game with multiple males playing."* and "*A group of men playing a baseball game."*. +To address this, :class:`~sentence_transformers.losses.GISTEmbedLoss` uses a Sentence Transformer model to guide the in-batch negative sample selection. In particular, if the guide model considers the similarity of (a\ :sub:`1`\ , b\ :sub:`n`\ ) to be larger than (a\ :sub:`1`\ , b\ :sub:`1`\ ), then the (a\ :sub:`1`\ , b\ :sub:`n`\ ) pair is considered a false negative and consequently ignored in the training process. In essence, this results in higher quality training data for the model. +``` \ No newline at end of file diff --git a/examples/training/paraphrases/MultiDatasetDataLoader.py b/examples/training/paraphrases/MultiDatasetDataLoader.py deleted file mode 100644 index 9220a37be..000000000 --- a/examples/training/paraphrases/MultiDatasetDataLoader.py +++ /dev/null @@ -1,91 +0,0 @@ -import math -import logging -import random - - -class MultiDatasetDataLoader: - def __init__(self, datasets, batch_size_pairs, batch_size_triplets=None, dataset_size_temp=-1): - self.allow_swap = True - self.batch_size_pairs = batch_size_pairs - self.batch_size_triplets = batch_size_pairs if batch_size_triplets is None else batch_size_triplets - - # Compute dataset weights - self.dataset_lengths = list(map(len, datasets)) - self.dataset_lengths_sum = sum(self.dataset_lengths) - - weights = [] - if dataset_size_temp > 0: # Scale probability with dataset size - for dataset in datasets: - prob = len(dataset) / self.dataset_lengths_sum - weights.append(max(1, int(math.pow(prob, 1 / dataset_size_temp) * 1000))) - else: # Equal weighting of all datasets - weights = [100] * len(datasets) - - logging.info("Dataset lengths and weights: {}".format(list(zip(self.dataset_lengths, weights)))) - - self.dataset_idx = [] - self.dataset_idx_pointer = 0 - - for idx, weight in enumerate(weights): - self.dataset_idx.extend([idx] * weight) - random.shuffle(self.dataset_idx) - - self.datasets = [] - for dataset in datasets: - random.shuffle(dataset) - self.datasets.append( - { - "elements": dataset, - "pointer": 0, - } - ) - - def __iter__(self): - for _ in range(int(self.__len__())): - # Select dataset - if self.dataset_idx_pointer >= len(self.dataset_idx): - self.dataset_idx_pointer = 0 - random.shuffle(self.dataset_idx) - - dataset_idx = self.dataset_idx[self.dataset_idx_pointer] - self.dataset_idx_pointer += 1 - - # Select batch from this dataset - dataset = self.datasets[dataset_idx] - batch_size = self.batch_size_pairs if len(dataset["elements"][0].texts) == 2 else self.batch_size_triplets - - batch = [] - texts_in_batch = set() - guid_in_batch = set() - while len(batch) < batch_size: - example = dataset["elements"][dataset["pointer"]] - - valid_example = True - # First check if one of the texts in already in the batch - for text in example.texts: - text_norm = text.strip().lower() - if text_norm in texts_in_batch: - valid_example = False - - texts_in_batch.add(text_norm) - - # If the example has a guid, check if guid is in batch - if example.guid is not None: - valid_example = valid_example and example.guid not in guid_in_batch - guid_in_batch.add(example.guid) - - if valid_example: - if self.allow_swap and random.random() > 0.5: - example.texts[0], example.texts[1] = example.texts[1], example.texts[0] - - batch.append(example) - - dataset["pointer"] += 1 - if dataset["pointer"] >= len(dataset["elements"]): - dataset["pointer"] = 0 - random.shuffle(dataset["elements"]) - - yield self.collate_fn(batch) if self.collate_fn is not None else batch - - def __len__(self): - return int(self.dataset_lengths_sum / self.batch_size_pairs) diff --git a/examples/training/paraphrases/README.md b/examples/training/paraphrases/README.md index 4ba0d7fbe..1e76d2c85 100644 --- a/examples/training/paraphrases/README.md +++ b/examples/training/paraphrases/README.md @@ -1,65 +1,17 @@ # Paraphrase Data -**This page is currently work-in-progress and will be extended in the future** +```eval_rst +In our paper `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation `_, we showed that paraphrase data together with :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is a powerful combination to learn sentence embeddings models. Read `NLI > MultipleNegativesRankingLoss <../nli/README.html#multiplenegativesrankingloss>`_ for more information on this loss function. +``` -In our paper [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) we showed that paraphrase dataset together with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) is a powerful combination to learn sentence embeddings models. +The [training.py](training.py) script loads various datasets from the [Dataset Overview](../../../docs/sentence_transformer/dataset_overview.html@pre-existing-datasets). We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset. -You can find here: [NLI - MultipleNegativesRankingLoss](https://www.sbert.net/examples/training/nli/README.html#multiplenegativesrankingloss) more information how the loss can be used. - -In this folder, we collect different datasets and scripts to train using paraphrase data. - -## Datasets - -You can find here: [sbert.net/datasets/paraphrases](http://sbert.net/datasets/paraphrases) a list of datasets with paraphrases suitable for training. - -| Name | Source | #Sentence-Pairs | STSb-dev | -| --- | --- | :---: | :---: | -| [AllNLI.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/AllNLI.tsv.gz) | [SNLI](https://nlp.stanford.edu/projects/snli/) + [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | 277,230 | 86.54 | -| [sentence-compression.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/sentence-compression.tsv.gz) | [sentence-compression](https://github.com/google-research-datasets/sentence-compression) | 180,000 | 84.36 | -| [SimpleWiki.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/SimpleWiki.tsv.gz) | [SimpleWiki](https://cs.pomona.edu/~dkauchak/simplification/) | 102,225 | 84.26 | -| [altlex.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/altlex.tsv.gz) | [altlex](https://github.com/chridey/altlex/) | 112,696 | 83.34 | -| [msmarco-triplets.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/msmarco-triplets.tsv.gz) | [MS MARCO Passages](https://microsoft.github.io/msmarco/) | 5,028,051 | 83.12 | -| [quora_duplicates.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/quora_duplicates.tsv.gz) | [Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 103,663 | 82.55 | -| [coco_captions-with-guid.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/coco_captions-with-guid.tsv.gz) | [COCO](https://cocodataset.org/) | 828,395 | 82.25 -| [flickr30k_captions-with-guid.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/flickr30k_captions-with-guid.tsv.gz) | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | 317,695 | 82.04 -| [yahoo_answers_title_question.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answers_title_question.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) | 659,896 | 81.19 | -| [S2ORC_citation_pairs.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/S2ORC_citation_pairs.tsv.gz) | [Semantic Scholar Open Research Corpus](http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/) | 52,603,982 | 81.02 | -| [yahoo_answers_title_answer.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answerstitle_answer.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) | 1,198,260 | 80.25 -| [stackexchange_duplicate_questions.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/stackexchange_duplicate_questions.tsv.gz) | [Stackexchange](https://stackexchange.com/) | 169,438 | 80.37 -| [yahoo_answers_question_answer.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/yahoo_answers_question_answer.tsv.gz) | [Yahoo Answers Dataset](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) | 681,164 | 79.88 | -| [wiki-atomic-edits.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/wiki-atomic-edits.tsv.gz) | [wiki-atomic-edits](https://github.com/google-research-datasets/wiki-atomic-edits) | 22,980,185 | 79.58 -| [wiki-split.tsv.gz](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/wiki-split.tsv.gz) | [wiki-split](https://github.com/google-research-datasets/wiki-split) | 929,944 | 76.59 - - -See the respective linked source website for the dataset license. - - -All datasets have a sample per line and the individual sentences are separated by a tab (\t). Some datasets (like AllNLI) has three sentences per line: An anchor, a positive, and a hard negative. - -We measure for each dataset the performance on the STSb development dataset after 2k training steps with a distilroberta-base model and a batch size of 256. - -**Note**: We find that the STSb dataset is a suboptimal dataset to evaluate the quality of sentence embedding models. It consists mainly of rather simple sentences, it does not require any domain specific knowledge, and the included sentences are of rather high quality compared to noisy, user-written content. Please do not infer from the above numbers how the approaches will perform on your domain specific dataset. - -## Training -See [training.py](training.py) for the training script. - -The training script allows to load one or multiple files. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset. - -As the dataset sizes are quite different in size, we perform a temperature controlled sampling from the datasets: Smaller datasets are up-sampled, while larger datasets are down-sampled. This allows an effective training with very large and smaller datasets. +As the dataset sizes are quite different in size, we perform [round-robin sampling](../../../docs/package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.MultiDatasetBatchSamplers) to train using the same amount of batches from each dataset. ## Pre-Trained Models -Have a look at [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) to view all models that were trained on these paraphrase datasets. - -- **paraphrase-MiniLM-L12-v2** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits -- **paraphrase-distilroberta-base-v2** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits -- **paraphrase-distilroberta-base-v1** - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split -- **paraphrase-xlm-r-multilingual-v1** - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base) - - -## Work in Progress +Have a look at [pre-trained models](../../../docs/sentence_transformer/pretrained_models.md) to view all models that were trained on these paraphrase datasets. -Training with this data is currently work-in-progress. Things that will be added in the next time: -- **More datasets**: Are you aware of more suitable training datasets? Let me know: [info@nils-reimers.de](mailto:info@nils-reimers.de) -- **Optimized batching**: Currently batches are only drawn from one dataset. Future work might include also batches that are sampled across datasets -- **Optimized loss function**: Currently the same parameters of MultipleNegativesRankingLoss is used for all datasets. Future work includes testing if the dataset benefit from individual loss functions. -- **Pre-trained models**: Once all datasets are collected, we will train and release respective models. \ No newline at end of file +- [paraphrase-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L12-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits +- [paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits +- [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split +- [paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1), Student: [xlm-r-base](https://huggingface.co/FacebookAI/xlm-roberta-base)) diff --git a/examples/training/quora_duplicate_questions/README.md b/examples/training/quora_duplicate_questions/README.md index d61b04ac3..598fcb337 100644 --- a/examples/training/quora_duplicate_questions/README.md +++ b/examples/training/quora_duplicate_questions/README.md @@ -1,191 +1,144 @@ # Quora Duplicate Questions -This folder contains scripts that demonstrate how to train SentenceTransformers for **Information Retrieval**. As simple example, we will use the [Quora Duplicate Questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). It contains over 500,000 sentences with over 400,000 pairwise annotations whether two questions are a duplicate or not. - -## Pretrained Models - -Currently the following models trained on Quora Duplicate Questions are available: -* **distilbert-base-nli-stsb-quora-ranking**: We extended the *distilbert-base-nli-stsb-mean-tokens* model and trained it with *OnlineContrastiveLoss* and with *MultipleNegativesRankingLoss* on the Quora Duplicate questions dataset. For the code, see [training_multi-task-learning.py](training_multi-task-learning.py) -* **distilbert-multilingual-nli-stsb-quora-ranking**: Extension of *distilbert-base-nli-stsb-quora-ranking* to be multi-lingual. Trained on parallel data for 50 languages. - -You can load & use pre-trained models like this: -```python -from sentence_transformers import SentenceTransformer - -model = SentenceTransformer("model_name") -``` - - -## Dataset -As dataset to train a **Duplicate Questions Semantic Search Engine** we use [Quora Duplicate Questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). The original format looks like this: -``` -id qid1 qid2 question1 question2 is_duplicate -0 1 2 What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0 -1 3 4 What is the story of Kohinoor (Koh-i-Noor) Diamond? What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? 0 -``` - -As a first step, we process this file to create distinct train/dev/test splits for different tasks. We define the following tasks: -- **Duplicate Questions Classification**: Given two questions, are these questions duplicates? This is the original task as defined by Quora, however, it is rather a unpractical task. How do we retrieve possible duplicates in a large corpus for a given question? Further, models performing well on this classification task do not necessarily perform well on the following two task. -- **Duplicate Questions Mining**: Given a large set (like 100k) of questions, identify all question pairs that are duplicates. -- **Duplicate Questions Information Retrieval**: Given a large corpus (350k+) of questions. For a new, unseen question, find the most related (i.e. duplicate) questions in this corpus. - - -**Download**: You can download the finished dataset here: [quora-IR-dataset.zip](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/quora-IR-dataset.zip) - -For details on the creation of the dataset, see [create_splits.py](create_splits.py). - - -## Usage - -### Duplicate Questions Mining - -Given a large set of sentences (in this case questions), identify all pairs that are duplicates. See [Paraphrase Mining](../../applications/paraphrase-mining/README.md) for an example how to use sentence transformers to mine for duplicate questions / paraphrases. This approach can be scaled to hundred thousands of sentences given you have enough memory. - -### Semantic Search - -The model can also be used for Information Retrieval / Semantic Search. Given a new question, search a large corpus of hundred thousands of questions for duplicate questions. Given you have enough memory, this approach works well to copora up in the Millions (depending on your real-time requirements). - -For an interactive example, see [Semantic Search](../../applications/semantic-search/README.md). +This folder contains scripts that demonstrate how to train SentenceTransformers for **Information Retrieval**. As a simple example, we will use the [Quora Duplicate Questions dataset](https://huggingface.co/datasets/sentence-transformers/quora-duplicates). It contains over 500,000 sentences with over 400,000 pairwise annotations whether two questions are a duplicate or not. +Models trained on this dataset can be used for mining duplicate questions, i.e., given a large set of sentences (in this case questions), identify all pairs that are duplicates. See [Paraphrase Mining](../../applications/paraphrase-mining/README.md) for an example how to use sentence transformers to mine for duplicate questions / paraphrases. This approach can be scaled to hundred thousands of sentences. ## Training -Choosing the right loss function is crucial for getting well working sentence embeddings. For the given task, two loss functions are especially suitable: **ConstrativeLoss** and **MultipleNegativesRankingLoss** +```eval_rst +Choosing the right loss function is crucial for finetuning useful models. For the given task, two loss functions are especially suitable: :class:`~sentence_transformers.losses.OnlineContrastiveLoss` and :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss`. +``` -### Constrative Loss -For the complete example, see [training_OnlineContrastiveLoss.py](training_OnlineContrastiveLoss.py). +### Contrastive Loss +For the complete training example, see [training_OnlineContrastiveLoss.py](training_OnlineContrastiveLoss.py). -In the original dataset, we have questions given with a label of 0=not duplicate and 1=duplicate. In that case, we can use contrastive loss: Similar pairs with label 1 are pulled together, so that they are close in vector space. Dissimilar pairs, that are closer than a defined margin, are pushed away in vector space. +```eval_rst +The Quora Duplicates dataset has a `pair-class subset `_ which consists of question pairs and labels: 1 for duplicate and 0 for different. -Choosing the distance function and especially choosing a sensible margin are quite important for the success of contrastive loss. In the given example, we use cosine_distance (which is 1-cosine_similarity) with a margin of 0.5. I.e., non-duplicate questions should have a cosine_distance of at least 0.5 (which is equivalent to a 0.5 cosine similarity difference). +As shown by our `Loss Overview <../../../docs/sentence_transformer/loss_overview.md>`_, this allows us to use :class:`~sentence_transformers.losses.ContrastiveLoss`. Similar pairs with label 1 are pulled together, so that they are close in vector space, while dissimilar pairs that are closer than a defined margin are pushed away in vector space. -An improved version of contrastive loss is OnlineContrastiveLoss, which looks which negative pairs have a lower distance that the largest positive pair and which positive pairs have a higher distance than the lowest distance of negative pairs. I.e., this loss automatically detects the hard cases in a batch and computes the loss only for these cases. +An improved version is :class:`~sentence_transformers.losses.OnlineContrastiveLoss`. This loss looks which negative pairs have a lower distance than the largest positive pair and which positive pairs have a higher distance than the lowest distance of negative pairs. I.e., this loss automatically detects the hard cases in a batch and computes the loss only for these cases. +``` The loss can be used like this: ```python -train_samples = [] -with open( - os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8" -) as fIn: - reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE) - for row in reader: - sample = InputExample( - texts=[row["question1"], row["question2"]], - label=int(row["is_duplicate"]), - ) - train_samples.append(sample) - - -train_dataset = SentencesDataset(train_samples, model=model) -train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size) -train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=distance_metric, margin=margin) -``` - -For each row in our train dataset, we create new InputExample objects and the two questions as texts and the is_duplicate as the label. - - +from datasets import load_dataset + +train_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train") +# => Dataset({ +# features: ['sentence1', 'sentence2', 'label'], +# num_rows: 404290 +# }) +print(train_dataset[0]) +# => {'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0} +train_loss = losses.OnlineContrastiveLoss(model=model, margin=0.5) +``` ## MultipleNegativesRankingLoss For the complete example, see [training_MultipleNegativesRankingLoss.py](training_MultipleNegativesRankingLoss.py). -*MultipleNegativesRankingLoss* is especially suitable for Information Retrieval / Semantic Search. A nice advantage of *MultipleNegativesRankingLoss* is that it only requires positive pairs, i.e., we only need examples of duplicate questions. - -From all pairs, we sample a mini-batch *(a_1, b_1), ..., (a_n, b_n)* where *(a_i, b_i)* is a duplicate question. - -MultipleNegativesRankingLoss now uses all *b_j* with j != i as negative example for *(a_i, b_i)*. For example, for *a_1* we have given the options *(b_1, ..., b_n)* and we need to identify which is the correct duplicate question to *a_1*. We do this by computing the dot-product between the embedding of *a_1* and all *b*'s and softmax normalize it so that we get a probability distribution over *(b_1, ..., b_n)*. In the best case, the positive example *b_1* get a probability of close to 1 while all others get scores close to 0. We use negative log-likelihood to compute the loss. - - -*MultipleNegativesRankingLoss* implements this idea in an efficient way so that the embeddings are re-used. With a batch-size of 64, we have 64 positive pairs and each positive pairs has 64-1 negative distractors. - +```eval_rst +:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is especially suitable for Information Retrieval / Semantic Search. A nice advantage is that it only requires positive pairs, i.e., we only need examples of duplicate questions. See `NLI > MultipleNegativesRankingLoss <../nli/README.html#multiplenegativesrankingloss>`_ for more information on how the loss works. +``` Using the loss is easy and does not require tuning of any hyperparameters: ```python -train_samples = [] -with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn: - reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE) - for row in reader: - if row["is_duplicate"] == "1": - train_samples.append( - InputExample(texts=[row["question1"], row["question2"]], label=1) - ) - train_samples.append( - InputExample(texts=[row["question2"], row["question1"]], label=1) - ) # if A is a duplicate of B, then B is a duplicate of A - - -# After reading the train_samples, we create a SentencesDataset and a DataLoader -train_dataset = SentencesDataset(train_samples, model=model) -train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size) +from datasets import load_dataset + +train_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train") +# => Dataset({ +# features: ['anchor', 'positive'], +# num_rows: 149263 +# }) +print(train_dataset[0]) +# => {'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"} train_loss = losses.MultipleNegativesRankingLoss(model) ``` -We only use the positive examples. As 'is_duplicate' is a symmetric relation, we not only add (A, B) but also (B, A) to our training sample set. +As 'is_duplicate' is a symmetric relation, we can use not just (anchor, positive) but also (positive, anchor) to our training sample set: -**Note 1:** Increasing the batch sizes usually yields better results, as the task gets harder. It is more difficult to identify the correct duplicate question out of a set of 100 questions than out of a set of only 10 questions. So it is advisable to set the training batch size as large as possible. I trained it with a batch size of 350 on 32 GB GPU memory. +```python +from datasets import concatenate_datasets + +train_dataset = concatenate_datasets([ + train_dataset, + train_dataset.rename_columns({"anchor": "positive", "positive": "anchor"}) +]) +# Dataset({ +# features: ['anchor', 'positive'], +# num_rows: 298526 +# }) +``` +```eval_rst +.. note:: + Increasing the batch sizes usually yields better results, as the task gets harder. It is more difficult to identify the correct duplicate question out of a set of 100 questions than out of a set of only 10 questions. So it is advisable to set the training batch size as large as possible. I trained it with a batch size of 350 on 32 GB GPU memory. -**Note 2:** MultipleNegativesRankingLoss only works if *(a_i, b_j)* with j != i is actually a negative, non-duplicate question pair. In few instances, this assumption is wrong. But in the majority of cases, if we sample two random questions, they are not duplicates. If your dataset cannot fulfil this property, MultipleNegativesRankingLoss might not work well. +.. note:: + :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` only works if *(a_i, b_j)* with j != i is actually a negative, non-duplicate question pair. In few instances, this assumption is wrong. But in the majority of cases, if we sample two random questions, they are not duplicates. If your dataset cannot fulfil this property, :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` might not work well. +``` ### Multi-Task-Learning -Contrastive Loss works well for pair classification, i.e., given two pairs, are these duplicates or not. It pushes negative pairs far away in vector space, so that the distinguishing between duplicate and non-duplicate pairs works good. +```eval_rst +:class:`~sentence_transformers.losses.ContrastiveLoss` works well for pair classification, i.e., given two pairs, are these duplicates or not. It pushes negative pairs far away in vector space, so that the distinguishing between duplicate and non-duplicate pairs works good. -MultipleNegativesRankingLoss on the other sides mainly reduces the distance between positive pairs out of large set of possible candidates. However, the distance between non-duplicate questions is not so large, so that this loss does not work that well for pair classification. +:class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` on the other sides mainly reduces the distance between positive pairs out of large set of possible candidates. However, the distance between non-duplicate questions is not so large, so that this loss does not work that well for pair classification. +``` In [training_multi-task-learning.py](training_multi-task-learning.py) I demonstrate how we can train the network with both losses. The essential code is to define both losses and to pass it to the fit method. + ```python -train_samples_MultipleNegativesRankingLoss = [] -train_samples_ContrastiveLoss = [] - -with open(os.path.join(dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn: - reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE) - for row in reader: - train_samples_ContrastiveLoss.append( - InputExample( - texts=[row["question1"], row["question2"]], - label=int(row["is_duplicate"]), - ) - ) - if row["is_duplicate"] == "1": - train_samples_MultipleNegativesRankingLoss.append( - InputExample(texts=[row["question1"], row["question2"]], label=1) - ) - train_samples_MultipleNegativesRankingLoss.append( - InputExample(texts=[row["question2"], row["question1"]], label=1) - ) # if A is a duplicate of B, then B is a duplicate of A - -# Create data loader and loss for MultipleNegativesRankingLoss -train_dataset_MultipleNegativesRankingLoss = SentencesDataset( - train_samples_MultipleNegativesRankingLoss, model=model +from datasets import load_dataset +from sentence_transformers.losses import ContrastiveLoss, MultipleNegativesRankingLoss +from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer + +model_name = "stsb-distilbert-base" +model = SentenceTransformer(model_name) + +# https://huggingface.co/datasets/sentence-transformers/quora-duplicates +mnrl_dataset = load_dataset( + "sentence-transformers/quora-duplicates", "triplet", split="train" +) # The "pair" subset also works +mnrl_train_dataset = mnrl_dataset.select(range(100000)) +mnrl_eval_dataset = mnrl_dataset.select(range(100000, 101000)) + +mnrl_train_loss = MultipleNegativesRankingLoss(model=model) + +# https://huggingface.co/datasets/sentence-transformers/quora-duplicates +cl_dataset = load_dataset("sentence-transformers/quora-duplicates", "pair-class", split="train") +cl_train_dataset = cl_dataset.select(range(100000)) +cl_eval_dataset = cl_dataset.select(range(100000, 101000)) + +cl_train_loss = ContrastiveLoss(model=model, margin=0.5) + +# Create the trainer & start training +trainer = SentenceTransformerTrainer( + model=model, + train_dataset={ + "mnrl": mnrl_train_dataset, + "cl": cl_train_dataset, + }, + eval_dataset={ + "mnrl": mnrl_eval_dataset, + "cl": cl_eval_dataset, + }, + loss={ + "mnrl": mnrl_train_loss, + "cl": cl_train_loss, + }, ) -train_dataloader_MultipleNegativesRankingLoss = DataLoader( - train_dataset_MultipleNegativesRankingLoss, - shuffle=True, - batch_size=train_batch_size, -) -train_loss_MultipleNegativesRankingLoss = losses.MultipleNegativesRankingLoss(model) +trainer.train() +``` +## Pretrained Models -# Create data loader and loss for OnlineContrastiveLoss -train_dataset_ConstrativeLoss = SentencesDataset( - train_samples_ConstrativeLoss, model=model -) -train_dataloader_ConstrativeLoss = DataLoader( - train_dataset_ConstrativeLoss, shuffle=True, batch_size=train_batch_size -) -train_loss_ConstrativeLoss = losses.OnlineContrastiveLoss( - model=model, distance_metric=distance_metric, margin=margin -) +Currently the following models trained on Quora Duplicate Questions are available: +* [distilbert-base-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-quora-ranking): We extended the [distilbert-base-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens) model and trained it with *OnlineContrastiveLoss* and with *MultipleNegativesRankingLoss* on the Quora Duplicate questions dataset. For the code, see [training_multi-task-learning.py](training_multi-task-learning.py) +* [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking): Extension of *distilbert-base-nli-stsb-quora-ranking* to be multi-lingual. Trained on parallel data for 50 languages. -# ..... -# Train the model -model.fit( - train_objectives=[ - (train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss), - (train_dataloader_ConstrativeLoss, train_loss_ConstrativeLoss), - ], - evaluator=seq_evaluator, - epochs=num_epochs, - warmup_steps=1000, - output_path=model_save_path, -) -``` +You can load & use pre-trained models like this: +```python +from sentence_transformers import SentenceTransformer +model = SentenceTransformer("distilbert-base-nli-stsb-quora-ranking") +``` \ No newline at end of file diff --git a/examples/training/sts/README.md b/examples/training/sts/README.md index e95266da6..6c95adfc4 100644 --- a/examples/training/sts/README.md +++ b/examples/training/sts/README.md @@ -1,42 +1,61 @@ # Semantic Textual Similarity -Semantic Textual Similarity (STS) assigns a score on the similarity of two texts. In this example, we use the [STSbenchmark](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) as training data to fine-tune our network. See the following example scripts how to tune SentenceTransformer on STS data: +Semantic Textual Similarity (STS) assigns a score on the similarity of two texts. In this example, we use the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset as training data to fine-tune our model. See the following example scripts how to tune SentenceTransformer on STS data: -- **[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model together with a pooling layer. - - **[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model. In that example, we load a model trained on [NLI data](../nli/README.md). - +- **[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model (e.g. [`distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased)) together with a pooling layer. +- **[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model (e.g. [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)). ## Training data -In STS, we have sentence pairs annotated together with a score indicating the similarity. For the [STSbenchmark](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark), the scores ranges from 0 (the content of the two sentences are competely different) up to 5 (the two sentences are identical in terms of their meaning). To train our network, we need to normalize these scores to a range of 0-1. This can simply be done by dividing the score by 5. +```eval_rst +In STS, we have sentence pairs annotated together with a score indicating the similarity. In the original STSbenchmark dataset, the scores range from 0 to 5. We have normalized these scores to range between 0 and 1 in `stsb `_, as that is required for :class:`~sentence_transformers.losses.CosineSimilarityLoss` as you can see in the `Loss Overiew <../../../docs/sentence_transformer/loss_overview.html>`_. +``` -To store our training data, we create a list with `InputExample` objects. Each `InputExample` contains the sentence pair together with the label (score) that ranges between 0 - 1. A simplified version how the training data has to look like is the following: +Here is a simplified version of our training data: ```python -from sentence_transformers import ( - SentenceTransformer, - SentencesDataset, - InputExample, - losses, -) - -model = SentenceTransformer("nli-distilroberta-base-v2") -train_examples = [ - InputExample(texts=["My first sentence", "My second sentence"], label=0.8), - InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3), -] -train_dataset = SentencesDataset(train_examples, model) +from datasets import Dataset + +sentence1_list = ["My first sentence", "Another pair"] +sentence2_list = ["My second sentence", "Unrelated sentence"] +labels_list = [0.8, 0.3] +train_dataset = Dataset.from_dict({ + "sentence1": sentence1_list, + "sentence2": sentence2_list, + "label": labels_list, +}) +# => Dataset({ +# features: ['sentence1', 'sentence2', 'label'], +# num_rows: 2 +# }) +print(train_dataset[0]) +# => {'sentence1': 'My first sentence', 'sentence2': 'My second sentence', 'label': 0.8} +print(train_dataset[1]) +# => {'sentence1': 'Another pair', 'sentence2': 'Unrelated sentence', 'label': 0.3} ``` -## Loss Function -As loss function we use [CosineSimilarityLoss](../../../docs/package_reference/losses.html#cosinesimilarityloss). +In the aforementioned scripts, we directly load the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset: +```python +from datasets import load_dataset -*CosineSimilarityLoss* trains the network with a siamese network structure (for details see: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)) +train_dataset = load_dataset("sentence-transformers/stsb", split="train") +# => Dataset({ +# features: ['sentence1', 'sentence2', 'score'], +# num_rows: 5749 +# }) +``` +## Loss Function +```eval_rst +We use :class:`~sentence_transformers.losses.CosineSimilarityLoss` as our loss function. +``` -![SBERT Siamese Network Architecture](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png "SBERT Siamese Architecture") +SBERT Siamese Network Architecture +For each sentence pair, we pass sentence A and sentence B through the BERT-based model, which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. Note that the two sentences are fed through the same model rather than two separate models. In particular, the cosine similarity for similar texts is maximized and the cosine similarity for dissimilar texts is minimized. This allows our model to be fine-tuned and to recognize the similarity of sentences. -For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences. +For more details, see [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084). -This training in a siamese network structure is done automatically when we use CosineSimilarityLoss. +```eval_rst +:class:`~sentence_transformers.losses.CoSENTLoss` and :class:`~sentence_transformers.losses.AnglELoss` are more modern variants of :class:`~sentence_transformers.losses.CosineSimilarityLoss` that accept the same data format of a sentence pair with a similarity score ranging from 0.0 to 1.0. Informal experiments indicate that these two produce stronger models than :class:`~sentence_transformers.losses.CosineSimilarityLoss`. +``` \ No newline at end of file diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py index ea194640e..9bdc1efe3 100644 --- a/examples/training/sts/training_stsbenchmark.py +++ b/examples/training/sts/training_stsbenchmark.py @@ -43,7 +43,7 @@ logging.info(train_dataset) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and one # similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) # train_loss = losses.CoSENTLoss(model=model) diff --git a/examples/training/sts/training_stsbenchmark_continue_training.py b/examples/training/sts/training_stsbenchmark_continue_training.py index c902306a1..ff4c70bdd 100644 --- a/examples/training/sts/training_stsbenchmark_continue_training.py +++ b/examples/training/sts/training_stsbenchmark_continue_training.py @@ -39,7 +39,7 @@ logging.info(train_dataset) # 3. Define our training loss -# CosineSimilarityLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one +# CosineSimilarityLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) needs two text columns and one # similarity score column (between 0 and 1) train_loss = losses.CosineSimilarityLoss(model=model) # train_loss = losses.CoSENTLoss(model=model) diff --git a/examples/unsupervised_learning/README.md b/examples/unsupervised_learning/README.md index ba123fba2..6ad9672fe 100644 --- a/examples/unsupervised_learning/README.md +++ b/examples/unsupervised_learning/README.md @@ -2,8 +2,10 @@ This page contains a collection of unsupervised learning methods to learn sentence embeddings. The methods have in common that they **do not require labeled training data**. Instead, they can learn semantically meaningful sentence embeddings just from the text itself. -**Note:** Unsupervised learning approaches are still an activate research area and in many cases the models perform rather poorly compared to models that are using training pairs as provided in our [training data collection](https://huggingface.co/datasets/sentence-transformers/embedding-training-data). A better approach is **[Domain Adaptation](../domain_adaptation/README.md)** where you combine unsupervised learning on your target domain with existent labeled data. This gives the best performance on your specific corpus. - +```eval_rst +.. note:: + Unsupervised learning approaches are still an activate research area and in many cases the models perform rather poorly compared to models that are using training pairs as provided in our `training data collection `_. A better approach is `Domain Adaptation <../domain_adaptation/README.md>`_ where you combine unsupervised learning on your target domain with existent labeled data. This should give the best performance on your specific corpus. +``` ## TSDAE In our work [TSDAE (Transformer-based Denoising AutoEncoder)](https://arxiv.org/abs/2104.06979) we present an unsupervised sentence embedding learning method based on denoising auto-encoders: diff --git a/index.rst b/index.rst index 4d2936c2e..ef8b630b9 100644 --- a/index.rst +++ b/index.rst @@ -1,77 +1,74 @@ -SentenceTransformers Documentation -================================================= - -SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks `_. - -You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for `semantic textual similarity `_, `semantic search `_, or `paraphrase mining `_. - -The framework is based on `PyTorch `_ and `Transformers `_ and offers a large collection of `pre-trained models `_ tuned for various tasks. Further, it is easy to `fine-tune your own models `_. +.. note:: + Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview `_ to learn more about the training API, and check out `v3.0 Release Notes `_ for details on the other changes. -Installation -================================================= - -You can install it using pip: - -.. code-block:: python - - pip install -U sentence-transformers - +SentenceTransformers Documentation +================================== -We recommend **Python 3.8** or higher, and at least **PyTorch 1.11.0**. See `installation `_ for further installation options, especially if you want to use a GPU. +Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. +It can be used to compute embeddings using Sentence Transformer models (`quickstart `_) or to calculate similarity scores using Cross-Encoder models (`quickstart `_). This unlocks a wide range of applications, including `semantic search `_, `semantic textual similarity `_, and `paraphrase mining `_. +A wide selection of over `5,000 pre-trained Sentence Transformers models `_ are available for immediate use on 🤗 Hugging Face, including many of the state-of-the-art models from the `Massive Text Embeddings Benchmark (MTEB) leaderboard `_. Additionally, it is easy to `train or finetune your own models `_ using Sentence Transformers, enabling you to create custom models for your specific use cases. +Sentence Transformers was created by `UKPLab `_ and is being maintained by `🤗 Hugging Face `_. Don't hesitate to open an issue on the `Sentence Transformers repository `_ if something is broken or if you have further questions. Usage -================================================= -The usage is as simple as: - -.. code-block:: python - - from sentence_transformers import SentenceTransformer - model = SentenceTransformer("all-MiniLM-L6-v2") - - # Our sentences to encode - sentences = [ - "This framework generates embeddings for each input sentence", - "Sentences are passed as a list of string.", - "The quick brown fox jumps over the lazy dog." - ] - - # Sentences are encoded by calling model.encode() - embeddings = model.encode(sentences) +===== +.. seealso:: + + See the `Quickstart `_ for more quick information on how to use Sentence Transformers. - # Print the embeddings - for sentence, embedding in zip(sentences, embeddings): - print("Sentence:", sentence) - print("Embedding:", embedding) - print("") +Using Sentence Transformer models is elementary: +.. sidebar:: Installation + You can install *sentence-transformers* using pip: + + .. code-block:: python + + pip install -U sentence-transformers + + We recommend **Python 3.8+** and **PyTorch 1.11.0+**. See `installation `_ for further installation options. +.. code-block:: python -Performance -========================= - -Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed. Have a look at `Pre-Trained Models `_ for an overview of available models and the respective performance on different tasks. - - - + from sentence_transformers import SentenceTransformer + # 1. Load a pretrained Sentence Transformer model + model = SentenceTransformer("all-MiniLM-L6-v2") + # The sentences to encode + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] -Contact -========================= + # 2. Calculate embeddings by calling model.encode() + embeddings = model.encode(sentences) + print(embeddings.shape) + # [3, 384] -Contact person: Tom Aarsen, tom.aarsen@huggingface.co + # 3. Calculate the embedding similarities + similarities = model.similarity(embeddings, embeddings) + print(similarities) + # tensor([[1.0000, 0.6660, 0.1046], + # [0.6660, 1.0000, 0.1411], + # [0.1046, 0.1411, 1.0000]]) -Don't hesitate to open an issue on the `repository `_ if something is broken (and it shouldn't be) or if you have further questions. +What Next? +========== -*This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.* +Consider reading one of the following sections to answer the related questions: +* How to **use** Sentence Transformer models? `Sentence Transformers > Usage `_ +* What Sentence Transformer **models** can I use? `Sentence Transformers > Pretrained Models `_ +* How do I **train/finetune** a Sentence Transformer model? `Sentence Transformers > Training Overview `_ +* How to **use** Cross Encoder models? `Cross Encoder > Usage `_ +* What Cross Encoder **models** can I use? `Cross Encoder > Pretrained Models `_ -Citing & Authors -========================= +Citing +====== If you find this repository helpful, feel free to cite our publication `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks `_: @@ -124,71 +121,41 @@ If you use the code for `data augmentation `_), - or `"flash_attention_2"` (using `Dao-AILab/flash-attention `_). - By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` - implementation. - - See the `PreTrainedModel.from_pretrained - `_ - documentation for more details. - :param tokenizer_kwargs: Additional tokenizer configuration parameters to be passed to the Huggingface Transformers tokenizer. - See the `AutoTokenizer.from_pretrained - `_ - documentation for more details. - :param config_kwargs: Additional model configuration parameters to be passed to the Huggingface Transformers config. - See the `AutoConfig.from_pretrained - `_ - documentation for more details. - :param model_card_data: A model card data object that contains information about the model. This is used to generate - a model card when saving the model. If not set, a default model card data object is created. - - Example + Args: + model_name_or_path (str, optional): If it is a filepath on disc, it loads the model from that path. If it is not a path, + it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model + from the Hugging Face Hub with that name. + modules (Iterable[nn.Module], optional): A list of torch Modules that should be called sequentially, can be used to create custom + SentenceTransformer models from scratch. + device (str, optional): Device (like "cuda", "cpu", "mps", "npu") that should be used for computation. If None, checks if a GPU + can be used. + prompts (Dict[str, str], optional): A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text. + The prompt text will be prepended before any text to encode. For example: + `{"query": "query: ", "passage": "passage: "}` or `{"clustering": "Identify the main category based on the + titles in "}`. + default_prompt_name (str, optional): The name of the prompt that should be used by default. If not set, + no prompt will be applied. + similarity_fn_name (str or SimilarityFunction, optional): The name of the similarity function to use. Valid options are "cosine", "dot", + "euclidean", and "manhattan". If not set, it is automatically set to "cosine" if `similarity` or + `similarity_pairwise` are called while `model.similarity_fn_name` is still `None`. + cache_folder (str, optional): Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable. + trust_remote_code (bool, optional): Whether or not to allow for custom models defined on the Hub in their own modeling files. + This option should only be set to True for repositories you trust and in which you have read the code, as it + will execute code present on the Hub on your local machine. + revision (str, optional): The specific model version to use. It can be a branch name, a tag name, or a commit id, + for a stored model on Hugging Face. + local_files_only (bool, optional): If `True`, avoid downloading the model. + token (bool or str, optional): Hugging Face authentication token to download private models. + use_auth_token (bool or str, optional): Deprecated argument. Please use `token` instead. + truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is + only applicable during inference when :meth:`SentenceTransformer.encode` is called. + model_kwargs (Dict[str, Any], optional): Additional model configuration parameters to be passed to the Huggingface Transformers model. + Particularly useful options are: + + - ``torch_dtype``: Override the default `torch.dtype` and load the model under a specific `dtype`. + The different options are: + + 1. ``torch.float16``, ``torch.bfloat16`` or ``torch.float``: load in a specified + ``dtype``, ignoring the model's ``config.torch_dtype`` if one exists. If not specified - the model will + get loaded in ``torch.float`` (fp32). + + 2. ``"auto"`` - A ``torch_dtype`` entry in the ``config.json`` file of the model will be + attempted to be used. If this entry isn't found then next check the ``dtype`` of the first weight in + the checkpoint that's of a floating point type and use that as ``dtype``. This will load the model + using the ``dtype`` it was saved in at the end of the training. It can't be used as an indicator of how + the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32. + - ``attn_implementation``: The attention implementation to use in the model (if relevant). Can be any of + `"eager"` (manual implementation of the attention), `"sdpa"` (using `F.scaled_dot_product_attention + `_), + or `"flash_attention_2"` (using `Dao-AILab/flash-attention `_). + By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` + implementation. + + See the `PreTrainedModel.from_pretrained + `_ + documentation for more details. + tokenizer_kwargs (Dict[str, Any], optional): Additional tokenizer configuration parameters to be passed to the Huggingface Transformers tokenizer. + See the `AutoTokenizer.from_pretrained + `_ + documentation for more details. + config_kwargs (Dict[str, Any], optional): Additional model configuration parameters to be passed to the Huggingface Transformers config. + See the `AutoConfig.from_pretrained + `_ + documentation for more details. + model_card_data (:class:`~sentence_transformers.model_card.SentenceTransformerModelCardData`, optional): A model + card data object that contains information about the model. This is used to generate a model card when saving + the model. If not set, a default model card data object is created. + + Example: :: from sentence_transformers import SentenceTransformer @@ -364,34 +367,55 @@ def encode( """ Computes sentence embeddings. - :param sentences: the sentences to embed. - :param prompt_name: The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary, - which is either set in the constructor or loaded from the model configuration. For example if - `prompt_name` is ``"query"`` and the `prompts` is ``{"query": "query: ", ...}``, then the sentence "What - is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence - is appended to the prompt. If `prompt` is also set, this argument is ignored. - :param prompt: The prompt to use for encoding. For example, if the prompt is ``"query: "``, then the - sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" - because the sentence is appended to the prompt. If `prompt` is set, `prompt_name` is ignored. - :param batch_size: the batch size used for the computation. - :param show_progress_bar: Whether to output a progress bar when encode sentences. - :param output_value: The type of embeddings to return: "sentence_embedding" to get sentence embeddings, - "token_embeddings" to get wordpiece token embeddings, and `None`, to get all output values. Defaults - to "sentence_embedding". - :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or - "ubinary". All non-float32 precisions are quantized embeddings. Quantized embeddings are smaller in - size and faster to compute, but may have a lower accuracy. They are useful for reducing the size - of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32". - :param convert_to_numpy: Whether the output should be a list of numpy vectors. If False, it is a list of PyTorch tensors. - :param convert_to_tensor: Whether the output should be one large tensor. Overwrites `convert_to_numpy`. - :param device: Which `torch.device` to use for the computation. - :param normalize_embeddings: Whether to normalize returned vectors to have length 1. In that case, - the faster dot-product (util.dot_score) instead of cosine similarity can be used. - - :return: By default, a 2d numpy array with shape [num_inputs, output_dimension] is returned. If only one string - input is provided, then the output is a 1d array with shape [output_dimension]. If `convert_to_tensor`, a - torch Tensor is returned instead. If `self.truncate_dim <= output_dimension` then output_dimension is - `self.truncate_dim`. + Args: + sentences (Union[str, List[str]]): The sentences to embed. + prompt_name (Optional[str], optional): The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary, + which is either set in the constructor or loaded from the model configuration. For example if + ``prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What + is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence + is appended to the prompt. If ``prompt`` is also set, this argument is ignored. Defaults to None. + prompt (Optional[str], optional): The prompt to use for encoding. For example, if the prompt is "query: ", then the + sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" + because the sentence is appended to the prompt. If ``prompt`` is set, ``prompt_name`` is ignored. Defaults to None. + batch_size (int, optional): The batch size used for the computation. Defaults to 32. + show_progress_bar (bool, optional): Whether to output a progress bar when encode sentences. Defaults to None. + output_value (Optional[Literal["sentence_embedding", "token_embeddings"]], optional): The type of embeddings to return: + "sentence_embedding" to get sentence embeddings, "token_embeddings" to get wordpiece token embeddings, and `None`, + to get all output values. Defaults to "sentence_embedding". + precision (Literal["float32", "int8", "uint8", "binary", "ubinary"], optional): The precision to use for the embeddings. + Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions are quantized embeddings. + Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy. They are useful for + reducing the size of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32". + convert_to_numpy (bool, optional): Whether the output should be a list of numpy vectors. If False, it is a list of PyTorch tensors. + Defaults to True. + convert_to_tensor (bool, optional): Whether the output should be one large tensor. Overwrites `convert_to_numpy`. + Defaults to False. + device (str, optional): Which :class:`torch.device` to use for the computation. Defaults to None. + normalize_embeddings (bool, optional): Whether to normalize returned vectors to have length 1. In that case, + the faster dot-product (util.dot_score) instead of cosine similarity can be used. Defaults to False. + + Returns: + Union[List[Tensor], ndarray, Tensor]: By default, a 2d numpy array with shape [num_inputs, output_dimension] is returned. + If only one string input is provided, then the output is a 1d array with shape [output_dimension]. If ``convert_to_tensor``, + a torch Tensor is returned instead. If ``self.truncate_dim <= output_dimension`` then output_dimension is ``self.truncate_dim``. + + Example: + :: + + from sentence_transformers import SentenceTransformer + + # Load a pre-trained SentenceTransformer model + model = SentenceTransformer('all-mpnet-base-v2') + + # Encode some texts + sentences = [ + "The weather is lovely today.", + "It's so sunny outside!", + "He drove to the stadium.", + ] + embeddings = model.encode(sentences) + print(embeddings.shape) + # (3, 768) """ if self.device.type == "hpu" and not self.is_hpu_graph_enabled: import habana_frameworks.torch as ht @@ -551,7 +575,17 @@ def encode( @property def similarity_fn_name(self) -> Optional[str]: - """Return the name of the similarity function used by :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise`.""" + """Return the name of the similarity function used by :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise`. + + Returns: + Optional[str]: The name of the similarity function. Can be None if not set, in which case any uses of + :meth:`SentenceTransformer.similarity` and :meth:`SentenceTransformer.similarity_pairwise` default to "cosine". + + Example: + >>> model = SentenceTransformer("multi-qa-mpnet-base-dot-v1") + >>> model.similarity_fn_name + 'dot' + """ return self._similarity_fn_name @similarity_fn_name.setter @@ -577,7 +611,14 @@ def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray] scores between all embeddings from the first parameter and all embeddings from the second parameter. This differs from `similarity_pairwise` which computes the similarity between each pair of embeddings. - Example + Args: + embeddings1 (Union[Tensor, ndarray]): [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + embeddings2 (Union[Tensor, ndarray]): [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + + Returns: + Tensor: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores. + + Example: :: >>> model = SentenceTransformer("all-mpnet-base-v2") @@ -601,10 +642,6 @@ def similarity(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray] [-0.7437, -0.0000, -1.3702, -1.3320], [-1.3935, -1.3702, -0.0000, -0.9973], [-1.3184, -1.3320, -0.9973, -0.0000]]) - - :param embeddings1: [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. - :param embeddings2: [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. - :return: A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores. """ if self.similarity_fn_name is None: self.similarity_fn_name = SimilarityFunction.COSINE @@ -622,7 +659,14 @@ def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, Compute the similarity between two collections of embeddings. The output will be a vector with the similarity scores between each pair of embeddings. - Example + Args: + embeddings1 (Union[Tensor, ndarray]): [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + embeddings2 (Union[Tensor, ndarray]): [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. + + Returns: + Tensor: A [num_embeddings]-shaped torch tensor with pairwise similarity scores. + + Example: :: >>> model = SentenceTransformer("all-mpnet-base-v2") @@ -640,27 +684,28 @@ def similarity_pairwise(self) -> Callable[[Union[Tensor, ndarray], Union[Tensor, >>> model.similarity_fn_name = "euclidean" >>> model.similarity_pairwise(embeddings[::2], embeddings[1::2]) tensor([-0.7437, -0.9973]) - - :param embeddings1: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. - :param embeddings2: [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor. - :return: A [num_embeddings]-shaped torch tensor with pairwise similarity scores. """ if self.similarity_fn_name is None: self.similarity_fn_name = SimilarityFunction.COSINE return self._similarity_pairwise - def start_multi_process_pool(self, target_devices: List[str] = None): + def start_multi_process_pool(self, target_devices: List[str] = None) -> Dict[str, Any]: """ - Starts multi process to process the encoding with several, independent processes. + Starts a multi-process pool to process the encoding with several independent processes + via :meth:`SentenceTransformer.encode_multi_process `. + This method is recommended if you want to encode on multiple GPUs or CPUs. It is advised to start only one process per GPU. This method works together with encode_multi_process and stop_multi_process_pool. - :param target_devices: PyTorch target devices, e.g. ["cuda:0", "cuda:1", ...], ["npu:0", "npu:1", ...] or - ["cpu", "cpu", "cpu", "cpu"]. If target_devices is None and CUDA/NPU is available, then all available - CUDA/NPU devices will be used. If target_devices is None and CUDA/NPU is not available, then 4 CPU - devices will be used. - :return: Returns a dict with the target processes, an input queue and and output queue. + Args: + target_devices (List[str], optional): PyTorch target devices, e.g. ["cuda:0", "cuda:1", ...], + ["npu:0", "npu:1", ...], or ["cpu", "cpu", "cpu", "cpu"]. If target_devices is None and CUDA/NPU + is available, then all available CUDA/NPU devices will be used. If target_devices is None and + CUDA/NPU is not available, then 4 CPU devices will be used. + + Returns: + Dict[str, Any]: A dictionary with the target processes, an input queue, and an output queue. """ if target_devices is None: if torch.cuda.is_available(): @@ -694,7 +739,13 @@ def start_multi_process_pool(self, target_devices: List[str] = None): @staticmethod def stop_multi_process_pool(pool): """ - Stops all processes started with start_multi_process_pool + Stops all processes started with start_multi_process_pool. + + Args: + pool (Dict[str, object]): A dictionary containing the input queue, output queue, and process list. + + Returns: + None """ for p in pool["processes"]: p.terminate() @@ -716,31 +767,56 @@ def encode_multi_process( chunk_size: int = None, precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32", normalize_embeddings: bool = False, - ): + ) -> np.ndarray: """ - This method allows to run encode() on multiple GPUs. The sentences are chunked into smaller packages - and sent to individual processes, which encode these on the different GPUs. This method is only suitable - for encoding large sets of sentences - - :param sentences: List of sentences - :param pool: A pool of workers started with SentenceTransformer.start_multi_process_pool - :param prompt_name: The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary, - which is either set in the constructor or loaded from the model configuration. For example if - `prompt_name` is ``"query"`` and the `prompts` is ``{"query": "query: {}", ...}``, then the sentence "What - is the capital of France?" will be encoded as "query: What is the capital of France?". If `prompt` is - also set, this argument is ignored. - :param prompt: The prompt to use for encoding. For example, if the prompt is ``"query: {}"``, then the - sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?". - If `prompt` is set, `prompt_name` is ignored. - :param batch_size: Encode sentences with batch size - :param chunk_size: Sentences are chunked and sent to the individual processes. If none, it determine a sensible size. - :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or - "ubinary". All non-float32 precisions are quantized embeddings. Quantized embeddings are smaller in - size and faster to compute, but may have a lower accuracy. They are useful for reducing the size - of the embeddings of a corpus for semantic search, among other tasks. Defaults to "float32". - :param normalize_embeddings: Whether to normalize returned vectors to have length 1. In that case, - the faster dot-product (util.dot_score) instead of cosine similarity can be used. - :return: 2d numpy array with shape [num_inputs, output_dimension] + Encodes a list of sentences using multiple processes and GPUs via + :meth:`SentenceTransformer.encode `. + The sentences are chunked into smaller packages and sent to individual processes, which encode them on different + GPUs or CPUs. This method is only suitable for encoding large sets of sentences. + + Args: + sentences (List[str]): List of sentences to encode. + pool (Dict[str, object]): A pool of workers started with SentenceTransformer.start_multi_process_pool. + prompt_name (Optional[str], optional): The name of the prompt to use for encoding. Must be a key in the `prompts` dictionary, + which is either set in the constructor or loaded from the model configuration. For example if + ``prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What + is the capital of France?" will be encoded as "query: What is the capital of France?" because the sentence + is appended to the prompt. If ``prompt`` is also set, this argument is ignored. Defaults to None. + prompt (Optional[str], optional): The prompt to use for encoding. For example, if the prompt is "query: ", then the + sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" + because the sentence is appended to the prompt. If ``prompt`` is set, ``prompt_name`` is ignored. Defaults to None. + batch_size (int): Encode sentences with batch size. (default: 32) + chunk_size (int): Sentences are chunked and sent to the individual processes. If None, it determines a + sensible size. Defaults to None. + precision (Literal["float32", "int8", "uint8", "binary", "ubinary"]): The precision to use for the + embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions + are quantized embeddings. Quantized embeddings are smaller in size and faster to compute, but may + have lower accuracy. They are useful for reducing the size of the embeddings of a corpus for + semantic search, among other tasks. Defaults to "float32". + normalize_embeddings (bool): Whether to normalize returned vectors to have length 1. In that case, + the faster dot-product (util.dot_score) instead of cosine similarity can be used. Defaults to False. + + Returns: + np.ndarray: A 2D numpy array with shape [num_inputs, output_dimension]. + + Example: + :: + + from sentence_transformers import SentenceTransformer + + def main(): + model = SentenceTransformer("all-mpnet-base-v2") + sentences = ["The weather is so nice!", "It's so sunny outside.", "He's driving to the movie theater.", "She's going to the cinema."] * 1000 + + pool = model.start_multi_process_pool() + embeddings = model.encode_multi_process(sentences, pool) + model.stop_multi_process_pool(pool) + + print(embeddings.shape) + # => (4000, 768) + + if __name__ == "__main__": + main() """ if chunk_size is None: @@ -800,25 +876,42 @@ def set_pooling_include_prompt(self, include_prompt: bool) -> None: """ Sets the `include_prompt` attribute in the pooling layer in the model, if there is one. - :param include_prompt: Whether to include the prompt in the pooling layer. + This is useful for INSTRUCTOR models, as the prompt should be excluded from the pooling strategy + for these models. + + Args: + include_prompt (bool): Whether to include the prompt in the pooling layer. + + Returns: + None """ for module in self: if isinstance(module, Pooling): module.include_prompt = include_prompt break - def get_max_seq_length(self): + def get_max_seq_length(self) -> Optional[int]: """ - Returns the maximal sequence length for input the model accepts. Longer inputs will be truncated + Returns the maximal sequence length that the model accepts. Longer inputs will be truncated. + + Returns: + Optional[int]: The maximal sequence length that the model accepts, or None if it is not defined. """ if hasattr(self._first_module(), "max_seq_length"): return self._first_module().max_seq_length return None - def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]): + def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]) -> Dict[str, Tensor]: """ - Tokenizes the texts + Tokenizes the texts. + + Args: + texts (Union[List[str], List[Dict], List[Tuple[str, str]]]): A list of texts to be tokenized. + + Returns: + Dict[str, Tensor]: A dictionary of tensors with the tokenized texts. Common keys are "input_ids", + "attention_mask", and "token_type_ids". """ return self._first_module().tokenize(texts) @@ -827,7 +920,10 @@ def get_sentence_features(self, *features): def get_sentence_embedding_dimension(self) -> Optional[int]: """ - :return: The number of dimensions in the output of `encode`. If it's not known, it's `None`. + Returns the number of dimensions in the output of :meth:`SentenceTransformer.encode `. + + Returns: + Optional[int]: The number of dimensions in the output of `encode`. If it's not known, it's `None`. """ output_dim = None for mod in reversed(self._modules.values()): @@ -844,23 +940,25 @@ def get_sentence_embedding_dimension(self) -> Optional[int]: @contextmanager def truncate_sentence_embeddings(self, truncate_dim: Optional[int]): """ - In this context, `model.encode` outputs sentence embeddings truncated at dimension `truncate_dim`. + In this context, :meth:`SentenceTransformer.encode ` outputs + sentence embeddings truncated at dimension ``truncate_dim``. This may be useful when you are using the same model for different applications where different dimensions are needed. - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation. + Args: + truncate_dim (int, optional): The dimension to truncate sentence embeddings to. ``None`` does no truncation. - Example:: - - from sentence_transformers import SentenceTransformer + Example: + :: - model = SentenceTransformer("model-name") + from sentence_transformers import SentenceTransformer - with model.truncate_sentence_embeddings(truncate_dim=16): - embeddings_truncated = model.encode(["hello there", "hiya"]) - assert embeddings_truncated.shape[-1] == 16 + model = SentenceTransformer("all-mpnet-base-v2") + with model.truncate_sentence_embeddings(truncate_dim=16): + embeddings_truncated = model.encode(["hello there", "hiya"]) + assert embeddings_truncated.shape[-1] == 16 """ original_output_dim = self.truncate_dim try: @@ -887,13 +985,15 @@ def save( ): """ Saves a model and its configuration files to a directory, so that it can be loaded - with `SentenceTransformer(path)` again. - - :param path: Path on disc - :param model_name: Optional model name - :param create_model_card: If True, create a README.md with basic information about this model - :param train_datasets: Optional list with the names of the datasets used to to train the model - :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way + with ``SentenceTransformer(path)`` again. + + Args: + path (str): Path on disc where the model will be saved. + model_name (str, optional): Optional model name. + create_model_card (bool, optional): If True, create a README.md with basic information about this model. + train_datasets (List[str], optional): Optional list with the names of the datasets used to train the model. + safe_serialization (bool, optional): If True, save the model using safetensors. If False, save the model + the traditional (but unsafe) PyTorch way. """ if path is None: return @@ -953,13 +1053,15 @@ def save_pretrained( ): """ Saves a model and its configuration files to a directory, so that it can be loaded - with `SentenceTransformer(path)` again. Alias of `SentenceTransformer.save`. - - :param path: Path on disc - :param model_name: Optional model name - :param create_model_card: If True, create a README.md with basic information about this model - :param train_datasets: Optional list with the names of the datasets used to to train the model - :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way + with ``SentenceTransformer(path)`` again. + + Args: + path (str): Path on disc where the model will be saved. + model_name (str, optional): Optional model name. + create_model_card (bool, optional): If True, create a README.md with basic information about this model. + train_datasets (List[str], optional): Optional list with the names of the datasets used to train the model. + safe_serialization (bool, optional): If True, save the model using safetensors. If False, save the model + the traditional (but unsafe) PyTorch way. """ self.save( path, @@ -973,8 +1075,16 @@ def _create_model_card( self, path: str, model_name: Optional[str] = None, train_datasets: Optional[List[str]] = "deprecated" ): """ - Create an automatic model and stores it in path. If no training was done, and the loaded model was - a Sentence Transformer model already, then its model card is reused. + Create an automatic model and stores it in the specified path. If no training was done and the loaded model + was a Sentence Transformer model already, then its model card is reused. + + Args: + path (str): The path where the model card will be stored. + model_name (Optional[str], optional): The name of the model. Defaults to None. + train_datasets (Optional[List[str]], optional): Deprecated argument. Defaults to "deprecated". + + Returns: + None """ if model_name: model_path = Path(model_name) @@ -1018,18 +1128,19 @@ def save_to_hub( Uploads all elements of this Sentence Transformer to a new HuggingFace Hub repository. - :param repo_id: Repository name for your model in the Hub, including the user or organization. - :param token: An authentication token (See https://huggingface.co/settings/token) - :param private: Set to true, for hosting a private model - :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way - :param commit_message: Message to commit while pushing. - :param local_model_path: Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded - :param exist_ok: If true, saving to an existing repository is OK. If false, saving only to a new repository is possible - :param replace_model_card: If true, replace an existing model card in the hub with the automatically created model card - :param train_datasets: Datasets used to train the model. If set, the datasets will be added to the model card in the Hub. - :param organization: Deprecated. Organization in which you want to push your model or tokenizer (you must be a member of this organization). - - :return: The url of the commit of your model in the repository on the Hugging Face Hub. + Args: + repo_id (str): Repository name for your model in the Hub, including the user or organization. + token (str, optional): An authentication token (See https://huggingface.co/settings/token) + private (bool, optional): Set to true, for hosting a private model + safe_serialization (bool, optional): If true, save the model using safetensors. If false, save the model the traditional PyTorch way + commit_message (str, optional): Message to commit while pushing. + local_model_path (str, optional): Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded + exist_ok (bool, optional): If true, saving to an existing repository is OK. If false, saving only to a new repository is possible + replace_model_card (bool, optional): If true, replace an existing model card in the hub with the automatically created model card + train_datasets (List[str], optional): Datasets used to train the model. If set, the datasets will be added to the model card in the Hub. + + Returns: + str: The url of the commit of your model in the repository on the Hugging Face Hub. """ logger.warning( "The `save_to_hub` method is deprecated and will be removed in a future version of SentenceTransformers." @@ -1078,17 +1189,19 @@ def push_to_hub( """ Uploads all elements of this Sentence Transformer to a new HuggingFace Hub repository. - :param repo_id: Repository name for your model in the Hub, including the user or organization. - :param token: An authentication token (See https://huggingface.co/settings/token) - :param private: Set to true, for hosting a private model - :param safe_serialization: If true, save the model using safetensors. If false, save the model the traditional PyTorch way - :param commit_message: Message to commit while pushing. - :param local_model_path: Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded - :param exist_ok: If true, saving to an existing repository is OK. If false, saving only to a new repository is possible - :param replace_model_card: If true, replace an existing model card in the hub with the automatically created model card - :param train_datasets: Datasets used to train the model. If set, the datasets will be added to the model card in the Hub. - - :return: The url of the commit of your model in the repository on the Hugging Face Hub. + Args: + repo_id (str): Repository name for your model in the Hub, including the user or organization. + token (str, optional): An authentication token (See https://huggingface.co/settings/token) + private (bool, optional): Set to true, for hosting a private model + safe_serialization (bool, optional): If true, save the model using safetensors. If false, save the model the traditional PyTorch way + commit_message (str, optional): Message to commit while pushing. + local_model_path (str, optional): Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded + exist_ok (bool, optional): If true, saving to an existing repository is OK. If false, saving only to a new repository is possible + replace_model_card (bool, optional): If true, replace an existing model card in the hub with the automatically created model card + train_datasets (List[str], optional): Datasets used to train the model. If set, the datasets will be added to the model card in the Hub. + + Returns: + str: The url of the commit of your model in the repository on the Hugging Face Hub. """ api = HfApi(token=token) repo_url = api.create_repo( @@ -1140,12 +1253,14 @@ def _text_length(self, text: Union[List[int], List[List[int]]]): def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None): """ - Evaluate the model + Evaluate the model based on an evaluator + + Args: + evaluator (SentenceEvaluator): The evaluator used to evaluate the model. + output_path (str, optional): The path where the evaluator can write the results. Defaults to None. - :param evaluator: - the evaluator - :param output_path: - the evaluator can write the results to this path + Returns: + The evaluation results. """ if output_path is not None: os.makedirs(output_path, exist_ok=True) @@ -1162,14 +1277,26 @@ def _load_auto_model( model_kwargs: Optional[Dict[str, Any]] = None, tokenizer_kwargs: Optional[Dict[str, Any]] = None, config_kwargs: Optional[Dict[str, Any]] = None, - ): + ) -> List[nn.Module]: """ Creates a simple Transformer + Mean Pooling model and returns the modules + + Args: + model_name_or_path (str): The name or path of the pre-trained model. + token (Optional[Union[bool, str]]): The token to use for the model. + cache_folder (Optional[str]): The folder to cache the model. + revision (Optional[str], optional): The revision of the model. Defaults to None. + trust_remote_code (bool, optional): Whether to trust remote code. Defaults to False. + local_files_only (bool, optional): Whether to use only local files. Defaults to False. + model_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the model. Defaults to None. + tokenizer_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the tokenizer. Defaults to None. + config_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the config. Defaults to None. + + Returns: + List[nn.Module]: A list containing the transformer model and the pooling model. """ logger.warning( - "No sentence-transformers model found with name {}. Creating a new one with mean pooling.".format( - model_name_or_path - ) + f"No sentence-transformers model found with name {model_name_or_path}. Creating a new one with mean pooling." ) shared_kwargs = { @@ -1204,9 +1331,23 @@ def _load_sbert_model( model_kwargs: Optional[Dict[str, Any]] = None, tokenizer_kwargs: Optional[Dict[str, Any]] = None, config_kwargs: Optional[Dict[str, Any]] = None, - ): + ) -> Dict[str, nn.Module]: """ - Loads a full sentence-transformers model + Loads a full SentenceTransformer model using the modules.json file. + + Args: + model_name_or_path (str): The name or path of the pre-trained model. + token (Optional[Union[bool, str]]): The token to use for the model. + cache_folder (Optional[str]): The folder to cache the model. + revision (Optional[str], optional): The revision of the model. Defaults to None. + trust_remote_code (bool, optional): Whether to trust remote code. Defaults to False. + local_files_only (bool, optional): Whether to use only local files. Defaults to False. + model_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the model. Defaults to None. + tokenizer_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the tokenizer. Defaults to None. + config_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments for the config. Defaults to None. + + Returns: + OrderedDict[str, nn.Module]: An ordered dictionary containing the modules of the model. """ # Check if the config_sentence_transformers.json file exists (exists since v2 of the framework) config_sentence_transformers_json_path = load_file_path( @@ -1398,9 +1539,21 @@ def tokenizer(self, value): self._first_module().tokenizer = value @property - def max_seq_length(self): + def max_seq_length(self) -> int: """ - Property to get the maximal input sequence length for the model. Longer inputs will be truncated. + Returns the maximal input sequence length for the model. Longer inputs will be truncated. + + Returns: + int: The maximal input sequence length. + + Example: + :: + + from sentence_transformers import SentenceTransformer + + model = SentenceTransformer("all-mpnet-base-v2") + print(model.max_seq_length) + # => 384 """ return self._first_module().max_seq_length @@ -1414,7 +1567,7 @@ def max_seq_length(self, value): @property def _target_device(self) -> torch.device: logger.warning( - "`SentenceTransformer._target_device` has been removed, please use `SentenceTransformer.device` instead.", + "`SentenceTransformer._target_device` has been deprecated, please use `SentenceTransformer.device` instead.", ) return self.device diff --git a/sentence_transformers/cross_encoder/CrossEncoder.py b/sentence_transformers/cross_encoder/CrossEncoder.py index 7e99fc286..cc205c317 100644 --- a/sentence_transformers/cross_encoder/CrossEncoder.py +++ b/sentence_transformers/cross_encoder/CrossEncoder.py @@ -4,7 +4,7 @@ import numpy as np import logging import os -from typing import Dict, Type, Callable, List, Optional +from typing import Dict, Type, Callable, List, Optional, Union import torch from torch import nn from torch.optim import Optimizer @@ -29,26 +29,28 @@ class CrossEncoder(PushToHubMixin): It does not yield a sentence embedding and does not work for individual sentences. - :param model_name: A model name from Hugging Face Hub that can be loaded with AutoModel, or a path to a local - model. We provide several pre-trained CrossEncoder models that can be used for common tasks. - :param num_labels: Number of labels of the classifier. If 1, the CrossEncoder is a regression model that - outputs a continuous score 0...1. If > 1, it output several scores that can be soft-maxed to get - probability scores for the different classes. - :param max_length: Max length for input sequences. Longer sequences will be truncated. If None, max - length of the model will be used - :param device: Device that should be used for the model. If None, it will use CUDA if available. - :param tokenizer_args: Arguments passed to AutoTokenizer - :param automodel_args: Arguments passed to AutoModelForSequenceClassification - :param trust_remote_code: Whether or not to allow for custom models defined on the Hub in their own modeling files. - This option should only be set to True for repositories you trust and in which you have read the code, as it - will execute code present on the Hub on your local machine. - :param revision: The specific model version to use. It can be a branch name, a tag name, or a commit id, - for a stored model on Hugging Face. - :param local_files_only: If `True`, avoid downloading the model. - :param default_activation_function: Callable (like nn.Sigmoid) about the default activation function that - should be used on-top of model.predict(). If None. nn.Sigmoid() will be used if num_labels=1, - else nn.Identity() - :param classifier_dropout: The dropout ratio for the classification head. + Args: + model_name (str): A model name from Hugging Face Hub that can be loaded with AutoModel, or a path to a local + model. We provide several pre-trained CrossEncoder models that can be used for common tasks. + num_labels (int, optional): Number of labels of the classifier. If 1, the CrossEncoder is a regression model that + outputs a continuous score 0...1. If > 1, it output several scores that can be soft-maxed to get + probability scores for the different classes. Defaults to None. + max_length (int, optional): Max length for input sequences. Longer sequences will be truncated. If None, max + length of the model will be used. Defaults to None. + device (str, optional): Device that should be used for the model. If None, it will use CUDA if available. + Defaults to None. + tokenizer_args (Dict, optional): Arguments passed to AutoTokenizer. Defaults to None. + automodel_args (Dict, optional): Arguments passed to AutoModelForSequenceClassification. Defaults to None. + trust_remote_code (bool, optional): Whether or not to allow for custom models defined on the Hub in their own modeling files. + This option should only be set to True for repositories you trust and in which you have read the code, as it + will execute code present on the Hub on your local machine. Defaults to False. + revision (Optional[str], optional): The specific model version to use. It can be a branch name, a tag name, or a commit id, + for a stored model on Hugging Face. Defaults to None. + local_files_only (bool, optional): If `True`, avoid downloading the model. Defaults to False. + default_activation_function (Callable, optional): Callable (like nn.Sigmoid) about the default activation function that + should be used on-top of model.predict(). If None. nn.Sigmoid() will be used if num_labels=1, + else nn.Identity(). Defaults to None. + classifier_dropout (float, optional): The dropout ratio for the classification head. Defaults to None. """ def __init__( @@ -57,14 +59,18 @@ def __init__( num_labels: int = None, max_length: int = None, device: str = None, - tokenizer_args: Dict = {}, - automodel_args: Dict = {}, + tokenizer_args: Dict = None, + automodel_args: Dict = None, trust_remote_code: bool = False, revision: Optional[str] = None, local_files_only: bool = False, default_activation_function=None, classifier_dropout: float = None, ): + if tokenizer_args is None: + tokenizer_args = {} + if automodel_args is None: + automodel_args = {} self.config = AutoConfig.from_pretrained( model_name, trust_remote_code=trust_remote_code, revision=revision, local_files_only=local_files_only ) @@ -187,25 +193,26 @@ def fit( We sample only as many batches from each objective as there are in the smallest one to make sure of equal training with each dataset. - :param train_dataloader: DataLoader with training InputExamples - :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. - :param epochs: Number of epochs for training - :param loss_fct: Which loss function to use for training. If None, will use nn.BCEWithLogitsLoss() if self.config.num_labels == 1 else nn.CrossEntropyLoss() - :param activation_fct: Activation function applied on top of logits output of model. - :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts - :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. - :param optimizer_class: Optimizer - :param optimizer_params: Optimizer parameters - :param weight_decay: Weight decay for model parameters - :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps - :param output_path: Storage path for the model and evaluation files - :param save_best_model: If true, the best model (according to evaluator) is stored at output_path - :param max_grad_norm: Used for gradient normalization. - :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 - :param callback: Callback function that is invoked after each evaluation. + Args: + train_dataloader (DataLoader): DataLoader with training InputExamples + evaluator (SentenceEvaluator, optional): An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. Defaults to None. + epochs (int, optional): Number of epochs for training. Defaults to 1. + loss_fct: Which loss function to use for training. If None, will use nn.BCEWithLogitsLoss() if self.config.num_labels == 1 else nn.CrossEntropyLoss(). Defaults to None. + activation_fct: Activation function applied on top of logits output of model. + scheduler (str, optional): Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts. Defaults to "WarmupLinear". + warmup_steps (int, optional): Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. Defaults to 10000. + optimizer_class (Type[Optimizer], optional): Optimizer. Defaults to torch.optim.AdamW. + optimizer_params (Dict[str, object], optional): Optimizer parameters. Defaults to {"lr": 2e-5}. + weight_decay (float, optional): Weight decay for model parameters. Defaults to 0.01. + evaluation_steps (int, optional): If > 0, evaluate the model using evaluator after each number of training steps. Defaults to 0. + output_path (str, optional): Storage path for the model and evaluation files. Defaults to None. + save_best_model (bool, optional): If true, the best model (according to evaluator) is stored at output_path. Defaults to True. + max_grad_norm (float, optional): Used for gradient normalization. Defaults to 1. + use_amp (bool, optional): Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0. Defaults to False. + callback (Callable[[float, int, int], None], optional): Callback function that is invoked after each evaluation. It must accept the following three parameters in this order: - `score`, `epoch`, `steps` - :param show_progress_bar: If True, output a tqdm progress bar + `score`, `epoch`, `steps`. Defaults to None. + show_progress_bar (bool, optional): If True, output a tqdm progress bar. Defaults to True. """ train_dataloader.collate_fn = self.smart_batching_collate @@ -307,19 +314,38 @@ def predict( apply_softmax=False, convert_to_numpy: bool = True, convert_to_tensor: bool = False, - ): + ) -> Union[List[float], np.ndarray, torch.Tensor]: """ - Performs predicts with the CrossEncoder on the given sentence pairs. - - :param sentences: A list of sentence pairs [[Sent1, Sent2], [Sent3, Sent4]] - :param batch_size: Batch size for encoding - :param show_progress_bar: Output progress bar - :param num_workers: Number of workers for tokenization - :param activation_fct: Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity - :param convert_to_numpy: Convert the output to a numpy matrix. - :param apply_softmax: If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output - :param convert_to_tensor: Convert the output to a tensor. - :return: Predictions for the passed sentence pairs + Performs predictions with the CrossEncoder on the given sentence pairs. + + Args: + sentences (List[List[str]]): A list of sentence pairs [[Sent1, Sent2], [Sent3, Sent4]] + batch_size (int, optional): Batch size for encoding. Defaults to 32. + show_progress_bar (bool, optional): Output progress bar. Defaults to None. + num_workers (int, optional): Number of workers for tokenization. Defaults to 0. + activation_fct (callable, optional): Activation function applied on the logits output of the CrossEncoder. + If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity. Defaults to None. + convert_to_numpy (bool, optional): Convert the output to a numpy matrix. Defaults to True. + apply_softmax (bool, optional): If there are more than 2 dimensions and apply_softmax=True, + applies softmax on the logits output. Defaults to False. + convert_to_tensor (bool, optional): Convert the output to a tensor. Defaults to False. + + Returns: + Union[List[float], np.ndarray, torch.Tensor]: Predictions for the passed sentence pairs. + The return type depends on the `convert_to_numpy` and `convert_to_tensor` parameters. + If `convert_to_tensor` is True, the output will be a torch.Tensor. + If `convert_to_numpy` is True, the output will be a numpy.ndarray. + Otherwise, the output will be a list of float values. + + Examples: + :: + + from sentence_transformers import CrossEncoder + + model = CrossEncoder("cross-encoder/stsb-roberta-base") + sentences = [["I love cats", "Cats are amazing"], ["I prefer dogs", "Dogs are loyal"]] + model.predict(sentences) + # => array([0.6912767, 0.4303499], dtype=float32) """ input_was_string = False if isinstance(sentences[0], str): # Cast an individual sentence to a list with length 1 @@ -388,6 +414,22 @@ def rank( """ Performs ranking with the CrossEncoder on the given query and documents. Returns a sorted list with the document indices and scores. + Args: + query (str): A single query. + documents (List[str]): A list of documents. + top_k (Optional[int], optional): Return the top-k documents. If None, all documents are returned. Defaults to None. + return_documents (bool, optional): If True, also returns the documents. If False, only returns the indices and scores. Defaults to False. + batch_size (int, optional): Batch size for encoding. Defaults to 32. + show_progress_bar (bool, optional): Output progress bar. Defaults to None. + num_workers (int, optional): Number of workers for tokenization. Defaults to 0. + activation_fct ([type], optional): Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity. Defaults to None. + convert_to_numpy (bool, optional): Convert the output to a numpy matrix. Defaults to True. + apply_softmax (bool, optional): If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output. Defaults to False. + convert_to_tensor (bool, optional): Convert the output to a tensor. Defaults to False. + + Returns: + List[Dict]: A sorted list with the document indices and scores, and optionally also documents. + Example: :: @@ -423,19 +465,6 @@ def rank( {'corpus_id': 4, 'score': -5.082967, 'text': "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era."}] - - :param query: A single query - :param documents: A list of documents - :param top_k: Return the top-k documents. If None, all documents are returned. - :param return_documents: If True, also returns the documents. If False, only returns the indices and scores. - :param batch_size: Batch size for encoding - :param show_progress_bar: Output progress bar - :param num_workers: Number of workers for tokenization - :param activation_fct: Activation function applied on the logits output of the CrossEncoder. If None, nn.Sigmoid() will be used if num_labels=1, else nn.Identity - :param convert_to_numpy: Convert the output to a numpy matrix. - :param apply_softmax: If there are more than 2 dimensions and apply_softmax=True, applies softmax on the logits output - :param convert_to_tensor: Convert the output to a tensor. - :return: A sorted list with the document indices and scores, and optionally also documents. """ query_doc_pairs = [[query, doc] for doc in documents] scores = self.predict( diff --git a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py index 21ad057c2..f9bf25b25 100644 --- a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py @@ -4,8 +4,9 @@ from typing import List import numpy as np + +from sentence_transformers.readers.InputExample import InputExample from .. import CrossEncoder -from ... import InputExample from sklearn.metrics import f1_score logger = logging.getLogger(__name__) @@ -19,18 +20,13 @@ class CEF1Evaluator: binary tasks the returned metric is binary F1 score. For the multiclass tasks the returned metric is macro F1 score. - :param sentence_pairs: A list of sentence pairs, where each pair is a list of two strings. - :type sentence_pairs: list[list[str]] - :param labels: A list of integer labels corresponding to each sentence pair. - :type labels: list[int] - :param batch_size: Batch size for prediction. Defaults to 32. - :type batch_size: int - :param show_progress_bar: Show tqdm progress bar. - :type show_progress_bar: bool - :param name: An optional name for the CSV file with stored results. Defaults to an empty string. - :type name: str, optional - :param write_csv: Flag to determine if the data should be saved to a CSV file. Defaults to True. - :type write_csv: bool, optional + Args: + sentence_pairs (List[List[str]]): A list of sentence pairs, where each pair is a list of two strings. + labels (List[int]): A list of integer labels corresponding to each sentence pair. + batch_size (int, optional): Batch size for prediction. Defaults to 32. + show_progress_bar (bool, optional): Show tqdm progress bar. + name (str, optional): An optional name for the CSV file with stored results. Defaults to an empty string. + write_csv (bool, optional): Flag to determine if the data should be saved to a CSV file. Defaults to True. """ def __init__( @@ -42,7 +38,7 @@ def __init__( show_progress_bar: bool = False, name: str = "", write_csv: bool = True, - ): + ) -> None: self.sentence_pairs = sentence_pairs self.labels = labels self.batch_size = batch_size @@ -72,6 +68,16 @@ def __init__( @classmethod def from_input_examples(cls, examples: List[InputExample], **kwargs): + """ + Create an instance of CEF1Evaluator from a list of InputExample objects. + + Args: + examples (List[InputExample]): A list of InputExample objects. + **kwargs: Additional keyword arguments to pass to the CEF1Evaluator constructor. + + Returns: + CEF1Evaluator: An instance of CEF1Evaluator. + """ sentence_pairs = [] labels = [] @@ -81,13 +87,19 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): return cls(sentence_pairs, labels, **kwargs) - def __call__( - self, - model: CrossEncoder, - output_path: str = None, - epoch: int = -1, - steps: int = -1, - ) -> float: + def __call__(self, model: CrossEncoder, output_path: str = None, epoch: int = -1, steps: int = -1) -> float: + """ + Evaluate the model using the CEF1Evaluator. + + Args: + model (CrossEncoder): The cross-encoder model to evaluate. + output_path (str, optional): The path to save the evaluation results. Defaults to None. + epoch (int, optional): The epoch number. Defaults to -1. + steps (int, optional): The number of steps. Defaults to -1. + + Returns: + float: The F1 score. + """ if epoch != -1: if steps == -1: out_txt = f"after epoch {epoch}:" diff --git a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py index 5d4813877..fa6160ec4 100644 --- a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py @@ -15,8 +15,10 @@ class CERerankingEvaluator: Given a query and a list of documents, it computes the score [query, doc_i] for all possible documents and sorts them in decreasing order. Then, MRR@10 and NDCG@10 are computed to measure the quality of the ranking. - :param samples: Must be a list and each element is of the form: {'query': '', 'positive': [], 'negative': []}. Query is the search query, - positive is a list of positive (relevant) documents, negative is a list of negative (irrelevant) documents. + Args: + samples (List[Dict, str, Union[str, List[str]]): Must be a list and each element is of the form: + {'query': '', 'positive': [], 'negative': []}. Query is the search query, positive is a list + of positive (relevant) documents, negative is a list of negative (irrelevant) documents. """ def __init__( diff --git a/sentence_transformers/data_collator.py b/sentence_transformers/data_collator.py index bd4d5ff27..afdd86bfb 100644 --- a/sentence_transformers/data_collator.py +++ b/sentence_transformers/data_collator.py @@ -9,7 +9,8 @@ class SentenceTransformerDataCollator: """Collator for a SentenceTransformers model. This encodes the text columns to {column}_input_ids and {column}_attention_mask columns. This works with the two text dataset that is used as the example in the training overview: - https://www.sbert.net/docs/training/overview.html""" + https://www.sbert.net/docs/training/overview.html + """ tokenize_fn: Callable valid_label_columns: List[str] = field(default_factory=lambda: ["label", "score"]) diff --git a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py index 33a02c2e3..973d55cac 100644 --- a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py +++ b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py @@ -11,8 +11,10 @@ class DenoisingAutoEncoderDataset(Dataset): It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the sentence without noise. - :param sentences: A list of sentences - :param noise_fn: A noise function: Given a string, it returns a string with noise, e.g. deleted words + Args: + sentences: A list of sentences + noise_fn: A noise function: Given a string, it returns a string + with noise, e.g. deleted words """ def __init__(self, sentences: List[str], noise_fn=lambda s: DenoisingAutoEncoderDataset.delete(s)): diff --git a/sentence_transformers/datasets/ParallelSentencesDataset.py b/sentence_transformers/datasets/ParallelSentencesDataset.py index a83e4abee..1ec72e90b 100644 --- a/sentence_transformers/datasets/ParallelSentencesDataset.py +++ b/sentence_transformers/datasets/ParallelSentencesDataset.py @@ -37,8 +37,11 @@ def __init__( """ Parallel sentences dataset reader to train student model given a teacher model - :param student_model: Student sentence embedding model that should be trained - :param teacher_model: Teacher model, that provides the sentence embeddings for the first column in the dataset file + Args: + student_model (SentenceTransformer): The student sentence embedding model that should be trained. + teacher_model (SentenceTransformer): The teacher model that provides the sentence embeddings for the first column in the dataset file. + batch_size (int, optional): The batch size for training. Defaults to 8. + use_embedding_cache (bool, optional): Whether to use an embedding cache. Defaults to True. """ self.student_model = student_model self.teacher_model = teacher_model @@ -53,16 +56,20 @@ def __init__( self.embedding_cache = {} self.num_sentences = 0 - def load_data(self, filepath: str, weight: int = 100, max_sentences: int = None, max_sentence_length: int = 128): + def load_data( + self, filepath: str, weight: int = 100, max_sentences: int = None, max_sentence_length: int = 128 + ) -> None: """ Reads in a tab-seperated .txt/.csv/.tsv or .gz file. The different columns contain the different translations of the sentence in the first column - :param filepath: Filepath to the file - :param weight: If more than one dataset is loaded with load_data: With which frequency should data be sampled from this dataset? - :param max_sentences: Max number of lines to be read from filepath - :param max_sentence_length: Skip the example if one of the sentences is has more characters than max_sentence_length - :param batch_size: Size for encoding parallel sentences - :return: + Args: + filepath (str): Filepath to the file. + weight (int, optional): If more than one dataset is loaded with load_data, specifies the frequency at which data should be sampled from this dataset. Defaults to 100. + max_sentences (int, optional): Maximum number of lines to be read from the filepath. Defaults to None. + max_sentence_length (int, optional): Skip the example if one of the sentences has more characters than max_sentence_length. Defaults to 128. + + Returns: + None """ logger.info("Load " + filepath) diff --git a/sentence_transformers/datasets/SentenceLabelDataset.py b/sentence_transformers/datasets/SentenceLabelDataset.py index eb69cca8c..ca90665eb 100644 --- a/sentence_transformers/datasets/SentenceLabelDataset.py +++ b/sentence_transformers/datasets/SentenceLabelDataset.py @@ -1,5 +1,3 @@ -""" """ - from torch.utils.data import IterableDataset import numpy as np from typing import List @@ -27,14 +25,13 @@ def __init__(self, examples: List[InputExample], samples_per_label: int = 2, wit """ Creates a LabelSampler for a SentenceLabelDataset. - :param examples: - a list with InputExamples - :param samples_per_label: - the number of consecutive, random and unique samples drawn per label. Batch size should be a multiple of samples_per_label - :param with_replacement: - if this is True, then each sample is drawn at most once (depending on the total number of samples per label). - if this is False, then one sample can be drawn in multiple draws, but still not multiple times in the same - drawing. + Args: + examples (List[InputExample]): A list of InputExamples. + samples_per_label (int, optional): The number of consecutive, random, and unique samples drawn per label. + The batch size should be a multiple of samples_per_label. Defaults to 2. + with_replacement (bool, optional): If True, each sample is drawn at most once (depending on the total number + of samples per label). If False, one sample can be drawn in multiple draws, but not multiple times in + the same drawing. Defaults to False. """ super().__init__() diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py index e15a61d9d..b3838f116 100644 --- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py +++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py @@ -27,17 +27,17 @@ class BinaryClassificationEvaluator(SentenceEvaluator): The labels need to be 0 for dissimilar pairs and 1 for similar pairs. - :param sentences1: The first column of sentences - :param sentences2: The second column of sentences - :param labels: labels[i] is the label for the pair (sentences1[i], sentences2[i]). Must be 0 or 1 - :param name: Name for the output - :param batch_size: Batch size used to compute embeddings - :param show_progress_bar: If true, prints a progress bar - :param write_csv: Write results to a CSV file - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation - dimension. Defaults to None. - - Example + Args: + sentences1 (List[str]): The first column of sentences. + sentences2 (List[str]): The second column of sentences. + labels (List[int]): labels[i] is the label for the pair (sentences1[i], sentences2[i]). Must be 0 or 1. + name (str, optional): Name for the output. Defaults to "". + batch_size (int, optional): Batch size used to compute embeddings. Defaults to 32. + show_progress_bar (bool, optional): If true, prints a progress bar. Defaults to False. + write_csv (bool, optional): Write results to a CSV file. Defaults to True. + truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None. + + Example: :: from sentence_transformers import SentenceTransformer @@ -152,6 +152,18 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): def __call__( self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: + """ + Compute the evaluation metrics for the given model. + + Args: + model (SentenceTransformer): The model to evaluate. + output_path (str, optional): Path to save the evaluation results CSV file. Defaults to None. + epoch (int, optional): The epoch number. Defaults to -1. + steps (int, optional): The number of steps. Defaults to -1. + + Returns: + Dict[str, float]: A dictionary containing the evaluation metrics. + """ if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py index 5bbbaa4fe..4c89d6178 100644 --- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py +++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py @@ -23,7 +23,7 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator): The metrics are the cosine similarity as well as euclidean and Manhattan distance The returned score is the Spearman correlation with a specified metric. - Example + Example: :: from datasets import load_dataset @@ -33,14 +33,14 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator): # Load a model model = SentenceTransformer('all-mpnet-base-v2') - # Load the STSB dataset (https://huggingface.co/datasets/nyu-mll/glue/viewer/stsb) - eval_dataset = load_dataset("nyu-mll/glue", "stsb", split="validation") + # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb) + eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") # Initialize the evaluator dev_evaluator = EmbeddingSimilarityEvaluator( sentences1=eval_dataset["sentence1"], sentences2=eval_dataset["sentence2"], - scores=[score / 5 for score in eval_dataset["label"]], + scores=eval_dataset["score"], main_similarity=SimilarityFunction.COSINE, name="sts-dev", ) @@ -52,7 +52,7 @@ class EmbeddingSimilarityEvaluator(SentenceEvaluator): Euclidean-Distance: Pearson: 0.7824 Spearman: 0.7827 Dot-Product-Similarity: Pearson: 0.7192 Spearman: 0.7126 ''' - # => 0.8004 + # => {'sts-dev_pearson_cosine': 0.880607226102985, 'sts-dev_spearman_cosine': 0.881019449484294, ...} """ def __init__( @@ -69,18 +69,22 @@ def __init__( truncate_dim: Optional[int] = None, ): """ - Constructs an evaluator based for the dataset - - The labels need to indicate the similarity between the sentences. - - :param sentences1: List with the first sentence in a pair - :param sentences2: List with the second sentence in a pair - :param scores: Similarity score between sentences1[i] and sentences2[i] - :param write_csv: Write results to a CSV file - :param precision: The precision to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or - "ubinary". Defaults to None. - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current - truncation dimension. Defaults to None. + Constructs an evaluator based for the dataset. + + Args: + sentences1 (List[str]): List with the first sentence in a pair. + sentences2 (List[str]): List with the second sentence in a pair. + scores (List[float]): Similarity score between sentences1[i] and sentences2[i]. + batch_size (int, optional): The batch size for processing the sentences. Defaults to 16. + main_similarity (Optional[Union[str, SimilarityFunction]], optional): The main similarity function to use. + Can be a string (e.g. "cosine", "dot") or a SimilarityFunction object. Defaults to None. + name (str, optional): The name of the evaluator. Defaults to "". + show_progress_bar (bool, optional): Whether to show a progress bar during evaluation. Defaults to False. + write_csv (bool, optional): Whether to write the evaluation results to a CSV file. Defaults to True. + precision (Optional[Literal["float32", "int8", "uint8", "binary", "ubinary"]], optional): The precision + to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". Defaults to None. + truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the + model's current truncation dimension. Defaults to None. """ super().__init__() self.sentences1 = sentences1 diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py index 235bd1b75..8381abf5b 100644 --- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py +++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py @@ -24,7 +24,7 @@ class InformationRetrievalEvaluator(SentenceEvaluator): Given a set of queries and a large corpus set. It will retrieve for each query the top-k most similar document. It measures Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG) - Example + Example: :: import random @@ -46,9 +46,9 @@ class InformationRetrievalEvaluator(SentenceEvaluator): corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids) # Convert the datasets to dictionaries - corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (qid => question) + corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (cid => document) queries = dict(zip(queries["_id"], queries["text"])) # Our queries (qid => question) - relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_question_ids]) + relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]): qid = str(qid) corpus_ids = str(corpus_ids) @@ -129,7 +129,28 @@ def __init__( SimilarityFunction.DOT_PRODUCT.value: dot_score, }, # Score function, higher=more similar main_score_function: Optional[Union[str, SimilarityFunction]] = None, - ): + ) -> None: + """ + Initializes the InformationRetrievalEvaluator. + + Args: + queries (Dict[str, str]): A dictionary mapping query IDs to queries. + corpus (Dict[str, str]): A dictionary mapping document IDs to documents. + relevant_docs (Dict[str, Set[str]]): A dictionary mapping query IDs to a set of relevant document IDs. + corpus_chunk_size (int): The size of each chunk of the corpus. Defaults to 50000. + mrr_at_k (List[int]): A list of integers representing the values of k for MRR calculation. Defaults to [10]. + ndcg_at_k (List[int]): A list of integers representing the values of k for NDCG calculation. Defaults to [10]. + accuracy_at_k (List[int]): A list of integers representing the values of k for accuracy calculation. Defaults to [1, 3, 5, 10]. + precision_recall_at_k (List[int]): A list of integers representing the values of k for precision and recall calculation. Defaults to [1, 3, 5, 10]. + map_at_k (List[int]): A list of integers representing the values of k for MAP calculation. Defaults to [100]. + show_progress_bar (bool): Whether to show a progress bar during evaluation. Defaults to False. + batch_size (int): The batch size for evaluation. Defaults to 32. + name (str): A name for the evaluation. Defaults to "". + write_csv (bool): Whether to write the evaluation results to a CSV file. Defaults to True. + truncate_dim (int, optional): The dimension to truncate the embeddings to. Defaults to None. + score_functions (Dict[str, Callable[[Tensor, Tensor], Tensor]]): A dictionary mapping score function names to score functions. Defaults to {SimilarityFunction.COSINE.value: cos_sim, SimilarityFunction.DOT_PRODUCT.value: dot_score}. + main_score_function (Union[str, SimilarityFunction], optional): The main score function to use for evaluation. Defaults to None. + """ super().__init__() self.queries_ids = [] for qid in queries: diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py index 05ebfe253..2bd65a163 100644 --- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py +++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py @@ -25,8 +25,8 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w """ Constructs an evaluator for the given dataset - :param dataloader: - the data for the evaluation + Args: + dataloader (DataLoader): the data for the evaluation """ super().__init__() self.dataloader = dataloader diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py index 1cca98e6c..46e7518f6 100644 --- a/sentence_transformers/evaluation/MSEEvaluator.py +++ b/sentence_transformers/evaluation/MSEEvaluator.py @@ -20,16 +20,18 @@ class MSEEvaluator(SentenceEvaluator): For multilingual knowledge distillation (https://arxiv.org/abs/2004.09813), source_sentences are in English and target_sentences are in a different language like German, Chinese, Spanish... - :param source_sentences: Source sentences are embedded with the teacher model - :param target_sentences: Target sentences are ambedding with the student model. - :param show_progress_bar: Show progress bar when computing embeddings - :param batch_size: Batch size to compute sentence embeddings - :param name: Name of the evaluator - :param write_csv: Write results to CSV file - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation - dimension. Defaults to None. - - Example + Args: + source_sentences (List[str]): Source sentences to embed with the teacher model. + target_sentences (List[str]): Target sentences to embed with the student model. + teacher_model (SentenceTransformer, optional): The teacher model to compute the source sentence embeddings. + show_progress_bar (bool, optional): Show progress bar when computing embeddings. Defaults to False. + batch_size (int, optional): Batch size to compute sentence embeddings. Defaults to 32. + name (str, optional): Name of the evaluator. Defaults to "". + write_csv (bool, optional): Write results to CSV file. Defaults to True. + truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation + dimension. Defaults to None. + + Example: :: from sentence_transformers import SentenceTransformer diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py index bb027614e..0d6fd2778 100644 --- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py +++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py @@ -15,21 +15,22 @@ class MSEEvaluatorFromDataFrame(SentenceEvaluator): """ Computes the mean squared error (x100) between the computed sentence embedding and some target sentence embedding. - :param dataframe: It must have the following format. Rows contains different, parallel sentences. - Columns are the respective language codes:: + Args: + dataframe (List[Dict[str, str]]): It must have the following format. Rows contains different, parallel sentences. + Columns are the respective language codes:: [{'en': 'My sentence in English', 'es': 'Oración en español', 'fr': 'Phrase en français'...}, {'en': 'My second sentence', ...}] - - :param combinations: Must be of the format ``[('en', 'es'), ('en', 'fr'), ...]``. - First entry in a tuple is the source language. The sentence in the respective language will be fetched from - the dataframe and passed to the teacher model. Second entry in a tuple the the target language. Sentence - will be fetched from the dataframe and passed to the student model - :param batch_size: Batch size to compute sentence embeddings - :param name: Name of the evaluator - :param write_csv: Write results to CSV file - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation - dimension. Defaults to None. + teacher_model (SentenceTransformer): The teacher model used to compute the sentence embeddings. + combinations (List[Tuple[str, str]]): Must be of the format ``[('en', 'es'), ('en', 'fr'), ...]``. + First entry in a tuple is the source language. The sentence in the respective language will be fetched from + the dataframe and passed to the teacher model. Second entry in a tuple the the target language. Sentence + will be fetched from the dataframe and passed to the student model + batch_size (int, optional): The batch size to compute sentence embeddings. Defaults to 8. + name (str, optional): The name of the evaluator. Defaults to "". + write_csv (bool, optional): Whether to write the results to a CSV file. Defaults to True. + truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. If None, uses the model's + current truncation dimension. Defaults to None. """ def __init__( diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py index 46f2f4fdb..701edf29b 100644 --- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py +++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py @@ -19,7 +19,7 @@ class ParaphraseMiningEvaluator(SentenceEvaluator): identifies the pairs with the highest similarity. It compare the extracted paraphrase pairs with a set of gold labels and computes the F1 score. - Example + Example: :: from datasets import load_dataset @@ -76,22 +76,35 @@ def __init__( truncate_dim: Optional[int] = None, ): """ - - :param sentences_map: A dictionary that maps sentence-ids to sentences, i.e. sentences_map[id] => sentence. - :param duplicates_list: Duplicates_list is a list with id pairs [(id1, id2), (id1, id5)] that identifies the duplicates / paraphrases in the sentences_map - :param duplicates_dict: A default dictionary mapping [id1][id2] to true if id1 and id2 are duplicates. Must be symmetric, i.e., if [id1][id2] => True, then [id2][id1] => True. - :param add_transitive_closure: If true, it adds a transitive closure, i.e. if dup[a][b] and dup[b][c], then dup[a][c] - :param query_chunk_size: To identify the paraphrases, the cosine-similarity between all sentence-pairs will be computed. As this might require a lot of memory, we perform a batched computation. #query_batch_size sentences will be compared against up to #corpus_batch_size sentences. In the default setting, 5000 sentences will be grouped together and compared up-to against 100k other sentences. - :param corpus_chunk_size: The corpus will be batched, to reduce the memory requirement - :param max_pairs: We will only extract up to #max_pairs potential paraphrase candidates. - :param top_k: For each query, we extract the top_k most similar pairs and add it to a sorted list. I.e., for one sentence we cannot find more than top_k paraphrases - :param show_progress_bar: Output a progress bar - :param batch_size: Batch size for computing sentence embeddings - :param name: Name of the experiment - :param write_csv: Write results to CSV file - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation - dimension. Defaults to None. - + Initializes the ParaphraseMiningEvaluator. + + Args: + sentences_map (Dict[str, str]): A dictionary that maps sentence-ids to sentences. + For example, sentences_map[id] => sentence. + duplicates_list (List[Tuple[str, str]], optional): A list with id pairs [(id1, id2), (id1, id5)] + that identifies the duplicates / paraphrases in the sentences_map. Defaults to None. + duplicates_dict (Dict[str, Dict[str, bool]], optional): A default dictionary mapping [id1][id2] + to true if id1 and id2 are duplicates. Must be symmetric, i.e., if [id1][id2] => True, + then [id2][id1] => True. Defaults to None. + add_transitive_closure (bool, optional): If true, it adds a transitive closure, + i.e. if dup[a][b] and dup[b][c], then dup[a][c]. Defaults to False. + query_chunk_size (int, optional): To identify the paraphrases, the cosine-similarity between + all sentence-pairs will be computed. As this might require a lot of memory, we perform + a batched computation. query_chunk_size sentences will be compared against up to + corpus_chunk_size sentences. In the default setting, 5000 sentences will be grouped + together and compared up-to against 100k other sentences. Defaults to 5000. + corpus_chunk_size (int, optional): The corpus will be batched, to reduce the memory requirement. + Defaults to 100000. + max_pairs (int, optional): We will only extract up to max_pairs potential paraphrase candidates. + Defaults to 500000. + top_k (int, optional): For each query, we extract the top_k most similar pairs and add it to a sorted list. + I.e., for one sentence we cannot find more than top_k paraphrases. Defaults to 100. + show_progress_bar (bool, optional): Output a progress bar. Defaults to False. + batch_size (int, optional): Batch size for computing sentence embeddings. Defaults to 16. + name (str, optional): Name of the experiment. Defaults to "". + write_csv (bool, optional): Write results to CSV file. Defaults to True. + truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. + `None` uses the model's current truncation dimension. Defaults to None. """ super().__init__() self.sentences = [] diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py index 1216bff79..902d6281d 100644 --- a/sentence_transformers/evaluation/RerankingEvaluator.py +++ b/sentence_transformers/evaluation/RerankingEvaluator.py @@ -21,20 +21,20 @@ class RerankingEvaluator(SentenceEvaluator): Given a query and a list of documents, it computes the score [query, doc_i] for all possible documents and sorts them in decreasing order. Then, MRR@10, NDCG@10 and MAP is compute to measure the quality of the ranking. - :param samples: Must be a list and each element is of the form: {'query': '', 'positive': [], 'negative': []}. - Query is the search query, positive is a list of positive (relevant) documents, negative is a list of negative - (irrelevant) documents. - - :param at_k: Only consider the top k most similar documents to each query for the evaluation - :param name: Name of the evaluator - :param write_csv: Write results to CSV file - :param similarity_fct: similarity function between sentence embeddings. By default, cosine similarity. - :param batch_size: Batch size to compute sentence embeddings - :param show_progress_bar: Show progress bar when computing embeddings - :param use_batched_encoding: Whether or not to encode queries and documents in batches for greater speed, or 1-by-1 - to save memory - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current truncation - dimension. Defaults to None. + Args: + samples (list): A list of dictionaries, where each dictionary represents a sample and has the following keys: + - 'query': The search query. + - 'positive': A list of positive (relevant) documents. + - 'negative': A list of negative (irrelevant) documents. + at_k (int, optional): Only consider the top k most similar documents to each query for the evaluation. Defaults to 10. + name (str, optional): Name of the evaluator. Defaults to "". + write_csv (bool, optional): Write results to CSV file. Defaults to True. + similarity_fct (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional): Similarity function between sentence embeddings. By default, cosine similarity. Defaults to cos_sim. + batch_size (int, optional): Batch size to compute sentence embeddings. Defaults to 64. + show_progress_bar (bool, optional): Show progress bar when computing embeddings. Defaults to False. + use_batched_encoding (bool, optional): Whether or not to encode queries and documents in batches for greater speed, or 1-by-1 to save memory. Defaults to True. + truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. Defaults to None. + mrr_at_k (Optional[int], optional): Deprecated parameter. Please use `at_k` instead. Defaults to None. """ def __init__( @@ -88,6 +88,18 @@ def __init__( def __call__( self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: + """ + Evaluates the model on the dataset and returns the evaluation metrics. + + Args: + model (SentenceTransformer): The SentenceTransformer model to evaluate. + output_path (str, optional): The output path to write the results. Defaults to None. + epoch (int, optional): The current epoch number. Defaults to -1. + steps (int, optional): The current step number. Defaults to -1. + + Returns: + Dict[str, float]: A dictionary containing the evaluation metrics. + """ if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" @@ -145,6 +157,15 @@ def __call__( return metrics def compute_metrices(self, model): + """ + Computes the evaluation metrics for the given model. + + Args: + model (SentenceTransformer): The SentenceTransformer model to compute metrics for. + + Returns: + Dict[str, float]: A dictionary containing the evaluation metrics. + """ return ( self.compute_metrices_batched(model) if self.use_batched_encoding @@ -153,8 +174,13 @@ def compute_metrices(self, model): def compute_metrices_batched(self, model): """ - Computes the metrices in a batched way, by batching all queries and - all documents together + Computes the evaluation metrics in a batched way, by batching all queries and all documents together. + + Args: + model (SentenceTransformer): The SentenceTransformer model to compute metrics for. + + Returns: + Dict[str, float]: A dictionary containing the evaluation metrics. """ all_mrr_scores = [] all_ndcg_scores = [] @@ -222,10 +248,13 @@ def compute_metrices_batched(self, model): def compute_metrices_individual(self, model): """ - Embeds every (query, positive, negative) tuple individually. - Is slower than the batched version, but saves memory as only the - embeddings for one tuple are needed. Useful when you have - a really large test set + Computes the evaluation metrics individually by embedding every (query, positive, negative) tuple individually. + + Args: + model (SentenceTransformer): The SentenceTransformer model to compute metrics for. + + Returns: + Dict[str, float]: A dictionary containing the evaluation metrics. """ all_mrr_scores = [] all_ndcg_scores = [] diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py index d8f4f5232..a336fc362 100644 --- a/sentence_transformers/evaluation/SentenceEvaluator.py +++ b/sentence_transformers/evaluation/SentenceEvaluator.py @@ -22,21 +22,23 @@ def __call__( This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result. - :param model: - the model to evaluate - :param output_path: - path where predictions and metrics are written to - :param epoch - the epoch where the evaluation takes place. - This is used for the file prefixes. - If this is -1, then we assume evaluation on test data. - :param steps - the steps in the current epoch at time of the evaluation. - This is used for the file prefixes. - If this is -1, then we assume evaluation at the end of the epoch. - :return: Either a score for the evaluation with a higher score indicating a better result, - or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` - must be defined + Args: + model: the model to evaluate + output_path: path where predictions and metrics are written + to + epoch: the epoch where the evaluation takes place. This is + used for the file prefixes. If this is -1, then we + assume evaluation on test data. + steps: the steps in the current epoch at time of the + evaluation. This is used for the file prefixes. If this + is -1, then we assume evaluation at the end of the + epoch. + + Returns: + Either a score for the evaluation with a higher score + indicating a better result, or a dictionary with scores. If + the latter is chosen, then `evaluator.primary_metric` must + be defined """ pass diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py index 2e2fb5ced..3f9f43f36 100644 --- a/sentence_transformers/evaluation/SequentialEvaluator.py +++ b/sentence_transformers/evaluation/SequentialEvaluator.py @@ -12,6 +12,22 @@ class SequentialEvaluator(SentenceEvaluator): """ def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function=lambda scores: scores[-1]): + """ + Initializes a SequentialEvaluator object. + + Args: + evaluators (Iterable[SentenceEvaluator]): A collection of SentenceEvaluator objects. + main_score_function (function, optional): A function that takes a list of scores and returns the main score. + Defaults to selecting the last score in the list. + + Example: + :: + + evaluator1 = BinaryClassificationEvaluator(...) + evaluator2 = InformationRetrievalEvaluator(...) + evaluator3 = MSEEvaluator(...) + seq_evaluator = SequentialEvaluator([evaluator1, evaluator2, evaluator3]) + """ super().__init__() self.evaluators = evaluators self.main_score_function = main_score_function diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py index 603cbe744..8580b8d70 100644 --- a/sentence_transformers/evaluation/TranslationEvaluator.py +++ b/sentence_transformers/evaluation/TranslationEvaluator.py @@ -19,7 +19,7 @@ class TranslationEvaluator(SentenceEvaluator): and assuming that fr_i is the translation of en_i. Checks if vec(en_i) has the highest similarity to vec(fr_i). Computes the accuracy in both directions - Example + Example: :: from sentence_transformers import SentenceTransformer @@ -66,23 +66,16 @@ def __init__( The labels need to indicate the similarity between the sentences. - :param source_sentences: - List of sentences in source language - :param target_sentences: - List of sentences in target language - :param show_progress_bar: - Show progress bar when computing embeddings - :param batch_size: - Batch size to compute sentence embeddings - :param name: - Name of the evaluator - :param print_wrong_matches: - Prints incorrect matches - :param write_csv: - Write results to CSV file - :param truncate_dim: - The dimension to truncate sentence embeddings to. `None` uses the model's current truncation dimension. - Defaults to None. + Args: + source_sentences (List[str]): List of sentences in the source language. + target_sentences (List[str]): List of sentences in the target language. + show_progress_bar (bool): Whether to show a progress bar when computing embeddings. Defaults to False. + batch_size (int): The batch size to compute sentence embeddings. Defaults to 16. + name (str): The name of the evaluator. Defaults to an empty string. + print_wrong_matches (bool): Whether to print incorrect matches. Defaults to False. + write_csv (bool): Whether to write the evaluation results to a CSV file. Defaults to True. + truncate_dim (int, optional): The dimension to truncate sentence embeddings to. If None, the model's + current truncation dimension will be used. Defaults to None. """ super().__init__() self.source_sentences = source_sentences diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py index 86ff42ba8..fe34a32ac 100644 --- a/sentence_transformers/evaluation/TripletEvaluator.py +++ b/sentence_transformers/evaluation/TripletEvaluator.py @@ -19,7 +19,7 @@ class TripletEvaluator(SentenceEvaluator): Evaluate a model based on a triplet: (sentence, positive_example, negative_example). Checks if distance(sentence, positive_example) < distance(sentence, negative_example). - Example + Example: :: from sentence_transformers import SentenceTransformer @@ -66,17 +66,21 @@ def __init__( truncate_dim: Optional[int] = None, ): """ - :param anchors: Sentences to check similarity to. (e.g. a query) - :param positives: List of positive sentences - :param negatives: List of negative sentences - :param main_distance_function: The distance function to use. If not specified, use cosine similarity, - dot product, Euclidean, and Manhattan. - :param name: Name for the output - :param batch_size: Batch size used to compute embeddings - :param show_progress_bar: If true, prints a progress bar - :param write_csv: Write results to a CSV file - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` uses the model's current - truncation dimension. Defaults to None. + Initializes a TripletEvaluator object. + + Args: + anchors (List[str]): Sentences to check similarity to. (e.g. a query) + positives (List[str]): List of positive sentences + negatives (List[str]): List of negative sentences + main_distance_function (Union[str, SimilarityFunction], optional): + The distance function to use. If not specified, use cosine similarity, + dot product, Euclidean, and Manhattan. Defaults to None. + name (str): Name for the output. Defaults to "". + batch_size (int): Batch size used to compute embeddings. Defaults to 16. + show_progress_bar (bool): If true, prints a progress bar. Defaults to False. + write_csv (bool): Write results to a CSV file. Defaults to True. + truncate_dim (int, optional): The dimension to truncate sentence embeddings to. + `None` uses the model's current truncation dimension. Defaults to None. """ super().__init__() self.anchors = anchors diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py index 02f3cc744..8ec3b94ba 100644 --- a/sentence_transformers/fit_mixin.py +++ b/sentence_transformers/fit_mixin.py @@ -173,32 +173,59 @@ def fit( checkpoint_save_total_limit: int = 0, ): """ - Train the model with the given training objective - Each training objective is sampled in turn for one batch. - We sample only as many batches from each objective as there are in the smallest one - to make sure of equal training with each dataset. - - :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning - :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. - :param epochs: Number of epochs for training - :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives. - :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts - :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. - :param optimizer_class: Optimizer - :param optimizer_params: Optimizer parameters - :param weight_decay: Weight decay for model parameters - :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps - :param output_path: Storage path for the model and evaluation files - :param save_best_model: If true, the best model (according to evaluator) is stored at output_path - :param max_grad_norm: Used for gradient normalization. - :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 - :param callback: Callback function that is invoked after each evaluation. - It must accept the following three parameters in this order: - `score`, `epoch`, `steps` - :param show_progress_bar: If True, output a tqdm progress bar - :param checkpoint_path: Folder to save checkpoints during training - :param checkpoint_save_steps: Will save a checkpoint after so many steps - :param checkpoint_save_total_limit: Total number of checkpoints to store + Deprecated training method from before Sentence Transformers v3.0, it is recommended to use + :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` instead. This method uses + :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` behind the scenes, but does + not provide as much flexibility as the Trainer itself. + + This training approach uses a list of DataLoaders and Loss functions to train the model. Each DataLoader + is sampled in turn for one batch. We sample only as many batches from each DataLoader as there are in the + smallest one to make sure of equal training with each dataset, i.e. round robin sampling. + + This method should produce equivalent results in v3.0+ as before v3.0, but if you encounter any issues + with your existing training scripts, then you may wish to use + :meth:`SentenceTransformer.old_fit ` instead. + That uses the old training method from before v3.0. + + Args: + train_objectives: Tuples of (DataLoader, LossFunction). Pass + more than one for multi-task learning + evaluator: An evaluator (sentence_transformers.evaluation) + evaluates the model performance during training on held- + out dev data. It is used to determine the best model + that is saved to disc. + epochs: Number of epochs for training + steps_per_epoch: Number of training steps per epoch. If set + to None (default), one epoch is equal the DataLoader + size from train_objectives. + scheduler: Learning rate scheduler. Available schedulers: + constantlr, warmupconstant, warmuplinear, warmupcosine, + warmupcosinewithhardrestarts + warmup_steps: Behavior depends on the scheduler. For + WarmupLinear (default), the learning rate is increased + from o up to the maximal learning rate. After these many + training steps, the learning rate is decreased linearly + back to zero. + optimizer_class: Optimizer + optimizer_params: Optimizer parameters + weight_decay: Weight decay for model parameters + evaluation_steps: If > 0, evaluate the model using evaluator + after each number of training steps + output_path: Storage path for the model and evaluation files + save_best_model: If true, the best model (according to + evaluator) is stored at output_path + max_grad_norm: Used for gradient normalization. + use_amp: Use Automatic Mixed Precision (AMP). Only for + Pytorch >= 1.6.0 + callback: Callback function that is invoked after each + evaluation. It must accept the following three + parameters in this order: `score`, `epoch`, `steps` + show_progress_bar: If True, output a tqdm progress bar + checkpoint_path: Folder to save checkpoints during training + checkpoint_save_steps: Will save a checkpoint after so many + steps + checkpoint_save_total_limit: Total number of checkpoints to + store """ # Delayed import to counter the SentenceTransformers -> FitMixin -> SentenceTransformerTrainer -> SentenceTransformers circular import from sentence_transformers.trainer import SentenceTransformerTrainer @@ -338,7 +365,13 @@ def _default_checkpoint_dir() -> str: @staticmethod def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int): """ - Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts + Returns the correct learning rate scheduler. Available scheduler: + + - constantlr, + - warmupconstant, + - warmuplinear, + - warmupcosine, + - warmupcosinewithhardrestarts """ scheduler = scheduler.lower() if scheduler == "constantlr": @@ -365,9 +398,10 @@ def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict Transforms a batch from a SmartBatchingDataset to a batch of tensors for the model Here, batch is a list of InputExample instances: [InputExample(...), ...] - :param batch: - a batch from a SmartBatchingDataset - :return: + Args: + batch: a batch from a SmartBatchingDataset + + Returns: a batch of tensors for the model """ texts = [example.texts for example in batch] @@ -410,32 +444,53 @@ def old_fit( checkpoint_save_total_limit: int = 0, ): """ - Train the model with the given training objective - Each training objective is sampled in turn for one batch. - We sample only as many batches from each objective as there are in the smallest one - to make sure of equal training with each dataset. - - :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning - :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc. - :param epochs: Number of epochs for training - :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives. - :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts - :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero. - :param optimizer_class: Optimizer - :param optimizer_params: Optimizer parameters - :param weight_decay: Weight decay for model parameters - :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps - :param output_path: Storage path for the model and evaluation files - :param save_best_model: If true, the best model (according to evaluator) is stored at output_path - :param max_grad_norm: Used for gradient normalization. - :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0 - :param callback: Callback function that is invoked after each evaluation. - It must accept the following three parameters in this order: - `score`, `epoch`, `steps` - :param show_progress_bar: If True, output a tqdm progress bar - :param checkpoint_path: Folder to save checkpoints during training - :param checkpoint_save_steps: Will save a checkpoint after so many steps - :param checkpoint_save_total_limit: Total number of checkpoints to store + Deprecated training method from before Sentence Transformers v3.0, it is recommended to use + :class:`sentence_transformers.trainer.SentenceTransformerTrainer` instead. This method should + only be used if you encounter issues with your existing training scripts after upgrading to v3.0+. + + This training approach uses a list of DataLoaders and Loss functions to train the model. Each DataLoader + is sampled in turn for one batch. We sample only as many batches from each DataLoader as there are in the + smallest one to make sure of equal training with each dataset, i.e. round robin sampling. + + Args: + train_objectives: Tuples of (DataLoader, LossFunction). Pass + more than one for multi-task learning + evaluator: An evaluator (sentence_transformers.evaluation) + evaluates the model performance during training on held- + out dev data. It is used to determine the best model + that is saved to disc. + epochs: Number of epochs for training + steps_per_epoch: Number of training steps per epoch. If set + to None (default), one epoch is equal the DataLoader + size from train_objectives. + scheduler: Learning rate scheduler. Available schedulers: + constantlr, warmupconstant, warmuplinear, warmupcosine, + warmupcosinewithhardrestarts + warmup_steps: Behavior depends on the scheduler. For + WarmupLinear (default), the learning rate is increased + from o up to the maximal learning rate. After these many + training steps, the learning rate is decreased linearly + back to zero. + optimizer_class: Optimizer + optimizer_params: Optimizer parameters + weight_decay: Weight decay for model parameters + evaluation_steps: If > 0, evaluate the model using evaluator + after each number of training steps + output_path: Storage path for the model and evaluation files + save_best_model: If true, the best model (according to + evaluator) is stored at output_path + max_grad_norm: Used for gradient normalization. + use_amp: Use Automatic Mixed Precision (AMP). Only for + Pytorch >= 1.6.0 + callback: Callback function that is invoked after each + evaluation. It must accept the following three + parameters in this order: `score`, `epoch`, `steps` + show_progress_bar: If True, output a tqdm progress bar + checkpoint_path: Folder to save checkpoints during training + checkpoint_save_steps: Will save a checkpoint after so many + steps + checkpoint_save_total_limit: Total number of checkpoints to + store """ ##Add info to model card diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py index 678aada1a..d50ac3227 100644 --- a/sentence_transformers/losses/AdaptiveLayerLoss.py +++ b/sentence_transformers/losses/AdaptiveLayerLoss.py @@ -111,20 +111,33 @@ def __init__( layers of the Sentence Transformer model. This is useful for when you want to train a model where users have the option to lower the number of layers used to improve their inference speed and memory usage. - :param model: SentenceTransformer model - :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc. - :param n_layers_per_step: The number of layers to use per step. If -1, then all layers are used. If > 0, then - a random sample of `n_layers_per_step` layers are used per step, separate from the final layer, which is - always used. The 2DMSE paper uses `n_layers_per_step=1`. The default value is 1. - :param last_layer_weight: The weight to use for the loss of the final layer. Increase this to focus more on the - performance when using all layers. The default value is 1.0. - :param prior_layers_weight: The weight to use for the loss of the prior layers. Increase this to focus more on - the performance when using fewer layers. The default value is 1.0. - :param kl_div_weight: The weight to use for the KL-divergence loss that is used to make the prior layers match - that of the last layer. Increase this to focus more on the performance when using fewer layers. The default - value is 1.0. - :param kl_temperature: The temperature to use for the KL-divergence loss. If 0, then the KL-divergence loss is - not used. The default value is 1.0. + Args: + model: SentenceTransformer model + loss: The loss function to be used, e.g. + :class:`MultipleNegativesRankingLoss`, + :class:`CoSENTLoss`, etc. + n_layers_per_step: The number of layers to use per step. If + -1, then all layers are used. If > 0, then a random + sample of `n_layers_per_step` layers are used per step, + separate from the final layer, which is always used. The + 2DMSE paper uses `n_layers_per_step=1`. The default + value is 1. + last_layer_weight: The weight to use for the loss of the + final layer. Increase this to focus more on the + performance when using all layers. The default value is + 1.0. + prior_layers_weight: The weight to use for the loss of the + prior layers. Increase this to focus more on the + performance when using fewer layers. The default value + is 1.0. + kl_div_weight: The weight to use for the KL-divergence loss + that is used to make the prior layers match that of the + last layer. Increase this to focus more on the + performance when using fewer layers. The default value + is 1.0. + kl_temperature: The temperature to use for the KL-divergence + loss. If 0, then the KL-divergence loss is not used. The + default value is 1.0. References: - The concept was inspired by the 2DMSE paper: https://arxiv.org/abs/2402.14776 @@ -147,21 +160,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('microsoft/mpnet-base') - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1']), - InputExample(texts=['Anchor 2', 'Positive 2']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32) - train_loss = losses.MultipleNegativesRankingLoss(model=model) - train_loss = losses.AdaptiveLayerLoss(model, train_loss) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.MultipleNegativesRankingLoss(model=model) + loss = losses.AdaptiveLayerLoss(model, loss) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super().__init__() self.model = model diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py index a506a1317..661e8693d 100644 --- a/sentence_transformers/losses/AnglELoss.py +++ b/sentence_transformers/losses/AnglELoss.py @@ -20,8 +20,10 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0): pairs of input pairs in the batch that match this condition. This is the same as CoSENTLoss, with a different similarity function. - :param model: SentenceTransformerModel - :param scale: Output of similarity function is multiplied by scale value. Represents the inverse temperature. + Args: + model: SentenceTransformerModel + scale: Output of similarity function is multiplied by scale + value. Represents the inverse temperature. References: - For further details, see: https://arxiv.org/abs/2309.12871v1 @@ -43,15 +45,23 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0): Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset - model = SentenceTransformer('bert-base-uncased') - train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=1.0), - InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)] + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": ["It's nice weather outside today.", "He drove to work."], + "sentence2": ["It's so sunny.", "She walked to the store."], + "score": [1.0, 0.3], + }) + loss = losses.AnglELoss(model) - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.AnglELoss(model=model) + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, + ) + trainer.train() """ super().__init__(model, scale, similarity_fct=util.pairwise_angle_sim) diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py index a0fb1055c..1843a615a 100644 --- a/sentence_transformers/losses/BatchAllTripletLoss.py +++ b/sentence_transformers/losses/BatchAllTripletLoss.py @@ -17,9 +17,13 @@ def __init__( must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used. - :param margin: Negative samples should be at least margin further apart from the anchor than the positive. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrics that can be used. + margin: Negative samples should be at least margin further + apart from the anchor than the positive. References: * Source: https://github.com/NegatioN/OnlineMiningTripletLoss/blob/master/online_triplet_loss/losses.py @@ -46,24 +50,29 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['Sentence from class 0'], label=0), - InputExample(texts=['Another sentence from class 0'], label=0), - InputExample(texts=['Sentence from class 1'], label=1), - InputExample(texts=['Sentence from class 2'], label=2), - ] - train_batch_size = 2 - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.BatchAllTripletLoss(model=model) - model.fit( - train_objectives=[(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + # E.g. 0: sports, 1: economy, 2: politics + train_dataset = Dataset.from_dict({ + "sentence": [ + "He played a great game.", + "The stock is up 20%", + "They won 2-1.", + "The last goal was amazing.", + "They all voted against the bill.", + ], + "label": [0, 1, 0, 0, 2], + }) + loss = losses.BatchAllTripletLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(BatchAllTripletLoss, self).__init__() diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py index a70f419e1..02914a165 100644 --- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py +++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py @@ -15,8 +15,11 @@ def __init__( must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class. This soft-margin variant does not require setting a margin. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrics that can be used. Definitions: :Easy triplets: Triplets which have a loss of 0 because @@ -49,24 +52,29 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['Sentence from class 0'], label=0), - InputExample(texts=['Another sentence from class 0'], label=0), - InputExample(texts=['Sentence from class 1'], label=1), - InputExample(texts=['Sentence from class 2'], label=2) - ] - train_batch_size = 2 - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.BatchHardSoftMarginTripletLoss(model=model) - model.fit( - train_objectives=[(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + # E.g. 0: sports, 1: economy, 2: politics + train_dataset = Dataset.from_dict({ + "sentence": [ + "He played a great game.", + "The stock is up 20%", + "They won 2-1.", + "The last goal was amazing.", + "They all voted against the bill.", + ], + "label": [0, 1, 0, 0, 2], + }) + loss = losses.BatchHardSoftMarginTripletLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(BatchHardSoftMarginTripletLoss, self).__init__(model) self.sentence_embedder = model diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py index ab023ec3d..51df4a8b5 100644 --- a/sentence_transformers/losses/BatchHardTripletLoss.py +++ b/sentence_transformers/losses/BatchHardTripletLoss.py @@ -6,15 +6,11 @@ class BatchHardTripletLossDistanceFunction: - """ - This class defines distance functions, that can be used with Batch[All/Hard/SemiHard]TripletLoss - """ + """This class defines distance functions, that can be used with Batch[All/Hard/SemiHard]TripletLoss""" @staticmethod def cosine_distance(embeddings): - """ - Compute the 2D matrix of cosine distances (1-cosine_similarity) between all embeddings. - """ + """Compute the 2D matrix of cosine distances (1-cosine_similarity) between all embeddings.""" return 1 - util.pytorch_cos_sim(embeddings, embeddings) @staticmethod @@ -69,9 +65,13 @@ def __init__( The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used - :param margin: Negative samples should be at least margin further apart from the anchor than the positive. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrics that can be used + margin: Negative samples should be at least margin further + apart from the anchor than the positive. Definitions: :Easy triplets: Triplets which have a loss of 0 because @@ -106,24 +106,29 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['Sentence from class 0'], label=0), - InputExample(texts=['Another sentence from class 0'], label=0), - InputExample(texts=['Sentence from class 1'], label=1), - InputExample(texts=['Sentence from class 2'], label=2) - ] - train_batch_size = 2 - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.BatchHardTripletLoss(model=model) - model.fit( - train_objectives=[(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + # E.g. 0: sports, 1: economy, 2: politics + train_dataset = Dataset.from_dict({ + "sentence": [ + "He played a great game.", + "The stock is up 20%", + "They won 2-1.", + "The last goal was amazing.", + "They all voted against the bill.", + ], + "label": [0, 1, 0, 0, 2], + }) + loss = losses.BatchHardTripletLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(BatchHardTripletLoss, self).__init__() self.sentence_embedder = model diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py index a54a6bc26..c997d1f58 100644 --- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py +++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py @@ -19,9 +19,13 @@ def __init__( The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used - :param margin: Negative samples should be at least margin further apart from the anchor than the positive. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrics that can be used + margin: Negative samples should be at least margin further + apart from the anchor than the positive. Definitions: :Easy triplets: Triplets which have a loss of 0 because @@ -57,24 +61,29 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['Sentence from class 0'], label=0), - InputExample(texts=['Another sentence from class 0'], label=0), - InputExample(texts=['Sentence from class 1'], label=1), - InputExample(texts=['Sentence from class 2'], label=2) - ] - train_batch_size = 2 - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.BatchSemiHardTripletLoss(model=model) - model.fit( - train_objectives=[(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + # E.g. 0: sports, 1: economy, 2: politics + train_dataset = Dataset.from_dict({ + "sentence": [ + "He played a great game.", + "The stock is up 20%", + "They won 2-1.", + "The last goal was amazing.", + "They all voted against the bill.", + ], + "label": [0, 1, 0, 0, 2], + }) + loss = losses.BatchSemiHardTripletLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(BatchSemiHardTripletLoss, self).__init__() self.sentence_embedder = model diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py index 6612a1773..cf3392456 100644 --- a/sentence_transformers/losses/CachedGISTEmbedLoss.py +++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py @@ -77,9 +77,12 @@ def __init__( :class:`CachedMultipleNegativesRankingLoss`, it is possible to reduce memory usage while maintaining performance levels comparable to those of :class:`GISTEmbedLoss`. - :param model: SentenceTransformer model - :param guide: SentenceTransformer model to guide the in-batch negative sample selection. - :param temperature: Temperature parameter to scale the cosine similarities. + Args: + model: SentenceTransformer model + guide: SentenceTransformer model to guide the in-batch + negative sample selection. + temperature: Temperature parameter to scale the cosine + similarities. References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf @@ -105,22 +108,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-uncased') - guide = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0') - - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1']), - InputExample(texts=['Anchor 2', 'Positive 2']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1024) # Here we can try much larger batch sizes! - train_loss = losses.CachedGISTEmbedLoss(model=model, mini_batch_size=32, guide=guide) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + guide = SentenceTransformer("all-MiniLM-L6-v2") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.CachedGISTEmbedLoss(model, guide, mini_batch_size=64) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(CachedGISTEmbedLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py index 80bdc4899..d3e2c7204 100644 --- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py @@ -83,9 +83,13 @@ def __init__( Notes: All steps are done with mini-batches. In the original implementation of GradCache, (2) is not done in mini-batches and requires a lot memory when batch size large. One drawback is about the speed. GradCache will sacrifice around 20% computation time according to the paper. - :param model: SentenceTransformer model - :param scale: Output of similarity function is multiplied by scale value - :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1) + Args: + model: SentenceTransformer model + scale: Output of similarity function is multiplied by scale + value + similarity_fct: similarity function between sentence + embeddings. By default, cos_sim. Can also be set to dot + product (and then set scale to 1) References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf @@ -112,20 +116,22 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-uncased') - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1']), - InputExample(texts=['Anchor 2', 'Positive 2']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1024) # Here we can try much larger batch sizes! - train_loss = losses.CachedMultipleNegativesRankingLoss(model=model, mini_batch_size = 32) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.CachedGISTEmbedLoss(model, mini_batch_size=64) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(CachedMultipleNegativesRankingLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py index e937d7ef9..e0c5203b7 100644 --- a/sentence_transformers/losses/CoSENTLoss.py +++ b/sentence_transformers/losses/CoSENTLoss.py @@ -22,9 +22,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f resulting in faster convergence and a final model with superior performance. Consequently, CoSENTLoss may be used as a drop-in replacement for :class:`CosineSimilarityLoss` in any training script. - :param model: SentenceTransformerModel - :param similarity_fct: Function to compute the PAIRWISE similarity between embeddings. Default is ``util.pairwise_cos_sim``. - :param scale: Output of similarity function is multiplied by scale value. Represents the inverse temperature. + Args: + model: SentenceTransformerModel + similarity_fct: Function to compute the PAIRWISE similarity + between embeddings. Default is + ``util.pairwise_cos_sim``. + scale: Output of similarity function is multiplied by scale + value. Represents the inverse temperature. References: - For further details, see: https://kexue.fm/archives/8847 @@ -46,15 +50,23 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - - model = SentenceTransformer('bert-base-uncased') - train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=1.0), - InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)] - - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.CoSENTLoss(model=model) + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": ["It's nice weather outside today.", "He drove to work."], + "sentence2": ["It's so sunny.", "She walked to the store."], + "score": [1.0, 0.3], + }) + loss = losses.CoSENTLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, + ) + trainer.train() """ super(CoSENTLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py index 55f5ad993..a7c66792f 100644 --- a/sentence_transformers/losses/ContrastiveLoss.py +++ b/sentence_transformers/losses/ContrastiveLoss.py @@ -6,9 +6,7 @@ class SiameseDistanceMetric(Enum): - """ - The metric for the contrastive loss - """ + """The metric for the contrastive loss""" EUCLIDEAN = lambda x, y: F.pairwise_distance(x, y, p=2) MANHATTAN = lambda x, y: F.pairwise_distance(x, y, p=1) @@ -27,10 +25,14 @@ def __init__( Contrastive loss. Expects as input two texts and a label of either 0 or 1. If the label == 1, then the distance between the two embeddings is reduced. If the label == 0, then the distance between the embeddings is increased. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used - :param margin: Negative samples (label == 0) should have a distance of at least the margin value. - :param size_average: Average by the size of the mini-batch. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrices that can be used + margin: Negative samples (label == 0) should have a distance + of at least the margin value. + size_average: Average by the size of the mini-batch. References: * Further information: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf @@ -53,23 +55,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('all-MiniLM-L6-v2') - train_examples = [ - InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1), - InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0), - ] - - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2) - train_loss = losses.ContrastiveLoss(model=model) - - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": ["It's nice weather outside today.", "He drove to work."], + "sentence2": ["It's so sunny.", "She walked to the store."], + "label": [1, 0], + }) + loss = losses.ContrastiveLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(ContrastiveLoss, self).__init__() self.distance_metric = distance_metric diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py index 85af67fc3..08dc5d723 100644 --- a/sentence_transformers/losses/ContrastiveTensionLoss.py +++ b/sentence_transformers/losses/ContrastiveTensionLoss.py @@ -23,7 +23,8 @@ class ContrastiveTensionLoss(nn.Module): Generally, :class:`ContrastiveTensionLossInBatchNegatives` is recommended over this loss, as it gives a stronger training signal. - :param model: SentenceTransformer model + Args: + model: SentenceTransformer model References: * Semantic Re-Tuning with Contrastive Tension: https://openreview.net/pdf?id=Ov_sMNau-PF @@ -112,9 +113,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f Note that you should not use the `ContrastiveTensionDataLoader` for this loss, but just a normal DataLoader with `InputExample` instances. The two texts of each `InputExample` instance should be identical. - :param model: SentenceTransformer model - :param scale: Output of similarity function is multiplied by scale value - :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1) + Args: + model: SentenceTransformer model + scale: Output of similarity function is multiplied by scale + value + similarity_fct: similarity function between sentence + embeddings. By default, cos_sim. Can also be set to dot + product (and then set scale to 1) References: - Semantic Re-Tuning with Contrastive Tension: https://openreview.net/pdf?id=Ov_sMNau-PF diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py index 46b075b38..8d27300e7 100644 --- a/sentence_transformers/losses/CosineSimilarityLoss.py +++ b/sentence_transformers/losses/CosineSimilarityLoss.py @@ -13,11 +13,15 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_ vectors ``u = model(sentence_A)`` and ``v = model(sentence_B)`` and measures the cosine-similarity between the two. By default, it minimizes the following loss: ``||input_label - cos_score_transformation(cosine_sim(u,v))||_2``. - :param model: SentenceTransformer model - :param loss_fct: Which pytorch loss function should be used to compare the ``cosine_similarity(u, v)`` with the input_label? - By default, MSE is used: ``||input_label - cosine_sim(u, v)||_2`` - :param cos_score_transformation: The cos_score_transformation function is applied on top of cosine_similarity. - By default, the identify function is used (i.e. no change). + Args: + model: SentenceTransformer model + loss_fct: Which pytorch loss function should be used to + compare the ``cosine_similarity(u, v)`` with the + input_label? By default, MSE is used: ``||input_label - + cosine_sim(u, v)||_2`` + cos_score_transformation: The cos_score_transformation + function is applied on top of cosine_similarity. By + default, the identify function is used (i.e. no change). References: - `Training Examples > Semantic Textual Similarity <../../examples/training/sts/README.html>`_ @@ -39,22 +43,23 @@ def __init__(self, model: SentenceTransformer, loss_fct=nn.MSELoss(), cos_score_ Example: :: - from sentence_transformers import SentenceTransformer, InputExample, losses - from torch.utils.data import DataLoader + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['My first sentence', 'My second sentence'], label=0.8), - InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3) - ] - train_batch_size = 1 - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.CosineSimilarityLoss(model=model) + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": ["It's nice weather outside today.", "He drove to work."], + "sentence2": ["It's so sunny.", "She walked to the store."], + "score": [1.0, 0.3], + }) + loss = losses.CosineSimilarityLoss(model) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(CosineSimilarityLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py index 9f13698a6..cdc35cb85 100644 --- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py +++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py @@ -1,5 +1,5 @@ from torch import nn, Tensor -from typing import Iterable, Dict +from typing import Iterable, Dict, Optional from sentence_transformers import SentenceTransformer from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, PreTrainedModel import logging @@ -8,7 +8,9 @@ class DenoisingAutoEncoderLoss(nn.Module): - def __init__(self, model: SentenceTransformer, decoder_name_or_path: str = None, tie_encoder_decoder: bool = True): + def __init__( + self, model: SentenceTransformer, decoder_name_or_path: Optional[str] = None, tie_encoder_decoder: bool = True + ) -> None: """ This loss expects as input a pairs of damaged sentences and the corresponding original ones. During training, the decoder reconstructs the original sentences from the encoded sentence embeddings. @@ -21,9 +23,10 @@ def __init__(self, model: SentenceTransformer, decoder_name_or_path: str = None, The data generation process (i.e. the 'damaging' process) has already been implemented in ``DenoisingAutoEncoderDataset``, allowing you to only provide regular sentences. - :param model: SentenceTransformer model - :param decoder_name_or_path: Model name or path for initializing a decoder (compatible with Huggingface's Transformers) - :param tie_encoder_decoder: whether to tie the trainable parameters of encoder and decoder + Args: + model (SentenceTransformer): The SentenceTransformer model. + decoder_name_or_path (str, optional): Model name or path for initializing a decoder (compatible with Huggingface's Transformers). Defaults to None. + tie_encoder_decoder (bool): Whether to tie the trainable parameters of encoder and decoder. Defaults to True. References: * TSDAE paper: https://arxiv.org/pdf/2104.06979.pdf diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py index ff8d3d288..6c719d511 100644 --- a/sentence_transformers/losses/GISTEmbedLoss.py +++ b/sentence_transformers/losses/GISTEmbedLoss.py @@ -18,9 +18,13 @@ def __init__( in-batch negative sample selection. The cosine similarity is used to compute the loss and the temperature parameter is used to scale the cosine similarities. - :param model: SentenceTransformer model based on a `transformers` model. - :param guide: SentenceTransformer model to guide the in-batch negative sample selection. - :param temperature: Temperature parameter to scale the cosine similarities. + Args: + model: SentenceTransformer model based on a `transformers` + model. + guide: SentenceTransformer model to guide the in-batch + negative sample selection. + temperature: Temperature parameter to scale the cosine + similarities. References: - For further details, see: https://arxiv.org/abs/2402.16829 @@ -46,21 +50,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('all-MiniLM-L6-v2') - guide = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0') - train_examples = [ - InputExample(texts=['The first query', 'The first positive passage', 'The first negative passage']), - InputExample(texts=['The second query', 'The second positive passage', 'The second negative passage']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2) - train_loss = losses.GISTEmbedLoss(model=model, guide=guide) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + guide = SentenceTransformer("all-MiniLM-L6-v2") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.GISTEmbedLoss(model, guide) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(GISTEmbedLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py index 798c7949f..c377a96e1 100644 --- a/sentence_transformers/losses/MSELoss.py +++ b/sentence_transformers/losses/MSELoss.py @@ -12,7 +12,8 @@ def __init__(self, model): For an example, see `the distillation documentation <../../examples/training/distillation/README.html>`_ on extending language models to new languages. - :param model: SentenceTransformerModel + Args: + model: SentenceTransformerModel References: - Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation: https://arxiv.org/abs/2004.09813 @@ -34,30 +35,33 @@ def __init__(self, model): | sentence_1, sentence_2, ..., sentence_N | model sentence embeddings | +-----------------------------------------+-----------------------------+ - Example:: - - from sentence_transformers import SentenceTransformer, InputExample, losses - from torch.utils.data import DataLoader - - model_en = SentenceTransformer('bert-base-cased') - model_fr = SentenceTransformer('flaubert/flaubert_base_cased') - - examples_en = ['The first sentence', 'The second sentence', 'The third sentence', 'The fourth sentence'] - examples_fr = ['La première phrase', 'La deuxième phrase', 'La troisième phrase', 'La quatrième phrase'] - train_batch_size = 2 - - labels_en_en = model_en.encode(examples_en) - examples_en_fr = [InputExample(texts=[x], label=labels_en_en[i]) for i, x in enumerate(examples_en)] - loader_en_fr = DataLoader(examples_en_fr, batch_size=train_batch_size) - - examples_fr_fr = [InputExample(texts=[x], label=labels_en_en[i]) for i, x in enumerate(examples_fr)] - loader_fr_fr = DataLoader(examples_fr_fr, batch_size=train_batch_size) - - train_loss = losses.MSELoss(model=model_fr) - model_fr.fit( - [(loader_en_fr, train_loss), (loader_fr_fr, train_loss)], - epochs=10, - ) + Example: + :: + + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + student_model = SentenceTransformer("microsoft/mpnet-base") + teacher_model = SentenceTransformer("all-mpnet-base-v2") + train_dataset = Dataset.from_dict({ + "english": ["The first sentence", "The second sentence", "The third sentence", "The fourth sentence"], + "french": ["La première phrase", "La deuxième phrase", "La troisième phrase", "La quatrième phrase"], + }) + + def compute_labels(batch): + return { + "label": teacher_model.encode(batch["english"]) + } + + train_dataset = train_dataset.map(compute_labels, batched=True) + loss = losses.MSELoss(student_model) + + trainer = SentenceTransformerTrainer( + model=student_model, + train_dataset=train_dataset, + loss=loss, + ) + trainer.train() """ super(MSELoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py index 26e202fe3..44ab49710 100644 --- a/sentence_transformers/losses/MarginMSELoss.py +++ b/sentence_transformers/losses/MarginMSELoss.py @@ -15,8 +15,9 @@ def __init__(self, model, similarity_fct=util.pairwise_dot_score): with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages. - :param model: SentenceTransformerModel - :param similarity_fct: Which similarity function to use. + Args: + model: SentenceTransformerModel + similarity_fct: Which similarity function to use. References: - For more details, please refer to https://arxiv.org/abs/2010.02666. @@ -38,33 +39,72 @@ def __init__(self, model, similarity_fct=util.pairwise_dot_score): +-----------------------------------------------+-----------------------------------------------+ Example: + + With gold labels, e.g. if you have hard scores for sentences. Imagine you want a model to embed sentences + with similar "quality" close to each other. If the "text1" has quality 5 out of 5, "text2" has quality + 1 out of 5, and "text3" has quality 3 out of 5, then the similarity of a pair can be defined as the + difference of the quality scores. So, the similarity between "text1" and "text2" is 4, and the + similarity between "text1" and "text3" is 2. If we use this as our "Teacher Model", the label becomes + similraity("text1", "text2") - similarity("text1", "text3") = 4 - 2 = 2. + + Positive values denote that the first passage is more similar to the query than the second passage, + while negative values denote the opposite. + + :: + + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "text1": ["It's nice weather outside today.", "He drove to work."], + "text2": ["It's so sunny.", "He took the car to work."], + "text3": ["It's very sunny.", "She walked to the store."], + "label": [0.1, 0.8], + }) + loss = losses.MarginMSELoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, + ) + trainer.train() + + We can also use a teacher model to compute the similarity scores. In this case, we can use the teacher model + to compute the similarity scores and use them as the silver labels. This is often used in knowledge distillation. + :: - from sentence_transformers import SentenceTransformer, InputExample, losses - from sentence_transformers.util import pairwise_dot_score - from torch.utils.data import DataLoader - import torch - - student_model = SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens') - teacher_model = SentenceTransformer('sentence-transformers/bert-base-nli-stsb-mean-tokens') - - train_examples = [ - ['The first query', 'The first positive passage', 'The first negative passage'], - ['The second query', 'The second positive passage', 'The second negative passage'], - ['The third query', 'The third positive passage', 'The third negative passage'], - ] - train_batch_size = 1 - encoded = torch.tensor([teacher_model.encode(x).tolist() for x in train_examples]) - labels = pairwise_dot_score(encoded[:, 0], encoded[:, 1]) - pairwise_dot_score(encoded[:, 0], encoded[:, 2]) - - train_input_examples = [InputExample(texts=x, label=labels[i]) for i, x in enumerate(train_examples)] - train_dataloader = DataLoader(train_input_examples, shuffle=True, batch_size=train_batch_size) - train_loss = losses.MarginMSELoss(model=student_model) - - student_model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + student_model = SentenceTransformer("microsoft/mpnet-base") + teacher_model = SentenceTransformer("all-mpnet-base-v2") + train_dataset = Dataset.from_dict({ + "query": ["It's nice weather outside today.", "He drove to work."], + "passage1": ["It's so sunny.", "He took the car to work."], + "passage2": ["It's very sunny.", "She walked to the store."], + }) + + def compute_labels(batch): + emb_queries = teacher_model.encode(batch["query"]) + emb_passages1 = teacher_model.encode(batch["passage1"]) + emb_passages2 = teacher_model.encode(batch["passage2"]) + return { + "label": teacher_model.similarity_pairwise(emb_queries, emb_passages1) - teacher_model.similarity_pairwise(emb_queries, emb_passages2) + } + + train_dataset = train_dataset.map(compute_labels, batched=True) + # In this example, the labels become -0.036 and 0.68, respectively + loss = losses.MarginMSELoss(student_model) + + trainer = SentenceTransformerTrainer( + model=student_model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(MarginMSELoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py index da6f1512b..a3043da89 100644 --- a/sentence_transformers/losses/Matryoshka2dLoss.py +++ b/sentence_transformers/losses/Matryoshka2dLoss.py @@ -31,25 +31,41 @@ def __init__( Note, this uses `n_layers_per_step=1` and `n_dims_per_step=1` as default, following the original 2DMSE implementation. - :param model: SentenceTransformer model - :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc. - :param matryoshka_dims: A list of embedding dimensions to be used for the loss function, e.g. [768, 512, 256, 128, 64]. - :param matryoshka_weights: A list of weights to be used for the loss function, e.g. [1, 1, 1, 1, 1]. If None, then the - weights will be set to 1 for all dimensions. - :param n_layers_per_step: The number of layers to use per step. If -1, then all layers are used. If > 0, then - a random sample of n_layers_per_step layers are used per step. The 2DMSE paper uses `n_layers_per_step=1`. - The default value is -1. - :param n_dims_per_step: The number of dimensions to use per step. If -1, then all dimensions are used. If > 0, then - a random sample of n_dims_per_step dimensions are used per step. The default value is -1. - :param last_layer_weight: The weight to use for the loss of the final layer. Increase this to focus more on the - performance when using all layers. The default value is 1.0. - :param prior_layers_weight: The weight to use for the loss of the prior layers. Increase this to focus more on - the performance when using fewer layers. The default value is 1.0. - :param kl_div_weight: The weight to use for the KL-divergence loss that is used to make the prior layers match - that of the last layer. Increase this to focus more on the performance when using fewer layers. The default - value is 1.0. - :param kl_temperature: The temperature to use for the KL-divergence loss. If 0, then the KL-divergence loss is - not used. The default value is 1.0. + Args: + model: SentenceTransformer model + loss: The loss function to be used, e.g. + :class:`MultipleNegativesRankingLoss`, + :class:`CoSENTLoss`, etc. + matryoshka_dims: A list of embedding dimensions to be used + for the loss function, e.g. [768, 512, 256, 128, 64]. + matryoshka_weights: A list of weights to be used for the + loss function, e.g. [1, 1, 1, 1, 1]. If None, then the + weights will be set to 1 for all dimensions. + n_layers_per_step: The number of layers to use per step. If + -1, then all layers are used. If > 0, then a random + sample of n_layers_per_step layers are used per step. + The 2DMSE paper uses `n_layers_per_step=1`. The default + value is -1. + n_dims_per_step: The number of dimensions to use per step. + If -1, then all dimensions are used. If > 0, then a + random sample of n_dims_per_step dimensions are used per + step. The default value is -1. + last_layer_weight: The weight to use for the loss of the + final layer. Increase this to focus more on the + performance when using all layers. The default value is + 1.0. + prior_layers_weight: The weight to use for the loss of the + prior layers. Increase this to focus more on the + performance when using fewer layers. The default value + is 1.0. + kl_div_weight: The weight to use for the KL-divergence loss + that is used to make the prior layers match that of the + last layer. Increase this to focus more on the + performance when using fewer layers. The default value + is 1.0. + kl_temperature: The temperature to use for the KL-divergence + loss. If 0, then the KL-divergence loss is not used. The + default value is 1.0. References: - See the 2D Matryoshka Sentence Embeddings (2DMSE) paper: https://arxiv.org/abs/2402.14776 @@ -76,7 +92,7 @@ def __init__( from sentence_transformers import SentenceTransformer, losses, InputExample from torch.utils.data import DataLoader - model = SentenceTransformer('microsoft/mpnet-base') + model = SentenceTransformer("microsoft/mpnet-base") train_examples = [ InputExample(texts=['Anchor 1', 'Positive 1']), InputExample(texts=['Anchor 2', 'Positive 2']), diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py index c866f1e08..acdac95f4 100644 --- a/sentence_transformers/losses/MatryoshkaLoss.py +++ b/sentence_transformers/losses/MatryoshkaLoss.py @@ -60,13 +60,20 @@ def __init__( different embedding dimensions. This is useful for when you want to train a model where users have the option to lower the embedding dimension to improve their embedding comparison speed and costs. - :param model: SentenceTransformer model - :param loss: The loss function to be used, e.g. :class:`MultipleNegativesRankingLoss`, :class:`CoSENTLoss`, etc. - :param matryoshka_dims: A list of embedding dimensions to be used for the loss function, e.g. [768, 512, 256, 128, 64]. - :param matryoshka_weights: A list of weights to be used for the loss function, e.g. [1, 1, 1, 1, 1]. If None, then the - weights will be set to 1 for all dimensions. - :param n_dims_per_step: The number of dimensions to use per step. If -1, then all dimensions are used. If > 0, then - a random sample of n_dims_per_step dimensions are used per step. The default value is -1. + Args: + model: SentenceTransformer model + loss: The loss function to be used, e.g. + :class:`MultipleNegativesRankingLoss`, + :class:`CoSENTLoss`, etc. + matryoshka_dims: A list of embedding dimensions to be used + for the loss function, e.g. [768, 512, 256, 128, 64]. + matryoshka_weights: A list of weights to be used for the + loss function, e.g. [1, 1, 1, 1, 1]. If None, then the + weights will be set to 1 for all dimensions. + n_dims_per_step: The number of dimensions to use per step. + If -1, then all dimensions are used. If > 0, then a + random sample of n_dims_per_step dimensions are used per + step. The default value is -1. References: - The concept was introduced in this paper: https://arxiv.org/abs/2205.13147 @@ -92,7 +99,7 @@ def __init__( from sentence_transformers import SentenceTransformer, losses, InputExample from torch.utils.data import DataLoader - model = SentenceTransformer('microsoft/mpnet-base') + model = SentenceTransformer("microsoft/mpnet-base") train_examples = [ InputExample(texts=['Anchor 1', 'Positive 1']), InputExample(texts=['Anchor 2', 'Positive 2']), diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py index b63d05afd..dc63ba6d9 100644 --- a/sentence_transformers/losses/MegaBatchMarginLoss.py +++ b/sentence_transformers/losses/MegaBatchMarginLoss.py @@ -21,12 +21,18 @@ def __init__( Then train as with the triplet loss. - :param model: SentenceTransformerModel - :param positive_margin: Positive margin, cos(anchor, positive) should be > positive_margin - :param negative_margin: Negative margin, cos(anchor, negative) should be < negative_margin - :param use_mini_batched_version: As large batch sizes require a lot of memory, we can use a mini-batched version. - We break down the large batch into smaller batches with fewer examples. - :param mini_batch_size: Size for the mini-batches. Should be a devisor for the batch size in your data loader. + Args: + model: SentenceTransformerModel + positive_margin: Positive margin, cos(anchor, positive) + should be > positive_margin + negative_margin: Negative margin, cos(anchor, negative) + should be < negative_margin + use_mini_batched_version: As large batch sizes require a lot + of memory, we can use a mini-batched version. We break + down the large batch into smaller batches with fewer + examples. + mini_batch_size: Size for the mini-batches. Should be a + devisor for the batch size in your data loader. References: - This loss function was inspired by the ParaNMT paper: https://www.aclweb.org/anthology/P18-1042/ diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py index 30c1fd163..78b03303c 100644 --- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py @@ -24,9 +24,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f ``(a_1, p_1, n_1), (a_2, p_2, n_2)``. Then, ``n_1`` is a hard negative for ``(a_1, p_1)``. The loss will use for the pair ``(a_i, p_i)`` all ``p_j`` for ``j != i`` and all ``n_j`` as negatives. - :param model: SentenceTransformer model - :param scale: Output of similarity function is multiplied by scale value - :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1) + Args: + model: SentenceTransformer model + scale: Output of similarity function is multiplied by scale + value + similarity_fct: similarity function between sentence + embeddings. By default, cos_sim. Can also be set to dot + product (and then set scale to 1) References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf @@ -60,20 +64,22 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-uncased') - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1']), - InputExample(texts=['Anchor 2', 'Positive 2']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32) - train_loss = losses.MultipleNegativesRankingLoss(model=model) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.MultipleNegativesRankingLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(MultipleNegativesRankingLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py index 85b157128..979502dde 100644 --- a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py +++ b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py @@ -20,9 +20,13 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f Note: If you pass triplets, the negative entry will be ignored. A anchor is just searched for the positive. - :param model: SentenceTransformer model - :param scale: Output of similarity function is multiplied by scale value - :param similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1) + Args: + model: SentenceTransformer model + scale: Output of similarity function is multiplied by scale + value + similarity_fct: similarity function between sentence + embeddings. By default, cos_sim. Can also be set to dot + product (and then set scale to 1) Requirements: 1. (anchor, positive) pairs @@ -40,20 +44,22 @@ def __init__(self, model: SentenceTransformer, scale: float = 20.0, similarity_f Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-uncased') - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1']), - InputExample(texts=['Anchor 2', 'Positive 2']), - ] - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32) - train_loss = losses.MultipleNegativesSymmetricRankingLoss(model=model) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + }) + loss = losses.MultipleNegativesSymmetricRankingLoss(model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(MultipleNegativesSymmetricRankingLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/OnlineContrastiveLoss.py b/sentence_transformers/losses/OnlineContrastiveLoss.py index 1aeaa0618..d36e61ccf 100644 --- a/sentence_transformers/losses/OnlineContrastiveLoss.py +++ b/sentence_transformers/losses/OnlineContrastiveLoss.py @@ -14,9 +14,13 @@ def __init__( are far apart) and hard negative pairs (negatives that are close) and computes the loss only for these pairs. This loss often yields better performances than ContrastiveLoss. - :param model: SentenceTransformer model - :param distance_metric: Function that returns a distance between two embeddings. The class SiameseDistanceMetric contains pre-defined metrics that can be used - :param margin: Negative samples (label == 0) should have a distance of at least the margin value. + Args: + model: SentenceTransformer model + distance_metric: Function that returns a distance between + two embeddings. The class SiameseDistanceMetric contains + pre-defined metrics that can be used + margin: Negative samples (label == 0) should have a distance + of at least the margin value. References: - `Training Examples > Quora Duplicate Questions <../../examples/training/quora_duplicate_questions/README.html>`_ @@ -39,21 +43,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, losses, InputExample - from torch.utils.data import DataLoader + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset - model = SentenceTransformer('all-MiniLM-L6-v2') - train_examples = [ - InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1), - InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0), - ] + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": ["It's nice weather outside today.", "He drove to work."], + "sentence2": ["It's so sunny.", "She walked to the store."], + "label": [1, 0], + }) + loss = losses.OnlineContrastiveLoss(model) - train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2) - train_loss = losses.OnlineContrastiveLoss(model=model) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(OnlineContrastiveLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py index c8da160e6..eaf95d4e3 100644 --- a/sentence_transformers/losses/SoftmaxLoss.py +++ b/sentence_transformers/losses/SoftmaxLoss.py @@ -26,13 +26,14 @@ def __init__( :class:`MultipleNegativesRankingLoss` is an alternative loss function that often yields better results, as per https://arxiv.org/abs/2004.09813. - :param model: SentenceTransformer model - :param sentence_embedding_dimension: Dimension of your sentence embeddings - :param num_labels: Number of different labels - :param concatenation_sent_rep: Concatenate vectors u,v for the softmax classifier? - :param concatenation_sent_difference: Add abs(u-v) for the softmax classifier? - :param concatenation_sent_multiplication: Add u*v for the softmax classifier? - :param loss_fct: Optional: Custom pytorch loss function. If not set, uses nn.CrossEntropyLoss() + Args: + model (SentenceTransformer): The SentenceTransformer model. + sentence_embedding_dimension (int): The dimension of the sentence embeddings. + num_labels (int): The number of different labels. + concatenation_sent_rep (bool): Whether to concatenate vectors u,v for the softmax classifier. Defaults to True. + concatenation_sent_difference (bool): Whether to add abs(u-v) for the softmax classifier. Defaults to True. + concatenation_sent_multiplication (bool): Whether to add u*v for the softmax classifier. Defaults to False. + loss_fct (Callable): Custom pytorch loss function. If not set, uses nn.CrossEntropyLoss(). Defaults to nn.CrossEntropyLoss(). References: - Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks: https://arxiv.org/abs/1908.10084 @@ -51,29 +52,33 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, SentencesDataset, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['First pair, sent A', 'First pair, sent B'], label=0), - InputExample(texts=['Second pair, sent A', 'Second pair, sent B'], label=1), - InputExample(texts=['Third pair, sent A', 'Third pair, sent B'], label=0), - InputExample(texts=['Fourth pair, sent A', 'Fourth pair, sent B'], label=2), - ] - train_batch_size = 2 - train_dataset = SentencesDataset(train_examples, model) - train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size) - train_loss = losses.SoftmaxLoss( + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "sentence1": [ + "A person on a horse jumps over a broken down airplane.", + "A person on a horse jumps over a broken down airplane.", + "A person on a horse jumps over a broken down airplane.", + "Children smiling and waving at camera", + ], + "sentence2": [ + "A person is training his horse for a competition.", + "A person is at a diner, ordering an omelette.", + "A person is outdoors, on a horse.", + "There are children present.", + ], + "label": [1, 2, 0, 0], + }) + loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3) + + trainer = SentenceTransformerTrainer( model=model, - sentence_embedding_dimension=model.get_sentence_embedding_dimension(), - num_labels=len(set(x.label for x in train_examples)) - ) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(SoftmaxLoss, self).__init__() self.model = model diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py index 1768bc2b7..de44db228 100644 --- a/sentence_transformers/losses/TripletLoss.py +++ b/sentence_transformers/losses/TripletLoss.py @@ -6,9 +6,7 @@ class TripletDistanceMetric(Enum): - """ - The metric for the triplet loss - """ + """The metric for the triplet loss""" COSINE = lambda x, y: 1 - F.cosine_similarity(x, y) EUCLIDEAN = lambda x, y: F.pairwise_distance(x, y, p=2) @@ -28,10 +26,13 @@ def __init__( Margin is an important hyperparameter and needs to be tuned respectively. - :param model: SentenceTransformerModel - :param distance_metric: Function to compute distance between two embeddings. The class TripletDistanceMetric - contains common distance metrices that can be used. - :param triplet_margin: The negative should be at least this much further away from the anchor than the positive. + Args: + model: SentenceTransformerModel + distance_metric: Function to compute distance between two + embeddings. The class TripletDistanceMetric contains + common distance metrices that can be used. + triplet_margin: The negative should be at least this much + further away from the anchor than the positive. References: - For further details, see: https://en.wikipedia.org/wiki/Triplet_loss @@ -49,23 +50,23 @@ def __init__( Example: :: - from sentence_transformers import SentenceTransformer, SentencesDataset, losses - from sentence_transformers.readers import InputExample - from torch.utils.data import DataLoader - - model = SentenceTransformer('distilbert-base-nli-mean-tokens') - train_examples = [ - InputExample(texts=['Anchor 1', 'Positive 1', 'Negative 1']), - InputExample(texts=['Anchor 2', 'Positive 2', 'Negative 2']), - ] - train_batch_size = 1 - train_dataset = SentencesDataset(train_examples, model) - train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size) - train_loss = losses.TripletLoss(model=model) - model.fit( - [(train_dataloader, train_loss)], - epochs=10, + from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses + from datasets import Dataset + + model = SentenceTransformer("microsoft/mpnet-base") + train_dataset = Dataset.from_dict({ + "anchor": ["It's nice weather outside today.", "He drove to work."], + "positive": ["It's so sunny.", "He took the car to the office."], + "negative": ["It's quite rainy, sadly.", "She walked to the store."], + }) + loss = losses.TripletLoss(model=model) + + trainer = SentenceTransformerTrainer( + model=model, + train_dataset=train_dataset, + loss=loss, ) + trainer.train() """ super(TripletLoss, self).__init__() self.model = model diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index ff9f68fcb..f41cefefa 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -241,12 +241,10 @@ class SentenceTransformerModelCardData(CardData): e.g. ["sentence-transformers", "sentence-similarity", "feature-extraction"]. generate_widget_examples (`bool`): Whether to generate widget examples on every model save. - + .. tip:: - Install [``codecarbon``](https://github.com/mlco2/codecarbon) to automatically track carbon emission usage and - include it in your model cards. - - + Install `codecarbon `_ to automatically track carbon emission usage and + include it in your model cards. Example:: diff --git a/sentence_transformers/models/Asym.py b/sentence_transformers/models/Asym.py index ba9c2bc5b..a84911d50 100644 --- a/sentence_transformers/models/Asym.py +++ b/sentence_transformers/models/Asym.py @@ -29,9 +29,13 @@ def __init__(self, sub_modules: Dict[str, List[nn.Module]], allow_empty_key: boo #You can train it with InputExample like this. Note, that the order must always be the same: train_example = InputExample(texts=[{'query': 'Train query'}, {'doc': 'Document'}], label=1) - - :param sub_modules: Dict in the format str -> List[models]. The models in the specified list will be applied for input marked with the respective key. - :param allow_empty_key: If true, inputs without a key can be processed. If false, an exception will be thrown if no key is specified. + Args: + sub_modules: Dict in the format str -> List[models]. The + models in the specified list will be applied for input + marked with the respective key. + allow_empty_key: If true, inputs without a key can be + processed. If false, an exception will be thrown if no + key is specified. """ self.sub_modules = sub_modules self.allow_empty_key = allow_empty_key @@ -91,9 +95,7 @@ def save(self, output_path): ) def tokenize(self, texts: Union[List[str], List[Tuple[str, str]]], **kwargs): - """ - Tokenizes a text and maps tokens to token-ids - """ + """Tokenizes a text and maps tokens to token-ids""" if not isinstance(texts[0], dict): raise AttributeError("Asym. model requires that texts are passed as dicts: {'key': 'text'}") diff --git a/sentence_transformers/models/Dense.py b/sentence_transformers/models/Dense.py index 6592e6877..bfd50e5e5 100644 --- a/sentence_transformers/models/Dense.py +++ b/sentence_transformers/models/Dense.py @@ -8,16 +8,19 @@ class Dense(nn.Module): - """Feed-forward function with activiation function. + """ + Feed-forward function with activiation function. This layer takes a fixed-sized sentence embedding and passes it through a feed-forward layer. Can be used to generate deep averaging networks (DAN). - :param in_features: Size of the input dimension - :param out_features: Output size - :param bias: Add a bias vector - :param activation_function: Pytorch activation function applied on output - :param init_weight: Initial value for the matrix of the linear layer - :param init_bias: Initial value for the bias of the linear layer + Args: + in_features: Size of the input dimension + out_features: Output size + bias: Add a bias vector + activation_function: Pytorch activation function applied on + output + init_weight: Initial value for the matrix of the linear layer + init_bias: Initial value for the bias of the linear layer """ def __init__( diff --git a/sentence_transformers/models/Dropout.py b/sentence_transformers/models/Dropout.py index 28514f720..ea353279d 100644 --- a/sentence_transformers/models/Dropout.py +++ b/sentence_transformers/models/Dropout.py @@ -8,7 +8,8 @@ class Dropout(nn.Module): """Dropout layer. - :param dropout: Sets a dropout value for dense layer. + Args: + dropout: Sets a dropout value for dense layer. """ def __init__(self, dropout: float = 0.2): diff --git a/sentence_transformers/models/LSTM.py b/sentence_transformers/models/LSTM.py index a3cee522d..bab555d17 100644 --- a/sentence_transformers/models/LSTM.py +++ b/sentence_transformers/models/LSTM.py @@ -6,9 +6,7 @@ class LSTM(nn.Module): - """ - Bidirectional LSTM running over word embeddings. - """ + """Bidirectional LSTM running over word embeddings.""" def __init__( self, diff --git a/sentence_transformers/models/Normalize.py b/sentence_transformers/models/Normalize.py index f9301a81e..337b92a72 100644 --- a/sentence_transformers/models/Normalize.py +++ b/sentence_transformers/models/Normalize.py @@ -5,9 +5,7 @@ class Normalize(nn.Module): - """ - This layer normalizes embeddings to unit length - """ + """This layer normalizes embeddings to unit length""" def __init__(self): super(Normalize, self).__init__() diff --git a/sentence_transformers/models/Pooling.py b/sentence_transformers/models/Pooling.py index 23b61962e..9cddc7e4f 100644 --- a/sentence_transformers/models/Pooling.py +++ b/sentence_transformers/models/Pooling.py @@ -7,20 +7,33 @@ class Pooling(nn.Module): - """Performs pooling (max or mean) on the token embeddings. + """ + Performs pooling (max or mean) on the token embeddings. Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings together. - :param word_embedding_dimension: Dimensions for the word embeddings - :param pooling_mode: Either "cls", "lasttoken", "max", "mean", "mean_sqrt_len_tokens", or "weightedmean". If set, overwrites the other pooling_mode_* settings - :param pooling_mode_cls_token: Use the first token (CLS token) as text representations - :param pooling_mode_max_tokens: Use max in each dimension over all tokens. - :param pooling_mode_mean_tokens: Perform mean-pooling - :param pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but divide by sqrt(input_length). - :param pooling_mode_weightedmean_tokens: Perform (position) weighted mean pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search `_. - :param pooling_mode_lasttoken: Perform last token pooling. See `SGPT: GPT Sentence Embeddings for Semantic Search `_ and `Text and Code Embeddings by Contrastive Pre-Training `_. + Args: + word_embedding_dimension: Dimensions for the word embeddings + pooling_mode: Either "cls", "lasttoken", "max", "mean", + "mean_sqrt_len_tokens", or "weightedmean". If set, + overwrites the other pooling_mode_* settings + pooling_mode_cls_token: Use the first token (CLS token) as text + representations + pooling_mode_max_tokens: Use max in each dimension over all + tokens. + pooling_mode_mean_tokens: Perform mean-pooling + pooling_mode_mean_sqrt_len_tokens: Perform mean-pooling, but + divide by sqrt(input_length). + pooling_mode_weightedmean_tokens: Perform (position) weighted + mean pooling. See `SGPT: GPT Sentence Embeddings for + Semantic Search `_. + pooling_mode_lasttoken: Perform last token pooling. See `SGPT: + GPT Sentence Embeddings for Semantic Search + `_ and `Text and Code + Embeddings by Contrastive Pre-Training + `_. """ POOLING_MODES = ( @@ -98,9 +111,7 @@ def __repr__(self): return "Pooling({})".format(self.get_config_dict()) def get_pooling_mode_str(self) -> str: - """ - Returns the pooling mode as string - """ + """Returns the pooling mode as string""" modes = [] if self.pooling_mode_cls_token: modes.append("cls") diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py index 866d34259..f9d94e2d1 100644 --- a/sentence_transformers/models/Transformer.py +++ b/sentence_transformers/models/Transformer.py @@ -9,14 +9,22 @@ class Transformer(nn.Module): """Huggingface AutoModel to generate token embeddings. Loads the correct class, e.g. BERT / RoBERTa etc. - :param model_name_or_path: Huggingface models name (https://huggingface.co/models) - :param max_seq_length: Truncate any inputs longer than max_seq_length - :param model_args: Keyword arguments passed to the Huggingface Transformers model - :param tokenizer_args: Keyword arguments passed to the Huggingface Transformers tokenizer - :param config_args: Keyword arguments passed to the Huggingface Transformers config - :param cache_dir: Cache dir for Huggingface Transformers to store/load models - :param do_lower_case: If true, lowercases the input (independent if the model is cased or not) - :param tokenizer_name_or_path: Name or path of the tokenizer. When None, then model_name_or_path is used + Args: + model_name_or_path: Huggingface models name + (https://huggingface.co/models) + max_seq_length: Truncate any inputs longer than max_seq_length + model_args: Keyword arguments passed to the Huggingface + Transformers model + tokenizer_args: Keyword arguments passed to the Huggingface + Transformers tokenizer + config_args: Keyword arguments passed to the Huggingface + Transformers config + cache_dir: Cache dir for Huggingface Transformers to store/load + models + do_lower_case: If true, lowercases the input (independent if the + model is cased or not) + tokenizer_name_or_path: Name or path of the tokenizer. When + None, then model_name_or_path is used """ def __init__( @@ -124,9 +132,7 @@ def get_word_embedding_dimension(self) -> int: return self.auto_model.config.hidden_size def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]], padding: Union[str, bool] = True): - """ - Tokenizes a text and maps tokens to token-ids - """ + """Tokenizes a text and maps tokens to token-ids""" output = {} if isinstance(texts[0], str): to_tokenize = [texts] diff --git a/sentence_transformers/models/WeightedLayerPooling.py b/sentence_transformers/models/WeightedLayerPooling.py index d2c8fd92c..33d5f4406 100644 --- a/sentence_transformers/models/WeightedLayerPooling.py +++ b/sentence_transformers/models/WeightedLayerPooling.py @@ -7,9 +7,7 @@ class WeightedLayerPooling(nn.Module): - """ - Token embeddings are weighted mean of their different hidden layer representations - """ + """Token embeddings are weighted mean of their different hidden layer representations""" def __init__( self, word_embedding_dimension, num_hidden_layers: int = 12, layer_start: int = 4, layer_weights=None diff --git a/sentence_transformers/models/WordWeights.py b/sentence_transformers/models/WordWeights.py index 42d2c21a3..3e53738bd 100644 --- a/sentence_transformers/models/WordWeights.py +++ b/sentence_transformers/models/WordWeights.py @@ -15,13 +15,14 @@ class WordWeights(nn.Module): def __init__(self, vocab: List[str], word_weights: Dict[str, float], unknown_word_weight: float = 1): """ - - :param vocab: - Vocabulary of the tokenizer - :param word_weights: - Mapping of tokens to a float weight value. Words embeddings are multiplied by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values) - :param unknown_word_weight: - Weight for words in vocab, that do not appear in the word_weights lookup. These can be for example rare words in the vocab, where no weight exists. + Initializes the WordWeights class. + + Args: + vocab (List[str]): Vocabulary of the tokenizer. + word_weights (Dict[str, float]): Mapping of tokens to a float weight value. Word embeddings are multiplied + by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values). + unknown_word_weight (float, optional): Weight for words in vocab that do not appear in the word_weights lookup. + These can be, for example, rare words in the vocab where no weight exists. Defaults to 1. """ super(WordWeights, self).__init__() self.config_keys = ["vocab", "word_weights", "unknown_word_weight"] diff --git a/sentence_transformers/quantization.py b/sentence_transformers/quantization.py index 06aa7400a..d958b1a34 100644 --- a/sentence_transformers/quantization.py +++ b/sentence_transformers/quantization.py @@ -37,31 +37,53 @@ def semantic_search_faiss( Only if these conditions are true, will we search for `top_k * rescore_multiplier` samples and then rescore to only keep `top_k`. - :param query_embeddings: Embeddings of the query sentences. Ideally not quantized to allow for rescoring. - :param corpus_embeddings: Embeddings of the corpus sentences. Either `corpus_embeddings` or `corpus_index` should - be used, not both. The embeddings can be quantized to "int8" or "binary" for more efficient search. - :param corpus_index: FAISS index for the corpus sentences. Either `corpus_embeddings` or `corpus_index` should - be used, not both. - :param corpus_precision: Precision of the corpus embeddings. The options are "float32", "int8", or "binary". - Default is "float32". - :param top_k: Number of top results to retrieve. Default is 10. - :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges - refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim). - Default is None, which means that the ranges will be calculated from the calibration embeddings. - :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8 - quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum - values for each dimension. Default is None, which means that the ranges will be calculated from the query - embeddings. This is not recommended. - :param rescore: Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings - are not quantized and the corpus is quantized, i.e. the corpus precision is not "float32". Default is True. - :param rescore_multiplier: Oversampling factor for rescoring. The code will now search `top_k * rescore_multiplier` samples - and then rescore to only keep `top_k`. Default is 2. - :param exact: Whether to use exact search or approximate search. Default is True. - :param output_index: Whether to output the FAISS index used for the search. Default is False. - - :return: A tuple containing a list of search results and the time taken for the search. If `output_index` is True, - the tuple will also contain the FAISS index used for the search. - :raises ValueError: If both `corpus_embeddings` and `corpus_index` are provided or if neither is provided. + Args: + query_embeddings: Embeddings of the query sentences. Ideally not + quantized to allow for rescoring. + corpus_embeddings: Embeddings of the corpus sentences. Either + `corpus_embeddings` or `corpus_index` should be used, not + both. The embeddings can be quantized to "int8" or "binary" + for more efficient search. + corpus_index: FAISS index for the corpus sentences. Either + `corpus_embeddings` or `corpus_index` should be used, not + both. + corpus_precision: Precision of the corpus embeddings. The + options are "float32", "int8", or "binary". Default is + "float32". + top_k: Number of top results to retrieve. Default is 10. + ranges: Ranges for quantization of embeddings. This is only used + for int8 quantization, where the ranges refers to the + minimum and maximum values for each dimension. So, it's a 2D + array with shape (2, embedding_dim). Default is None, which + means that the ranges will be calculated from the + calibration embeddings. + calibration_embeddings: Embeddings used for calibration during + quantization. This is only used for int8 quantization, where + the calibration embeddings can be used to compute ranges, + i.e. the minimum and maximum values for each dimension. + Default is None, which means that the ranges will be + calculated from the query embeddings. This is not + recommended. + rescore: Whether to perform rescoring. Note that rescoring still + will only be used if the query embeddings are not quantized + and the corpus is quantized, i.e. the corpus precision is + not "float32". Default is True. + rescore_multiplier: Oversampling factor for rescoring. The code + will now search `top_k * rescore_multiplier` samples and + then rescore to only keep `top_k`. Default is 2. + exact: Whether to use exact search or approximate search. + Default is True. + output_index: Whether to output the FAISS index used for the + search. Default is False. + + Returns: + A tuple containing a list of search results and the time taken + for the search. If `output_index` is True, the tuple will also + contain the FAISS index used for the search. + + Raises: + ValueError: If both `corpus_embeddings` and `corpus_index` are + provided or if neither is provided. The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...] The time taken for the search is a float value. @@ -182,31 +204,53 @@ def semantic_search_usearch( Only if these conditions are true, will we search for `top_k * rescore_multiplier` samples and then rescore to only keep `top_k`. - :param query_embeddings: Embeddings of the query sentences. Ideally not quantized to allow for rescoring. - :param corpus_embeddings: Embeddings of the corpus sentences. Either `corpus_embeddings` or `corpus_index` should - be used, not both. The embeddings can be quantized to "int8" or "binary" for more efficient search. - :param corpus_index: usearch index for the corpus sentences. Either `corpus_embeddings` or `corpus_index` should - be used, not both. - :param corpus_precision: Precision of the corpus embeddings. The options are "float32", "int8", or "binary". - Default is "float32". - :param top_k: Number of top results to retrieve. Default is 10. - :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges - refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim). - Default is None, which means that the ranges will be calculated from the calibration embeddings. - :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8 - quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum - values for each dimension. Default is None, which means that the ranges will be calculated from the query - embeddings. This is not recommended. - :param rescore: Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings - are not quantized and the corpus is quantized, i.e. the corpus precision is not "float32". Default is True. - :param rescore_multiplier: Oversampling factor for rescoring. The code will now search `top_k * rescore_multiplier` samples - and then rescore to only keep `top_k`. Default is 2. - :param exact: Whether to use exact search or approximate search. Default is True. - :param output_index: Whether to output the usearch index used for the search. Default is False. - - :return: A tuple containing a list of search results and the time taken for the search. If `output_index` is True, - the tuple will also contain the usearch index used for the search. - :raises ValueError: If both `corpus_embeddings` and `corpus_index` are provided or if neither is provided. + Args: + query_embeddings: Embeddings of the query sentences. Ideally not + quantized to allow for rescoring. + corpus_embeddings: Embeddings of the corpus sentences. Either + `corpus_embeddings` or `corpus_index` should be used, not + both. The embeddings can be quantized to "int8" or "binary" + for more efficient search. + corpus_index: usearch index for the corpus sentences. Either + `corpus_embeddings` or `corpus_index` should be used, not + both. + corpus_precision: Precision of the corpus embeddings. The + options are "float32", "int8", or "binary". Default is + "float32". + top_k: Number of top results to retrieve. Default is 10. + ranges: Ranges for quantization of embeddings. This is only used + for int8 quantization, where the ranges refers to the + minimum and maximum values for each dimension. So, it's a 2D + array with shape (2, embedding_dim). Default is None, which + means that the ranges will be calculated from the + calibration embeddings. + calibration_embeddings: Embeddings used for calibration during + quantization. This is only used for int8 quantization, where + the calibration embeddings can be used to compute ranges, + i.e. the minimum and maximum values for each dimension. + Default is None, which means that the ranges will be + calculated from the query embeddings. This is not + recommended. + rescore: Whether to perform rescoring. Note that rescoring still + will only be used if the query embeddings are not quantized + and the corpus is quantized, i.e. the corpus precision is + not "float32". Default is True. + rescore_multiplier: Oversampling factor for rescoring. The code + will now search `top_k * rescore_multiplier` samples and + then rescore to only keep `top_k`. Default is 2. + exact: Whether to use exact search or approximate search. + Default is True. + output_index: Whether to output the usearch index used for the + search. Default is False. + + Returns: + A tuple containing a list of search results and the time taken + for the search. If `output_index` is True, the tuple will also + contain the usearch index used for the search. + + Raises: + ValueError: If both `corpus_embeddings` and `corpus_index` are + provided or if neither is provided. The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...] The time taken for the search is a float value. @@ -327,18 +371,27 @@ def quantize_embeddings( Quantizes embeddings to a lower precision. This can be used to reduce the memory footprint and increase the speed of similarity search. The supported precisions are "float32", "int8", "uint8", "binary", and "ubinary". - :param embeddings: Unquantized (e.g. float) embeddings with to quantize to a given precision - :param precision: The precision to convert to. Options are "float32", "int8", "uint8", "binary", "ubinary". - :param ranges: Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges - refers to the minimum and maximum values for each dimension. So, it's a 2D array with shape (2, embedding_dim). - Default is None, which means that the ranges will be calculated from the calibration embeddings. - :type ranges: Optional[np.ndarray] - :param calibration_embeddings: Embeddings used for calibration during quantization. This is only used for int8 - quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum - values for each dimension. Default is None, which means that the ranges will be calculated from the query - embeddings. This is not recommended. - :type calibration_embeddings: Optional[np.ndarray] - :return: Quantized embeddings with the specified precision + Args: + embeddings: Unquantized (e.g. float) embeddings with to quantize + to a given precision + precision: The precision to convert to. Options are "float32", + "int8", "uint8", "binary", "ubinary". + ranges (Optional[np.ndarray]): Ranges for quantization of + embeddings. This is only used for int8 quantization, where + the ranges refers to the minimum and maximum values for each + dimension. So, it's a 2D array with shape (2, + embedding_dim). Default is None, which means that the ranges + will be calculated from the calibration embeddings. + calibration_embeddings (Optional[np.ndarray]): Embeddings used + for calibration during quantization. This is only used for + int8 quantization, where the calibration embeddings can be + used to compute ranges, i.e. the minimum and maximum values + for each dimension. Default is None, which means that the + ranges will be calculated from the query embeddings. This is + not recommended. + + Returns: + Quantized embeddings with the specified precision """ if isinstance(embeddings, Tensor): embeddings = embeddings.cpu().numpy() diff --git a/sentence_transformers/readers/InputExample.py b/sentence_transformers/readers/InputExample.py index 80e93c56f..1e0f6bbd2 100644 --- a/sentence_transformers/readers/InputExample.py +++ b/sentence_transformers/readers/InputExample.py @@ -2,21 +2,16 @@ class InputExample: - """ - Structure for one input example with texts, the label and a unique id - """ + """Structure for one input example with texts, the label and a unique id""" def __init__(self, guid: str = "", texts: List[str] = None, label: Union[int, float] = 0): """ Creates one InputExample with the given texts, guid and label - - :param guid - id for the example - :param texts - the texts for the example. - :param label - the label for the example + Args: + guid: id for the example + texts: the texts for the example. + label: the label for the example """ self.guid = guid self.texts = texts diff --git a/sentence_transformers/readers/LabelSentenceReader.py b/sentence_transformers/readers/LabelSentenceReader.py index 47e2c77eb..70b28c7ef 100644 --- a/sentence_transformers/readers/LabelSentenceReader.py +++ b/sentence_transformers/readers/LabelSentenceReader.py @@ -5,7 +5,8 @@ class LabelSentenceReader: """Reads in a file that has at least two columns: a label and a sentence. This reader can for example be used with the BatchHardTripletLoss. - Maps labels automatically to integers""" + Maps labels automatically to integers + """ def __init__(self, folder, label_col_idx=0, sentence_col_idx=1, separator="\t"): self.folder = folder diff --git a/sentence_transformers/readers/NLIDataReader.py b/sentence_transformers/readers/NLIDataReader.py index 2112ba290..2d78a5a8f 100644 --- a/sentence_transformers/readers/NLIDataReader.py +++ b/sentence_transformers/readers/NLIDataReader.py @@ -4,9 +4,7 @@ class NLIDataReader(object): - """ - Reads in the Stanford NLI dataset and the MultiGenre NLI dataset - """ + """Reads in the Stanford NLI dataset and the MultiGenre NLI dataset""" def __init__(self, dataset_folder): self.dataset_folder = dataset_folder diff --git a/sentence_transformers/readers/PairedFilesReader.py b/sentence_transformers/readers/PairedFilesReader.py index 9c1a94d86..2a1c16495 100644 --- a/sentence_transformers/readers/PairedFilesReader.py +++ b/sentence_transformers/readers/PairedFilesReader.py @@ -3,15 +3,12 @@ class PairedFilesReader(object): - """ - Reads in the a Pair Dataset, split in two files - """ + """Reads in the a Pair Dataset, split in two files""" def __init__(self, filepaths): self.filepaths = filepaths def get_examples(self, max_examples=0): - """ """ fIns = [] for filepath in self.filepaths: fIn = ( diff --git a/sentence_transformers/readers/STSDataReader.py b/sentence_transformers/readers/STSDataReader.py index 5d000282d..e9a6e7600 100644 --- a/sentence_transformers/readers/STSDataReader.py +++ b/sentence_transformers/readers/STSDataReader.py @@ -5,8 +5,7 @@ class STSDataReader: - """ - Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx) + """Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx) Default values expects a tab separated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1 """ @@ -34,9 +33,7 @@ def __init__( self.max_score = max_score def get_examples(self, filename, max_examples=0): - """ - filename specified which data split to use (train.csv, dev.csv, test.csv). - """ + """filename specified which data split to use (train.csv, dev.csv, test.csv).""" filepath = os.path.join(self.dataset_folder, filename) with gzip.open(filepath, "rt", encoding="utf8") if filename.endswith(".gz") else open( filepath, encoding="utf-8" @@ -59,8 +56,7 @@ def get_examples(self, filename, max_examples=0): class STSBenchmarkDataReader(STSDataReader): - """ - Reader especially for the STS benchmark dataset. There, the sentences are in column 5 and 6, the score is in column 4. + """Reader especially for the STS benchmark dataset. There, the sentences are in column 5 and 6, the score is in column 4. Scores are normalized from 0...5 to 0...1 """ diff --git a/sentence_transformers/readers/TripletReader.py b/sentence_transformers/readers/TripletReader.py index 6045ef697..99e1ff0f2 100644 --- a/sentence_transformers/readers/TripletReader.py +++ b/sentence_transformers/readers/TripletReader.py @@ -4,8 +4,7 @@ class TripletReader(object): - """ - Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1), + """Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1), one positive example (s2) and one negative example (s3) """ @@ -28,7 +27,6 @@ def __init__( self.quoting = quoting def get_examples(self, filename, max_examples=0): - """ """ data = csv.reader( open(os.path.join(self.dataset_folder, filename), encoding="utf-8"), delimiter=self.delimiter, diff --git a/sentence_transformers/similarity_functions.py b/sentence_transformers/similarity_functions.py index b97e7ec92..589d0404a 100644 --- a/sentence_transformers/similarity_functions.py +++ b/sentence_transformers/similarity_functions.py @@ -16,6 +16,15 @@ class SimilarityFunction(Enum): + """ + Enum class for supported similarity functions. The following functions are supported: + + - ``SimilarityFunction.COSINE`` (``"cosine"``): Cosine similarity + - ``SimilarityFunction.DOT_PRODUCT`` (``"dot"``, ``dot_product``): Dot product similarity + - ``SimilarityFunction.EUCLIDEAN`` (``"euclidean"``): Euclidean distance + - ``SimilarityFunction.MANHATTAN`` (``"manhattan"``): Manhattan distance + """ + COSINE = "cosine" DOT_PRODUCT = "dot" DOT = "dot" # Alias for DOT_PRODUCT @@ -26,6 +35,25 @@ class SimilarityFunction(Enum): def to_similarity_fn( similarity_function: Union[str, "SimilarityFunction"], ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: + """ + Converts a similarity function name or enum value to the corresponding similarity function. + + Args: + similarity_function (Union[str, SimilarityFunction]): The name or enum value of the similarity function. + + Returns: + Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: The corresponding similarity function. + + Raises: + ValueError: If the provided function is not supported. + + Example: + >>> similarity_fn = SimilarityFunction.to_similarity_fn("cosine") + >>> similarity_scores = similarity_fn(embeddings1, embeddings2) + >>> similarity_scores + tensor([[0.3952, 0.0554], + [0.0992, 0.1570]]) + """ similarity_function = SimilarityFunction(similarity_function) if similarity_function == SimilarityFunction.COSINE: @@ -47,6 +75,28 @@ def to_similarity_fn( def to_similarity_pairwise_fn( similarity_function: Union[str, "SimilarityFunction"], ) -> Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: + """ + Converts a similarity function into a pairwise similarity function. + + The pairwise similarity function returns the diagonal vector from the similarity matrix, i.e. it only + computes the similarity(a[i], b[i]) for each i in the range of the input tensors, rather than + computing the similarity between all pairs of a and b. + + Args: + similarity_function (Union[str, SimilarityFunction]): The name or enum value of the similarity function. + + Returns: + Callable[[Union[Tensor, ndarray], Union[Tensor, ndarray]], Tensor]: The pairwise similarity function. + + Raises: + ValueError: If the provided similarity function is not supported. + + Example: + >>> pairwise_fn = SimilarityFunction.to_similarity_pairwise_fn("cosine") + >>> similarity_scores = pairwise_fn(embeddings1, embeddings2) + >>> similarity_scores + tensor([0.3952, 0.1570]) + """ similarity_function = SimilarityFunction(similarity_function) if similarity_function == SimilarityFunction.COSINE: @@ -66,4 +116,15 @@ def to_similarity_pairwise_fn( @staticmethod def possible_values(): + """ + Returns a list of possible values for the SimilarityFunction enum. + + Returns: + list: A list of possible values for the SimilarityFunction enum. + + Example: + >>> possible_values = SimilarityFunction.possible_values() + >>> possible_values + ['cosine', 'dot', 'euclidean', 'manhattan'] + """ return [m.value for m in SimilarityFunction] diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py index 532e4c009..0620d5f38 100644 --- a/sentence_transformers/trainer.py +++ b/sentence_transformers/trainer.py @@ -43,6 +43,72 @@ class SentenceTransformerTrainer(Trainer): + """ + SentenceTransformerTrainer is a simple but feature-complete training and eval loop for PyTorch + based on the 🤗 Transformers :class:`~transformers.Trainer`. + + This trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as: + + - :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed + - :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible. + - :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed. + + - Note: These carbon emissions will be included in your automatically generated model card. + + See the Transformers `Callbacks `_ + documentation for more information on the integrated callbacks and how to write your own callbacks. + + Args: + model (:class:`~sentence_transformers.SentenceTransformer`, *optional*): + The model to train, evaluate or use for predictions. If not provided, a `model_init` must be passed. + args (:class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`, *optional*): + The arguments to tweak for training. Will default to a basic instance of + :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` with the + `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided. + train_dataset (Union[:class:`datasets.Dataset`, :class:`datasets.DatasetDict`, Dict[str, :class:`datasets.Dataset`]], *optional*): + The dataset to use for training. Must have a format accepted by your loss function, see + `Training Overview > Dataset Format <../../../docs/sentence_transformer/training_overview.html#dataset-format>`_. + eval_dataset (Union[:class:`datasets.Dataset`, :class:`datasets.DatasetDict`, Dict[str, :class:`datasets.Dataset`]], *optional*): + The dataset to use for evaluation. Must have a format accepted by your loss function, see + `Training Overview > Dataset Format <../../../docs/sentence_transformer/training_overview.html#dataset-format>`_. + loss (Optional[Union[:class:`torch.nn.Module`, Dict[str, :class:`torch.nn.Module`],\ + Callable[[:class:`~sentence_transformers.SentenceTransformer`], :class:`torch.nn.Module`],\ + Dict[str, Callable[[:class:`~sentence_transformers.SentenceTransformer`]]]], *optional*): + The loss function to use for training. Can either be a loss class instance, a dictionary mapping dataset names to + loss class instances, a function that returns a loss class instance given a model, or a dictionary mapping + dataset names to functions that return a loss class instance given a model. In practice, the latter two + are primarily used for hyper-parameter optimization. Will default to + :class:`~sentence_transformers.losses.CoSENTLoss` if no ``loss`` is provided. + evaluator (:class:`~sentence_transformers.evaluation.SentenceEvaluator`, *optional*): + The evaluator class to use for evaluation alongside the evaluation dataset. An evaluator will display more + useful metrics than the loss function. + callbacks (List of [:class:`transformers.TrainerCallback`], *optional*): + A list of callbacks to customize the training loop. Will add those to the list of default callbacks + detailed in [here](callback). + + If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method. + optimizers (`Tuple[:class:`torch.optim.Optimizer`, :class:`torch.optim.lr_scheduler.LambdaLR`]`, *optional*, defaults to `(None, None)`): + A tuple containing the optimizer and the scheduler to use. Will default to an instance of :class:`torch.optim.AdamW` + on your model and a scheduler given by :func:`transformers.get_linear_schedule_with_warmup` controlled by `args`. + + Important attributes: + + - **model** -- Always points to the core model. If using a transformers model, it will be a [`PreTrainedModel`] + subclass. + - **model_wrapped** -- Always points to the most external model in case one or more other modules wrap the + original model. This is the model that should be used for the forward pass. For example, under `DeepSpeed`, + the inner model is wrapped in `DeepSpeed` and then again in `torch.nn.DistributedDataParallel`. If the inner + model hasn't been wrapped, then `self.model_wrapped` is the same as `self.model`. + - **is_model_parallel** -- Whether or not a model has been switched to a model parallel mode (different from + data parallelism, this means some of the model layers are split on different GPUs). + - **place_model_on_device** -- Whether or not to automatically place the model on the device - it will be set + to `False` if model parallel or deepspeed is used, or if the default + `TrainingArguments.place_model_on_device` is overridden to return `False` . + - **is_in_train** -- Whether or not a model is currently running `train` (e.g. when `evaluate` is called while + in `train`) + + """ + def __init__( self, model: Optional["SentenceTransformer"] = None, @@ -207,7 +273,24 @@ def compute_loss( model: "SentenceTransformer", inputs: Dict[str, Union[torch.Tensor, Any]], return_outputs: bool = False, - ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]: + ) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, Any]]]: + """ + Computes the loss for the SentenceTransformer model. + + It uses ``self.loss`` to compute the loss, which can be a single loss function or a dictionary of loss functions + for different datasets. If the loss is a dictionary, the dataset name is expected to be passed in the inputs + under the key "dataset_name". This is done automatically in the ``add_dataset_name_column`` method. + Note that even if ``return_outputs = True``, the outputs will be empty, as the SentenceTransformers losses do not + return outputs. + + Args: + model (SentenceTransformer): The SentenceTransformer model. + inputs (Dict[str, Union[torch.Tensor, Any]]): The input data for the model. + return_outputs (bool, optional): Whether to return the outputs along with the loss. Defaults to False. + + Returns: + Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, Any]]]: The computed loss. If `return_outputs` is True, returns a tuple of loss and outputs. Otherwise, returns only the loss. + """ dataset_name = inputs.pop("dataset_name", None) features, labels = self.collect_features(inputs) loss_fn = self.loss @@ -634,3 +717,9 @@ def _load_from_checkpoint(self, checkpoint_path: str) -> None: loaded_model = SentenceTransformer(checkpoint_path) self.model.load_state_dict(loaded_model.state_dict()) + + def create_model_card(self, *args, **kwargs): + raise NotImplementedError( + "SentenceTransformers does not implement the `create_model_card` method in its Trainer. " + "Instead, consider calling SentenceTransformer._create_model_card(path)." + ) diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py index f28f0d359..5f4386768 100644 --- a/sentence_transformers/training_args.py +++ b/sentence_transformers/training_args.py @@ -7,16 +7,32 @@ class BatchSamplers(ExplicitEnum): """ Stores the acceptable string identifiers for batch samplers. + + The batch sampler is responsible for determining how samples are grouped into batches during training. + Valid options are: + + - ``BatchSamplers.BATCH_SAMPLER``: The default PyTorch batch sampler. + - ``BatchSamplers.NO_DUPLICATES``: Ensures no duplicate samples in a batch. + - ``BatchSamplers.GROUP_BY_LABEL``: Ensures each batch has 2+ samples from the same label. """ - BATCH_SAMPLER = "batch_sampler" # Just the default PyTorch batch sampler [default] - NO_DUPLICATES = "no_duplicates" # Ensures no duplicate samples in a batch - GROUP_BY_LABEL = "group_by_label" # Ensure each batch has 2+ samples from the same label + BATCH_SAMPLER = "batch_sampler" + NO_DUPLICATES = "no_duplicates" + GROUP_BY_LABEL = "group_by_label" class MultiDatasetBatchSamplers(ExplicitEnum): """ Stores the acceptable string identifiers for multi-dataset batch samplers. + + The multi-dataset batch sampler is responsible for determining in what order batches are sampled from multiple + datasets during training. Valid options are: + + - ``MultiDatasetBatchSamplers.ROUND_ROBIN``: Round-robin sampling from each dataset until one is exhausted. + With this strategy, it's likely that not all samples from each dataset are used, but each dataset is sampled + from equally. + - ``MultiDatasetBatchSamplers.PROPORTIONAL``: Sample from each dataset in proportion to its size [default]. + With this strategy, all samples from each dataset are used and larger datasets are sampled from more frequently. """ ROUND_ROBIN = "round_robin" # Round-robin sampling from each dataset @@ -25,6 +41,22 @@ class MultiDatasetBatchSamplers(ExplicitEnum): @dataclass class SentenceTransformerTrainingArguments(TransformersTrainingArguments): + """ + SentenceTransformerTrainingArguments extends :class:`~transformers.TrainingArguments` with additional arguments + specific to Sentence Transformers. See :class:`~transformers.TrainingArguments` for the complete list of + available arguments. + + Args: + output_dir (`str`): + The output directory where the model checkpoints will be written. + batch_sampler (Union[:class:`~sentence_transformers.training_args.BatchSamplers`, `str`], *optional*): + The batch sampler to use. See :class:`~sentence_transformers.training_args.BatchSamplers` for valid options. + Defaults to ``BatchSamplers.BATCH_SAMPLER``. + multi_dataset_batch_sampler (Union[:class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers`, `str`], *optional*): + The multi-dataset batch sampler to use. See :class:`~sentence_transformers.training_args.MultiDatasetBatchSamplers` + for valid options. Defaults to ``MultiDatasetBatchSamplers.PROPORTIONAL``. + """ + batch_sampler: Union[BatchSamplers, str] = field( default=BatchSamplers.BATCH_SAMPLER, metadata={"help": "The batch sampler to use."} ) diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py index 28b2376e9..30dbeaa61 100644 --- a/sentence_transformers/util.py +++ b/sentence_transformers/util.py @@ -21,18 +21,45 @@ def _convert_to_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor: + """ + Converts the input `a` to a PyTorch tensor if it is not already a tensor. + + Args: + a (Union[list, np.ndarray, Tensor]): The input array or tensor. + + Returns: + Tensor: The converted tensor. + """ if not isinstance(a, Tensor): a = torch.tensor(a) return a def _convert_to_batch(a: Tensor) -> Tensor: + """ + If the tensor `a` is 1-dimensional, it is unsqueezed to add a batch dimension. + + Args: + a (Tensor): The input tensor. + + Returns: + Tensor: The tensor with a batch dimension. + """ if a.dim() == 1: a = a.unsqueeze(0) return a def _convert_to_batch_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor: + """ + Converts the input data to a tensor with a batch dimension. + + Args: + a (Union[list, np.ndarray, Tensor]): The input data to be converted. + + Returns: + Tensor: The converted tensor with a batch dimension. + """ a = _convert_to_tensor(a) a = _convert_to_batch(a) return a @@ -40,18 +67,28 @@ def _convert_to_batch_tensor(a: Union[list, np.ndarray, Tensor]) -> Tensor: def pytorch_cos_sim(a: Tensor, b: Tensor) -> Tensor: """ - Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. + Computes the cosine similarity between two tensors. - :return: Matrix with res[i][j] = cos_sim(a[i], b[j]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Matrix with res[i][j] = cos_sim(a[i], b[j]) """ return cos_sim(a, b) def cos_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor: """ - Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. + Computes the cosine similarity between two tensors. + + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. - :return: Matrix with res[i][j] = cos_sim(a[i], b[j]) + Returns: + Tensor: Matrix with res[i][j] = cos_sim(a[i], b[j]) """ a = _convert_to_batch_tensor(a) b = _convert_to_batch_tensor(b) @@ -63,9 +100,14 @@ def cos_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tenso def pairwise_cos_sim(a: Tensor, b: Tensor) -> Tensor: """ - Computes the pairwise cossim cos_sim(a[i], b[i]) + Computes the pairwise cosine similarity cos_sim(a[i], b[i]). - :return: Vector with res[i] = cos_sim(a[i], b[i]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Vector with res[i] = cos_sim(a[i], b[i]) """ a = _convert_to_tensor(a) b = _convert_to_tensor(b) @@ -77,7 +119,12 @@ def dot_score(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Ten """ Computes the dot-product dot_prod(a[i], b[j]) for all i and j. - :return: Matrix with res[i][j] = dot_prod(a[i], b[j]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Matrix with res[i][j] = dot_prod(a[i], b[j]) """ a = _convert_to_batch_tensor(a) b = _convert_to_batch_tensor(b) @@ -87,9 +134,14 @@ def dot_score(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Ten def pairwise_dot_score(a: Tensor, b: Tensor) -> Tensor: """ - Computes the pairwise dot-product dot_prod(a[i], b[i]) + Computes the pairwise dot-product dot_prod(a[i], b[i]). - :return: Vector with res[i] = dot_prod(a[i], b[i]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Vector with res[i] = dot_prod(a[i], b[i]) """ a = _convert_to_tensor(a) b = _convert_to_tensor(b) @@ -99,9 +151,14 @@ def pairwise_dot_score(a: Tensor, b: Tensor) -> Tensor: def manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor: """ - Computes the manhattan similarity manhattan_sim(a[i], b[j]) for all i and j. + Computes the manhattan similarity (i.e., negative distance) between two tensors. + + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. - :return: Matrix with res[i][j] = manhattan_sim(a[i], b[j]) + Returns: + Tensor: Matrix with res[i][j] = -manhattan_distance(a[i], b[j]) """ a = _convert_to_batch_tensor(a) b = _convert_to_batch_tensor(b) @@ -111,9 +168,14 @@ def manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, def pairwise_manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]): """ - Computes the negative manhattan distance. + Computes the manhattan similarity (i.e., negative distance) between pairs of tensors. - :return: Vector with res[i] = -manhattan_distance(a[i], b[i]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Vector with res[i] = -manhattan_distance(a[i], b[i]) """ a = _convert_to_tensor(a) b = _convert_to_tensor(b) @@ -123,9 +185,14 @@ def pairwise_manhattan_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np def euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]) -> Tensor: """ - Computes the euclidean similarity euclidean_sim(a[i], b[j]) for all i and j. + Computes the euclidean similarity (i.e., negative distance) between two tensors. + + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. - :return: Matrix with res[i][j] = euclidean_sim(a[i], b[j]) + Returns: + Tensor: Matrix with res[i][j] = -euclidean_distance(a[i], b[j]) """ a = _convert_to_batch_tensor(a) b = _convert_to_batch_tensor(b) @@ -135,9 +202,14 @@ def euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, def pairwise_euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np.ndarray, Tensor]): """ - Computes the negative euclidean distance. + Computes the euclidean distance (i.e., negative distance) between pairs of tensors. - :return: Vector with res[i] = -euclidean(a[i], b[i]) + Args: + a (Union[list, np.ndarray, Tensor]): The first tensor. + b (Union[list, np.ndarray, Tensor]): The second tensor. + + Returns: + Tensor: Vector with res[i] = -euclidean_distance(a[i], b[i]) """ a = _convert_to_tensor(a) b = _convert_to_tensor(b) @@ -147,11 +219,15 @@ def pairwise_euclidean_sim(a: Union[list, np.ndarray, Tensor], b: Union[list, np def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor: """ - Computes the absolute normalized angle distance; - see AnglELoss or https://arxiv.org/abs/2309.12871v1 - for more information. + Computes the absolute normalized angle distance. See :class:`~sentence_transformers.losses.AnglELoss` + or https://arxiv.org/abs/2309.12871v1 for more information. + + Args: + x (Tensor): The first tensor. + y (Tensor): The second tensor. - :return: Vector with res[i] = angle_sim(a[i], b[i]) + Returns: + Tensor: Vector with res[i] = angle_sim(a[i], b[i]) """ x = _convert_to_tensor(x) @@ -177,7 +253,13 @@ def pairwise_angle_sim(x: Tensor, y: Tensor) -> Tensor: def normalize_embeddings(embeddings: Tensor) -> Tensor: """ - Normalizes the embeddings matrix, so that each sentence embedding has unit length + Normalizes the embeddings matrix, so that each sentence embedding has unit length. + + Args: + embeddings (Tensor): The input embeddings matrix. + + Returns: + Tensor: The normalized embeddings matrix. """ return torch.nn.functional.normalize(embeddings, p=2, dim=1) @@ -194,30 +276,64 @@ def truncate_embeddings( embeddings: Union[np.ndarray, torch.Tensor], truncate_dim: Optional[int] ) -> Union[np.ndarray, torch.Tensor]: """ - :param embeddings: Embeddings to truncate. - :param truncate_dim: The dimension to truncate sentence embeddings to. `None` does no truncation. - :return: Truncated embeddings. + Truncates the embeddings matrix. + + Args: + embeddings (Union[np.ndarray, torch.Tensor]): Embeddings to truncate. + truncate_dim (Optional[int]): The dimension to truncate sentence embeddings to. `None` does no truncation. + + Example: + >>> from sentence_transformers import SentenceTransformer + >>> from sentence_transformers.util import truncate_embeddings + >>> model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka") + >>> embeddings = model.encode(["It's so nice outside!", "Today is a beautiful day.", "He drove to work earlier"]) + >>> embeddings.shape + (3, 768) + >>> model.similarity(embeddings, embeddings) + tensor([[1.0000, 0.8100, 0.1426], + [0.8100, 1.0000, 0.2121], + [0.1426, 0.2121, 1.0000]]) + >>> truncated_embeddings = truncate_embeddings(embeddings, 128) + >>> truncated_embeddings.shape + >>> model.similarity(truncated_embeddings, truncated_embeddings) + tensor([[1.0000, 0.8092, 0.1987], + [0.8092, 1.0000, 0.2716], + [0.1987, 0.2716, 1.0000]]) + + Returns: + Union[np.ndarray, torch.Tensor]: Truncated embeddings. """ return embeddings[..., :truncate_dim] def paraphrase_mining( - model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, *args, **kwargs + model, + sentences: List[str], + show_progress_bar: bool = False, + batch_size: int = 32, + query_chunk_size: int = 5000, + corpus_chunk_size: int = 100000, + max_pairs: int = 500000, + top_k: int = 100, + score_function: Callable[[Tensor, Tensor], Tensor] = cos_sim, ) -> List[List[Union[float, int]]]: """ Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score. - :param model: SentenceTransformer model for embedding computation - :param sentences: A list of strings (texts or sentences) - :param show_progress_bar: Plotting of a progress bar - :param batch_size: Number of texts that are encoded simultaneously by the model - :param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). - :param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). - :param max_pairs: Maximal number of text pairs returned. - :param top_k: For each sentence, we retrieve up to top_k other sentences - :param score_function: Function for computing scores. By default, cosine similarity. - :return: Returns a list of triplets with the format [score, id1, id2] + Args: + model (SentenceTransformer): SentenceTransformer model for embedding computation + sentences (List[str]): A list of strings (texts or sentences) + show_progress_bar (bool, optional): Plotting of a progress bar. Defaults to False. + batch_size (int, optional): Number of texts that are encoded simultaneously by the model. Defaults to 32. + query_chunk_size (int, optional): Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000. + corpus_chunk_size (int, optional): Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000. + max_pairs (int, optional): Maximal number of text pairs returned. Defaults to 500000. + top_k (int, optional): For each sentence, we retrieve up to top_k other sentences. Defaults to 100. + score_function (Callable[[Tensor, Tensor], Tensor], optional): Function for computing scores. By default, cosine similarity. Defaults to cos_sim. + + Returns: + List[List[Union[float, int]]]: Returns a list of triplets with the format [score, id1, id2] """ # Compute embedding for the sentences @@ -225,7 +341,14 @@ def paraphrase_mining( sentences, show_progress_bar=show_progress_bar, batch_size=batch_size, convert_to_tensor=True ) - return paraphrase_mining_embeddings(embeddings, *args, **kwargs) + return paraphrase_mining_embeddings( + embeddings, + query_chunk_size=query_chunk_size, + corpus_chunk_size=corpus_chunk_size, + max_pairs=max_pairs, + top_k=top_k, + score_function=score_function, + ) def paraphrase_mining_embeddings( @@ -240,13 +363,16 @@ def paraphrase_mining_embeddings( Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score. - :param embeddings: A tensor with the embeddings - :param query_chunk_size: Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). - :param corpus_chunk_size: Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). - :param max_pairs: Maximal number of text pairs returned. - :param top_k: For each sentence, we retrieve up to top_k other sentences - :param score_function: Function for computing scores. By default, cosine similarity. - :return: Returns a list of triplets with the format [score, id1, id2] + Args: + embeddings (Tensor): A tensor with the embeddings + query_chunk_size (int): Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). + corpus_chunk_size (int): Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). + max_pairs (int): Maximal number of text pairs returned. + top_k (int): For each sentence, we retrieve up to top_k other sentences + score_function (Callable[[Tensor, Tensor], Tensor]): Function for computing scores. By default, cosine similarity. + + Returns: + List[List[Union[float, int]]]: Returns a list of triplets with the format [score, id1, id2] """ top_k += 1 # A sentence has the highest similarity to itself. Increase +1 as we are interest in distinct pairs @@ -315,13 +441,16 @@ def semantic_search( This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries. - :param query_embeddings: A 2 dimensional tensor with the query embeddings. - :param corpus_embeddings: A 2 dimensional tensor with the corpus embeddings. - :param query_chunk_size: Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. - :param corpus_chunk_size: Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. - :param top_k: Retrieve top k matching entries. - :param score_function: Function for computing scores. By default, cosine similarity. - :return: Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys 'corpus_id' and 'score', sorted by decreasing cosine similarity scores. + Args: + query_embeddings (Tensor): A 2 dimensional tensor with the query embeddings. + corpus_embeddings (Tensor): A 2 dimensional tensor with the corpus embeddings. + query_chunk_size (int, optional): Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. Defaults to 100. + corpus_chunk_size (int, optional): Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. Defaults to 500000. + top_k (int, optional): Retrieve top k matching entries. Defaults to 10. + score_function (Callable[[Tensor, Tensor], Tensor], optional): Function for computing scores. By default, cosine similarity. + + Returns: + List[List[Dict[str, Union[int, float]]]]: A list with one entry for each query. Each entry is a list of dictionaries with the keys 'corpus_id' and 'score', sorted by decreasing cosine similarity scores. """ if isinstance(query_embeddings, (np.ndarray, np.generic)): @@ -382,7 +511,17 @@ def semantic_search( def http_get(url, path) -> None: """ - Downloads a URL to a given path on disc + Downloads a URL to a given path on disk. + + Args: + url (str): The URL to download. + path (str): The path to save the downloaded file. + + Raises: + requests.HTTPError: If the HTTP request returns a non-200 status code. + + Returns: + None """ if os.path.dirname(path) != "": os.makedirs(os.path.dirname(path), exist_ok=True) @@ -409,7 +548,14 @@ def http_get(url, path) -> None: def batch_to_device(batch, target_device: device): """ - send a pytorch batch to a device (CPU/GPU) + Send a PyTorch batch (i.e., a dictionary of string keys to Tensors) to a device (e.g. "cpu", "cuda", "mps"). + + Args: + batch (Dict[str, Tensor]): The batch to send to the device. + target_device (torch.device): The target device (e.g. "cpu", "cuda", "mps"). + + Returns: + Dict[str, Tensor]: The batch with tensors sent to the target device. """ for key in batch: if isinstance(batch[key], Tensor): @@ -421,6 +567,21 @@ def fullname(o) -> str: """ Gives a full name (package_name.class_name) for a class / object in Python. Will be used to load the correct classes from JSON files + + Args: + o: The object for which to get the full name. + + Returns: + str: The full name of the object. + + Example: + >>> from sentence_transformers.losses import MultipleNegativesRankingLoss + >>> from sentence_transformers import SentenceTransformer + >>> from sentence_transformers.util import fullname + >>> model = SentenceTransformer('all-MiniLM-L6-v2') + >>> loss = MultipleNegativesRankingLoss(model) + >>> fullname(loss) + 'sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss' """ module = o.__class__.__module__ @@ -434,6 +595,19 @@ def import_from_string(dotted_path): """ Import a dotted module path and return the attribute/class designated by the last name in the path. Raise ImportError if the import failed. + + Args: + dotted_path (str): The dotted module path. + + Returns: + Any: The attribute/class designated by the last name in the path. + + Raises: + ImportError: If the import failed. + + Example: + >>> import_from_string('sentence_transformers.losses.MultipleNegativesRankingLoss') + """ try: module_path, class_name = dotted_path.rsplit(".", 1) @@ -454,13 +628,28 @@ def import_from_string(dotted_path): def community_detection( - embeddings, threshold=0.75, min_community_size=10, batch_size=1024, show_progress_bar=False + embeddings: Union[torch.Tensor, np.ndarray], + threshold: float = 0.75, + min_community_size: int = 10, + batch_size: int = 1024, + show_progress_bar: bool = False, ) -> List[List[int]]: """ - Function for Fast Community Detection + Function for Fast Community Detection. + Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community. + + Args: + embeddings (torch.Tensor or numpy.ndarray): The input embeddings. + threshold (float): The threshold for determining if two embeddings are close. Defaults to 0.75. + min_community_size (int): The minimum size of a community to be considered. Defaults to 10. + batch_size (int): The batch size for computing cosine similarity scores. Defaults to 1024. + show_progress_bar (bool): Whether to show a progress bar during computation. Defaults to False. + + Returns: + List[List[int]]: A list of communities, where each community is represented as a list of indices. """ if not isinstance(embeddings, torch.Tensor): embeddings = torch.tensor(embeddings) @@ -571,7 +760,8 @@ def disable_logging(highest_level=logging.CRITICAL): A context manager that will prevent any logging messages triggered during the body from being processed. - :param highest_level: the maximum logging level allowed. + Args: + highest_level: the maximum logging level allowed. """ previous_level = logging.root.manager.disable @@ -591,6 +781,19 @@ def is_sentence_transformer_model( revision: Optional[str] = None, local_files_only: bool = False, ) -> bool: + """ + Checks if the given model name or path corresponds to a SentenceTransformer model. + + Args: + model_name_or_path (str): The name or path of the model. + token (Optional[Union[bool, str]]): The token to be used for authentication. Defaults to None. + cache_folder (Optional[str]): The folder to cache the model files. Defaults to None. + revision (Optional[str]): The revision of the model. Defaults to None. + local_files_only (bool): Whether to only use local files for the model. Defaults to False. + + Returns: + bool: True if the model is a SentenceTransformer model, False otherwise. + """ return bool( load_file_path( model_name_or_path, @@ -611,6 +814,20 @@ def load_file_path( revision: Optional[str] = None, local_files_only: bool = False, ) -> Optional[str]: + """ + Loads a file from a local or remote location. + + Args: + model_name_or_path (str): The model name or path. + filename (str): The name of the file to load. + token (Optional[Union[bool, str]]): The token to access the remote file (if applicable). + cache_folder (Optional[str]): The folder to cache the downloaded file (if applicable). + revision (Optional[str], optional): The revision of the file (if applicable). Defaults to None. + local_files_only (bool, optional): Whether to only consider local files. Defaults to False. + + Returns: + Optional[str]: The path to the loaded file, or None if the file could not be found or loaded. + """ # If file is local file_path = os.path.join(model_name_or_path, filename) if os.path.exists(file_path): @@ -639,6 +856,20 @@ def load_dir_path( revision: Optional[str] = None, local_files_only: bool = False, ) -> Optional[str]: + """ + Loads the directory path for a given model name or path. + + Args: + model_name_or_path (str): The name or path of the model. + directory (str): The directory to load. + token (Optional[Union[bool, str]]): The token for authentication. + cache_folder (Optional[str]): The folder to cache the downloaded files. + revision (Optional[str], optional): The revision of the model. Defaults to None. + local_files_only (bool, optional): Whether to only use local files. Defaults to False. + + Returns: + Optional[str]: The directory path if it exists, otherwise None. + """ # If file is local dir_path = os.path.join(model_name_or_path, directory) if os.path.exists(dir_path): @@ -687,11 +918,12 @@ def wrapper(self, *args, **kwargs): def get_device_name() -> Literal["mps", "cuda", "npu", "hpu", "cpu"]: """ Returns the name of the device where this module is running on. - It's simple implementation that doesn't cover cases when more powerful GPUs are available and - not a primary device ('cuda:0') or MPS device is available, but not configured properly: - https://pytorch.org/docs/master/notes/mps.html - :return: Device name, like 'cuda' or 'cpu' + It's a simple implementation that doesn't cover cases when more powerful GPUs are available and + not a primary device ('cuda:0') or MPS device is available, but not configured properly. + + Returns: + str: Device name, like 'cuda' or 'cpu' """ if torch.cuda.is_available(): return "cuda" diff --git a/setup.py b/setup.py index 593033260..86b8717bd 100644 --- a/setup.py +++ b/setup.py @@ -42,6 +42,10 @@ "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", "Topic :: Scientific/Engineering :: Artificial Intelligence", ], keywords="Transformer Networks BERT XLNet sentence embedding PyTorch NLP deep learning", diff --git a/tests/test_compute_embeddings.py b/tests/test_compute_embeddings.py index 3ffc51982..d301367b7 100644 --- a/tests/test_compute_embeddings.py +++ b/tests/test_compute_embeddings.py @@ -10,7 +10,6 @@ def test_encode_token_embeddings(paraphrase_distilroberta_base_v1_model: SentenceTransformer) -> None: """ Test that encode(output_value='token_embeddings') works - :return: """ model = paraphrase_distilroberta_base_v1_model sent = [ From 551feeb532126f214a64702420fc777e1cf37452 Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Fri, 24 May 2024 14:40:44 +0200 Subject: [PATCH 22/39] [`v3`] Chore - include import sorting in ruff (#2672) * Include import sorting in ruff * Remove deprecated ignore-init-module-imports * Remove --select I from ruff.toml again after CI issues --- docs/_themes/sphinx_rtd_theme/__init__.py | 1 - docs/conf.py | 6 ++- .../applications/clustering/agglomerative.py | 3 +- .../clustering/fast_clustering.py | 4 +- examples/applications/clustering/kmeans.py | 3 +- .../computing_embeddings.py | 6 ++- .../computing_embeddings_multi_gpu.py | 3 +- .../computing_embeddings_streaming.py | 6 ++- .../cross-encoder/cross-encoder_reranking.py | 6 +-- .../cross-encoder/cross-encoder_usage.py | 3 +- .../semantic_search_faiss.py | 3 +- .../semantic_search_faiss_benchmark.py | 2 +- .../semantic_search_recommended.py | 7 +-- .../semantic_search_usearch.py | 3 +- .../semantic_search_usearch_benchmark.py | 2 +- examples/applications/image-search/example.py | 4 +- .../parallel-sentence-mining/bitext_mining.py | 10 +++-- .../bitext_mining_utils.py | 7 +-- .../parallel-sentence-mining/bucc2018.py | 12 ++--- .../in_document_search_crossencoder.py | 5 ++- .../semantic-search/semantic_search.py | 3 +- .../semantic_search_publications.py | 1 + .../semantic_search_quora_annoy.py | 5 ++- .../semantic_search_quora_elasticsearch.py | 9 ++-- .../semantic_search_quora_faiss.py | 5 ++- .../semantic_search_quora_hnswlib.py | 5 ++- .../semantic_search_quora_pytorch.py | 4 +- .../semantic_search_wikipedia_qa.py | 8 ++-- .../text-summarization/LexRank.py | 3 +- .../text-summarization/text-summarization.py | 2 +- .../evaluation/evaluation_inference_speed.py | 4 +- .../evaluation/evaluation_stsbenchmark.py | 9 ++-- .../evaluation_translation_matching.py | 6 +-- .../adaptive_layer/adaptive_layer_nli.py | 16 ++++--- .../adaptive_layer/adaptive_layer_sts.py | 15 ++++--- ...aining_stsbenchmark_avg_word_embeddings.py | 9 ++-- .../training_stsbenchmark_bilstm.py | 9 ++-- .../training_stsbenchmark_bow.py | 13 +++--- .../training_stsbenchmark_cnn.py | 10 ++--- ...ing_stsbenchmark_tf-idf_word_embeddings.py | 13 +++--- .../training/cross-encoder/training_nli.py | 14 +++--- .../training_quora_duplicate_questions.py | 14 +++--- .../cross-encoder/training_stsbenchmark.py | 17 +++---- .../train_sts_indomain_bm25.py | 20 ++++----- .../train_sts_indomain_nlpaug.py | 15 ++++--- .../train_sts_indomain_semantic.py | 23 +++++----- .../train_sts_qqp_crossdomain.py | 23 +++++----- .../train_sts_seed_optimization.py | 18 ++++---- .../distillation/dimensionality_reduction.py | 7 +-- .../distillation/model_distillation.py | 13 +++--- .../model_distillation_layer_reduction.py | 11 +++-- .../distillation/model_quantization.py | 12 ++--- examples/training/hpo/hpo_nli.py | 10 +++-- .../training/matryoshka/2d_matryoshka_nli.py | 16 ++++--- .../training/matryoshka/2d_matryoshka_sts.py | 15 ++++--- .../matryoshka/matryoshka_eval_stsb.py | 7 +-- .../training/matryoshka/matryoshka_nli.py | 16 ++++--- .../matryoshka/matryoshka_nli_reduced_dim.py | 17 ++++--- .../training/matryoshka/matryoshka_sts.py | 15 ++++--- .../ms_marco/eval_cross-encoder-trec-dl.py | 12 ++--- examples/training/ms_marco/eval_msmarco.py | 5 ++- .../multilingual/translate_queries.py | 8 ++-- .../ms_marco/train_bi-encoder_margin-mse.py | 21 ++++----- .../ms_marco/train_bi-encoder_mnrl.py | 17 +++---- .../ms_marco/train_cross-encoder_kd.py | 15 ++++--- .../ms_marco/train_cross-encoder_scratch.py | 15 ++++--- .../multilingual/get_parallel_data_opus.py | 2 +- .../multilingual/get_parallel_data_talks.py | 7 +-- .../multilingual/get_parallel_data_tatoeba.py | 5 ++- .../get_parallel_data_wikimatrix.py | 4 +- .../multilingual/make_multilingual.py | 9 ++-- examples/training/nli/training_nli.py | 11 +++-- examples/training/nli/training_nli_v2.py | 11 +++-- examples/training/nli/training_nli_v3.py | 11 +++-- .../other/training_batch_hard_trec.py | 16 +++---- .../training/other/training_multi-task.py | 7 +-- .../other/training_wikipedia_sections.py | 7 +-- examples/training/paraphrases/training.py | 8 ++-- .../create_splits.py | 6 +-- .../training_MultipleNegativesRankingLoss.py | 7 +-- .../training_OnlineContrastiveLoss.py | 7 +-- .../training_multi-task-learning.py | 7 +-- .../training/sts/training_stsbenchmark.py | 8 ++-- ...training_stsbenchmark_continue_training.py | 8 ++-- .../CT/train_askubuntu_ct.py | 7 +-- .../CT/train_ct_from_file.py | 11 ++--- .../unsupervised_learning/CT/train_stsb_ct.py | 15 ++++--- .../train_askubuntu_ct-improved.py | 7 +-- .../train_ct-improved_from_file.py | 11 ++--- .../train_stsb_ct-improved.py | 13 +++--- .../unsupervised_learning/MLM/train_mlm.py | 14 ++++-- .../SimCSE/train_askubuntu_simcse.py | 8 ++-- .../SimCSE/train_simcse_from_file.py | 13 +++--- .../SimCSE/train_stsb_simcse.py | 17 +++---- .../TSDAE/eval_askubuntu.py | 6 +-- .../TSDAE/train_askubuntu_tsdae.py | 11 ++--- .../TSDAE/train_stsb_tsdae.py | 15 ++++--- .../TSDAE/train_tsdae_from_file.py | 11 ++--- .../1_programming_query_generation.py | 8 ++-- .../2_programming_train_bi-encoder.py | 2 +- .../3_programming_semantic_search.py | 3 +- .../example_query_generation.py | 7 +-- ruff.toml | 1 - sentence_transformers/LoggingHandler.py | 1 + sentence_transformers/SentenceTransformer.py | 42 +++++++++--------- sentence_transformers/__init__.py | 21 +++++---- .../cross_encoder/CrossEncoder.py | 22 ++++------ .../evaluation/CEBinaryAccuracyEvaluator.py | 5 ++- .../CEBinaryClassificationEvaluator.py | 12 ++--- .../evaluation/CECorrelationEvaluator.py | 9 ++-- .../cross_encoder/evaluation/CEF1Evaluator.py | 4 +- .../evaluation/CERerankingEvaluator.py | 5 ++- .../evaluation/CESoftmaxAccuracyEvaluator.py | 5 ++- .../cross_encoder/evaluation/__init__.py | 4 +- .../datasets/DenoisingAutoEncoderDataset.py | 10 +++-- .../datasets/NoDuplicatesDataLoader.py | 2 +- .../datasets/ParallelSentencesDataset.py | 11 ++--- .../datasets/SentenceLabelDataset.py | 10 +++-- .../datasets/SentencesDataset.py | 8 ++-- sentence_transformers/datasets/__init__.py | 2 +- .../BinaryClassificationEvaluator.py | 24 +++++----- .../EmbeddingSimilarityEvaluator.py | 24 +++++----- .../InformationRetrievalEvaluator.py | 24 +++++----- .../evaluation/LabelAccuracyEvaluator.py | 19 ++++---- .../evaluation/MSEEvaluator.py | 14 +++--- .../evaluation/MSEEvaluatorFromDataFrame.py | 19 ++++---- .../evaluation/ParaphraseMiningEvaluator.py | 17 +++---- .../evaluation/RerankingEvaluator.py | 22 ++++++---- .../evaluation/SentenceEvaluator.py | 7 +-- .../evaluation/SequentialEvaluator.py | 11 +++-- .../evaluation/TranslationEvaluator.py | 17 ++++--- .../evaluation/TripletEvaluator.py | 21 +++++---- sentence_transformers/evaluation/__init__.py | 6 +-- sentence_transformers/fit_mixin.py | 22 +++++----- .../losses/AdaptiveLayerLoss.py | 8 ++-- sentence_transformers/losses/AnglELoss.py | 2 +- .../losses/BatchAllTripletLoss.py | 9 ++-- .../losses/BatchHardSoftMarginTripletLoss.py | 7 ++- .../losses/BatchHardTripletLoss.py | 6 ++- .../losses/BatchSemiHardTripletLoss.py | 9 ++-- .../losses/CachedGISTEmbedLoss.py | 9 ++-- .../CachedMultipleNegativesRankingLoss.py | 12 ++--- sentence_transformers/losses/CoSENTLoss.py | 10 +++-- .../losses/ContrastiveLoss.py | 6 ++- .../losses/ContrastiveTensionLoss.py | 15 ++++--- .../losses/CosineSimilarityLoss.py | 7 +-- .../losses/DenoisingAutoEncoderLoss.py | 10 +++-- sentence_transformers/losses/GISTEmbedLoss.py | 8 ++-- sentence_transformers/losses/MSELoss.py | 5 ++- sentence_transformers/losses/MarginMSELoss.py | 8 ++-- .../losses/Matryoshka2dLoss.py | 4 +- .../losses/MatryoshkaLoss.py | 6 ++- .../losses/MegaBatchMarginLoss.py | 8 ++-- .../losses/MultipleNegativesRankingLoss.py | 10 +++-- .../MultipleNegativesSymmetricRankingLoss.py | 10 +++-- .../losses/OnlineContrastiveLoss.py | 9 ++-- sentence_transformers/losses/SoftmaxLoss.py | 9 ++-- sentence_transformers/losses/TripletLoss.py | 10 +++-- sentence_transformers/losses/__init__.py | 44 +++++++++---------- sentence_transformers/model_card.py | 29 ++++++------ sentence_transformers/models/Asym.py | 11 ++--- sentence_transformers/models/BoW.py | 12 ++--- sentence_transformers/models/CLIPModel.py | 5 ++- sentence_transformers/models/CNN.py | 7 +-- sentence_transformers/models/Dense.py | 13 +++--- sentence_transformers/models/Dropout.py | 8 ++-- sentence_transformers/models/LSTM.py | 7 +-- sentence_transformers/models/LayerNorm.py | 10 ++--- sentence_transformers/models/Normalize.py | 4 +- sentence_transformers/models/Pooling.py | 10 ++--- sentence_transformers/models/Transformer.py | 7 +-- .../models/WeightedLayerPooling.py | 10 ++--- .../models/WordEmbeddings.py | 18 ++++---- sentence_transformers/models/WordWeights.py | 9 ++-- sentence_transformers/models/__init__.py | 4 +- .../models/tokenizer/PhraseTokenizer.py | 11 ++--- .../models/tokenizer/WhitespaceTokenizer.py | 9 ++-- .../models/tokenizer/WordTokenizer.py | 2 +- .../models/tokenizer/__init__.py | 4 +- sentence_transformers/quantization.py | 11 +++-- sentence_transformers/readers/InputExample.py | 2 +- .../readers/LabelSentenceReader.py | 3 +- .../readers/NLIDataReader.py | 3 +- .../readers/PairedFilesReader.py | 3 +- .../readers/STSDataReader.py | 3 +- .../readers/TripletReader.py | 3 +- sentence_transformers/readers/__init__.py | 2 +- sentence_transformers/sampler.py | 7 +-- sentence_transformers/similarity_functions.py | 9 ++-- sentence_transformers/trainer.py | 32 +++++++------- sentence_transformers/training_args.py | 1 + sentence_transformers/util.py | 23 +++++----- setup.py | 2 +- tests/conftest.py | 7 +-- tests/test_cmnrl.py | 8 ++-- tests/test_cross_encoder.py | 4 +- tests/test_image_embeddings.py | 2 +- tests/test_model_card_data.py | 4 +- tests/test_multi_process.py | 3 +- tests/test_sentence_transformer.py | 13 +++--- tests/test_trainer.py | 6 ++- 201 files changed, 1078 insertions(+), 866 deletions(-) diff --git a/docs/_themes/sphinx_rtd_theme/__init__.py b/docs/_themes/sphinx_rtd_theme/__init__.py index e9ae9ccc6..0f739cce4 100644 --- a/docs/_themes/sphinx_rtd_theme/__init__.py +++ b/docs/_themes/sphinx_rtd_theme/__init__.py @@ -8,7 +8,6 @@ import sphinx - __version__ = "0.5.0" __version_full__ = __version__ diff --git a/docs/conf.py b/docs/conf.py index ba2e61f0b..1d3ad9bb0 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -14,10 +14,12 @@ # import sys # sys.path.insert(0, os.path.abspath('.')) -from recommonmark.transform import AutoStructify +import datetime import os + +from recommonmark.transform import AutoStructify from sphinx.domains import Domain -import datetime + # -- Project information ----------------------------------------------------- project = "Sentence-Transformers" diff --git a/examples/applications/clustering/agglomerative.py b/examples/applications/clustering/agglomerative.py index b5f449396..bb914f514 100644 --- a/examples/applications/clustering/agglomerative.py +++ b/examples/applications/clustering/agglomerative.py @@ -4,9 +4,10 @@ Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied. """ -from sentence_transformers import SentenceTransformer from sklearn.cluster import AgglomerativeClustering +from sentence_transformers import SentenceTransformer + embedder = SentenceTransformer("all-MiniLM-L6-v2") # Corpus with example sentences diff --git a/examples/applications/clustering/fast_clustering.py b/examples/applications/clustering/fast_clustering.py index de67238ae..eb1d0d191 100644 --- a/examples/applications/clustering/fast_clustering.py +++ b/examples/applications/clustering/fast_clustering.py @@ -12,11 +12,11 @@ In this example, we download a large set of questions from Quora and then find similar questions in this set. """ -from sentence_transformers import SentenceTransformer, util -import os import csv +import os import time +from sentence_transformers import SentenceTransformer, util # Model for computing sentence embeddings. We use one trained for similar questions detection model = SentenceTransformer("all-MiniLM-L6-v2") diff --git a/examples/applications/clustering/kmeans.py b/examples/applications/clustering/kmeans.py index be3e6fc8d..076c58115 100644 --- a/examples/applications/clustering/kmeans.py +++ b/examples/applications/clustering/kmeans.py @@ -4,9 +4,10 @@ Sentences are mapped to sentence embeddings and then k-mean clustering is applied. """ -from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans +from sentence_transformers import SentenceTransformer + embedder = SentenceTransformer("all-MiniLM-L6-v2") # Corpus with example sentences diff --git a/examples/applications/computing-embeddings/computing_embeddings.py b/examples/applications/computing-embeddings/computing_embeddings.py index 482527e9f..666ecca26 100644 --- a/examples/applications/computing-embeddings/computing_embeddings.py +++ b/examples/applications/computing-embeddings/computing_embeddings.py @@ -3,10 +3,12 @@ generate sentence embeddings for a given list of sentences. """ -from sentence_transformers import SentenceTransformer, LoggingHandler -import numpy as np import logging +import numpy as np + +from sentence_transformers import LoggingHandler, SentenceTransformer + #### Just some code to print debug information to stdout np.set_printoptions(threshold=100) diff --git a/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py b/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py index 1f47117d9..12fbfa562 100644 --- a/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py +++ b/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py @@ -4,9 +4,10 @@ when encoding large text collections. """ -from sentence_transformers import SentenceTransformer, LoggingHandler import logging +from sentence_transformers import LoggingHandler, SentenceTransformer + logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] ) diff --git a/examples/applications/computing-embeddings/computing_embeddings_streaming.py b/examples/applications/computing-embeddings/computing_embeddings_streaming.py index 8f9d4b4e0..6be3cac40 100644 --- a/examples/applications/computing-embeddings/computing_embeddings_streaming.py +++ b/examples/applications/computing-embeddings/computing_embeddings_streaming.py @@ -8,12 +8,14 @@ https://huggingface.co/docs/datasets/stream """ -from sentence_transformers import SentenceTransformer, LoggingHandler import logging -from datasets import load_dataset + from torch.utils.data import DataLoader from tqdm import tqdm +from datasets import load_dataset +from sentence_transformers import LoggingHandler, SentenceTransformer + logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] ) diff --git a/examples/applications/cross-encoder/cross-encoder_reranking.py b/examples/applications/cross-encoder/cross-encoder_reranking.py index a13aa96c3..0d0a8753c 100644 --- a/examples/applications/cross-encoder/cross-encoder_reranking.py +++ b/examples/applications/cross-encoder/cross-encoder_reranking.py @@ -7,13 +7,13 @@ Then, we re-rank the hits from the Bi-Encoder using a Cross-Encoder. """ -from sentence_transformers import SentenceTransformer, util -from sentence_transformers import CrossEncoder -import os import csv +import os import pickle import time +from sentence_transformers import CrossEncoder, SentenceTransformer, util + # We use a BiEncoder (SentenceTransformer) that produces embeddings for questions. # We then search for similar questions using cosine similarity and identify the top 100 most similar questions model_name = "all-MiniLM-L6-v2" diff --git a/examples/applications/cross-encoder/cross-encoder_usage.py b/examples/applications/cross-encoder/cross-encoder_usage.py index 7034445c6..16eba858a 100644 --- a/examples/applications/cross-encoder/cross-encoder_usage.py +++ b/examples/applications/cross-encoder/cross-encoder_usage.py @@ -4,9 +4,10 @@ It output then the most similar sentences for the given query. """ -from sentence_transformers.cross_encoder import CrossEncoder import numpy as np +from sentence_transformers.cross_encoder import CrossEncoder + # Pre-trained cross encoder model = CrossEncoder("cross-encoder/stsb-distilroberta-base") diff --git a/examples/applications/embedding-quantization/semantic_search_faiss.py b/examples/applications/embedding-quantization/semantic_search_faiss.py index 5da707327..1a6a1189a 100644 --- a/examples/applications/embedding-quantization/semantic_search_faiss.py +++ b/examples/applications/embedding-quantization/semantic_search_faiss.py @@ -1,7 +1,8 @@ import time + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings, semantic_search_faiss -from datasets import load_dataset # 1. Load the quora corpus with questions dataset = load_dataset("quora", split="train").map( diff --git a/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py b/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py index 0ff84333c..e869ca50b 100644 --- a/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py +++ b/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py @@ -1,6 +1,6 @@ +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings, semantic_search_faiss -from datasets import load_dataset # 1. Load the quora corpus with questions dataset = load_dataset("quora", split="train").map( diff --git a/examples/applications/embedding-quantization/semantic_search_recommended.py b/examples/applications/embedding-quantization/semantic_search_recommended.py index 594dbbc60..9084b56cc 100644 --- a/examples/applications/embedding-quantization/semantic_search_recommended.py +++ b/examples/applications/embedding-quantization/semantic_search_recommended.py @@ -8,13 +8,14 @@ import os import time +import faiss import numpy as np +from usearch.index import Index + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings -from datasets import load_dataset -import faiss -from usearch.index import Index # We use usearch as it can efficiently load int8 vectors from disk. # Load the model diff --git a/examples/applications/embedding-quantization/semantic_search_usearch.py b/examples/applications/embedding-quantization/semantic_search_usearch.py index c80b5ca6d..19d410a5d 100644 --- a/examples/applications/embedding-quantization/semantic_search_usearch.py +++ b/examples/applications/embedding-quantization/semantic_search_usearch.py @@ -1,7 +1,8 @@ import time + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings, semantic_search_usearch -from datasets import load_dataset # 1. Load the quora corpus with questions dataset = load_dataset("quora", split="train").map( diff --git a/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py b/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py index 5a7280f26..e8e0583a2 100644 --- a/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py +++ b/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py @@ -1,6 +1,6 @@ +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings, semantic_search_usearch -from datasets import load_dataset # 1. Load the quora corpus with questions dataset = load_dataset("quora", split="train").map( diff --git a/examples/applications/image-search/example.py b/examples/applications/image-search/example.py index 49bb6a699..9eadfc3aa 100644 --- a/examples/applications/image-search/example.py +++ b/examples/applications/image-search/example.py @@ -1,12 +1,12 @@ -from sentence_transformers import SentenceTransformer, util, models from PIL import Image +from sentence_transformers import SentenceTransformer, models, util ########### image = Image.open("two_dogs_in_snow.jpg") -from transformers import CLIPProcessor, CLIPModel +from transformers import CLIPModel, CLIPProcessor model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") diff --git a/examples/applications/parallel-sentence-mining/bitext_mining.py b/examples/applications/parallel-sentence-mining/bitext_mining.py index 886788fac..ac42de325 100644 --- a/examples/applications/parallel-sentence-mining/bitext_mining.py +++ b/examples/applications/parallel-sentence-mining/bitext_mining.py @@ -13,13 +13,15 @@ https://github.com/facebookresearch/faiss """ -from sentence_transformers import SentenceTransformer, models -import numpy as np -from bitext_mining_utils import score_candidates, kNN, file_open import gzip + +import numpy as np +import torch import tqdm +from bitext_mining_utils import file_open, kNN, score_candidates from sklearn.decomposition import PCA -import torch + +from sentence_transformers import SentenceTransformer, models # Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance model_name = "LaBSE" diff --git a/examples/applications/parallel-sentence-mining/bitext_mining_utils.py b/examples/applications/parallel-sentence-mining/bitext_mining_utils.py index 3fdbc74bc..b723392b2 100644 --- a/examples/applications/parallel-sentence-mining/bitext_mining_utils.py +++ b/examples/applications/parallel-sentence-mining/bitext_mining_utils.py @@ -6,11 +6,12 @@ https://github.com/facebookresearch/LASER """ -import faiss -import numpy as np -import time import gzip import lzma +import time + +import faiss +import numpy as np ######## Functions to find and score candidates diff --git a/examples/applications/parallel-sentence-mining/bucc2018.py b/examples/applications/parallel-sentence-mining/bucc2018.py index 4781d86bc..47208356c 100644 --- a/examples/applications/parallel-sentence-mining/bucc2018.py +++ b/examples/applications/parallel-sentence-mining/bucc2018.py @@ -10,14 +10,16 @@ https://github.com/facebookresearch/faiss """ -from sentence_transformers import SentenceTransformer, models -from collections import defaultdict import os import pickle -from sklearn.decomposition import PCA -import torch -from bitext_mining_utils import score_candidates, kNN +from collections import defaultdict + import numpy as np +import torch +from bitext_mining_utils import kNN, score_candidates +from sklearn.decomposition import PCA + +from sentence_transformers import SentenceTransformer, models # Model we want to use for bitext mining. LaBSE achieves state-of-the-art performance model_name = "LaBSE" diff --git a/examples/applications/retrieve_rerank/in_document_search_crossencoder.py b/examples/applications/retrieve_rerank/in_document_search_crossencoder.py index 131c9edec..fca9cb86f 100644 --- a/examples/applications/retrieve_rerank/in_document_search_crossencoder.py +++ b/examples/applications/retrieve_rerank/in_document_search_crossencoder.py @@ -17,10 +17,11 @@ Note: Requires NLTK: `pip install nltk` """ -from sentence_transformers import CrossEncoder -from nltk import sent_tokenize import time +from nltk import sent_tokenize + +from sentence_transformers import CrossEncoder # As document, we take the first two section from the Wikipedia article about Europe document = """Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It comprises the westernmost part of Eurasia and is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, the Mediterranean Sea to the south, and Asia to the east. Europe is commonly considered to be separated from Asia by the watershed of the Ural Mountains, the Ural River, the Caspian Sea, the Greater Caucasus, the Black Sea, and the waterways of the Turkish Straits. Although some of this border is over land, Europe is generally accorded the status of a full continent because of its great physical size and the weight of history and tradition. diff --git a/examples/applications/semantic-search/semantic_search.py b/examples/applications/semantic-search/semantic_search.py index 80f9e9986..9f882b7b3 100644 --- a/examples/applications/semantic-search/semantic_search.py +++ b/examples/applications/semantic-search/semantic_search.py @@ -7,9 +7,10 @@ This script outputs for various queries the top 5 most similar sentences in the corpus. """ -from sentence_transformers import SentenceTransformer import torch +from sentence_transformers import SentenceTransformer + embedder = SentenceTransformer("all-MiniLM-L6-v2") # Corpus with example sentences diff --git a/examples/applications/semantic-search/semantic_search_publications.py b/examples/applications/semantic-search/semantic_search_publications.py index 12cb6c05b..f2a58d90d 100644 --- a/examples/applications/semantic-search/semantic_search_publications.py +++ b/examples/applications/semantic-search/semantic_search_publications.py @@ -11,6 +11,7 @@ import json import os + from sentence_transformers import SentenceTransformer, util # First, we load the papers dataset (with title and abstract information) diff --git a/examples/applications/semantic-search/semantic_search_quora_annoy.py b/examples/applications/semantic-search/semantic_search_quora_annoy.py index 25c4ac7f5..80d74594d 100644 --- a/examples/applications/semantic-search/semantic_search_quora_annoy.py +++ b/examples/applications/semantic-search/semantic_search_quora_annoy.py @@ -26,14 +26,15 @@ return the closest questions in the corpus (questions in the corpus are mainly in English). """ -from sentence_transformers import SentenceTransformer, util -import os import csv +import os import pickle import time + import torch from annoy import AnnoyIndex +from sentence_transformers import SentenceTransformer, util model_name = "quora-distilbert-multilingual" model = SentenceTransformer(model_name) diff --git a/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py b/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py index ff0f37b19..f1201a83c 100644 --- a/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py +++ b/examples/applications/semantic-search/semantic_search_quora_elasticsearch.py @@ -19,14 +19,15 @@ return the closest questions in the corpus (questions in the corpus are mainly in English). """ -from sentence_transformers import SentenceTransformer, util -import os -from elasticsearch import Elasticsearch, helpers -from ssl import create_default_context import csv +import os import time +from ssl import create_default_context + import tqdm.autonotebook +from elasticsearch import Elasticsearch, helpers +from sentence_transformers import SentenceTransformer, util es = Elasticsearch( hosts=["https://localhost:9200"], diff --git a/examples/applications/semantic-search/semantic_search_quora_faiss.py b/examples/applications/semantic-search/semantic_search_quora_faiss.py index 3f45298cb..76880e36b 100644 --- a/examples/applications/semantic-search/semantic_search_quora_faiss.py +++ b/examples/applications/semantic-search/semantic_search_quora_faiss.py @@ -23,14 +23,15 @@ return the closest questions in the corpus (questions in the corpus are mainly in English). """ -from sentence_transformers import SentenceTransformer, util -import os import csv +import os import pickle import time + import faiss import numpy as np +from sentence_transformers import SentenceTransformer, util model_name = "quora-distilbert-multilingual" model = SentenceTransformer(model_name) diff --git a/examples/applications/semantic-search/semantic_search_quora_hnswlib.py b/examples/applications/semantic-search/semantic_search_quora_hnswlib.py index a6f1dbc12..cff380490 100644 --- a/examples/applications/semantic-search/semantic_search_quora_hnswlib.py +++ b/examples/applications/semantic-search/semantic_search_quora_hnswlib.py @@ -21,13 +21,14 @@ return the closest questions in the corpus (questions in the corpus are mainly in English). """ -from sentence_transformers import SentenceTransformer, util -import os import csv +import os import pickle import time + import hnswlib +from sentence_transformers import SentenceTransformer, util model_name = "quora-distilbert-multilingual" model = SentenceTransformer(model_name) diff --git a/examples/applications/semantic-search/semantic_search_quora_pytorch.py b/examples/applications/semantic-search/semantic_search_quora_pytorch.py index 0e715b479..903c6804a 100644 --- a/examples/applications/semantic-search/semantic_search_quora_pytorch.py +++ b/examples/applications/semantic-search/semantic_search_quora_pytorch.py @@ -13,12 +13,12 @@ Google Colab example: https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing """ -from sentence_transformers import SentenceTransformer, util -import os import csv +import os import pickle import time +from sentence_transformers import SentenceTransformer, util model_name = "quora-distilbert-multilingual" model = SentenceTransformer(model_name) diff --git a/examples/applications/semantic-search/semantic_search_wikipedia_qa.py b/examples/applications/semantic-search/semantic_search_wikipedia_qa.py index 1cd8744bd..cdc0baf86 100644 --- a/examples/applications/semantic-search/semantic_search_wikipedia_qa.py +++ b/examples/applications/semantic-search/semantic_search_wikipedia_qa.py @@ -13,13 +13,15 @@ Google Colab Example: https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing """ -import json -from sentence_transformers import SentenceTransformer, util -import time import gzip +import json import os +import time + import torch +from sentence_transformers import SentenceTransformer, util + # We use the Bi-Encoder to encode all passages, so that we can use it with semantic search model_name = "nq-distilbert-base-v1" bi_encoder = SentenceTransformer(model_name) diff --git a/examples/applications/text-summarization/LexRank.py b/examples/applications/text-summarization/LexRank.py index 055cd8a79..db269c430 100644 --- a/examples/applications/text-summarization/LexRank.py +++ b/examples/applications/text-summarization/LexRank.py @@ -3,10 +3,11 @@ Source: https://github.com/crabcamp/lexrank/tree/dev """ +import logging + import numpy as np from scipy.sparse.csgraph import connected_components from scipy.special import softmax -import logging logger = logging.getLogger(__name__) diff --git a/examples/applications/text-summarization/text-summarization.py b/examples/applications/text-summarization/text-summarization.py index 64dc0a4cd..58c79d49c 100644 --- a/examples/applications/text-summarization/text-summarization.py +++ b/examples/applications/text-summarization/text-summarization.py @@ -19,10 +19,10 @@ """ import nltk -from sentence_transformers import SentenceTransformer import numpy as np from LexRank import degree_centrality_scores +from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") diff --git a/examples/evaluation/evaluation_inference_speed.py b/examples/evaluation/evaluation_inference_speed.py index a91ec0067..7ac8ec4c3 100644 --- a/examples/evaluation/evaluation_inference_speed.py +++ b/examples/evaluation/evaluation_inference_speed.py @@ -7,11 +7,13 @@ python evaluation_inference_speed.py model_name """ -from sentence_transformers import SentenceTransformer import sys import time + import torch + from datasets import load_dataset +from sentence_transformers import SentenceTransformer # Limit torch to 4 threads torch.set_num_threads(4) diff --git a/examples/evaluation/evaluation_stsbenchmark.py b/examples/evaluation/evaluation_stsbenchmark.py index 1e3fb78e0..4a2e96a1f 100644 --- a/examples/evaluation/evaluation_stsbenchmark.py +++ b/examples/evaluation/evaluation_stsbenchmark.py @@ -7,14 +7,15 @@ python evaluation_stsbenchmark.py model_name """ -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -from datasets import load_dataset import logging +import os import sys + import torch -import os +from datasets import load_dataset +from sentence_transformers import SentenceTransformer +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction script_folder_path = os.path.dirname(os.path.realpath(__file__)) diff --git a/examples/evaluation/evaluation_translation_matching.py b/examples/evaluation/evaluation_translation_matching.py index bac21e9bd..567331401 100644 --- a/examples/evaluation/evaluation_translation_matching.py +++ b/examples/evaluation/evaluation_translation_matching.py @@ -19,11 +19,11 @@ python examples/evaluation/evaluation_translation_matching.py distiluse-base-multilingual-cased sentence-transformers/parallel-sentences-tatoeba en-ar en-de en-nl """ -from sentence_transformers import SentenceTransformer, evaluation -import sys import logging -from datasets import load_dataset +import sys +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, evaluation # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/adaptive_layer/adaptive_layer_nli.py b/examples/training/adaptive_layer/adaptive_layer_nli.py index 955bc2e6c..8062cd7fe 100644 --- a/examples/training/adaptive_layer/adaptive_layer_nli.py +++ b/examples/training/adaptive_layer/adaptive_layer_nli.py @@ -10,15 +10,19 @@ python adaptive_layer_nli.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers # Set the log level to INFO to get more information diff --git a/examples/training/adaptive_layer/adaptive_layer_sts.py b/examples/training/adaptive_layer/adaptive_layer_sts.py index c2ebdd6f4..7d13a0690 100644 --- a/examples/training/adaptive_layer/adaptive_layer_sts.py +++ b/examples/training/adaptive_layer/adaptive_layer_sts.py @@ -10,14 +10,19 @@ python adaptive_layer_sts.py pretrained_transformer_model_name """ +import logging +import sys import traceback +from datetime import datetime + from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction -import logging -from datetime import datetime -import sys # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py index a89cea13a..4ef74dc40 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py @@ -7,14 +7,13 @@ for available word embeddings files """ -import traceback -from datasets import load_dataset -from sentence_transformers import models, losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging +import traceback from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses, models +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py index c2453ee07..224111cab 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bilstm.py @@ -5,14 +5,13 @@ Note, you can also pass BERT embeddings to the BiLSTM. """ -import traceback -from datasets import load_dataset -from sentence_transformers import models, losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging +import traceback from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses, models +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py index 3be966717..2ee204b3f 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_bow.py @@ -5,17 +5,16 @@ To make the model trainable, we add multiple dense layers to create a Deep Averaging Network (DAN). """ +import logging +import math +import os import traceback +from datetime import datetime + from datasets import load_dataset -import math -from sentence_transformers import models, losses, util -from sentence_transformers import SentenceTransformer +from sentence_transformers import SentenceTransformer, losses, models, util from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.models.tokenizer.WordTokenizer import ENGLISH_STOP_WORDS -import logging -from datetime import datetime -import os - from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py index db8c4ee50..3e97d0557 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_cnn.py @@ -5,20 +5,18 @@ """ +import logging import sys import traceback -from datasets import load_dataset -from sentence_transformers import models, losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -import logging from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses, models +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments - # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py index 183894b07..08814eb26 100644 --- a/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py +++ b/examples/training/avg_word_embeddings/training_stsbenchmark_tf-idf_word_embeddings.py @@ -9,16 +9,15 @@ https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/wikipedia_doc_frequencies.txt """ -import traceback -from datasets import load_dataset -import math -from sentence_transformers import models, losses, util -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging -from datetime import datetime +import math import os +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses, models, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/cross-encoder/training_nli.py b/examples/training/cross-encoder/training_nli.py index b6c83d271..e368855a8 100644 --- a/examples/training/cross-encoder/training_nli.py +++ b/examples/training/cross-encoder/training_nli.py @@ -8,18 +8,20 @@ python training_nli.py """ -from torch.utils.data import DataLoader +import csv +import gzip +import logging import math +import os +from datetime import datetime + +from torch.utils.data import DataLoader + from sentence_transformers import LoggingHandler, util from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CEF1Evaluator, CESoftmaxAccuracyEvaluator from sentence_transformers.evaluation import SequentialEvaluator from sentence_transformers.readers import InputExample -import logging -from datetime import datetime -import os -import gzip -import csv #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/cross-encoder/training_quora_duplicate_questions.py b/examples/training/cross-encoder/training_quora_duplicate_questions.py index de9e5634e..ecbdda11c 100644 --- a/examples/training/cross-encoder/training_quora_duplicate_questions.py +++ b/examples/training/cross-encoder/training_quora_duplicate_questions.py @@ -9,17 +9,19 @@ """ -from torch.utils.data import DataLoader +import csv +import logging import math +import os +from datetime import datetime +from zipfile import ZipFile + +from torch.utils.data import DataLoader + from sentence_transformers import LoggingHandler, util from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator from sentence_transformers.readers import InputExample -import logging -from datetime import datetime -import os -import csv -from zipfile import ZipFile #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/cross-encoder/training_stsbenchmark.py b/examples/training/cross-encoder/training_stsbenchmark.py index 8e1c83922..61a186e09 100644 --- a/examples/training/cross-encoder/training_stsbenchmark.py +++ b/examples/training/cross-encoder/training_stsbenchmark.py @@ -8,17 +8,18 @@ python training_stsbenchmark.py """ -from torch.utils.data import DataLoader +import csv +import gzip +import logging import math -from sentence_transformers import LoggingHandler, util +import os +from datetime import datetime + +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, util from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator -from sentence_transformers import InputExample -import logging -from datetime import datetime -import os -import gzip -import csv #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/data_augmentation/train_sts_indomain_bm25.py b/examples/training/data_augmentation/train_sts_indomain_bm25.py index 121dbb9bc..20a971bd5 100644 --- a/examples/training/data_augmentation/train_sts_indomain_bm25.py +++ b/examples/training/data_augmentation/train_sts_indomain_bm25.py @@ -26,22 +26,22 @@ """ +import logging +import math +import sys import traceback -from datasets import load_dataset, Dataset, concatenate_datasets +from datetime import datetime + +import tqdm +from elasticsearch import Elasticsearch from torch.utils.data import DataLoader -from sentence_transformers import losses + +from datasets import Dataset, concatenate_datasets, load_dataset +from sentence_transformers import SentenceTransformer, losses from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator -from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.readers import InputExample -from elasticsearch import Elasticsearch -from datetime import datetime -import logging -import sys -import tqdm -import math - from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py index 735e5cf36..8f0bd9310 100644 --- a/examples/training/data_augmentation/train_sts_indomain_nlpaug.py +++ b/examples/training/data_augmentation/train_sts_indomain_nlpaug.py @@ -29,17 +29,18 @@ python train_sts_indomain_nlpaug.py """ -import traceback -from datasets import load_dataset, Dataset, concatenate_datasets -import torch -from sentence_transformers import SentenceTransformer, losses -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -import nlpaug.augmenter.word as naw import logging -from datetime import datetime import sys +import traceback +from datetime import datetime + +import nlpaug.augmenter.word as naw +import torch import tqdm +from datasets import Dataset, concatenate_datasets, load_dataset +from sentence_transformers import SentenceTransformer, losses +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/data_augmentation/train_sts_indomain_semantic.py b/examples/training/data_augmentation/train_sts_indomain_semantic.py index b3b276cce..7849d8512 100644 --- a/examples/training/data_augmentation/train_sts_indomain_semantic.py +++ b/examples/training/data_augmentation/train_sts_indomain_semantic.py @@ -19,22 +19,23 @@ python train_sts_indomain_semantic.py bert-base-uncased 3 """ +import csv +import gzip +import logging +import math +import os +import sys +from datetime import datetime + +import torch +import tqdm from torch.utils.data import DataLoader -from sentence_transformers import models, losses, util -from sentence_transformers import LoggingHandler, SentenceTransformer + +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.readers import InputExample -from datetime import datetime -import logging -import csv -import torch -import tqdm -import sys -import math -import gzip -import os #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/data_augmentation/train_sts_qqp_crossdomain.py b/examples/training/data_augmentation/train_sts_qqp_crossdomain.py index 55bd0a8f8..ac26ff35d 100644 --- a/examples/training/data_augmentation/train_sts_qqp_crossdomain.py +++ b/examples/training/data_augmentation/train_sts_qqp_crossdomain.py @@ -17,22 +17,23 @@ python train_sts_qqp_crossdomain.py pretrained_transformer_model_name """ +import csv +import gzip +import logging +import math +import os +import sys +from datetime import datetime +from zipfile import ZipFile + +import torch from torch.utils.data import DataLoader -from sentence_transformers import models, losses, util, LoggingHandler, SentenceTransformer + +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator from sentence_transformers.evaluation import BinaryClassificationEvaluator from sentence_transformers.readers import InputExample -from datetime import datetime -from zipfile import ZipFile -import logging -import csv -import sys -import torch -import math -import gzip -import os - #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/data_augmentation/train_sts_seed_optimization.py b/examples/training/data_augmentation/train_sts_seed_optimization.py index ddc53c676..02920fba2 100644 --- a/examples/training/data_augmentation/train_sts_seed_optimization.py +++ b/examples/training/data_augmentation/train_sts_seed_optimization.py @@ -23,19 +23,21 @@ python train_sts_seed_optimization.py bert-base-uncased 10 0.3 """ -from torch.utils.data import DataLoader +import csv +import gzip +import logging import math -import torch +import os import random +import sys + import numpy as np -from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util +import torch +from torch.utils.data import DataLoader + +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models, util from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.readers import InputExample -import logging -import sys -import os -import gzip -import csv #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/distillation/dimensionality_reduction.py b/examples/training/distillation/dimensionality_reduction.py index 3ba9432a9..3244c5f8d 100644 --- a/examples/training/distillation/dimensionality_reduction.py +++ b/examples/training/distillation/dimensionality_reduction.py @@ -15,14 +15,15 @@ without further changes needed. """ -from datasets import load_dataset -from sklearn.decomposition import PCA -from sentence_transformers import SentenceTransformer, models import logging import random + import numpy as np import torch +from sklearn.decomposition import PCA +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, models from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator # Set the log level to INFO to get more information diff --git a/examples/training/distillation/model_distillation.py b/examples/training/distillation/model_distillation.py index b33f777ac..ea8bc6b7e 100644 --- a/examples/training/distillation/model_distillation.py +++ b/examples/training/distillation/model_distillation.py @@ -20,22 +20,21 @@ of the teacher performance, while being 2.3 times faster. """ -import traceback -from datasets import load_dataset, concatenate_datasets, Dataset -import pandas as pd -from sentence_transformers import models, losses, evaluation -from sentence_transformers import LoggingHandler, SentenceTransformer import logging +import traceback from datetime import datetime -from sklearn.decomposition import PCA + +import pandas as pd import torch +from sklearn.decomposition import PCA +from datasets import Dataset, concatenate_datasets, load_dataset +from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, losses, models from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments - #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/training/distillation/model_distillation_layer_reduction.py b/examples/training/distillation/model_distillation_layer_reduction.py index 7a0d76f5d..83f8cc6be 100644 --- a/examples/training/distillation/model_distillation_layer_reduction.py +++ b/examples/training/distillation/model_distillation_layer_reduction.py @@ -20,21 +20,20 @@ of the teacher performance, while being 2.3 times faster. """ -import traceback -from datasets import load_dataset, concatenate_datasets, Dataset -import pandas as pd -from sentence_transformers import losses, evaluation -from sentence_transformers import SentenceTransformer import logging +import traceback from datetime import datetime + +import pandas as pd import torch +from datasets import Dataset, concatenate_datasets, load_dataset +from sentence_transformers import SentenceTransformer, evaluation, losses from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments - # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/distillation/model_quantization.py b/examples/training/distillation/model_quantization.py index 78b3f49d0..07f76310d 100644 --- a/examples/training/distillation/model_quantization.py +++ b/examples/training/distillation/model_quantization.py @@ -9,16 +9,18 @@ https://pytorch.org/docs/stable/quantization.html """ +import csv +import gzip import logging import os +import time + import torch -from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from torch.nn import Embedding, Linear from torch.quantization import quantize_dynamic -import gzip -import csv -import time + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/hpo/hpo_nli.py b/examples/training/hpo/hpo_nli.py index 758224ec9..604d92bdd 100644 --- a/examples/training/hpo/hpo_nli.py +++ b/examples/training/hpo/hpo_nli.py @@ -1,8 +1,12 @@ -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments +from datasets import load_dataset +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers -from datasets import load_dataset # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, 10k samples train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]") diff --git a/examples/training/matryoshka/2d_matryoshka_nli.py b/examples/training/matryoshka/2d_matryoshka_nli.py index 0b6acd2ec..4ed4dbfea 100644 --- a/examples/training/matryoshka/2d_matryoshka_nli.py +++ b/examples/training/matryoshka/2d_matryoshka_nli.py @@ -11,15 +11,19 @@ python 2d_matryoshka_nli.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers # Set the log level to INFO to get more information diff --git a/examples/training/matryoshka/2d_matryoshka_sts.py b/examples/training/matryoshka/2d_matryoshka_sts.py index 3880c8c93..a170f1581 100644 --- a/examples/training/matryoshka/2d_matryoshka_sts.py +++ b/examples/training/matryoshka/2d_matryoshka_sts.py @@ -10,14 +10,19 @@ python 2d_matryoshka_sts.py pretrained_transformer_model_name """ +import logging +import sys import traceback +from datetime import datetime + from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction -import logging -from datetime import datetime -import sys # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/matryoshka/matryoshka_eval_stsb.py b/examples/training/matryoshka/matryoshka_eval_stsb.py index 4c5cb511d..288850882 100644 --- a/examples/training/matryoshka/matryoshka_eval_stsb.py +++ b/examples/training/matryoshka/matryoshka_eval_stsb.py @@ -7,15 +7,16 @@ import os from typing import Dict, List, Optional, Tuple, cast -from datasets import load_dataset -import numpy as np import matplotlib.pyplot as plt +import numpy as np +from tqdm.auto import tqdm + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import ( EmbeddingSimilarityEvaluator, SimilarityFunction, ) -from tqdm.auto import tqdm # Dimension plot diff --git a/examples/training/matryoshka/matryoshka_nli.py b/examples/training/matryoshka/matryoshka_nli.py index 07a1d24d5..c111f2f05 100644 --- a/examples/training/matryoshka/matryoshka_nli.py +++ b/examples/training/matryoshka/matryoshka_nli.py @@ -11,15 +11,19 @@ python matryoshka_nli.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers # Set the log level to INFO to get more information diff --git a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py index b1e470b78..4f5dc600e 100644 --- a/examples/training/matryoshka/matryoshka_nli_reduced_dim.py +++ b/examples/training/matryoshka/matryoshka_nli_reduced_dim.py @@ -15,15 +15,20 @@ python matryoshka_nli_reduced_dim.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses, models -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, + models, +) +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers # Set the log level to INFO to get more information diff --git a/examples/training/matryoshka/matryoshka_sts.py b/examples/training/matryoshka/matryoshka_sts.py index 7cb17d715..4722f3cf3 100644 --- a/examples/training/matryoshka/matryoshka_sts.py +++ b/examples/training/matryoshka/matryoshka_sts.py @@ -10,14 +10,19 @@ python matryoshka_sts.py pretrained_transformer_model_name """ +import logging +import sys import traceback +from datetime import datetime + from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments +from sentence_transformers import ( + SentenceTransformer, + SentenceTransformerTrainer, + SentenceTransformerTrainingArguments, + losses, +) from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction -import logging -from datetime import datetime -import sys # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/ms_marco/eval_cross-encoder-trec-dl.py b/examples/training/ms_marco/eval_cross-encoder-trec-dl.py index acc96fe2f..f4635ca53 100644 --- a/examples/training/ms_marco/eval_cross-encoder-trec-dl.py +++ b/examples/training/ms_marco/eval_cross-encoder-trec-dl.py @@ -14,14 +14,16 @@ """ import gzip -from collections import defaultdict import logging -import tqdm -import numpy as np +import os import sys +from collections import defaultdict + +import numpy as np import pytrec_eval -from sentence_transformers import util, CrossEncoder -import os +import tqdm + +from sentence_transformers import CrossEncoder, util data_folder = "trec2019-data" os.makedirs(data_folder, exist_ok=True) diff --git a/examples/training/ms_marco/eval_msmarco.py b/examples/training/ms_marco/eval_msmarco.py index b40e25920..bfe6dbed9 100644 --- a/examples/training/ms_marco/eval_msmarco.py +++ b/examples/training/ms_marco/eval_msmarco.py @@ -6,12 +6,13 @@ python eval_msmarco.py model_name [max_corpus_size_in_thousands] """ -from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util import logging -import sys import os +import sys import tarfile +from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/training/ms_marco/multilingual/translate_queries.py b/examples/training/ms_marco/multilingual/translate_queries.py index af7d7e941..56ca7542a 100644 --- a/examples/training/ms_marco/multilingual/translate_queries.py +++ b/examples/training/ms_marco/multilingual/translate_queries.py @@ -8,12 +8,14 @@ python translate_queries [target_language] """ -import os -from sentence_transformers import LoggingHandler, util import logging +import os +import sys import tarfile + from easynmt import EasyNMT -import sys + +from sentence_transformers import LoggingHandler, util #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/ms_marco/train_bi-encoder_margin-mse.py b/examples/training/ms_marco/train_bi-encoder_margin-mse.py index 7b397da62..bff34f5cb 100644 --- a/examples/training/ms_marco/train_bi-encoder_margin-mse.py +++ b/examples/training/ms_marco/train_bi-encoder_margin-mse.py @@ -1,18 +1,19 @@ -import sys +import argparse +import gzip import json -from torch.utils.data import DataLoader -from sentence_transformers import SentenceTransformer, LoggingHandler, util, models, losses, InputExample import logging -from datetime import datetime -import gzip import os -import tarfile -import tqdm -from torch.utils.data import Dataset +import pickle import random +import sys +import tarfile +from datetime import datetime from shutil import copyfile -import pickle -import argparse + +import tqdm +from torch.utils.data import DataLoader, Dataset + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/ms_marco/train_bi-encoder_mnrl.py b/examples/training/ms_marco/train_bi-encoder_mnrl.py index 110ced2e0..eea99b1c2 100644 --- a/examples/training/ms_marco/train_bi-encoder_mnrl.py +++ b/examples/training/ms_marco/train_bi-encoder_mnrl.py @@ -17,19 +17,20 @@ python train_bi-encoder-v3.py """ +import argparse +import gzip import json -from torch.utils.data import DataLoader -from sentence_transformers import SentenceTransformer, LoggingHandler, util, models, losses, InputExample import logging -from datetime import datetime -import gzip import os +import pickle +import random import tarfile +from datetime import datetime + import tqdm -from torch.utils.data import Dataset -import random -import pickle -import argparse +from torch.utils.data import DataLoader, Dataset + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/ms_marco/train_cross-encoder_kd.py b/examples/training/ms_marco/train_cross-encoder_kd.py index 9045d27df..0d2cd6e0f 100644 --- a/examples/training/ms_marco/train_cross-encoder_kd.py +++ b/examples/training/ms_marco/train_cross-encoder_kd.py @@ -17,17 +17,18 @@ python train_cross-encoder-v2.py """ -from torch.utils.data import DataLoader -from sentence_transformers import LoggingHandler, util -from sentence_transformers.cross_encoder import CrossEncoder -from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator -from sentence_transformers import InputExample -import logging -from datetime import datetime import gzip +import logging import os import tarfile +from datetime import datetime + import torch +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, util +from sentence_transformers.cross_encoder import CrossEncoder +from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/ms_marco/train_cross-encoder_scratch.py b/examples/training/ms_marco/train_cross-encoder_scratch.py index faffcb87d..ebdd8d955 100644 --- a/examples/training/ms_marco/train_cross-encoder_scratch.py +++ b/examples/training/ms_marco/train_cross-encoder_scratch.py @@ -14,17 +14,18 @@ python train_cross-encoder.py """ -from torch.utils.data import DataLoader -from sentence_transformers import LoggingHandler, util -from sentence_transformers.cross_encoder import CrossEncoder -from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator -from sentence_transformers import InputExample -import logging -from datetime import datetime import gzip +import logging import os import tarfile +from datetime import datetime + import tqdm +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, util +from sentence_transformers.cross_encoder import CrossEncoder +from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/training/multilingual/get_parallel_data_opus.py b/examples/training/multilingual/get_parallel_data_opus.py index e66471624..a39d95b3b 100644 --- a/examples/training/multilingual/get_parallel_data_opus.py +++ b/examples/training/multilingual/get_parallel_data_opus.py @@ -29,9 +29,9 @@ """ -from opustools import OpusRead import os +from opustools import OpusRead corpora = ["JW300"] # Corpora you want to use source_languages = ["en"] # Source language, our teacher model is able to understand diff --git a/examples/training/multilingual/get_parallel_data_talks.py b/examples/training/multilingual/get_parallel_data_talks.py index 0c567b2e6..745e125d5 100644 --- a/examples/training/multilingual/get_parallel_data_talks.py +++ b/examples/training/multilingual/get_parallel_data_talks.py @@ -13,12 +13,13 @@ https://arxiv.org/abs/2004.09813 """ -import os -import sentence_transformers.util -import gzip import csv +import gzip +import os + from tqdm.autonotebook import tqdm +import sentence_transformers.util source_languages = set(["en"]) # Languages our (monolingual) teacher model understands target_languages = set(["de", "es", "it", "fr", "ar", "tr"]) # New languages we want to extend to diff --git a/examples/training/multilingual/get_parallel_data_tatoeba.py b/examples/training/multilingual/get_parallel_data_tatoeba.py index a4a226d1a..77b033e6e 100644 --- a/examples/training/multilingual/get_parallel_data_tatoeba.py +++ b/examples/training/multilingual/get_parallel_data_tatoeba.py @@ -5,10 +5,11 @@ This script downloads the Tatoeba corpus and extracts the sentences & translations in the languages you like """ +import gzip import os -import sentence_transformers import tarfile -import gzip + +import sentence_transformers # Note: Tatoeba uses 3 letter languages codes (ISO-639-2), # while other datasets like OPUS use 2 letter language codes (ISO-639-1) diff --git a/examples/training/multilingual/get_parallel_data_wikimatrix.py b/examples/training/multilingual/get_parallel_data_wikimatrix.py index 1f2e4520a..8b985dd38 100644 --- a/examples/training/multilingual/get_parallel_data_wikimatrix.py +++ b/examples/training/multilingual/get_parallel_data_wikimatrix.py @@ -9,10 +9,10 @@ https://arxiv.org/abs/2004.09813 """ -import os -import sentence_transformers.util import gzip +import os +import sentence_transformers.util source_languages = set(["en"]) # Languages our (monolingual) teacher model understands target_languages = set(["de", "es", "it", "fr", "ar", "tr"]) # New languages we want to extend to diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py index b50d5d408..bb62d37bf 100644 --- a/examples/training/multilingual/make_multilingual.py +++ b/examples/training/multilingual/make_multilingual.py @@ -17,12 +17,14 @@ https://arxiv.org/abs/2004.09813 """ +import logging import traceback -from sentence_transformers import SentenceTransformer, LoggingHandler from datetime import datetime -from datasets import load_dataset, DatasetDict -import logging +import numpy as np + +from datasets import DatasetDict, load_dataset +from sentence_transformers import LoggingHandler, SentenceTransformer from sentence_transformers.evaluation import ( EmbeddingSimilarityEvaluator, MSEEvaluator, @@ -32,7 +34,6 @@ from sentence_transformers.losses import MSELoss from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments -import numpy as np logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/training/nli/training_nli.py b/examples/training/nli/training_nli.py index 2dd26c112..0a2dc6ba8 100644 --- a/examples/training/nli/training_nli.py +++ b/examples/training/nli/training_nli.py @@ -10,15 +10,14 @@ python training_nli.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/nli/training_nli_v2.py b/examples/training/nli/training_nli_v2.py index 0567e0f8b..0b0025351 100644 --- a/examples/training/nli/training_nli_v2.py +++ b/examples/training/nli/training_nli_v2.py @@ -10,15 +10,14 @@ python training_nli_v2.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments diff --git a/examples/training/nli/training_nli_v3.py b/examples/training/nli/training_nli_v3.py index 1844a2588..ffc95e128 100644 --- a/examples/training/nli/training_nli_v3.py +++ b/examples/training/nli/training_nli_v3.py @@ -10,15 +10,14 @@ python training_nli_v3.py pretrained_transformer_model_name """ -import traceback -from datasets import load_dataset -from sentence_transformers import losses -from sentence_transformers import SentenceTransformer -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator import logging -from datetime import datetime import sys +import traceback +from datetime import datetime +from datasets import load_dataset +from sentence_transformers import SentenceTransformer, losses +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments diff --git a/examples/training/other/training_batch_hard_trec.py b/examples/training/other/training_batch_hard_trec.py index 0d17817c7..7082cb706 100644 --- a/examples/training/other/training_batch_hard_trec.py +++ b/examples/training/other/training_batch_hard_trec.py @@ -16,18 +16,18 @@ all sentences with the same label should be close and sentences for different labels should be clearly separated. """ -from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util -from sentence_transformers.datasets import SentenceLabelDataset -from torch.utils.data import DataLoader -from sentence_transformers.readers import InputExample -from sentence_transformers.evaluation import TripletEvaluator -from datetime import datetime - - import logging import os import random from collections import defaultdict +from datetime import datetime + +from torch.utils.data import DataLoader + +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, util +from sentence_transformers.datasets import SentenceLabelDataset +from sentence_transformers.evaluation import TripletEvaluator +from sentence_transformers.readers import InputExample logging.basicConfig( format="%(asctime)s - %(message)s", diff --git a/examples/training/other/training_multi-task.py b/examples/training/other/training_multi-task.py index b9de63c0a..e54e6f270 100644 --- a/examples/training/other/training_multi-task.py +++ b/examples/training/other/training_multi-task.py @@ -4,16 +4,17 @@ The system trains BERT on the AllNLI and on the STSbenchmark dataset. """ +import logging import traceback +from datetime import datetime + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.losses import CosineSimilarityLoss, SoftmaxLoss from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments -import logging -from datetime import datetime -from datasets import load_dataset # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/other/training_wikipedia_sections.py b/examples/training/other/training_wikipedia_sections.py index 8f9b36dfa..e1c835418 100644 --- a/examples/training/other/training_wikipedia_sections.py +++ b/examples/training/other/training_wikipedia_sections.py @@ -4,15 +4,16 @@ As corpus, we use the wikipedia sections dataset that was describd by Dor et al., 2018, Learning Thematic Similarity Metric Using Triplet Networks. """ +import logging import traceback +from datetime import datetime + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import TripletEvaluator from sentence_transformers.losses import TripletLoss from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments -from datetime import datetime -from datasets import load_dataset -import logging # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/paraphrases/training.py b/examples/training/paraphrases/training.py index 6ad6b5877..1184b3f05 100644 --- a/examples/training/paraphrases/training.py +++ b/examples/training/paraphrases/training.py @@ -3,7 +3,11 @@ As a result, it does not produce exactly the same behaviour as the original script. """ +import logging import traceback +from datetime import datetime + +from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.losses import MultipleNegativesRankingLoss @@ -14,10 +18,6 @@ MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments, ) -import logging -from datetime import datetime -from datasets import load_dataset - # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/quora_duplicate_questions/create_splits.py b/examples/training/quora_duplicate_questions/create_splits.py index f69a0040d..00e485e10 100644 --- a/examples/training/quora_duplicate_questions/create_splits.py +++ b/examples/training/quora_duplicate_questions/create_splits.py @@ -44,11 +44,11 @@ """ import csv -from collections import defaultdict -import random import os -from sentence_transformers import util +import random +from collections import defaultdict +from sentence_transformers import util random.seed(42) diff --git a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py index 9ddea4dd2..b8641bc87 100644 --- a/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py +++ b/examples/training/quora_duplicate_questions/training_MultipleNegativesRankingLoss.py @@ -11,7 +11,11 @@ The model we get works well for duplicate questions mining and for duplicate questions information retrieval. For question pair classification, other losses (like OnlineConstrativeLoss) work better. """ +import logging +import random import traceback +from datetime import datetime + from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import ( @@ -23,9 +27,6 @@ from sentence_transformers.losses import MultipleNegativesRankingLoss from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments -import logging -from datetime import datetime -import random # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py index 51fc34b0a..ab35b8edb 100644 --- a/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py +++ b/examples/training/quora_duplicate_questions/training_OnlineContrastiveLoss.py @@ -9,7 +9,11 @@ An issue with constrative loss is, that it might push sentences away that are already well positioned in vector space. """ +import logging +import random import traceback +from datetime import datetime + from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import ( @@ -22,9 +26,6 @@ from sentence_transformers.losses.ContrastiveLoss import SiameseDistanceMetric from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import BatchSamplers, SentenceTransformerTrainingArguments -import logging -from datetime import datetime -import random # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/quora_duplicate_questions/training_multi-task-learning.py b/examples/training/quora_duplicate_questions/training_multi-task-learning.py index 15d8c227c..c9b301d16 100644 --- a/examples/training/quora_duplicate_questions/training_multi-task-learning.py +++ b/examples/training/quora_duplicate_questions/training_multi-task-learning.py @@ -11,7 +11,11 @@ model.fit(train_objectives=[(train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss), (train_dataloader_constrative_loss, train_loss_constrative_loss)] ...) """ +import logging +import random import traceback +from datetime import datetime + from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import ( @@ -27,9 +31,6 @@ MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments, ) -import logging -from datetime import datetime -import random # Set the log level to INFO to get more information logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py index 9bdc1efe3..c61648494 100644 --- a/examples/training/sts/training_stsbenchmark.py +++ b/examples/training/sts/training_stsbenchmark.py @@ -9,14 +9,14 @@ python training_nli.py pretrained_transformer_model_name """ +import logging +import sys import traceback +from datetime import datetime + from datasets import load_dataset from sentence_transformers import SentenceTransformer, losses from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -import logging -from datetime import datetime -import sys - from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/training/sts/training_stsbenchmark_continue_training.py b/examples/training/sts/training_stsbenchmark_continue_training.py index ff4c70bdd..852892be9 100644 --- a/examples/training/sts/training_stsbenchmark_continue_training.py +++ b/examples/training/sts/training_stsbenchmark_continue_training.py @@ -6,14 +6,14 @@ If you want to fine-tune a huggingface/transformers model like bert-base-uncased, see training_nli.py and training_stsbenchmark.py """ +import logging +import sys import traceback +from datetime import datetime + from datasets import load_dataset from sentence_transformers import SentenceTransformer, losses from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -import logging -from datetime import datetime -import sys - from sentence_transformers.similarity_functions import SimilarityFunction from sentence_transformers.trainer import SentenceTransformerTrainer from sentence_transformers.training_args import SentenceTransformerTrainingArguments diff --git a/examples/unsupervised_learning/CT/train_askubuntu_ct.py b/examples/unsupervised_learning/CT/train_askubuntu_ct.py index a4cf386f0..ed35538b0 100644 --- a/examples/unsupervised_learning/CT/train_askubuntu_ct.py +++ b/examples/unsupervised_learning/CT/train_askubuntu_ct.py @@ -1,11 +1,12 @@ -from sentence_transformers import SentenceTransformer, LoggingHandler -from sentence_transformers import models, util, evaluation, losses +import gzip import logging import os -import gzip from datetime import datetime + import torch +from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, losses, models, util + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/CT/train_ct_from_file.py b/examples/unsupervised_learning/CT/train_ct_from_file.py index 9e8444ddb..a78a9bed5 100644 --- a/examples/unsupervised_learning/CT/train_ct_from_file.py +++ b/examples/unsupervised_learning/CT/train_ct_from_file.py @@ -8,15 +8,16 @@ """ -import math -from sentence_transformers import models, losses -from sentence_transformers import LoggingHandler, SentenceTransformer -import logging -from datetime import datetime import gzip +import logging +import math import sys +from datetime import datetime + import tqdm +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/CT/train_stsb_ct.py b/examples/unsupervised_learning/CT/train_stsb_ct.py index 23b9a3ae9..0fe756f63 100644 --- a/examples/unsupervised_learning/CT/train_stsb_ct.py +++ b/examples/unsupervised_learning/CT/train_stsb_ct.py @@ -1,12 +1,13 @@ -import torch -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -from sentence_transformers import SentenceTransformer, LoggingHandler, models, util, InputExample -from sentence_transformers import losses -import os -import gzip import csv -from datetime import datetime +import gzip import logging +import os +from datetime import datetime + +import torch + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py index e0c56e7f4..9df16ff1c 100644 --- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py +++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_askubuntu_ct-improved.py @@ -1,11 +1,12 @@ -from sentence_transformers import SentenceTransformer, LoggingHandler, InputExample -from sentence_transformers import models, util, evaluation, losses +import gzip import logging import os -import gzip from datetime import datetime + from torch.utils.data import DataLoader +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, evaluation, losses, models, util + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py index 8338628d1..06c04c4d6 100644 --- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py +++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_ct-improved_from_file.py @@ -8,16 +8,17 @@ """ -import math -from sentence_transformers import models, losses -from sentence_transformers import LoggingHandler, SentenceTransformer -import logging -from datetime import datetime import gzip +import logging +import math import sys +from datetime import datetime + import tqdm from torch.utils.data import DataLoader +from sentence_transformers import LoggingHandler, SentenceTransformer, losses, models + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py index 015d798c0..6201d94c2 100644 --- a/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py +++ b/examples/unsupervised_learning/CT_In-Batch_Negatives/train_stsb_ct-improved.py @@ -1,13 +1,14 @@ -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator -from sentence_transformers import SentenceTransformer, LoggingHandler, models, util, InputExample -from sentence_transformers import losses -import os -import gzip import csv -from datetime import datetime +import gzip import logging +import os +from datetime import datetime + from torch.utils.data import DataLoader +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/MLM/train_mlm.py b/examples/unsupervised_learning/MLM/train_mlm.py index ccc11f08f..3f6e144d6 100644 --- a/examples/unsupervised_learning/MLM/train_mlm.py +++ b/examples/unsupervised_learning/MLM/train_mlm.py @@ -8,13 +8,19 @@ python train_mlm.py model_name data/train_sentences.txt [data/dev_sentences.txt] """ -from transformers import AutoModelForMaskedLM, AutoTokenizer -from transformers import DataCollatorForLanguageModeling, DataCollatorForWholeWordMask -from transformers import Trainer, TrainingArguments -import sys import gzip +import sys from datetime import datetime +from transformers import ( + AutoModelForMaskedLM, + AutoTokenizer, + DataCollatorForLanguageModeling, + DataCollatorForWholeWordMask, + Trainer, + TrainingArguments, +) + if len(sys.argv) < 3: print("Usage: python train_mlm.py model_name data/train_sentences.txt [data/dev_sentences.txt]") exit() diff --git a/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py b/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py index 8b63aad70..bf581e017 100644 --- a/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py +++ b/examples/unsupervised_learning/SimCSE/train_askubuntu_simcse.py @@ -1,11 +1,11 @@ -from sentence_transformers import SentenceTransformer, LoggingHandler, InputExample -from sentence_transformers import models, util, evaluation, losses +import gzip import logging import os -import gzip -from torch.utils.data import DataLoader from datetime import datetime +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, evaluation, losses, models, util #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py b/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py index 861f99951..6c196d7e0 100644 --- a/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py +++ b/examples/unsupervised_learning/SimCSE/train_simcse_from_file.py @@ -8,15 +8,16 @@ """ -from torch.utils.data import DataLoader -import math -from sentence_transformers import models, losses -from sentence_transformers import LoggingHandler, SentenceTransformer, InputExample -import logging -from datetime import datetime import gzip +import logging +import math import sys +from datetime import datetime + import tqdm +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py b/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py index 0fcba32d4..82d98fbcb 100644 --- a/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py +++ b/examples/unsupervised_learning/SimCSE/train_stsb_simcse.py @@ -1,13 +1,14 @@ -from torch.utils.data import DataLoader -import math -from sentence_transformers import models, losses -from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator +import csv +import gzip import logging -from datetime import datetime +import math import os -import gzip -import csv +from datetime import datetime + +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, losses, models, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/TSDAE/eval_askubuntu.py b/examples/unsupervised_learning/TSDAE/eval_askubuntu.py index 7efe16dfe..4b5a3cdfc 100644 --- a/examples/unsupervised_learning/TSDAE/eval_askubuntu.py +++ b/examples/unsupervised_learning/TSDAE/eval_askubuntu.py @@ -5,13 +5,13 @@ python eval_askubuntu.py [sbert_model_name_or_path] """ -from sentence_transformers import SentenceTransformer, LoggingHandler -from sentence_transformers import util, evaluation +import gzip import logging import os -import gzip import sys +from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation, util + #### Just some code to print debug information to stdout logging.basicConfig( format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()] diff --git a/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py b/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py index 6abccbb50..beac68f90 100644 --- a/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py +++ b/examples/unsupervised_learning/TSDAE/train_askubuntu_tsdae.py @@ -1,11 +1,12 @@ -from sentence_transformers import SentenceTransformer, LoggingHandler -from sentence_transformers import models, util, datasets, evaluation, losses +import gzip import logging import os -import gzip -from torch.utils.data import DataLoader -from datetime import datetime import sys +from datetime import datetime + +from torch.utils.data import DataLoader + +from sentence_transformers import LoggingHandler, SentenceTransformer, datasets, evaluation, losses, models, util #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py b/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py index 0e898d91c..96d80b14a 100644 --- a/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py +++ b/examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py @@ -1,12 +1,13 @@ -from torch.utils.data import DataLoader -from sentence_transformers import models, losses, datasets -from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample -from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator +import csv +import gzip import logging -from datetime import datetime import os -import gzip -import csv +from datetime import datetime + +from torch.utils.data import DataLoader + +from sentence_transformers import InputExample, LoggingHandler, SentenceTransformer, datasets, losses, models, util +from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py b/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py index 257c9b6c9..14db4da15 100644 --- a/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py +++ b/examples/unsupervised_learning/TSDAE/train_tsdae_from_file.py @@ -8,14 +8,15 @@ """ -from sentence_transformers import SentenceTransformer, LoggingHandler -from sentence_transformers import models, datasets, losses -import logging import gzip -from torch.utils.data import DataLoader -from datetime import datetime +import logging import sys +from datetime import datetime + import tqdm +from torch.utils.data import DataLoader + +from sentence_transformers import LoggingHandler, SentenceTransformer, datasets, losses, models #### Just some code to print debug information to stdout logging.basicConfig( diff --git a/examples/unsupervised_learning/query_generation/1_programming_query_generation.py b/examples/unsupervised_learning/query_generation/1_programming_query_generation.py index f75875ca9..7558edc08 100644 --- a/examples/unsupervised_learning/query_generation/1_programming_query_generation.py +++ b/examples/unsupervised_learning/query_generation/1_programming_query_generation.py @@ -11,12 +11,14 @@ 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search """ -import json import gzip -from transformers import T5Tokenizer, T5ForConditionalGeneration +import json +import os + import torch import tqdm -import os +from transformers import T5ForConditionalGeneration, T5Tokenizer + from sentence_transformers import util paragraphs = set() diff --git a/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py b/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py index ae316c810..6364b5731 100644 --- a/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py +++ b/examples/unsupervised_learning/query_generation/2_programming_train_bi-encoder.py @@ -11,9 +11,9 @@ 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search """ -from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets import os +from sentence_transformers import InputExample, SentenceTransformer, datasets, losses, models train_examples = [] with open("generated_queries.tsv") as fIn: diff --git a/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py b/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py index 46a8b9ef6..d3f73ae44 100644 --- a/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py +++ b/examples/unsupervised_learning/query_generation/3_programming_semantic_search.py @@ -11,11 +11,12 @@ 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search """ -from sentence_transformers import SentenceTransformer, util import gzip import json import os +from sentence_transformers import SentenceTransformer, util + # Load the model we trained in 2_programming_train_bi-encoder.py model = SentenceTransformer("output/programming-model") diff --git a/examples/unsupervised_learning/query_generation/example_query_generation.py b/examples/unsupervised_learning/query_generation/example_query_generation.py index 99be09806..099b960a5 100644 --- a/examples/unsupervised_learning/query_generation/example_query_generation.py +++ b/examples/unsupervised_learning/query_generation/example_query_generation.py @@ -1,7 +1,8 @@ -import torch -import numpy as np import random -from transformers import T5Tokenizer, T5ForConditionalGeneration + +import numpy as np +import torch +from transformers import T5ForConditionalGeneration, T5Tokenizer # Set all seeds to make output deterministic torch.manual_seed(0) diff --git a/ruff.toml b/ruff.toml index 765b3bee5..0c56fdb1d 100644 --- a/ruff.toml +++ b/ruff.toml @@ -1,4 +1,3 @@ -lint.ignore-init-module-imports = true line-length = 119 diff --git a/sentence_transformers/LoggingHandler.py b/sentence_transformers/LoggingHandler.py index 7696f353e..7ae11480c 100644 --- a/sentence_transformers/LoggingHandler.py +++ b/sentence_transformers/LoggingHandler.py @@ -1,4 +1,5 @@ import logging + import tqdm diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py index f2ad12f83..280312847 100644 --- a/sentence_transformers/SentenceTransformer.py +++ b/sentence_transformers/SentenceTransformer.py @@ -1,46 +1,46 @@ -from contextlib import contextmanager +import copy +import importlib import json import logging +import math import os -from collections import OrderedDict -from pathlib import Path +import queue +import tempfile import traceback import warnings -from typing import Callable, List, Dict, Literal, Tuple, Iterable, Union, Optional, overload, Any +from collections import OrderedDict +from contextlib import contextmanager +from pathlib import Path +from typing import Any, Callable, Dict, Iterable, List, Literal, Optional, Tuple, Union, overload + import numpy as np -from numpy import ndarray -import transformers -from transformers import is_torch_npu_available -from huggingface_hub import HfApi import torch -from torch import nn, Tensor, device import torch.multiprocessing as mp +import transformers +from huggingface_hub import HfApi +from numpy import ndarray +from torch import Tensor, device, nn from tqdm.autonotebook import trange -import math -import queue -import tempfile -import copy -import importlib +from transformers import is_torch_npu_available from sentence_transformers.model_card import SentenceTransformerModelCardData, generate_model_card from sentence_transformers.similarity_functions import SimilarityFunction -from . import __MODEL_HUB_ORGANIZATION__ +from . import __MODEL_HUB_ORGANIZATION__, __version__ from .evaluation import SentenceEvaluator +from .fit_mixin import FitMixin +from .models import Normalize, Pooling, Transformer +from .quantization import quantize_embeddings from .util import ( - import_from_string, batch_to_device, + get_device_name, + import_from_string, is_sentence_transformer_model, load_dir_path, load_file_path, save_to_hub_args_decorator, - get_device_name, truncate_embeddings, ) -from .quantization import quantize_embeddings -from .models import Transformer, Pooling, Normalize -from .fit_mixin import FitMixin -from . import __version__ logger = logging.getLogger(__name__) diff --git a/sentence_transformers/__init__.py b/sentence_transformers/__init__.py index 3d772711b..488a4b9fa 100644 --- a/sentence_transformers/__init__.py +++ b/sentence_transformers/__init__.py @@ -4,17 +4,16 @@ import importlib import os -from .datasets import SentencesDataset, ParallelSentencesDataset -from .LoggingHandler import LoggingHandler -from .SentenceTransformer import SentenceTransformer -from .similarity_functions import SimilarityFunction -from .readers import InputExample -from .cross_encoder.CrossEncoder import CrossEncoder -from .trainer import SentenceTransformerTrainer -from .training_args import SentenceTransformerTrainingArguments -from .model_card import SentenceTransformerModelCardData -from .quantization import quantize_embeddings - +from sentence_transformers.cross_encoder.CrossEncoder import CrossEncoder +from sentence_transformers.datasets import ParallelSentencesDataset, SentencesDataset +from sentence_transformers.LoggingHandler import LoggingHandler +from sentence_transformers.model_card import SentenceTransformerModelCardData +from sentence_transformers.quantization import quantize_embeddings +from sentence_transformers.readers import InputExample +from sentence_transformers.SentenceTransformer import SentenceTransformer +from sentence_transformers.similarity_functions import SimilarityFunction +from sentence_transformers.trainer import SentenceTransformerTrainer +from sentence_transformers.training_args import SentenceTransformerTrainingArguments # If codecarbon is installed and the log level is not defined, # automatically overwrite the default to "error" diff --git a/sentence_transformers/cross_encoder/CrossEncoder.py b/sentence_transformers/cross_encoder/CrossEncoder.py index cc205c317..9462b69b6 100644 --- a/sentence_transformers/cross_encoder/CrossEncoder.py +++ b/sentence_transformers/cross_encoder/CrossEncoder.py @@ -1,22 +1,20 @@ +import logging +import os from functools import wraps +from typing import Callable, Dict, List, Optional, Type, Union -from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig import numpy as np -import logging -import os -from typing import Dict, Type, Callable, List, Optional, Union import torch from torch import nn from torch.optim import Optimizer from torch.utils.data import DataLoader from tqdm.autonotebook import tqdm, trange -from transformers import is_torch_npu_available +from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, is_torch_npu_available from transformers.utils import PushToHubMixin -from .. import SentenceTransformer, util -from ..evaluation import SentenceEvaluator -from ..util import get_device_name - +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.SentenceTransformer import SentenceTransformer +from sentence_transformers.util import fullname, get_device_name, import_from_string logger = logging.getLogger(__name__) @@ -114,7 +112,7 @@ def __init__( if default_activation_function is not None: self.default_activation_function = default_activation_function try: - self.config.sbert_ce_default_activation_function = util.fullname(self.default_activation_function) + self.config.sbert_ce_default_activation_function = fullname(self.default_activation_function) except Exception as e: logger.warning( "Was not able to update config about the default_activation_function: {}".format(str(e)) @@ -123,9 +121,7 @@ def __init__( hasattr(self.config, "sbert_ce_default_activation_function") and self.config.sbert_ce_default_activation_function is not None ): - self.default_activation_function = util.import_from_string( - self.config.sbert_ce_default_activation_function - )() + self.default_activation_function = import_from_string(self.config.sbert_ce_default_activation_function)() else: self.default_activation_function = nn.Sigmoid() if self.config.num_labels == 1 else nn.Identity() diff --git a/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py index 70eccbd23..b0fb33723 100644 --- a/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CEBinaryAccuracyEvaluator.py @@ -1,10 +1,11 @@ +import csv import logging import os -import csv from typing import List -from ... import InputExample + import numpy as np +from sentence_transformers import InputExample logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py index 0d41a7a08..da6e3a36d 100644 --- a/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CEBinaryClassificationEvaluator.py @@ -1,13 +1,13 @@ +import csv import logging -from sklearn.metrics import average_precision_score -from typing import List -import numpy as np import os -import csv +from typing import List -from ... import InputExample -from ...evaluation import BinaryClassificationEvaluator +import numpy as np +from sklearn.metrics import average_precision_score +from sentence_transformers import InputExample +from sentence_transformers.evaluation import BinaryClassificationEvaluator logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py index fbb76ac53..80f8f48df 100644 --- a/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CECorrelationEvaluator.py @@ -1,10 +1,11 @@ +import csv import logging -from scipy.stats import pearsonr, spearmanr -from typing import List import os -import csv -from ... import InputExample +from typing import List + +from scipy.stats import pearsonr, spearmanr +from sentence_transformers import InputExample logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py index f9bf25b25..c28ab631e 100644 --- a/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CEF1Evaluator.py @@ -4,10 +4,10 @@ from typing import List import numpy as np +from sklearn.metrics import f1_score +from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.readers.InputExample import InputExample -from .. import CrossEncoder -from sklearn.metrics import f1_score logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py index fa6160ec4..8552fc9d9 100644 --- a/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CERerankingEvaluator.py @@ -1,8 +1,9 @@ +import csv import logging -import numpy as np import os -import csv from typing import Optional + +import numpy as np from sklearn.metrics import ndcg_score logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py b/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py index ec4704e31..e3973d952 100644 --- a/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py +++ b/sentence_transformers/cross_encoder/evaluation/CESoftmaxAccuracyEvaluator.py @@ -1,10 +1,11 @@ +import csv import logging import os -import csv from typing import List -from ... import InputExample + import numpy as np +from sentence_transformers import InputExample logger = logging.getLogger(__name__) diff --git a/sentence_transformers/cross_encoder/evaluation/__init__.py b/sentence_transformers/cross_encoder/evaluation/__init__.py index 43d2db677..ac176ff83 100644 --- a/sentence_transformers/cross_encoder/evaluation/__init__.py +++ b/sentence_transformers/cross_encoder/evaluation/__init__.py @@ -1,9 +1,9 @@ from .CEBinaryAccuracyEvaluator import CEBinaryAccuracyEvaluator from .CEBinaryClassificationEvaluator import CEBinaryClassificationEvaluator -from .CEF1Evaluator import CEF1Evaluator from .CECorrelationEvaluator import CECorrelationEvaluator -from .CESoftmaxAccuracyEvaluator import CESoftmaxAccuracyEvaluator +from .CEF1Evaluator import CEF1Evaluator from .CERerankingEvaluator import CERerankingEvaluator +from .CESoftmaxAccuracyEvaluator import CESoftmaxAccuracyEvaluator __all__ = [ "CEBinaryAccuracyEvaluator", diff --git a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py index 973d55cac..997413cdf 100644 --- a/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py +++ b/sentence_transformers/datasets/DenoisingAutoEncoderDataset.py @@ -1,8 +1,10 @@ -from torch.utils.data import Dataset from typing import List -from ..readers.InputExample import InputExample + import numpy as np -from transformers.utils.import_utils import is_nltk_available, NLTK_IMPORT_ERROR +from torch.utils.data import Dataset +from transformers.utils.import_utils import NLTK_IMPORT_ERROR, is_nltk_available + +from sentence_transformers.readers.InputExample import InputExample class DenoisingAutoEncoderDataset(Dataset): @@ -34,7 +36,7 @@ def __len__(self): # Deletion noise. @staticmethod def delete(text, del_ratio=0.6): - from nltk import word_tokenize, TreebankWordDetokenizer + from nltk import TreebankWordDetokenizer, word_tokenize words = word_tokenize(text) n = len(words) diff --git a/sentence_transformers/datasets/NoDuplicatesDataLoader.py b/sentence_transformers/datasets/NoDuplicatesDataLoader.py index e05b504b7..f910183c1 100644 --- a/sentence_transformers/datasets/NoDuplicatesDataLoader.py +++ b/sentence_transformers/datasets/NoDuplicatesDataLoader.py @@ -1,5 +1,5 @@ -import random import math +import random class NoDuplicatesDataLoader: diff --git a/sentence_transformers/datasets/ParallelSentencesDataset.py b/sentence_transformers/datasets/ParallelSentencesDataset.py index 1ec72e90b..6b64179dc 100644 --- a/sentence_transformers/datasets/ParallelSentencesDataset.py +++ b/sentence_transformers/datasets/ParallelSentencesDataset.py @@ -1,11 +1,12 @@ -from torch.utils.data import Dataset -import logging import gzip -from .. import SentenceTransformer -from ..readers import InputExample -from typing import List +import logging import random +from typing import List + +from torch.utils.data import Dataset +from sentence_transformers import SentenceTransformer +from sentence_transformers.readers import InputExample logger = logging.getLogger(__name__) diff --git a/sentence_transformers/datasets/SentenceLabelDataset.py b/sentence_transformers/datasets/SentenceLabelDataset.py index ca90665eb..c716f82ed 100644 --- a/sentence_transformers/datasets/SentenceLabelDataset.py +++ b/sentence_transformers/datasets/SentenceLabelDataset.py @@ -1,8 +1,10 @@ -from torch.utils.data import IterableDataset -import numpy as np -from typing import List -from ..readers import InputExample import logging +from typing import List + +import numpy as np +from torch.utils.data import IterableDataset + +from sentence_transformers.readers import InputExample logger = logging.getLogger(__name__) diff --git a/sentence_transformers/datasets/SentencesDataset.py b/sentence_transformers/datasets/SentencesDataset.py index ae689676e..f7795a8fc 100644 --- a/sentence_transformers/datasets/SentencesDataset.py +++ b/sentence_transformers/datasets/SentencesDataset.py @@ -1,7 +1,9 @@ -from torch.utils.data import Dataset from typing import List -from .. import SentenceTransformer -from ..readers.InputExample import InputExample + +from torch.utils.data import Dataset + +from sentence_transformers import SentenceTransformer +from sentence_transformers.readers.InputExample import InputExample class SentencesDataset(Dataset): diff --git a/sentence_transformers/datasets/__init__.py b/sentence_transformers/datasets/__init__.py index 6d0f06471..33cc755d5 100644 --- a/sentence_transformers/datasets/__init__.py +++ b/sentence_transformers/datasets/__init__.py @@ -1,8 +1,8 @@ from .DenoisingAutoEncoderDataset import DenoisingAutoEncoderDataset from .NoDuplicatesDataLoader import NoDuplicatesDataLoader from .ParallelSentencesDataset import ParallelSentencesDataset -from .SentencesDataset import SentencesDataset from .SentenceLabelDataset import SentenceLabelDataset +from .SentencesDataset import SentencesDataset __all__ = [ "DenoisingAutoEncoderDataset", diff --git a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py index b3838f116..a4910c299 100644 --- a/sentence_transformers/evaluation/BinaryClassificationEvaluator.py +++ b/sentence_transformers/evaluation/BinaryClassificationEvaluator.py @@ -1,17 +1,19 @@ -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext - -from sentence_transformers.similarity_functions import SimilarityFunction -from . import SentenceEvaluator +import csv import logging import os -import csv -from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances -from sklearn.metrics import average_precision_score +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional + import numpy as np -from typing import Dict, List, Optional -from ..readers import InputExample +from sklearn.metrics import average_precision_score +from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.readers import InputExample +from sentence_transformers.similarity_functions import SimilarityFunction +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -150,7 +152,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): return cls(sentences1, sentences2, scores, **kwargs) def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: """ Compute the evaluation metrics for the given model. diff --git a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py index 4c89d6178..bf9729631 100644 --- a/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py +++ b/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py @@ -1,17 +1,19 @@ -from contextlib import nullcontext - -from sentence_transformers import SentenceTransformer -from . import SentenceEvaluator -from sentence_transformers.similarity_functions import SimilarityFunction +import csv import logging import os -import csv -from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances -from scipy.stats import pearsonr, spearmanr +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Union + import numpy as np -from typing import Dict, List, Literal, Optional, Union -from ..readers import InputExample +from scipy.stats import pearsonr, spearmanr +from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.readers import InputExample +from sentence_transformers.similarity_functions import SimilarityFunction +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -139,7 +141,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): return cls(sentences1, sentences2, scores, **kwargs) def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: if epoch != -1: if steps == -1: diff --git a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py index 8381abf5b..1d46f980c 100644 --- a/sentence_transformers/evaluation/InformationRetrievalEvaluator.py +++ b/sentence_transformers/evaluation/InformationRetrievalEvaluator.py @@ -1,18 +1,20 @@ -from sentence_transformers import SentenceTransformer +import heapq +import logging +import os from contextlib import nullcontext +from typing import TYPE_CHECKING, Callable, Dict, List, Optional, Set, Union -from sentence_transformers.similarity_functions import SimilarityFunction -from . import SentenceEvaluator +import numpy as np import torch from torch import Tensor -import logging from tqdm import trange -from ..util import cos_sim, dot_score -import os -import numpy as np -from typing import List, Dict, Optional, Set, Callable, Union -import heapq +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.similarity_functions import SimilarityFunction +from sentence_transformers.util import cos_sim, dot_score + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -203,7 +205,7 @@ def __init__( self.csv_headers.append("{}-MAP@{}".format(score_name, k)) def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1, *args, **kwargs ) -> Dict[str, float]: if epoch != -1: if steps == -1: @@ -272,7 +274,7 @@ def __call__( return metrics def compute_metrices( - self, model: SentenceTransformer, corpus_model=None, corpus_embeddings: Tensor = None + self, model: "SentenceTransformer", corpus_model=None, corpus_embeddings: Tensor = None ) -> Dict[str, float]: if corpus_model is None: corpus_model = model diff --git a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py index 2bd65a163..4031cef9d 100644 --- a/sentence_transformers/evaluation/LabelAccuracyEvaluator.py +++ b/sentence_transformers/evaluation/LabelAccuracyEvaluator.py @@ -1,13 +1,16 @@ -from typing import Dict -from sentence_transformers import SentenceTransformer -from . import SentenceEvaluator -import torch -from torch.utils.data import DataLoader +import csv import logging -from ..util import batch_to_device import os -import csv +from typing import TYPE_CHECKING, Dict + +import torch +from torch.utils.data import DataLoader + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.util import batch_to_device +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -42,7 +45,7 @@ def __init__(self, dataloader: DataLoader, name: str = "", softmax_model=None, w self.primary_metric = "accuracy" def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: model.eval() total = 0 diff --git a/sentence_transformers/evaluation/MSEEvaluator.py b/sentence_transformers/evaluation/MSEEvaluator.py index 46e7518f6..6fa300bb1 100644 --- a/sentence_transformers/evaluation/MSEEvaluator.py +++ b/sentence_transformers/evaluation/MSEEvaluator.py @@ -1,11 +1,13 @@ -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext -from sentence_transformers.evaluation import SentenceEvaluator +import csv import logging import os -import csv -from typing import Dict, List, Optional +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -94,7 +96,7 @@ def __init__( self.write_csv = write_csv self.primary_metric = "negative_mse" - def __call__(self, model: SentenceTransformer, output_path: str = None, epoch=-1, steps=-1) -> Dict[str, float]: + def __call__(self, model: "SentenceTransformer", output_path: str = None, epoch=-1, steps=-1) -> Dict[str, float]: if epoch != -1: if steps == -1: out_txt = f" after epoch {epoch}" diff --git a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py index 0d6fd2778..6147877ca 100644 --- a/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py +++ b/sentence_transformers/evaluation/MSEEvaluatorFromDataFrame.py @@ -1,12 +1,15 @@ -from contextlib import nullcontext -from sentence_transformers.evaluation import SentenceEvaluator -from sentence_transformers import SentenceTransformer -from typing import List, Optional, Tuple, Dict -import numpy as np +import csv import logging import os -import csv +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional, Tuple + +import numpy as np + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -36,7 +39,7 @@ class MSEEvaluatorFromDataFrame(SentenceEvaluator): def __init__( self, dataframe: List[Dict[str, str]], - teacher_model: SentenceTransformer, + teacher_model: "SentenceTransformer", combinations: List[Tuple[str, str]], batch_size: int = 8, name: str = "", @@ -81,7 +84,7 @@ def __init__( self.teacher_embeddings = {sent: emb for sent, emb in zip(all_source_sentences, all_src_embeddings)} def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: model.eval() diff --git a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py index 701edf29b..7728e1393 100644 --- a/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py +++ b/sentence_transformers/evaluation/ParaphraseMiningEvaluator.py @@ -1,14 +1,15 @@ -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext -from . import SentenceEvaluator +import csv import logging -from sentence_transformers.util import paraphrase_mining import os -import csv - -from typing import List, Optional, Tuple, Dict from collections import defaultdict +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional, Tuple + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.util import paraphrase_mining +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -155,7 +156,7 @@ def __init__( self.primary_metric = "average_precision" def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: if epoch != -1: if steps == -1: diff --git a/sentence_transformers/evaluation/RerankingEvaluator.py b/sentence_transformers/evaluation/RerankingEvaluator.py index 902d6281d..8fa6f93bd 100644 --- a/sentence_transformers/evaluation/RerankingEvaluator.py +++ b/sentence_transformers/evaluation/RerankingEvaluator.py @@ -1,15 +1,19 @@ -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext -from . import SentenceEvaluator +import csv import logging -import numpy as np import os -import csv -from ..util import cos_sim +from contextlib import nullcontext +from typing import TYPE_CHECKING, Callable, Dict, Optional + +import numpy as np import torch -from sklearn.metrics import average_precision_score, ndcg_score import tqdm -from typing import Callable, Dict, Optional +from sklearn.metrics import average_precision_score, ndcg_score + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.util import cos_sim + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -86,7 +90,7 @@ def __init__( self.primary_metric = "map" def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: """ Evaluates the model on the dataset and returns the evaluation metrics. diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py index a336fc362..a3a2497ed 100644 --- a/sentence_transformers/evaluation/SentenceEvaluator.py +++ b/sentence_transformers/evaluation/SentenceEvaluator.py @@ -1,7 +1,8 @@ import re -from typing import Any, Dict, Union +from typing import TYPE_CHECKING, Any, Dict, Union -from sentence_transformers import SentenceTransformer +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer class SentenceEvaluator: @@ -16,7 +17,7 @@ def __init__(self): # TODO: Add better `primary_metrics` support def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Union[float, Dict[str, float]]: """ This is called during training to evaluate the model. diff --git a/sentence_transformers/evaluation/SequentialEvaluator.py b/sentence_transformers/evaluation/SequentialEvaluator.py index 3f9f43f36..eac6ff79c 100644 --- a/sentence_transformers/evaluation/SequentialEvaluator.py +++ b/sentence_transformers/evaluation/SequentialEvaluator.py @@ -1,6 +1,9 @@ -from sentence_transformers import SentenceTransformer -from . import SentenceEvaluator -from typing import Dict, Iterable +from typing import TYPE_CHECKING, Dict, Iterable + +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer class SequentialEvaluator(SentenceEvaluator): @@ -33,7 +36,7 @@ def __init__(self, evaluators: Iterable[SentenceEvaluator], main_score_function= self.main_score_function = main_score_function def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: evaluations = [] scores = [] diff --git a/sentence_transformers/evaluation/TranslationEvaluator.py b/sentence_transformers/evaluation/TranslationEvaluator.py index 8580b8d70..312a8c6f5 100644 --- a/sentence_transformers/evaluation/TranslationEvaluator.py +++ b/sentence_transformers/evaluation/TranslationEvaluator.py @@ -1,14 +1,17 @@ -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext -from . import SentenceEvaluator +import csv import logging -from ..util import pytorch_cos_sim import os -import csv +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional + import numpy as np -from typing import Dict, List, Optional import torch +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.util import pytorch_cos_sim + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -97,7 +100,7 @@ def __init__( self.primary_metric = "mean_accuracy" def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: if epoch != -1: if steps == -1: diff --git a/sentence_transformers/evaluation/TripletEvaluator.py b/sentence_transformers/evaluation/TripletEvaluator.py index fe34a32ac..7b26c4a27 100644 --- a/sentence_transformers/evaluation/TripletEvaluator.py +++ b/sentence_transformers/evaluation/TripletEvaluator.py @@ -1,15 +1,18 @@ -import numpy as np -from sentence_transformers import SentenceTransformer -from contextlib import nullcontext -from . import SentenceEvaluator -from sentence_transformers.similarity_functions import SimilarityFunction +import csv import logging import os -import csv +from contextlib import nullcontext +from typing import TYPE_CHECKING, Dict, List, Optional, Union + +import numpy as np from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances -from typing import Dict, List, Optional, Union -from ..readers import InputExample +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.readers import InputExample +from sentence_transformers.similarity_functions import SimilarityFunction + +if TYPE_CHECKING: + from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) @@ -118,7 +121,7 @@ def from_input_examples(cls, examples: List[InputExample], **kwargs): return cls(anchors, positives, negatives, **kwargs) def __call__( - self, model: SentenceTransformer, output_path: str = None, epoch: int = -1, steps: int = -1 + self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 ) -> Dict[str, float]: if epoch != -1: if steps == -1: diff --git a/sentence_transformers/evaluation/__init__.py b/sentence_transformers/evaluation/__init__.py index 5c0309027..7a2568992 100644 --- a/sentence_transformers/evaluation/__init__.py +++ b/sentence_transformers/evaluation/__init__.py @@ -1,5 +1,3 @@ -from .SentenceEvaluator import SentenceEvaluator -from .SimilarityFunction import SimilarityFunction from .BinaryClassificationEvaluator import BinaryClassificationEvaluator from .EmbeddingSimilarityEvaluator import EmbeddingSimilarityEvaluator from .InformationRetrievalEvaluator import InformationRetrievalEvaluator @@ -7,10 +5,12 @@ from .MSEEvaluator import MSEEvaluator from .MSEEvaluatorFromDataFrame import MSEEvaluatorFromDataFrame from .ParaphraseMiningEvaluator import ParaphraseMiningEvaluator +from .RerankingEvaluator import RerankingEvaluator +from .SentenceEvaluator import SentenceEvaluator from .SequentialEvaluator import SequentialEvaluator +from .SimilarityFunction import SimilarityFunction from .TranslationEvaluator import TranslationEvaluator from .TripletEvaluator import TripletEvaluator -from .RerankingEvaluator import RerankingEvaluator __all__ = [ "SentenceEvaluator", diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py index 8ec3b94ba..62e688ff6 100644 --- a/sentence_transformers/fit_mixin.py +++ b/sentence_transformers/fit_mixin.py @@ -1,38 +1,40 @@ import json import logging import os -from pathlib import Path import shutil -from typing import Any, List, Dict, Tuple, Iterable, Type, Callable, Optional, TYPE_CHECKING +from pathlib import Path +from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, List, Optional, Tuple, Type + import numpy as np -import transformers import torch -from torch import nn, Tensor +import transformers +from torch import Tensor, nn from torch.optim import Optimizer from torch.utils.data import DataLoader from tqdm.autonotebook import trange +from transformers import TrainerCallback, TrainerControl, TrainerState + from datasets import Dataset, DatasetDict -from transformers import TrainerCallback, TrainerState, TrainerControl -from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader +from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset from sentence_transformers.training_args import ( - SentenceTransformerTrainingArguments, - MultiDatasetBatchSamplers, BatchSamplers, + MultiDatasetBatchSamplers, + SentenceTransformerTrainingArguments, ) from .evaluation import SentenceEvaluator +from .model_card_templates import ModelCardTemplate from .util import ( batch_to_device, fullname, ) -from .model_card_templates import ModelCardTemplate logger = logging.getLogger(__name__) if TYPE_CHECKING: - from sentence_transformers.SentenceTransformer import SentenceTransformer from sentence_transformers.readers.InputExample import InputExample + from sentence_transformers.SentenceTransformer import SentenceTransformer class SaveModelCallback(TrainerCallback): diff --git a/sentence_transformers/losses/AdaptiveLayerLoss.py b/sentence_transformers/losses/AdaptiveLayerLoss.py index d50ac3227..d91e83834 100644 --- a/sentence_transformers/losses/AdaptiveLayerLoss.py +++ b/sentence_transformers/losses/AdaptiveLayerLoss.py @@ -1,12 +1,14 @@ import random -from typing import Any, Dict, Iterable, List, Tuple import warnings +from typing import Any, Dict, Iterable, List, Tuple + +import torch from torch import Tensor, nn from torch.nn import functional as F -import torch + from sentence_transformers import SentenceTransformer -from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss +from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss from sentence_transformers.models import Transformer diff --git a/sentence_transformers/losses/AnglELoss.py b/sentence_transformers/losses/AnglELoss.py index 661e8693d..f69900d87 100644 --- a/sentence_transformers/losses/AnglELoss.py +++ b/sentence_transformers/losses/AnglELoss.py @@ -1,4 +1,4 @@ -from sentence_transformers import losses, SentenceTransformer, util +from sentence_transformers import SentenceTransformer, losses, util class AnglELoss(losses.CoSENTLoss): diff --git a/sentence_transformers/losses/BatchAllTripletLoss.py b/sentence_transformers/losses/BatchAllTripletLoss.py index 1843a615a..925c1a591 100644 --- a/sentence_transformers/losses/BatchAllTripletLoss.py +++ b/sentence_transformers/losses/BatchAllTripletLoss.py @@ -1,8 +1,11 @@ -from torch import nn, Tensor -from typing import Iterable, Dict -from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction +from typing import Dict, Iterable + +from torch import Tensor, nn + from sentence_transformers.SentenceTransformer import SentenceTransformer +from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction + class BatchAllTripletLoss(nn.Module): def __init__( diff --git a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py index 02914a165..b2212de7a 100644 --- a/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py +++ b/sentence_transformers/losses/BatchHardSoftMarginTripletLoss.py @@ -1,9 +1,12 @@ +from typing import Dict, Iterable + import torch from torch import Tensor -from typing import Iterable, Dict -from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction + from sentence_transformers.SentenceTransformer import SentenceTransformer +from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction + class BatchHardSoftMarginTripletLoss(BatchHardTripletLoss): def __init__( diff --git a/sentence_transformers/losses/BatchHardTripletLoss.py b/sentence_transformers/losses/BatchHardTripletLoss.py index 51df4a8b5..ca940e657 100644 --- a/sentence_transformers/losses/BatchHardTripletLoss.py +++ b/sentence_transformers/losses/BatchHardTripletLoss.py @@ -1,6 +1,8 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict +from torch import Tensor, nn + from sentence_transformers import util from sentence_transformers.SentenceTransformer import SentenceTransformer diff --git a/sentence_transformers/losses/BatchSemiHardTripletLoss.py b/sentence_transformers/losses/BatchSemiHardTripletLoss.py index c997d1f58..20b8316c2 100644 --- a/sentence_transformers/losses/BatchSemiHardTripletLoss.py +++ b/sentence_transformers/losses/BatchSemiHardTripletLoss.py @@ -1,9 +1,12 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict -from .BatchHardTripletLoss import BatchHardTripletLossDistanceFunction +from torch import Tensor, nn + from sentence_transformers.SentenceTransformer import SentenceTransformer +from .BatchHardTripletLoss import BatchHardTripletLossDistanceFunction + class BatchSemiHardTripletLoss(nn.Module): def __init__( diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py index cf3392456..e3208298f 100644 --- a/sentence_transformers/losses/CachedGISTEmbedLoss.py +++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py @@ -1,12 +1,15 @@ from __future__ import annotations + from contextlib import nullcontext from functools import partial +from typing import Dict, Iterable, Iterator, List, Optional, Tuple + import torch -from torch import nn, Tensor +import tqdm +from torch import Tensor, nn from torch.utils.checkpoint import get_device_states, set_device_states -from typing import Iterable, Dict, Iterator, List, Optional, Tuple + from sentence_transformers import SentenceTransformer -import tqdm from sentence_transformers.models import Transformer diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py index d3e2c7204..5e1b4e1d0 100644 --- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py @@ -1,13 +1,15 @@ from __future__ import annotations + from contextlib import nullcontext from functools import partial +from typing import Dict, Iterable, Iterator, List, Optional, Tuple + import torch -from torch import nn, Tensor -from torch.utils.checkpoint import get_device_states, set_device_states -from typing import Iterable, Dict, Iterator, List, Optional, Tuple -from sentence_transformers import SentenceTransformer -from sentence_transformers import util import tqdm +from torch import Tensor, nn +from torch.utils.checkpoint import get_device_states, set_device_states + +from sentence_transformers import SentenceTransformer, util class RandContext: diff --git a/sentence_transformers/losses/CoSENTLoss.py b/sentence_transformers/losses/CoSENTLoss.py index e0c5203b7..e59de1bdc 100644 --- a/sentence_transformers/losses/CoSENTLoss.py +++ b/sentence_transformers/losses/CoSENTLoss.py @@ -1,8 +1,10 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict -from ..SentenceTransformer import SentenceTransformer -from .. import util +from torch import Tensor, nn + +from sentence_transformers import util +from sentence_transformers.SentenceTransformer import SentenceTransformer class CoSENTLoss(nn.Module): diff --git a/sentence_transformers/losses/ContrastiveLoss.py b/sentence_transformers/losses/ContrastiveLoss.py index a7c66792f..b11c80290 100644 --- a/sentence_transformers/losses/ContrastiveLoss.py +++ b/sentence_transformers/losses/ContrastiveLoss.py @@ -1,7 +1,9 @@ from enum import Enum -from typing import Iterable, Dict +from typing import Dict, Iterable + import torch.nn.functional as F -from torch import nn, Tensor +from torch import Tensor, nn + from sentence_transformers.SentenceTransformer import SentenceTransformer diff --git a/sentence_transformers/losses/ContrastiveTensionLoss.py b/sentence_transformers/losses/ContrastiveTensionLoss.py index 08dc5d723..885a7db28 100644 --- a/sentence_transformers/losses/ContrastiveTensionLoss.py +++ b/sentence_transformers/losses/ContrastiveTensionLoss.py @@ -1,13 +1,14 @@ -import torch -from torch import nn, Tensor -from typing import Iterable, Dict -from ..SentenceTransformer import SentenceTransformer -from .. import util import copy -import random import math -from .. import InputExample +import random +from typing import Dict, Iterable + import numpy as np +import torch +from torch import Tensor, nn + +from sentence_transformers import InputExample, util +from sentence_transformers.SentenceTransformer import SentenceTransformer class ContrastiveTensionLoss(nn.Module): diff --git a/sentence_transformers/losses/CosineSimilarityLoss.py b/sentence_transformers/losses/CosineSimilarityLoss.py index 8d27300e7..bf9776474 100644 --- a/sentence_transformers/losses/CosineSimilarityLoss.py +++ b/sentence_transformers/losses/CosineSimilarityLoss.py @@ -1,9 +1,10 @@ +from typing import Any, Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Any, Iterable, Dict +from torch import Tensor, nn +from sentence_transformers.SentenceTransformer import SentenceTransformer from sentence_transformers.util import fullname -from ..SentenceTransformer import SentenceTransformer class CosineSimilarityLoss(nn.Module): diff --git a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py index cdc35cb85..56a31d702 100644 --- a/sentence_transformers/losses/DenoisingAutoEncoderLoss.py +++ b/sentence_transformers/losses/DenoisingAutoEncoderLoss.py @@ -1,8 +1,10 @@ -from torch import nn, Tensor -from typing import Iterable, Dict, Optional -from sentence_transformers import SentenceTransformer -from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, PreTrainedModel import logging +from typing import Dict, Iterable, Optional + +from torch import Tensor, nn +from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel + +from sentence_transformers import SentenceTransformer logger = logging.getLogger(__name__) diff --git a/sentence_transformers/losses/GISTEmbedLoss.py b/sentence_transformers/losses/GISTEmbedLoss.py index 6c719d511..9646402e4 100644 --- a/sentence_transformers/losses/GISTEmbedLoss.py +++ b/sentence_transformers/losses/GISTEmbedLoss.py @@ -1,8 +1,10 @@ -from typing import Any, Iterable, Dict +from typing import Any, Dict, Iterable + import torch -from torch import nn, Tensor -from sentence_transformers.SentenceTransformer import SentenceTransformer +from torch import Tensor, nn + from sentence_transformers.models import Transformer +from sentence_transformers.SentenceTransformer import SentenceTransformer class GISTEmbedLoss(nn.Module): diff --git a/sentence_transformers/losses/MSELoss.py b/sentence_transformers/losses/MSELoss.py index c377a96e1..f7349a39d 100644 --- a/sentence_transformers/losses/MSELoss.py +++ b/sentence_transformers/losses/MSELoss.py @@ -1,6 +1,7 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict +from torch import Tensor, nn class MSELoss(nn.Module): diff --git a/sentence_transformers/losses/MarginMSELoss.py b/sentence_transformers/losses/MarginMSELoss.py index 44ab49710..2e13f59f5 100644 --- a/sentence_transformers/losses/MarginMSELoss.py +++ b/sentence_transformers/losses/MarginMSELoss.py @@ -1,6 +1,8 @@ -from .. import util -from torch import nn, Tensor -from typing import Iterable, Dict +from typing import Dict, Iterable + +from torch import Tensor, nn + +from sentence_transformers import util class MarginMSELoss(nn.Module): diff --git a/sentence_transformers/losses/Matryoshka2dLoss.py b/sentence_transformers/losses/Matryoshka2dLoss.py index a3043da89..9c30543bd 100644 --- a/sentence_transformers/losses/Matryoshka2dLoss.py +++ b/sentence_transformers/losses/Matryoshka2dLoss.py @@ -1,7 +1,9 @@ from typing import Any, Dict, List, Optional, Union + from torch.nn import Module -from sentence_transformers.SentenceTransformer import SentenceTransformer + from sentence_transformers.losses import AdaptiveLayerLoss, MatryoshkaLoss +from sentence_transformers.SentenceTransformer import SentenceTransformer class Matryoshka2dLoss(AdaptiveLayerLoss): diff --git a/sentence_transformers/losses/MatryoshkaLoss.py b/sentence_transformers/losses/MatryoshkaLoss.py index acdac95f4..850225239 100644 --- a/sentence_transformers/losses/MatryoshkaLoss.py +++ b/sentence_transformers/losses/MatryoshkaLoss.py @@ -1,8 +1,10 @@ import random -from typing import Any, Dict, Iterable, List, Optional, Union import warnings -from torch import Tensor, nn +from typing import Any, Dict, Iterable, List, Optional, Union + import torch.nn.functional as F +from torch import Tensor, nn + from sentence_transformers import SentenceTransformer from sentence_transformers.losses.CachedGISTEmbedLoss import CachedGISTEmbedLoss from sentence_transformers.losses.CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss diff --git a/sentence_transformers/losses/MegaBatchMarginLoss.py b/sentence_transformers/losses/MegaBatchMarginLoss.py index dc63ba6d9..bc7ce6d05 100644 --- a/sentence_transformers/losses/MegaBatchMarginLoss.py +++ b/sentence_transformers/losses/MegaBatchMarginLoss.py @@ -1,8 +1,10 @@ -from .. import util +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict import torch.nn.functional as F +from torch import Tensor, nn + +from sentence_transformers import util class MegaBatchMarginLoss(nn.Module): diff --git a/sentence_transformers/losses/MultipleNegativesRankingLoss.py b/sentence_transformers/losses/MultipleNegativesRankingLoss.py index 78b03303c..0c3761523 100644 --- a/sentence_transformers/losses/MultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/MultipleNegativesRankingLoss.py @@ -1,8 +1,10 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict -from ..SentenceTransformer import SentenceTransformer -from .. import util +from torch import Tensor, nn + +from sentence_transformers import util +from sentence_transformers.SentenceTransformer import SentenceTransformer class MultipleNegativesRankingLoss(nn.Module): diff --git a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py index 979502dde..553cb0b97 100644 --- a/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py +++ b/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py @@ -1,8 +1,10 @@ +from typing import Dict, Iterable + import torch -from torch import nn, Tensor -from typing import Iterable, Dict -from ..SentenceTransformer import SentenceTransformer -from .. import util +from torch import Tensor, nn + +from sentence_transformers import util +from sentence_transformers.SentenceTransformer import SentenceTransformer class MultipleNegativesSymmetricRankingLoss(nn.Module): diff --git a/sentence_transformers/losses/OnlineContrastiveLoss.py b/sentence_transformers/losses/OnlineContrastiveLoss.py index d36e61ccf..fcfd1a69f 100644 --- a/sentence_transformers/losses/OnlineContrastiveLoss.py +++ b/sentence_transformers/losses/OnlineContrastiveLoss.py @@ -1,9 +1,12 @@ -from typing import Iterable, Dict +from typing import Dict, Iterable + import torch.nn.functional as F -from torch import nn, Tensor -from .ContrastiveLoss import SiameseDistanceMetric +from torch import Tensor, nn + from sentence_transformers.SentenceTransformer import SentenceTransformer +from .ContrastiveLoss import SiameseDistanceMetric + class OnlineContrastiveLoss(nn.Module): def __init__( diff --git a/sentence_transformers/losses/SoftmaxLoss.py b/sentence_transformers/losses/SoftmaxLoss.py index eaf95d4e3..44e240499 100644 --- a/sentence_transformers/losses/SoftmaxLoss.py +++ b/sentence_transformers/losses/SoftmaxLoss.py @@ -1,9 +1,10 @@ -import torch -from torch import nn, Tensor -from typing import Iterable, Dict, Callable -from ..SentenceTransformer import SentenceTransformer import logging +from typing import Callable, Dict, Iterable + +import torch +from torch import Tensor, nn +from sentence_transformers.SentenceTransformer import SentenceTransformer logger = logging.getLogger(__name__) diff --git a/sentence_transformers/losses/TripletLoss.py b/sentence_transformers/losses/TripletLoss.py index de44db228..53c6c88d6 100644 --- a/sentence_transformers/losses/TripletLoss.py +++ b/sentence_transformers/losses/TripletLoss.py @@ -1,8 +1,10 @@ -from torch import nn, Tensor -from typing import Iterable, Dict -import torch.nn.functional as F from enum import Enum -from ..SentenceTransformer import SentenceTransformer +from typing import Dict, Iterable + +import torch.nn.functional as F +from torch import Tensor, nn + +from sentence_transformers.SentenceTransformer import SentenceTransformer class TripletDistanceMetric(Enum): diff --git a/sentence_transformers/losses/__init__.py b/sentence_transformers/losses/__init__.py index 00a64e2cb..fdac35735 100644 --- a/sentence_transformers/losses/__init__.py +++ b/sentence_transformers/losses/__init__.py @@ -1,33 +1,33 @@ +# CoSENTLoss must be imported before AnglELoss +from .CoSENTLoss import CoSENTLoss # isort: skip + from .AdaptiveLayerLoss import AdaptiveLayerLoss -from .CosineSimilarityLoss import CosineSimilarityLoss -from .SoftmaxLoss import SoftmaxLoss -from .MultipleNegativesRankingLoss import MultipleNegativesRankingLoss -from .MultipleNegativesSymmetricRankingLoss import MultipleNegativesSymmetricRankingLoss -from .TripletLoss import TripletDistanceMetric, TripletLoss -from .MarginMSELoss import MarginMSELoss -from .MatryoshkaLoss import MatryoshkaLoss -from .Matryoshka2dLoss import Matryoshka2dLoss -from .MSELoss import MSELoss +from .AnglELoss import AnglELoss +from .BatchAllTripletLoss import BatchAllTripletLoss +from .BatchHardSoftMarginTripletLoss import BatchHardSoftMarginTripletLoss +from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction +from .BatchSemiHardTripletLoss import BatchSemiHardTripletLoss +from .CachedGISTEmbedLoss import CachedGISTEmbedLoss from .CachedMultipleNegativesRankingLoss import CachedMultipleNegativesRankingLoss -from .ContrastiveLoss import SiameseDistanceMetric, ContrastiveLoss +from .ContrastiveLoss import ContrastiveLoss, SiameseDistanceMetric from .ContrastiveTensionLoss import ( + ContrastiveTensionDataLoader, ContrastiveTensionLoss, ContrastiveTensionLossInBatchNegatives, - ContrastiveTensionDataLoader, ) -from .CoSENTLoss import CoSENTLoss -from .AnglELoss import AnglELoss -from .OnlineContrastiveLoss import OnlineContrastiveLoss -from .MegaBatchMarginLoss import MegaBatchMarginLoss +from .CosineSimilarityLoss import CosineSimilarityLoss from .DenoisingAutoEncoderLoss import DenoisingAutoEncoderLoss from .GISTEmbedLoss import GISTEmbedLoss -from .CachedGISTEmbedLoss import CachedGISTEmbedLoss - -# Triplet losses -from .BatchHardTripletLoss import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction -from .BatchHardSoftMarginTripletLoss import BatchHardSoftMarginTripletLoss -from .BatchSemiHardTripletLoss import BatchSemiHardTripletLoss -from .BatchAllTripletLoss import BatchAllTripletLoss +from .MarginMSELoss import MarginMSELoss +from .Matryoshka2dLoss import Matryoshka2dLoss +from .MatryoshkaLoss import MatryoshkaLoss +from .MegaBatchMarginLoss import MegaBatchMarginLoss +from .MSELoss import MSELoss +from .MultipleNegativesRankingLoss import MultipleNegativesRankingLoss +from .MultipleNegativesSymmetricRankingLoss import MultipleNegativesSymmetricRankingLoss +from .OnlineContrastiveLoss import OnlineContrastiveLoss +from .SoftmaxLoss import SoftmaxLoss +from .TripletLoss import TripletDistanceMetric, TripletLoss __all__ = [ "AdaptiveLayerLoss", diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index f41cefefa..7488d006d 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -1,39 +1,39 @@ -from copy import copy import json +import logging import random +import re from collections import Counter, defaultdict +from copy import copy from dataclasses import dataclass, field, fields from pathlib import Path from platform import python_version -import re from textwrap import indent from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union -import logging import torch -from torch import nn import transformers -from datasets import Dataset, DatasetDict -from huggingface_hub import CardData, ModelCard, dataset_info as get_dataset_info, model_info as get_model_info -from huggingface_hub.repocard_data import eval_results_to_model_index, EvalResult +from huggingface_hub import CardData, ModelCard +from huggingface_hub import dataset_info as get_dataset_info +from huggingface_hub import model_info as get_model_info +from huggingface_hub.repocard_data import EvalResult, eval_results_to_model_index from huggingface_hub.utils import yaml_dump +from torch import nn +from tqdm.autonotebook import tqdm from transformers import TrainerCallback from transformers.integrations import CodeCarbonCallback from transformers.modelcard import make_markdown_table from transformers.trainer_callback import TrainerControl, TrainerState -from tqdm.autonotebook import tqdm +from datasets import Dataset, DatasetDict from sentence_transformers import __version__ as sentence_transformers_version -from sentence_transformers.evaluation import SequentialEvaluator from sentence_transformers.models import Transformer -from sentence_transformers.util import cos_sim, fullname from sentence_transformers.training_args import SentenceTransformerTrainingArguments - +from sentence_transformers.util import cos_sim, fullname logger = logging.getLogger(__name__) if TYPE_CHECKING: - from sentence_transformers.evaluation import SentenceEvaluator + from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator from sentence_transformers.SentenceTransformer import SentenceTransformer from sentence_transformers.trainer import SentenceTransformerTrainer @@ -205,9 +205,10 @@ def on_log( def get_versions() -> Dict[str, Any]: from accelerate import __version__ as accelerate_version - from datasets import __version__ as datasets_version from tokenizers import __version__ as tokenizers_version + from datasets import __version__ as datasets_version + return { "python": python_version(), "sentence_transformers": sentence_transformers_version, @@ -435,6 +436,8 @@ def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None: self.predict_example = [source_sentence, similar_sentence, median_sentence] def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[str, Any]): + from sentence_transformers.evaluation import SequentialEvaluator + self.eval_results_dict[evaluator] = copy(metrics) # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs diff --git a/sentence_transformers/models/Asym.py b/sentence_transformers/models/Asym.py index a84911d50..c4ca6ace8 100644 --- a/sentence_transformers/models/Asym.py +++ b/sentence_transformers/models/Asym.py @@ -1,10 +1,11 @@ -from torch import Tensor -from torch import nn -import os import json -from ..util import import_from_string +import os from collections import OrderedDict -from typing import List, Dict, Union, Tuple +from typing import Dict, List, Tuple, Union + +from torch import Tensor, nn + +from sentence_transformers.util import import_from_string class Asym(nn.Sequential): diff --git a/sentence_transformers/models/BoW.py b/sentence_transformers/models/BoW.py index c9f7aef06..1501ff121 100644 --- a/sentence_transformers/models/BoW.py +++ b/sentence_transformers/models/BoW.py @@ -1,12 +1,12 @@ -import torch -from torch import Tensor -from torch import nn -from typing import List, Dict -import os import json import logging -from .tokenizer import WhitespaceTokenizer +import os +from typing import Dict, List +import torch +from torch import Tensor, nn + +from .tokenizer import WhitespaceTokenizer logger = logging.getLogger(__name__) diff --git a/sentence_transformers/models/CLIPModel.py b/sentence_transformers/models/CLIPModel.py index 9e5a06842..79a8f9f73 100644 --- a/sentence_transformers/models/CLIPModel.py +++ b/sentence_transformers/models/CLIPModel.py @@ -1,8 +1,9 @@ from typing import Union -from torch import nn -import transformers + import torch +import transformers from PIL import Image +from torch import nn class CLIPModel(nn.Module): diff --git a/sentence_transformers/models/CNN.py b/sentence_transformers/models/CNN.py index feaa23901..5cefa7f4e 100644 --- a/sentence_transformers/models/CNN.py +++ b/sentence_transformers/models/CNN.py @@ -1,8 +1,9 @@ +import json +import os +from typing import List + import torch from torch import nn -from typing import List -import os -import json class CNN(nn.Module): diff --git a/sentence_transformers/models/Dense.py b/sentence_transformers/models/Dense.py index bfd50e5e5..f0c5f1f4f 100644 --- a/sentence_transformers/models/Dense.py +++ b/sentence_transformers/models/Dense.py @@ -1,10 +1,11 @@ -import torch -from torch import Tensor -from torch import nn -from typing import Dict -import os import json -from ..util import fullname, import_from_string +import os +from typing import Dict + +import torch +from torch import Tensor, nn + +from sentence_transformers.util import fullname, import_from_string class Dense(nn.Module): diff --git a/sentence_transformers/models/Dropout.py b/sentence_transformers/models/Dropout.py index ea353279d..f909e609b 100644 --- a/sentence_transformers/models/Dropout.py +++ b/sentence_transformers/models/Dropout.py @@ -1,8 +1,8 @@ -from torch import Tensor -from torch import nn -from typing import Dict -import os import json +import os +from typing import Dict + +from torch import Tensor, nn class Dropout(nn.Module): diff --git a/sentence_transformers/models/LSTM.py b/sentence_transformers/models/LSTM.py index bab555d17..239aff4f0 100644 --- a/sentence_transformers/models/LSTM.py +++ b/sentence_transformers/models/LSTM.py @@ -1,8 +1,9 @@ +import json +import os +from typing import List + import torch from torch import nn -from typing import List -import os -import json class LSTM(nn.Module): diff --git a/sentence_transformers/models/LayerNorm.py b/sentence_transformers/models/LayerNorm.py index f63369223..d02fd32e2 100644 --- a/sentence_transformers/models/LayerNorm.py +++ b/sentence_transformers/models/LayerNorm.py @@ -1,9 +1,9 @@ -import torch -from torch import Tensor -from torch import nn -from typing import Dict -import os import json +import os +from typing import Dict + +import torch +from torch import Tensor, nn class LayerNorm(nn.Module): diff --git a/sentence_transformers/models/Normalize.py b/sentence_transformers/models/Normalize.py index 337b92a72..06dc44186 100644 --- a/sentence_transformers/models/Normalize.py +++ b/sentence_transformers/models/Normalize.py @@ -1,7 +1,7 @@ -from torch import Tensor -from torch import nn from typing import Dict + import torch.nn.functional as F +from torch import Tensor, nn class Normalize(nn.Module): diff --git a/sentence_transformers/models/Pooling.py b/sentence_transformers/models/Pooling.py index 9cddc7e4f..e0a3bf954 100644 --- a/sentence_transformers/models/Pooling.py +++ b/sentence_transformers/models/Pooling.py @@ -1,9 +1,9 @@ -import torch -from torch import Tensor -from torch import nn -from typing import Dict -import os import json +import os +from typing import Dict + +import torch +from torch import Tensor, nn class Pooling(nn.Module): diff --git a/sentence_transformers/models/Transformer.py b/sentence_transformers/models/Transformer.py index f9d94e2d1..d5a670869 100644 --- a/sentence_transformers/models/Transformer.py +++ b/sentence_transformers/models/Transformer.py @@ -1,8 +1,9 @@ -from torch import nn -from transformers import AutoModel, AutoTokenizer, AutoConfig, T5Config, MT5Config import json -from typing import Any, List, Dict, Optional, Union, Tuple import os +from typing import Any, Dict, List, Optional, Tuple, Union + +from torch import nn +from transformers import AutoConfig, AutoModel, AutoTokenizer, MT5Config, T5Config class Transformer(nn.Module): diff --git a/sentence_transformers/models/WeightedLayerPooling.py b/sentence_transformers/models/WeightedLayerPooling.py index 33d5f4406..beb686d57 100644 --- a/sentence_transformers/models/WeightedLayerPooling.py +++ b/sentence_transformers/models/WeightedLayerPooling.py @@ -1,9 +1,9 @@ -import torch -from torch import Tensor -from torch import nn -from typing import Dict -import os import json +import os +from typing import Dict + +import torch +from torch import Tensor, nn class WeightedLayerPooling(nn.Module): diff --git a/sentence_transformers/models/WordEmbeddings.py b/sentence_transformers/models/WordEmbeddings.py index 44d7c5931..40086ade2 100644 --- a/sentence_transformers/models/WordEmbeddings.py +++ b/sentence_transformers/models/WordEmbeddings.py @@ -1,15 +1,17 @@ +import gzip +import json +import logging +import os +from typing import List + +import numpy as np import torch from torch import nn -from typing import List -import logging -import gzip from tqdm import tqdm -import numpy as np -import os -import json -from ..util import import_from_string, fullname, http_get -from .tokenizer import WordTokenizer, WhitespaceTokenizer +from sentence_transformers.util import fullname, http_get, import_from_string + +from .tokenizer import WhitespaceTokenizer, WordTokenizer logger = logging.getLogger(__name__) diff --git a/sentence_transformers/models/WordWeights.py b/sentence_transformers/models/WordWeights.py index 3e53738bd..d545fab38 100644 --- a/sentence_transformers/models/WordWeights.py +++ b/sentence_transformers/models/WordWeights.py @@ -1,11 +1,10 @@ -import torch -from torch import Tensor -from torch import nn -from typing import List, Dict -import os import json import logging +import os +from typing import Dict, List +import torch +from torch import Tensor, nn logger = logging.getLogger(__name__) diff --git a/sentence_transformers/models/__init__.py b/sentence_transformers/models/__init__.py index a0a518ba4..c238101ed 100644 --- a/sentence_transformers/models/__init__.py +++ b/sentence_transformers/models/__init__.py @@ -1,6 +1,6 @@ -from .Transformer import Transformer from .Asym import Asym from .BoW import BoW +from .CLIPModel import CLIPModel from .CNN import CNN from .Dense import Dense from .Dropout import Dropout @@ -8,10 +8,10 @@ from .LSTM import LSTM from .Normalize import Normalize from .Pooling import Pooling +from .Transformer import Transformer from .WeightedLayerPooling import WeightedLayerPooling from .WordEmbeddings import WordEmbeddings from .WordWeights import WordWeights -from .CLIPModel import CLIPModel __all__ = [ "Transformer", diff --git a/sentence_transformers/models/tokenizer/PhraseTokenizer.py b/sentence_transformers/models/tokenizer/PhraseTokenizer.py index 578a3c949..834154e0d 100644 --- a/sentence_transformers/models/tokenizer/PhraseTokenizer.py +++ b/sentence_transformers/models/tokenizer/PhraseTokenizer.py @@ -1,12 +1,13 @@ -from typing import List, Iterable import collections -import string -import os import json import logging -from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS -from transformers.utils.import_utils import is_nltk_available, NLTK_IMPORT_ERROR +import os +import string +from typing import Iterable, List + +from transformers.utils.import_utils import NLTK_IMPORT_ERROR, is_nltk_available +from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer logger = logging.getLogger(__name__) diff --git a/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py b/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py index a5d9fc478..7a6a39473 100644 --- a/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py +++ b/sentence_transformers/models/tokenizer/WhitespaceTokenizer.py @@ -1,9 +1,10 @@ -from typing import List, Iterable import collections -import string -import os import json -from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS +import os +import string +from typing import Iterable, List + +from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer class WhitespaceTokenizer(WordTokenizer): diff --git a/sentence_transformers/models/tokenizer/WordTokenizer.py b/sentence_transformers/models/tokenizer/WordTokenizer.py index cfe00f701..51bcfd09c 100644 --- a/sentence_transformers/models/tokenizer/WordTokenizer.py +++ b/sentence_transformers/models/tokenizer/WordTokenizer.py @@ -1,5 +1,5 @@ from abc import ABC, abstractmethod -from typing import List, Iterable +from typing import Iterable, List ENGLISH_STOP_WORDS = [ "!", diff --git a/sentence_transformers/models/tokenizer/__init__.py b/sentence_transformers/models/tokenizer/__init__.py index f11b207eb..b09bed73a 100644 --- a/sentence_transformers/models/tokenizer/__init__.py +++ b/sentence_transformers/models/tokenizer/__init__.py @@ -1,5 +1,5 @@ -from .WordTokenizer import WordTokenizer, ENGLISH_STOP_WORDS -from .WhitespaceTokenizer import WhitespaceTokenizer from .PhraseTokenizer import PhraseTokenizer +from .WhitespaceTokenizer import WhitespaceTokenizer +from .WordTokenizer import ENGLISH_STOP_WORDS, WordTokenizer __all__ = ["WordTokenizer", "WhitespaceTokenizer", "PhraseTokenizer", "ENGLISH_STOP_WORDS"] diff --git a/sentence_transformers/quantization.py b/sentence_transformers/quantization.py index d958b1a34..8750b974b 100644 --- a/sentence_transformers/quantization.py +++ b/sentence_transformers/quantization.py @@ -1,10 +1,9 @@ -import time -from torch import Tensor -from typing import List, Literal, Tuple, TYPE_CHECKING -import numpy as np import logging -from typing import Dict, Optional, Union +import time +from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple, Union +import numpy as np +from torch import Tensor logger = logging.getLogger(__name__) @@ -255,8 +254,8 @@ def semantic_search_usearch( The list of search results is in the format: [[{"corpus_id": int, "score": float}, ...], ...] The time taken for the search is a float value. """ - from usearch.index import Index from usearch.compiled import ScalarKind + from usearch.index import Index if corpus_embeddings is not None and corpus_index is not None: raise ValueError("Only corpus_embeddings or corpus_index should be used, not both.") diff --git a/sentence_transformers/readers/InputExample.py b/sentence_transformers/readers/InputExample.py index 1e0f6bbd2..7266159e3 100644 --- a/sentence_transformers/readers/InputExample.py +++ b/sentence_transformers/readers/InputExample.py @@ -1,4 +1,4 @@ -from typing import Union, List +from typing import List, Union class InputExample: diff --git a/sentence_transformers/readers/LabelSentenceReader.py b/sentence_transformers/readers/LabelSentenceReader.py index 70b28c7ef..82aefedb7 100644 --- a/sentence_transformers/readers/LabelSentenceReader.py +++ b/sentence_transformers/readers/LabelSentenceReader.py @@ -1,6 +1,7 @@ -from . import InputExample import os +from . import InputExample + class LabelSentenceReader: """Reads in a file that has at least two columns: a label and a sentence. diff --git a/sentence_transformers/readers/NLIDataReader.py b/sentence_transformers/readers/NLIDataReader.py index 2d78a5a8f..ce359d6f5 100644 --- a/sentence_transformers/readers/NLIDataReader.py +++ b/sentence_transformers/readers/NLIDataReader.py @@ -1,7 +1,8 @@ -from . import InputExample import gzip import os +from . import InputExample + class NLIDataReader(object): """Reads in the Stanford NLI dataset and the MultiGenre NLI dataset""" diff --git a/sentence_transformers/readers/PairedFilesReader.py b/sentence_transformers/readers/PairedFilesReader.py index 2a1c16495..157ac5cbe 100644 --- a/sentence_transformers/readers/PairedFilesReader.py +++ b/sentence_transformers/readers/PairedFilesReader.py @@ -1,6 +1,7 @@ -from . import InputExample import gzip +from . import InputExample + class PairedFilesReader(object): """Reads in the a Pair Dataset, split in two files""" diff --git a/sentence_transformers/readers/STSDataReader.py b/sentence_transformers/readers/STSDataReader.py index e9a6e7600..6c9533989 100644 --- a/sentence_transformers/readers/STSDataReader.py +++ b/sentence_transformers/readers/STSDataReader.py @@ -1,8 +1,9 @@ -from . import InputExample import csv import gzip import os +from . import InputExample + class STSDataReader: """Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx) diff --git a/sentence_transformers/readers/TripletReader.py b/sentence_transformers/readers/TripletReader.py index 99e1ff0f2..be32ebd9b 100644 --- a/sentence_transformers/readers/TripletReader.py +++ b/sentence_transformers/readers/TripletReader.py @@ -1,7 +1,8 @@ -from . import InputExample import csv import os +from . import InputExample + class TripletReader(object): """Reads in the a Triplet Dataset: Each line contains (at least) 3 columns, one anchor column (s1), diff --git a/sentence_transformers/readers/__init__.py b/sentence_transformers/readers/__init__.py index f9b956cb8..fb2add55a 100644 --- a/sentence_transformers/readers/__init__.py +++ b/sentence_transformers/readers/__init__.py @@ -1,7 +1,7 @@ from .InputExample import InputExample from .LabelSentenceReader import LabelSentenceReader from .NLIDataReader import NLIDataReader -from .STSDataReader import STSDataReader, STSBenchmarkDataReader +from .STSDataReader import STSBenchmarkDataReader, STSDataReader from .TripletReader import TripletReader __all__ = [ diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py index e717d21ec..d2cec4cee 100644 --- a/sentence_transformers/sampler.py +++ b/sentence_transformers/sampler.py @@ -1,11 +1,12 @@ +import logging from collections import defaultdict from itertools import accumulate, cycle from typing import List -import logging -from datasets import Dataset -from torch.utils.data import BatchSampler, SubsetRandomSampler, ConcatDataset import torch +from torch.utils.data import BatchSampler, ConcatDataset, SubsetRandomSampler + +from datasets import Dataset logger = logging.getLogger(__name__) diff --git a/sentence_transformers/similarity_functions.py b/sentence_transformers/similarity_functions.py index 589d0404a..753970683 100644 --- a/sentence_transformers/similarity_functions.py +++ b/sentence_transformers/similarity_functions.py @@ -3,15 +3,16 @@ from numpy import ndarray from torch import Tensor + from .util import ( cos_sim, - manhattan_sim, - euclidean_sim, dot_score, + euclidean_sim, + manhattan_sim, pairwise_cos_sim, - pairwise_manhattan_sim, - pairwise_euclidean_sim, pairwise_dot_score, + pairwise_euclidean_sim, + pairwise_manhattan_sim, ) diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py index 0620d5f38..a30c3e6e5 100644 --- a/sentence_transformers/trainer.py +++ b/sentence_transformers/trainer.py @@ -1,30 +1,25 @@ -from contextlib import nullcontext import logging import os -from typing import Any, Callable, Dict, List, Optional, Tuple, Union, TYPE_CHECKING import warnings +from contextlib import nullcontext +from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union import torch from torch import nn -from torch.utils.data import DataLoader, ConcatDataset, Dataset, BatchSampler, SubsetRandomSampler -from transformers import PreTrainedTokenizerBase, Trainer, EvalPrediction, TrainerCallback +from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, Dataset, SubsetRandomSampler +from transformers import EvalPrediction, PreTrainedTokenizerBase, Trainer, TrainerCallback +from transformers.data.data_collator import DataCollator from transformers.integrations import WandbCallback from transformers.trainer import TRAINING_ARGS_NAME +from transformers.trainer_utils import EvalLoopOutput from transformers.training_args import ParallelMode from datasets import DatasetDict -from transformers.trainer_utils import EvalLoopOutput -from transformers.data.data_collator import DataCollator -from sentence_transformers.losses import CoSENTLoss - -from sentence_transformers.models.Transformer import Transformer -from sentence_transformers.training_args import ( - SentenceTransformerTrainingArguments, - BatchSamplers, - MultiDatasetBatchSamplers, -) from sentence_transformers.data_collator import SentenceTransformerDataCollator -from sentence_transformers.evaluation import SentenceEvaluator +from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator +from sentence_transformers.losses.CoSENTLoss import CoSENTLoss +from sentence_transformers.model_card import ModelCardCallback +from sentence_transformers.models.Transformer import Transformer from sentence_transformers.sampler import ( DefaultBatchSampler, GroupByLabelBatchSampler, @@ -32,10 +27,13 @@ ProportionalBatchSampler, RoundRobinBatchSampler, ) +from sentence_transformers.training_args import ( + BatchSamplers, + MultiDatasetBatchSamplers, + SentenceTransformerTrainingArguments, +) from sentence_transformers.util import disable_logging -from sentence_transformers.model_card import ModelCardCallback - logger = logging.getLogger(__name__) if TYPE_CHECKING: diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py index 5f4386768..326c649a6 100644 --- a/sentence_transformers/training_args.py +++ b/sentence_transformers/training_args.py @@ -1,5 +1,6 @@ from dataclasses import dataclass, field from typing import Union + from transformers import TrainingArguments as TransformersTrainingArguments from transformers.utils import ExplicitEnum diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py index 30dbeaa61..6b9cef844 100644 --- a/sentence_transformers/util.py +++ b/sentence_transformers/util.py @@ -1,21 +1,20 @@ -from contextlib import contextmanager import functools -import requests -from torch import Tensor, device -from typing import List, Callable, Literal, overload -from tqdm.autonotebook import tqdm -import sys +import heapq import importlib +import logging import os -import torch -import numpy as np import queue -import logging -from typing import Dict, Optional, Union +import sys +from contextlib import contextmanager +from typing import Callable, Dict, List, Literal, Optional, Union, overload +import numpy as np +import requests +import torch +from huggingface_hub import hf_hub_download, snapshot_download +from torch import Tensor, device +from tqdm.autonotebook import tqdm from transformers import is_torch_npu_available -from huggingface_hub import snapshot_download, hf_hub_download -import heapq logger = logging.getLogger(__name__) diff --git a/setup.py b/setup.py index 86b8717bd..922dcf7ae 100644 --- a/setup.py +++ b/setup.py @@ -1,4 +1,4 @@ -from setuptools import setup, find_packages +from setuptools import find_packages, setup with open("README.md", mode="r", encoding="utf-8") as readme_file: readme = readme_file.read() diff --git a/tests/conftest.py b/tests/conftest.py index 05609b7a9..acd2870da 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -1,11 +1,12 @@ import os import platform import tempfile + import pytest -from sentence_transformers import SentenceTransformer, CrossEncoder -from sentence_transformers.models import Transformer, Pooling -from datasets import load_dataset, DatasetDict +from datasets import DatasetDict, load_dataset +from sentence_transformers import CrossEncoder, SentenceTransformer +from sentence_transformers.models import Pooling, Transformer @pytest.fixture() diff --git a/tests/test_cmnrl.py b/tests/test_cmnrl.py index 9967c51af..3d47b4b02 100644 --- a/tests/test_cmnrl.py +++ b/tests/test_cmnrl.py @@ -1,11 +1,13 @@ from contextlib import nullcontext from typing import List + import pytest -from sentence_transformers import SentenceTransformer, InputExample, losses -import tqdm -from transformers import set_seed import torch +import tqdm from torch.optim import Adam +from transformers import set_seed + +from sentence_transformers import InputExample, SentenceTransformer, losses @pytest.mark.parametrize( diff --git a/tests/test_cross_encoder.py b/tests/test_cross_encoder.py index 2a7f16d5b..cdc442a16 100644 --- a/tests/test_cross_encoder.py +++ b/tests/test_cross_encoder.py @@ -5,8 +5,9 @@ import csv import gzip import os -from pathlib import Path import tempfile +from pathlib import Path +from typing import Generator, List, Tuple import pytest import torch @@ -15,7 +16,6 @@ from sentence_transformers import CrossEncoder, util from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator from sentence_transformers.readers import InputExample -from typing import Generator, List, Tuple @pytest.fixture() diff --git a/tests/test_image_embeddings.py b/tests/test_image_embeddings.py index 581ab3ffd..d684e258f 100644 --- a/tests/test_image_embeddings.py +++ b/tests/test_image_embeddings.py @@ -6,7 +6,7 @@ from PIL import Image -from sentence_transformers import util, SentenceTransformer +from sentence_transformers import SentenceTransformer, util def test_simple_encode(clip_vit_b_32_model: SentenceTransformer) -> None: diff --git a/tests/test_model_card_data.py b/tests/test_model_card_data.py index 3c0a0f06a..434fc081c 100644 --- a/tests/test_model_card_data.py +++ b/tests/test_model_card_data.py @@ -1,7 +1,7 @@ -from sentence_transformers import SentenceTransformer - import pytest +from sentence_transformers import SentenceTransformer + @pytest.mark.parametrize( ("revision", "expected_base_revision"), diff --git a/tests/test_multi_process.py b/tests/test_multi_process.py index 624ca4e89..a1deef2f9 100644 --- a/tests/test_multi_process.py +++ b/tests/test_multi_process.py @@ -2,9 +2,10 @@ Computes embeddings """ +from typing import Optional + import numpy as np import pytest -from typing import Optional from sentence_transformers import SentenceTransformer diff --git a/tests/test_sentence_transformer.py b/tests/test_sentence_transformer.py index ca3b15433..0a6479701 100644 --- a/tests/test_sentence_transformer.py +++ b/tests/test_sentence_transformer.py @@ -2,23 +2,22 @@ Tests general behaviour of the SentenceTransformer class """ -from functools import partial import json import logging import os -from pathlib import Path import re import tempfile +from functools import partial +from pathlib import Path from typing import Dict, List, Literal, Optional, Union, cast import numpy as np import pytest - -from huggingface_hub import HfApi, RepoUrl, GitRefs, GitRefInfo import torch -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import Normalize, Transformer, Pooling -from sentence_transformers import util +from huggingface_hub import GitRefInfo, GitRefs, HfApi, RepoUrl + +from sentence_transformers import SentenceTransformer, util +from sentence_transformers.models import Normalize, Pooling, Transformer from sentence_transformers.similarity_functions import SimilarityFunction diff --git a/tests/test_trainer.py b/tests/test_trainer.py index 8d8c123af..8d2de1b48 100644 --- a/tests/test_trainer.py +++ b/tests/test_trainer.py @@ -1,9 +1,11 @@ -from pathlib import Path import re import tempfile +from pathlib import Path + import pytest -from sentence_transformers import SentenceTransformerTrainer, SentenceTransformer, losses + from datasets import DatasetDict +from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses def test_trainer_multi_dataset_errors( From 5db04cbf0533c397cd7af84732385c0d3cdbf7bf Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Fri, 24 May 2024 15:00:40 +0200 Subject: [PATCH 23/39] [`v3`] Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy (#2673) * Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy * Reformat --- sentence_transformers/fit_mixin.py | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py index 62e688ff6..6af7dc149 100644 --- a/sentence_transformers/fit_mixin.py +++ b/sentence_transformers/fit_mixin.py @@ -8,6 +8,7 @@ import numpy as np import torch import transformers +from packaging import version from torch import Tensor, nn from torch.optim import Optimizer from torch.utils.data import DataLoader @@ -297,6 +298,12 @@ def _default_checkpoint_dir() -> str: ) steps_per_epoch = None + # Transformers renamed `evaluation_strategy` to `eval_strategy` in v4.41.0 + eval_strategy_key = ( + "eval_strategy" + if version.parse(transformers.__version__) >= version.parse("4.41.0") + else "evaluation_strategy" + ) args = SentenceTransformerTrainingArguments( output_dir=checkpoint_path or _default_checkpoint_dir(), batch_sampler=batch_sampler, @@ -305,7 +312,9 @@ def _default_checkpoint_dir() -> str: per_device_eval_batch_size=batch_size, num_train_epochs=epochs, max_steps=max_steps, - evaluation_strategy="steps" if evaluation_steps is not None and evaluation_steps > 0 else "no", + **{ + eval_strategy_key: "steps" if evaluation_steps is not None and evaluation_steps > 0 else "no", + }, eval_steps=evaluation_steps, # load_best_model_at_end=save_best_model, # <- TODO: Look into a good solution for save_best_model max_grad_norm=max_grad_norm, From 7177c4867ac37cfb1294b09e70a0c4e7852b37f1 Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Fri, 24 May 2024 18:22:14 +0200 Subject: [PATCH 24/39] [`v3`] Add various useful Sphinx packages (copy code, link to code, nicer tabs) (#2674) * No longer hide toctrees in API Reference * Add linkcode support It's not perfect, as it'll always link to 'master', but it'll do pretty nicely for the most part. * Add copy button to all code blocks * Add nicer tabs * Reformatted --- docs/conf.py | 45 +++- .../package_reference/cross_encoder/index.rst | 1 - .../sentence_transformer/index.rst | 1 - docs/requirements.txt | 9 +- .../training/distributed.rst | 28 +- .../sentence_transformer/training_overview.md | 248 +++++++++--------- 6 files changed, 187 insertions(+), 145 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index 1d3ad9bb0..f7fe0882f 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -15,6 +15,8 @@ # sys.path.insert(0, os.path.abspath('.')) import datetime +import importlib +import inspect import os from recommonmark.transform import AutoStructify @@ -22,7 +24,7 @@ # -- Project information ----------------------------------------------------- -project = "Sentence-Transformers" +project = "Sentence Transformers" copyright = str(datetime.datetime.now().year) author = "Nils Reimers, Tom Aarsen" @@ -37,8 +39,10 @@ "sphinx.ext.autodoc", "recommonmark", "sphinx_markdown_tables", + "sphinx_copybutton", "sphinx.ext.intersphinx", - "sphinx_tabs.tabs", + "sphinx.ext.linkcode", + "sphinx_inline_tabs", ] # Add any paths that contain templates here, relative to this directory. @@ -108,6 +112,43 @@ autoclass_content = "both" +# https://github.com/readthedocs/sphinx-autoapi/issues/202#issuecomment-907582382 +def linkcode_resolve(domain, info): + # Non-linkable objects from the starter kit in the tutorial. + if domain == "js" or info["module"] == "connect4": + return + + assert domain == "py", "expected only Python objects" + + mod = importlib.import_module(info["module"]) + if "." in info["fullname"]: + objname, attrname = info["fullname"].split(".") + obj = getattr(mod, objname) + try: + # object is a method of a class + obj = getattr(obj, attrname) + except AttributeError: + # object is an attribute of a class + return None + else: + obj = getattr(mod, info["fullname"]) + obj = inspect.unwrap(obj) + + try: + file = inspect.getsourcefile(obj) + lines = inspect.getsourcelines(obj) + except TypeError: + # e.g. object is a typing.Union + return None + file = os.path.relpath(file, os.path.abspath("..")) + if not file.startswith("sentence_transformers"): + # e.g. object is a typing.NewType + return None + start, end = lines[1], lines[1] + len(lines[0]) - 1 + + return f"https://github.com/UKPLab/sentence-transformers/blob/master/{file}#L{start}-L{end}" + + class GithubURLDomain(Domain): """ Resolve .py links to their respective Github URL diff --git a/docs/package_reference/cross_encoder/index.rst b/docs/package_reference/cross_encoder/index.rst index f27406944..81dc3bc41 100644 --- a/docs/package_reference/cross_encoder/index.rst +++ b/docs/package_reference/cross_encoder/index.rst @@ -3,7 +3,6 @@ Cross Encoder ============= .. toctree:: - :hidden: cross_encoder evaluation \ No newline at end of file diff --git a/docs/package_reference/sentence_transformer/index.rst b/docs/package_reference/sentence_transformer/index.rst index 063ed31e1..0e724e78b 100644 --- a/docs/package_reference/sentence_transformer/index.rst +++ b/docs/package_reference/sentence_transformer/index.rst @@ -3,7 +3,6 @@ Sentence Transformer ==================== .. toctree:: - :hidden: SentenceTransformer trainer diff --git a/docs/requirements.txt b/docs/requirements.txt index bbc151601..312d2e2eb 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,8 +1,9 @@ # Must use Python 3.8! -sphinx<4 -Jinja2<3.1 -sphinx_markdown_tables +sphinx==3.5.4 +Jinja2==3.0.3 +sphinx_markdown_tables==0.0.17 recommonmark==0.7.1 -sphinx-tabs==3.4.5 +sphinx-copybutton==0.5.2 +sphinx_inline_tabs==2023.4.21 -e .. \ No newline at end of file diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst index fc6e78138..f2ade01cf 100644 --- a/docs/sentence_transformer/training/distributed.rst +++ b/docs/sentence_transformer/training/distributed.rst @@ -10,23 +10,29 @@ Sentence Transformers implements two forms of distributed training: Data Paralle In short, **DDP is generally recommended**. You can use DDP by running your normal training scripts with ``torchrun`` or ``accelerate``. For example, if you have a script called ``train_script.py``, you can run it with DDP using the following command: -.. tabs:: +.. |br| raw:: html - .. tab:: Via ``torchrun`` +
    - - `torchrun documentation `_ +.. tab:: Via ``torchrun`` - :: + |br| - torchrun --nproc_per_node=4 train_script.py - - .. tab:: Via ``accelerate`` + - `torchrun documentation `_ - - `accelerate documentation `_ + :: - :: - - accelerate launch --num_processes 4 train_script.py + torchrun --nproc_per_node=4 train_script.py + +.. tab:: Via ``accelerate`` + + |br| + + - `accelerate documentation `_ + + :: + + accelerate launch --num_processes 4 train_script.py .. note:: diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md index def0a1759..4f55bfba7 100644 --- a/docs/sentence_transformer/training_overview.md +++ b/docs/sentence_transformer/training_overview.md @@ -45,102 +45,100 @@ Training Sentence Transformer models involves between 3 to 5 components: ```eval_rst The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datasets.Dataset` (one dataset) or :class:`datasets.DatasetDict` instances (multiple datasets, see also `Multi-dataset training <#multi-dataset-training>`_). -.. tabs:: +.. tab:: Data on 🤗 Hugging Face Hub - .. tab:: Data on 🤗 Hugging Face Hub + If you want to load data from the `Hugging Face Datasets `_, then you should use :func:`datasets.load_dataset`: - If you want to load data from the `Hugging Face Datasets `_, then you should use :func:`datasets.load_dataset`: + .. raw:: html - .. raw:: html + - + :: - :: + from datasets import load_dataset - from datasets import load_dataset + train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train") + eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev") - train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train") - eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev") + print(train_dataset) + """ + Dataset({ + features: ['premise', 'hypothesis', 'label'], + num_rows: 942069 + }) + """ - print(train_dataset) - """ - Dataset({ - features: ['premise', 'hypothesis', 'label'], - num_rows: 942069 - }) - """ + Some datasets (including `sentence-transformers/all-nli `_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair `_, `pair-class `_, `pair-score `_, `triplet `_. - Some datasets (including `sentence-transformers/all-nli `_) require you to provide a "subset" alongside the dataset name. ``sentence-transformers/all-nli`` has 4 subsets, each with different data formats: `pair `_, `pair-class `_, `pair-score `_, `triplet `_. + .. note:: - .. note:: + Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers `_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks. - Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them by browsing to `https://huggingface.co/datasets?other=sentence-transformers `_. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks. +.. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL) - .. tab:: Local Data (CSV, JSON, Parquet, Arrow, SQL) + If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`: - If you have local data in common file-formats, then you can load this data easily using :func:`datasets.load_dataset`: + .. raw:: html - .. raw:: html + - + :: - :: + from datasets import load_dataset - from datasets import load_dataset - - dataset = load_dataset("csv", data_files="my_file.csv") - - or:: + dataset = load_dataset("csv", data_files="my_file.csv") + + or:: - from datasets import load_dataset + from datasets import load_dataset - dataset = load_dataset("json", data_files="my_file.json") + dataset = load_dataset("json", data_files="my_file.json") - .. tab:: Local Data that requires pre-processing +.. tab:: Local Data that requires pre-processing - .. sidebar:: Documentation + .. sidebar:: Documentation - - :meth:`datasets.Dataset.from_dict` + - :meth:`datasets.Dataset.from_dict` - If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so: + If you have local data that requires some extra pre-processing, my recommendation is to initialize your dataset using :meth:`datasets.Dataset.from_dict` and a dictionary of lists, like so: - .. raw:: html + .. raw:: html - + - :: + :: - from datasets import Dataset + from datasets import Dataset - sentence1_list = [] - sentence2_list = [] - # Open a file, do preprocessing, filtering, cleaning, etc. - # and append to the lists + sentence1_list = [] + sentence2_list = [] + # Open a file, do preprocessing, filtering, cleaning, etc. + # and append to the lists - dataset = Dataset.from_dict({ - "sentence1": sentence1_list, - "sentence2": sentence2_list, - }) + dataset = Dataset.from_dict({ + "sentence1": sentence1_list, + "sentence2": sentence2_list, + }) - Each key from the dictionary will become a column in the resulting dataset. + Each key from the dictionary will become a column in the resulting dataset. ``` @@ -298,72 +296,70 @@ Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` sho Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face. -.. tabs:: - - .. tab:: EmbeddingSimilarityEvaluator with STSb +.. tab:: EmbeddingSimilarityEvaluator with STSb - .. raw:: html + .. raw:: html - + - :: + :: - from datasets import load_dataset - from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction + from datasets import load_dataset + from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction - # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb) - eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") + # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb) + eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") - # Initialize the evaluator - dev_evaluator = EmbeddingSimilarityEvaluator( - sentences1=eval_dataset["sentence1"], - sentences2=eval_dataset["sentence2"], - scores=eval_dataset["score"], - main_similarity=SimilarityFunction.COSINE, - name="sts-dev", - ) - # You can run evaluation like so: - # dev_evaluator(model) - - .. tab:: TripletEvaluator with AllNLI - - .. raw:: html - - - - :: - - from datasets import load_dataset - from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction - - # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli) - max_samples = 1000 - eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]") - - # Initialize the evaluator - dev_evaluator = TripletEvaluator( - anchors=eval_dataset["anchor"], - positives=eval_dataset["positive"], - negatives=eval_dataset["negative"], - main_distance_function=SimilarityFunction.COSINE, - name="all-nli-dev", - ) - # You can run evaluation like so: - # dev_evaluator(model) + # Initialize the evaluator + dev_evaluator = EmbeddingSimilarityEvaluator( + sentences1=eval_dataset["sentence1"], + sentences2=eval_dataset["sentence2"], + scores=eval_dataset["score"], + main_similarity=SimilarityFunction.COSINE, + name="sts-dev", + ) + # You can run evaluation like so: + # dev_evaluator(model) + +.. tab:: TripletEvaluator with AllNLI + + .. raw:: html + + + + :: + + from datasets import load_dataset + from sentence_transformers.evaluation import TripletEvaluator, SimilarityFunction + + # Load triplets from the AllNLI dataset (https://huggingface.co/datasets/sentence-transformers/all-nli) + max_samples = 1000 + eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split=f"dev[:{max_samples}]") + + # Initialize the evaluator + dev_evaluator = TripletEvaluator( + anchors=eval_dataset["anchor"], + positives=eval_dataset["positive"], + negatives=eval_dataset["negative"], + main_distance_function=SimilarityFunction.COSINE, + name="all-nli-dev", + ) + # You can run evaluation like so: + # dev_evaluator(model) ``` ## Trainer From fdd31eebd873f6aabf78d77ed0891f5f3aa567af Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Fri, 24 May 2024 18:29:26 +0200 Subject: [PATCH 25/39] [`v3`] Make the "primary_metric" for evaluators a bit more robust (#2675) * Make the "primary_metric" for evaluators a bit more robust * Also remove some other TODOs that are not very important or already done --- sentence_transformers/evaluation/SentenceEvaluator.py | 11 ++++++++++- sentence_transformers/fit_mixin.py | 4 +--- sentence_transformers/model_card.py | 4 +--- tests/test_sentence_transformer.py | 1 - 4 files changed, 12 insertions(+), 8 deletions(-) diff --git a/sentence_transformers/evaluation/SentenceEvaluator.py b/sentence_transformers/evaluation/SentenceEvaluator.py index a3a2497ed..9d0646556 100644 --- a/sentence_transformers/evaluation/SentenceEvaluator.py +++ b/sentence_transformers/evaluation/SentenceEvaluator.py @@ -13,8 +13,17 @@ class SentenceEvaluator: """ def __init__(self): + """ + Base class for all evaluators. Notably, this class introduces the ``greater_is_better`` and ``primary_metric`` + attributes. The former is a boolean indicating whether a higher evaluation score is better, which is used + for choosing the best checkpoint if ``load_best_model_at_end`` is set to ``True`` in the training arguments. + + The latter is a string indicating the primary metric for the evaluator. This has to be defined whenever + the evaluator returns a dictionary of metrics, and the primary metric is the key pointing to the primary + metric, i.e. the one that is used for model selection and/or logging. + """ self.greater_is_better = True - # TODO: Add better `primary_metrics` support + self.primary_metric = None def __call__( self, model: "SentenceTransformer", output_path: str = None, epoch: int = -1, steps: int = -1 diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py index 6af7dc149..099cd2042 100644 --- a/sentence_transformers/fit_mixin.py +++ b/sentence_transformers/fit_mixin.py @@ -53,7 +53,6 @@ def __init__(self, output_dir: str, evaluator: Optional[SentenceEvaluator], save super().__init__() self.output_dir = output_dir self.evaluator = evaluator - # TODO: ^ has to implement `greater_is_better` and `primary_metric` self.save_best_model = save_best_model self.best_metric = None @@ -245,7 +244,7 @@ def identity(batch): batch_size = 8 batch_sampler = BatchSamplers.BATCH_SAMPLER # Convert dataloaders into a DatasetDict - # TODO: This should be done in a more efficient way + # TODO: This is rather inefficient, as we load all data into memory. We might benefit from a more efficient solution train_dataset_dict = {} for loader_idx, data_loader in enumerate(data_loaders, start=1): if isinstance(data_loader, NoDuplicatesDataLoader): @@ -284,7 +283,6 @@ def _default_checkpoint_dir() -> str: # Convert loss_fns into a dict with `dataset_{idx}` keys loss_fn_dict = {f"_dataset_{idx}": loss_fn for idx, loss_fn in enumerate(loss_fns, start=1)} - # TODO: Test model checkpointing & loading # Use steps_per_epoch to perhaps set max_steps max_steps = -1 diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index 7488d006d..e3db3b975 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -340,7 +340,6 @@ def validate_datasets(self, dataset_list, infer_languages: bool = True) -> None: ) del dataset["id"] else: - # TODO: Perhaps we can try to infer the dataset name from the dataset card if info.cardData and infer_languages and "language" in info.cardData: dataset_language = info.cardData.get("language") if dataset_language is None: @@ -441,8 +440,7 @@ def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[s self.eval_results_dict[evaluator] = copy(metrics) # If the evaluator has a primary metric and we have a trainer, then add the primary metric to the training logs - if hasattr(evaluator, "primary_metric"): - primary_metrics = evaluator.primary_metric + if hasattr(evaluator, "primary_metric") and (primary_metrics := evaluator.primary_metric): if isinstance(evaluator, SequentialEvaluator): primary_metrics = [sub_evaluator.primary_metric for sub_evaluator in evaluator.evaluators] elif isinstance(primary_metrics, str): diff --git a/tests/test_sentence_transformer.py b/tests/test_sentence_transformer.py index 0a6479701..233be3f8c 100644 --- a/tests/test_sentence_transformer.py +++ b/tests/test_sentence_transformer.py @@ -461,7 +461,6 @@ def test(model: SentenceTransformer, expected_dim: int): embeddings = outputs["sentence_embedding"] else: outputs = cast(List[Dict[str, torch.Tensor]], outputs) - # TODO: can overload model.encode if ppl want type checker compatibility embeddings = [out_features["sentence_embedding"] for out_features in outputs] else: embeddings = outputs From 6b86c3e3aa444faf20f55d412ef3b12785953ac7 Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Sun, 26 May 2024 10:07:08 +0200 Subject: [PATCH 26/39] Set `broadcast_buffers = False` when training with DDP (#2663) --- sentence_transformers/training_args.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py index 326c649a6..803bb4d85 100644 --- a/sentence_transformers/training_args.py +++ b/sentence_transformers/training_args.py @@ -74,3 +74,7 @@ def __post_init__(self): # The `compute_loss` method in `SentenceTransformerTrainer` is overridden to only compute the prediction loss, # so we set `prediction_loss_only` to `True` here to avoid self.prediction_loss_only = True + + # Disable broadcasting of buffers to avoid `RuntimeError: one of the variables needed for gradient computation + # has been modified by an inplace operation.` when training with DDP & a BertModel-based model. + self.ddp_broadcast_buffers = False From 8e5b1fe3a4cb04845de4a0e83f11565655c294fb Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Mon, 27 May 2024 11:44:18 +0200 Subject: [PATCH 27/39] [`v3`] Warn about using DP instead of DDP + set dataloader_drop_last with DDP (#2677) * Warn about using DP instead of DDP + set dataloader_drop_last with DDP * Prevent duplicate warnings * Remove note, done automatically now * Avoid inequality comparison to True --- .../training/distributed.rst | 4 ---- sentence_transformers/training_args.py | 23 +++++++++++++++++++ 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst index f2ade01cf..3753dd852 100644 --- a/docs/sentence_transformer/training/distributed.rst +++ b/docs/sentence_transformer/training/distributed.rst @@ -47,10 +47,6 @@ In short, **DDP is generally recommended**. You can use DDP by running your norm if __name__ == "__main__": main() -.. note:: - - When using DDP, using ``dataloader_drop_last=True`` in :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` is recommended, as the training may halt at the last (incomplete) training batch otherwise. - Comparison ---------- diff --git a/sentence_transformers/training_args.py b/sentence_transformers/training_args.py index 803bb4d85..4aefcd426 100644 --- a/sentence_transformers/training_args.py +++ b/sentence_transformers/training_args.py @@ -1,9 +1,13 @@ +import logging from dataclasses import dataclass, field from typing import Union from transformers import TrainingArguments as TransformersTrainingArguments +from transformers.training_args import ParallelMode from transformers.utils import ExplicitEnum +logger = logging.getLogger(__name__) + class BatchSamplers(ExplicitEnum): """ @@ -78,3 +82,22 @@ def __post_init__(self): # Disable broadcasting of buffers to avoid `RuntimeError: one of the variables needed for gradient computation # has been modified by an inplace operation.` when training with DDP & a BertModel-based model. self.ddp_broadcast_buffers = False + + if self.parallel_mode == ParallelMode.NOT_DISTRIBUTED: + # If output_dir is "unused", then this instance is created to compare training arguments vs the defaults, + # so we don't have to warn. + if self.output_dir != "unused": + logger.warning( + "Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. " + "See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information." + ) + + elif self.parallel_mode == ParallelMode.DISTRIBUTED and not self.dataloader_drop_last: + # If output_dir is "unused", then this instance is created to compare training arguments vs the defaults, + # so we don't have to warn. + if self.output_dir != "unused": + logger.warning( + "When using DistributedDataParallel (DDP), it is recommended to set `dataloader_drop_last=True` to avoid hanging issues with an uneven last batch. " + "Setting `dataloader_drop_last=True`." + ) + self.dataloader_drop_last = True From cf62248e8d9f898474543a53000c9efd42641434 Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Mon, 27 May 2024 12:03:58 +0200 Subject: [PATCH 28/39] [`v3`] Add warning that Evaluators only run on 1 GPU when multi-GPU training (#2678) * Add warning that Evaluators only run on 1 GPU when multi-GPU training * Also add a note in the distributed training docs --- docs/sentence_transformer/training/distributed.rst | 4 ++++ docs/sentence_transformer/training_overview.md | 6 +++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/sentence_transformer/training/distributed.rst b/docs/sentence_transformer/training/distributed.rst index 3753dd852..6e2f6c5a3 100644 --- a/docs/sentence_transformer/training/distributed.rst +++ b/docs/sentence_transformer/training/distributed.rst @@ -47,6 +47,10 @@ In short, **DDP is generally recommended**. You can use DDP by running your norm if __name__ == "__main__": main() +.. note:: + + When using an `Evaluator <../training_overview.html#evaluator>`_, the evaluator only runs on the first device unlike the training and evaluation datasets, which are shared across all devices. + Comparison ---------- diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md index 4f55bfba7..72c0fcf0f 100644 --- a/docs/sentence_transformer/training_overview.md +++ b/docs/sentence_transformer/training_overview.md @@ -360,12 +360,16 @@ Sometimes you don't have the required evaluation data to prepare one of these ev ) # You can run evaluation like so: # dev_evaluator(model) + +.. warning:: + + When using `Distributed Training `_, the evaluator only runs on the first device, unlike the training and evaluation datasets, which are shared across all devices. ``` ## Trainer ```eval_rst -The :class:`sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together: +The :class:`~sentence_transformers.SentenceTransformerTrainer` is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training. Let's have a look at a script where all of these components come together: .. sidebar:: Documentation From 1a2a883011e65379dc2b8d325a22be2f6eb5935c Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Mon, 27 May 2024 12:04:23 +0200 Subject: [PATCH 29/39] [`v3`] Move training dependencies into a "train" extra (#2676) * Move training dependencies into a "train" extra * Install the train extra with the CI tests * Simplify dev install: also include train deps there * Implement is_..._available in ST instead; add is_training_available --- .github/workflows/tests.yml | 2 +- docs/installation.md | 110 ++++++++++++++++++++++++---- docs/package_reference/util.md | 2 +- sentence_transformers/fit_mixin.py | 9 +-- sentence_transformers/model_card.py | 37 ++++++---- sentence_transformers/sampler.py | 9 ++- sentence_transformers/trainer.py | 31 +++++--- sentence_transformers/util.py | 21 ++++++ setup.py | 8 +- tests/conftest.py | 7 +- tests/test_train_stsb.py | 17 +++++ tests/test_trainer.py | 21 +++++- 12 files changed, 216 insertions(+), 58 deletions(-) diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 7c66f1507..391d160ac 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -45,7 +45,7 @@ jobs: if: steps.restore-cache.outputs.cache-hit != 'true' - name: Install the checked-out sentence-transformers - run: python -m pip install . + run: python -m pip install .[train] - name: Run unit tests shell: bash diff --git a/docs/installation.md b/docs/installation.md index 2f694b475..84ce4c0e4 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,35 +1,115 @@ # Installation -We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**. +We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**. There are three options to install Sentence Transformers: +* **Default:** This allows for loading, saving, and inference (i.e., getting embeddings) of models. +* **Default and Training**: All of the above plus training. +* **Development**: All of the above plus some dependencies for developing Sentence Transformers, see [Editable Install](#editable-install). ## Install with pip -Install the *sentence-transformers* with `pip`: -``` -pip install -U sentence-transformers -``` +```eval_rst -## Install with conda +.. tab:: Default + + :: + + pip install -U sentence-transformers + +.. tab:: Default and Training + + :: + + pip install -U "sentence-transformers[train]" + + To use `Weights and Biases `_ to track your training logs, you should also install ``wandb`` **(recommended)**:: + + pip install wandb + + And to track your Carbon Emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**:: + + pip install codecarbon + +.. tab:: Development + + :: + + pip install -U "sentence-transformers[dev]" -Apple silicon installation of *sentence-transformers* -``` -conda install -c conda-forge sentence-transformers ``` -## Install from source +## Install with Conda + +```eval_rst + +.. tab:: Default + + :: + + conda install -c conda-forge sentence-transformers + +.. tab:: Default and Training + + :: + + conda install -c conda-forge sentence-transformers accelerate datasets + + To use `Weights and Biases `_ to track your training logs, you should also install ``wandb`` **(recommended)**:: + + pip install wandb + + And to track your Carbon Emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**:: + + pip install codecarbon + +.. tab:: Development + + :: + + conda install -c conda-forge sentence-transformers accelerate datasets pre-commit pytest ruff -You can install *sentence-transformers* directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release: ``` -pip install git+https://github.com/UKPLab/sentence-transformers + +## Install from Source + +You can install ``sentence-transformers`` directly from source to take advantage of the bleeding edge `master` branch rather than the latest stable release: + +```eval_rst + +.. tab:: Default + + :: + + pip install git+https://github.com/UKPLab/sentence-transformers.git + +.. tab:: Default and Training + + :: + + pip install -U "sentence-transformers[train] @ git+https://github.com/UKPLab/sentence-transformers.git" + + To use `Weights and Biases `_ to track your training logs, you should also install ``wandb`` **(recommended)**:: + + pip install wandb + + And to track your carbon emissions while training and have this information automatically included in your model cards, also install ``codecarbon`` **(recommended)**:: + + pip install codecarbon + +.. tab:: Development + + :: + + pip install -U "sentence-transformers[dev] @ git+https://github.com/UKPLab/sentence-transformers.git" + ``` -## Editable install +## Editable Install -If you want to make changes to *sentence-transformers*, you will need an editable install. Clone the repository and install it with these commands: +If you want to make changes to ``sentence-transformers``, you will need an editable install. Clone the repository and install it with these commands: ``` git clone https://github.com/UKPLab/sentence-transformers cd sentence-transformers -pip install -e . +pip install -e ".[train,dev]" ``` These commands will link the new `sentence-transformers` folder and your Python library paths, such that this folder will be used when importing `sentence-transformers`. diff --git a/docs/package_reference/util.md b/docs/package_reference/util.md index 690b4cd19..1b5e9e326 100644 --- a/docs/package_reference/util.md +++ b/docs/package_reference/util.md @@ -4,7 +4,7 @@ ## Helper Functions ```eval_rst .. automodule:: sentence_transformers.util - :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings + :members: paraphrase_mining, semantic_search, community_detection, http_get, truncate_embeddings, normalize_embeddings, is_training_available ``` ## Similarity Metrics diff --git a/sentence_transformers/fit_mixin.py b/sentence_transformers/fit_mixin.py index 099cd2042..b4ece8c57 100644 --- a/sentence_transformers/fit_mixin.py +++ b/sentence_transformers/fit_mixin.py @@ -15,7 +15,6 @@ from tqdm.autonotebook import trange from transformers import TrainerCallback, TrainerControl, TrainerState -from datasets import Dataset, DatasetDict from sentence_transformers.datasets.NoDuplicatesDataLoader import NoDuplicatesDataLoader from sentence_transformers.datasets.SentenceLabelDataset import SentenceLabelDataset from sentence_transformers.training_args import ( @@ -23,13 +22,13 @@ MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments, ) +from sentence_transformers.util import batch_to_device, fullname, is_datasets_available from .evaluation import SentenceEvaluator from .model_card_templates import ModelCardTemplate -from .util import ( - batch_to_device, - fullname, -) + +if is_datasets_available(): + from datasets import Dataset, DatasetDict logger = logging.getLogger(__name__) diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index e3db3b975..68d2c0a44 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -24,11 +24,13 @@ from transformers.modelcard import make_markdown_table from transformers.trainer_callback import TrainerControl, TrainerState -from datasets import Dataset, DatasetDict from sentence_transformers import __version__ as sentence_transformers_version from sentence_transformers.models import Transformer from sentence_transformers.training_args import SentenceTransformerTrainingArguments -from sentence_transformers.util import cos_sim, fullname +from sentence_transformers.util import cos_sim, fullname, is_accelerate_available, is_datasets_available + +if is_datasets_available(): + from datasets import Dataset, DatasetDict logger = logging.getLogger(__name__) @@ -204,20 +206,25 @@ def on_log( def get_versions() -> Dict[str, Any]: - from accelerate import __version__ as accelerate_version - from tokenizers import __version__ as tokenizers_version - - from datasets import __version__ as datasets_version - - return { + versions = { "python": python_version(), "sentence_transformers": sentence_transformers_version, "transformers": transformers.__version__, "torch": torch.__version__, - "accelerate": accelerate_version, - "datasets": datasets_version, - "tokenizers": tokenizers_version, } + if is_accelerate_available(): + from accelerate import __version__ as accelerate_version + + versions["accelerate"] = accelerate_version + if is_datasets_available(): + from datasets import __version__ as datasets_version + + versions["datasets"] = datasets_version + from tokenizers import __version__ as tokenizers_version + + versions["tokenizers"] = tokenizers_version + + return versions @dataclass @@ -387,7 +394,7 @@ def join_list(losses: List[str]) -> str: def set_best_model_step(self, step: int) -> None: self.best_model_step = step - def set_widget_examples(self, dataset: Union[Dataset, DatasetDict]) -> None: + def set_widget_examples(self, dataset: Union["Dataset", "DatasetDict"]) -> None: if isinstance(dataset, Dataset): dataset = DatasetDict(dataset=dataset) @@ -465,7 +472,7 @@ def set_evaluation_metrics(self, evaluator: "SentenceEvaluator", metrics: Dict[s } ) - def set_label_examples(self, dataset: Dataset) -> None: + def set_label_examples(self, dataset: "Dataset") -> None: num_examples_per_label = 3 examples = defaultdict(list) finished_labels = set() @@ -487,7 +494,7 @@ def set_label_examples(self, dataset: Dataset) -> None: ] def infer_datasets( - self, dataset: Union[Dataset, DatasetDict], dataset_name: Optional[str] = None + self, dataset: Union["Dataset", "DatasetDict"], dataset_name: Optional[str] = None ) -> List[Dict[str, str]]: if isinstance(dataset, DatasetDict): return [ @@ -661,7 +668,7 @@ def to_html_list(data: dict): return dataset_info def extract_dataset_metadata( - self, dataset: Union[Dataset, DatasetDict], dataset_metadata, dataset_type: Literal["train", "eval"] + self, dataset: Union["Dataset", "DatasetDict"], dataset_metadata, dataset_type: Literal["train", "eval"] ) -> Dict[str, Any]: if dataset: if dataset_metadata and ( diff --git a/sentence_transformers/sampler.py b/sentence_transformers/sampler.py index d2cec4cee..7dad2dec2 100644 --- a/sentence_transformers/sampler.py +++ b/sentence_transformers/sampler.py @@ -6,7 +6,10 @@ import torch from torch.utils.data import BatchSampler, ConcatDataset, SubsetRandomSampler -from datasets import Dataset +from sentence_transformers.util import is_datasets_available + +if is_datasets_available(): + from datasets import Dataset logger = logging.getLogger(__name__) @@ -33,7 +36,7 @@ class DefaultBatchSampler(SetEpochMixin, BatchSampler): class GroupByLabelBatchSampler(SetEpochMixin, BatchSampler): def __init__( self, - dataset: Dataset, + dataset: "Dataset", batch_size: int, drop_last: bool, valid_label_columns: List[str] = None, @@ -89,7 +92,7 @@ def __iter__(self): class NoDuplicatesBatchSampler(SetEpochMixin, BatchSampler): def __init__( self, - dataset: Dataset, + dataset: "Dataset", batch_size: int, drop_last: bool, valid_label_columns: List[str] = [], diff --git a/sentence_transformers/trainer.py b/sentence_transformers/trainer.py index a30c3e6e5..6d6c35ecc 100644 --- a/sentence_transformers/trainer.py +++ b/sentence_transformers/trainer.py @@ -6,7 +6,7 @@ import torch from torch import nn -from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, Dataset, SubsetRandomSampler +from torch.utils.data import BatchSampler, ConcatDataset, DataLoader, SubsetRandomSampler from transformers import EvalPrediction, PreTrainedTokenizerBase, Trainer, TrainerCallback from transformers.data.data_collator import DataCollator from transformers.integrations import WandbCallback @@ -14,7 +14,6 @@ from transformers.trainer_utils import EvalLoopOutput from transformers.training_args import ParallelMode -from datasets import DatasetDict from sentence_transformers.data_collator import SentenceTransformerDataCollator from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator from sentence_transformers.losses.CoSENTLoss import CoSENTLoss @@ -32,7 +31,10 @@ MultiDatasetBatchSamplers, SentenceTransformerTrainingArguments, ) -from sentence_transformers.util import disable_logging +from sentence_transformers.util import disable_logging, is_datasets_available, is_training_available + +if is_datasets_available(): + from datasets import Dataset, DatasetDict logger = logging.getLogger(__name__) @@ -111,8 +113,8 @@ def __init__( self, model: Optional["SentenceTransformer"] = None, args: SentenceTransformerTrainingArguments = None, - train_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None, - eval_dataset: Optional[Union[Dataset, DatasetDict, Dict[str, Dataset]]] = None, + train_dataset: Optional[Union["Dataset", "DatasetDict", Dict[str, "Dataset"]]] = None, + eval_dataset: Optional[Union["Dataset", "DatasetDict", Dict[str, "Dataset"]]] = None, loss: Optional[ Union[ nn.Module, @@ -130,6 +132,13 @@ def __init__( optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, ) -> None: + if not is_training_available(): + raise RuntimeError( + "To train a SentenceTransformer model, you need to install the `accelerate` and `datasets` modules. " + "You can do so with the `train` extra:\n" + 'pip install -U "sentence-transformers[train]"' + ) + if args is None: output_dir = "tmp_trainer" logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.") @@ -260,7 +269,7 @@ def prepare_loss( return loss.to(model.device) return loss(model).to(model.device) - def add_dataset_name_column(self, dataset_dict: DatasetDict) -> DatasetDict: + def add_dataset_name_column(self, dataset_dict: "DatasetDict") -> "DatasetDict": for key, dataset in dataset_dict.items(): if "dataset_name" not in dataset.column_names: dataset_dict[key] = dataset.add_column("dataset_name", [key] * len(dataset)) @@ -350,7 +359,7 @@ def collect_features( def evaluate( self, - eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, + eval_dataset: Optional[Union["Dataset", Dict[str, "Dataset"]]] = None, ignore_keys: Optional[List[str]] = None, metric_key_prefix: str = "eval", ) -> Dict[str, float]: @@ -427,7 +436,7 @@ def _load_best_model(self) -> None: self.model = full_model self.model[0].auto_model = loaded_auto_model - def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] = None) -> bool: + def validate_column_names(self, dataset: "Dataset", dataset_name: Optional[str] = None) -> bool: if overlap := set(dataset.column_names) & {"return_loss", "dataset_name"}: raise ValueError( f"The following column names are invalid in your {dataset_name + ' ' if dataset_name else ''}dataset: {list(overlap)}." @@ -436,7 +445,7 @@ def validate_column_names(self, dataset: Dataset, dataset_name: Optional[str] = def get_batch_sampler( self, - dataset: Dataset, + dataset: "Dataset", batch_size: int, drop_last: bool, valid_label_columns: Optional[List[str]] = None, @@ -559,7 +568,7 @@ def get_train_dataloader(self) -> DataLoader: self._train_dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params)) return self._train_dataloader - def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> DataLoader: + def get_eval_dataloader(self, eval_dataset: Union["Dataset", None] = None) -> DataLoader: """ Returns the evaluation [`~torch.utils.data.DataLoader`]. @@ -628,7 +637,7 @@ def get_eval_dataloader(self, eval_dataset: Union[Dataset, None] = None) -> Data self.accelerator.even_batches = True return self.accelerator.prepare(DataLoader(eval_dataset, **dataloader_params)) - def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader: + def get_test_dataloader(self, test_dataset: "Dataset") -> DataLoader: """ Returns the training [`~torch.utils.data.DataLoader`]. diff --git a/sentence_transformers/util.py b/sentence_transformers/util.py index 6b9cef844..80fba6978 100644 --- a/sentence_transformers/util.py +++ b/sentence_transformers/util.py @@ -936,3 +936,24 @@ def get_device_name() -> Literal["mps", "cuda", "npu", "hpu", "cpu"]: if hthpu.is_available(): return "hpu" return "cpu" + + +def is_accelerate_available() -> bool: + """ + Returns True if the accelerate library is available. + """ + return importlib.util.find_spec("accelerate") is not None + + +def is_datasets_available() -> bool: + """ + Returns True if the datasets library is available. + """ + return importlib.util.find_spec("datasets") is not None + + +def is_training_available() -> bool: + """ + Returns True if we have the required dependencies for training Sentence Transformer models + """ + return is_accelerate_available() and is_datasets_available() diff --git a/setup.py b/setup.py index 922dcf7ae..2ad8457ae 100644 --- a/setup.py +++ b/setup.py @@ -27,11 +27,15 @@ "scipy", "huggingface-hub>=0.15.1", "Pillow", - "datasets", - "accelerate>=0.20.3", ], extras_require={ + "train": [ + "datasets", + "accelerate>=0.20.3", + ], "dev": [ + "datasets", + "accelerate>=0.20.3", "pre-commit", "pytest", "ruff>=0.3.0", diff --git a/tests/conftest.py b/tests/conftest.py index acd2870da..5a83759ad 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -4,9 +4,12 @@ import pytest -from datasets import DatasetDict, load_dataset from sentence_transformers import CrossEncoder, SentenceTransformer from sentence_transformers.models import Pooling, Transformer +from sentence_transformers.util import is_datasets_available + +if is_datasets_available(): + from datasets import DatasetDict, load_dataset @pytest.fixture() @@ -43,7 +46,7 @@ def distilbert_base_uncased_model() -> SentenceTransformer: @pytest.fixture(scope="session") -def stsb_dataset_dict() -> DatasetDict: +def stsb_dataset_dict() -> "DatasetDict": return load_dataset("mteb/stsbenchmark-sts") diff --git a/tests/test_train_stsb.py b/tests/test_train_stsb.py index a71fe8f06..e2ac0171a 100644 --- a/tests/test_train_stsb.py +++ b/tests/test_train_stsb.py @@ -19,6 +19,7 @@ ) from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator from sentence_transformers.readers import InputExample +from sentence_transformers.util import is_training_available @pytest.fixture() @@ -71,6 +72,10 @@ def evaluate_stsb_test(model, expected_score, test_samples) -> None: @pytest.mark.slow +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_train_stsb_slow( distilbert_base_uncased_model: SentenceTransformer, sts_resource: Tuple[List[InputExample], List[InputExample]] ) -> None: @@ -92,6 +97,10 @@ def test_train_stsb_slow( @pytest.mark.skipif("CI" in os.environ, reason="This test is too slow for the CI (~8 minutes)") +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_train_stsb( distilbert_base_uncased_model: SentenceTransformer, sts_resource: Tuple[List[InputExample], List[InputExample]] ) -> None: @@ -113,6 +122,10 @@ def test_train_stsb( @pytest.mark.slow +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_train_nli_slow( distilbert_base_uncased_model: SentenceTransformer, nli_resource: List[InputExample], @@ -139,6 +152,10 @@ def test_train_nli_slow( @pytest.mark.skipif("CI" in os.environ, reason="This test is too slow for the CI (~25 minutes)") +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_train_nli( distilbert_base_uncased_model: SentenceTransformer, nli_resource: List[InputExample], diff --git a/tests/test_trainer.py b/tests/test_trainer.py index 8d2de1b48..5188837cb 100644 --- a/tests/test_trainer.py +++ b/tests/test_trainer.py @@ -4,12 +4,19 @@ import pytest -from datasets import DatasetDict from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses +from sentence_transformers.util import is_datasets_available, is_training_available +if is_datasets_available(): + from datasets import DatasetDict + +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_trainer_multi_dataset_errors( - stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict + stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: "DatasetDict" ) -> None: train_dataset = stsb_dataset_dict["train"] loss = { @@ -73,8 +80,12 @@ def test_trainer_multi_dataset_errors( ) +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_trainer_invalid_column_names( - stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: DatasetDict + stsb_bert_tiny_model: SentenceTransformer, stsb_dataset_dict: "DatasetDict" ) -> None: train_dataset = stsb_dataset_dict["train"] for column_name in ("return_loss", "dataset_name"): @@ -106,6 +117,10 @@ def test_trainer_invalid_column_names( trainer.train() +@pytest.mark.skipif( + not is_training_available(), + reason='Sentence Transformers was not installed with the `["train"]` extra.', +) def test_model_card_reuse(stsb_bert_tiny_model: SentenceTransformer): assert stsb_bert_tiny_model._model_card_text # Reuse the model card if no training was done From cd236c9f283064f7d3c535c29d9111a5df32c99a Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Mon, 27 May 2024 13:25:45 +0200 Subject: [PATCH 30/39] Update references to the API ref (#2679) --- docs/sentence_transformer/loss_overview.md | 45 +++++++++---------- examples/training/adaptive_layer/README.md | 10 ++--- .../training/matryoshka/2d_matryoshka_sts.py | 2 +- examples/training/matryoshka/README.md | 10 ++--- .../training/matryoshka/matryoshka_sts.py | 2 +- .../multilingual/make_multilingual.py | 2 +- examples/training/nli/training_nli.py | 2 +- examples/training/nli/training_nli_v2.py | 2 +- examples/training/nli/training_nli_v3.py | 2 +- .../other/training_wikipedia_sections.py | 2 +- .../unsupervised_learning/SimCSE/README.md | 2 +- .../query_generation/README.md | 2 +- sentence_transformers/model_card_template.md | 4 +- 13 files changed, 43 insertions(+), 44 deletions(-) diff --git a/docs/sentence_transformer/loss_overview.md b/docs/sentence_transformer/loss_overview.md index f46b0418e..ab5e8ed72 100644 --- a/docs/sentence_transformer/loss_overview.md +++ b/docs/sentence_transformer/loss_overview.md @@ -8,44 +8,43 @@ Loss functions play a critical role in the performance of your fine-tuned model. You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, ``(sentence_A, sentence_B) pairs`` with ``class`` labels can be converted into ``(anchor, positive, negative) triplets`` by sampling sentences with the same or different classes. ``` -| Inputs | Labels | Appropriate Loss Functions | -|-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `single sentences` | `class` | `BatchAllTripletLoss`
    `BatchHardSoftMarginTripletLoss`
    `BatchHardTripletLoss`
    `BatchSemiHardTripletLoss` | -| `single sentences` | `none` | `ContrastiveTensionLoss`
    `DenoisingAutoEncoderLoss` | -| `(anchor, anchor) pairs` | `none` | `ContrastiveTensionLossInBatchNegatives` | -| `(damaged_sentence, original_sentence) pairs` | `none` | `DenoisingAutoEncoderLoss` | -| `(sentence_A, sentence_B) pairs` | `class` | `SoftmaxLoss` | -| `(anchor, positive) pairs` | `none` | `CachedMultipleNegativesRankingLoss`
    `MultipleNegativesRankingLoss`
    `MultipleNegativesSymmetricRankingLoss`
    `MegaBatchMarginLoss`
    `CachedGISTEmbedLoss`
    `GISTEmbedLoss` | -| `(anchor, positive/negative) pairs` | `1 if positive, 0 if negative` | `ContrastiveLoss`
    `OnlineContrastiveLoss` | -| `(sentence_A, sentence_B) pairs` | `float similarity score` | `CoSENTLoss`
    `AnglELoss`
    `CosineSimilarityLoss` | -| `(anchor, positive, negative) triplets` | `none` | `CachedMultipleNegativesRankingLoss`
    `MultipleNegativesRankingLoss`
    `TripletLoss`
    `CachedGISTEmbedLoss`
    `GISTEmbedLoss` | +| Inputs | Labels | Appropriate Loss Functions | +|-----------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `single sentences` | `class` | `BatchAllTripletLoss`
    `BatchHardSoftMarginTripletLoss`
    `BatchHardTripletLoss`
    `BatchSemiHardTripletLoss` | +| `single sentences` | `none` | `ContrastiveTensionLoss`
    `DenoisingAutoEncoderLoss` | +| `(anchor, anchor) pairs` | `none` | `ContrastiveTensionLossInBatchNegatives` | +| `(damaged_sentence, original_sentence) pairs` | `none` | `DenoisingAutoEncoderLoss` | +| `(sentence_A, sentence_B) pairs` | `class` | `SoftmaxLoss` | +| `(anchor, positive) pairs` | `none` | `CachedMultipleNegativesRankingLoss`
    `MultipleNegativesRankingLoss`
    `MultipleNegativesSymmetricRankingLoss`
    `MegaBatchMarginLoss`
    `CachedGISTEmbedLoss`
    `GISTEmbedLoss` | +| `(anchor, positive/negative) pairs` | `1 if positive, 0 if negative` | `ContrastiveLoss`
    `OnlineContrastiveLoss` | +| `(sentence_A, sentence_B) pairs` | `float similarity score` | `CoSENTLoss`
    `AnglELoss`
    `CosineSimilarityLoss` | +| `(anchor, positive, negative) triplets` | `none` | `CachedMultipleNegativesRankingLoss`
    `MultipleNegativesRankingLoss`
    `TripletLoss`
    `CachedGISTEmbedLoss`
    `GISTEmbedLoss` | ## Loss modifiers These loss functions can be seen as *loss modifiers*: they work on top of standard loss functions, but apply those loss functions in different ways to try and instil useful properties into the trained embedding model. -For example, models trained with `MatryoshkaLoss` produce embeddings whose size can be truncated without notable losses in performance, and models trained with `AdaptiveLayerLoss` still perform well when you remove model layers for faster inference. - -| Texts | Labels | Appropriate Loss Functions | -|-------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `any` | `any` | `MatryoshkaLoss`
    `AdaptiveLayerLoss`
    `Matryoshka2dLoss` | +For example, models trained with `MatryoshkaLoss` produce embeddings whose size can be truncated without notable losses in performance, and models trained with `AdaptiveLayerLoss` still perform well when you remove model layers for faster inference. +| Texts | Labels | Appropriate Loss Functions | +|-------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `any` | `any` | `MatryoshkaLoss`
    `AdaptiveLayerLoss`
    `Matryoshka2dLoss` | ## Distillation These loss functions are specifically designed to be used when distilling the knowledge from one model into another. For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual. -| Texts | Labels | Appropriate Loss Functions | -|----------------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------------------| -| `sentence` | `model sentence embeddings` | `MSELoss` | -| `sentence_1, sentence_2, ..., sentence_N` | `model sentence embeddings` | `MSELoss` | -| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | `MarginMSELoss` | +| Texts | Labels | Appropriate Loss Functions | +|----------------------------------------------|---------------------------------------------------------------|---------------------------------------------------------------------------------------------------| +| `sentence` | `model sentence embeddings` | `MSELoss` | +| `sentence_1, sentence_2, ..., sentence_N` | `model sentence embeddings` | `MSELoss` | +| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | `MarginMSELoss` | ## Commonly used Loss Functions In practice, not all loss functions get used equally often. The most common scenarios are: -* `(anchor, positive) pairs` without any labels: MultipleNegativesRankingLoss is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. CachedMultipleNegativesRankingLoss is often used to increase the batch size, resulting in superior performance. -* `(sentence_A, sentence_B) pairs` with a `float similarity score`: CosineSimilarityLoss is traditionally used a lot, though more recently CoSENTLoss and AnglELoss are used as drop-in replacements with superior performance. +* `(anchor, positive) pairs` without any labels: MultipleNegativesRankingLoss is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. CachedMultipleNegativesRankingLoss is often used to increase the batch size, resulting in superior performance. +* `(sentence_A, sentence_B) pairs` with a `float similarity score`: CosineSimilarityLoss is traditionally used a lot, though more recently CoSENTLoss and AnglELoss are used as drop-in replacements with superior performance. ## Custom Loss Functions diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md index 8ab7dcf8b..71afe7c43 100644 --- a/examples/training/adaptive_layer/README.md +++ b/examples/training/adaptive_layer/README.md @@ -36,7 +36,7 @@ model = SentenceTransformer("microsoft/mpnet-base") base_loss = CoSENTLoss(model=model) loss = AdaptiveLayerLoss(model=model, loss=base_loss) ``` -* **Reference**: AdaptiveLayerLoss +* **Reference**: AdaptiveLayerLoss Note that training with `AdaptiveLayerLoss` is not notably slower than without using it. @@ -52,7 +52,7 @@ base_loss = CoSENTLoss(model=model) loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64]) ``` -* **Reference**: Matryoshka2dLoss +* **Reference**: Matryoshka2dLoss ## Inference @@ -116,7 +116,7 @@ new_num_layers = 3 model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers] ``` -Then we can run inference with it using SentenceTransformers.encode. +Then we can run inference with it using SentenceTransformers.encode. ```python from sentence_transformers import SentenceTransformer @@ -142,11 +142,11 @@ As you can see, the similarity between the related sentences is much higher than ## Code Examples -See the following scripts as examples of how to apply the AdaptiveLayerLoss in practice: +See the following scripts as examples of how to apply the AdaptiveLayerLoss in practice: * **[adaptive_layer_nli.py](adaptive_layer_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `AdaptiveLayerLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. * **[adaptive_layer_sts.py](adaptive_layer_sts.py)**: This example uses the CoSENTLoss with AdaptiveLayerLoss to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation. -And the following scripts to see how to apply Matryoshka2dLoss: +And the following scripts to see how to apply Matryoshka2dLoss: * **[2d_matryoshka_nli.py](../matryoshka/2d_matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `Matryoshka2dLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. * **[2d_matryoshka_sts.py](../matryoshka/2d_matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `Matryoshka2dLoss` to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation. diff --git a/examples/training/matryoshka/2d_matryoshka_sts.py b/examples/training/matryoshka/2d_matryoshka_sts.py index a170f1581..55f31871c 100644 --- a/examples/training/matryoshka/2d_matryoshka_sts.py +++ b/examples/training/matryoshka/2d_matryoshka_sts.py @@ -48,7 +48,7 @@ logging.info(train_dataset) # 3. Define our training loss -# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one +# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text columns and one # similarity score column (between 0 and 1) inner_train_loss = losses.CoSENTLoss(model=model) train_loss = losses.Matryoshka2dLoss(model, inner_train_loss, [768, 512, 256, 128, 64]) diff --git a/examples/training/matryoshka/README.md b/examples/training/matryoshka/README.md index 6781bf53c..25cf1eaea 100644 --- a/examples/training/matryoshka/README.md +++ b/examples/training/matryoshka/README.md @@ -36,7 +36,7 @@ model = SentenceTransformer("microsoft/mpnet-base") base_loss = CoSENTLoss(model=model) loss = MatryoshkaLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64]) ``` -* **Reference**: MatryoshkaLoss +* **Reference**: MatryoshkaLoss Additionally, this can be combined with the `AdaptiveLayerLoss` such that the resulting model can be reduced both in the size of the output dimensions, but also in the number of layers for faster inference. See also the [Adaptive Layers](../adaptive_layer/README.html) for more information on reducing the number of model layers. In Sentence Transformers, the combination of these two losses is called `Matryoshka2dLoss`, and a shorthand is provided for simpler training. @@ -50,11 +50,11 @@ base_loss = CoSENTLoss(model=model) loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64]) ``` -* **Reference**: Matryoshka2dLoss +* **Reference**: Matryoshka2dLoss ## Inference -After a model has been trained using a Matryoshka loss, you can then run inference with it using SentenceTransformers.encode. +After a model has been trained using a Matryoshka loss, you can then run inference with it using SentenceTransformers.encode. ```python from sentence_transformers import SentenceTransformer @@ -85,13 +85,13 @@ As you can see, the similarity between the search query and the correct document ## Code Examples -See the following scripts as examples of how to apply the MatryoshkaLoss in practice: +See the following scripts as examples of how to apply the MatryoshkaLoss in practice: * **[matryoshka_nli.py](matryoshka_nli.py)**: This example uses the MultipleNegativesRankingLoss with MatryoshkaLoss to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. * **[matryoshka_nli_reduced_dim.py](matryoshka_nli_reduced_dim.py)**: This example uses the MultipleNegativesRankingLoss with MatryoshkaLoss to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation. * **[matryoshka_eval_stsb.py](matryoshka_eval_stsb.py)**: This example evaluates the embedding model trained with MatryoshkaLoss in [matryoshka_nli.py](matryoshka_nli.py) on the test set of the STSBenchmark dataset, and compares it to a non-Matryoshka trained model. * **[matryoshka_sts.py](matryoshka_sts.py)**: This example uses the CoSENTLoss with MatryoshkaLoss to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation. -And the following scripts to see how to apply Matryoshka2dLoss: +And the following scripts to see how to apply Matryoshka2dLoss: * **[2d_matryoshka_nli.py](2d_matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `Matryoshka2dLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. * **[2d_matryoshka_sts.py](2d_matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `Matryoshka2dLoss` to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation. diff --git a/examples/training/matryoshka/matryoshka_sts.py b/examples/training/matryoshka/matryoshka_sts.py index 4722f3cf3..f0813f1ff 100644 --- a/examples/training/matryoshka/matryoshka_sts.py +++ b/examples/training/matryoshka/matryoshka_sts.py @@ -49,7 +49,7 @@ logging.info(train_dataset) # 3. Define our training loss -# CoSENTLoss (https://sbert.net/docs/package_reference/losses.html#cosentloss) needs two text columns and one +# CoSENTLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) needs two text columns and one # similarity score column (between 0 and 1) inner_train_loss = losses.CoSENTLoss(model=model) train_loss = losses.MatryoshkaLoss(model, loss=inner_train_loss, matryoshka_dims=matryoshka_dims) diff --git a/examples/training/multilingual/make_multilingual.py b/examples/training/multilingual/make_multilingual.py index bb62d37bf..6d0555125 100644 --- a/examples/training/multilingual/make_multilingual.py +++ b/examples/training/multilingual/make_multilingual.py @@ -140,7 +140,7 @@ def prepare_dataset(batch): logging.info("Prepared datasets for training:", train_dataset_dict) # 3. Define our training loss -# MSELoss (https://sbert.net/docs/package_reference/losses.html#mseloss) needs one text columns and one +# MSELoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss) needs one text columns and one # column with embeddings from the teacher model train_loss = MSELoss(model=student_model) diff --git a/examples/training/nli/training_nli.py b/examples/training/nli/training_nli.py index 0a2dc6ba8..1f41e2e25 100644 --- a/examples/training/nli/training_nli.py +++ b/examples/training/nli/training_nli.py @@ -43,7 +43,7 @@ eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000)) logging.info(train_dataset) -# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#softmaxloss +# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#softmaxloss train_loss = losses.SoftmaxLoss( model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), diff --git a/examples/training/nli/training_nli_v2.py b/examples/training/nli/training_nli_v2.py index 0b0025351..00727a5f1 100644 --- a/examples/training/nli/training_nli_v2.py +++ b/examples/training/nli/training_nli_v2.py @@ -47,7 +47,7 @@ eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000)) logging.info(train_dataset) -# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss +# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss train_loss = losses.MultipleNegativesRankingLoss(model) diff --git a/examples/training/nli/training_nli_v3.py b/examples/training/nli/training_nli_v3.py index ffc95e128..43a946089 100644 --- a/examples/training/nli/training_nli_v3.py +++ b/examples/training/nli/training_nli_v3.py @@ -47,7 +47,7 @@ eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev").select(range(1000)) logging.info(train_dataset) -# 3. Define our training loss: https://sbert.net/docs/package_reference/losses.html#gistembedloss +# 3. Define our training loss: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#gistembedloss # The guiding model guide_model = SentenceTransformer("all-MiniLM-L6-v2") train_loss = losses.GISTEmbedLoss(model, guide_model) diff --git a/examples/training/other/training_wikipedia_sections.py b/examples/training/other/training_wikipedia_sections.py index e1c835418..a25614dfb 100644 --- a/examples/training/other/training_wikipedia_sections.py +++ b/examples/training/other/training_wikipedia_sections.py @@ -43,7 +43,7 @@ logging.info(train_dataset) # 3. Define our training loss -# TripletLoss (https://sbert.net/docs/package_reference/losses.html#tripletloss) needs three text columns +# TripletLoss (https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) needs three text columns train_loss = TripletLoss(model) # 4. Define an evaluator for use during training. This is useful to keep track of alongside the evaluation loss. diff --git a/examples/unsupervised_learning/SimCSE/README.md b/examples/unsupervised_learning/SimCSE/README.md index 6b670de0b..7b67d9a73 100644 --- a/examples/unsupervised_learning/SimCSE/README.md +++ b/examples/unsupervised_learning/SimCSE/README.md @@ -6,7 +6,7 @@ The idea is to encode the same sentence twice. Due to the used dropout in transf ![SimCSE working](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SimCSE.png) ## Usage with SentenceTransformers -SentenceTransformers implements the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), which makes training with SimCSE trivial: +SentenceTransformers implements the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss), which makes training with SimCSE trivial: ```python from sentence_transformers import SentenceTransformer, InputExample diff --git a/examples/unsupervised_learning/query_generation/README.md b/examples/unsupervised_learning/query_generation/README.md index de237a269..5fa0e0724 100644 --- a/examples/unsupervised_learning/query_generation/README.md +++ b/examples/unsupervised_learning/query_generation/README.md @@ -80,7 +80,7 @@ In the above code, we use [Top-p (nucleus) sampling](https://huggingface.co/blog ## Bi-Encoder Training -With the generated queries, we can then train a bi-encoder using the use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss). +With the generated queries, we can then train a bi-encoder using the use [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss). ## Full Example We train a semantic search model to search through Wikipedia diff --git a/sentence_transformers/model_card_template.md b/sentence_transformers/model_card_template.md index bf1ac2896..73eb8b954 100644 --- a/sentence_transformers/model_card_template.md +++ b/sentence_transformers/model_card_template.md @@ -126,7 +126,7 @@ You can finetune this model on your own dataset. {% for metrics in eval_metrics %} #### {{ metrics.description }} {% if metrics.dataset_name %}* Dataset: `{{ metrics.dataset_name }}`{% endif %} -* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[{{ metrics.class_name.split(".")[-1] }}](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}{{ metrics.class_name }}{% endif %} +* Evaluated with {% if metrics.class_name.startswith("sentence_transformers.") %}[{{ metrics.class_name.split(".")[-1] }}](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.{{ metrics.class_name.split(".")[-1] }}){% else %}{{ metrics.class_name }}{% endif %} {{ metrics.table }} {%- endfor %}{% endif %} @@ -154,7 +154,7 @@ You can finetune this model on your own dataset. * Columns: {% if dataset['columns'] | length == 1 %}{{ dataset['columns'][0] }}{% elif dataset['columns'] | length == 2 %}{{ dataset['columns'][0] }} and {{ dataset['columns'][1] }}{% else %}{{ dataset['columns'][:-1] | join(', ') }}, and {{ dataset['columns'][-1] }}{% endif %} * Approximate statistics based on the first 1000 samples: {{ dataset['stats_table'] }}* Samples: -{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[{{ dataset["loss"]["fullname"].split(".")[-1] }}](https://sbert.net/docs/package_reference/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}{{ dataset["loss"]["fullname"] }}{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters: +{{ dataset['examples_table'] }}* Loss: {% if dataset["loss"]["fullname"].startswith("sentence_transformers.") %}[{{ dataset["loss"]["fullname"].split(".")[-1] }}](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#{{ dataset["loss"]["fullname"].split(".")[-1].lower() }}){% else %}{{ dataset["loss"]["fullname"] }}{% endif %}{% if "config_code" in dataset["loss"] %} with these parameters: {{ dataset["loss"]["config_code"] }}{% endif %} {% endfor %}{% endif %}{% endfor -%} From 57794192c48246e2dac5ebbebdacd99d1678445e Mon Sep 17 00:00:00 2001 From: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Date: Mon, 27 May 2024 14:37:50 +0200 Subject: [PATCH 31/39] [`v3`] Add "dataset_size:" to the tag denoting the number of training samples (#2680) * Prepend "dataset_size:" instead. I can always change the look of this later On the HF side --- sentence_transformers/model_card.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sentence_transformers/model_card.py b/sentence_transformers/model_card.py index 68d2c0a44..ea67e4dee 100644 --- a/sentence_transformers/model_card.py +++ b/sentence_transformers/model_card.py @@ -702,7 +702,7 @@ def extract_dataset_metadata( if dataset_type == "train": num_training_samples = sum([metadata.get("size", 0) for metadata in dataset_metadata]) if num_training_samples: - self.tags += [self.num_training_samples_to_tag(num_training_samples)] + self.tags += ["dataset_size:" + self.num_training_samples_to_tag(num_training_samples)] return self.validate_datasets(dataset_metadata) From 24bee0949f087b4b7919d24eda6c1fe60631b78a Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Mon, 27 May 2024 14:46:57 +0200 Subject: [PATCH 32/39] Fix formatting of Python modules --- docs/sentence_transformer/training_overview.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md index 72c0fcf0f..51bac2fb2 100644 --- a/docs/sentence_transformer/training_overview.md +++ b/docs/sentence_transformer/training_overview.md @@ -481,9 +481,9 @@ The :class:`~sentence_transformers.SentenceTransformerTrainer` is where all prev ```eval_rst This Sentence Transformers trainer integrates support for various :class:`transformers.TrainerCallback` subclasses, such as: -- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if `wandb` is installed -- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if `tensorboard` is accessible. -- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if `codecarbon` is installed. +- :class:`~transformers.integrations.WandbCallback` to automatically log training metrics to W&B if ``wandb`` is installed +- :class:`~transformers.integrations.TensorBoardCallback` to log training metrics to TensorBoard if ``tensorboard`` is accessible. +- :class:`~transformers.integrations.CodeCarbonCallback` to track the carbon emissions of your model during training if ``codecarbon`` is installed. - Note: These carbon emissions will be included in your automatically generated model card. From a373931ed336f1488f6a793810971d2f61b8c3aa Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Mon, 27 May 2024 15:26:29 +0200 Subject: [PATCH 33/39] Docs: pairwise_cosine_similarity -> pairwise_similarity --- docs/sentence_transformer/usage/semantic_textual_similarity.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sentence_transformer/usage/semantic_textual_similarity.rst b/docs/sentence_transformer/usage/semantic_textual_similarity.rst index cd4c332b1..219b4c291 100644 --- a/docs/sentence_transformer/usage/semantic_textual_similarity.rst +++ b/docs/sentence_transformer/usage/semantic_textual_similarity.rst @@ -89,7 +89,7 @@ This value can be changed in a handful of ways: Sentence Transformers implements two methods to calculate the similarity between embeddings: - :meth:`SentenceTransformer.similarity `: Calculates the similarity between all pairs of embeddings. -- :meth:`SentenceTransformer.pairwise_cosine_similarity `: Calculates the similarity between embeddings in a pairwise fashion. +- :meth:`SentenceTransformer.pairwise_similarity `: Calculates the similarity between embeddings in a pairwise fashion. :: From 403d188980755d8694e74aabad7496f393ce0c4b Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Mon, 27 May 2024 16:54:58 +0200 Subject: [PATCH 34/39] Link to the yet-to-be-released release notes instead --- docs/changelog/v3.0.md | 0 index.rst | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) delete mode 100644 docs/changelog/v3.0.md diff --git a/docs/changelog/v3.0.md b/docs/changelog/v3.0.md deleted file mode 100644 index e69de29bb..000000000 diff --git a/index.rst b/index.rst index ef8b630b9..980701531 100644 --- a/index.rst +++ b/index.rst @@ -1,6 +1,6 @@ .. note:: - Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview `_ to learn more about the training API, and check out `v3.0 Release Notes `_ for details on the other changes. + Sentence Transformers v3.0 just released, introducing a new training API for Sentence Transformer models. Read `SentenceTransformer > Training Overview `_ to learn more about the training API, and check out `v3.0 Release Notes `_ for details on the other changes. SentenceTransformers Documentation ================================== From 3f5dccbe5ebaf1060449ed4b637c820597cc3967 Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Mon, 27 May 2024 16:55:44 +0200 Subject: [PATCH 35/39] Update phrasing on local_files_only docstring --- sentence_transformers/SentenceTransformer.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py index 280312847..f6a48ae29 100644 --- a/sentence_transformers/SentenceTransformer.py +++ b/sentence_transformers/SentenceTransformer.py @@ -72,7 +72,7 @@ class SentenceTransformer(nn.Sequential, FitMixin): will execute code present on the Hub on your local machine. revision (str, optional): The specific model version to use. It can be a branch name, a tag name, or a commit id, for a stored model on Hugging Face. - local_files_only (bool, optional): If `True`, avoid downloading the model. + local_files_only (bool, optional): Whether or not to only look at local files (i.e., do not try to download the model). token (bool or str, optional): Hugging Face authentication token to download private models. use_auth_token (bool or str, optional): Deprecated argument. Please use `token` instead. truncate_dim (int, optional): The dimension to truncate sentence embeddings to. `None` does no truncation. Truncation is From 2f89fd617951be78081b53fcaf9a98be66658938 Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Tue, 28 May 2024 08:07:19 +0200 Subject: [PATCH 36/39] Link directly to the 2DMSE preprint --- examples/training/adaptive_layer/README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/examples/training/adaptive_layer/README.md b/examples/training/adaptive_layer/README.md index 71afe7c43..b904c40c8 100644 --- a/examples/training/adaptive_layer/README.md +++ b/examples/training/adaptive_layer/README.md @@ -1,6 +1,11 @@ # Adaptive Layers -Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776) (2DMSE) revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs. +Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. The [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776v1) (2DMSE) preprint revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs. + +```eval_rst +.. note:: + The 2DMSE preprint was later updated and renamed to `ESE: Espresso Sentence Embeddings `_. The Sentence Transformers implementation of Adaptive Layers and Matryoshka2d (Adaptive Layer + Matryoshka Embeddings) are based on the initial preprint, and we accept contributions that implement the updated ESE paper. +``` ## Use Cases From 649a31c06a50d571a806f272f04d9ab652fbe7ab Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Tue, 28 May 2024 09:56:49 +0200 Subject: [PATCH 37/39] Add missing subset in quora-duplicates --- docs/sentence_transformer/training_overview.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md index 51bac2fb2..7999c2cf0 100644 --- a/docs/sentence_transformer/training_overview.md +++ b/docs/sentence_transformer/training_overview.md @@ -545,7 +545,7 @@ Training on multiple datasets looks like this: # (sentence1, sentence2) + score stsb_pair_score_train = load_dataset("sentence-transformers/stsb", split="train[:10000]") # (anchor, positive) - quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", split="train[:10000]") + quora_pair_train = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train[:10000]") # (query, answer) natural_questions_train = load_dataset("sentence-transformers/natural-questions", split="train[:10000]") @@ -566,7 +566,7 @@ Training on multiple datasets looks like this: # (sentence1, sentence2, score) stsb_pair_score_dev = load_dataset("sentence-transformers/stsb", split="validation") # (anchor, positive) - quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", split="train[10000:11000]") + quora_pair_dev = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train[10000:11000]") # (query, answer) natural_questions_dev = load_dataset("sentence-transformers/natural-questions", split="train[10000:11000]") From 946a97d41d0f5e98f30ed61d38b902123f9e219c Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Tue, 28 May 2024 10:29:15 +0200 Subject: [PATCH 38/39] Add missing docstrings arguments for Cached... losses --- sentence_transformers/losses/CachedGISTEmbedLoss.py | 11 +++++++---- .../losses/CachedMultipleNegativesRankingLoss.py | 13 ++++++++----- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/sentence_transformers/losses/CachedGISTEmbedLoss.py b/sentence_transformers/losses/CachedGISTEmbedLoss.py index e3208298f..9480263d1 100644 --- a/sentence_transformers/losses/CachedGISTEmbedLoss.py +++ b/sentence_transformers/losses/CachedGISTEmbedLoss.py @@ -82,10 +82,13 @@ def __init__( Args: model: SentenceTransformer model - guide: SentenceTransformer model to guide the in-batch - negative sample selection. - temperature: Temperature parameter to scale the cosine - similarities. + guide: SentenceTransformer model to guide the in-batch negative sample selection. + temperature: Temperature parameter to scale the cosine similarities. + mini_batch_size: Mini-batch size for the forward pass, this denotes how much memory is actually used during + training and evaluation. The larger the mini-batch size, the more memory efficient the training is, but + the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default + value is 32. + show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False. References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py index 5e1b4e1d0..c6838ddef 100644 --- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py @@ -87,12 +87,15 @@ def __init__( Args: model: SentenceTransformer model - scale: Output of similarity function is multiplied by scale - value - similarity_fct: similarity function between sentence - embeddings. By default, cos_sim. Can also be set to dot + scale: Output of similarity function is multiplied by scale value + similarity_fct: similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set scale to 1) - + mini_batch_size: Mini-batch size for the forward pass, this denotes how much memory is actually used during + training and evaluation. The larger the mini-batch size, the more memory efficient the training is, but + the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default + value is 32. + show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False. + References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf - Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: https://arxiv.org/pdf/2101.06983.pdf From 85890d5713b24fd69af08d89c81db9fc0f3c5ea6 Mon Sep 17 00:00:00 2001 From: Tom Aarsen Date: Tue, 28 May 2024 12:10:32 +0200 Subject: [PATCH 39/39] Update training overview docs based on the blogpost reviews --- docs/sentence_transformer/training_overview.md | 15 ++++++++------- .../losses/CachedMultipleNegativesRankingLoss.py | 2 +- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md index 7999c2cf0..fb9c421f6 100644 --- a/docs/sentence_transformer/training_overview.md +++ b/docs/sentence_transformer/training_overview.md @@ -128,14 +128,14 @@ The :class:`SentenceTransformerTrainer` trains and evaluates using :class:`datas from datasets import Dataset - sentence1_list = [] - sentence2_list = [] + anchors = [] + positives = [] # Open a file, do preprocessing, filtering, cleaning, etc. # and append to the lists dataset = Dataset.from_dict({ - "sentence1": sentence1_list, - "sentence2": sentence2_list, + "anchor": anchors, + "positive": positives, }) Each key from the dictionary will become a column in the resulting dataset. @@ -276,9 +276,10 @@ args = SentenceTransformerTrainingArguments( ## Evaluator -```eval_rst -Several evaluators exist that can help with evaluation before, during, and after training: +You can provide the [`SentenceTransformerTrainer`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer) with an `eval_dataset` to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can both an `eval_dataset` and an evaluator, one or the other, or neither. They evaluate based on the `eval_strategy` and `eval_steps` [Training Arguments](#training-arguments). +Here are the implemented Evaluators that come with Sentence Tranformers: +```eval_rst ======================================================================== =========================================================================================================================== Evaluator Required Data ======================================================================== =========================================================================================================================== @@ -292,7 +293,7 @@ Evaluator Requir :class:`~sentence_transformers.evaluation.TripletEvaluator` (anchor, positive, negative) pairs. ======================================================================== =========================================================================================================================== -Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`. When the evaluator is run depends on the ``eval_strategy`` and ``eval_steps`` `Training Arguments <#training-arguments>`_. +Additionally, :class:`~sentence_transformers.evaluation.SequentialEvaluator` should be used to combine multiple evaluators into one Evaluator that can be passed to the :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`. Sometimes you don't have the required evaluation data to prepare one of these evaluators on your own, but you still want to track how well the model performs on some common benchmarks. In that case, you can use these evaluators with data from Hugging Face. diff --git a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py index c6838ddef..e8f4c0f3c 100644 --- a/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py +++ b/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py @@ -95,7 +95,7 @@ def __init__( the slower the training will be. It's recommended to set it as high as your GPU memory allows. The default value is 32. show_progress_bar: If True, a progress bar for the mini-batches is shown during training. The default is False. - + References: - Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf - Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: https://arxiv.org/pdf/2101.06983.pdf