Skip to content

Commit

Permalink
Merge v3.0 pre-release into master, prepare for full v3.0 release (#2685
Browse files Browse the repository at this point in the history
)

Merge v3.0 pre-release into master, prepare for full v3.0 release
  • Loading branch information
tomaarsen authored May 28, 2024
2 parents 684b6b5 + 85890d5 commit e55a6d1
Show file tree
Hide file tree
Showing 292 changed files with 13,732 additions and 7,262 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
if: steps.restore-cache.outputs.cache-hit != 'true'

- name: Install the checked-out sentence-transformers
run: python -m pip install .
run: python -m pip install .[train]

- name: Run unit tests
shell: bash
Expand Down
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ nr_*/
/docs/make.bat
/examples/training/quora_duplicate_questions/quora-IR-dataset/
build

htmlcov
.coverage
.venv
wandb
checkpoints
tmp
.venv
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include sentence_transformers/model_card_template.md
74 changes: 24 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,10 @@
<!--- BADGES: START --->
[![HF Models](https://img.shields.io/badge/%F0%9F%A4%97-models-yellow)](https://huggingface.co/models?library=sentence-transformers)
[![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]
[![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]
[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]
[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]
[![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]
<!---
[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]
[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package]
--->
<!-- [![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package] -->

[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE
[#pypi-package]: https://pypi.org/project/sentence-transformers/
Expand All @@ -20,38 +16,24 @@

This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.
We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.
Further, this framework allows an easy **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/sentence_transformer/training_overview.html)**, to achieve maximal performance on your specific task.

For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.

The following publications are integrated in this framework:

- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)
- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)
- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)
- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)
- [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) (arXiv 2022)

## Installation

We recommend **Python 3.8** or higher, **[PyTorch 1.11.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.32.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.
We recommend **Python 3.8+**, **[PyTorch 1.11.0+](https://pytorch.org/get-started/locally/)**, and **[transformers v4.34.0+](https://github.com/huggingface/transformers)**.

**Install with pip**

Install the *sentence-transformers* with `pip`:

```
pip install -U sentence-transformers
```

**Install with conda**

You can install the *sentence-transformers* with `conda`:

```
conda install -c conda-forge sentence-transformers
```
Expand All @@ -73,8 +55,6 @@ If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA

See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.

[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

````python
Expand All @@ -87,58 +67,52 @@ Then provide some sentences to the model.

````python
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
sentence_embeddings = model.encode(sentences)
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (3, 384)
````

And that's it already. We now have a list of numpy arrays with the embeddings.
And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities.

````python
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
````

## Pre-Trained Models

We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.

[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
We provide a large list of [Pretrained Models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.

## Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.
See [Training Overview](https://www.sbert.net/docs/sentence_transformer/training_overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.

Some highlights are:
- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
- Multi-Lingual and multi-task learning
- Evaluation during training to find optimal model
- [20+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.

## Performance

Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.

[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
- [20+ loss-functions](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc.

## Application Examples

You can use this framework for:

- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
- [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)
- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)
- [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
- [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
- [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
- [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html)
- [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)

and many more use-cases.
Expand Down Expand Up @@ -193,7 +167,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma

Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.

Contact person: Tom Aarsen, [[email protected]](mailto:[email protected])
Maintainer: [Tom Aarsen](https://github.com/tomaarsen), 🤗 Hugging Face

https://www.ukp.tu-darmstadt.de/

Expand Down
5 changes: 4 additions & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@

docs:
sphinx-build -c . -a -E .. _build
sphinx-build -c . -a -E .. _build

docs-quick:
sphinx-build -c . .. _build
86 changes: 85 additions & 1 deletion docs/_static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,88 @@ dl.class > dt {

.wy-side-nav-search {
padding-top: 0px;
}
}

.components {
display: flex;
flex-flow: row wrap;
}

.components > .box {
flex: 1;
margin: 0.5rem;
padding: 1rem;
border-style: solid;
border-width: 1px;
border-radius: 0.5rem;
border-color: rgb(55 65 81);
background-color: #e3e3e3;
color: #404040; /* Override the colors imposed by <a href> */
}

.components > .box:nth-child(1) > .header {
background-image: linear-gradient(to bottom right, #60a5fa, #3b82f6);
}

.components > .box:nth-child(2) > .header {
background-image: linear-gradient(to bottom right, #fb923c, #f97316);
}

.components > .box:nth-child(3) > .header {
background-image: linear-gradient(to bottom right, #f472b6, #ec4899);
}

.components > .box:nth-child(4) > .header {
background-image: linear-gradient(to bottom right, #a78bfa, #8b5cf6);
}

.components > .box:nth-child(5) > .header {
background-image: linear-gradient(to bottom right, #34d399, #10b981);
}

.components > .optional {
background: repeating-linear-gradient(
135deg,
#f1f1f1,
#f1f1f1 25px,
#e3e3e3 25px,
#e3e3e3 50px
);
}

.components > .box > .header {
border-style: solid;
border-width: 1px;
border-radius: 0.5rem;
border-color: rgb(55 65 81);
padding: 0.5rem;
text-align: center;
margin-bottom: 0.5rem;
font-weight: bold;
color: white;
}

.sidebar p {
font-size: 100% !important;
}

.training-arguments {
background-color: #f3f6f6;
border: 1px solid #e1e4e5;
}

.training-arguments > .header {
font-weight: 700;
padding: 6px 12px;
background: #e1e4e5;
}

.training-arguments > .table {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(15em, 1fr));
}

.training-arguments > .table > a {
padding: 0.5rem;
border: 1px solid #e1e4e5;
}
1 change: 0 additions & 1 deletion docs/_themes/sphinx_rtd_theme/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

import sphinx


__version__ = "0.5.0"
__version_full__ = __version__

Expand Down
3 changes: 0 additions & 3 deletions docs/_themes/sphinx_rtd_theme/footer.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,6 @@
&copy; {% trans %}Copyright{% endtrans %} {{ copyright }}
{%- endif %}
{%- endif %}

&bull; <a href="/docs/contact.html">Contact</a>

{%- if build_id and build_url %}
<span class="build">
{# Translators: Build is a noun, not a verb #}
Expand Down
6 changes: 5 additions & 1 deletion docs/_themes/sphinx_rtd_theme/layout.html
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,12 @@
</a>

<div style="display: flex; justify-content: center;">
<div id="twitter-button">
<!-- This snippet adds a "Follow SBERT on Twitter" button. I'll remove it as Nils doesn't post about SBERT anmymore -->
<!-- <div id="twitter-button">
<a href="https://twitter.com/Nils_Reimers" target="_blank" title="Follow SBERT on Twitter"><img src="/_static/Twitter_Logo_White.svg" height="20" style="margin: 0px 10px 0px -10px;"> </a>
</div> -->
<div id="hf-button">
<a href="https://huggingface.co/models?library=sentence-transformers" target="_blank" title="See all Sentence Transformer models"><img src="{{ pathto('_static/img/hf-logo.svg', 1) }}" style="margin: 0px 10px 0px -10px; padding: 0px; height: 28px; width: 28px;"></a>
</div>
<div id="github-button"></div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/_themes/sphinx_rtd_theme/theme.conf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ canonical_url =
analytics_id =
collapse_navigation = True
sticky_navigation = True
navigation_depth = 4
navigation_depth =
includehidden = True
titles_only =
logo_only =
Expand Down
Loading

0 comments on commit e55a6d1

Please sign in to comment.