Fix smart_batching_collate Inefficiency (#2556)

* Fix smart_batching_collate Inefficiency SentenceTransformer.py:846 throws a Inefficiency warning: ".....Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:275.) labels = torch.tensor([example.label for example in batch])" * Update SentenceTransformer.py * Remove some comments; add edge case (if labels is empty) --------- Co-authored-by: Tom Aarsen <[email protected]>
UKPLab · May 22, 2024 · 684b6b5 · 684b6b5
1 parent 5f75ce5
commit 684b6b5
Showing 1 changed file with 10 additions and 2 deletions.
diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
@@ -1000,8 +1000,16 @@ def smart_batching_collate(self, batch: List["InputExample"]) -> Tuple[List[Dict
         """
         texts = [example.texts for example in batch]
         sentence_features = [self.tokenize(sentence) for sentence in zip(*texts)]
-        labels = torch.tensor([example.label for example in batch])
-        return sentence_features, labels
+        labels = [example.label for example in batch]
+
+        # Use torch.from_numpy to convert the numpy array directly to a tensor,
+        # which is the recommended approach for converting numpy arrays to tensors
+        if labels and isinstance(labels[0], np.ndarray):
+            labels_tensor = torch.from_numpy(np.stack(labels))
+        else:
+            labels_tensor = torch.tensor(labels)
+
+        return sentence_features, labels_tensor
 
     def _text_length(self, text: Union[List[int], List[List[int]]]):
         """