Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

Open
takipipo opened this issue Jan 14, 2025 · 10 comments

Comments

@takipipo
Copy link

takipipo commented Jan 14, 2025

I attempted to diarize the audio clip using the same model, but I obtained different results. Is this a known issue related to the ONNX format, or did I make a mistake in my process?

I have checked the pipeline of the pyannote/speaker-diarization-3.0 and select the same model as provided in sherpa-onnx

How to reproduce

pyannote/speaker-diarization-3.0

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token="change_to_your_huggingface_token")

diarization = pipeline("ck-interview-mono.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

Output

start=0.0s stop=5.2s speaker_SPEAKER_00
start=6.0s stop=23.0s speaker_SPEAKER_00
start=23.8s stop=33.2s speaker_SPEAKER_00
start=33.3s stop=41.4s speaker_SPEAKER_00
start=42.2s stop=43.0s speaker_SPEAKER_00
start=43.0s stop=48.0s speaker_SPEAKER_01
start=48.7s stop=50.2s speaker_SPEAKER_01
start=50.5s stop=61.9s speaker_SPEAKER_01
start=62.2s stop=71.3s speaker_SPEAKER_01
start=71.5s stop=72.0s speaker_SPEAKER_00
start=71.9s stop=72.7s speaker_SPEAKER_01
start=73.5s stop=74.6s speaker_SPEAKER_00

k2-fsa/speaker-diarization

Ran on https://huggingface.co/spaces/k2-fsa/speaker-diarization

  1. speaker embedding model: wespeaker_en_voxceleb_resnet34_LM.onnx|26MB
  2. speaker segmentation model: pyannote/segmentation-3.0
  3. Number of speakers: 2

Output

0.031 -- 5.228 speaker_00
6.038 -- 23.048 speaker_00
23.825 -- 32.971 speaker_00
33.562 -- 41.375 speaker_00
42.151 -- 47.990 speaker_00
48.732 -- 72.728 speaker_00
73.522 -- 74.602 speaker_00
@takipipo
Copy link
Author

Additionally, I conducted a comparison of the embedding models using cosine similarity. The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

Cosine Similarity Calculation

from pyannote.audio import Model
from pyannote.audio import Inference
import sherpa_onnx
from scipy.spatial.distance import cdist

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
inference = Inference(model, window="whole")
audio_fp = "change_to_your_audio_filepath"

embedding_pyannote = inference(audio_fp)
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model = "/Users/kridtaphadsae-khow/.cache/huggingface/hub/models--csukuangfj--speaker-embedding-models/snapshots/0743f301363dec56491a490f6d6cbc9d67f9a3bf/wespeaker_en_voxceleb_resnet34_LM.onnx", num_threads = 1, debug=True, provider = "cpu")
extractor = sherpa_onnx.SpeakerEmbeddingExtractor(config)

audio, sample_rate = read_wave(audio_fp)
stream = extractor.create_stream()
stream.accept_waveform(sample_rate=sample_rate, waveform=audio)
embedding_sherpa = np.asarray(extractor.compute(stream))

distance = cdist(np.expand_dims(embedding_pyannote, axis=0), np.expand_dims(embedding_sherpa, axis=0), metric="cosine")
print(distance)
>> array([[0.82130009]])

@csukuangfj
Copy link
Collaborator

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

@csukuangfj
Copy link
Collaborator

Can you share ck-interview-mono.wav ?

@takipipo
Copy link
Author

Can you share ck-interview-mono.wav ?

audio clip

@takipipo
Copy link
Author

takipipo commented Jan 15, 2025

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

In the context of the scipy implementation, 1 indicates orthogonality, while 0 signifies parallelism.

image

@csukuangfj
Copy link
Collaborator

I see what you mean now

cosine_distance = 1 - similariy_score

@csukuangfj
Copy link
Collaborator

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

@takipipo
Copy link
Author

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

I used the read_wave as you provided in the https://huggingface.co/spaces/k2-fsa/speaker-diarization/blob/main/model.py#L26-L48

@takipipo
Copy link
Author

@csukuangfj any update ?

@csukuangfj
Copy link
Collaborator

sorry, will.check it after the Chinese New Year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants