Skip to content

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust

License

Notifications You must be signed in to change notification settings

k2-fsa/sherpa-onnx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Supported functions

Speech recognition Speech synthesis
✔️ ✔️
Speaker identification Speaker diarization Speaker verification
✔️ ✔️ ✔️
Spoken Language identification Audio tagging Voice activity detection
✔️ ✔️ ✔️
Keyword spotting Add punctuation
✔️ ✔️

Supported platforms

Architecture Android iOS Windows macOS linux HarmonyOS
x64 ✔️ ✔️ ✔️ ✔️ ✔️
x86 ✔️ ✔️
arm64 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
arm32 ✔️ ✔️ ✔️
riscv64 ✔️

Supported programming languages

1. C++ 2. C 3. Python 4. JavaScript
✔️ ✔️ ✔️ ✔️
5. Java 6. C# 7. Kotlin 8. Swift
✔️ ✔️ ✔️ ✔️
9. Go 10. Dart 11. Rust 12. Pascal
✔️ ✔️ ✔️ ✔️

For Rust support, please see sherpa-rs

It also supports WebAssembly.

Introduction

This repository supports running the following functions locally

  • Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
  • Text-to-speech (i.e., TTS)
  • Speaker diarization
  • Speaker identification
  • Speaker verification
  • Spoken language identification
  • Audio tagging
  • VAD (e.g., silero-vad)
  • Keyword spotting

on the following platforms and operating systems:

with the following APIs

  • C++, C, Python, Go, C#
  • Java, Kotlin, JavaScript
  • Swift, Rust
  • Dart, Object Pascal

Links for Huggingface Spaces

You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.
Description URL
Speaker diarization Click me
Speech recognition Click me
Speech recognition with Whisper Click me
Speech synthesis Click me
Generate subtitles Click me
Audio tagging Click me
Spoken language identification with Whisper Click me

We also have spaces built using WebAssembly. They are listed below:

Description Huggingface space ModelScope space
Voice activity detection with silero-vad Click me 地址
Real-time speech recognition (Chinese + English) with Zipformer Click me 地址
Real-time speech recognition (Chinese + English) with Paraformer Click me 地址
Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-large Click me 地址
Real-time speech recognition (English) Click me 地址
VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoice Click me 地址
VAD + speech recognition (English) with Whisper tiny.en Click me 地址
VAD + speech recognition (English) with Moonshine tiny Click me 地址
VAD + speech recognition (English) with Zipformer trained with GigaSpeech Click me 地址
VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeech Click me 地址
VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeech Click me 地址
VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2 Click me 地址
VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC model Click me 地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-large Click me 地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-small Click me 地址
Speech synthesis (English) Click me 地址
Speech synthesis (German) Click me 地址
Speaker diarization Click me 地址

Links for pre-built Android APKs

You can find pre-built Android APKs for this repository in the following table
Description URL 中国用户
Speaker diarization Address 点此
Streaming speech recognition Address 点此
Text-to-speech Address 点此
Voice activity detection (VAD) Address 点此
VAD + non-streaming speech recognition Address 点此
Two-pass speech recognition Address 点此
Audio tagging Address 点此
Audio tagging (WearOS) Address 点此
Speaker identification Address 点此
Spoken language identification Address 点此
Keyword spotting Address 点此

Links for pre-built Flutter APPs

Real-time speech recognition

Description URL 中国用户
Streaming speech recognition Address 点此

Text-to-speech

Description URL 中国用户
Android (arm64-v8a, armeabi-v7a, x86_64) Address 点此
Linux (x64) Address 点此
macOS (x64) Address 点此
macOS (arm64) Address 点此
Windows (x64) Address 点此

Note: You need to build from source for iOS.

Links for pre-built Lazarus APPs

Generating subtitles

Description URL 中国用户
Generate subtitles (生成字幕) Address 点此

Links for pre-trained models

Description URL
Speech recognition (speech to text, ASR) Address
Text-to-speech (TTS) Address
VAD Address
Keyword spotting Address
Audio tagging Address
Speaker identification (Speaker ID) Address
Spoken language identification (Language ID) See multi-lingual Whisper ASR models from Speech recognition
Punctuation Address
Speaker segmentation Address

Some pre-trained ASR models (Streaming)

Please see

for more models. The following table lists only SOME of them.

Name Supported Languages Description
sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 Chinese, English See also
sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16 Chinese, English See also
sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23 Chinese Suitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-en-20M-2023-02-17 English Suitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-korean-2024-06-16 Korean See also
sherpa-onnx-streaming-zipformer-fr-2023-04-14 French See also

Some pre-trained ASR models (Non-Streaming)

Please see

for more models. The following table lists only SOME of them.

Name Supported Languages Description
Whisper tiny.en English See also
Moonshine tiny English See also
sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 Chinese, Cantonese, English, Korean, Japanese 支持多种中文方言. See also
sherpa-onnx-paraformer-zh-2024-03-09 Chinese, English 也支持多种中文方言. See also
sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01 Japanese See also
sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24 Russian See also
sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24 Russian See also
sherpa-onnx-zipformer-ru-2024-09-18 Russian See also
sherpa-onnx-zipformer-korean-2024-06-24 Korean See also
sherpa-onnx-zipformer-thai-2024-06-20 Thai See also
sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04 Chinese 支持多种方言. See also

Useful links

How to reach us

Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.

Projects using sherpa-onnx

Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms

See also Open-LLM-VTuber/Open-LLM-VTuber#50

Streaming ASR and TTS based on FastAPI

It shows how to use the ASR and TTS Python APIs with FastAPI.

Uses streaming ASR in C# with graphical user interface.

Video demo in Chinese: 【开源】Windows实时字幕软件(网课/开会必备)

It uses the JavaScript API of sherpa-onnx along with Electron

Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!

A server based on nodejs providing Restful API for speech recognition.

一个模块化,全过程可离线,低占用率的对话机器人/智能音箱

It uses QT. Both ASR and TTS are used.

About

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust

Topics

Resources

License

Stars

Watchers

Forks