Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create hotwords? #1762

Open
joonhyung-lee opened this issue Jan 24, 2025 · 3 comments
Open

How to create hotwords? #1762

joonhyung-lee opened this issue Jan 24, 2025 · 3 comments

Comments

@joonhyung-lee
Copy link

joonhyung-lee commented Jan 24, 2025

Hi,

I’m using the following model configuration:

encoder:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.int8.onnx \
decoder:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.onnx \
decoding_method:=modified_beam_search \
joiner:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.int8.onnx \
tokens:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
bpe_vocab:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.vocab \
hotwords_file:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/hotwords_ko.txt \
hotwords_score:=2.0 \
vad_model:=../models/silero_vad.onnx

I’ve followed the guide from this page and successfully generated the bpe.vocab file using:

python ./export_bpe_vocab.py --bpe-model ./bpe.model

Now, I’d like to configure the hotwords_ko.txt file to include specific keywords I want to emphasize.

Could you clarify the following:

  1. Should I list the original form of the words directly, one per line?

For example, if I want to add "Hello," should I simply write it as "Hello" in hotwords_ko.txt?

  1. Or, do I need to tokenize (encode) the words using the BPE model and write the encoded results, one per line?
  2. Additionally, I’m using the model for Korean. Based on the documentation, the modeling-unit for Korean is unclear. Should I follow the bpe approach like in the English example? Or is there a different process for languages like Korean?

Thank you for your guidance!

@joonhyung-lee
Copy link
Author

I found this example source code from https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/text2token.py !

@csukuangfj
Copy link
Collaborator

So have you fixed it?

@joonhyung-lee
Copy link
Author

I've followed the hotwords generation process using text2token.py with the BPE model, but the recognition results aren't as accurate or responsive as I expected.

Is this the correct approach for creating hotwords, especially for Korean language models? What might I be missing in the tokenization or hotwords configuration?

Here's my example command and output of hotwords_ko.txt.

python3 text2token.py \
  --text ./hotwords_input.txt \
  --tokens ../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
  --tokens-type bpe \
  --bpe-model ../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.vocab \
  --output ./hotwords_ko.txt

Full code for text2token.py is as follows:

#!/usr/bin/env python3

"""
This script encode the texts (given line by line through `text`) to tokens and
write the results to the file given by ``output``.

Usage:
If the tokens_type is bpe:
python3 ./text2token.py \
          --text texts.txt \
          --tokens tokens.txt \
          --tokens-type bpe \
          --bpe-model bpe.model \
          --output hotwords.txt
          
Example,
python3 text2token.py \
  --text ./hotwords_input.txt \
  --tokens ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
  --tokens-type bpe \
  --bpe-model ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.model \
  --output ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/hotwords_ko.txt
"""
import argparse
from sherpa_onnx import text2token

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="""Path to the input texts.

        Each line in the texts contains the original phrase, it might also contain some
        extra items, for example, the boosting score (startting with :), the triggering
        threshold (startting with #, only used in keyword spotting task) and the original
        phrase (startting with @). Note: extra items will be kept in the output.

        example input (tokens_type = bpe):

        HELLO WORLD :1.5 #0.4
        HI GOOGLE :2.0 #0.8
        HEY SIRI #0.35

        example output:

        ▁HE LL O ▁WORLD :1.5 #0.4
        ▁HI ▁GO O G LE :2.0 #0.8
        ▁HE Y ▁S I RI #0.35
        """,
    )

    parser.add_argument(
        "--tokens",
        type=str,
        required=True,
        help="The path to tokens.txt.",
    )

    parser.add_argument(
        "--tokens-type",
        type=str,
        required=True,
        choices=["cjkchar", "bpe", "cjkchar+bpe", "fpinyin", "ppinyin"],
        help="""The type of modeling units, should be cjkchar, bpe, cjkchar+bpe, fpinyin or ppinyin.
        fpinyin means full pinyin, each cjkchar has a pinyin(with tone).
        ppinyin means partial pinyin, it splits pinyin into initial and final,
        """,
    )

    parser.add_argument(
        "--bpe-model",
        type=str,
        help="The path to bpe.model. Only required when tokens-type is bpe or cjkchar+bpe.",
    )

    parser.add_argument(
        "--output",
        type=str,
        required=True,
        help="Path where the encoded tokens will be written to.",
    )

    return parser.parse_args()


def main():
    args = get_args()

    texts = []
    # extra information like boosting score (start with :), triggering threshold (start with #)
    # original keyword (start with @)
    extra_info = []
    with open(args.text, "r", encoding="utf8") as f:
        for line in f:
            extra = []
            text = []
            toks = line.strip().split()
            for tok in toks:
                if tok[0] == ":" or tok[0] == "#" or tok[0] == "@":
                    extra.append(tok)
                else:
                    text.append(tok)
            texts.append(" ".join(text))
            extra_info.append(extra)
    encoded_texts = text2token(
        texts,
        tokens=args.tokens,
        tokens_type=args.tokens_type,
        bpe_model=args.bpe_model,
    )
    with open(args.output, "w", encoding="utf8") as f:
        for i, txt in enumerate(encoded_texts):
            txt += extra_info[i]
            f.write(" ".join(txt) + "\n")


if __name__ == "__main__":
    main()

Input File: hotwords_input.txt

아토: 2.5
초기화: 2.0
하이: 1.5
안녕: 1.5

Output File: hotwords_ko.txt

▁아 토 : ▁2 . 5
▁초 기 화 : ▁2 . 0
▁하 이 : ▁1 . 5
▁안 녕 : ▁1 . 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants