How to create hotwords? #1762

joonhyung-lee · 2025-01-24T05:54:42Z

Hi,

I’m using the following model configuration:

encoder:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.int8.onnx \
decoder:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.onnx \
decoding_method:=modified_beam_search \
joiner:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.int8.onnx \
tokens:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
bpe_vocab:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.vocab \
hotwords_file:=../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/hotwords_ko.txt \
hotwords_score:=2.0 \
vad_model:=../models/silero_vad.onnx

I’ve followed the guide from this page and successfully generated the bpe.vocab file using:

python ./export_bpe_vocab.py --bpe-model ./bpe.model

Now, I’d like to configure the hotwords_ko.txt file to include specific keywords I want to emphasize.

Could you clarify the following:

Should I list the original form of the words directly, one per line?

For example, if I want to add "Hello," should I simply write it as "Hello" in hotwords_ko.txt?

Or, do I need to tokenize (encode) the words using the BPE model and write the encoded results, one per line?
Additionally, I’m using the model for Korean. Based on the documentation, the modeling-unit for Korean is unclear. Should I follow the bpe approach like in the English example? Or is there a different process for languages like Korean?

Thank you for your guidance!

The text was updated successfully, but these errors were encountered:

joonhyung-lee · 2025-01-24T06:14:06Z

I found this example source code from https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/text2token.py !

csukuangfj · 2025-01-24T07:10:27Z

So have you fixed it?

joonhyung-lee · 2025-01-24T07:18:59Z

I've followed the hotwords generation process using text2token.py with the BPE model, but the recognition results aren't as accurate or responsive as I expected.

Is this the correct approach for creating hotwords, especially for Korean language models? What might I be missing in the tokenization or hotwords configuration?

Here's my example command and output of hotwords_ko.txt.

python3 text2token.py \
  --text ./hotwords_input.txt \
  --tokens ../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
  --tokens-type bpe \
  --bpe-model ../models/sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.vocab \
  --output ./hotwords_ko.txt

Full code for text2token.py is as follows:

#!/usr/bin/env python3

"""
This script encode the texts (given line by line through `text`) to tokens and
write the results to the file given by ``output``.

Usage:
If the tokens_type is bpe:
python3 ./text2token.py \
          --text texts.txt \
          --tokens tokens.txt \
          --tokens-type bpe \
          --bpe-model bpe.model \
          --output hotwords.txt
          
Example,
python3 text2token.py \
  --text ./hotwords_input.txt \
  --tokens ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
  --tokens-type bpe \
  --bpe-model ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/bpe.model \
  --output ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/hotwords_ko.txt
"""
import argparse
from sherpa_onnx import text2token

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="""Path to the input texts.

        Each line in the texts contains the original phrase, it might also contain some
        extra items, for example, the boosting score (startting with :), the triggering
        threshold (startting with #, only used in keyword spotting task) and the original
        phrase (startting with @). Note: extra items will be kept in the output.

        example input (tokens_type = bpe):

        HELLO WORLD :1.5 #0.4
        HI GOOGLE :2.0 #0.8
        HEY SIRI #0.35

        example output:

        ▁HE LL O ▁WORLD :1.5 #0.4
        ▁HI ▁GO O G LE :2.0 #0.8
        ▁HE Y ▁S I RI #0.35
        """,
    )

    parser.add_argument(
        "--tokens",
        type=str,
        required=True,
        help="The path to tokens.txt.",
    )

    parser.add_argument(
        "--tokens-type",
        type=str,
        required=True,
        choices=["cjkchar", "bpe", "cjkchar+bpe", "fpinyin", "ppinyin"],
        help="""The type of modeling units, should be cjkchar, bpe, cjkchar+bpe, fpinyin or ppinyin.
        fpinyin means full pinyin, each cjkchar has a pinyin(with tone).
        ppinyin means partial pinyin, it splits pinyin into initial and final,
        """,
    )

    parser.add_argument(
        "--bpe-model",
        type=str,
        help="The path to bpe.model. Only required when tokens-type is bpe or cjkchar+bpe.",
    )

    parser.add_argument(
        "--output",
        type=str,
        required=True,
        help="Path where the encoded tokens will be written to.",
    )

    return parser.parse_args()


def main():
    args = get_args()

    texts = []
    # extra information like boosting score (start with :), triggering threshold (start with #)
    # original keyword (start with @)
    extra_info = []
    with open(args.text, "r", encoding="utf8") as f:
        for line in f:
            extra = []
            text = []
            toks = line.strip().split()
            for tok in toks:
                if tok[0] == ":" or tok[0] == "#" or tok[0] == "@":
                    extra.append(tok)
                else:
                    text.append(tok)
            texts.append(" ".join(text))
            extra_info.append(extra)
    encoded_texts = text2token(
        texts,
        tokens=args.tokens,
        tokens_type=args.tokens_type,
        bpe_model=args.bpe_model,
    )
    with open(args.output, "w", encoding="utf8") as f:
        for i, txt in enumerate(encoded_texts):
            txt += extra_info[i]
            f.write(" ".join(txt) + "\n")


if __name__ == "__main__":
    main()

Input File: hotwords_input.txt

아토: 2.5
초기화: 2.0
하이: 1.5
안녕: 1.5

Output File: hotwords_ko.txt

▁아 토 : ▁2 . 5
▁초 기 화 : ▁2 . 0
▁하 이 : ▁1 . 5
▁안 녕 : ▁1 . 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create hotwords? #1762

How to create hotwords? #1762

joonhyung-lee commented Jan 24, 2025 •

edited

Loading

joonhyung-lee commented Jan 24, 2025

csukuangfj commented Jan 24, 2025

joonhyung-lee commented Jan 24, 2025

How to create hotwords? #1762

How to create hotwords? #1762

Comments

joonhyung-lee commented Jan 24, 2025 • edited Loading

joonhyung-lee commented Jan 24, 2025

csukuangfj commented Jan 24, 2025

joonhyung-lee commented Jan 24, 2025

joonhyung-lee commented Jan 24, 2025 •

edited

Loading