Skip to content

Latest commit

 

History

History
64 lines (41 loc) · 2.96 KB

README.md

File metadata and controls

64 lines (41 loc) · 2.96 KB

Lucene Analyzers for Tibetan

This repository contains Lucene tools (analysers, tokenizers and filters) for the Tibetan Language. They are based on these Lucene analyzers.

Installation through maven:

    <dependency>
      <groupId>io.bdrc.lucene</groupId>
      <artifactId>lucene-bo</artifactId>
      <version>1.2.0</version>
    </dependency>

If the jar is needed for use in a non-maven based install, it may be found at

    https://repo1.maven.org/maven2/io/bdrc/lucene/lucene-bo/1.2.0/lucene-bo-1.2.0.jar

Components

TibetanAnalyzer

The main Analyzer. It tokenizes the input text using TibSyllableTokenizer, then applies TibAffixedFilter and StopFilter with a predefined list of stop words.

There are two constructors. The nullary constructor and

    TibetanAnalyzer(boolean segmentInWords, boolean lemmatize, boolean filterChars, String inputMode, String lexiconFileName)

    segmentInWords - if the segmentation is on words instead of syllables
    lemmatize - if the analyzer should remove affixed particles, and normalize words in words mode
    filterChars - if the text should be converted to NFD (necessary for texts containing NFC strings)
    inputMode - "unicode" (default), "ewts", "dts" (Diacritics Transliteration Schema) or "alalc" ([ALA-LC](https://www.loc.gov/catdir/cpso/romanization/tibetan.pdf))
    stopFilename - file name of the stop word list (defaults to empty string for the shipped one, set to null for no stop words)

The nullary constructor is equivalent to TibetanAnalyzer(true, true, true, "unicode", "")

TibWordTokenizer

This tokenizer produces words through a Maximal Matching algorithm. It builds on top of this Trie implementation.

Due to its design, this tokenizer doesn't deal with contextual ambiguities.
For example, if both དོན and དོན་གྲུབ exist in the Trie, དོན་གྲུབ will be returned every time the sequence དོན + གྲུབ is found.
The sentence སེམས་ཅན་གྱི་དོན་གྲུབ་པར་ཤོག will be tokenized into "སེམས་ཅན + གྱི + དོན་གྲུབ + པར + ཤོག" (སེམས་ཅན + གྱི + དོན + གྲུབ་པར + ཤོག expected)

TibSyllableTokenizer

This tokenizer produces syllable tokens (with no tshek) from the input Tibetan text.

TibAffixedFilter

This filter removes non-ambiguous affixed particles (འི, འོ, འིའོ, འམ, འང and འིས), leaving the འ if necessary (ex: དགའི -> དགའ, གའི -> ག).

Maven Build Options

To sign the .jars before deploying, pass -DperformRelease=true ; to include ewts-converter in the built jar, pass -DincludeDeps=true.

License

The code is Copyright 2017 Buddhist Digital Resource Center, and is provided under Apache License 2.0.