Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations, new filters for Japanese Sutegana and new Sudachi dictionary version #110

Merged
merged 8 commits into from
Jun 18, 2024

Conversation

azagniotov
Copy link
Owner

Alexander Zagniotov added 8 commits June 18, 2024 08:12
Cherry-picked from: apache/lucene#12915 by @daixque

Context

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlike in the modern texts. For example:

"ストップウォッチ" is written as "ストツプウオツチ"
"ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize Sutegana to normal (uppercase) characters if we search against the corpuses which includes old Japanese texts such as patents, legal documents, contract policies, etc.
@azagniotov azagniotov changed the title Performance optimizations, new filters from Japanese Sutegana and new Sudachi dictionary version Performance optimizations, new filters for Japanese Sutegana and new Sudachi dictionary version Jun 18, 2024
@azagniotov azagniotov merged commit 9d2edbe into master Jun 18, 2024
3 checks passed
@azagniotov azagniotov deleted the sync_updates branch June 18, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant