lxxmorph/
-- directory of patch files to correct CATSS lxxmorphlxx_lexicon.yaml
-- WIP stem dataset for LXXlxxmorph_generate.py
-- script that goes through all the verb forms inlxxmorph/
and validates that the code + dataset generates the correct formgenerate_lxxmorph_lexicon.py
-- similar tolxxmorph_generate.py
but instead of just validating and showing unexplainable forms, builds the starting point for a lexicon file to explain the formslxxmorph_utils.py
-- common code used by the above scripts
The particular books to be tested are configured in the MLXX_FILES
variable in the script.
- set
LXX_FILENAME
ingenerate_lxxmorph_lexicon.py
to the path of the lxxmorph file you want to work on - add an entry in
book_to_num
inlxxmorph_utils.py
to map book code used by file to book number - run
./generate_lxxmorph_lexicon.py > tmp1
cat lxx_lexicon.yaml tmp1 > tmp2
./sort_lexicon.py tmp2 > lxx_lexicon.yaml
- remove
tmp1
andtmp2
- review all lines in
lxx_lexicon.yaml
that have# @
(you can review about 10 a minute once you get good at it)
@m
means multiple possible stems, for example:
ἀγνοέω:
stems:
1-: {'ἀγνοου{athematic}', 'ἀγνοε', 'ἀγνοο'} # @m
which should be manually corrected to:
ἀγνοέω:
stems:
1-: ἀγνοε
@1
means a single possible stem. Verify and make sure if the lemma already exists that the new stem is moved in with the others.
For example:
ἀποκόπτω:
stems:
3+: ἀπεκοψ
stems:
2-: ἀποκοψ # @1
needs to be changed to
ἀποκόπτω:
stems:
2-: ἀποκοψ
3+: ἀπεκοψ
stems:
@0
means no stem could be guessed. This normally means a missing stemming.yaml
rule.
At the end of all this, you can run ./generate_lxxmorph_lexicon.py
again and
if you haven't made any mistakes, it should return nothing.
- Update
MLXX_FILES
inlxxmorph_generate.py
and run to test your new lexicon.
You will almost certainly get failures. These could just be mistakes you made in step 7, could be missing stemming.yaml
rules, or (actually most likely at this stage in the project) could be mistakes in the .mlxx
file that need to be corrected.