This document outlines the shaping information needed to display characters from the Unicode Vedic Extensions block, which may be used within text runs in many Indic scripts.
Table of Contents
The Vedic Extensions block encodes letters and marks that are used in a large body of ancient literature written in the Vedic Sanskrit language.
Primarily an oral language in the time period when the key literature originated, Vedic Sanskrit has no native script. Therefore, texts may be typeset in any one of the Indic scripts, using the Vedic Extensions to supplement the main script's character set.
Individual Vedic Extension characters may be named by a combination of the Vedic text in which the mark is used, the regional or manuscript tradition involved, or a simple visual or phonetic description of the character. Some commonly used general categories are worth noting.
Udatta is the term for a high tone on a vowel.
Anudatta is the term for a low tone on a vowel.
Svarita is the term for a falling or mixed tone on a vowel.
Anusvara is the term for a nasalization sound that precedes a consonant.
Visarga is the term for a soft breathing sound that precedes a vowel.
Note: In modern Indic languages, the terms anusvara and visarga often refer to diacritical marks that have the above effects on pronunciation. In the Vedic Sanskrit language, however, they are generally considered independent letters.
For most codepoints, the General Category
property defined in the Unicode
standard is correct, but it is not sufficient to fully capture the
expected shaping behavior (such as how the character is treated during
glyph reordering). Therefore, they must additionally be classified by
how they are treated when shaping a run of text.
Vedic Extension glyphs should be classified as in the following table. Codepoints with no assigned meaning are marked as unassigned in the Unicode category column.
Assigned codepoints marked with a null in the Shaping class column evoke no special behavior from the shaping engine.
The Mark-placement subclass column indicates mark-placement positioning. Assigned codepoints marked with a null in this column evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.
Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific behavior.
Codepoint | Unicode category | Shaping class | Mark-placement subclass | Glyph |
---|---|---|---|---|
U+1CD0 |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳐ Tone Karshana |
U+1CD1 |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳑ Tone Shara |
U+1CD2 |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳒ Tone Prenkha |
U+1CD3 |
Punctuation | null | null | ᳓ Sign Nihshvasa |
U+1CD4 |
Mark [Mn] | CANTILLATION | OVERSTRUCK | ᳔ Tone Midline Svarita |
U+1CD5 |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳕ Tone Aggravated Independent Svarita |
U+1CD6 |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳖ Tone Independent Svarita |
U+1CD7 |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳗ Tone Kathaka Independent Svarita |
U+1CD8 |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳘ Tone Candra Below |
U+1CD9 |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳙ Tone Kathaka Independent Svarita Schroeder |
U+1CDA |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳚ Tone Double Svarita |
U+1CDB |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳛ Tone Triple Svarita |
U+1CDC |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳜ Tone Kathaka Anudatta |
U+1CDD |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳝ Tone Dot Below |
U+1CDE |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳞ Tone Two Dots Below |
U+1CDF |
Mark [Mn] | CANTILLATION | BOTTOM_POSITION | ᳟ Tone Three Dots Below |
U+1CE0 |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳠ Tone Rigvedic Kashmiri Independent Svarita |
U+1CE1 |
Mark [Mc] | CANTILLATION | RIGHT_POSITION | ᳡ Tone Atharavedic Independent Svarita |
U+1CE2 |
Mark [Mn] | AVAGRAHA | OVERSTRUCK | ᳢ Sign Visarga Svarita |
U+1CE3 |
Mark [Mn] | null | OVERSTRUCK | ᳣ Sign Visarga Udatta |
U+1CE4 |
Mark [Mn] | null | OVERSTRUCK | ᳤ Sign Reversed Visarga Udatta |
U+1CE5 |
Mark [Mn] | null | OVERSTRUCK | ᳥ Sign Visarga Anudatta |
U+1CE6 |
Mark [Mn] | null | OVERSTRUCK | ᳦ Sign Reversed Visarga Anudatta |
U+1CE7 |
Mark [Mn] | null | OVERSTRUCK | ᳧ Sign Visarga Udatta With Tail |
U+1CE8 |
Mark [Mn] | AVAGRAHA | OVERSTRUCK | ᳨ Sign Visarga Anudatta With Tail |
U+1CE9 |
Letter | AVAGRAHA | null | ᳩ Sign Anusvara Antargomukha |
U+1CEA |
Letter | null | null | ᳪ Sign Anusvara Bahirgomukha |
U+1CEB |
Letter | null | null | ᳫ Sign Anusvara Vamagomukha |
U+1CEC |
Letter | AVAGRAHA | null | ᳬ Sign Anusvara Vamagomukha With Tail |
U+1CED |
Mark [Mn] | AVAGRAHA | BOTTOM_POSITION | ᳭ Sign Tiryak |
U+1CEE |
Letter | AVAGRAHA | null | ᳮ Sign Hexiform Long Anusvara |
U+1CEF |
Letter | null | null | ᳯ Sign Long Anusvara |
U+1CF0 |
Letter | null | null | ᳰ Sign Rthang Long Anusvara |
U+1CF1 |
Letter | AVAGRAHA | null | ᳱ Sign Anusvara Ubhayato Mukha |
U+1CF2 |
Letter | CONSONANT_DEAD | null | ᳲ Sign Ardhavisarga |
U+1CF3 |
Letter | CONSONANT_DEAD | null | ᳳ Sign Rotated Ardhavisarga |
U+1CF4 |
Mark [Mn] | CANTILLATION | TOP_POSITION | ᳴ Tone Candra Above |
U+1CF5 |
Letter | CONSONANT_WITH_STACKER | null | ᳵ Sign Jihvamuliya |
U+1CF6 |
Letter | CONSONANT_WITH_STACKER | null | ᳶ Sign Upadhmaniya |
U+1CF7 |
Mark [Mc] | null | null | ᳷ Sign Atikrama |
U+1CF8 |
Mark [Mn] | CANTILLATION | null | ᳸ Tone Ring Above |
U+1CF9 |
Mark [Mn] | CANTILLATION | null | ᳹ Tone Double Ring Above |
U+1CFA |
Letter | PLACEHOLDER | null | ᳺ Sign Double Anusvara Antargomukha |
U+1CFB |
unassigned | |||
U+1CFC |
unassigned | |||
U+1CFD |
unassigned | |||
U+1CFE |
unassigned | |||
U+1CFF |
unassigned |
31 of the characters in the block are categorized as marks. 27 of these marks are subcategorized as non-spacing; the remaining four are spacing-combining.
Of the non-spacing marks, 20 are classified as CANTILLATION
(or tone-marker)
indicators, which modify the pitch of vowels. Most of these marks are
generally positioned above or below the main character, using GPOS
mark attachment, in a position that does not interact or interfere
with the main character. In Unicode, the CANTILLATION
classification
is separate from the TONE_MARKER
classification used in some scripts
for semantic reasons; the two classifications are identical for
shaping purposes.
Some of the marks (cantillation and non-cantillation) are classified
as OVERSTRUCK
in the Mark-placement subclass column.
This indicates that the mark is intended to be rendered on top of the
preceding character. During reordering, OVERSTRUCK
marks are tagged
for the ordering position POS_AFTER_MAIN
.
Some marks are classified, for shaping purposes, as AVAGRAHA
or
VISARGA
. This indicates that the mark behaves more like the Avagraha
or Visarga character than like a diacritic.
Characters that are categorized in Unicode as letters vary with
respect to whether or not they trigger special behavior in the shaping
process. These include letters that are classified as CONSONANT
and
letters that are classified as AVAGRAHA
.