-
-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22940 MF2 ICU4C: Update for bidi support #3236
Conversation
911d047
to
703002a
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
703002a
to
01d2fdc
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x1FFF) || | ||
inRange(c, 0x200C, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) || | ||
inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x061B) || | ||
inRange(c, 0x061D, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@catamorphism i see ALM is not namestart, but this change makes U+2000…U+200B isNameStart true. they are dashes and spaces, and not ID_Start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think this would be far more reliable in using a UnicodeSet. That can be created as the C++ equivalent of a static final immutable object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srl295 Fixed
@macchiati Done in 780a947
@@ -125,7 +125,13 @@ static bool isContentChar(UChar32 c) { | |||
|| inRange(c, 0xE000, 0x10FFFF); | |||
} | |||
|
|||
// See `s` in the MessageFormat 2 grammar | |||
// See `bidi` in the MF2 grammar | |||
static bool isBidi(UChar32 c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe isBidiControl might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 780a947
needs squash but LGTM, seems like all issues addressed |
01d2fdc
to
9054d0c
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
9054d0c
to
5ff8167
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
5ff8167
to
cebadc4
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
cebadc4
to
5eb2b7d
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I added 5eb2b7d to fix a bug that actually was in a previous PR, #3239 , but wasn't caught when merging that PR because of a bug in the test runner. This PR incidentally fixes that bug, so the test failure showed up here when I rebased this PR against main, and I'm fixing it here. (Will squash after @srl295 gets a chance to look at it, I left it unsquashed so far just so it's clear what I changed after he reviewed the PR.) |
5eb2b7d
to
03cba1f
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
The fuzzer is reporting a timeout bug with this test data: https://github.com/unicode-org/icu/actions/runs/12241776312/artifacts/2295676981 I downloaded the artifact and tried running it. The test string is |
I think I've found the cause of the timeout bug. 41994fa fixes it, if so; there's a loop in the parser that's checking for the presence of a syntax error and exiting if there already is one so that it doesn't loop infinitely, but if the What I don't understand is why the |
Will squash after final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixes LGTM
e2353bd
to
63b7228
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
parseError.line = 0; | ||
parseError.offset = 0; | ||
parseError.lengthBeforeCurrentLine = 0; | ||
parseError.preContext[0] = '\0'; | ||
parseError.postContext[0] = '\0'; | ||
} | ||
|
||
UnicodeSet initContentChars(UErrorCode& status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the return type of these initX method "UnicodeSet" ? or it should be "UnicodeSet* "?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be UnicodeSet*
, but these declarations aren't actually necessary as the functions are defined before they're used.
|
||
UnicodeSet* initContentChars(UErrorCode& status) { | ||
if (U_FAILURE(status)) { | ||
return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you return {} here but return nullptr later? What is the differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no difference, but I'll change it to nullptr
for consistency.
return {}; | ||
} | ||
|
||
UnicodeSet* result = new UnicodeSet(*unisets::getImpl(unisets::ALPHA)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check the value of "unisets::getImpl(unisets::ALPHA)" is not nullptr before you deref. I understand in the current code, this code inside initNameStartChars is always called only if status is U_FAILURE is not true after initAlpha() but there are no facility to enforce initNameStartChars is always called after initAlpha and will not be moved around in the future. Therefore, it is a very weak assumption that unisets::getImpl(unisets::ALPHA) is not nullptr here.
Could we change to
UnicodeSet* isAlpha = unisets::getImpl(unisets::ALPHA);
if (isAlpha == nullptr) {
status = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
UnicodeSet* result = new UnicodeSet(*isalpha);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this in a future PR.
status = U_MEMORY_ALLOCATION_ERROR; | ||
return nullptr; | ||
}; | ||
result->addAll(*unisets::getImpl(unisets::NAME_START)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR
status = U_MEMORY_ALLOCATION_ERROR; | ||
return nullptr; | ||
}; | ||
result->addAll(*unisets::getImpl(unisets::CONTENT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR
status = U_MEMORY_ALLOCATION_ERROR; | ||
return nullptr; | ||
}; | ||
result->addAll(*unisets::getImpl(unisets::CONTENT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR
const UnicodeSet* get(Key key) { | ||
UErrorCode localStatus = U_ZERO_ERROR; | ||
umtx_initOnce(gMF2ParseUniSetsInitOnce, &initMF2ParseUniSets, localStatus); | ||
if (U_FAILURE(localStatus)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first time the code call initMF2ParseUniSets the error happen, then localStatus will be error and return nullptr, but the gUnicodeSets will still be partially initialize. For example, gUnicodeSets[unisets::TEXT] may be nullptr if initNameChars failed. then later on you will pass a nullptr into contentChars and the later code contentChars->contains() will deref a nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR (by making get()
take a UErrorCode
).
I found the error handling code in this PR very weak. If any of the operation of new operation during initMF2ParseUniSets return nullptr due to out of memory, the code will deref nullptr later on. |
bidiControlChars(unisets::get(unisets::BIDI)), | ||
alphaChars(unisets::get(unisets::ALPHA)), | ||
digitChars(unisets::get(unisets::DIGIT)), | ||
nameStartChars(unisets::get(unisets::NAME_START)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notice, with the current code, some of these unisets::get() may return nullptr while other not and cause nullptr deref later
} | ||
|
||
UnicodeSet* initNameStartChars(UErrorCode& status) { | ||
if (U_FAILURE(status)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have an implicit requirement that initNameStartChars must be called after initAlpha or it may crash here. I think this is dangerous. Could we enforce that explicitly, for example, by calling initAlpha in the beginning of initNameStartChars() ? Also, there is an implicit assumption that the status contains the status passed to initAlpha so if initAlpha fail the status here will contain that failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR
gUnicodeSets[unisets::BIDI] = initBidiControls(status); | ||
gUnicodeSets[unisets::ALPHA] = initAlpha(status); | ||
gUnicodeSets[unisets::DIGIT] = initDigits(status); | ||
gUnicodeSets[unisets::NAME_START] = initNameStartChars(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's consider the condition that initMF2ParseUniSets is called the first time while the memory is almost full and initContentChars initWhitespace and initBidiControls all success but initAlpha fail, in that case, all except CONTENT, WHITESPACE and BIDI will return nullptr . And the initMF2ParseUniSets will return error in status the first time. so the first get() call will return nullptr, and later on the getImpl may also return nullptr. I think you need to change
const UnicodeSet* get(Key key)
to
const UnicodeSet* get(Key key, UErrorCode& status)
to pass up the error andalso set the status to error code if getImpl() return nullptr to the Parser() constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a future PR
Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236
@FrankYFTang I submitted #3306 to apply the fixes you suggested. Thanks! |
Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236
Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on #3236
The tests in this PR are also included in a PR against the MF2 spec. However, some editing will have to occur unless #3198 (matching on variables instead of expressions) lands before then. There was a spec change to the syntax of
.match
constructs, and some of the tests include.match
constructs.Checklist