Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Fix character range bounds in case-insensitive regexes #5164

Merged

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Dec 5, 2024

This PR deals with three related problems for character ranges in case-insensitive mode:

  • _Builder::_Add_range() casts the character bounds to unsigned int. As a consequence, characters with negative numeric values are not added to the bitmap, but rather to the _Large list of characters. This means that these characters are not found during matching. (Note the suspiciously different casts in the else branch for the case-sensitive case.)
  • The parser fails to reject some empty ranges in case-insensitive mode such as [Z-a] (= [z-a]).
  • When both the collate and icase flags are set, there is an unnecessary call to translate the bounds by _Traits.translate() first before passing them to _Traits.translate_nocase(). The standard says in [re.grammar]/14.1 and 14.2 that it is sufficient to call translate_nocase() only. (See also _Builder::_Add_char(), which already follows the Standard in this regard.)

The PR moves the entire character translation into the parser so that empty ranges can be reliably diagnosed there in case-insensitive mode as well. It also fixes the unsigned cast and removes the unnecessary translate() call.

The test deliberately does not use any manual signed/unsigned casts, but leaves all of these casts to char_traits to avoid getting the casts similarly wrong in <regex> and the test.

@muellerj2 muellerj2 requested a review from a team as a code owner December 5, 2024 12:40
Comment on lines -4121 to +4120
if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is still a (pre-existing) bug in this error check when the collate flag is set: The bounds should be transformed by _Traits.transform() before performing the comparison.

However, the matcher fails to do this transform() dance as well, so I think this should rather be fixed for both matcher and parser in a coordinated fashion in a follow-up PR. (But I will adjust the PR if you rather prefer to fix this check completely now.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up PR will be great.

@CaseyCarter CaseyCarter added the bug Something isn't working label Dec 5, 2024
@StephanTLavavej StephanTLavavej self-assigned this Dec 5, 2024
@StephanTLavavej StephanTLavavej added the regex Everyone's favorite header label Jan 8, 2025
Comment on lines -4121 to +4120
if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up PR will be great.

@StephanTLavavej
Copy link
Member

Thanks - I really appreciate the detailed explanations in your PRs! 🐱

@StephanTLavavej StephanTLavavej removed their assignment Jan 12, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 13, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit af0bd00 into microsoft:main Jan 14, 2025
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks! 😻+

@muellerj2 muellerj2 deleted the fix-case-insensitive-char-ranges branch January 14, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex Everyone's favorite header
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants