Skip to content

Commit

Permalink
Updates
Browse files Browse the repository at this point in the history
  • Loading branch information
gregli-msft committed Mar 6, 2025
1 parent 4f54c71 commit d287205
Show file tree
Hide file tree
Showing 8 changed files with 55 additions and 20 deletions.
14 changes: 7 additions & 7 deletions docs/regular-expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Mixing submatches and quantifiers has limitations. See [Possibly empty submatche
| Numbered sub-match and back reference | When **MatchOptions.NumberedSubMatches** is enabled, `(a)` captures a sub-match referenced with `\1`. |
| Non-capture group | `(?:a)`, creates group without capturing the result as a named or numbered sub-match. All groups are non-capturing unless **MatchOptions.NumberedSubMatches** is enabled. |

Named and numbered sub-matches cannot be used together. By default, named sub-matches are enabled and are preferred for clarity and maintainability, while standard capture groups become non capture groups with improved performance. This can be changed with **MatchOptions.NumberedSubMatches** which provides for traditional capture groups but disables named captures groups. Some implementations treat a mix of numbered and named capture groups differently which is why Power Fx disallows it.
Named and numbered sub-matches cannot be used together. By default, named sub-matches are enabled and are preferred for clarity and maintainability, while standard capture groups become non capture groups with improved performance. This can be changed with **MatchOptions.NumberedSubMatches** which provides for traditional capture groups but disables named capture groups. Some implementations treat a mix of numbered and named capture groups differently which is why Power Fx disallows it.

Self referencing capture groups are not supported, for example the regular expression `(a\1)`.

Expand Down Expand Up @@ -154,7 +154,7 @@ Inline options cannot be used to disable an option or set an option for a sub-ex
## Options

Match options change the behavior of regular expression matching. There are two ways to enable options, which can be mixed so long as there is no conflict:
- **MatchOptions** enum value passed as the third argument to **Match**, **MatchAll**, and **IsMatch**. Options can be combined with the `&` operator or `Concatenate` function, for example `MatchOptions.DotAll & MatchOptions.FreeSpacing`. All of the regular expression functions requires that **MatchOptions** is a constant value, it cannot be calculated or stored in a variable.
- **MatchOptions** enum value passed as the third argument to **Match**, **MatchAll**, and **IsMatch**. Options can be combined with the `&` operator or `Concatenate` function, for example `MatchOptions.DotAll & MatchOptions.FreeSpacing`. All of the regular expression functions require that **MatchOptions** is a constant value, it cannot be calculated or stored in a variable.
- `(?...)` prefix at the very beginning of the regular expression. Options can be combined with multiple letters in the `(?...)` construct, for example `(?sx)`. Some options do not have a `(?...)` equivalent but may have other ways to get the same effect, for example **MatchOptions.BeginsWith** is the equivalent of `^` at the beginning of the regular expression.

### Contains
Expand Down Expand Up @@ -275,17 +275,17 @@ MatchAll( "Hello" & Char(13) & Char(10) & "World", "^.+$" )

Enabled with **MatchOptions.NumberedSubMatches** with no inline option. `(?n)` is supported as the opposite of this option for compatibility and is the default.

By default, `(...)` does not capture, the equivalent of what most systems call "explicit capture". To capture, use a named capture with `(?<name>...)` with backreference `\k<name>`. This improves performance of the regular expression by not capturing gruops that do not need to be captures and improving clarity by using names instead of numbers that can change.
By default, `(...)` does not capture, the equivalent of what most systems call "explicit capture". To capture, use a named capture with `(?<name>...)` with backreference `\k<name>`. This improves performance of the regular expression by not capturing groups that do not need to be captures and improving clarity by using names instead of numbers that can change.

If you have an existing regular expression, it may depend on groups being captured automatically and numbered, including numbered back references. This is available by using the **MatchOptions.NumberedSubMaches** option.
If you have an existing regular expression, it may depend on groups being captured automatically and numbered, including numbered back references. This is available by using the **MatchOptions.NumberedSubMatches** option.

Named and numbered sub-matches cannot be used together. Some implementations treat a mix of numbered and named capture groups differently which is why Power Fx disallows it.

## Possibly empty submatches

As stated in the introduction, Power Fx's regular expressions are intentionally limited to features that can be consistently implemented on top of .NET, JavaScript, and other programming language regular expression engines. Authoring time errors prevent use of features that are not a part of this set.
As stated in the introduction, Power Fx's regular expressions are intentionally limited to features that can be consistently implemented on top of .NET, JavaScript, and other programming language regular expression engines. Authoring time errors prevent the use of features that are not a part of this set.

One area that can be signficaintly different between implementations is how empty submatches are handled. For example, consdier the regular expression `(?<submatch>a*)+` asked to match the text `a`. On .NET, the submatch will result in an empty text string, while on JavaScript it will result in `a`. Both can be argued as correct implementations, as the `+` quantifier can be satisfied with an empty string since the contents of the group has a `*` quantifier.
One area that can be significantly different between implementations is how empty submatches are handled. For example, consider the regular expression `(?<submatch>a*)+` asked to match the text `a`. On .NET, the submatch will result in an empty text string, while on JavaScript it will result in `a`. Both can be argued as correct implementations, as the `+` quantifier can be satisfied with an empty string since the contents of the group has a `*` quantifier.

To avoid different results across Power Fx implementations, submatches that could be empty cannot be used with a quantifier. Here are examples of how a submatch could be empty:

Expand All @@ -296,6 +296,6 @@ To avoid different results across Power Fx implementations, submatches that coul
| `(?<submatch>a|b*)+` | Alternation within the submatch with something that could be empty could result in the entire submatch being empty. |
| `((?<submatch>a)|b)+` | Alternation outside the submatch could match `b` in which case the submatch would be empty.|

Note that the submatch in `(?<submatch>a+)+` cannot be empty, as there must be at leaset one `a` in he submatch, and is supported.
Note that the submatch in `(?<submatch>a+)+` cannot be empty, as there must be at least one `a` in he submatch, and is supported.

Backreferences to possibly empty submatches are also not supported.
Original file line number Diff line number Diff line change
Expand Up @@ -783,6 +783,7 @@ internal static class TexlStrings
public static ErrorResourceKey ErrInvalidRegExNamedCaptureNameTooLong = new ErrorResourceKey("ErrInvalidRegExNamedCaptureNameTooLong");
public static ErrorResourceKey ErrInvalidRegExLowHighQuantifierFlip = new ErrorResourceKey("ErrInvalidRegExLowHighQuantifierFlip");
public static ErrorResourceKey ErrInvalidRegExLookbehindTooManyChars = new ErrorResourceKey("ErrInvalidRegExLookbehindTooManyChars");
public static ErrorResourceKey ErrInvalidRegExNumberOverflow = new ErrorResourceKey("ErrInvalidRegExNumberOverflow");

public static ErrorResourceKey ErrVariableRegEx = new ErrorResourceKey("ErrVariableRegEx");
public static ErrorResourceKey ErrVariableRegExOptions = new ErrorResourceKey("ErrVariableRegExOptions");
Expand Down
29 changes: 23 additions & 6 deletions src/libraries/Microsoft.PowerFx.Core/Texl/Builtins/Match.cs
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ private void AddWarnings(TexlNode regExNode, IErrorContainer errors, bool hidesF
// These tests can be run through all three engines and the results compared with by setting ExpressionEvaluationTests.RegExCompareEnabled, a PCRE2 DLL and NodeJS must be installed on the system.
//
// In short, we use the insersection of canonical .NET regular expressions and ECMAScript 2024's "v" flag for escaping rules.
// Someday when "v" is more widely avaialble, we can support more of its features such as set subtraction.
// Someday when "v" is more widely available, we can support more of its features such as set subtraction.
// We chose to use canonical .NET instead of RegexOptions.ECMAScript because we wanted the unicode definitions for words. See https://learn.microsoft.com/dotnet/standard/base-types/regular-expression-options#ecmascript-matching-behavior
//
// In addition, Power Fx regular expressions are opinionated and try to eliminate some of the ambiguity in the common regular expression language:
Expand Down Expand Up @@ -335,7 +335,7 @@ private bool IsSupportedRegularExpression(TexlNode regExNode, string regexPatter
[\#\ ] | # added for free spacing, always accepted for conssitency even in character classes, escape needs to be removed on Unicode aware ECMAScript
x[0-9a-fA-F]{2} | # hex character, must be exactly 2 hex digits
u[0-9a-fA-F]{4})) | # Unicode characters, must be exactly 4 hex digits
\\(?<goodUEscape>[pP])\{(?<UCategory>[\w=:-]+)\} | # Unicode chaeracter classes, extra characters here for a better error message
\\(?<goodUEscape>[pP])\{(?<UCategory>[\w=:-]+)\} | # Unicode character classes, extra characters here for a better error message
(?<goodEscapeOutsideCC>\\[bB]) | # acceptable outside a character class, includes negative classes until we have character class subtraction, include \P for future MatchOptions.LocaleAware
(?<goodEscapeOutsideAndInsideCCIfPositive>\\[DWS]) |
(?<goodEscapeInsideCCOnly>\\[&\-!#%,;:<=>@`~\^]) | # https://262.ecma-international.org/#prod-ClassSetReservedPunctuator, others covered with goodEscape above
Expand Down Expand Up @@ -471,7 +471,11 @@ void RegExError(ErrorResourceKey errKey, Match errToken = null, bool startContex
}
else if (token.Groups["goodExact"].Success)
{
var exact = Convert.ToInt32(token.Groups["goodExact"].Value, CultureInfo.InvariantCulture);
if (!int.TryParse(token.Groups["goodExact"].Value, out var exact))
{
RegExError(TexlStrings.ErrInvalidRegExNumberOverflow);
return false;
}

if (!groupTracker.SeenQuantifier(exact, exact, out var error))
{
Expand All @@ -481,8 +485,17 @@ void RegExError(ErrorResourceKey errKey, Match errToken = null, bool startContex
}
else if (token.Groups["goodLimitedL"].Success)
{
var low = Convert.ToInt32(token.Groups["goodLimitedL"].Value, CultureInfo.InvariantCulture);
var high = Convert.ToInt32(token.Groups["goodLimitedH"].Value, CultureInfo.InvariantCulture);
if (!int.TryParse(token.Groups["goodLimitedL"].Value, out var low))
{
RegExError(TexlStrings.ErrInvalidRegExNumberOverflow);
return false;
}

if (!int.TryParse(token.Groups["goodLimitedH"].Value, out var high))
{
RegExError(TexlStrings.ErrInvalidRegExNumberOverflow);
return false;
}

if (!groupTracker.SeenQuantifier(low, high, out var error))
{
Expand All @@ -500,7 +513,11 @@ void RegExError(ErrorResourceKey errKey, Match errToken = null, bool startContex
}
else if (token.Groups["goodUnlimited"].Success)
{
var low = Convert.ToInt32(token.Groups["goodUnlimited"].Value, CultureInfo.InvariantCulture);
if (!int.TryParse(token.Groups["goodUnlimited"].Value, out var low))
{
RegExError(TexlStrings.ErrInvalidRegExNumberOverflow);
return false;
}

if (!groupTracker.SeenQuantifier(low, -1, out var error))
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ public Task<FormulaValue> InvokeAsync(FormulaValue[] args, CancellationToken can
bool matchEnd = options.Contains("$");
bool numberedSubMatches = options.Contains("N");

// Can't add options ^ and $ too early as there may be freespacing comments, centalize the logic here and call subfunctions
// Can't add options ^ and $ too early as there may be freespacing comments, centralize the logic here and call subfunctions
string AlterStart()
{
// ^ doesn't require any translation if not in multilline, only matches the start of the string
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@
using System.Linq.Expressions;
using System.Text;
using System.Text.RegularExpressions;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CodeActions;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.FlowAnalysis;
using Microsoft.PowerFx.Core.Localization;
using Microsoft.PowerFx.Core.Utils;
using Microsoft.PowerFx.Repl.Functions;
Expand Down
6 changes: 5 additions & 1 deletion src/strings/PowerFxResources.en-US.resx
Original file line number Diff line number Diff line change
Expand Up @@ -4682,7 +4682,11 @@
<data name="ErrorResource_ErrInvalidRegExNamedCaptureNameTooLong_ShortMessage" xml:space="preserve">
<value>Invalid regular expression: Name of submatch is too long, found "{0}".</value>
<comment>Error Message.</comment>
</data>
</data>
<data name="ErrorResource_ErrInvalidRegExNumberOverflow_ShortMessage" xml:space="preserve">
<value>Invalid regular expression: Number is too large, found "{0}".</value>
<comment>Error Message.</comment>
</data>
<data name="ErrorResource_ErrVariableRegEx_ShortMessage" xml:space="preserve">
<value>Regular expression must be a constant value.</value>
<comment>Error Message.</comment>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1561,4 +1561,21 @@ Errors: Error 33-72: Invalid regular expression: Lookbehind exceeded maximum pos
>> Match( "aaaaaaaaaaaaaaaaaaaaab", "((aa){2,})(?<!\1{2,12}aaaaaaaaaa)b", MatchOptions.NumberedSubMatches )
Errors: Error 33-69: Invalid regular expression: Unlimited quantifiers are not supported in look behinds, found at the end of "\1".|Error 0-5: The function 'Match' has some invalid arguments.

// number of repetitions overflow
// very unlikely for standard use to hit signed 32-bit max, but someone may test the boundaries...
// back refs are never converted to a number, instead compared as strings against the capture group's index converted to a string, so not an issue

>> Match( "a", "a{1111111111111111111111111111111111,}" )
Errors: Error 12-52: Invalid regular expression: Number is too large, found "{1111111111111111111111111111111111,}".|Error 0-5: The function 'Match' has some invalid arguments.

>> Match( "a", "a{1111111111111111111111111111111111,111111111111111111111111111111111111111111111}" )
Errors: Error 12-97: Invalid regular expression: Number is too large, found "{1111111111111111111111111111111111,111111111111111111111111111111111111111111111}".|Error 0-5: The function 'Match' has some invalid arguments.

>> Match( "a", "a{1,111111111111111111111111111111111111111111111}" )
Errors: Error 12-64: Invalid regular expression: Number is too large, found "{1,111111111111111111111111111111111111111111111}".|Error 0-5: The function 'Match' has some invalid arguments.

>> Match( "a", "a{11111111111111111111111111111111111111,1}" )
Errors: Error 12-57: Invalid regular expression: Number is too large, found "{11111111111111111111111111111111111111,1}".|Error 0-5: The function 'Match' has some invalid arguments.

>> Match( "a", "a{11111111111111111111111111111111111111}" )
Errors: Error 12-55: Invalid regular expression: Number is too large, found "{11111111111111111111111111111111111111}".|Error 0-5: The function 'Match' has some invalid arguments.
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ public void TestRegExEnableTwice2()
config2.EnableRegExFunctions(TimeSpan.FromMilliseconds(50), 20);
}

// castrophic backtracking
// catastrophic backtracking
// 1. short, will succeed with little backtracking
// 2. short, will fail due to the extra comma at the end of the input, more backtracking but short enough that it completes in a reasonable amount of time
// 3. long, will still succeed with little backtracking
Expand Down

0 comments on commit d287205

Please sign in to comment.