Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up syntax error reporting #3278

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Swiddis
Copy link
Collaborator

@Swiddis Swiddis commented Jan 29, 2025

Description

While working on fixing PPL AST bugs (#3273), it stood out to me that our error reporting for syntax errors really isn't that good. This PR cleans up the handling of our errors.

Example query that's currently returning syntax errors:

POST _plugins/_ppl
{
  "query": "SOURCE = test_19a673e2 | WHERE x OR y"
}

Before:

{
  "error": {
    "reason": "Invalid Query",
    "details": "Failed to parse query due to offending symbol [OR] at: 'SOURCE = test_19a673e2 | WHERE x OR' <--- HERE... More details: Expecting tokens in {'SEARCH', 'DESCRIBE', 'SHOW', 'FROM', 'WHERE', 'FIELDS', 'RENAME', 'STATS', 'DEDUP', 'SORT', 'EVAL', 'HEAD', 'TOP', 'RARE', 'PARSE', 'METHOD', 'REGEX', 'PUNCT', 'GROK', 'PATTERN', 'PATTERNS', 'NEW_FIELD', 'KMEANS', 'AD', 'ML', 'FILLNULL', 'TRENDLINE', 'SOURCE', 'INDEX', 'D', 'DESC', 'DATASOURCES', 'SORTBY', 'AUTO', 'STR', 'NUM', 'KEEPEMPTY', 'CONSECUTIVE', 'DEDUP_SPLITVALUES', 'PARTITIONS', 'ALLNUM', 'DELIM', 'CENTROIDS', 'ITERATIONS', 'DISTANCE_TYPE', 'NUMBER_OF_TREES', 'SHINGLE_SIZE', 'SAMPLE_SIZE', 'OUTPUT_AFTER', 'TIME_DECAY', 'ANOMALY_RATE', 'CATEGORY_FIELD', 'TIME_FIELD', 'TIME_ZONE', 'TRAINING_DATA_SIZE', 'ANOMALY_SCORE_THRESHOLD', 'TRUE', 'FALSE', 'CONVERT_TZ', 'DATETIME', 'DAY', 'DAY_HOUR', 'DAY_MICROSECOND', 'DAY_MINUTE', 'DAY_OF_YEAR', 'DAY_SECOND', 'HOUR', 'HOUR_MICROSECOND', 'HOUR_MINUTE', 'HOUR_OF_DAY', 'HOUR_SECOND', 'INTERVAL', 'MICROSECOND', 'MILLISECOND', 'MINUTE', 'MINUTE_MICROSECOND', 'MINUTE_OF_DAY', 'MINUTE_OF_HOUR', 'MINUTE_SECOND', 'MONTH', 'MONTH_OF_YEAR', 'QUARTER', 'SECOND', 'SECOND_MICROSECOND', 'SECOND_OF_MINUTE', 'WEEK', 'WEEK_OF_YEAR', 'YEAR', 'YEAR_MONTH', 'IP', '.', '+', '-', '(', '`', 'AVG', 'COUNT', 'DISTINCT_COUNT', 'ESTDC', 'ESTDC_ERROR', 'MAX', 'MEAN', 'MEDIAN', 'MIN', 'MODE', 'RANGE', 'STDEV', 'STDEVP', 'SUM', 'SUMSQ', 'VAR_SAMP', 'VAR_POP', 'STDDEV_SAMP', 'STDDEV_POP', 'PERCENTILE', 'TAKE', 'FIRST', 'LAST', 'LIST', 'VALUES', 'EARLIEST', 'EARLIEST_TIME', 'LATEST', 'LATEST_TIME', 'PER_DAY', 'PER_HOUR', 'PER_MINUTE', 'PER_SECOND', 'RATE', 'SPARKLINE', 'C', 'DC', 'ABS', 'CBRT', 'CEIL', 'CEILING', 'CONV', 'CRC32', 'E', 'EXP', 'FLOOR', 'LN', 'LOG', 'LOG10', 'LOG2', 'MOD', 'PI', 'POSITION', 'POW', 'POWER', 'RAND', 'ROUND', 'SIGN', 'SQRT', 'TRUNCATE', 'ACOS', 'ASIN', 'ATAN', 'ATAN2', 'COS', 'COT', 'DEGREES', 'RADIANS', 'SIN', 'TAN', 'ADDDATE', 'ADDTIME', 'CURDATE', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURTIME', 'DATE', 'DATEDIFF', 'DATE_ADD', 'DATE_FORMAT', 'DATE_SUB', 'DAYNAME', 'DAYOFMONTH', 'DAYOFWEEK', 'DAYOFYEAR', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'EXTRACT', 'FROM_DAYS', 'FROM_UNIXTIME', 'GET_FORMAT', 'LAST_DAY', 'LOCALTIME', 'LOCALTIMESTAMP', 'MAKEDATE', 'MAKETIME', 'MONTHNAME', 'NOW', 'PERIOD_ADD', 'PERIOD_DIFF', 'SEC_TO_TIME', 'STR_TO_DATE', 'SUBDATE', 'SUBTIME', 'SYSDATE', 'TIME', 'TIMEDIFF', 'TIMESTAMP', 'TIMESTAMPADD', 'TIMESTAMPDIFF', 'TIME_FORMAT', 'TIME_TO_SEC', 'TO_DAYS', 'TO_SECONDS', 'UNIX_TIMESTAMP', 'UTC_DATE', 'UTC_TIME', 'UTC_TIMESTAMP', 'WEEKDAY', 'YEARWEEK', 'SUBSTR', 'SUBSTRING', 'LTRIM', 'RTRIM', 'TRIM', 'LOWER', 'UPPER', 'CONCAT', 'CONCAT_WS', 'LENGTH', 'STRCMP', 'RIGHT', 'LEFT', 'ASCII', 'LOCATE', 'REPLACE', 'REVERSE', 'CAST', 'LIKE', 'ISNULL', 'ISNOTNULL', 'CIDRMATCH', 'IFNULL', 'NULLIF', 'IF', 'TYPEOF', 'ALLOW_LEADING_WILDCARD', 'ANALYZE_WILDCARD', 'ANALYZER', 'AUTO_GENERATE_SYNONYMS_PHRASE_QUERY', 'BOOST', 'CUTOFF_FREQUENCY', 'DEFAULT_FIELD', 'DEFAULT_OPERATOR', 'ENABLE_POSITION_INCREMENTS', 'ESCAPE', 'FLAGS', 'FUZZY_MAX_EXPANSIONS', 'FUZZY_PREFIX_LENGTH', 'FUZZY_TRANSPOSITIONS', 'FUZZY_REWRITE', 'FUZZINESS', 'LENIENT', 'LOW_FREQ_OPERATOR', 'MAX_DETERMINIZED_STATES', 'MAX_EXPANSIONS', 'MINIMUM_SHOULD_MATCH', 'OPERATOR', 'PHRASE_SLOP', 'PREFIX_LENGTH', 'QUOTE_ANALYZER', 'QUOTE_FIELD_SUFFIX', 'REWRITE', 'SLOP', 'TIE_BREAKER', 'TYPE', 'ZERO_TERMS_QUERY', 'SPAN', 'MS', 'S', 'M', 'H', 'W', 'Q', 'Y', ID, INTEGER_LITERAL, DECIMAL_LITERAL, DQUOTA_STRING, SQUOTA_STRING, BQUOTA_STRING}",
    "type": "SyntaxCheckException"
  },
  "status": 400
}

After:

{
  "error": {
    "reason": "Invalid Query",
    "details": "[OR] is not a valid term at this part of the query: '..._19a673e2 | WHERE x OR' <-- HERE. Expecting one of 284 possible tokens. Some examples: 'SEARCH', 'DESCRIBE', 'SHOW', 'FROM', 'WHERE', ...",
    "type": "SyntaxCheckException"
  },
  "status": 400
}

Even though this particular query should be valid, I think it's much clearer from this what the parser is mad about, and easier to put in a bug report too.

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Simeon Widdis <[email protected]>
@Swiddis Swiddis force-pushed the feature/clean-up-syntax-errors branch from 31649f1 to a47be17 Compare January 29, 2025 22:47
{
"query" : "SELECT * FROM sample:data"
"query" : "SOURCE = test_index | where a > 0)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SQL error no longer outputs the same error message (new parsing engine?). I couldn't hit the ANTLR exception with a new SQL query, so I updated it to a PPL one.

Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the improvement of this PR is truncate the long error message? If so, is it possible to simplify the changes especially line 60-80?

if (contextStartIndex < 3) { // The ellipses won't save us anything below the first 4 characters
return query.substring(0, offendingToken.getStopIndex() + 1);
}
return "..." + query.substring(contextStartIndex, offendingToken.getStopIndex() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] Perhaps we should have an unit || IT test for this ... code path, to make sure indeed lengthy log being truncated as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, though I'm not sure what the value is -- the method has no dependencies and seems unlikely to be changed later on, and if it's being changed they'd almost certainly need to change the associated tests as well so it would only get in the way. If there's a part of the method that's in need of clarification to make it easier for others to parse/modify, let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I share your point, let's it as it is then.

Signed-off-by: Simeon Widdis <[email protected]>
@Swiddis
Copy link
Collaborator Author

Swiddis commented Feb 4, 2025

So the improvement of this PR is truncate the long error message? If so, is it possible to simplify the changes especially line 60-80?

Should be better now, refactored it to only do the mapping for the first few tokens so we can just use String.join everywhere

Comment on lines +65 to +73
IntervalSet followSet = e.getExpectedTokens();
Vocabulary vocab = recognizer.getVocabulary();
List<String> tokenNames = new ArrayList<>(SUGGESTION_TRUNCATION_THRESHOLD);
for (int tokenType :
followSet
.toList()
.subList(0, Math.min(followSet.size(), SUGGESTION_TRUNCATION_THRESHOLD))) {
tokenNames.add(vocab.getDisplayName(tokenType));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract for readability?

import org.antlr.v4.runtime.RecognitionException;
import org.antlr.v4.runtime.Recognizer;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u replace the * import with explicit classes

@@ -82,25 +82,25 @@ public void queryShouldBeCaseInsensitiveInKeywords() {
@Test
public void queryNotStartingWithSearchCommandShouldFailSyntaxCheck() {
String query = "fields firstname";
queryShouldThrowSyntaxException(query, "Failed to parse query due to offending symbol");
queryShouldThrowSyntaxException(query, "is not a valid term at this part of the query");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract this string into a const

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants