Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(Views): views for a single dataframe #1594

Merged
merged 3 commits into from
Feb 7, 2025

Conversation

scaliseraoul
Copy link
Contributor

@scaliseraoul scaliseraoul commented Feb 6, 2025


Important

Adds support for views with columns in dataframes, updating schema, query building, and SQL sanitization logic, with corresponding tests.

  • Behavior:
    • create() in __init__.py now supports views with columns, updating schema creation logic.
    • check_columns_relations() in semantic_layer_schema.py now allows views with either relations or columns.
    • ViewDatasetLoader in view_loader.py updated to handle dependencies when columns are present.
  • Sanitization:
    • Renames sanitize_relation_name() to sanitize_view_column_name() in sql_sanitizer.py.
    • Updates all references to sanitize_view_column_name() in view_query_builder.py and tests.
  • Query Building:
    • ViewQueryBuilder in view_query_builder.py now normalizes view column names and aliases.
    • Updates _get_table_expression() to handle views with columns.
  • Tests:
    • Updates test_sql_sanitizer.py to test sanitize_view_column_name().
    • Updates test_view_query_builder.py to test new view handling logic.

This description was created by Ellipsis for 7480ad9. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 7480ad9 in 1 minute and 52 seconds

More details
  • Looked at 316 lines of code in 7 files
  • Skipped 0 files when reviewing.
  • Skipped posting 14 drafted comments based on config settings.
1. pandasai/__init__.py:117
  • Draft comment:
    Good use of parsed_columns. Ensure that overriding df-inferred columns with provided ones is intentional.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
2. pandasai/data_loader/semantic_layer_schema.py:313
  • Draft comment:
    Validation update: requiring at least one relation or column for views is clearer. Confirm that this new rule meets all use cases.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
3. pandasai/data_loader/view_loader.py:47
  • Draft comment:
    Fallback in _get_dependencies_datasets uses schema.columns[0]. Ensure schema.columns is non-empty when relations are absent.
  • Reason this comment was not posted:
    Marked as duplicate.
4. pandasai/helpers/sql_sanitizer.py:7
  • Draft comment:
    Renaming to sanitize_view_column_name clarifies its purpose. No issues observed.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
5. pandasai/query_builders/view_query_builder.py:22
  • Draft comment:
    Custom normalization methods for view column names and aliases help mitigate injection risks. Review injection tests to confirm consistency.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
6. pandasai/__init__.py:117
  • Draft comment:
    Parsed-columns initialization is clear. Consider verifying if an empty list should be handled explicitly.
  • Reason this comment was not posted:
    Marked as duplicate.
7. pandasai/__init__.py:130
  • Draft comment:
    Passing 'parsed_columns' to the SemanticLayerSchema for view/source is a good update; ensure this behavior is documented for users.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is purely informative and suggests ensuring documentation, which violates the rule against making purely informative comments and asking for documentation updates.
8. pandasai/data_loader/semantic_layer_schema.py:313
  • Draft comment:
    Updated validation now requires at least one relation or column for a view. Confirm that allowing a view with only columns (no relations) fits your design intent.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 50% <= threshold 50%
    This comment is asking the PR author to confirm their design intent, which violates the rule against asking for confirmation of intention. However, it also points out a specific change in validation logic that could have significant implications, which is useful feedback. The comment could be rephrased to focus on the implications of the change rather than asking for confirmation.
9. pandasai/data_loader/semantic_layer_schema.py:332
  • Draft comment:
    The uncovered_tables check now triggers only for multi-table scenarios. Verify that this logic correctly permits single-table views without raising errors.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is asking the PR author to verify the logic, which is against the rules. It doesn't provide a specific suggestion or point out a clear issue. It should be removed.
10. pandasai/data_loader/view_loader.py:47
  • Draft comment:
    Accessing self.schema.columns[0] assumes columns are present when relations are absent. Consider adding a safeguard or clear documentation of preconditions.
  • Reason this comment was not posted:
    Marked as duplicate.
11. pandasai/helpers/sql_sanitizer.py:7
  • Draft comment:
    Renaming to 'sanitize_view_column_name' improves clarity for view contexts; ensure all call sites are updated accordingly.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
    The comment is about checking call sites after a rename, which would be automatically caught by the build/compiler. Any IDE or build system would flag unresolved references. This falls under the rule "Do NOT comment on anything that would be obviously caught by the build, such as variable renames."
    The function rename might have semantic implications that aren't caught by the compiler, like ensuring the new name accurately reflects its usage.
    While the semantic change is relevant, asking to "ensure all call sites are updated" is specifically about the mechanical rename process, which is handled by tooling.
    The comment should be deleted as it primarily concerns a mechanical rename that would be caught by the build system.
12. pandasai/query_builders/view_query_builder.py:22
  • Draft comment:
    Consider adding error-handling in 'normalize_view_column_name' to catch parsing issues if the sanitized string isn't a valid SQL expression.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    The comment points out a legitimate concern - parse_one() could throw exceptions if the input isn't valid SQL. However, sanitize_view_column_name() should already ensure the input is safe. Looking at the usage, this is only called on column names from the schema, which should be validated elsewhere. Adding error handling here would be defensive programming but may not be necessary.
    I may be assuming too much about the validation happening elsewhere. The sanitize function may not guarantee valid SQL syntax.
    While input validation could fail, error handling should be handled at the schema validation level, not in this utility function. This function assumes valid inputs.
    The comment should be deleted as it suggests defensive programming that would be better handled elsewhere in the validation chain.
13. pandasai/query_builders/view_query_builder.py:27
  • Draft comment:
    The alias normalization (replacing dots with underscores) is effective; verify that the resulting aliases remain consistent across different SQL engines.
  • Reason this comment was not posted:
    Marked as duplicate.
14. tests/unit_tests/query_builders/test_view_query_builder.py:77
  • Draft comment:
    Injection tests are comprehensive. Ensure that the expected outputs (including alias quoting) remain stable if underlying sanitization logic evolves.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_WA7nfrDzslIdmc4w


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Collaborator

@gventuri gventuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks strong to me, only left a minor comment

@gventuri gventuri requested a review from ArslanSaleem February 6, 2025 14:46
Copy link
Collaborator

@ArslanSaleem ArslanSaleem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation fails for mysql database it gives this error:
pandas.errors.DatabaseError: Execution failed on sql 'SELECT COUNT(*) AS "total_rows" FROM "hearts" AS hearts': (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near \'"hearts" AS hearts\' at line 3')

@gventuri gventuri merged commit 2455592 into sinaptik-ai:main Feb 7, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants