Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(Views): adding view for local sources #1586

Merged
merged 12 commits into from
Feb 6, 2025

Conversation

scaliseraoul
Copy link
Contributor

@scaliseraoul scaliseraoul commented Feb 4, 2025


Important

Introduces local data source handling with DuckDBConnectionManager, new query builders, and updates to DatasetLoader and Agent classes, along with comprehensive tests.

  • New Features:
    • Introduced DuckDBConnectionManager in duck_db_connection_manager.py for managing DuckDB connections.
    • Added LocalQueryBuilder, SqlQueryBuilder, and ViewQueryBuilder in query_builders for handling different query types.
    • Updated DatasetLoader in loader.py to support local, SQL, and view-based datasets.
  • Agent Updates:
    • Modified Agent class in base.py to use DuckDBConnectionManager for executing local SQL queries.
  • Schema and Sanitization:
    • Updated SemanticLayerSchema in semantic_layer_schema.py to handle local sources without requiring a table.
    • Added sanitize_relation_name in sql_sanitizer.py for sanitizing relation names.
  • Testing:
    • Added unit tests for DuckDBConnectionManager, DatasetLoader, and query builders in test_duckdbmanager.py, test_loader.py, and test_query_builder.py.
    • Updated tests for SQL sanitization in test_sql_sanitizer.py.
  • Miscellaneous:
    • Renamed query_builder.py to base_query_builder.py and moved to query_builders directory.

This description was created by Ellipsis for d4b9671. It will automatically update as commits are pushed.

@scaliseraoul scaliseraoul marked this pull request as ready for review February 4, 2025 15:51
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to d4b9671 in 2 minutes and 56 seconds

More details
  • Looked at 953 lines of code in 27 files
  • Skipped 0 files when reviewing.
  • Skipped posting 11 drafted comments based on config settings.
1. tests/unit_tests/data_loader/test_sql_loader.py:124
  • Draft comment:
    Ensure that expected transformations (like email anonymization and timestamp timezone conversion) are stable. Consider comparing key parts of the output rather than the full DataFrame to avoid brittle tests.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. tests/unit_tests/prompts/test_sql_prompt.py:63
  • Draft comment:
    The expected prompt string is very long and hardcoded. Consider using a dedicated fixture or helper to generate the expected output for better maintainability and to reduce brittleness if formatting changes.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. tests/unit_tests/query_builders/test_view_query_builder.py:24
  • Draft comment:
    The expected multi-line query string in 'test_build_query' is sensitive to whitespace/line breaks. Consider normalizing whitespace in both the generated and expected queries before comparing.
  • Reason this comment was not posted:
    Comment was on unchanged code.
4. tests/unit_tests/data_loader/test_loader.py:62
  • Draft comment:
    Using 'builtins.open' patch for schema file loading is good, but make sure file encoding and newline consistency are handled in both production and test code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    None
5. tests/unit_tests/data_loader/test_duckdbmanager.py:11
  • Draft comment:
    Typo in test method name: 'doesnt trow' should be 'doesn't throw'.
  • Reason this comment was not posted:
    Marked as duplicate.
6. tests/unit_tests/data_loader/test_loader.py:58
  • Draft comment:
    Ensure schema names are sanitized properly. The test 'test_load_schema_mysql_sanitized_name' expects 'non_sanitized-name' to become 'non_sanitized_name'. Verify that the sanitizer consistently replaces invalid characters with underscores.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
7. tests/unit_tests/query_builders/test_view_query_builder.py:24
  • Draft comment:
    The expected multiline query string in 'test_build_query' is tightly coupled with whitespace and newlines. Consider using textwrap.dedent to standardize formatting and reduce brittleness in tests.
  • Reason this comment was not posted:
    Marked as duplicate.
8. tests/unit_tests/prompts/test_sql_prompt.py:65
  • Draft comment:
    The long expected prompt string literal could be simplified using a dedented multiline string (using textwrap.dedent), to improve readability and reduce potential whitespace issues.
  • Reason this comment was not posted:
    Comment was on unchanged code.
9. tests/unit_tests/helpers/test_sql_sanitizer.py:63
  • Draft comment:
    The test 'test_safe_select_with_comment' correctly blocks queries with comments. Consider adding a note if safe inline comments should ever be permitted or if this strict rule is intentional for security.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    None
10. pandasai/query_builders/sql_query_builder.py:5
  • Draft comment:
    Ensure that using lower() on the table name in _get_from_statement is consistent with desired behavior; if source table names may have uppercase letters, this conversion might cause issues in some SQL engines.
  • Reason this comment was not posted:
    Comment did not seem useful.
11. pandasai/query_builders/view_query_builder.py:60
  • Draft comment:
    The alias for the view schema uses a hyphen (e.g. 'parent-children'). SQL aliases with hyphens may require quoting. Consider using underscores or quoting aliases to avoid SQL syntax issues.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_GrbqqsySHoc9gFIK


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

tests/unit_tests/data_loader/test_duckdbmanager.py Outdated Show resolved Hide resolved
@gventuri gventuri requested a review from ArslanSaleem February 4, 2025 17:02
Copy link
Collaborator

@gventuri gventuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good overall, only left a couple of comments!

pandasai/dataframe/base.py Show resolved Hide resolved
pandasai/query_builders/base_query_builder.py Outdated Show resolved Hide resolved
pandasai/query_builders/base_query_builder.py Outdated Show resolved Hide resolved
pandasai/query_builders/base_query_builder.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArslanSaleem ArslanSaleem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good. Found two issues apart for sql injection.

  1. In pull method it uses dataset_loader.load(self.path) where load method doesn't take any argument.
  2. In case of sql connector it is failing in code_cleaning._clean_sql_query where allowed table is passed as schema.name which doesn't exists in the database.

@scaliseraoul
Copy link
Contributor Author

Hi @ArslanSaleem , good catch for n1, could you iterate on n2,
"In case of SQL connector, it is failing in code_cleaning._clean_sql_query where allowed table is passed as schema.name which doesn't exist in the database."

cause schema.name it's the dataset name, and it's always present

* commit '36b26ed4566f3a21e70ea00b5c39543d05dfbec9':
  chore: remove numpy dependency
…ed one, using, refactor query builders to use sqlglot
Copy link
Collaborator

@gventuri gventuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements, only a couple of leftovers

pandasai/agent/base.py Show resolved Hide resolved
pandasai/query_builders/base_query_builder.py Show resolved Hide resolved
@gventuri gventuri merged commit 8a0123c into sinaptik-ai:main Feb 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants