DuplicateQualifiedField With Paritioned Data #1018

cfis · 2025-02-11T00:41:40Z

This might be more of an arrow issue, but I am running into this error:

Exception: DataFusion error: SchemaError(DuplicateQualifiedField { qualifier: Bare { table: "data" }, name: "year" }, Some(""))

This is happening when querying parquet files stored on S3 using hive partitioning. The partition fields are year/month/day. Those same fields are also contained in the parquet files themselves. Thus the error. Is there a way to avoid this? I can specify the schema manually but I'd like to avoid that.

Example Code:

import os
import time
import datafusion
from datafusion.object_store import AmazonS3

s3 = AmazonS3(
    bucket_name="<removed>",
    region="<removed>",
    endpoint="<removed>",
    access_key_id="<removed>",
    secret_access_key="<removed>")

ctx = datafusion.SessionContext()
ctx.register_object_store("s3://<removed>/", s3)

ctx.sql("""
CREATE EXTERNAL TABLE data
STORED AS PARQUET
PARTITIONED BY (year, month, day)
LOCATION 's3://<removed>/'
""")

sql = f"""SELECT count(*)
         FROM data
         WHERE year = 2025 AND
               month = 1 AND
               day = 1)
               """

kosiew · 2025-02-21T03:09:53Z

Managed to reproduce this without S3

import os
import tempfile
import pyarrow as pa
import pyarrow.parquet as pq
import datafusion

def create_parquet_file(file_path):
    # Create a Parquet file containing duplicate partition columns using pyarrow arrays
    table = pa.Table.from_arrays(
        [
            pa.array([1, 2, 3]),
            pa.array([2025, 2025, 2025]),  # duplicate partition field
            pa.array([10, 20, 30])
        ],
        names=['id', 'year', 'value']
    )
    pq.write_table(table, file_path)

def test_duplicate_field_error():
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create a hive-partitioned directory structure: data/year=2025/month=1/day=1
        storage_dir = os.path.join(tmpdir, "data", "year=2025", "month=1", "day=1")
        os.makedirs(storage_dir, exist_ok=True)
        file_path = os.path.join(storage_dir, "data.parquet")
        create_parquet_file(file_path)

        ctx = datafusion.SessionContext()
        # Register the external table with hive partitioning on local storage
        create_table = f"""
        CREATE EXTERNAL TABLE data
        STORED AS PARQUET
        PARTITIONED BY (year, month, day)
        LOCATION '{os.path.join(tmpdir, "data")}'
        """
        ctx.sql(create_table).collect()

        # Query the table.
        query = """
        SELECT count(*) as cnt
        FROM data
        WHERE year = 2025 AND month = 1 AND day = 1
        """
        
        # Run query without expecting exception.
        result = ctx.sql(query).collect()
        # Assert expected count. Adjust extraction of count as needed.
        assert result[0].column("cnt")[0].as_py() == 3

pytest output:

self = <datafusion.context.SessionContext object at 0x108096780>
query = '\n        SELECT count(*) as cnt\n        FROM data\n        WHERE year = 2025 AND month = 1 AND day = 1\n        '
options = None

    def sql(self, query: str, options: SQLOptions | None = None) -> DataFrame:
        """Create a :py:class:`~datafusion.DataFrame` from SQL query text.

        Note: This API implements DDL statements such as ``CREATE TABLE`` and
        ``CREATE VIEW`` and DML statements such as ``INSERT INTO`` with in-memory
        default implementation.See
        :py:func:`~datafusion.context.SessionContext.sql_with_options`.

        Args:
            query: SQL query text.
            options: If provided, the query will be validated against these options.

        Returns:
            DataFrame representation of the SQL query.
        """
        if options is None:
>           return DataFrame(self.ctx.sql(query))
E           Exception: DataFusion error: SchemaError(DuplicateQualifiedField { qualifier: Bare { table: "data" }, name: "year" }, Some(""))

cfis added the bug Something isn't working label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DuplicateQualifiedField With Paritioned Data #1018

DuplicateQualifiedField With Paritioned Data #1018

cfis commented Feb 11, 2025 •

edited

Loading

kosiew commented Feb 21, 2025 •

edited

Loading

DuplicateQualifiedField With Paritioned Data #1018

DuplicateQualifiedField With Paritioned Data #1018

Comments

cfis commented Feb 11, 2025 • edited Loading

kosiew commented Feb 21, 2025 • edited Loading

cfis commented Feb 11, 2025 •

edited

Loading

kosiew commented Feb 21, 2025 •

edited

Loading