-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow.lib.ArrowInvalid schema difference errors occur with duckdb==0.10.0
#163
Comments
Hi @shntnu thanks for opening this issue (sorry to hear it's happening) and for the thorough notes. Aside, I'm so excited to hear about Parquet being used for Cell Painting datasets in the future! I took a look at what you provided and tried to reproduce the findings. Here's what I noticed: DuckDB Duplicate Column HandlingWhen performing a SQL-based JOIN operation with DuckDB it can sometimes add additional column names per table where the column names are the same between tables. This is generalized SQL behavior which seeks to preserve the data output (leaning on a developer to specify / filter using other means). In DuckDB, this behavior is outlined here, which mentions the addition of integers per every duplicate column name. We can see an alignment in what you shared through the column names:
And also column values (notice the repeated values for
My first instinct was to mention the above and a workaround within DuckDB SQL which can be implemented: the EnvironmentI tried adding the code you provided to see what would happen within a Google Colab environment. Here's a link to the Google Colab notebook and a gist with similar content (as a backup). I didn't see the same errors which made me think that DuckDB, Arrow, or something in between might be operating differently depending on your environment. When you have the chance, could you share an environment lockfile, the output from |
@d33bs Thank you so much for looking into this. I'll drop in my env right away and then follow up later in case I have more to add.
|
Thank you @shntnu ! I found that Google Colab was using I'm digging into the specifics of the issue and giving |
duckdb==0.10.0
As a quick follow up, I found that from cytotable import convert
import logging
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.Metadata_TableNumber,
image.Metadata_ImageNumber,
image.Image_Metadata_Well,
image.Image_Metadata_Plate,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
)
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
) |
Thank you for looking this up. I can confirm that the code worked fine when pegged it to v0.9.2 using
|
I can also confirm that This gives us everything we need to proceed. Thank you @d33bs! Some notes/questions
Note: Updated version is in #163 (comment) Config{
"jump-cytominer-database-to-parquet": {
# version specifications using related references
"CONFIG_SOURCE_VERSION": {
"cell-health-dataset": "v5", # TODO: Update this
"cellprofiler": "v2.X", # TODO: Update this
"cytominer-database": "5aa00f58e4a31bbbd2a3779c87e7a3620b0030db", # TODO: Update this
},
# names of source table compartments (for ex. cells.csv, etc.)
"CONFIG_NAMES_COMPARTMENTS": ("cells", "nuclei", "cytoplasm"),
# names of source table metadata (for ex. image.csv, etc.)
"CONFIG_NAMES_METADATA": ("image",),
# column names in any compartment or metadata tables which contain
# unique names to avoid renaming
"CONFIG_IDENTIFYING_COLUMNS": (
"TableNumber",
"ImageNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber",
),
# chunk size to use for join operations to help with possible performance issues
# note: this number is an estimate and is may need changes contingent on data
# and system used by this library.
"CONFIG_CHUNK_SIZE": 5000,
# compartment and metadata joins performed using DuckDB SQL
# and modified at runtime as needed
"CONFIG_JOINS": """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Cells_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Nuclei_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
""",
},
} # Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite
from cytotable import convert
import logging
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
) |
Thanks @shntnu ! I've opened duckdb/duckdb#11157 as a result of our findings here. Addressing your comments:
Thank you for the feedback here! Would you have any recommendations for how to improve or what you found most helpful as you found what you needed? Generally we hope to improve documentation via, for example, #25, but we could get more specific here.
My feeling is that we need feedback on whether the new behavior in
|
Hi @shntnu - thanks to review from @gwaybio in #166 and #167 we have now released a fix for the I'd like to leave this issue open for the time being to help acknowledge that we will eventually need to update the constraint once a fix is available from a new |
Thank you @d33bs – I can't believe how quickly this got addressed, both by you as well as the duckdb team! Thanks as well for all your notes in #163 (comment). I'll read those carefully and get back to you. |
Hi @shntnu - as a heads up, in addition to what I mentioned earlier, I recommend using As a result, I might recommend using the ThreadPoolExecutor instead which may be configured as follows: import cytotable
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor
cytotable.convert(
...
parsl_config=parsl.load(
Config(
executors=[
ThreadPoolExecutor(
# set maximum number of threads at any time, for example 3.
# if not set, the default is 2.
max_threads=3,
)
]
)
),
) |
@d33bs -- just a heads up that I've haven't been able to return to this and I might need to push it out a couple of weeks. Would that block you? Thank you again for the quick and thorough response! |
Thank you @shntnu for the updates here, I don't feel blocked here. Once #174 is merged I feel we should focus the remainder of questions / considerations here in new issues to help uncover any additional insights which may benefit your work with CytoTable (the original challenges with DuckDB and Arrow will have been resolved I feel). The following are issues / additional focus areas I can think of related to our discussion here (please feel free to add / suggest / etc):
|
I found this hard to pin down but in general, I'd say it's the difference between experiential learning vs. cognitive learning.
Could you clarify what are these renaming operations? Maybe the notes below would help understand my confusion The exercise made me realize that our lab should carefully think through the join because a lot could go wrong (this is not related to CytoTable per se) That is, we should think through this snippet below and revise based on how our data are structured.
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei I ran the code below not knowing what I should include # Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite
from cytotable import convert
import logging
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
) and then I checked the output to see which of the columns were present or renamed import polars as pl
# Load the parquet file
df = pl.read_parquet("test_SQ00014613.parquet")
# Columns to check for in the parquet file
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber",
)
identifying_cols_metadata_suffix = [f"Metadata_{col}" for col in identifying_cols]
identifying_cols_image_suffix = [f"Image_{col}" for col in identifying_cols]
# Check if each column is present in the DataFrame
columns_present_1 = {col: col in df.columns for col in identifying_cols_metadata_suffix}
columns_present_2 = {col: col in df.columns for col in identifying_cols_image_suffix}
columns_present_3 = {col: col in df.columns for col in identifying_cols}
filt = lambda d: {key: value for key, value in d.items() if value}
import json
print(json.dumps(identifying_cols, indent=4))
print(json.dumps(filt(columns_present_1), indent=4))
print(json.dumps(filt(columns_present_2), indent=4))
print(json.dumps(filt(columns_present_3), indent=4))
From that, it looks like these columns are not present in any of the tables in the SQLite, which is true
So that means I should change my cols to identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
) |
It worked great! I factored in all your advice in #163 (comment) to create a new script # Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite
from cytotable import convert
import logging
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Cytoplasm_Parent_Nuclei",
"Cells_Parent_Nuclei",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
INNER JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = image.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = image.Metadata_ImageNumber
INNER JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
AND cytoplasm.Metadata_Cytoplasm_Parent_Nuclei = nuclei.Metadata_ObjectNumber
INNER JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_Cells_Parent_Nuclei = nuclei.Metadata_ObjectNumber
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
parsl_config=parsl.load(
Config(
executors=[
ThreadPoolExecutor(
# set maximum number of threads at any time, for example 3.
# if not set, the default is 2.
max_threads=3,
)
]
)
),
) This does not factor in this advice:
|
We are planning to convert several SQLite files to Parquet, and (potentially) make the joined Parquet the de facto standard for Cell Painting datasets in the future.
I get this error cytotable_error.txt when running the code below (
CytoTable==0.0.4
)There's something funky happening; from the log:
There is no error if I change
to
I adapted the script that @Zitong-Chen-16 had written; it seemed to have worked for her
https://github.com/broadinstitute/2021_09_01_VarChAMP/blob/main/6.downstream_analysis/scripts/0.convert_to_parquet.py
The script errors out after producing the
[cells|nuclei|cytoplasm|image].parquet
files. So I took advantage of this and poked around to see if the join query works if I try to do it directly like this below; it does:Output:
The text was updated successfully, but these errors were encountered: