-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28518: Iceberg: Fix ClassCastException during in-place migration to Iceberg tables with timestamp columns #5590
base: master
Are you sure you want to change the base?
Conversation
…Iceberg tables with timestamp columns
@ggangadharan , thanks for the PR. I have 1 question though. |
Thank you for raising this question. Upon investigation, it appears that the issue stems from how the IcebergRecordReader interprets the timestamp column for different file formats:
Due to this discrepancy, we are encountering a ClassCastException when working with Parquet tables. As you mentioned, I also believe the root cause lies in the underlying file format/Iceberg level. I’ve attached a screenshot for reference. Please let me know if you need further details or if we should take any additional steps to address this. if it looks okay , Please review the PR. |
I wonder why everything is ok when we directly create an Iceberg + Parquet table.
|
Hi @okumin , Thank you for taking the time to review the pull request. In the Iceberg Parquet table, the timestamp column is read as LOCALDATETIME. I’ve attached a screenshot for reference. There is a notable difference in how the timestamp column is stored at the Parquet file format level. Specifically:
For clarity, I’ve also included the metadata from Parquet-tools for reference. As Iceberg Parquet table
As standard parquet table
|
Thanks. I also remember Hive didn't follow the regular convention on encoding TIMESTAMP. I don't have an immediate idea on how to fix it |
We should try to fix the regular convention on encoding TIMESTAMP, but that might not fix the case with existing tables, For those the fix in the current PR seems ok to me |
@ayushtkn Thank you for the feedback. Based on this, I believe the changes are in a good state to proceed unless there are further concerns. @Aggarwal-Raghav @okumin Could you kindly review the code and share your feedback? Your insights would be greatly appreciated to help move this forward. If there are any questions or blockers, feel free to let me know. Thank you for your time and support! |
Thank you! It is obvious. So, is the remaining problem to verify INT96 can be compatible with Iceberg's TIMESTAMP, which means verifying other query engines or tools can read it as a timestamp. I am trying to check it. |
@okumin Thanks for the update Successfully read the migrated ICEBERG table (previously migrated from Hive) using spark.sql in Spark , and it worked as expected. Spark is reading the timestamp column as TimestampNTZType. As per documentation - TimestampNTZType : Timestamp without time zone(TIMESTAMP_NTZ). It represents values comprising values of fields year, month, day, hour, minute, and second. All operations are performed without taking any time zone into account. Ref - https://spark.apache.org/docs/latest/sql-ref-datatypes.html Attaching spark3-shell output for a reference.
FYI While reading the string column name, I encountered an error that has been reported here . Since it is related to a spark/Iceberg issue, we can ignore it for now
|
@okumin I have shared the output snippets from SPARK side in my previous response. Please let me know if there's anything else required from my side to help resolve this issue. |
@ggangadharan Thanks. I think you provided sufficient information. I was testing Spark and Trino, but the progress has not been so good. I faced another problem while testing them, and it's been a holiday season in my country. I would appreciate it if you could give us time |
@okumin Thanks for the update. Let me know if there's anything needed from my end. Will check and do the needful from my side. |
OffsetDateTime odt = (OffsetDateTime) o; | ||
time = odt.atZoneSameInstant(TypeInfoFactory.timestampLocalTZTypeInfo.getTimeZone()).toLocalDateTime(); | ||
} else { | ||
time = (LocalDateTime) o; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be safe to check if it is LocalDateTime before casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zratkai Thanks for taking a look into this.
Sure, I will make the necessary changes. Besides LocalDateTime and OffsetDateTime, are there any other classes that might be expected here? Please let me know so I can handle them appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always be LocalDateTime, ideally. But the Iceberg + Parquet reader constructs OffsetDateTime. So, we should support only LocalDateTime and OffsetDateTime.
Though we expected never fail, it could be a little kinder if we had a better error message like String.format("An unexpected type %s was passed as timestamp", o.getClass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed update @okumin
if it's okay. I will make the below code changes.
public Timestamp getPrimitiveJavaObject(Object o) {
if (o == null) {
return null;
}
LocalDateTime time;
if (o instanceof LocalDateTime) {
time = (LocalDateTime) o;
} else if (o instanceof OffsetDateTime) {
OffsetDateTime odt = (OffsetDateTime) o;
time = odt.atZoneSameInstant(TypeInfoFactory.timestampLocalTZTypeInfo.getTimeZone()).toLocalDateTime();
} else {
throw new ClassCastException(String.format("An unexpected type %s was passed as timestamp. " +
"Expected LocalDateTime/OffsetDateTime", o.getClass().getName()));
}
return Timestamp.ofEpochMilli(time.toInstant(ZoneOffset.UTC).toEpochMilli(), time.getNano());
}
Please check and let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @ayushtkn
I have checked how we handle timestamps for the past few days, and this is the summary. As far as I tested, I didn't find any patterns where this doesn't work so far. So, I am thinking to review this and give +1. I might ask someone for one more look because I might not have comprehensive information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn @zabetak While investigating the history of TIMESTAMP and related tickets, I thought you would likely be knowledgeable about this area. I'd appreciate it if you could take another look at it.
I have two mysteries.
- I am not sure if it is very valid that the Iceberg reader returns OffsetDateTime
- INT96 might have historically broken or strange timestamps
OffsetDateTime odt = (OffsetDateTime) o; | ||
time = odt.atZoneSameInstant(TypeInfoFactory.timestampLocalTZTypeInfo.getTimeZone()).toLocalDateTime(); | ||
} else { | ||
time = (LocalDateTime) o; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always be LocalDateTime, ideally. But the Iceberg + Parquet reader constructs OffsetDateTime. So, we should support only LocalDateTime and OffsetDateTime.
Though we expected never fail, it could be a little kinder if we had a better error message like String.format("An unexpected type %s was passed as timestamp", o.getClass)
…Iceberg tables, Changes after Review Comments
@ayushtkn Updated the changes in the recent commit, and all tests have passed. Please review. |
Quality Gate passedIssues Measures |
What changes were proposed in this pull request?
This fix improves the stability and reliability of in-place migrated Iceberg tables involving timestamp data types.
Why are the changes needed?
The issue occurred due to incorrect type casting in the timestamp handling logic, which caused the migrated Iceberg tables Fetch task to fail.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Qtest - iceberg_inplace_migration_with_timestamp_column.q