-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28622: Duplicate Entries in TXN_WRITE_NOTIFICATION_LOG Due to Oracle's Handling of Empty Strings #5617
Conversation
+1. Looks good to me. |
327c3d6
to
3dab250
Compare
@@ -1205,7 +1205,7 @@ private void addWriteNotificationLog(List<NotificationEvent> eventBatch, List<Ac | |||
String select = sqlGenerator.addForUpdateClause("select \"WNL_ID\", \"WNL_FILES\" from" + | |||
" \"TXN_WRITE_NOTIFICATION_LOG\" " + | |||
"where \"WNL_DATABASE\" = ? " + | |||
"and \"WNL_TABLE\" = ? " + " and \"WNL_PARTITION\" = ? " + | |||
"and \"WNL_TABLE\" = ? " + " and (\"WNL_PARTITION\" = ? OR \"WNL_PARTITION\" IS NULL) " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why would we get an empty string ('') as partition name?
- what if we have an empty partition and a non-empty, you'll lock and update both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ
for 1. you can simply create CREATE TABLE employee (id int, name string, salary int)
STORED AS ORC TBLPROPERTIES ('transactional' = 'true');
This will not have any partition information and in the table we are adding '' for partition
For 2. I am not changing any logic here. I am just covering the way oracle handles '' string
Consider below example in oracle database
CREATE TABLE t1 (name VARCHAR2(100));
INSERT INTO t1 (name) VALUES ('harshal');
INSERT INTO t1 (name) VALUES ('');
INSERT INTO t1 (name) VALUES (NULL);
SELECT * FROM t1; -- RETURNS all 3 rows
SELECT * FROM t1 WHERE name = ''; -- RETURNS 0 rows
SELECT * FROM t1 WHERE name IS NULL; --returns 2 rows
Because of this quirky behavior of Oracle, if the backend database is Oracle, then the MERGE operation adds 2 rows into TXN_WRITE_NOTIFICATION_LOG instead of 1.
Merge statement ran in Hive shell:
MERGE INTO employee AS a USING employee_update AS b ON a.id = b.id WHEN MATCHED THEN UPDATE SET salary = b.salary WHEN NOT MATCHED THEN INSERT VALUES (b.id, b.name, b.salary);
Rows added in oracle database in TXN_WRITE_NOTIFICATION_LOG table
"WNL_ID","WNL_TXNID","WNL_WRITEID","WNL_DATABASE","WNL_TABLE","WNL_PARTITION","WNL_TABLE_OBJ","WNL_PARTITION_OBJ","WNL_FILES","WNL_EVENT_TIME" 7,9,3,default,employee,,"{""1"":{""str"":""employee""},""2"":{""str"":""default""},""3"":{""str"":""hive""},""4"":{""i32"":1738583942},""5"":{""i32"":0},""6"":{""i32"":0},""7"":{""rec"":{""1"":{""lst"":[""rec"",3,{""1"":{""str"":""id""},""2"":{""str"":""int""}},{""1"":{""str"":""name""},""2"":{""str"":""string""}},{""1"":{""str"":""salary""},""2"":{""str"":""int""}}]},""2"":{""str"":""hdfs://ccycloud-1.harshal1.root.comops.site:8020/warehouse/tablespace/managed/hive/employee""},""3"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcInputFormat""},""4"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat""},""5"":{""tf"":0},""6"":{""i32"":-1},""7"":{""rec"":{""2"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcSerde""},""3"":{""map"":[""str"",""str"",0,{}]}}},""8"":{""lst"":[""str"",0]},""9"":{""lst"":[""rec"",0]},""10"":{""map"":[""str"",""str"",0,{}]},""11"":{""rec"":{""1"":{""lst"":[""str"",0]},""2"":{""lst"":[""lst"",0]},""3"":{""map"":[""lst"",""str"",0,{}]}}},""12"":{""tf"":0}}},""8"":{""lst"":[""rec"",0]},""9"":{""map"":[""str"",""str"",10,{""totalSize"":""2391"",""numRows"":""3"",""rawDataSize"":""0"",""transactional_properties"":""default"",""COLUMN_STATS_ACCURATE"":""{\""BASIC_STATS\"":\""true\""}"",""numFiles"":""3"",""transient_lastDdlTime"":""1738584068"",""bucketing_version"":""2"",""numFilesErasureCoded"":""0"",""transactional"":""true""}]},""12"":{""str"":""MANAGED_TABLE""},""15"":{""tf"":0},""17"":{""str"":""hive""},""18"":{""i32"":1},""19"":{""i64"":3},""26"":{""i64"":131}}",[NULL],"hdfs://ccycloud-1.harshal1.root.comops.site:8020/warehouse/tablespace/managed/hive/employee/delta_0000003_0000003_0001/bucket_00000_0###delta_0000003_0000003_0001",1738584069 8,9,3,default,employee,,"{""1"":{""str"":""employee""},""2"":{""str"":""default""},""3"":{""str"":""hive""},""4"":{""i32"":1738583942},""5"":{""i32"":0},""6"":{""i32"":0},""7"":{""rec"":{""1"":{""lst"":[""rec"",3,{""1"":{""str"":""id""},""2"":{""str"":""int""}},{""1"":{""str"":""name""},""2"":{""str"":""string""}},{""1"":{""str"":""salary""},""2"":{""str"":""int""}}]},""2"":{""str"":""hdfs://ccycloud-1.harshal1.root.comops.site:8020/warehouse/tablespace/managed/hive/employee""},""3"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcInputFormat""},""4"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat""},""5"":{""tf"":0},""6"":{""i32"":-1},""7"":{""rec"":{""2"":{""str"":""org.apache.hadoop.hive.ql.io.orc.OrcSerde""},""3"":{""map"":[""str"",""str"",0,{}]}}},""8"":{""lst"":[""str"",0]},""9"":{""lst"":[""rec"",0]},""10"":{""map"":[""str"",""str"",0,{}]},""11"":{""rec"":{""1"":{""lst"":[""str"",0]},""2"":{""lst"":[""lst"",0]},""3"":{""map"":[""lst"",""str"",0,{}]}}},""12"":{""tf"":0}}},""8"":{""lst"":[""rec"",0]},""9"":{""map"":[""str"",""str"",10,{""totalSize"":""2391"",""numRows"":""3"",""rawDataSize"":""0"",""transactional_properties"":""default"",""COLUMN_STATS_ACCURATE"":""{\""BASIC_STATS\"":\""true\""}"",""numFiles"":""3"",""transient_lastDdlTime"":""1738584069"",""bucketing_version"":""2"",""numFilesErasureCoded"":""0"",""transactional"":""true""}]},""12"":{""str"":""MANAGED_TABLE""},""15"":{""tf"":0},""17"":{""str"":""hive""},""18"":{""i32"":1},""19"":{""i64"":3},""26"":{""i64"":131}}",[NULL],"hdfs://ccycloud-1.harshal1.root.comops.site:8020/warehouse/tablespace/managed/hive/employee/delta_0000003_0000003_0002/bucket_00000_0###delta_0000003_0000003_0002",1738584070
and if you see all other database's DDL statements for table creation, it has primary key on WNL_TXNID, WNL_DATABASE, WNL_TABLE, WNL_PARTITION -So, above thing will not even happen in case of other databases but this is a different topic for discussion
Snippet from hive-schema-4.0.0-beta-1.mssql.sql
ALTER TABLE TXN_WRITE_NOTIFICATION_LOG ADD CONSTRAINT TXN_WRITE_NOTIFICATION_LOG_PK PRIMARY KEY (WNL_TXNID, WNL_DATABASE, WNL_TABLE, WNL_PARTITION);
Apart from this, Similar logic is added other places but missing here
Ref:
Line 35 in dee6546
"AND \"CMC_METRIC_TYPE\" = :type AND (:partition IS NULL OR \"CMC_PARTITION\" = :partition)"; Line 56 in dee6546
" AND (tc.\"CTC_PARTITION\" = c.\"CTC_PARTITION\" OR (tc.\"CTC_PARTITION\" IS NULL AND c.\"CTC_PARTITION\" IS NULL)) " +
Let me know if you still have some doubts / suggestion then we can surely discuss in detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshal-16
maybe, better fix the way you populate that column. Insert NULL
instead of '' for unpartitioned tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw (:partition IS NULL OR \"CMC_PARTITION\" = :partition)
is not the equivalent to your change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code and in most of the places partition is inserted as '' if value is not present, so i don't want to change that behaviour, For your 2nd suggestion, I have updated the condition please have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are those places where partition is inserted as ''? check any of acid or compaction tables - it's always null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved as query is correct now, but it would be better to refactor this in the follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
…acle's Handling of Empty Strings In Oracle, empty strings ('') are treated as NULL values for VARCHAR2 and CHAR data types. This behavior is unique to Oracle and can be confusing, as an empty string is typically considered distinct from NULL in other databases. As a result, the TXN_WRITE_NOTIFICATION_LOG table receives duplicate entries for a single Hive ACID transaction involving MERGE statements. This discrepancy leads to issues: the _files and _dumpmetadata files in a Hive ACID incremental dump will not align if the dump scope includes one or more MERGE statements. Consequently, the Hive ACID incremental LOAD fails at the target (DR), blocking subsequent replication executions. Solution * Add additional check for partition being null Testing: * Tested on cluster with oracle and mysql as backend database
3dab250
to
8df8cf7
Compare
|
Thanks @deniskuzZ for your review!! I have created a follow-up JIRA HIVE-28743 Can you please merge this PR as I don't have merge access |
In Oracle, empty strings ('') are treated as NULL values for VARCHAR2 and CHAR data types. This behavior is unique to Oracle and can be confusing, as an empty string is typically considered distinct from NULL in other databases.
As a result, the TXN_WRITE_NOTIFICATION_LOG table receives duplicate entries for a single Hive ACID transaction involving MERGE statements.
This discrepancy leads to issues: the _files and _dumpmetadata files in a Hive ACID incremental dump will not align if the dump scope includes one or more MERGE statements. Consequently, the Hive ACID incremental LOAD fails at the target (DR), blocking subsequent replication executions.
Solution
Testing:
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
Is the change a dependency upgrade?
How was this patch tested?