Tika Metadata Add performance optimization #2039
patrickdalla
started this conversation in
General
Replies: 3 comments 9 replies
-
Hi @patrickdalla! Welcome back! |
Beta Was this translation helpful? Give feedback.
6 replies
-
Right. Thanks for testing.
Any way, I think we agree in some point: that original Tika implementation
is bad, and that a good solution would be to subclass this metadata to
override this.
For Sqlite Split PR, at least, this improvement proved mandatory to save
many timestamps and locations on same multi value field, be it made in IPED
project or in Tika project.
So we I will transform this in an issue here.
Maybe we can store alls this multiple values in a temp diferent internal
field and, when iterating over them to save in lucene index, load them from
it.
Em qui., 4 de jan. de 2024 23:29, Wladimir Leite ***@***.***>
escreveu:
… I ran a test with 120,000 messages, each with ~8,000 members set as
Communication:To.
So it would be setting a property with 8K distinct string values, repeated
120K times (each string value has around 20 characters).
I couldn't reproduce the performance difference you observed.
Using Tika's Metadata the whole processing took *2543*, while using your
IpedMetadata, it took *2579* seconds.
So basically the same time.
// Using Tika's `Metadata` with multiple calls to add(String name, String value)Metadata meta = new Metadata();ChatGroup groupChat = (ChatGroup) m.getChat();for (Long id : groupChat.getMembers()) {
if (id != m.getFrom().getId())
meta.add(Message.MESSAGE_TO, e.getContact(id).toString());
}
// Using IpedMetadata, with a single call set(String name, ArrayList<String> values)IpedMetadata meta = new IpedMetadata();ChatGroup groupChat = (ChatGroup) m.getChat();List<String> values = new ArrayList<String>();for (Long id : groupChat.getMembers()) {
if (id != m.getFrom().getId())
values.add(e.getContact(id).toString());
}meta.set(Message.MESSAGE_TO, values);
I can try running the exact same test you observed the huge performance
difference. Maybe something else is going on.
—
Reply to this email directly, view it on GitHub
<#2039 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG247S4QRXLLTEILASJNYJTYM5XQPAVCNFSM6AAAAABBJ5EMGWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DAMJZGI3DM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
3 replies
-
In fact, that was what I was trying to start doing when strangely the
bottleneck disappeared in Sqlite Split PR.
Em sex., 5 de jan. de 2024 06:15, Patrick Bernardina <
***@***.***> escreveu:
… Right. Thanks for testing.
Any way, I think we agree in some point: that original Tika implementation
is bad, and that a good solution would be to subclass this metadata to
override this.
For Sqlite Split PR, at least, this improvement proved mandatory to save
many timestamps and locations on same multi value field, be it made in IPED
project or in Tika project.
So we I will transform this in an issue here.
Maybe we can store alls this multiple values in a temp diferent internal
field and, when iterating over them to save in lucene index, load them from
it.
Em qui., 4 de jan. de 2024 23:29, Wladimir Leite ***@***.***>
escreveu:
> I ran a test with 120,000 messages, each with ~8,000 members set as
> Communication:To.
> So it would be setting a property with 8K distinct string values,
> repeated 120K times (each string value has around 20 characters).
>
> I couldn't reproduce the performance difference you observed.
> Using Tika's Metadata the whole processing took *2543*, while using your
> IpedMetadata, it took *2579* seconds.
> So basically the same time.
>
> // Using Tika's `Metadata` with multiple calls to add(String name, String value)Metadata meta = new Metadata();ChatGroup groupChat = (ChatGroup) m.getChat();for (Long id : groupChat.getMembers()) {
> if (id != m.getFrom().getId())
> meta.add(Message.MESSAGE_TO, e.getContact(id).toString());
> }
>
> // Using IpedMetadata, with a single call set(String name, ArrayList<String> values)IpedMetadata meta = new IpedMetadata();ChatGroup groupChat = (ChatGroup) m.getChat();List<String> values = new ArrayList<String>();for (Long id : groupChat.getMembers()) {
> if (id != m.getFrom().getId())
> values.add(e.getContact(id).toString());
> }meta.set(Message.MESSAGE_TO, values);
>
> I can try running the exact same test you observed the huge performance
> difference. Maybe something else is going on.
>
> —
> Reply to this email directly, view it on GitHub
> <#2039 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AG247S4QRXLLTEILASJNYJTYM5XQPAVCNFSM6AAAAABBJ5EMGWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DAMJZGI3DM>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @wladimirleite , I'm back to work after my licences/vacation. I opened this discussion based on the performance issue you also noted when implementing #1999, about the bad implementation of metadata addition on multi valued metadata field. In that PR it seems that you found a different approach that do not need multivalue.
But the performance issued still exists, and affects any other parser that needs to add multiple metadata. Have you opened an specific issue to address this problem?
Beta Was this translation helpful? Give feedback.
All reactions