Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LO fails to load document after saving with odftoolkit due to invalid UTF-16 entities #137

Open
FlorianBruckner opened this issue Nov 22, 2021 · 6 comments

Comments

@FlorianBruckner
Copy link

Xalan contains a nasty bug that produces incorrect XML entities in the output, leading to a corrupt document. E.g. this input

<text:span text:style-name="T19">𝜈</text:span>

Is changed to this when saving this document with odftoolkit:

<text:span text:style-name="T19">&#55349;&#57096;</text:span>

More information about the root cause can be found here:
https://issues.apache.org/jira/browse/XALANJ-2419

As it seems unlikely that there will ever be a new Xalan release including a fix for this, one option (and that is what I have been doing now) is to replace the xalan serializer dependency with a known good version, e.g.

        <dependency>
            <groupId>org.docx4j.org.apache</groupId>
            <artifactId>xalan-serializer</artifactId>
            <version>11.0.0</version>
        </dependency>

I cannot vouch for the integrity of this package but I have verified that it actually fixes the invalid encoding.

@mistmist
Copy link
Contributor

how is this library actually used? i can only find the file odfdom/src/main/java/org/odftoolkit/odfdom/IElementWriter.java which defines an interface but this interface appears to be unused.... probably i'm missing something.

@FlorianBruckner
Copy link
Author

This library is a replacement for xalan:serializer. The xalan serializer is used to serialize back to XML, and this is what causes my problem.

@dgerhardt
Copy link

I also ran into this issue when trying to use the library to export user generated content. User generated content often contains Unicode emojis ("🙂") which trigger this incorrect behavior leading to broken docs.

@svanteschubert
Copy link
Contributor

svanteschubert commented Jul 13, 2023

Apache Xalan-Java did a 2.7.3 release in April: https://xalan.apache.org/xalan-j/readme.html#notes_latest
There are 7 issues mentioned to be fixed, but not especially close to what you explain.
But it is worth a try!
In ODF Toolkit refer to the lastest Xalan release alreaedy on the master, the 0.11.0 release still uses 2.7.2, but I did now again a snapshot release 0.12.0-SNAPSHOT, so you might test it in your environments.

If this problem still exist, I would suggest you address this issue to the Apache Xalan developers:
https://xalan.apache.org/xalan-j/contact_us.html
It might help to check their issue tracker first, write an issue and ask on the mailing list to get a quick response.

Please note, they still seem to use SVN, but have a GitHub Mirror, which is just read-only.
Nevertheless, some people have written pull requests and some look like as if they are solutions close to the problem you mentioned:
https://github.com/apache/xalan-j/pulls

Good luck!
Svante

@dgerhardt
Copy link

dgerhardt commented Jul 18, 2023

Thanks for the reply, @svanteschubert!

I've tried overriding the Xalan dependency with 2.7.3 but unfortunately the latest version doesn't fix this issue. For now, I've replaced the dependency with the fork by docx4j which fixes it.

Three related issues around this already exist in their tracker and are marked as major bugs, the oldest one has been reported 15 years ago. Looking at the SVN/Git history, it seems like the project has been completely unmaintained for nearly a decade. But since last year, there has been some activity. So I'm slightly hopeful that they will pickup the existing fixes in the near future.

@svanteschubert
Copy link
Contributor

svanteschubert commented Jul 19, 2023

@dgerhardt Hi Daniel,

I suggest to write to the Apache Xalan Dev List and list and tell them about the problem and the solution. The more you are able lower the bar of release (their work), the likelier it gets for them to fix it. For instance, the docx4j fork has a solution, you might point to it! Or try to motivate them to overtake that task! :-)

Godspeed, Daniel!
Svante

dmitriy-konovalov added a commit to dmitriy-konovalov/odftoolkit that referenced this issue Feb 15, 2024
Replacing xalan with fork to avoid document maformation when unicode emojis used in content tdf#137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants