-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write our to_ascii function (drop ICU dependency) #89
Comments
Is the |
That's correct @miguelteixeiraa. ToASCII refers to |
Thinking about pick this one up. Does this issue have any urgency? Is anyone already working on this? Also please let me know if there is another one that would be more interesting to pick up now. |
@miguelteixeiraa It would be a great issue. I have a nasty PR that demonstrates what it might look like... Don't copy my code, it is not good. It was done quickly. One mistake I did was to try to stay in UTF-8. It is a bad idea. Move to UTF-32 right away and process in UTF-32 mode throughout, getting back to UTF-8 only at the end. But I proved that you could almost get it done in a few hours. In truth, I probably did less than 20% of the required work, but it was enough to pass 90% of our tests. Getting it to be correct takes many hours. UTF-32 is pretty simple. Each character is a 32-bit integer (but only just about the first million values are allowed, so 32-bit is too much, but that's fine). The largest possible value is 0x10FFFF. But, actually, anything over 0xE01EF is disallowed for us. Here are some starting points.
Reference to the standard: https://www.unicode.org/reports/tr46/#ToUnicode Step 3 is a fun part. It depends on the Unicode version. For Unicode 15, the table is https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt Important: we want to be able to update the future Unicode releases so that the table should not be hard-coded... we should have a script that goes from the table to our implementation.
What you want to do is write some kind of script that reads the table and outputs code. That's more or less or ICU does it, I think. Take 0x2e. It is the '.' in ascii. It remains a dot. Fine. But look later...
So character 0x3002 must be mapped to 0x2e. Some characters are ignored, we just skip them. Now we must somehow go from the table to efficient C++ code. This has been done before, and I have not looked at how people do it. So here is my naive take. At the moment, 5986 different characters are mapped to 2945 distinct characters. This can be implemented using a key-value map, using 32 bits for the key and 32 bits (or less) for the value. That's about 50 kB right there. Maybe less if you get clever. A flat data structure with a binary search might work to locate the values and the mapping. Next we have the disallowed and the ignored values. This could be implemented using a flat list where you have 2 bits per unicode code points (1114111), so that's 2 MB. Each code point can be allowed, disallowed, ignored or mapped. But we probably don't want to use 2MB. So just represent the data as an list of ranges. Each range can be defined by its starting position (a value no larger than 1114111) and two bits representing whether it is allowed, disallowed, ignored or mapped. So each range can be represented using 32 bits. The table has something like 9000 lines, but most of it is descriptive, there are far fewer than 9000 ranges in there... So, let us say 24 kB to store that. Implementation-wise, a simple binary search through the ranges will do. So mapping, unallowed and so forth... we are talking about less than 100kB. Normalization is tiny. So I am hopeful we can do all of that in less than 150kB I recommend breaking down the implementation into the steps I have outlined. At each stage, write some tests. Don't be a cowboy like I did... it is too complicated to do in one sitting. |
I could probably even do the mapping and scripting over the big table. I have a good idea of how to do it. The normalization only looks mysterious, but if you scan through the perl code, you see that there is relatively little involved. |
For some context, currently we rely on ICU which requires something like 40 MB. I think that implementing just what we need would be far smaller. Less than 200 kB. |
Seems definitely hard. I like it. Thanks for sharing these ideas with me! I'll work on this. |
This task is hard 😵 I don't know when I'm going to finish this, but if anyone already has an idea in mind and thinks they could implement it quickly, please don't get hung up on me. I was thinking about storing the idna table in an interval map like this one: template <typename T, typename U>
class IntervalMap {
std::map<std::pair<T, T>, U> storage;
public:
IntervalMap() {}
void insert(std::pair<T, T> interval, U value)
{
storage[interval] = value;
}
U find(T value)
{
auto it = storage.lower_bound(std::make_pair(value, value));
if (it == storage.end() || value < it->first.first) {
if (it == storage.begin())
return U();
--it;
}
if (value >= it->first.first && value <= it->first.second)
return it->second;
return U();
}
}; then the table would be created like this: IntervalMap<unsigned int, std::pair<std::string, std::vector<unsigned int>>> idna_table;
idna_table.insert(std::make_pair(0x2FA1C, 0x2FA1C), {"mapped", {0x9F3B}});
idna_table.insert(std::make_pair(0x2FA1D, 0x2FA1D), {"mapped", {0x2A600}});
idna_table.insert(std::make_pair(0x2FA1E, 0x2FFFD), {"disallowed", {}});
idna_table.insert(std::make_pair(0x2FFFE, 0x2FFFF), {"disallowed", {}});
idna_table.insert(std::make_pair(0x30000, 0x3134A), {"valid", {}}); and access would be like this: idna_table.find(0x2FA12).second // the list of chars but maybe this implementation is too costly.. what do you think? the construction of this interval map could be done from a script, like this very ugly one here: url = "https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt"
response = requests.get(url)
mapping_table = response.text
lines = []
for line in mapping_table.split("\n"):
if(line.startswith("#") or ";" not in line):
continue
line = line[0:line.index("#")]
line = line.split(";")
line = [x.strip() for x in line if x.strip() != ""]
final_line = "idna_table.insert(std::make_pair("
if(".." in line[0]):
range = line[0].split("..")
final_line+=f"0x{range[0]}" + ", " + f"0x{range[1]}"
else:
final_line+=f"0x{line[0]}, 0x{line[0]}"
final_line+="), {" + "\"" + line[1] + "\", {"
if(len(line) > 2 and line[2] != "0xNV8"):
mapped_value = line[2].split(" ")
for value in mapped_value:
final_line += f"0x{value}" + ", "
final_line = final_line[:len(final_line) - 2]
final_line += "}});"
else:
final_line+="}});"
print(final_line) |
I could also ignore the "range/interval" thing, use unordered_map to store the whole table and get an O(1) |
Do not try to do it all, and especially not all at once. It is indeed quite a large project... there is a reason nobody does ever does it! I have a good idea how to implement this table and I can do it. You leave it up to me, I can probably do it this week. |
@miguelteixeiraa @anonrig I think we could start a secondary project, maybe https://github.com/ada-url/idna/, and work on it there. Because, indeed, it will be non-trivial and it will take some time (weeks). |
@miguelteixeiraa @anonrig And it could be cool to have it as a separate component. |
Sounds like a good idea! : ) |
Note that it is an evolving target : whatwg/url#744 |
@miguelteixeiraa Can you have a look at https://github.com/ada-url/idna ? Next step would be to write some tests for to_ascii. It is not complete because it does not do mapping, it does not scan for forbidden characters and it does not check for bidi conditions. But it is otherwise a good sketch. |
It is not impossibly hard to drop the ICU dependency. It would simplify builds and improve engineering.
I estimate that it would take about a week of work.
The text was updated successfully, but these errors were encountered: