-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual support #52
Comments
I posted a separate issue #55 of which this one is an instance. Below are draft requirements:
The Ontotext Platform allows more flexible lang specification, including negations, see this comment. But I think we don't need such advanced features? |
In our last call it was discussed that we could simply let the client specify their language using the standard HTTP header This header would control the language in which the names and descriptions of entities, properties and types should be returned. This would have benefits:
Downsides:
|
@wetneb Hmm which allows greater control for users who are ultimately behind those clients? What are the pros and cons for users dealing with multiple languages per project and work in batches between languages? Does one approach reduce user control considerably? Could a clients workflow be adapted to still provide good user control? |
Even web-based clients can control which value they send in the |
@wetneb The Accept-Language page lists two considerations that align with @thadguidry's questions above:
So I think we should allow an explicit One of the requirements needs to be modified:
Example: assume this header:
and assume an entity has this set of labels. Then the service should return the selected label for display:
|
As soon as you introduce an explicit But if the reconciliation client does let the user pick a language, then why can't it just pass on this language to the server with a header instead of a GET/POST parameter? Reconciliation clients, even web-based, will be able to set such a header independently of the browser's defaults. So I would rather stick with a single, standard way to define the language. |
I'm starting to implement the reconciliation API and this is one of the key issues I'm struggling a bit with (without really being blocked). In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. Perhaps this should be a separate query but since the title of this one is "Multilingual support", I thought this could be relevant. Thanks for this wonderful API BTW! |
@eroux thanks for chiming in! If that parameter was supplied in a header, would that make it any more difficult for you to rely on it? |
the parameter for the expected language of the results can be in a header yes (in fact the parameter for the queries should be with the queries I think, perhaps something like {
"q1": {
"query": "Hans-Eberhard Urbaniak",
"query_lang": "en"
}
} |
This fits well with the internationalization guidelines from W3C, which recommend that specifications provide separate methods for expressing (1) the language of the intended audience vs. (2) the text-processing language for a specific text range (see https://www.w3.org/International/questions/qa-text-processing-vs-metadata). So for (1) we could take the HTTP header approach from #108, and for (2) we could add optional language fields to the JSON. This would probably make sense for all objects (queries, properties, property values, candidates, candidate types, features). The language could apply to all fields of that object (e.g. {
"queries": [
{
"query": "Deng Shuping",
"lang": "en",
"properties": [
{
"pid": "professionOrOccupation",
"v": "art historian"
},
{
"pid": "variantName",
"v": "鄧淑蘋",
"lang": "zh-Hant"
}
]
}
]
} Here, the (The default / override logic is also part of the W3C guidelines, see https://www.w3.org/TR/international-specs/#lang_inherit.) |
I totally agree! Another way of encoding it would be the JSON-LD way: {
"pid": "variantName",
"v": {
"@value": "鄧淑蘋",
"@language": "zh-Hant"
}
} |
In today's call @fsteeg mentioned that we should not have to worry too much about JSON-LD to determine the format of our JSON: it should be possible to add the right JSON-LD context to map our JSON structure to RDF appropriately. So we are inclined to go for @fsteeg's JSON structure above. We might also need to support passing along the language of an entity used as property value: {
"pid": "foo",
"v" : [ {
"id": "Q344",
"name": "some entity",
"lang": "en"
} ]
} |
Hi @wetneb ! Regarding your last commit (only reading its title, not the code):
matches: [
{id: Q12, name: John Philips, lang: en},
{id: Q23, name: Иван Филипов, lang: bg}
] What's the |
Agreed, it's redundant with the inclusion of the language in the JSON payloads, which we want to do in another change. So I would just remove the |
The |
@VladimirAlexiev what do you think about @fsteeg's understanding of |
At the moment, the names and descriptions represented in the reconciliation queries and responses do not come with any language information.
For natively multilingual data sources (such as Wikidata) it would be convenient if this restriction could be lifted. At the moment, this is handled by offering one reconciliation endpoint per language supported by Wikidata (such as https://wikidata.reconci.link/en/api, https://wikidata.reconci.link/it/api, and so on), but it would be nicer to have a single endpoint which would support all languages directly. (It is complicated to teach users to insert the language code in the URL).
The lack of multilingual support was identified by a reviewer from the Ontology Matching 2020 workshop and earlier by @tfmorris in #48 (comment).
There are other endpoints which also encode some "constants" in the base URL of the reconciliation service. For instance, the OpenCorporates endpoint does this not for languages but for jurisdictions (letting users match against companies from a single country).
https://api.opencorporates.com/documentation/Open-Refine-Reconciliation-API
So perhaps the right way to address this is to provide a better way for services to receive global configuration options, which would encompass the use cases of both the Wikidata and OpenCorporates endpoints?
The text was updated successfully, but these errors were encountered: