Multilingual support #52

wetneb · 2020-09-21T17:06:35Z

At the moment, the names and descriptions represented in the reconciliation queries and responses do not come with any language information.

For natively multilingual data sources (such as Wikidata) it would be convenient if this restriction could be lifted. At the moment, this is handled by offering one reconciliation endpoint per language supported by Wikidata (such as https://wikidata.reconci.link/en/api, https://wikidata.reconci.link/it/api, and so on), but it would be nicer to have a single endpoint which would support all languages directly. (It is complicated to teach users to insert the language code in the URL).

The lack of multilingual support was identified by a reviewer from the Ontology Matching 2020 workshop and earlier by @tfmorris in #48 (comment).

There are other endpoints which also encode some "constants" in the base URL of the reconciliation service. For instance, the OpenCorporates endpoint does this not for languages but for jurisdictions (letting users match against companies from a single country).
https://api.opencorporates.com/documentation/Open-Refine-Reconciliation-API

So perhaps the right way to address this is to provide a better way for services to receive global configuration options, which would encompass the use cases of both the Wikidata and OpenCorporates endpoints?

VladimirAlexiev · 2020-10-22T08:30:27Z

I posted a separate issue #55 of which this one is an instance.
I think we should allow language preferencing, similar to Accept-Language and wikidata label service. See w3c/sparql-dev#13.

Below are draft requirements:

language is a special param with these requirements:

The language param values should conform to BCP47 and preferably be selected from the IANA Language Subtag Registry (or see the Google Sheet iana-lang-tags for easier access)
Language matching should conform to RFC4647 Matching of Language Tags, as used in HTTP Accept-Language and SPARQL langMatches. Eg "en" should match a name with lang tag "en-GB".
The language param can take several lang tags separated with commas, in which case they are interpreted as preference order.
The service should prefer matches in the specified language(s) but can also return matches in other languages
The service should return entity names (and descriptions) in the specified language(s), but can fall-back to any other language if the entity has no name in the specified language(s)

The Ontotext Platform allows more flexible lang specification, including negations, see this comment. But I think we don't need such advanced features?

wetneb · 2020-10-22T12:22:13Z

I think it makes sense to treat this as a special case of #55 indeed. Thanks for drafting these specs, that looks pretty neat. I guess we'd need to make progress on #55 first since we would rely on this notion of parameter.

wetneb · 2023-01-22T23:18:40Z

In our last call it was discussed that we could simply let the client specify their language using the standard HTTP header Accept-Language (which @VladimirAlexiev already mentions above - but I would not introduce a new language parameter for that: just use the HTTP header).

This header would control the language in which the names and descriptions of entities, properties and types should be returned.

This would have benefits:

not reinventing the wheel: just rely on a standard feature of HTTP
web-based reconciliation clients will have this header set by the browser directly, without the app needing to integrate it

Downsides:

when formulating a query manually, it would not be possible to set the language explicitly in the URL itself

thadguidry · 2023-01-23T00:33:26Z

@wetneb Hmm which allows greater control for users who are ultimately behind those clients? What are the pros and cons for users dealing with multiple languages per project and work in batches between languages? Does one approach reduce user control considerably? Could a clients workflow be adapted to still provide good user control?

wetneb · 2023-01-23T07:46:24Z

Even web-based clients can control which value they send in the Accept-Language header (for instance with the fetch JS API) so I would say so!

VladimirAlexiev · 2023-01-23T09:08:36Z

@wetneb The Accept-Language page lists two considerations that align with @thadguidry's questions above:

The content of Accept-Language is often out of a user's control (when traveling, for instance).
A user may also want to visit a page in a language different from the user interface language.

So I think we should allow an explicit language parameter, and use Accept-Language as default.

One of the requirements needs to be modified:

The language param can take several lang tags separated with commas, in which case they are interpreted as preference order.

The language param can take several lang tags separated with commas.
- In Accept-Language, each lang tag can have an optional ;q= quality value (also called "q-factor" or "weight"). Quality values are relative and the default is ;q=1.0
- Accept-Language lang tags are sorted by quality value using a stable sort
- The list (as given in language or sorted as per Accept-Language) is used as preference order

Example: assume this header:

Accept-Language: en;q=0.1, en-US, en-GB

and assume an entity has this set of labels. Then the service should return the selected label for display:

en-GB: return en-GB
en-GB, en-US: return en-US
en, en-GB, en-US: return en-US
en-NZ, fr: return en-NZ because it matches en
fr: return fr as last fallback (as if *,q=0.01 was specified last)

wetneb · 2023-01-23T09:36:17Z

So I think we should allow an explicit language parameter, and use Accept-Language as default.

As soon as you introduce an explicit language parameter, you are then expecting that the reconciliation client makes use of it to let the user select a language.

But if the reconciliation client does let the user pick a language, then why can't it just pass on this language to the server with a header instead of a GET/POST parameter?

Reconciliation clients, even web-based, will be able to set such a header independently of the browser's defaults.

So I would rather stick with a single, standard way to define the language.

eroux · 2023-01-27T16:52:44Z

I'm starting to implement the reconciliation API and this is one of the key issues I'm struggling a bit with (without really being blocked). In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. Perhaps this should be a separate query but since the title of this one is "Multilingual support", I thought this could be relevant. Thanks for this wonderful API BTW!

wetneb · 2023-01-27T19:32:40Z

@eroux thanks for chiming in! If that parameter was supplied in a header, would that make it any more difficult for you to rely on it?

eroux · 2023-01-27T19:47:52Z

the parameter for the expected language of the results can be in a header yes (in fact Accept-Language seems very standard for that), no problem.

the parameter for the queries should be with the queries I think, perhaps something like

{
  "q1": {
    "query": "Hans-Eberhard Urbaniak",
    "query_lang": "en"
  }
}

fsteeg · 2023-02-09T15:49:11Z

In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. [...] the parameter for the queries should be with the queries I think

This fits well with the internationalization guidelines from W3C, which recommend that specifications provide separate methods for expressing (1) the language of the intended audience vs. (2) the text-processing language for a specific text range (see https://www.w3.org/International/questions/qa-text-processing-vs-metadata). So for (1) we could take the HTTP header approach from #108, and for (2) we could add optional language fields to the JSON.

This would probably make sense for all objects (queries, properties, property values, candidates, candidate types, features). The language could apply to all fields of that object (e.g. name and description of candidates), and to all contained objects (like all properties of queries), unless they override the container setting, e.g.:

{
  "queries": [
    {
      "query": "Deng Shuping",
      "lang": "en",
      "properties": [
        {
          "pid": "professionOrOccupation",
          "v": "art historian"
        },
        {
          "pid": "variantName",
          "v": "鄧淑蘋",
          "lang": "zh-Hant"
        }
      ]
    }
  ]
}

Here, the query (explicit) and the first properties.v (inherited) have text-processing language en, the second properties.v overrides it with zh-Hant.

(The default / override logic is also part of the W3C guidelines, see https://www.w3.org/TR/international-specs/#lang_inherit.)

eroux · 2023-02-09T15:58:56Z

I totally agree! Another way of encoding it would be the JSON-LD way:

{
          "pid": "variantName",
          "v": {
              "@value": "鄧淑蘋",
              "@language": "zh-Hant"
          }
}

wetneb · 2023-03-09T13:22:37Z

In today's call @fsteeg mentioned that we should not have to worry too much about JSON-LD to determine the format of our JSON: it should be possible to add the right JSON-LD context to map our JSON structure to RDF appropriately. So we are inclined to go for @fsteeg's JSON structure above.

We might also need to support passing along the language of an entity used as property value:

{
   "pid": "foo",
   "v" : [ {
        "id": "Q344",
        "name": "some entity",
        "lang": "en"
   } ]
}

VladimirAlexiev · 2023-05-23T10:16:03Z

Hi @wetneb ! Regarding your last commit (only reading its title, not the code):

I understand Accept-Language, and we discussed it above
But Content-Language is suitable for a doc written in one language. I don't think Recon results fit that description: even the different matches of one query may carry different language.
Imagine this situation:
- I query for "John Philips" and specify Accept-Language: bg, en;q=0.5
- The server has two person items Q12 John Philips (en) and Q23 Иван Филипов (bg) = John Philips (en)
- The server should return this JSON (it's wrong in too many ways to list, but you get the idea)

matches: [
  {id: Q12, name: John Philips, lang: en},
  {id: Q23, name: Иван Филипов, lang: bg}
]

What's the Content-Language of this document? It's neither bg nor en, because it's mixed

wetneb · 2023-06-15T12:59:40Z

Agreed, it's redundant with the inclusion of the language in the JSON payloads, which we want to do in another change. So I would just remove the Content-Language header.

fsteeg · 2023-06-15T15:09:30Z

The server has two person items Q12 John Philips (en) and Q23 Иван Филипов (bg) = John Philips (en) [...] What's the Content-Language of this document? It's neither bg nor en, because it's mixed

The Content-Language can actually contain both languages, so here it could be bg, en. This is what the W3C refers to as the metadata language or language of the intended audience, which can be multiple languages (see Types of language declaration). To express which string is in which (single) language, we added a section on setting the text processing language (in #129), which seems to basically work like your example.

wetneb · 2023-07-13T13:27:51Z

@VladimirAlexiev what do you think about @fsteeg's understanding of Content-Language above? If that sounds good to you we'd merge the PR #108.

VladimirAlexiev mentioned this issue Oct 22, 2020

allow Fixed Params #55

Open

VladimirAlexiev mentioned this issue Oct 22, 2020

add "varies by" (language, jurisdiction) to testbench reconciliation-api/testbench#19

Open

fsteeg mentioned this issue Dec 7, 2022

Content on accessibility #103

Merged

wetneb added a commit that referenced this issue Feb 9, 2023

Add Accept-Language and Content-Language headers, for #52.

398a9e4

wetneb mentioned this issue Feb 9, 2023

Add Accept-Language and Content-Language headers #108

Merged

This was referenced May 9, 2023

I18n checklist: language #125

Open

I18n checklist: text direction #126

Open

fsteeg added a commit that referenced this issue May 17, 2023

Add section on text-processing language (#52)

8a3c082

fsteeg mentioned this issue May 17, 2023

Add section on text-processing language #129

Merged

wetneb added a commit that referenced this issue May 17, 2023

Add Accept-Language and Content-Language headers, for #52.

49fa02b

fsteeg added a commit that referenced this issue Jul 13, 2023

Add section on text-processing language (#52)

93a35dd

wetneb mentioned this issue Nov 14, 2023

Make the Wikidata reconciliation language changeable from the UI OpenRefine/OpenRefine#2333

Open

acka47 mentioned this issue Apr 8, 2024

Support for multi-lingual candidate names #138

Open

acka47 mentioned this issue Nov 11, 2024

Do we need JSON-LD support? If yes, what for? #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual support #52

Multilingual support #52

wetneb commented Sep 21, 2020

VladimirAlexiev commented Oct 22, 2020

wetneb commented Oct 22, 2020

wetneb commented Jan 22, 2023

thadguidry commented Jan 23, 2023

wetneb commented Jan 23, 2023

VladimirAlexiev commented Jan 23, 2023

wetneb commented Jan 23, 2023

eroux commented Jan 27, 2023

wetneb commented Jan 27, 2023

eroux commented Jan 27, 2023

fsteeg commented Feb 9, 2023

eroux commented Feb 9, 2023

wetneb commented Mar 9, 2023

VladimirAlexiev commented May 23, 2023

wetneb commented Jun 15, 2023

fsteeg commented Jun 15, 2023

wetneb commented Jul 13, 2023

Multilingual support #52

Multilingual support #52

Comments

wetneb commented Sep 21, 2020

VladimirAlexiev commented Oct 22, 2020

wetneb commented Oct 22, 2020

wetneb commented Jan 22, 2023

thadguidry commented Jan 23, 2023

wetneb commented Jan 23, 2023

VladimirAlexiev commented Jan 23, 2023

wetneb commented Jan 23, 2023

eroux commented Jan 27, 2023

wetneb commented Jan 27, 2023

eroux commented Jan 27, 2023

fsteeg commented Feb 9, 2023

eroux commented Feb 9, 2023

wetneb commented Mar 9, 2023

VladimirAlexiev commented May 23, 2023

wetneb commented Jun 15, 2023

fsteeg commented Jun 15, 2023

wetneb commented Jul 13, 2023