EuSQuAD, is a version of SQuAD2.0 for Basque. Our approach is based on machine-translating the original corpus with a generic neural machine translation system, addressing mismatches between context and answers via semantic text similarity. The resulting dataset is of the same size as the original SQuAD2.0 dataset (over 142k question-answer pairs), readily usable for QA-related tasks in Basque.
For more details see the paper.
EuSQuAD has the same json format and structure than the original SQuAD2.0, so it should be possible to use the same code and tools to load and use it.
EusQuaD can be requested from this REQUEST FORM
The following researchers have collaborated in the EuSQuaD dataset creation process:
- Aitor García-Pablos
- Naiara Perez
- Montse Cuadros
- Jaione Bengoetxea
We would like to express our gratitude the Machine Translation team from Vicomtech's HSLT department for providing the English-Basque translation service.
(TO BE UPDATED)
The same as the original SQuAD2.0
If you use this dataset, please, cite the following paper:
@misc{garcíapablos2024eusquad,
title={{EuSQuAD}: Automatically Translated and Aligned {SQuAD2.0} for {B}asque},
author={Aitor García-Pablos and Naiara Perez and Montse Cuadros and Jaione Bengoetxea},
year={2024},
eprint={2404.12177},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
(The paper has been accepted for publication in the number 73 of the Journal Procesamiento del Lenguaje Natural, we will update the bibtex information when it becomes published)