This repository contains the TANDO corpus for Document-level Machine Translation in Basque-Spanish.
TANDO is a corpus for training and evaluation of document-level machine translation models in Basque-Spanish. The corpus was prepared within the ELKARTEK project TANDO (2020-2021: www.tando.eus) by members of the project consortium:
- Vicomtech Foundation (https://www.vicomtech.org)
- University of the Basque Country (UPV/EHU) / IXA taldea (http://ixa.si.ehu.es/)
- Elhuyar Foundation (https://www.elhuyar.eus)
- ISEA (https://www.isea.eus/en/)
- Ametzagaiña (http://www.ametza.com/)
The TANDO corpus includes both parallel and contrastive datasets, in text format, and covers different domains (literature, news, subtitles, talks, politics). It can be downloaded via the following link: https://datasets.vicomtech.org/v2-tando/tando-corpus_v1.0.tar.gz
If you use any part of the corpus in your own work, please cite the following paper:
@inproceedings{gete-et-al2022tando-corpus,
title={TANDO: A Corpus for Document-level Machine Translation},
author={Gete, Harritxu and Etchegoyhen, Thierry and Ponce, David and Labaka, Gorka and
Aranberri, Nora and Corral, Ander and Saralegi, Xabier
and Ellakuria Santos, Igor and Martin, Maite}
booktitle={Proceedings of the 13th Edition of the Language Resources and Evaluation Conference (LREC 2022)},
location = {Marseille, France}
year={2022},
pages = {TBD}
}
The TANDO corpus is distributed under the Creative Commons BY-NC-SA 4.0 license.
To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
If you have any question or suggestion, do not hesitate to contact us at the following addresses:
- Thierry Etchegoyhen: tetchegoyhen [AT] vicomtech [DOT] org
- Harritxu Gete: hgete [AT] vicomtech [DOT] org