This repository contains the mintzai-ST corpus.
The mintzai-ST corpus is a Basque-Spanish speech translation parallel corpus, based on the proceedings of the parliamentary session of the Basque Government between 2011 and 2018.
The corpus consists of audio files, transcriptions and translations, which may be used to train end-to-end or cascaded speech translation systems for Basque-Spanish in both directions.
The corpus can be downloaded via the following link: https://datasets.vicomtech.org/v2-mintzai-st/mintzai-st-corpus_v1.0.tar.gz
Please note that the file is 25GB and downloading may take some time.
If you use any part of the corpus in your own work, please cite the following paper:
@inproceedings{etchegoyhen-et-al2021mintzai-st,
title={mintzai-ST: Corpus and Baselines for Basque-Spanish Speech Translation},
author={Etchegoyhen, Thierry and Arzelus, Haritz and Gete Ugarte, Harritxu and
Alvarez, Aitor and González-Docasal, Ander and Benites Fernandez, Edson}
booktitle={Proceedings of IberSPEECH2020},
location = {Valladolid, Spain}
year={2021},
pages = {TBD}
}
The mintzai-ST corpus is protected by copyright owned by Vicomtech:
Copyright (c) 2020 FUNDACION CENTRO DE TECNOLOGIAS DE INTERACCION VISUAL Y COMUNICACIONES VICOMTECH
The mintzai-ST corpus is distributed under the Creative Commons BY-NC-ND 4.0 license.
To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
If you have any question or suggestion, do not hesitate to contact us at the following addresses:
- Thierry Etchegoyhen: tetchegoyhen [AT] vicomtech [DOT] org
- Aitor Alvarez: aalvarez [AT] vicomtech [DOT] org