FAPESP Corpora - README

REVISTA PESQUISA FAPESP PARALLEL CORPORA

Last update: January 30th, 2012

Content

These corpora are the Portuguese-English and Portuguese-Spanish bilingual collections of the online issue of the scientific news Brazilian magazine REVISTA PESQUISA FAPESP .
The monolingual collections are also available.
The monolingual files pt.xml, en.xml and es.xml contain details about the articles, such as date of release and title.

A file name is composed by


<issue>_<article>



where the parts refer, respectively, to the attribute "file" in the XML entry issue and to the attribute "file" in the XML entry article .

Example:


For the file pt/data/961_4604
we find in pt.xml
<issue articles="22" file="961" month="1" number="191" year="2012">
    ...
    <article file="4604">
        <title>Corpo, jogo e teoria</title>
        ...
    </article>
    ...
</issue>



The context of the texts in the bilingual collections may not match exactly the content of the texts in the monolingual one as the sentence-aligned data is made of 1-1 correspondences.

Release

Fapesp-v2 (jan/2012)

Pre-processed data, word-alignment, language-model, phrase-tables and reordering models (Aziz and Specia, 2011)

Citing

When using these data please cite Aziz and Specia (2011) and add a link to the magazine's webpage.




http://revistapesquisa.fapesp.br


@INPROCEEDINGS{aziz:2011:newfapesp,
AUTHOR={Wilker Aziz and Lucia Specia},
TITLE={Fully Automatic Compilation of a {Portuguese-English} Parallel Corpus for Statistical Machine Translation},
BOOKTITLE={STIL 2011},
ADDRESS={Cuiab\'a, MT},
DAYS={24-26},
MONTH={Obtober},
YEAR={2011},
} 



Documentation

Please address to the paper (Aziz and Specia, 2011).
For additional information email Wilker Aziz .

Format

The data is sentence-aligned (alignment is given by the line number).
Under bitexts you will find the document pairs (one document per file).
You will also find the data split into different sets (training, development and test) as used in (Aziz and Specia, 2011).

License

Creative Commons Licence
FAPESP Parallel Corpora by Wilker Aziz is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Permissions beyond the scope of this license may be available at http://revistapesquisa.fapesp.br/.

You may distribute and use these data for non-commercial purposes, such as academic reasearch, but any commercial use of the corpus must be agreed with REVISTA PESQUISA FAPESP.