Sagan Textual Entailment Test Su

Sagan Textual Entailment Test Suite

This textual entailment test suite aims at providing developers of Textual Entailment system additional test and training datasets.

We followed the algorithm proposed in [1] to increase the size of Textual Entailment Corpus by using Machine Translation systems to generate additional <t,h> pairs.

We used this algorithm proposed to generate additional training dataset starting from RTEx and following a double translation process. We choose Spanish as intermediate language and Microsoft Bing Translator as the only Machine Translation system in this process. Soon we will provide additional datasets.

Additionally, we provide a Cross-Lingual Textual Entailment (CLTE) dataset based on the Monolingual RTE3 dataset used in the Third PASCAL Recognizing Textual Entailment Challenge. The texts (T) are written in English and the hypothesis (H) are written in Spanish. The procedure to generate this dataset can be found in [2].

Several datasets are provided and a description of their context can be found below :

Downloads

Monolingual RTE datasets. Formats:
These datasets are marked for a 3-way decision in terms of entailment: "ENTAILMENT" , "CONTRADICTION" and "UNKNOWN" (same format as RTE5 Pascal ).

3-Way based on RTE Stanford datasets:
- RTE3_3w-4C-.XML (2974 pairs: 1520 Entailment, 340 Contradiction, 1114 Unknown)
- RTE4_3w-4C-.XML (3630 pairs: 1812 Entailment, 546 Contradiction, 1272 Unknown)
- RTE5_3w-4C-.XML (2040 pairs: 1018 Entailment, 310 Contradiction, 712 Unknown)
These datasets are marked for a 2-way decision in terms of entailment: "ENTAILMENT" , "NO ENTAILMENT".

2-Way based on RTE TAC datasets: These dataset were converted to "2-way task" taking contradiction and unknown pairs as "NO ENTAILMENT" .
- RTE3_2w-4C-.XML (2974 pairs: 1520 Entailment, 1454 No Entailment)
- RTE4_2w-4C-.XML (3630 pairs: 1812 Entailment, 1818 No Entailment)
- RTE5_2w-4C-.XML (2040 pairs: 1018 Entailment, 1022 No Entailment)

Cross-Lingual Textual Entailment datasets. Formats:
- English to Spanish CLTE datasets:
- RTE3-DS-CLTE-EN-SP-Bi-Good.XML (541 pairs: 284 Entailment, 59 Contradiction, 198 Unknown)
- RTE3-TS-CLTE-EN-SP-Go-200pairs-Good.XML (200 pairs: 108 Entailment, 32 Contradiction, 60 Unknown)

This test suite may be downloaded and used without restriction, it would be appreciate an acknowledgement if you publish results using it, and we would also be interested to hear what performance you get.

How to cite this Textual Entailment Test Suite :

1. The algorithm to generate the Monolingual corpus and a description can be found in the following paper:

Please cite as: Julio J. Castillo, "Using Machine Translation Systems to Expand a Corpus in Textual Entailment". ICETAL 2010, LNCS, Springer Verlag, 2010.

2. The algorithm to generate the Bilingual corpus and a description can be found in the following paper:

Please cite as: Julio J. Castillo, "WordNet-based semantic approach to textual entailment and cross-lingual textual entailment ". International Journal of Machine Learning and Cybernetics 2011. Volume 2, Number 3, LNCS, Springer Verlag, 2011.

Paper that used this Textual Entailment Test Suite :

J. Castillo, M. Cardenas, "Using Sentence Semantic Similarity Based on WordNet in Recognizing Textual Entailment". 12th Ibero-American Conference on AI, IBERAMIA 2010, Bahía Blanca, Argentina, November 1-5, 2010, Springer LNAI, in press.

J.Castillo, "A Semantic Oriented Approach to Textual Entailment using WordNet-based Measures". 9th Mexican International Conference on Artificial Intelligence, MICAI 2010, Pachuca, Mexico, November 8-13, 2010, Springer LNAI, in press.

J.Castillo, M. Cardenas, "An Approach to Cross-Lingual Textual Entailment using Web Machine Translation Systems".10th Mexican International Conference on Artificial Intelligence.POLITIS Research journal on computer science and computer engineering with applications, ISSN 1870-9044, Issue 44, December 2011.

Contact

For questions please send an e-mail to : Julio Castillo ( jotacastillo A T gmail.com )