LXAI Open Source Datasets

Repository of open source datasets for use in research pertaining to the Latinx Community

Natural Language

Spanish Emojis: collection of Spanish phrases-emoji pairs.
TASS Dataset The TASS Dataset is a corpus of texts (mainly tweets) in Spanish tagged for Sentiment Analysis related tasks. It is divided into several subsets created for the various tasks proposed in the different editions through the years.
XNLI: The Cross-Lingual NLI Corpus: The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. paper
Europarl Parallel Corpus Spanish-English for Machine Translation: Extracted from the proceedings of the European Parliament. All formats contain document (<CHAPTER id>), speaker (<SPEAKER id name language>), and paragraph (<P>) mark-up on a separate line. The data is stored in one file per day, and in smaller units for newer data. Size: 187 MB. source.
Spanish ebooks by Project Guthenberg: One of the largest collections of free ebooks in Spanish. The portal includes other languages like Catalan and Galician -

FEI Face Database: Set of 2800 facial images from 200 individuals. All images are colourful and taken against a white homogenous background in an upright frontal position with profile rotation of up to about 180 degrees. Scale might vary about 10% and the original size of each image is 640x480 pixels. All faces are mainly represented by students and staff at FEI, between 19 and 40 years old with distinct appearance, hairstyle, and adorns. Images dimensions are 640x480 pixels. (for research purposes only)
10k US Adult Faces Database: This database contains 10,168 natural face photographs and several measures for 2,222 of the faces, including memorability scores, computer vision and psychology attributes, and landmark point annotations. The face photographs are JPEGs with 72 pixels/in resolution and 256-pixel height.
Labeled Faces in the Wild: The data set contains more than 13,000 images of faces collected from the web.

Librivox Spanish Audiobooks: 411 free audiobooks in Spanish (as of 11/1/18). Full audio books can be downloaded for free as zip-file, and are divided into chapters in mp3.

Population Estimate of Non-Hispanic White Persons: This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). Observations from 2009 to 2016.
LAPOP Survey Data: The AmericasBarometer data sets feature the responses to the annual survey conducted through the Latin American Public Opinion Project (LAPOP) to Latin American citizens since 2004 to present day. The survey gathers public opinion about politics, and more specifically about topics related to democracy – regime support, political tolerance, authoritarianism, corruption, local governments and citizen participation. Example of 2017 Questionary can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md