Skip to Content

NLP from Scratch Tutorial Data

IMDb Reviews Dataset 

Purpose: Training the Deep Learning model

Information courtesy of IMDb (http://www.imdb.com ). Used with permission.

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. For ease of reproducibility, we’ll be sourcing the data from Zenodo .

Andrea Esuli, Alejandro Moreo, & Fabrizio Sebastiani. (2020). Sentiment Quantification Datasets [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4117827 


Glove Embeddings 

Purpose: To represent text data in machine-readable i.e numeric format

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation 

GloVe is an unsupervised algorithm developed for generating word embeddings by generating global word-word co-occurence matrix from a corpus. You can download the zipped files containing the embeddings from https://nlp.stanford.edu/projects/glove/ . Here you can choose any of the four options for different sizes or training datasets, we opted for the least resource-heavy file with 50 dimensional representations for each word.


Speech Dataset 

Purpose: The trained Deep Learning Model will perform sentiment analysis on this data

Curated by the authors of the tutorial

We have chosen speeches by activists around the globe talking about issues like climate change, feminism, lgbtqa+ rights and racism. These were sourced from newspapers, the official website of the United Nations and the archives of established universities as cited in the table below. A CSV file was created containing the transcribed speeches, their speaker and the source the speeches were obtained from. We made sure to include different demographics in our data and included a range of different topics, most of which focus on social and/or ethical issues. The dataset is subjected to the CC0 Creative Common License, which means that is free for the public to use and there are no copyrights reserved.

SpeechSpeakerSource
Barnard College CommencementLeymah GboweeBarnard College 
UN Speech on youth EducationMalala YousafzaiThe Guardian 
Remarks in the UNGA on racial discriminationLinda Thomas GreenfieldUnited States mission to the United Nation 
How Dare YouGreta ThunbergNBC 
The speech that silenced the world for 5 minutesSevern SuzukiEarth Charter 
The Hope SpeechHarvey MilkMuseum of Fine Arts, Boston 
Speech at the time to Thrive ConferenceEllen PageHuffpost 
I have a dreamMartin Luther KingMarshall University 

What Comes Next

This folder documents the datasets used by the tutorial rather than teaching the workflow itself.

  1. Return to the parent tutorial folder and run the actual notebook or exercises that consume these resources.
  2. If you are studying the broader curriculum, move back to the surrounding NumPy or data-science track instead of treating this dataset note as a standalone lesson.
Last updated on