Selecting a Dataset for a Natural Language Processing Paper

The following article describes the process I followed to discover and download a new text dataset. Often researchers have to go through this process and I hope that with this article I can make someone’s life easier.

A while ago I was working on an area of Natural Language Processing named Sentiment and Emotion analysis, which in simple words aims to determine the sentiment (negative, positive, neutral), or emotion (happy, sad, etc.) expressed in a text. After a few years of not working on NLP, I find myself doing part-time research on Sarcasm Detection, which is closely related to my previous work.

The main goal of this research is to develop a sarcasm detection algorithm that can compare or surpass the state-of-the-art (SOTA), and publish a paper about it. To achieve this, the shortest path is to find a dataset widely used by many researchers in their latest work and develop algorithms using it. The best way to find such a dataset is through the Papers With Code academic website.

To find both the SOTA and a dataset for Sarcasm Detection, I visited that awesome website, searched for “Sarcasm”, and checked the date of publication of the papers returned. I focused on papers published in and after 2020. I was lucky to discover a 2020 paper introducing a completely new and “unbiased” dataset for Sarcasm Detection named iSarcasm, and later a series of 2022 papers competing on Task 6 (Sarcasm Detection) of The 16th International Workshop on Semantic Evaluation SemEval-2022. The best part was that the dataset used during SemEval-2022 Task 6 was an extended version of iSarcasm called iSarcasmEval.

The rest of this article will describe the steps taken to download and prepare the datasets, which are composed of Twitter data. Oftentimes when working with Twitter data, researchers have to follow similar steps, and therefore I hope this article is helpful to someone.

The iSarcasm Dataset

The first dataset I processed was the one introduced in 2020 by the paper titled iSarcasm: A Dataset of Intended Sarcasm. The idea was to present a dataset of tweets labeled by the authors themselves as opposed to labeled by a third party, which often introduced bias or labeling errors.

We show the limitations of previous labelling methods in capturing intended sarcasm and introduce the iSarcasm dataset of tweets labeled for sarcasm directly by their authors.

From the Papers With Code page, you can access the GitHub repository that contains the data. The repo contains two CSV files with the ID of the tweets, a sarcasm_label column with two possible values (sarcastic and not_sarcastic), and a sarcasm_type that identifies the specific type of sarcasm expressed. Why the ID of the tweets and not the texts themselves? Researchers often do this for privacy reasons and to respect some of Twitter’s guidelines. Having only an ID implies two things, first that we need to use a Twitter crawler to obtain the actual texts, and second, we may not be able to retrieve all texts as some tweets might have been deleted by their authors.

A quick view of the iSarcasm dataset

To get this dataset ready, perform the following steps.

Step 1: Clone the repo

git clone https://github.com/silviu-oprea/iSarcasm.git

Step 2: Generate query for Tweets Lookup endpoint

The Twitter API allows developers to do many things within the Twitter platform. We are particularly interested in retrieving tweets based on their IDs. For that, we can use the Tweets Lookup endpoint from the API. This tool allows us to lookup up to 100 tweets per call. We, therefore, need a simple piece of Python code that helps us group all the IDs of the dataset into query strings of 100 IDs each.

Python script to group tweet IDs into query strings of 100 IDs each

import pandas as pd
import math

# Read the CSV file with Pandas
data = pd.read_csv("iSarcasm/isarcasm_train.csv")

# Extract the column containing the ids and convert it to a list
tweet_ids = data.loc[:, "tweet_id"].tolist()

# Create a file that will contain the queries
file = open("isarcasm_queries_train.txt", "w")
# Go over all of the ids and create query strings with groups of 100 ids
query = ""
for idx, tweet_id in enumerate(tweet_ids):
    query += str(tweet_id) + ","
    
    if (idx + 1 ) % 100 == 0:
        query = query[:-1] + "\n"
        file.write(query)
        query = ""

# Do not forget the last line
query = query[:-1] + "\n"
file.write(query)

# Close the file
file.close()

Make sure to have this script next to the folder containing the dataset. Replace “iSarcasm/isarcasm_train.csv” with “iSarcasm/isarcasm_test.csv” and “isarcasm_queries_train.txt” with “isarcasm_queries_test.txt” to process the test file in the same way.

The queries file with 100 IDs per line

Step 3: Get the tweets with the Tweets Lookup endpoint

We can now retrieve the tweets from Twitter using the Tweets Lookup endpoint. To do so, you first need to fulfill a few requirements, like having a developer account and creating a project. These easy-to-follow steps are fully described in this quick-start document.

Prerequisites

To complete this guide, you will need to have a set of keys and tokens to authenticate your request. You can generate these keys and tokens by following these steps:

Sign up for a developer account and receive approval.

Create a Project and an associated developer App in the developer portal.

Navigate to your App’s “Keys and tokens” page to generate the required credentials. Make sure to save all credentials in a secure location.

Assuming you have met all of the requirements, we can retrieve the tweets with a simple Python script. The script will query the Twitter API and obtain 100 results at a time, some of which will contain a tweet and others an error (the tweet has been deleted or became private). In the end, all the retrieved tweets are matched with their labels in the dataset and saved in a new text file, with one tweet and label, tab-separated, per line.

Python script to retrieve tweets and save them with their corresponding labels

import os
import pandas as pd

import requests
import json

# Get this token from your developper portal
os.environ["BEARER_TOKEN"] = '<your_bearer_token>'


# To set your enviornment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
bearer_token = os.environ.get("BEARER_TOKEN")


'''
This function creates the url for our GET /tweets request.
Arg: query - the string with 100 comma-separated ids to retrieve
Return: a valid url for the request
'''
def create_url(query):
    tweet_fields = ""
    ids = "ids="+query
    # You can adjust ids to include a single Tweets.
    # Or you can add to up to 100 comma-separated IDs
    url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields)
    return url


'''
This is a method required by bearer token authentication.
Arg: r - a request
Return: a valid request
'''
def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2TweetLookupPython"
    return r


'''
This method connects to the endpoint and tries to retrieve the tweets
Arg: url - a valid GET /tweets url
Return: a json response, hopefully with the tweets
'''
def connect_to_endpoint(url):
    response = requests.request("GET", url, auth=bearer_oauth)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Request returned an error: {} {}".format(
                response.status_code, response.text
            )
        )
    return response.json()

'''
This method connects to the endpoint multiple times to retrieve all of the tweets 
Arg: queries_path - the path of the file containing all of the queries (comma-separated tweet ids)
Return: ids - a list of the ids of the retrieved tweets 
Return: tweets - a list of texts, the actual tweets 
Return: errors - a list of errors, one per each unavailable tweet 
'''
def get_tweets(queries_path):
    
    # Open the queries file
    file = open(queries_path, "r")

    # Empty lists to accumulate the results
    tweets = []
    ids = []
    errors = []

    # Go over every query string
    for line in file:
        
        # Create the GET request with the current query
        line = line.strip()
        url = create_url(line)
        # Retrieve the tweets
        json_response = connect_to_endpoint(url)
        
        # Accumulate the tweets
        if "data" in json_response:
            for tweet in json_response["data"]:
                
                tweets.append( tweet["text"])
                ids.append(int(tweet["id"]))
                
        # Accumulate the errors     
        if "errors" in json_response:
            for error in json_response["errors"]:
                errors.append(error["title"])
                
    return ids, tweets, errors
    

ids, tweets, errors = get_tweets("isarcasm_queries_train.txt")

print("Retrieved tweets:", len(tweets))
print("Missing tweets:", len(errors))

# Load the original dataset from the CSV file and match ids to labels
data = pd.read_csv("iSarcasm/isarcasm_train.csv")
labels = data.loc[data['tweet_id'].isin(ids), "sarcasm_label"].tolist()

# Save the tweets and their matching labels into a new tab-separated file
file = open("isarcasm_train.txt", "w")
for tweet, label in zip(tweets, labels):
    tweet = " ".join(tweet.split("\n"))
    file.write(tweet + "\t" + label + "\n")
file.close()

Before running the above script, make sure to replace the <your_bearer_token> with your actual token. Also to process the test dataset, change isarcasm_queries_train.txt to isarcasm_queries_test.txtiSarcasm/isarcasm_train.csv to iSarcasm/isarcasm_test.csv, and isarcasm_train.txt to isarcasm_test.txt.

The retrieved tweets dataset, each line has a tweet and its label separated by a tab

The iSarcasmEval Dataset

The second dataset, which I consider an extension of the first, was introduced in the paper describing the Task 6 of The 16th International Workshop on Semantic Evaluation SemEval-2022. Luckily, this dataset already provided the text of a tweet and not just its ID. Apart from the column determining whether the tweet is sarcastic or not, it has other columns for the different types of sarcasm.

Viewing parts of the iSarcasmEval dataset with pandas

The information is the same as the previous dataset but in a different format. Like with the previous data, I was only interested in knowing if the tweet is sarcastic or not. The following steps helped me process this dataset and format it just like the previous one.

Step 1: Clone the repo

git clone https://github.com/iabufarha/iSarcasmEval.git

Step 2: Reformat the dataset

The following script reformats the dataset to match the format of the previous one (iSarcasm). I use “sarcastic” and “not_sarcastic” as my labels, feel free to change them to whatever you like.

Python script to reformat the iSarcasmEval dataset

import pandas as pd
import math

# Load the CSV file with Pandas
train_data = pd.read_csv("iSarcasmEval/train/train.En.csv")

# Get only the columns with the text and the sarcastic label
tweets_and_labels = train_data.loc[:, ["tweet", "sarcastic"]].values.tolist()

# Create a new file to save the dataset
file = open("isarcasmeval_train.txt", "w")

# For every tweet-label pair
for count, (tweet, label) in enumerate(tweets_and_labels):
	# If the label is 1, then save the tweet with "sarcastic" label, otherwise use "not_sarcastic"
    if label == 1:
        label = "sarcastic"
    else:
        label = "not_sarcastic"

    # Some rows had no tweet and Pandas interpreted the empty cell as a nan.
    # The code below skips those 
    if not isinstance(tweet, str) and math.isnan(tweet):
        #print(count, tweet)
        continue
    
    # This line removes new lines in the tweet    
    tweet = " ".join(tweet.split("\n"))

    file.write(tweet + "\t" + label + "\n")
file.close()

Finally, to process the test set change “iSarcasmEval/train/train.En.csv” to “iSarcasmEval/test/task_A_En_test.csv” and “isarcasmeval_train.txt” to “isarcasmeval_test.txt”.

The reformated iSarcasmEval dataset where each line has a tweet and its label separated by a tab

Next Step: Do your Machine Learning!

After having retrieved and formatted your data, you are ready to do some cool sarcasm detection. Optionally, you may want to save the tweets in two different files, one for sarcastic and one for non-sarcastic tweets, avoiding having to include the labels inside the files. To do this you just need to make some simple changes to the scripts above. Have fun!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: