data-analyst

Comparing Sentiment Analysis Models: GPT, BERT, and Gemini Insights

Sentiment analysis is essential for interpreting textual data. In this article, I present a comprehensive comparison of various sentiment analysis models, using a dataset of 5,000 Yelp reviews.

Important Information About the Sentiment Analysis

The original dataset consisted of 5,000 Yelp reviews, each rated on a scale of 1 to 5. To streamline the sentiment analysis and facilitate a fair comparison across models, I transformed the ratings as follows:

  • Ratings of 1 and 2 were categorized as Negative sentiment.
  • Ratings of 3 were classified as Neutral sentiment.
  • Ratings of 4 and 5 were grouped under Positive sentiment.

While individual perceptions may vary—what constitutes a rating of 4 for one person might differ for another—this conversion provides a reasonable foundation for deriving insights and forming opinions.

Some reviews in the dataset exceeded 512 characters. Because certain transformer models have a limit on input length, these longer reviews were truncated for analysis. As a result, sentiment classifications might differ from the intended sentiment of the full review, since the models only processed the initial portion of these longer texts.

Review examples are provided below:

ReviewRating
OK so another winner in this neighborhood besides Nicky’s. The Penna Parma was outstanding and the Parm crusted salmon over whole wheat artichoke ravioli with a white basil sauce on special tonight was a great combo and wonderfully executed. Katie our server was professional AND personable and only added to our outstanding meal…..then there were the Pittsbugh natives seated nearby who were friendly and delightful to visit with. WOW what a place. The only downside is that its a little tight inside and a little loud but don’t let that stop you! Well worth the trip!5
For 30 bucks a month, this is a pretty rad spot. I’ve climbed all over the US and Canada, but I’m happy to call this my home gym.4
Service was okay, at best. I wouldn’t go there again. They quoted me at thousands of dollars of repairs for my car to pass inspection. I took it somewhere else and had it done for a fraction of the quote.2
The products and service are great, but the prices are outrageous! I am on a vegan diet and have found this co-op really helpful. If you can get past the prices, the selection is great!3
Worst climbing gym I have ever been to. Customer service was a joke. Wall appears as if a boy scout went ham with a few sheets of plywood back in the 80s and then did nothing since then. Holds are prehistoric and grimy. Setting is subpar. Insurance doesn’t allow for people who can’t belay to be belayed? Better off climbing the outside of the building for a good pump..1

Models Compared

For the purpose of this research, I tested and compared the following models:

  1. GPT-4o
  2. Distilbert (with no Neutral classification)
  3. TwitterRoBERTa
  4. MultilingualBERT
  5. TEXTBLOB
  6. GPT-01-preview (limited to 300 reviews)
  7. Gemini 1.5 (limited to 300 reviews)

Due to token limitations, I conducted a smaller comparison for GPT-01-preview and Gemini 1.5 using only 300 reviews. While these results provide a preliminary view of their performance, further analysis is needed for a complete evaluation.

How I Performed the Analysis

To carry out the sentiment analysis for this project, I utilized a combination of tools and platforms to maximize efficiency and accuracy. Here’s a breakdown of my approach:

  1. ChatGPT and Gemini -> Uploading the Data: I began by uploading my CSV file containing the Yelp review data to ChatGPT and Gemini for sentiment analysis. To guide these AI models, I used the following prompt:
    “Please analyze the sentiment of each review in the attached file, which contains 5000 Yelp reviews. Classify each review strictly into one of the following categories: Positive, Negative, or Neutral. Return the results in a structured format, such as a table or CSV file, with two columns: ‘Review’ and ‘Sentiment.’”
    This prompt ensured the models provided the sentiment classification for each review.
  2. Rest of the models -> Running Python Code on Google Colab: For a more hands-on analysis, I executed a Python script on Google Colab, leveraging pre-trained sentiment analysis models.

For those interested in replicating this process or exploring sentiment analysis in their own research, here is a simplified version of the Python code I used:

# Step 1: Install the necessary libraries
!pip install transformers
!pip install pandas

# Step 2: Import the libraries
import pandas as pd
from transformers import pipeline

# Step 3: Load your CSV file into a Pandas DataFrame
file_name = 'your_file.csv'  # Replace 'your_file.csv' with your actual file name
df = pd.read_csv(file_name)
df.head()  # Display the first few rows to ensure it's loaded correctly

# Step 4: Set up the sentiment analysis pipeline
# Replace 'model-name' with the model you want to use, e.g., "facebook/bart-large-mnli"
sentiment_pipeline = pipeline("sentiment-analysis", model="model-name")

# Step 5: Define a function to get the sentiment label
def get_sentiment(text):
    result = sentiment_pipeline(text[:512])  # Truncate to 512 characters if necessary
    label = result[0]['label']
    if label == "POSITIVE":
        return "Positive"
    elif label == "NEGATIVE":
        return "Negative"
    else:
        return "Neutral"

# Apply sentiment analysis to the specified text column
df['Sentiment'] = df['Text'].apply(get_sentiment)  # Replace 'Text' with your actual column name

# Step 6: Display the updated DataFrame
df.head()

# Step 7: Save the updated DataFrame to a new CSV file
df.to_csv('updated_sentiment_analysis.csv', index=False)

# Step 8: Provide a link to download the updated CSV file (for Google Colab users)
from google.colab import files
files.download('updated_sentiment_analysis.csv')

Sentiment Analysis Review for 5000 Yelp Reviews

The sentiment analysis conducted on a dataset of 5000 Yelp reviews across multiple models yielded interesting variations in classification.

Here’s a brief breakdown of the results:

sentiment-analysis-results

Yelp’s original ratings showed 1847 positive, 2150 negative, and 1003 neutral reviews, serving as a balanced baseline for comparison. GPT-4o displayed a clear positive bias, with 3973 reviews classified as positive and only 58 as neutral. DistilBERT, without a neutral option, split reviews into 2357 positive and 2643 negative, slightly favoring the negative side.

TwitterRoBERTa offered a balanced classification, with 2607 positive, 1970 negative, and 423 neutral, aligning well with Yelp’s ratings. MultilingualBERT mirrored the original sentiment spread closely, with 2103 positive, 2044 negative, and 853 neutral, making it a strong choice for multilingual sentiment analysis.

TEXTBLOB, however, favored neutrality, labeling 3664 reviews as neutral and only 309 as positive, highlighting a potential issue in identifying polarized sentiment. Overall, while models like GPT-4o skewed towards positivity and TEXTBLOB leaned neutral, TwitterRoBERTa and MultilingualBERT delivered more balanced results, closely reflecting Yelp’s original sentiment distribution.

Sentiment Analysis Review for 300 Yelp Reviews

This smaller sample of 300 Yelp reviews was analyzed across various models to compare performance on a reduced dataset. Here’s a summary of the findings:

Yelp’s original ratings had 112 positive, 129 negative, and 59 neutral reviews, establishing a benchmark for comparison. GPT-4o showed a strong bias towards positivity, labeling 240 reviews as positive and only 5 as neutral. DistilBERT, without a neutral category, split reviews evenly between 151 positive and 149 negative.

TwitterRoBERTa offered a balanced output, with 156 positive, 116 negative, and 28 neutral, while MultilingualBERT closely matched Yelp’s distribution with 123 positive, 118 negative, and 59 neutral reviews.

TEXTBLOB favored neutral sentiment heavily, with 216 reviews classified as neutral and only 24 as positive.

GPT-01-preview delivered a more balanced sentiment distribution, similar to Gemini 1.5, which leaned slightly towards negativity. Both models showed a moderated performance compared to the highly positive skew of GPT-4o.

GPT-01-preview produced a well-rounded distribution with 137 positive, 134 negative, and 29 neutral reviews, demonstrating a strong capability to capture sentiment nuances. Similarly, Gemini 1.5 performed reliably, with 129 positive, 143 negative, and 28 neutral reviews, showcasing a thoughtful approach to sentiment detection.

Overall, this analysis highlights notable differences in sentiment classification across models. While GPT-4o appears overly optimistic, models like TwitterRoBERTa and MultilingualBERT offer a more balanced distribution, aligning better with the original Yelp sentiments.

General conclusion

The results of this analysis revealed some unexpected trends. GPT-4o demonstrated a strong bias towards positive sentiment, while TEXTBLOB leaned heavily toward neutral sentiment. This was particularly surprising given that TEXTBLOB is a commonly used model, even in tools like ChatGPT and Gemini.

TwitterRoBERTa performed well, as expected, given its training on social media data similar to Yelp reviews. MultilingualBERT also stood out, effectively managing the variety in the text.

Both models are built on BERT, which is designed to deeply understand language by learning from massive amounts of text without human labels. BERT’s approach allows it to grasp the meaning and context of words, making it powerful for sentiment analysis.

Although GPT-4o didn’t perform as well as expected, GPT-01-preview yielded much more reliable results. This is likely due to GPT-01-preview’s advanced training mechanisms, designed to refine its understanding through iterative learning, as highlighted in recent OpenAI updates. This suggests that newer models have an improved ability to interpret sentiment with greater accuracy.


Overall, this analysis highlights the strengths of domain-specific models like TwitterRoBERTa and shows the power of BERT in understanding complex text data. It’s a great reminder that the right model choice can make a big difference in sentiment analysis accuracy.

Let me know if you found this analysis interesting and which models you use for sentiment analysis. I’d love to learn more from your experiences!


Author

Comments

One response to “Comparing Sentiment Analysis Models: GPT, BERT, and Gemini Insights”

  1. Just a fan Avatar
    Just a fan

    This is really a detailed article. Seems easy enough to recreate, plus just reading about it provides great insight. Great job. 😊

Leave a Reply

Your email address will not be published. Required fields are marked *