Predicting Star Ratings: Sentiment Analysis Built on MongoDB

Predicting Star Ratings: Sentiment Analysis Built on MongoDB

Nabiha Naqvie | Sunday, Nov 14, 2021 |  AI ML

If you want to build a robust machine learning model, the most important ingredient is data – but keep in mind that tuning your model will rely on devising a systematic way to store and query that data! In this post, we explore a project to understand a large dataset of Amazon reviews and predict start ratings using open source sentiment analyzers and the MongoDB ecosystem.

The complete project is posted here:


Customer reviews have become an epicenter of the e-commerce industry., being one of the best-known e-commerce companies, has seen an increase of an estimate of $2.7B revenue annually and sales revenue of $107 billion in 2015 solely due to customer reviews1. The importance of reviews to businesses sparked our interest to delve into understanding reviews from a company perspective, particularly how to narrow focus for smaller companies who may not have the headcount or marketing savvy to preserve their online reputations.

The Project

This project came out of the Georgetown Data Science Program, which is centered on applied machine learning and focused on exercising skills across the full predictive pipeline, from data ingestion and storage to feature analysis, algorithm selection, and hyperparameter tuning. The program culminates in a capstone project that produces a data product to illustrate practical use of these skills. For an additional challenge, we decided to focus the project to understand the sentiment of reviews to predict star ratings.

From the perspective of the business problem, our intent was to highlight for companies which reviews require the most urgent response, maximizing their opportunity to change the commenter from a detractor to a supporter. One of the goals of the project was to explore various models to find the best suited model for sentiment analysis. We experimented with two different sentiment analyzers: EmoMap2 and the NLTk’s Vader score.

The Data

Our dataset was the UCSD Amazon3 reviews, which contains 142.8 million reviews collected from May 1996 to July 2014. For text data management, the best choice is often to store data in a NoSQL document storage database that allows streaming reads of the documents with minimal overhead, or to simply write each document to disk4. Since the UCSD Amazon reviews are in JSON format and text, the best storage platform to use was MongoDB. MongoDB is a document-oriented database, classified as NoSQL – NoSQL refers to “not only SQL” and references a variety of non-relational databases. MongoDB is commonly known for storing JSON documents.

The Mongo Ecosystem

As a first time user, I had to learn that there were two components of using the database: MongoDB Atlas and MongoDB Compass. MongoDB Compass is a GUI that allows users to interact with the data, whereas MongoDB Atlas is a storage platform that creates and distributes the clusters and connects multiple users to MongoDB Compass. Compass is particularly convenient because it provides the opportunity to connect to the data with Jupyter Notebook. One of the cons of working with the free tier of MongoDB is the data storage limit is 512MB (unless you are willing to pay) and that it requires a bit of learning curve to connect MongoDB Atlas to MongoDB Compass. The pros of working with MongoDB was the ease of importing the data from MongoDB Compass to Jupyter Notebook.

Data Wrangling

The data was thoroughly cleaned, particularly tokenizing and lemmatizing the review text, and multiple binary categories were added for Exploratory Data Analysis (EDA). An example of binary categories was to divide the star ratings into positive (1) or negative (0). Once the data was prepared, we conducted initial feature analysis, where both sentiment analyzers scored higher than the other features. The data sets were split into train, validate, and test sets at a 60/20/20 split. The addition of a validation test set allowed us to test for overfitting for the dependent variables used in the VADER and Five Emotions models. The data sets were randomized at various stages to ensure that any review order in the initial data set was overcome. After the team decided on which variables to include in the models, the splits were changed to a train/test set with an 80/20 split.

Model Selection

For the machine learning analysis, four different approaches were taken and we compared the model accuracy for these four approaches. The approaches were: TF-IDF word vectorizer; Doc2Vec document vectorizer; a Five Emotions (joy, anger, sadness, fear and disgust) model ; and a VADER-focused approach.The team modelled the three approaches above with a number of regressors and classifiers. The models looked at predicting both binary classification (positive or negative review) and specific 1-5 rating. Models used included Binary and Multinomial Logistics Regressions, Support Vector Machines (SVM), Random Forests (RF), Naive Bayes Classifiers (NBC), and K Nearest Neighbors (KNN). Different combinations of the models were used for each of the four approaches to compare and contrast accuracy and precision.


Our strongest overall approach, in terms of accuracy, was TF-IDF. For the binary analysis, the logistic regression performed very well. For the multi-class, TF-IDF was also the strongest performer. Our second most accurate approach was Doc2Vec. It performed well overall, but lagged behind TF-IDF by around 3 percentage points for both binary and multiclass. The VADER and Five Emotions approaches were both a tier below the word vectorizers in terms of accuracy, with VADER being the stronger of the two. Similarly, both VADER and Five Emotions models significantly underperformed the word vectorizer models for multiclass.

Next Steps

Moving forward, we plan to create a GUI or a dashboard that will alert merchants to specific negative reviews or feedback. The objective is to highlight opportunities to mitigate reputation damage, covert undecided shoppers to satisfied customers, and identify responses to campaigns that generate unintended negative responses.

To see our implementation, go to:


  1. Haque, M. E., Tozal, M. E., & Islam, A. (2018, August). Helpfulness prediction of online product reviews. In Proceedings of the ACM Symposium on Document Engineering 2018 (pp. 1-4). ↩︎

  2. EmoMap (2018). ↩︎

  3. Amazon review data (2018). ↩︎

  4. Bengfort, B., Bilbro, R., Ojeda, T., (2018). Applied Text Analysis with Python. O’Reilly Media, Inc. ↩︎

About This Post

Modern sentiment analysis requires both creativity and elbow grease. In this post, we explore a project to understand Amazon reviews and predict start ratings using open source sentiment analyzers and MongoDB.

Written by:

Share this post:

Recent Rotations butterfly

View all

5 Javascript Libraries to Use for Machine Learning

Over the years, several JavaScript libraries have been created for machine learning. Let’s sort through the ones that can help you get started quickly, even if you don’t have much experience with machine learning or data …

Mar 11, 2024

Predicting the Oscars With LLMs

Looking for a middle ground between custom LLMs and traditional ML? Please welcome semantic search to the stage! Let’s use semantic search to predict which film will take home the “Best Picture” Oscar this year 🤩

Mar 8, 2024

How to Manage Overwhelm

Each morning, I make the mistake of checking social media before getting out of bed. As I catch up on what’s happening in the world, I often find myself thinking “This too much” all before 8 AM.

Feb 13, 2024
Enter Your Email To Subscribe