Amazon Review Sentiment

End-to-End NLP Pipeline for Product Review Understanding

An applied science project using real-world Amazon-style product reviews to predict sentiment and star ratings.

Overview

Amazon Review Sentiment is an end-to-end natural language processing (NLP) project that predicts sentiment and rating from product reviews. It mirrors the kind of applied science work done at large scale inside e-commerce systems: cleaning noisy text, extracting features, training models, and evaluating their performance on realistic tasks.

Problem Statement

Given a corpus of product reviews, can we automatically infer how satisfied the customer was? In this project, we treat this as:

Pipeline

  1. Load raw review text and associated star ratings from a CSV file.
  2. Clean text (lowercasing, basic tokenization, stopword handling).
  3. Convert text to TF–IDF features using scikit-learn.
  4. Train baseline models (Logistic Regression, Linear SVM, Random Forest).
  5. Evaluate using accuracy, F1-score, and macro-averaged metrics.
  6. Inspect feature weights and error cases for interpretation.

Sample Data

This repository includes a tiny synthetic sample file, data/amazon_reviews_sample.csv, so everything runs out-of-the-box. For more realistic experiments, you can plug in any public Amazon review dataset (e.g., subsets from the Stanford SNAP or Kaggle Amazon Reviews data) using the same columns.

Sample sentiment distribution plot

Repository Structure

amazon-review-sentiment-ml/
├── index.html              # Project landing page (for GitHub Pages)
├── README.md               # Full technical overview
├── assets/
│   ├── style.css           # Minimal styling for the page
│   └── sentiment_distribution.png
├── data/
│   └── amazon_reviews_sample.csv
└── src/
    ├── preprocess.py       # Text cleaning and TF–IDF feature extraction
    ├── train_models.py     # Model training & evaluation
    └── utils.py            # Shared helpers

How to Reproduce the Experiments

  1. Clone the repository.
  2. Create a Python environment and install dependencies listed in README.md.
  3. Optionally replace data/amazon_reviews_sample.csv with a larger public Amazon review dataset.
  4. Run src/preprocess.py to build TF–IDF features.
  5. Run src/train_models.py to train and evaluate baseline NLP models.

Relevance to Applied Science

This project reflects the end-to-end workflow of an applied scientist: