Overview
Amazon Review Sentiment is an end-to-end natural language processing (NLP) project that predicts sentiment and rating from product reviews. It mirrors the kind of applied science work done at large scale inside e-commerce systems: cleaning noisy text, extracting features, training models, and evaluating their performance on realistic tasks.
Problem Statement
Given a corpus of product reviews, can we automatically infer how satisfied the customer was? In this project, we treat this as:
- A sentiment classification task (positive / negative / neutral)
- A rating prediction task (regressing or classifying 1–5 star ratings)
Pipeline
- Load raw review text and associated star ratings from a CSV file.
- Clean text (lowercasing, basic tokenization, stopword handling).
- Convert text to TF–IDF features using scikit-learn.
- Train baseline models (Logistic Regression, Linear SVM, Random Forest).
- Evaluate using accuracy, F1-score, and macro-averaged metrics.
- Inspect feature weights and error cases for interpretation.
Sample Data
This repository includes a tiny synthetic sample file,
data/amazon_reviews_sample.csv, so everything runs out-of-the-box.
For more realistic experiments, you can plug in any public Amazon review dataset
(e.g., subsets from the Stanford SNAP or Kaggle Amazon Reviews data) using the same columns.
Repository Structure
amazon-review-sentiment-ml/
├── index.html # Project landing page (for GitHub Pages)
├── README.md # Full technical overview
├── assets/
│ ├── style.css # Minimal styling for the page
│ └── sentiment_distribution.png
├── data/
│ └── amazon_reviews_sample.csv
└── src/
├── preprocess.py # Text cleaning and TF–IDF feature extraction
├── train_models.py # Model training & evaluation
└── utils.py # Shared helpers
How to Reproduce the Experiments
- Clone the repository.
- Create a Python environment and install dependencies listed in
README.md. - Optionally replace
data/amazon_reviews_sample.csvwith a larger public Amazon review dataset. - Run
src/preprocess.pyto build TF–IDF features. - Run
src/train_models.pyto train and evaluate baseline NLP models.
Relevance to Applied Science
This project reflects the end-to-end workflow of an applied scientist:
- Framing a real business question as measurable ML tasks.
- Designing a text processing and feature extraction pipeline.
- Training and evaluating multiple models with clear metrics.
- Communicating design decisions and results in a reproducible repository.