Amazon Review Sentiment: Applied NLP Project

Overview

Amazon Review Sentiment is an end-to-end natural language processing (NLP) project that predicts sentiment and rating from product reviews. It mirrors the kind of applied science work done at large scale inside e-commerce systems: cleaning noisy text, extracting features, training models, and evaluating their performance on realistic tasks.

Problem Statement

Given a corpus of product reviews, can we automatically infer how satisfied the customer was? In this project, we treat this as:

A sentiment classification task (positive / negative / neutral)
A rating prediction task (regressing or classifying 1–5 star ratings)

Pipeline

Load raw review text and associated star ratings from a CSV file.
Clean text (lowercasing, basic tokenization, stopword handling).
Convert text to TF–IDF features using scikit-learn.
Train baseline models (Logistic Regression, Linear SVM, Random Forest).
Evaluate using accuracy, F1-score, and macro-averaged metrics.
Inspect feature weights and error cases for interpretation.

Sample Data

This repository includes a tiny synthetic sample file, data/amazon_reviews_sample.csv, so everything runs out-of-the-box. For more realistic experiments, you can plug in any public Amazon review dataset (e.g., subsets from the Stanford SNAP or Kaggle Amazon Reviews data) using the same columns.

Repository Structure

amazon-review-sentiment-ml/
├── index.html              # Project landing page (for GitHub Pages)
├── README.md               # Full technical overview
├── assets/
│   ├── style.css           # Minimal styling for the page
│   └── sentiment_distribution.png
├── data/
│   └── amazon_reviews_sample.csv
└── src/
    ├── preprocess.py       # Text cleaning and TF–IDF feature extraction
    ├── train_models.py     # Model training & evaluation
    └── utils.py            # Shared helpers

How to Reproduce the Experiments

Clone the repository.
Create a Python environment and install dependencies listed in README.md.
Optionally replace data/amazon_reviews_sample.csv with a larger public Amazon review dataset.
Run src/preprocess.py to build TF–IDF features.
Run src/train_models.py to train and evaluate baseline NLP models.

Relevance to Applied Science

This project reflects the end-to-end workflow of an applied scientist:

Framing a real business question as measurable ML tasks.
Designing a text processing and feature extraction pipeline.
Training and evaluating multiple models with clear metrics.
Communicating design decisions and results in a reproducible repository.