Sentiment Analysis Positive-Negative Reviews Classifier

A machine learning system that classifies customer reviews as positive or negative using Natural Language Processing (NLP). Leverages Naive Bayes classification, bigram extraction, and Laplace smoothing to deliver high-accuracy sentiment predictions.

Naive Bayes NLP NLTK Python Text Preprocessing

Key Features

๐Ÿง 

Naive Bayes Classification

Probabilistic classifier using Laplace smoothing for robust sentiment prediction.

๐Ÿ“

Text Preprocessing

Performs text cleaning including lowercasing, stop word removal, stemming, and punctuation handling.

๐Ÿ”—

Bigram Features

Word-pair extraction for capturing contextual sentiment indicators and improving classification accuracy.

๐Ÿงช

Unit Testing

Comprehensive test suite using Python's unittest module, ensuring correctness across all preprocessing and classification functions.

๐Ÿ“Š

Model Evaluation

Performance assessment using confusion matrix, F1-score, and class-specific metrics for thorough analysis.

โšก

High Performance

Optimized implementation handling 12,000+ samples with efficient feature extraction and classification.

Technical Implementation

Core Algorithm

Python bayes_classifier.py
class Bayes_Classifier:
    def preprocess_text(self, text):
        text = text.lower()
        text = re.sub(f"[{string.punctuation}]", "", text)
        tokens = word_tokenize(text)
        stop_words = set(stopwords.words("english"))
        tokens = [word for word in tokens if word not in stop_words]
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]
        bigrams = list(nltk.bigrams(tokens))
        bigram_tokens = ["_".join(bigram) for bigram in bigrams]
        return tokens + bigram_tokens

Key Features

  • Text Normalization: Lowercasing and punctuation removal
  • Tokenization: Word-level tokenization using NLTK
  • Stop Word Removal: Eliminates common words that don't carry sentiment
  • Stemming: Reduces words to their root form
  • Bigram Extraction: Captures word-pair relationships
  • Laplace Smoothing: Handles unseen words gracefully

Performance Results

๐Ÿ“ˆ

Positive Reviews

90%+ F1-Score

Excellent performance in recognizing positive sentiment patterns and language indicators.

๐Ÿ“‰

Negative Reviews

60%+ F1-Score

Respectable performance on harder-to-predict negative sentiment classes with complex language patterns.

๐Ÿ“Š

Dataset Size

12,000+ Samples

Large-scale evaluation on diverse real-world text data ensuring robust model performance.

Key Learnings

NLP Best Practices

Enhanced understanding of real-world NLP applications, the importance of thorough text preprocessing, and the power of interpretable models like Naive Bayes.

Software Engineering

Reinforced good software engineering practices such as modular coding and automated testing for machine learning systems.

Model Evaluation

Improved skills in evaluating and improving ML classifiers using appropriate statistical metrics and performance analysis.

Interested in this project?

Feel free to reach out to discuss collaboration opportunities or ask any questions.