Sentiment Analysis Positive-Negative Reviews Classifier
A machine learning system that classifies customer reviews as positive or negative using Natural Language Processing (NLP). Leverages Naive Bayes classification, bigram extraction, and Laplace smoothing to deliver high-accuracy sentiment predictions.
Key Features
Naive Bayes Classification
Probabilistic classifier using Laplace smoothing for robust sentiment prediction.
Text Preprocessing
Performs text cleaning including lowercasing, stop word removal, stemming, and punctuation handling.
Bigram Features
Word-pair extraction for capturing contextual sentiment indicators and improving classification accuracy.
Unit Testing
Comprehensive test suite using Python's unittest module, ensuring correctness across all preprocessing and classification functions.
Model Evaluation
Performance assessment using confusion matrix, F1-score, and class-specific metrics for thorough analysis.
High Performance
Optimized implementation handling 12,000+ samples with efficient feature extraction and classification.
Technical Implementation
Core Algorithm
class Bayes_Classifier:
def preprocess_text(self, text):
text = text.lower()
text = re.sub(f"[{string.punctuation}]", "", text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words("english"))
tokens = [word for word in tokens if word not in stop_words]
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens]
bigrams = list(nltk.bigrams(tokens))
bigram_tokens = ["_".join(bigram) for bigram in bigrams]
return tokens + bigram_tokens
Key Features
- Text Normalization: Lowercasing and punctuation removal
- Tokenization: Word-level tokenization using NLTK
- Stop Word Removal: Eliminates common words that don't carry sentiment
- Stemming: Reduces words to their root form
- Bigram Extraction: Captures word-pair relationships
- Laplace Smoothing: Handles unseen words gracefully
Performance Results
Positive Reviews
Excellent performance in recognizing positive sentiment patterns and language indicators.
Negative Reviews
Respectable performance on harder-to-predict negative sentiment classes with complex language patterns.
Dataset Size
Large-scale evaluation on diverse real-world text data ensuring robust model performance.
Key Learnings
NLP Best Practices
Enhanced understanding of real-world NLP applications, the importance of thorough text preprocessing, and the power of interpretable models like Naive Bayes.
Software Engineering
Reinforced good software engineering practices such as modular coding and automated testing for machine learning systems.
Model Evaluation
Improved skills in evaluating and improving ML classifiers using appropriate statistical metrics and performance analysis.
Interested in this project?
Feel free to reach out to discuss collaboration opportunities or ask any questions.