Manuel Alejandro Matías Astorga

1 minute read

🎯 Project Overview

This project analyzes a dataset of 50,000 IMDb movie reviews, each labeled as positive or negative. The goal is to classify sentiment through a two-phase workflow:

Exploratory Data Analysis and preprocessing in R
Sentiment classification modeling and deployment in Python

The first phase leverages R’s rich ecosystem for text manipulation and data visualization. The second phase (in progress) will involve training ML models and deploying a classifier using Python.

🛠️ Tools & Libraries

R: tidyverse, tidytext, ggplot2, udpipe, SnowballC, textstem, DT
Python (planned): scikit-learn, nltk, pandas, matplotlib, PyTorch
Techniques: Tokenization, POS tagging, stemming, lemmatization, n-grams, polarity
Format: R Markdown (.Rmd) → HTML (via RPubs)

📊 Key Explorations

Sentiment class distribution and review length analysis
HTML cleaning, stopword removal, punctuation & casing normalization
POS tagging using udpipe
Stemming (SnowballC) and lemmatization (textstem)
N-gram analysis for phrase structure insight

📈 Visualizations Preview

Some of the visualizations in the EDA notebook include:

Word frequency lollipop charts
Sentiment-based word clouds
N-gram distribution (bigrams & trigrams)
Polarity sentiment barplots

🚧 Interactive notebook will be available soon via RPubs.

➡️ Explore the full interactive report here

📘 Full Exploratory Report

🛠️ The full report with interactive visualizations is currently being compiled and will be published shortly on RPubs.

🔮 Next Steps

Export cleaned data to .csv for model training
Build classifiers using:
- Logistic Regression & Naive Bayes (baseline)
- Pipeline-based ML models (scikit-learn)
- Deep learning model using PyTorch (planned)
Evaluation: Confusion Matrix, F1 Score, ROC-AUC
Optionally deploy via Streamlit or Apache Spark

🧾 Deliverables

01_EDA.Rmd — Core notebook (R-based)
Cleaned datasets (stemmed, lemmatized, udpipe); export planned for modeling phase
EDA report (RPubs) — Coming soon
model_sentiment.py — (Coming soon)
Streamlit or Apache Spark deployment — (Planned)

📌 Outcome

Completed a robust EDA and text processing pipeline in R.
Laying the foundation for cross-platform sentiment classification with Python.

🧠 What I Learned

R is powerful for quick and elegant EDA and text visualization.
Handling natural language data requires both linguistic and statistical intuition.
Preprocessing choices (e.g., stemming vs lemmatization) can deeply affect downstream model performance.

🔗 View this project on GitHub

📌 Note: This project is currently in Phase 1 (EDA & preprocessing). The modeling and deployment phase will follow shortly.

Last updated: 2025-04-20

Share on

Twitter Facebook LinkedIn

🎯 Project Overview

🛠️ Tools & Libraries

📊 Key Explorations

📈 Visualizations Preview

📘 Full Exploratory Report

🔮 Next Steps

🧾 Deliverables

🧾 Deliverables

📌 Outcome

🧠 What I Learned

Share on

You may also enjoy