Applied ML and Information Retrieval Pipelines
Interactive demo
Applied ML and Information Retrieval Pipelines
Explore a compact version of the two modules: activity recognition from sensor-like inputs and TF-IDF-style ranking over document snippets.
Sensor sample
Classify an activity pattern
Model comparison
Reported mean scores
Overview
This project combines two university data science assignments into one applied systems case study. One module works with wearable sensor data from the HARTH dataset to classify human activities and explore unsupervised clustering. The other builds a document retrieval engine over a text collection using preprocessing, an inverted index, TF-IDF weighting, cosine similarity ranking, and ColBERT comparison.
The goal was not only to train models, but to compare complete pipelines: how different data types are cleaned, represented, modeled, ranked, and evaluated. Together, the modules show the same engineering discipline applied across structured numerical signals and unstructured text.
Module A: Human Activity Recognition
The activity-recognition module uses wearable sensor readings from back and thigh accelerometers. It includes exploratory analysis, feature inspection, supervised classification, and unsupervised clustering.
Key steps:
- Combined and cleaned participant CSV files.
- Removed non-informative index columns.
- Analyzed sensor distributions, waveform behavior, and feature correlations.
- Trained and compared MLP, Random Forest, and Gaussian Naive Bayes classifiers.
- Evaluated classification performance with accuracy, precision, recall, and F1.
- Compared KMeans and DBSCAN clustering after scaling and PCA projection.
Notable results:
- Random Forest achieved the highest mean test accuracy, around 0.932, but showed clear overfitting due to near-perfect training accuracy.
- MLP had slightly lower mean test accuracy, around 0.920, but generalized more consistently.
- KMeans produced clearer clusters than DBSCAN for this dataset.
Module B: Information Retrieval
The retrieval module implements a classical search pipeline and compares it with ColBERT outputs. It covers the full retrieval loop: parsing documents and queries, preprocessing text, indexing, ranking, relevance comparison, and per-query visualization.
Key steps:
- Parsed document, query, and relevance collections.
- Applied stopword removal and Porter stemming.
- Built an inverted index from preprocessed terms.
- Implemented vector-space retrieval with TF-IDF weighting.
- Ranked documents by cosine similarity.
- Compared vector-space model variants with ColBERT results.
- Evaluated retrieval quality with precision-recall curves and mean average precision.