Projects
I have worked on a diverse range of data science projects through university, professional work, and my spare time.
These projects cover various topics utilising verious important data science tools including Python, R, SQL, Tableau, and Power BI.
Highlighted below is a selection of the most engaging projects I have worked on recently.
Watch this space for more interesting and insighful data science projects coming soon.
Time-Series Analysis and Sales Forecasting with Prophet
This project explores a systematic and practical approach to time-series forecasting using Facebook Prophet to predict
grocery sales in a large supermarket chain. By focusing on the "Grocery" product family, the analysis highlights iterative
refinements through feature engineering, incorporation of external factors, and hyperparameter optimisation to enhance predictive accuracy. Key highlights include:
- Incorporated regressors for for better anomaly handling including holidays, outliers and payday effects.
- Fine-tuned hyperparameters using cross-validation to enhance the model's ability to generalise to unseen data.
- Greatly improved performance metrics from the baseline model, achieving 92.97% prediction accuracy in the test set.
- Utilised Python for data preparation, model development, and analysis, along with detailed exploratory data analysis and visualisations.
View Project
Natural Language Processing to Categorize News Articles
This project focused on classifying over 210,000 HuffPost news article headlines into predefined categories using Natural Language Processing (NLP) techniques. It involved:
- Utilised Python, Keras, NLTK and other libraries for data preprocessing and visualization.
- Prepared the text data using TF-IDF vectorization, tokenization, and sequence padding.
- Developed a Logistic Regression model for a baseline, followed by GRU and LSTM deep learning models.
- Significantly improved classification accuracy for 31 news article categories using GRU, highlighting the impact of effective
data preparation and deep learning techniques in NLP tasks.
View Project
ACT Crime Power BI Dashboard
Developed interactive Power BI dashboards to analyze and visualize crime trends in Canberra, utilizing datasets sourced from the ACT Police website.
- Wrangled datasets using Python, Excel and Power Query for smooth integration into Power BI.
- Created custom DAX measures for calculations and aggregations, enabling dynamic data analysis and visualization.
- Designed two dashboards for different data formats (June 2022 and August 2024 datasets).
- Incorporated geographic visualizations to highlight crime patterns by suburb and district.
- Prepared the PBIX files for future updates with automated data wrangling scripts.
- Developed brief presentations to showcase the key insights and findings derived from the dashboard.
View Project
Predictive Modeling of Eating Out in Sydney
Analyzed dining-out trends in Sydney using a real-world dataset of over 10,000 restaurants utilizing Python.
- Explored the relationships between restaurant ratings, cost, and cuisines to identify trends and patterns.
- Built regression models to predict numeric customer ratings based on features like cost, location, and cuisine type.
- Developed classification models to categorize restaurants into two success levels based on ratings: High and Low.
- Applied feature engineering techniques such as target encoding, one-hot encoding, and log transformations to optimize model performance.
- Deployed results in a Docker container for reproducibility and showcased findings with an interactive Tableau Dashboard.
View Project
Black Hole Cyber Attack Detection in Wireless MANET Networks
Led and managed a capstone project aimed at enhancing the security of Mobile Ad Hoc Networks (MANETs) by using machine learning to detect black hole cyber attacks. The project optimized the AODV routing protocol to address disruptions caused by malicious nodes.
- Simulated MANET networks with the NS3 simulator, incorporating black hole attack scenarios for analysis.
- Developed a Python script to transform AODV protocol messaging between network nodes into a machine learning usable dataset.
- Developed accurate machine learning models using Scikit-learn to detect and classify malicious nodes.
View Project
Predicting Flight Delays
Developed a project to predict flight delays at major U.S. airports using Python machine learning techniques and cloud-based solutions.
The project utilized US flight data along with airport weather and holiday information to build predictive models.
- Performed extensive feature engineering, including adding weather conditions, holiday indicators, and time-based features (e.g., departure hours, day of the week).
- Built baseline models locally and implemented advanced models, including XGBoost, using AWS SageMaker for scalable training and evaluation.
- Optimized classification thresholds to improve delay recall, achieving a significant performance boost for minority class predictions.
- Created a Tableau dashboard to visualize trends and delays, providing actionable insights for stakeholders.
View Project
Data Cleaning and Analysis Using SQL
Cleaned and analyzed a global layoffs dataset using various SQL techniques to ensure data integrity. Uncovered meaningful insights
about layoffs during the COVID-19 pandemic.
- Performed data cleaning by removing duplicates, standardizing text entries, and handling missing values.
- Utilized SQL features like Common Table Expressions (CTEs) and Window Functions for transformation and analysis.
- Conducted exploratory data analysis, revealing layoff insights by industry, location, and timeframe.
View Project
Bird Species Image Classification
Developed a project to classify images of 10 bird species using machine learning techniques. Various models, including Random Forest, Support Vector Machine (SVM), and Convolutional Neural Networks (CNN), were built and evaluated for accuracy and performance.
- Performed data preprocessing, including standardization, and cross-validation with hyperparameter tuning.
- Delivered high accuracy with an 8-layer CNN model, significantly surpassing the performance of Random Forest and SVM classifiers
- Utilized Scikit-learn and TensorFlow libraries for model development and evaluation.
- Identified overfitting as a challenge and proposed dataset expansion and data augmentation for future improvements.
View Project