I performed data preprocessing, feature engineering and machine learning algorithms to predict fraudulent advertisement click and achieve top 15% result on Kaggle Private leaderboard. Competition link
Dataset type: structured timeseries data of 180+ milion records timeseries data. Labels are imbalanced: only 0.2% of the dataset is made up of fradulent clicks.
Software, libraries and models:
I took a deep dive and performed intensive data analysis and visualization on unlabeled GPS dataset collected by Google Timeline application. From there I cleaned out inaccurate data features, transformed raw data into analysis-ready data and applied few clustering algorithms to identify clusters that can review user's habit and common routine.
This project is featured on TechTarget analytics blog
Dataset type: geographic dataset of 600K+ coordinates and metadata.
Software, libraries and models:
I built and finetuned one machine learning model to predict sales for every product and store in the next month. The final model (Light Gradient Boosting Machine) reached 7th position (out of 200) on Kaggle Public Leaderboard as of 4/16/2018. Competition link
Dataset type: structured timeseries data of 6 million records from one of the largest Russian software firms: 1C company.
Software, libraries and models:
I built a deep learning model to identify a scene/character from 10 animated classic movies (4 from Ghibli Studios, 6 from Disney) and published it as a web application that you can test here.
I also write a blog to show how I built it and how I debug my model using Grad-CAM visualization technique.
Dataset type: data collected from movie frames and internet
Software, libraries and models:
I create a data module to handle text data and tabular data (metadata) and built a joint deep learning architecture which combines a recurrent neural network model (AWD-LSTM) and multi-layer perceptron with categorical embedding in order to train both types of data efficiently.
You can follow the discussion on fastai forum
This architecture is adopted by Reply.ai to improve customer experience with better ticket classification. Read Reply.ai TowardDataScience's article here
Dataset type: model is tested on Kaggle PetFinder and Kaggle Mercari Price datasets
Sofware, libraries and models:
I didn't know much about machine learning until the last semester of my undergrad when I took a data mining class as an elective. Since then I have always been fascinated by how versatile machine learning is when it comes to solving problems. Since then I have taken several online machine learning courses such as Andrew Ng's famous ML course or Jeremy Howard and Rachel Thomas's practical course.
Using all the knowledge I have learned, I decide to rebuild some common machine learning algorithms using only Python and limited Numpy functions and compared them to existing models from popular machine learning framework such as Sklearn or Pytorch.
My goal is to fully understand the algorithms and libraries I have been used for Kaggle competitions or other personal projects. So far this project has helped me tremendously in gaining insight on how these models work, what is really happening at each iteration, how and why models need these parameters ... and even more basic stuff such as how data comes in and out of a model or what matrix multiplication looks like in code.
Some of the code might not be refactored well, but my goal is not to optimize these algorithms or to push them to production; there are already so many out there. I rather have their inner workings laid out in a transparent way so I can visit later.
(Links to jupyter notebooks are included)