Resources

Introduction

This is a work on Dreaddit dataset taken from one of the latest datasets on Kaggle which is created using data from 5 different Reddit communities. This is a slightly different problem as the length of the posts is larger than the usual tweets. Additionally, the data available for training is also not quite huge.

First look

This work doesn’t have an extensive EDA but just looks at the dataset - columns, amount of data available, missing values. This is done using Pandas profiling. Pandas profiling does very good Auto EDA and gives a myriad of information. Minimal data can also be seen by using the minimal param.

Code: pre-processor.ipynb

Pre-processing

Firstly, few unnecessary columns such as the post_id, sentence_range are removed. Second, the subreddit category - categorical data - is OneHotEncoded. Helper functions are run on the text of posts to clean them. BERT base uncased tokenizer from 🤗 is used to tokenize the text data. Finally, the pre-processed data is written to file.

Code: pre-processor.ipynb

Model evaluation

Multiple models were used for evaluation - XGBoost, SVM, Random Forest, LightGBM, GBM, etc. cross_value_score and 10-fold was used for evaluating the models. This roughly gave an idea of the weak and strong models. Deep learning models were not used mainly because of the small amount of data that’s available. For this section, I took help from this notebook.

Code: model.ipynb

Hyperparameter optimization

Models XGBoost, LGBM, and GBM which gave almost the same results were optimized. A beautiful framework Optuna was used for this. I have used 100 trials for optimization and the best parameters of each model are captured in different files. This is a time-consuming process, so using GPU will help in speeding up the process.

Code: xgboost_optimization.ipynb, lgbm_optimization.ipynb, gbm_optimization.ipynb

Final model

The best parameters from the optimization step still resulted in an almost same overall score of all the above models. Hence, a stable model was built by ensembling the best XGBoost, LGBM, GBM models using a Voting classifier. This resulted in an F1 score of around 0.77.

Code: main.ipynb