Kaggle - NLP
Jigsaw Rate Severity of Toxic Comments
Description
This is about sentiment analysis. This is a continuation from the previous one which determines whether a comment is toxic. This new one rates how toxic a comment is.
End goal is to have a rating similar to the experts in the field.
Related work section gives a relevant dataset. “Constructing Interval Variables via Faceted Rasch Measurement and Multitask Deep Learning: a Hate Speech Application” may help as well.
Google Jiasaw is where the name comes from.
TODO
Pre-trained word embedding
word2vec
All comprehensive More concise
Text cleaning
https://github.com/jfilter/clean-text
Jigsaw-Ridge Ensemble + TFIDF + FastText [0.868]
- data argumentation nlpaug
[RAPIDS] TFIDF_linear_model_ensemble
For words duplicates, shorten to one and add new col/feature counting number of times. E.g. happy is less happy than happy*5.
Past winning solution
Toxic Comment Classification Challenge
Reading
Pre-trained word embedding
- https://github.com/Hironsan/awesome-embedding-models
Comprehensive pre-processing
- https://www.kaggle.com/vinayakshanawad/text-preprocess-py
- https://www.kaggle.com/xbf6xbf/processing-helps-boosting-about-0-0005-on-lb
- machinelearningmastery
Mis
Start with a simple model, here I used: Incredibly Simple Naive Bayes [0.768]
Leave a comment