less than 1 minute read CN/中文

Jigsaw Rate Severity of Toxic Comments

Description

This is about sentiment analysis. This is a continuation from the previous one which determines whether a comment is toxic. This new one rates how toxic a comment is.

End goal is to have a rating similar to the experts in the field.

Related work section gives a relevant dataset. “Constructing Interval Variables via Faceted Rasch Measurement and Multitask Deep Learning: a Hate Speech Application” may help as well.

Google Jiasaw is where the name comes from.

TODO

Pre-trained word embedding

glove standford

word2vec

All comprehensive More concise

Text cleaning

https://github.com/jfilter/clean-text

Jigsaw-Ridge Ensemble + TFIDF + FastText [0.868]

[RAPIDS] TFIDF_linear_model_ensemble

For words duplicates, shorten to one and add new col/feature counting number of times. E.g. happy is less happy than happy*5.

Past winning solution

Toxic Comment Classification Challenge

Reading

Pre-trained word embedding

  • https://github.com/Hironsan/awesome-embedding-models

Comprehensive pre-processing

  • https://www.kaggle.com/vinayakshanawad/text-preprocess-py
  • https://www.kaggle.com/xbf6xbf/processing-helps-boosting-about-0-0005-on-lb
  • machinelearningmastery

gensim

Mis

Start with a simple model, here I used: Incredibly Simple Naive Bayes [0.768]

Categories:

Updated:

Leave a comment