NLP
Overview
- nlp overview benchmark
- paperswithcode - https://paperswithcode.com/sota/sentiment-analysis-on-imdb
Pre-trained words embeddding
fasttext [glove]
Pre-train model
- BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-
XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-
RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
- DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.
Courses
Others
Visualisation
Pre-trained word embedding demo
Pre-processing
- emoticons
- urls
techniques
We can see that the stop words list above contains some words that could be important in some contexts. These could be words like i, not, between, because, won, against. You might need to customize the stop words list for some applications.
For the punctuation, we saw earlier that certain groupings like ’:)’ and ‘…‘ should be retained when dealing with tweets because they are used to express emotions. In other contexts, like medical analysis, these should also be removed.
Topic modelling
My template
- print str with different colors
TODO
- english grammar: noun, pronoun, etc.
Leave a comment