1 minute read CN/中文

Overview

awesome nlp

  • nlp overview benchmark
  • paperswithcode - https://paperswithcode.com/sota/sentiment-analysis-on-imdb

Pre-trained words embeddding

fasttext [glove]

Pre-train model

  • BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  • XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

  • RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.

  • DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

Courses

nlp standford

Others

github awesome nlp list

Visualisation

Pre-trained word embedding demo

Pre-processing

nltk tokenize

  • emoticons
  • urls

techniques

We can see that the stop words list above contains some words that could be important in some contexts. These could be words like i, not, between, because, won, against. You might need to customize the stop words list for some applications.

For the punctuation, we saw earlier that certain groupings like ’:)’ and ‘…‘ should be retained when dealing with tweets because they are used to express emotions. In other contexts, like medical analysis, these should also be removed.

Topic modelling

My template

  • print str with different colors

TODO

  • english grammar: noun, pronoun, etc.

Categories:

Updated:

Leave a comment