Text Classification¶
- Devise features by hand: Does the message contain “church”. Does the email contain an Indian organization’s domain
- Bag of words: Count of occurrences off each word of a pre-defined ‘vocabulary’
Pre-Processing
- Stemming: only keep the root of the word
- “slowly” and “slow” both mapped to “slow”
- Filtering
- Stopwords: articles
- Filler words
- rare words