Skip to content

Text Classification

  • Devise features by hand: Does the message contain “church”. Does the email contain an Indian organization’s domain
  • Bag of words: Count of occurrences off each word of a pre-defined ‘vocabulary’

Pre-Processing

  • Stemming: only keep the root of the word
  • “slowly” and “slow” both mapped to “slow”
  • Filtering
  • Stopwords: articles
  • Filler words
  • rare words
\[ \begin{aligned} \text{tf(term)} &= \dfrac{n_\text{term}}{n_\text{terms in document}} \\ \text{idf(term)} &= \ln \left \vert \dfrac{n_\text{documents}}{n_\text{documents containing term}} \right \vert \\ \text{tf-idf(term)} &= \text{tf(term)} \times \text{idf(term)} \end{aligned} \]
Last Updated: 2024-12-26 ; Contributors: AhmedThahir, web-flow

Comments