TF-IDF

发布时间 2023-10-16 20:29:18作者: lif323

Term Frequency-Inverse Document Frequency(TF-IDF) 用于衡量一个单词(term)在一组文档(document)中对于一个文档(document)的重要性。它属于统计学方法。

Term Frequency(TF): $$\text{TF} = \frac{\text{number of times the term appears in the document}}{\text{total number of terms in the document}}$$

Inverse Document Frequency(IDF):

\[\text{IDF} = \log(\frac{\text{number of the documents in the corpus}}{\text{number of documents in the corpus contain the term}}) \]

为了避免除0的问题,可采用如下形式。

\[\text{IDF} = \log(\frac{\text{number of the documents in the corpus}}{\text{number of documents in the corpus contain the term} + 1}) \]

TF-IDF 通过计算 TF 和 IDF 的乘积获得。

\[\text{TF-IDF} = \text{TF} \cdot \text{IDF} \]

参考:
https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency