CVE-2021-43854: Inefficient Regular Expression Complexity in nltk (word_tokenize, sent_tokenize)
(updated )
The vulnerability is present in PunktSentenceTokenizer
, sent_tokenize
and word_tokenize
. Any users of this class, or these two functions, are vulnerable to a Regular Expression Denial of Service (ReDoS) attack.
In short, a specifically crafted long input to any of these vulnerable functions will cause them to take a significant amount of execution time. The effect of this vulnerability is noticeable with the following example:
from nltk.tokenize import word_tokenize
n = 8
for length in [10**i for i in range(2, n)]:
References
- github.com/advisories/GHSA-f8m6-h2c7-8h9x
- github.com/nltk/nltk
- github.com/nltk/nltk/commit/1405aad979c6b8080dbbc8e0858f89b2e3690341
- github.com/nltk/nltk/issues/2866
- github.com/nltk/nltk/pull/2869
- github.com/nltk/nltk/security/advisories/GHSA-f8m6-h2c7-8h9x
- github.com/pypa/advisory-database/tree/main/vulns/nltk/PYSEC-2021-859.yaml
- nvd.nist.gov/vuln/detail/CVE-2021-43854
Detect and mitigate CVE-2021-43854 with GitLab Dependency Scanning
Secure your software supply chain by verifying that all open source dependencies used in your projects contain no disclosed vulnerabilities. Learn more about Dependency Scanning →