Advisories for Pypi/Scrapy package

2024

Duplicate Advisory: Scrapy leaks the authorization header on same-domain but cross-origin redirects

Duplicate Advisory This advisory has been withdrawn because it is a duplicate of GHSA-4qqq-9vqf-3h3f. This link is maintained to preserve external references. Original Description In scrapy/scrapy, an issue was identified where the Authorization header is not removed during redirects that only change the scheme (e.g., HTTPS to HTTP) but remain within the same domain. This behavior contravenes the Fetch standard, which mandates the removal of Authorization headers in cross-origin requests …

Scrapy's redirects ignoring scheme-specific proxy settings

When using system proxy settings, which are scheme-specific (i.e. specific to http:// or https:// URLs), Scrapy was not accounting for scheme changes during redirects. For example, an HTTP request would use the proxy configured for HTTP and, when redirected to an HTTPS URL, the new HTTPS request would still use the proxy configured for HTTP instead of switching to the proxy configured for HTTPS. Same the other way around. If …

Scrapy leaks the authorization header on same-domain but cross-origin redirects

Since version 2.11.1, Scrapy drops the Authorization header when a request is redirected to a different domain. However, it keeps the header if the domain remains the same but the scheme (http/https) or the port change, all scenarios where the header should also be dropped. In the context of a man-in-the-middle attack, this could be used to get access to the value of that Authorization header

Scrapy leaks the authorization header on same-domain but cross-origin redirects

Scrapy allows redirect following in protocols other than HTTP

Scrapy was following redirects regardless of the URL protocol, so redirects were working for data://, file://, ftp://, s3://, and any other scheme defined in the DOWNLOAD_HANDLERS setting. However, HTTP redirects should only work between URLs that use the http:// or https:// schemes. A malicious actor, given write access to the start requests (e.g. ability to define start_urls) of a spider and read access to the spider output, could exploit this …

Scrapy decompression bomb vulnerability

The scrapy/scrapy project is vulnerable to XML External Entity (XXE) attacks due to the use of lxml.etree.fromstring for parsing untrusted XML data without proper validation. This vulnerability allows attackers to perform denial of service attacks, access local files, generate network connections, or circumvent firewalls by submitting specially crafted XML data.

Duplicate Advisory: Scrapy decompression bomb vulnerability

Duplicate Advisory This advisory has been withdrawn because it is a duplicate of GHSA-7j7m-v7m3-jqm7. This link is maintained to preserve external references. Original Description The scrapy/scrapy project is vulnerable to XML External Entity (XXE) attacks due to the use of lxml.etree.fromstring for parsing untrusted XML data without proper validation. This vulnerability allows attackers to perform denial of service attacks, access local files, generate network connections, or circumvent firewalls by submitting …

Duplicate Advisory: Scrapy authorization header leakage on cross-domain redirect

Duplicate Advisory This advisory has been withdrawn because it is a duplicate of GHSA-cw9j-q3vf-hrrv. This link is maintained to preserve external references. Original Description In scrapy versions before 2.11.1, an issue was identified where the Authorization header, containing credentials for server authentication, is leaked to a third-party site during a cross-domain redirect. This vulnerability arises from the failure to remove the Authorization header when redirecting across domains. The exposure of …

Duplicate Advisory: Scrapy authorization header leakage on cross-domain redirect

Scrapy decompression bomb vulnerability

Impact Scrapy limits allowed response sizes by default through the DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE settings. However, those limits were only being enforced during the download of the raw, usually-compressed response bodies, and not during decompression, making Scrapy vulnerable to decompression bombs. A malicious website being scraped could send a small response that, on decompression, could exhaust the memory available to the Scrapy process, potentially affecting any other process sharing that memory, …

Scrapy vulnerable to ReDoS via XMLFeedSpider

The following parts of the Scrapy API were found to be vulnerable to a ReDoS attack:

Scrapy authorization header leakage on cross-domain redirect

Impact When you send a request with the Authorization header to one domain, and the response asks to redirect to a different domain, Scrapy’s built-in redirect middleware creates a follow-up redirect request that keeps the original Authorization header, leaking its content to that second domain. The right behavior would be to drop the Authorization header instead, in this scenario. Patches Upgrade to Scrapy 2.11.1. If you are using Scrapy 1.8 …

ReDos vulnerability of XMLFeedSpider

Impact The following parts of the Scrapy API were found to be vulnerable to a ReDoS attack: The XMLFeedSpider class or any subclass that uses the default node iterator: iternodes, as well as direct uses of the scrapy.utils.iterators.xmliter function. Scrapy 2.6.0 to 2.11.0: The open_in_browser function for a response without a base tag. Handling a malicious response could cause extreme CPU and memory usage during the parsing of its content, …

2022

Scrapy before v2.6.2 and v1.8.3 vulnerable to one proxy sending credentials to another

Because of request retries and redirects, the same request can be processed by downloader middlewares more than once, including both the built-in HTTP proxy downloader middleware and any third-party proxy-rotation downloader middleware. These third-party proxy-rotation downloader middlewares could change the proxy metadata of a request to a new value, but fail to remove the Proxy-Authentication header from the previous value of the proxy metadata, causing the credentials of one proxy …

Scrapy denial of service vulnerability

Scrapy 1.4 allows remote attackers to cause a denial of service (memory consumption) via large files because arbitrarily many files are read into memory, which is especially problematic if the files are then individually written in a separate thread to a slow storage resource, as demonstrated by interaction between dataReceived (in core/downloader/handlers/http11.py) and S3FilesStore.

Incorrect Authorization and Exposure of Sensitive Information to an Unauthorized Actor in scrapy

If you manually define cookies on a Request object, and that Request object gets a redirect response, the new Request object scheduled to follow the redirect keeps those user-defined cookies, regardless of the target domain.

Cookie-setting is not restricted based on the public suffix list

Responses from domain names whose public domain name suffix contains 1 or more periods (e.g. responses from example.co.uk, given its public domain name suffix is co.uk) are able to set cookies that are included in requests to any other domain sharing the same domain name suffix.

2021

Scrapy HTTP authentication credentials potentially leaked to target websites

If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, all requests will expose your credentials to the request target. This includes requests generated by Scrapy components, such as robots.txt requests sent by Scrapy when the ROBOTSTXT_OBEY setting is set to True, or as requests reached through redirects.