Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL parsing check and documentation #17

Merged
merged 2 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

This project was initially a part of [lxml](https://github.com/lxml/lxml). Because HTML cleaner is designed as blocklist-based, many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.

**Important**: the HTML Cleaner in ``lxml_html_clean`` is **not** considered appropriate **for security sensitive environments**. See e.g. [bleach](https://pypi.org/project/bleach/) for an alternative.

This project uses functions from Python's `urllib.parse` for URL parsing which **do not validate inputs**. For more information on potential security risks, refer to the [URL parsing security](https://docs.python.org/3/library/urllib.parse.html#url-parsing-security) documentation. A maliciously crafted URL could potentially bypass the allowed hosts check in `Cleaner`.

## Installation

You can install this project directly via `pip install lxml_html_clean` or as an extra of lxml
Expand Down
8 changes: 8 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@ This project was initially a part of `lxml <https://github.com/lxml/lxml>`_. Bec
many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for
security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.

**Important**: the HTML Cleaner in ``lxml_html_clean`` is **not** considered appropriate **for security sensitive environments**.
See e.g. `bleach <https://pypi.org/project/bleach/>`_ for an alternative.

This project uses functions from Python's ``urllib.parse`` for URL parsing which **do not validate inputs**.
For more information on potential security risks, refer to the
`URL parsing security <https://docs.python.org/3/library/urllib.parse.html#url-parsing-security>`_ documentation.
A maliciously crafted URL could potentially bypass the allowed hosts check in ``Cleaner``.

Security
--------

Expand Down
4 changes: 3 additions & 1 deletion lxml_html_clean/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,7 @@ from .clean import (
autolink as autolink,
autolink_html as autolink_html,
word_break as word_break,
word_break_html as word_break_html
word_break_html as word_break_html,
LXMLHTMLCleanWarning as LXMLHTMLCleanWarning,
AmbiguousURLWarning as AmbiguousURLWarning,
)
41 changes: 40 additions & 1 deletion lxml_html_clean/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import re
from collections import deque
from urllib.parse import urlsplit, unquote_plus
import warnings

from lxml import etree
from lxml.html import defs
Expand All @@ -18,7 +19,7 @@


__all__ = ['clean_html', 'clean', 'Cleaner', 'autolink', 'autolink_html',
'word_break', 'word_break_html']
'word_break', 'word_break_html', 'LXMLHTMLCleanWarning', 'AmbiguousURLWarning']

# Look at http://code.sixapart.com/trac/livejournal/browser/trunk/cgi-bin/cleanhtml.pl
# Particularly the CSS cleaning; most of the tag cleaning is integrated now
Expand Down Expand Up @@ -100,6 +101,33 @@ def fromstring(string):
return lxml_fromstring(_ascii_control_characters.sub("", string))


# This regular expression is inspired by the one in urllib3.
_URI_RE = re.compile(
r"^(?:(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*[a-zA-Z0-9]):)?"
r"(?://(?P<authority>[^\\/?#]*))?"
r"(?P<path>[^?#]*)"
r"(?:\?(?P<query_string>[^#]*))?"
r"(?:#(?P<fragment>.*))?$",
re.UNICODE,
)


def _get_authority_from_url(url):
match = _URI_RE.match(url)
if match:
return match.group("authority")
else:
return None


class LXMLHTMLCleanWarning(Warning):
pass


class AmbiguousURLWarning(LXMLHTMLCleanWarning):
pass


class Cleaner:
"""
Instances cleans the document of each of the possible offending
Expand Down Expand Up @@ -185,6 +213,9 @@ class Cleaner:

Note that you may also need to set ``whitelist_tags``.

Note that URLs are parsed via functions from ``urllib.parse`` and
no input validation is performed.

``whitelist_tags``:
A set of tags that can be included with ``host_whitelist``.
The default is ``iframe`` and ``embed``; you may wish to
Expand Down Expand Up @@ -496,6 +527,14 @@ def allow_embedded_url(self, el, url):
parts = urlsplit(url)
if parts.scheme not in ('http', 'https'):
return False

authority = _get_authority_from_url(url)
if (parts.netloc or authority) and parts.netloc != authority:
warnings.warn(f"It's impossible to parse the hostname from URL: '{url}'! "
"URL is not allowed because parsers returned ambiguous results.",
AmbiguousURLWarning)
return False

if parts.hostname in self.host_whitelist:
return True
return False
Expand Down
11 changes: 11 additions & 0 deletions lxml_html_clean/clean.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@ _DT = TypeVar("_DT", str, bytes, HtmlElement)
_ET_DT = TypeVar("_ET_DT", str, bytes, HtmlElement, _ElementTree[HtmlElement])


def _get_authority_from_url(url: str) -> str | None: ...


class LXMLHTMLCleanWarning(Warning):
pass


class AmbiguousURLWarning(LXMLHTMLCleanWarning):
pass


class Cleaner:
@overload # allow_tags present, remove_unknown_tags must be False
def __init__(
Expand Down
15 changes: 14 additions & 1 deletion tests/test_clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@
import gzip
import io
import unittest
import warnings

import lxml.html
from lxml_html_clean import Cleaner, clean_html
from lxml_html_clean import AmbiguousURLWarning, Cleaner, clean_html, LXMLHTMLCleanWarning
from .utils import peak_memory_usage


Expand Down Expand Up @@ -346,3 +347,15 @@ def test_memory_usage_many_elements_with_long_tails(self):
mem = peak_memory_usage(cleaner.clean_html, html)

self.assertTrue(mem < 10, f"Used {mem} MiB memory, expected at most 10 MiB")

def test_possibly_invalid_url(self):
cleaner = Cleaner(host_whitelist=['google.com'])
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
result = cleaner.clean_html(r"<p><iframe src='http://example.com:\@google.com'> </iframe></p>")
self.assertGreaterEqual(len(w), 1)
self.assertIs(w[-1].category, AmbiguousURLWarning)
self.assertTrue(issubclass(w[-1].category, LXMLHTMLCleanWarning))
self.assertIn("impossible to parse the hostname", str(w[-1].message))
self.assertNotIn("google.com", result)
self.assertNotIn("example.com", result)
Loading