urlutils - Structured URL

urlutils is a module dedicated to one of software’s most versatile, well-aged, and beloved data structures: the URL, also known as the Uniform Resource Locator.

Among other things, this module is a full reimplementation of URLs, without any reliance on the urlparse or urllib standard library modules. The centerpiece and top-level interface of urlutils is the URL type. Also featured is the find_all_links() convenience function. Some low-level functions and constants are also below.

The implementations in this module are based heavily on RFC 3986 and RFC 3987, and incorporates details from several other RFCs and W3C documents.

New in version 17.2.

The URL type

class boltons.urlutils.URL(url='')[source]

The URL is one of the most ubiquitous data structures in the virtual and physical landscape. From blogs to billboards, URLs are so common, that it’s easy to overlook their complexity and power.

There are 8 parts of a URL, each with its own semantics and special characters:

Each is exposed as an attribute on the URL object. RFC 3986 offers this brief structural summary of the main URL components:

 foo://user:pass@example.com:8042/over/there?name=ferret#nose
 \_/   \_______/ \_________/ \__/\_________/ \_________/ \__/
  |        |          |        |      |           |        |
scheme  userinfo     host     port   path       query   fragment

And here’s how that example can be manipulated with the URL type:

>>> url = URL('foo://example.com:8042/over/there?name=ferret#nose')
>>> print(url.host)
example.com
>>> print(url.get_authority())
example.com:8042
>>> print(url.qp['name'])  # qp is a synonym for query_params
ferret

URL’s approach to encoding is that inputs are decoded as much as possible, and data remains in this decoded state until re-encoded using the to_text() method. In this way, it’s similar to Python’s current approach of encouraging immediate decoding of bytes to text.

Note that URL instances are mutable objects. If an immutable representation of the URL is desired, the string from to_text() may be used. For an immutable, but almost-as-featureful, URL object, check out the hyperlink package.

scheme

The scheme is an ASCII string, normally lowercase, which specifies the semantics for the rest of the URL, as well as network protocol in many cases. For example, “http” in “http://hatnote.com”.

username

The username is a string used by some schemes for authentication. For example, “public” in “ftp://public@example.com”.

password

The password is a string also used for authentication. Technically deprecated by RFC 3986 Section 7.5, they’re still used in cases when the URL is private or the password is public. For example “password” in “db://private:password@127.0.0.1”.

host

The host is a string used to resolve the network location of the resource, either empty, a domain, or IP address (v4 or v6). “example.com”, “127.0.0.1”, and “::1” are all good examples of host strings.

Per spec, fully-encoded output from to_text() is IDNA encoded for compatibility with DNS.

port

The port is an integer used, along with host, in connecting to network locations. 8080 is the port in “http://localhost:8080/index.html”.

Note

As is the case for 80 for HTTP and 22 for SSH, many schemes have default ports, and Section 3.2.3 of RFC 3986 states that when a URL’s port is the same as its scheme’s default port, the port should not be emitted:

>>> URL(u'https://github.com:443/mahmoud/boltons').to_text()
u'https://github.com/mahmoud/boltons'

Custom schemes can register their port with register_scheme(). See URL.default_port for more info.

path

The string starting with the first leading slash after the authority part of the URL, ending with the first question mark. Often percent-quoted for network use. “/a/b/c” is the path of “http://example.com/a/b/c?d=e”.

path_parts

The tuple form of path, split on slashes. Empty slash segments are preserved, including that of the leading slash:

>>> url = URL(u'http://example.com/a/b/c')
>>> url.path_parts
(u'', u'a', u'b', u'c')
query_params[source]

An instance of QueryParamDict, an OrderedMultiDict subtype, mapping textual keys and values which follow the first question mark after the path. Also available as the handy alias qp:

>>> url = URL('http://boltons.readthedocs.io/en/latest/?utm_source=docs&sphinx=ok')
>>> url.qp.keys()
[u'utm_source', u'sphinx']

Also percent-encoded for network use cases.

fragment

The string following the first ‘#’ after the query_params until the end of the URL. It has no inherent internal structure, and is percent-quoted.

classmethod from_parts(scheme=None, host=None, path_parts=(), query_params=(), fragment='', port=None, username=None, password=None)[source]

Build a new URL from parts. Note that the respective arguments are not in the order they would appear in a URL:

Parameters:
  • scheme (str) – The scheme of a URL, e.g., ‘http’
  • host (str) – The host string, e.g., ‘hatnote.com’
  • path_parts (tuple) – The individual text segments of the path, e.g., (‘post’, ‘123’)
  • query_params (dict) – An OMD, dict, or list of (key, value) pairs representing the keys and values of the URL’s query parameters.
  • fragment (str) – The fragment of the URL, e.g., ‘anchor1’
  • port (int) – The integer port of URL, automatic defaults are available for registered schemes.
  • username (str) – The username for the userinfo part of the URL.
  • password (str) – The password for the userinfo part of the URL.

Note that this method does relatively little validation. URL.to_text() should be used to check if any errors are produced while composing the final textual URL.

to_text(full_quote=False)[source]

Render a string representing the current state of the URL object.

>>> url = URL('http://listen.hatnote.com')
>>> url.fragment = 'en'
>>> print(url.to_text())
http://listen.hatnote.com#en

By setting the full_quote flag, the URL can either be fully quoted or minimally quoted. The most common characteristic of an encoded-URL is the presence of percent-encoded text (e.g., %60). Unquoted URLs are more readable and suitable for display, whereas fully-quoted URLs are more conservative and generally necessary for sending over the network.

default_port

Return the default port for the currently-set scheme. Returns None if the scheme is unrecognized. See register_scheme() above. If port matches this value, no port is emitted in the output of to_text().

Applies the same ‘+’ heuristic detailed in URL.uses_netloc().

uses_netloc

Whether or not a URL uses : or :// to separate the scheme from the rest of the URL depends on the scheme’s own standard definition. There is no way to infer this behavior from other parts of the URL. A scheme either supports network locations or it does not.

The URL type’s approach to this is to check for explicitly registered schemes, with common schemes like HTTP preregistered. This is the same approach taken by urlparse.

URL adds two additional heuristics if the scheme as a whole is not registered. First, it attempts to check the subpart of the scheme after the last + character. This adds intuitive behavior for schemes like git+ssh. Second, if a URL with an unrecognized scheme is loaded, it will maintain the separator it sees.

>>> print(URL('fakescheme://test.com').to_text())
fakescheme://test.com
>>> print(URL('mockscheme:hello:world').to_text())
mockscheme:hello:world
get_authority(full_quote=False, with_userinfo=False)[source]

Used by URL schemes that have a network location, get_authority() combines username, password, host, and port into one string, the authority, that is used for connecting to a network-accessible resource.

Used internally by to_text() and can be useful for labeling connections.

>>> url = URL('ftp://user@ftp.debian.org:2121/debian/README')
>>> print(url.get_authority())
ftp.debian.org:2121
>>> print(url.get_authority(with_userinfo=True))
user@ftp.debian.org:2121
Parameters:
  • full_quote (bool) – Whether or not to apply IDNA encoding. Defaults to False.
  • with_userinfo (bool) – Whether or not to include username and password, technically part of the authority. Defaults to False.
normalize(with_case=True)[source]

Resolve any “.” and “..” references in the path, as well as normalize scheme and host casing. To turn off case normalization, pass with_case=False.

More information can be found in Section 6.2.2 of RFC 3986.

navigate(dest)[source]

Factory method that returns a _new_ URL based on a given destination, dest. Useful for navigating those relative links with ease.

The newly created URL is normalized before being returned.

>>> url = URL('http://boltons.readthedocs.io')
>>> url.navigate('en/latest/')
URL(u'http://boltons.readthedocs.io/en/latest/')
Parameters:dest (str) – A string or URL object representing the destination

More information can be found in Section 5 of RFC 3986.

Low-level functions

A slew of functions used internally by URL.

boltons.urlutils.parse_url(url_text)[source]

Used to parse the text for a single URL into a dictionary, used internally by the URL type.

Note that “URL” has a very narrow, standards-based definition. While parse_url() may raise URLParseError under a very limited number of conditions, such as non-integer port, a surprising number of strings are technically valid URLs. For instance, the text "url" is a valid URL, because it is a relative path.

In short, do not expect this function to validate form inputs or other more colloquial usages of URLs.

>>> res = parse_url('http://127.0.0.1:3000/?a=1')
>>> sorted(res.keys())  # res is a basic dictionary
['_netloc_sep', 'authority', 'family', 'fragment', 'host', 'password', 'path', 'port', 'query', 'scheme', 'username']
boltons.urlutils.parse_host(host)[source]

Low-level function used to parse the host portion of a URL.

Returns a tuple of (family, host) where family is a socket module constant or None, and host is a string.

>>> parse_host('googlewebsite.com') == (None, 'googlewebsite.com')
True
>>> parse_host('[::1]') == (socket.AF_INET6, '::1')
True
>>> parse_host('192.168.1.1') == (socket.AF_INET, '192.168.1.1')
True

Odd doctest formatting above due to py3’s switch from int to enums for socket constants.

boltons.urlutils.parse_qsl(qs, keep_blank_values=True, encoding='utf8')[source]

Converts a query string into a list of (key, value) pairs.

boltons.urlutils.resolve_path_parts(path_parts)[source]

Normalize the URL path by resolving segments of ‘.’ and ‘..’, resulting in a dot-free path. See RFC 3986 section 5.2.4, Remove Dot Segments.

class boltons.urlutils.QueryParamDict(*args, **kwargs)[source]

A subclass of OrderedMultiDict specialized for representing query string values. Everything is fully unquoted on load and all parsed keys and values are strings by default.

As the name suggests, multiple values are supported and insertion order is preserved.

>>> qp = QueryParamDict.from_text(u'key=val1&key=val2&utm_source=rtd')
>>> qp.getlist('key')
[u'val1', u'val2']
>>> qp['key']
u'val2'
>>> qp.add('key', 'val3')
>>> qp.to_text()
'key=val1&key=val2&utm_source=rtd&key=val3'

See OrderedMultiDict for more API features.

classmethod from_text(query_string)[source]

Parse query_string and return a new QueryParamDict.

to_text(full_quote=False)[source]

Render and return a query string.

Parameters:full_quote (bool) – Whether or not to percent-quote special characters or leave them decoded for readability.

Quoting

URLs have many parts, and almost as many individual “quoting” (encoding) strategies.

boltons.urlutils.quote_userinfo_part(text, full_quote=True)[source]

Quote special characters in either the username or password section of the URL. Note that userinfo in URLs is considered deprecated in many circles (especially browsers), and support for percent-encoded userinfo can be spotty.

boltons.urlutils.quote_path_part(text, full_quote=True)[source]

Percent-encode a single segment of a URL path.

boltons.urlutils.quote_query_part(text, full_quote=True)[source]

Percent-encode a single query string key or value.

boltons.urlutils.quote_fragment_part(text, full_quote=True)[source]

Quote the fragment part of the URL. Fragments don’t have subdelimiters, so the whole URL fragment can be passed.

There is however, only one unquoting strategy:

boltons.urlutils.unquote(string, encoding='utf-8', errors='replace')[source]

Percent-decode a string, by replacing %xx escapes with their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by the bytes.decode() method. By default, percent-encoded sequences are decoded with UTF-8, and invalid sequences are replaced by a placeholder character.

>>> unquote(u'abc%20def')
u'abc def'

Useful constants

boltons.urlutils.SCHEME_PORT_MAP

A mapping of URL schemes to their protocols’ default ports. Painstakingly assembled from the IANA scheme registry, port registry, and independent research.

Keys are lowercase strings, values are integers or None, with None indicating that the scheme does not have a default port (or may not support ports at all):

>>> boltons.urlutils.SCHEME_PORT_MAP['http']
80
>>> boltons.urlutils.SCHEME_PORT_MAP['file']
None

See URL.port for more info on how it is used. See NO_NETLOC_SCHEMES for more scheme info.

Also available in JSON.

boltons.urlutils.NO_NETLOC_SCHEMES

This is a set of schemes explicitly do not support network resolution, such as “mailto” and “urn”.