Building a URI Regular Expression
2012-05-21
Introduction
Uniform Resource Locators, or URLs, are a type of Uniform Resource
Identifier (URI). When solving programming problems, it may be useful to build
a regular expression that will match all URIs within a string.
This article will show how to build a regular expression, consistent with
the URI specification, that matches URIs.
URI Specification
The specification for a URI is defined in RFC 3986. The URI definition is
written in ABNF form, so all
we need to do is convert the ABNF definition to regular expression syntax.
ABNF to Regular Expressions
The following shows the ABNF definition for each part of the URI spec, its
equivalent regular expression and that expression written in Perl.
I have made one simplification by limiting host names to registered names
(reg-name) only and not allowing IP addresses.
URI
ABNF |
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
|
Regex |
{scheme}:{hier_part}(?:\?{query})?(?:#{fragment})?
|
Perl |
$uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";
|
Hierarchical Part
ABNF |
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
|
Regex |
(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?
|
Perl |
$hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";
|
URI Scheme
ABNF |
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
|
Regex |
[a-zA-Z][a-zA-Z0-9+\-.]*
|
Perl |
$scheme = '[a-zA-Z][a-zA-Z0-9+\-.]*';
|
Naming Authority
ABNF |
authority = [ userinfo "@" ] host [ ":" port ]
|
Regex |
(?:{userinfo}@)?{host}(?::{port})?
|
Perl |
$authority = "(?:${userinfo}\@)?${host}(?::${port})?";
|
User Information
ABNF |
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
|
Regex |
(?:{unreserved}|{pct_encoded}|{sub_delims}|:)*
|
Perl |
$userinfo = "(?:${unreserved}|${pct_encoded}|${sub_delims}|:)*";
|
Host
ABNF |
host = IP-literal / IPv4address / reg-name
|
Regex |
{reg_name}
Modified to only allow registered names
|
Perl |
$host = $reg_name;
|
Port Number
ABNF |
port = *DIGIT
|
Regex |
[0-9]*
|
Perl |
$port = '[0-9]*';
|
Registered Name
ABNF |
reg-name = *( unreserved / pct-encoded / sub-delims )
|
Regex |
(?:{unreserved}|{pct_encoded}|{sub_delims})*
|
Perl |
$reg_name = "(?:${unreserved}|${pct_encoded}|${sub_delims})*";
|
Path Absolute or Empty
ABNF |
path-abempty = *( "/" segment )
|
Regex |
(?:/{segment})*
|
Perl |
$path_abempty = "(?:/${segment})*";
|
Path Absolute
ABNF |
path-absolute = "/" [ segment-nz *( "/" segment ) ]
|
Regex |
/(?:{segment_nz}(?:/{segment})*)?
|
Perl |
$path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
|
Path Beginning with Segment
ABNF |
path-rootless = segment-nz *( "/" segment )
|
Regex |
{segment_nz}(?:/{segment})*
|
Perl |
$path_rootless = "${segment_nz}(?:/${segment})*";
|
Path Empty
ABNF |
path-empty = 0<pchar>
|
Regex |
No regular expression needed for this parameter |
Segment
ABNF |
segment = *pchar
|
Regex |
{pchar}*
|
Perl |
$segment = "${pchar}*";
|
Segment, Non-Zero Length
ABNF |
segment-nz = 1*pchar
|
Regex |
{pchar}+
|
Perl |
$segment_nz = "${pchar}+";
|
Segment, Non-Zero Length, No colon
ABNF |
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
|
Regex |
(?:{unreserved}|{pct_encoded}|{sub_delims}|@)+
|
Perl |
$segment_nz_nc = "(?:${unreserved}|${pct_encoded}|${sub_delims}|\@)+";
|
Path Characters
ABNF |
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
|
Regex |
(?:{unreserved}|{pct-encoded}|{sub_delims}|[:@])
|
Perl |
$pchar = "(?:${unreserved}|${pct-encoded}|${sub_delims}|[:\@])";
|
Query Component
ABNF |
query = *( pchar / "/" / "?" )
|
Regex |
(?:{pchar}|[/?])*
|
Perl |
$query = "(?:${pchar}|[/?])*";
|
Fragment Component
ABNF |
fragment = *( pchar / "/" / "?" )
|
Regex |
(?:{pchar}|[/?])*
|
Perl |
$fragment = "(?:${pchar}|[/?])*";
|
Percent-Encoded
ABNF |
pct-encoded = "%" HEXDIG HEXDIG
|
Regex |
%[0-9A-F]{2}
|
Perl |
$pct_encoded = '%[0-9A-F]{2}';
|
Unreserved Characters
ABNF |
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
|
Regex |
[a-zA-Z0-9\-._~]
|
Perl |
$unreserved = '[a-zA-Z0-9\-._~]';
|
Subcomponent Delimiters
ABNF |
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
|
Regex |
[!$&'()*+,;=]
|
Perl |
$sub_delims = '[!$&\'()*+,;=]';
|
Simplifications
There are several optimizations that can be made that simplify the
regex and improve its performance. For example, the definition for
pchar
is:
(?:
[a-zA-Z0-9\-._~] # unreserved
|
%[0-9A-F]{2} # pct-encoded
|
[!$&'()*+,;=] # sub-delims
|
[:@] # ':' | '@'
)
But can be simplified to:
(?:
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)
Perl Function
All the expressions can be put into a Perl function to assemble the
complete regular expression:
sub build_uri_regex
{
my $pchar_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@]';
my $f_q_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@/?]';
my $seg_nc_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=@]';
my $reg_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=]';
my $user_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=:]';
my $pct_encoded = '%[0-9A-F]{2}';
my $pchar = "(?:${pchar_char}|${pct_encoded})";
my $fragment = "(?:${f_q_char}|${pct_encoded})*";
my $query = "(?:${f_q_char}|${pct_encoded})*";
my $segment = "${pchar}*";
my $segment_nz = "${pchar}+";
my $segment_nz_nc = "(?:${seg_nc_char}|${pct_encoded})+";
my $path_abempty = "(?:/${segment})*";
my $path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
my $path_rootless = "${segment_nz}(?:/${segment})*";
my $reg_name = "(?:${reg_char}|${pct_encoded})*";
my $port = '[0-9]*';
my $host = $reg_name;
my $userinfo = "(?:${user_char}|${pct_encoded})*";
my $authority = "(?:${userinfo}\@)?${host}(?::${port})?";
my $scheme = '[a-zA-Z][a-zA-Z0-9\-.+]*';
my $hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";
my $uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";
return $uri;
}
Completed Regular Expression
The following shows the complete URI regular expression. A version
with white space added can be found at the end of this article.
[a-zA-Z][a-zA-Z0-9\-.+]*:(?://(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;
=:]|%[0-9A-F]{2})*@)?(?:[a-zA-Z0-9\-._~!$&'()*+,;=]|%[0-9A-F]
{2})*(?::[0-9]*)?(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-
F]{2})*)*|/(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+
(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?|(?:[a
-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:/(?:[a-zA-Z0-9\-
._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?(?:\?(?:[a-zA-Z0-9\-._~!$
&'()*+,;=:@/?]|%[0-9A-F]{2})*)?(?:#(?:[a-zA-Z0-9\-._~!$&'()*+
,;=:@/?]|%[0-9A-F]{2})*)?
Final Thoughts
If you are given a URI, this expression can be used to parse the URI
and split it up into its components.
However, if you try parsing large selections of text with this
regular expression, you will quickly discover that there are many
instances of non-URIs that will match the expression (e.g.
"languages:"). Because of this, it is not useful to use this expression
to actually find all URIs within a block of text.
If you are wanting to locate URIs within text, you should be able to
use this regular expression as a starting point and modify the rules to
make the expression more restrictive. For example, the scheme could be
restricted to only HTTP and HTTPS in the following way:
my $scheme = 'https?';
Completed Regular Expression (White-Space Added)
The following shows the complete URI regular expression with
white-space and comments added:
[a-zA-Z][a-zA-Z0-9\-.+]*: {scheme}
(?:
//
(?: {authority}
(?: {userinfo}
[a-zA-Z0-9\-._~!$&'()*+,;=:]
|
%[0-9A-F]{2}
)*
@
)?
(?: {host}
[a-zA-Z0-9\-._~!$&'()*+,;=]
|
%[0-9A-F]{2}
)*
(?:
:
[0-9]* {port}
)?
(?:
/
(?: {path-abempty}
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)*
)*
|
/
(?:
(?: {path-absolute}
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)+
(?:
/
(?:
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)*
)*
)?
|
(?: {path-rootless}
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)+
(?:
/
(?:
[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
%[0-9A-F]{2}
)*
)*
)?
(?:
\?
(?: {query}
[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
|
%[0-9A-F]{2}
)*
)?
(?:
#
(?: {fragment}
[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
|
%[0-9A-F]{2}
)*
)?