Building a URI Regular Expression
2012-05-21
Introduction
Uniform Resource Locators, or URLs, are a type of Uniform Resource 
Identifier (URI). When solving programming problems, it may be useful to build 
a regular expression that will match all URIs within a string.
This article will show how to build a regular expression, consistent with 
the URI specification, that matches URIs.
URI Specification
The specification for a URI is defined in RFC 3986. The URI definition is 
written in ABNF form, so all 
we need to do is convert the ABNF definition to regular expression syntax.
ABNF to Regular Expressions
The following shows the ABNF definition for each part of the URI spec, its 
equivalent regular expression and that expression written in Perl.
I have made one simplification by limiting host names to registered names
(reg-name) only and not allowing IP addresses.
URI
| ABNF | 
URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
  | 
| Regex | 
{scheme}:{hier_part}(?:\?{query})?(?:#{fragment})?
 | 
| Perl | 
$uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";
 | 
Hierarchical Part
| ABNF | 
hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty
 | 
| Regex | 
(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?
 | 
| Perl | 
$hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";
 | 
URI Scheme
| ABNF | 
scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
  | 
| Regex | 
[a-zA-Z][a-zA-Z0-9+\-.]*
  | 
| Perl | 
$scheme = '[a-zA-Z][a-zA-Z0-9+\-.]*';
  | 
Naming Authority
| ABNF | 
authority     = [ userinfo "@" ] host [ ":" port ]
  | 
| Regex | 
(?:{userinfo}@)?{host}(?::{port})?
 | 
| Perl | 
$authority = "(?:${userinfo}\@)?${host}(?::${port})?";
 | 
User Information
| ABNF | 
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
  | 
| Regex | 
(?:{unreserved}|{pct_encoded}|{sub_delims}|:)*
 | 
| Perl | 
$userinfo = "(?:${unreserved}|${pct_encoded}|${sub_delims}|:)*";
 | 
Host
| ABNF | 
host          = IP-literal / IPv4address / reg-name
  | 
| Regex | 
{reg_name}
Modified to only allow registered names
 | 
| Perl | 
$host = $reg_name;
  | 
Port Number
| ABNF | 
port          = *DIGIT
  | 
| Regex | 
[0-9]*
  | 
| Perl | 
$port = '[0-9]*';
  | 
Registered Name
| ABNF | 
reg-name      = *( unreserved / pct-encoded / sub-delims )
  | 
| Regex | 
(?:{unreserved}|{pct_encoded}|{sub_delims})*
 | 
| Perl | 
$reg_name = "(?:${unreserved}|${pct_encoded}|${sub_delims})*";
 | 
Path Absolute or Empty
| ABNF | 
path-abempty  = *( "/" segment )
  | 
| Regex | 
(?:/{segment})*
 | 
| Perl | 
$path_abempty = "(?:/${segment})*";
 | 
Path Absolute
| ABNF | 
path-absolute = "/" [ segment-nz *( "/" segment ) ]
  | 
| Regex | 
/(?:{segment_nz}(?:/{segment})*)?
 | 
| Perl | 
$path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
 | 
Path Beginning with Segment
| ABNF | 
path-rootless = segment-nz *( "/" segment )
  | 
| Regex | 
{segment_nz}(?:/{segment})*
 | 
| Perl | 
$path_rootless = "${segment_nz}(?:/${segment})*";
 | 
Path Empty
| ABNF | 
path-empty    = 0<pchar>
  | 
| Regex | 
No regular expression needed for this parameter | 
Segment
| ABNF | 
segment       = *pchar
  | 
| Regex | 
{pchar}*
 | 
| Perl | 
$segment = "${pchar}*";
 | 
Segment, Non-Zero Length
| ABNF | 
segment-nz    = 1*pchar
  | 
| Regex | 
{pchar}+
 | 
| Perl | 
$segment_nz = "${pchar}+";
 | 
Segment, Non-Zero Length, No colon
| ABNF | 
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
 | 
| Regex | 
(?:{unreserved}|{pct_encoded}|{sub_delims}|@)+
 | 
| Perl | 
$segment_nz_nc = "(?:${unreserved}|${pct_encoded}|${sub_delims}|\@)+";
 | 
Path Characters
| ABNF | 
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
  | 
| Regex | 
(?:{unreserved}|{pct-encoded}|{sub_delims}|[:@])
 | 
| Perl | 
$pchar = "(?:${unreserved}|${pct-encoded}|${sub_delims}|[:\@])";
 | 
Query Component
| ABNF | 
query         = *( pchar / "/" / "?" )
  | 
| Regex | 
(?:{pchar}|[/?])*
 | 
| Perl | 
$query = "(?:${pchar}|[/?])*";
 | 
Fragment Component
| ABNF | 
fragment      = *( pchar / "/" / "?" )
  | 
| Regex | 
(?:{pchar}|[/?])*
 | 
| Perl | 
$fragment = "(?:${pchar}|[/?])*";
 | 
Percent-Encoded
| ABNF | 
pct-encoded   = "%" HEXDIG HEXDIG
  | 
| Regex | 
%[0-9A-F]{2}
 | 
| Perl | 
$pct_encoded = '%[0-9A-F]{2}';
 | 
Unreserved Characters
| ABNF | 
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
  | 
| Regex | 
[a-zA-Z0-9\-._~]
  | 
| Perl | 
$unreserved = '[a-zA-Z0-9\-._~]';
  | 
Subcomponent Delimiters
| ABNF | 
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="
 | 
| Regex | 
[!$&'()*+,;=]
  | 
| Perl | 
$sub_delims = '[!$&\'()*+,;=]';
  | 
Simplifications
There are several optimizations that can be made that simplify the 
regex and improve its performance. For example, the definition for 
pchar is:
(?:
	[a-zA-Z0-9\-._~]		# unreserved
|
	%[0-9A-F]{2}			# pct-encoded
|
	[!$&'()*+,;=]		# sub-delims
|
	[:@]				# ':' | '@'
)
But can be simplified to:
(?:
	[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
	%[0-9A-F]{2}
)
Perl Function
All the expressions can be put into a Perl function to assemble the 
complete regular expression:
sub build_uri_regex
{
	my $pchar_char  = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@]';
	my $f_q_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@/?]';
	my $seg_nc_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=@]';
	my $reg_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=]';
	my $user_char   = '[a-zA-Z0-9\-._~!$&\'()*+,;=:]';
	my $pct_encoded = '%[0-9A-F]{2}';
	my $pchar = "(?:${pchar_char}|${pct_encoded})";
	my $fragment = "(?:${f_q_char}|${pct_encoded})*";
	my $query = "(?:${f_q_char}|${pct_encoded})*";
	my $segment = "${pchar}*";
	my $segment_nz = "${pchar}+";
	my $segment_nz_nc = "(?:${seg_nc_char}|${pct_encoded})+";
	my $path_abempty = "(?:/${segment})*";
	my $path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
	my $path_rootless = "${segment_nz}(?:/${segment})*";
	my $reg_name = "(?:${reg_char}|${pct_encoded})*";
	my $port = '[0-9]*';
	my $host = $reg_name;
	my $userinfo = "(?:${user_char}|${pct_encoded})*";
	my $authority = "(?:${userinfo}\@)?${host}(?::${port})?";
	my $scheme = '[a-zA-Z][a-zA-Z0-9\-.+]*';
	my $hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";
	my $uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";
	return $uri;
}
Completed Regular Expression
The following shows the complete URI regular expression. A version 
with white space added can be found at the end of this article.
[a-zA-Z][a-zA-Z0-9\-.+]*:(?://(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;
=:]|%[0-9A-F]{2})*@)?(?:[a-zA-Z0-9\-._~!$&'()*+,;=]|%[0-9A-F]
{2})*(?::[0-9]*)?(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-
F]{2})*)*|/(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+
(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?|(?:[a
-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:/(?:[a-zA-Z0-9\-
._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?(?:\?(?:[a-zA-Z0-9\-._~!$
&'()*+,;=:@/?]|%[0-9A-F]{2})*)?(?:#(?:[a-zA-Z0-9\-._~!$&'()*+
,;=:@/?]|%[0-9A-F]{2})*)?
Final Thoughts
If you are given a URI, this expression can be used to parse the URI 
and split it up into its components.
However, if you try parsing large selections of text with this 
regular expression, you will quickly discover that there are many 
instances of non-URIs that will match the expression (e.g. 
"languages:"). Because of this, it is not useful to use this expression 
to actually find all URIs within a block of text.
If you are wanting to locate URIs within text, you should be able to 
use this regular expression as a starting point and modify the rules to 
make the expression more restrictive. For example, the scheme could be 
restricted to only HTTP and HTTPS in the following way:
my $scheme = 'https?';
Completed Regular Expression (White-Space Added)
The following shows the complete URI regular expression with 
white-space and comments added:
[a-zA-Z][a-zA-Z0-9\-.+]*:			{scheme}
(?:
	//
	(?:					{authority}
		(?:				{userinfo}
			[a-zA-Z0-9\-._~!$&'()*+,;=:]
		|
			%[0-9A-F]{2}
		)*
		@
	)?
	(?:					{host}
		[a-zA-Z0-9\-._~!$&'()*+,;=]
	|
		%[0-9A-F]{2}
	)*
	(?:
		:
		[0-9]*				{port}
	)?
	(?:
		/
		(?:				{path-abempty}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
|
	/
	(?:
		(?:				{path-absolute}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)+
		(?:
			/
			(?:
				[a-zA-Z0-9\-._~!$&'()*+,;=:@]
			|
				%[0-9A-F]{2}
			)*
		)*
	)?
|
	(?:					{path-rootless}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@]
	|
		%[0-9A-F]{2}
	)+
	(?:
		/
		(?:
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
)?
(?:
	\?
	(?:					{query}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?
(?:
	#
	(?:					{fragment}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?