Percent Encoding of Characters
2012-05-26
Introduction
Percent encoding is a method of encoding prohibited characters in strings. Percent encoding allows characters to be used in a string that would normally not be able to be represented.
Percent encoding is most often seen in URLs (URIs) and the most commonly encoded character is a space. URLs are not allowed to contain the space character (ASCII character number 32, which is 0x20 in hexadecimal notation), so a space character gets written as '%20'. For example:
http://www.dispersiondesign.com/path containing spaces/
would be encoded as:
http://www.dispersiondesign.com/path%20containing%20spaces/
The following table shows some characters and their percent encoded equivalent:
Character | ASCII value | ASCII value (in hex) | Percent Encoded |
---|---|---|---|
(space) | 32 | 0x30 | %20 |
% | 37 | 0x25 | %25 |
& | 38 | 0x26 | %26 |
, | 44 | 0x2C | %2C |
. | 46 | 0x2E | %2E |
? | 63 | 0x3F | %3F |
Let’s see how to encode and decode percent encoding.
Decoding (Unescaping) Percent Encoding
In programming languages that support regular expressions, such as Perl, PHP and JavaScript, decoding a percent encoded string is a simple substitution operation. First we need a regular expression that locates valid percent encoded character sequences. In URLs, a percent encoded sequence starts with a '%' (percent) character, followed by exactly two characters that can be 0-9, a-f or A-F. In regular expression syntax, we can find two consecutive characters that are 0-9, a-f or A-F with:
[0-9a-fA-F]{2}
Finding these characters with a preceeding '%' character is then simply:
%([0-9a-fA-F]{2})
The percent encoded value is a hexadecimal value, so it needs to be
converted to a decimal value. In Perl, this is accomplished using the
hex()
function:
my $decimal = hex($1);
Then, the resulting decimal value needs to be converted to a character. The
function in Perl for this is chr()
:
my $character = chr($decimal);
Putting this together, the unescaping (decoding) or percent encoding can be performed in Perl with a single line of code:
$str =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
JavaScript Solution
In JavaScript, the same thing can be performed with the
parseInt()
and fromCharCode()
functions:
var regex = /%([0-9a-fA-F]{2})/g; str = str.replace(regex, function (str, p1) { return String.fromCharCode(parseInt(p1, 16)); });
However, JavaScript has a built-in function called unscape()
that can perform the same task:
str = unescape(str);
Encoding (Escaping) with Percent Encoding
Creating a percent encoded string requires that the invalid characters first be defined. For example, if you wish to encode all characters that are not a-z, A-Z and 0-9, you would need a regular expression like the following:
[^0-9a-zA-Z]
Now, in Perl, these characters can be substituted using ord()
to get the decimal ASCII value for the character and sprintf()
to
get the hexadecimal equivalent:
$str =~ s/([^0-9a-zA-Z])/sprintf("%%%02X", ord($1))/ge;
JavaScript Solution
In JavaScript, the solution can be written:
var regex = /[^0-9a-zA-Z]/g; str = str.replace(regex, function (str) { var d = str.charCodeAt(0); return (d < 16 ? '%0' : '%') + d.toString(16); });
JavaScript also has a built-in function called escape()
that
will percent-encode a string. However, the escape()
function does
not give you any control over which characters are escaped.