Converting Text to HTML

2012-05-19

Introduction

While developing a web application, you might want to give a user the ability to submit blocks of text using a <textarea> form field. Unless you use a JavaScript HTML editor, the <textarea> field returns plain text to your program. If you try and display this text directly in an HTML page, the spacing gets removed.

For example, consider the following content:

When printed back unmodified, all the white-space including line breaks, gets displayed as single spaces:

Best Geek Websites: 1. http://www.slashdot.org/ 2. http://www.thinkgeek.com/ 3. http://www.engadget.com/

One solution is to wrap the content in <pre> tags before displaying in the HTML page:

Best Geek Websites:

1. http://www.slashdot.org/
2. http://www.thinkgeek.com/
3. http://www.engadget.com/

However, if we format the text in our program first, we can add a variety of HTML tags. This allows us to add more advanced features like detecting URLs and email addresses and automatically adding links to them:

Best Geek Websites:

1. http://www.slashdot.org/
2. http://www.thinkgeek.com/
3. http://www.engadget.com/

Using the Perl programming language, I will demonstrate how to convert plain text to HTML.

Cleaning Up the Text

The first thing we will do is clean up the plain text a bit. First we will make sure that the text consistently uses line-feed (LF) characters instead of carriage-returns or CRLF to indicate the end of a line:

# Convert CR or CRLF to LF
$str =~ s/\r\n?/\n/g;

Now we will remove any redundant white-space at the end of lines:

# Remove whitespace at the end of lines
$str =~ s/[ \t]+$//gm;

Note that the 'm' mode selector at the end of the regular expression changes the mode so that '$' indicates the end of a line and not the end of the entire block of text.

We can also remove any empty/blank lines at the beginning or end of the text:

# Remove initial empty lines
$str =~ s/^\n+//s;

# Remove trailing empty lines
$str =~ s/\n+$//s;

The 's' mode selector in these regular expressions makes '^' match the beginning of the block of text, not the beginning of each line and '$' match the end of the block of text, not the end of each line.

The characters '&', '<' and '>' are special characters in HTML, so we need to convert any instances of these characters to their equivalent HTML entities:

# Convert HTML entities
$str =~ s/&/&amp;/g;
$str =~ s/</&lt;/g;
$str =~ s/>/&gt;/g;

Add Some Hyperlinks

By detecting any URLs within the text, we can convert the plain text URLs to HTML links by adding <a> tags around them.

First we need a regular expression (regex) to detect URLs. Numerous options exist for this, but the following should work fine:

# Build a regular expression for detecting URLs
my $sub_delims = '!\$&\'\(\)\*\+,;=';
my $unreserved = '-a-zA-Z0-9\._~';
my $pchar = "${unreserved}${sub_delims}\%:\@";

my $scheme = '(?:https?|ftp)';
my $tld = '[a-zA-Z]{2,6}';
my $subdomain = '[a-zA-Z0-9](?:[-a-zA-Z0-9]{0,62}[a-zA-Z0-9])?';
my $domain = "(?:$subdomain\\.)+$tld";

my $port = '[0-9]*';
my $authority = "$domain(?::$port)?";

my $path = "(?:/[${pchar}]*)*";
my $query = "[${pchar}/\\?]*";
my $fragment = "[${pchar}/\\?]*";

my $url_ex = "$scheme://$authority$path(?:\\?$query)?(?:#$fragment)?";

Now we can use this regular expression to locate URLs within our text and add HTML links:

# Add hyperlinks to URLs
$str =~ s/($url_ex)/<a href="$1">$1<\/a>/g;

If we also want to detect URLs that are missing the protocol (i.e. no http:// at the beginning), we can do the following:

# Add hyperlinks to URLs with missing protocol
my $www_url_ex = "www\\.$authority$path(?:\\?$query)?(?:#$fragment)?";
$str =~ s/(?<![-a-zA-Z0-9\/\.])($www_url_ex)/<a href="http:\/\/$1">$1<\/a>/g;

This regular expression uses a ‘zero-width negative look-behind assertion’ (?<!pattern). Try and say that fast ten times! Simply put, this means that we are looking for any URLs that do not have a valid protocol string in front of them.

Breaking the Text Into Paragraphs

We will assume that paragraphs are separated by empty lines. In other words, we can break up the text into paragraphs by splitting the text any time two or more consecutive new-line characters occur:

# Locate paragraphs
my @paras = split(/\n{2,}/, $str);

Now we will iterate through each paragraph. Any remaining newline characters need to have <br /> tags added and we need to add <p> paragraph tags around each paragraph:

foreach(@paras)
{
	$_ =~ s/\n/<br \/>\n/sg;
	$_ = "<p>" . $_ . "</p>\n";
}

Finally, we join all the paragraphs back together to form the completed HTML formatted string:

my $html_str = join("\n", @paras);

Putting It Together

All this code can be combined into a single function:

sub text2html
{
	my($str) = @_;

	# Convert CR or CRLF to LF
	$str =~ s/\r\n?/\n/g;

	# Remove whitespace at the end of lines
	$str =~ s/[ \t]+$//gm;

	# Remove initial empty lines
	$str =~ s/^\n+//s;

	# Remove trailing empty lines
	$str =~ s/\n+$//s;

	# Convert HTML entities
	$str =~ s/&/&amp;/g;
	$str =~ s/</&lt;/g;
	$str =~ s/>/&gt;/g;

	# Build a regular expression for detecting URLs
	my $sub_delims = '!\$&\'\(\)\*\+,;=';
	my $unreserved = '-a-zA-Z0-9\._~';
	my $pchar = "${unreserved}${sub_delims}\%:\@";

	my $scheme = '(?:https?|ftp)';
	my $tld = '[a-zA-Z]{2,6}';
	my $subdomain = '[a-zA-Z0-9](?:[-a-zA-Z0-9]{0,62}[a-zA-Z0-9])?';
	my $domain = "(?:$subdomain\\.)+$tld";

	my $port = '[0-9]*';
	my $authority = "$domain(?::$port)?";

	my $path = "(?:/[${pchar}]*)*";
	my $query = "[${pchar}/\\?]*";
	my $fragment = "[${pchar}/\\?]*";

	my $url_ex = "$scheme://$authority$path(?:\\?$query)?(?:#$fragment)?";

	# Add hyperlinks to URLs
	$str =~ s/($url_ex)/<a href="$1">$1<\/a>/g;

	# Add hyperlinks to URLs with missing protocol
	my $www_url_ex = "www\\.$authority$path(?:\\?$query)?(?:#$fragment)?";
	$str =~ s/(?<![-a-zA-Z0-9\/\.])($www_url_ex)/<a href="http:\/\/$1">$1<\/a>/g;

	# Locate paragraphs
	my @paras = split(/\n{2,}/, $str);

	foreach(@paras)
	{
		$_ =~ s/\n/<br \/>\n/sg;
		$_ = "<p>" . $_ . "</p>\n";
	}

	return join("\n", @paras);
}