start page | rating of books | rating of authors | reviews | copyrights

Perl Cookbook

Perl CookbookSearch this book
Previous: 20.3. Extracting URLs Chapter 20
Web Automation
Next: 20.5. Converting HTML to ASCII
 

20.4. Converting ASCII to HTML

Problem

You want to convert ASCII text to HTML.

Solution

Use the simple little encoding filter in Example 20.3 .

Example 20.3: text2html

#!/usr/bin/perl -w -p00 # text2html - trivial html encoding of normal text # -p means apply this script to each record. # -00 mean that a record is now a paragraph  use HTML::Entities; $_ = encode_entities($_, "\200-\377");  if (/^\s/) {     # Paragraphs beginning with whitespace are wrapped in <PRE>      s{(.*)$}        {<PRE>\n$1</PRE>\n}s;           # indented verbatim } else {     s{^(>.*)}       {$1<BR>}gm;                    # quoted text     s{<URL:(.*?)>}    {<A HREF="$1">$1</A>}gs         # embedded URL  (good)                     ||     s{(http:\S+)}   {<A HREF="$1">$1</A>}gs;        # guessed URL   (bad)     s{\*(\S+)\*}    {<STRONG>$1</STRONG>}g;         # this is *bold* here     s{\b_(\S+)\_\b} {<EM>$1</EM>}g;                 # this is _italics_ here     s{^}            {<P>\n};                        # add paragraph tag }

Discussion

Converting arbitrary plain text to HTML has no general solution because there are too many different, conflicting ways of representing formatting information in a plain text file. The more you know about the input, the better the job you can do of formatting it.

For example, if you knew that you would be fed a mail message, you could add this block to format the mail headers:

BEGIN {     print "<TABLE>";     $_ = encode_entities(scalar <>);     s/\n\s+/ /g;  # continuation lines     while ( /^(\S+?:)\s*(.*)$/gm ) {                # parse heading         print "<TR><TH ALIGN='LEFT'>$1</TH><TD>$2</TD></TR>\n";     }     print "</TABLE><HR>"; }

See Also

The documentation for the CPAN module HTML::Entities


Previous: 20.3. Extracting URLs Perl Cookbook Next: 20.5. Converting HTML to ASCII
20.3. Extracting URLs Book Index 20.5. Converting HTML to ASCII