Linux.com

Strip HTML fails

Posted by: Anonymous Coward on August 09, 2005 01:37 AM
The "strip HTML" command is unnecessarily cluttered with backslashes. He suggests:


        sed s/\<[^\<]*\>//g

He put the backslashes (\) in there to prevent the shell from interpreting the redirection operators, < and >. It would have been better to wrap the sed command in 'single quotes' to prevent the shell from using them, like this:


      sed 's/<[^<]*>//g'

Much easier to read. However, the [character-set] is also a problem, because when it meets a pattern like this:


      <b>here</b>

the first 2 characters of the search pattern ("<[^<]*") will match the string "<b>here", and then the 3rd character (">") will kick in. Not finding a terminating >, the engine will have to backtrack to reach "<b>". To optimize the script, it is better to use the other angle bracket:


      sed 's/<[^>]*>//g'

Finally, this script doesn't work at all in cases where the HTML tags cross line boundaries:


      <a class="first"

      href="http://gnu.org">

      like this one here

      </a>

or in those perverse cases where an HTML tag legally contains embedded angle brackets:


      <img src="x.jpg" alt="<<like this>>" >

For more help on using sed, visit <a href="http://sed.sourceforge.net/" title="sourceforge.net">http://sed.sourceforge.net/</a sourceforge.net>

—Eric Pement

#

Return to CLI Magic: Regular expressions and metacharacters