Posted by: Anonymous Coward
on August 09, 2005 01:37 AM
The "strip HTML" command is unnecessarily cluttered with backslashes. He suggests:
sed s/\<[^\<]*\>//g
He put the backslashes (\) in there to prevent the shell from interpreting the redirection operators, < and >. It would have been better to wrap the sed command in 'single quotes' to prevent the shell from using them, like this:
sed 's/<[^<]*>//g'
Much easier to read. However, the [character-set] is also a problem, because when it meets a pattern like this:
<b>here</b>
the first 2 characters of the search pattern ("<[^<]*") will match the string "<b>here", and then the 3rd character (">") will kick in. Not finding a terminating >, the engine will have to backtrack to reach "<b>". To optimize the script, it is better to use the other angle bracket:
sed 's/<[^>]*>//g'
Finally, this script doesn't work at all in cases where the HTML tags cross line boundaries:
<a class="first"
href="http://gnu.org">
like this one here
</a>
or in those perverse cases where an HTML tag legally contains embedded angle brackets:
<img src="x.jpg" alt="<<like this>>" >
For more help on using sed, visit <a href="http://sed.sourceforge.net/" title="sourceforge.net">http://sed.sourceforge.net/</a sourceforge.net>
Strip HTML fails
Posted by: Anonymous Coward on August 09, 2005 01:37 AMsed s/\<[^\<]*\>//g
He put the backslashes (\) in there to prevent the shell from interpreting the redirection operators, < and >. It would have been better to wrap the sed command in 'single quotes' to prevent the shell from using them, like this:
sed 's/<[^<]*>//g'
Much easier to read. However, the [character-set] is also a problem, because when it meets a pattern like this:
<b>here</b>
the first 2 characters of the search pattern ("<[^<]*") will match the string "<b>here", and then the 3rd character (">") will kick in. Not finding a terminating >, the engine will have to backtrack to reach "<b>". To optimize the script, it is better to use the other angle bracket:
sed 's/<[^>]*>//g'
Finally, this script doesn't work at all in cases where the HTML tags cross line boundaries:
<a class="first"
href="http://gnu.org">
like this one here
</a>
or in those perverse cases where an HTML tag legally contains embedded angle brackets:
<img src="x.jpg" alt="<<like this>>" >
For more help on using sed, visit <a href="http://sed.sourceforge.net/" title="sourceforge.net">http://sed.sourceforge.net/</a sourceforge.net>
—Eric Pement
#