Posted by: Anonymous Coward
on August 04, 2005 05:43 AM
Both versions are "right" but both versions are "wrong". They work fine for normal HTML, but aren't robust when you get malformed HTML such as:
<a <b <c d> or <a >b >c d<
Although I haven't tested either, I think that both versions will leave some of the < or > characters lying around. A sufficiently skilled attacker might actually be able to slip some HTML through the HTML stripper code, by crafting sufficiently malformed HTML.
If you want to be absolutely sure that you've removed all the tags, after you run the regexp above, strip out all of the remaining > and < characters just to be safe. There shouldn't be any raw < or > characters that aren't part of a tag in HTML anyway -- they should have been converted to > or <.
Re:Little Mistake.
Posted by: Anonymous Coward on August 04, 2005 05:43 AM<a <b <c d>
or
<a >b >c d<
Although I haven't tested either, I think that both versions will leave some of the < or > characters lying around. A sufficiently skilled attacker might actually be able to slip some HTML through the HTML stripper code, by crafting sufficiently malformed HTML.
If you want to be absolutely sure that you've removed all the tags, after you run the regexp above, strip out all of the remaining > and < characters just to be safe. There shouldn't be any raw < or > characters that aren't part of a tag in HTML anyway -- they should have been converted to > or <.
-drane
#