Programmer Notes: [C#]Remove HTML tags from string

Wednesday, October 3, 2012

[C#]Remove HTML tags from string

As discussed in the comments, this approach is wrong:


System.Text.RegularExpressions.Regex regHtml = new System.Text.RegularExpressions.Regex("<[^>]*>");
string s = regHtml.Replace(InputString,"");

Best option here is to use proper parser to extract the text from HTML. In the past I've worked with Html Agility Pack, so you could give it a try! I will probably small snippet of code to show you how to use it in the morning!

Good night for now!

5 comments:

UnknownOctober 3, 2012 at 5:50 PM
Parsing HTML with regex. Nope.

http://stackoverflow.com/a/1732454/285944
ReplyDelete
Replies
Немања БорићOctober 3, 2012 at 5:58 PM
Well, I am not trying to parse HTML - I'm just removing HTML tags - leaving only text behind. I think this is appropriate technique to apply here.
ReplyDelete
Replies
Немања БорићOctober 3, 2012 at 5:58 PM
And thank you for the comment :)
ReplyDelete
Replies
UnknownOctober 3, 2012 at 7:39 PM
Right, but you're going to get stuck. For instance, your regex has no clue about CDATA, or HTML comments.

< ![CDATA[ This is a CDATA tag > BUT AHA! Your regex only looks for the closing >, and doesn't know that < ![CDATA[ tags are closed by something else, and not just the simple > I have escaped your naive regex filtering!!!!!! Oh, and that < CDATA ...> thing above looks like a new tag, and was deleted. Your script should have ignored this whole CDATA tag, but allows vast portions through ]]>
ReplyDelete
Replies
Немања БорићOctober 3, 2012 at 7:48 PM
Well, thank you, I've missed that! I will update blog post now.
ReplyDelete
Replies

Add comment