Wednesday, October 3, 2012

[C#]Remove HTML tags from string

As discussed in the comments, this approach is wrong:
  1. System.Text.RegularExpressions.Regex regHtml = new System.Text.RegularExpressions.Regex("<[^>]*>");
  2. string s = regHtml.Replace(InputString,"");
Best option here is to use proper parser to extract the text from HTML. In the past I've worked with Html Agility Pack, so you could give it a try! I will probably small snippet of code to show you how to use it in the morning!

Good night for now!

5 comments:

  1. Parsing HTML with regex. Nope.


    http://stackoverflow.com/a/1732454/285944

    ReplyDelete
  2. Well, I am not trying to parse HTML - I'm just removing HTML tags - leaving only text behind. I think this is appropriate technique to apply here.

    ReplyDelete
  3. Right, but you're going to get stuck. For instance, your regex has no clue about CDATA, or HTML comments.

    < ![CDATA[ This is a CDATA tag > BUT AHA! Your regex only looks for the closing >, and doesn't know that < ![CDATA[ tags are closed by something else, and not just the simple > I have escaped your naive regex filtering!!!!!! Oh, and that < CDATA ...> thing above looks like a new tag, and was deleted. Your script should have ignored this whole CDATA tag, but allows vast portions through ]]>

    ReplyDelete
  4. Well, thank you, I've missed that! I will update blog post now.

    ReplyDelete