Best option here is to use proper parser to extract the text from HTML. In the past I've worked with Html Agility Pack, so you could give it a try! I will probably small snippet of code to show you how to use it in the morning!
System.Text.RegularExpressions.Regex regHtml = new System.Text.RegularExpressions.Regex("<[^>]*>");
string s = regHtml.Replace(InputString,"");
Good night for now!
Parsing HTML with regex. Nope.
ReplyDeletehttp://stackoverflow.com/a/1732454/285944
Well, I am not trying to parse HTML - I'm just removing HTML tags - leaving only text behind. I think this is appropriate technique to apply here.
ReplyDeleteAnd thank you for the comment :)
ReplyDeleteRight, but you're going to get stuck. For instance, your regex has no clue about CDATA, or HTML comments.
ReplyDelete< ![CDATA[ This is a CDATA tag > BUT AHA! Your regex only looks for the closing >, and doesn't know that < ![CDATA[ tags are closed by something else, and not just the simple > I have escaped your naive regex filtering!!!!!! Oh, and that < CDATA ...> thing above looks like a new tag, and was deleted. Your script should have ignored this whole CDATA tag, but allows vast portions through ]]>
Well, thank you, I've missed that! I will update blog post now.
ReplyDelete