i followed along with Using Boost.Spirit to extract certain tags/attributes from HTML trying to gently nudge the output to handle the htmltext of common body tags.i believe spirit is appropriate for the task as a faster IO-bound solution than what we currently use with python HTMLParser module and about 10 lines of code. I have reviewed the existing "tiny" c++ parsers and would prefer to keep the simple task all on one page. That said, i could get no further with qi and phoenix lacking in the nuances of sending in a boolean to a sub-token parser, shown below.the intent of the grammar is in so many simple rules:
- select on a few content tags and ignore attributes
- traverse either \[code\]<p>foo</p>\[/code\] or \[code\]<p/>foo<next-element>\[/code\] to obtain "foo".
- resume grabbing \[code\]<p>foo<script></script>var</p>\[/code\] to obtain "foo" "var"