how to extract text from tag soup using boost::spirit

Andi

New Member
i followed along with Using Boost.Spirit to extract certain tags/attributes from HTML trying to gently nudge the output to handle the htmltext of common body tags.i believe spirit is appropriate for the task as a faster IO-bound solution than what we currently use with python HTMLParser module and about 10 lines of code. I have reviewed the existing "tiny" c++ parsers and would prefer to keep the simple task all on one page. That said, i could get no further with qi and phoenix lacking in the nuances of sending in a boolean to a sub-token parser, shown below.the intent of the grammar is in so many simple rules:
  • select on a few content tags and ignore attributes
  • traverse either \[code\]<p>foo</p>\[/code\] or \[code\]<p/>foo<next-element>\[/code\] to obtain "foo".
  • resume grabbing \[code\]<p>foo<script></script>var</p>\[/code\] to obtain "foo" "var"
my troubles begin trying to differentiate on the outer parse details, to receive a hint whether to keep the text (cout) or keeping skipping onward. [ if_(_r1) [std::cout<< qi::_1 << ' ' ]] is apparently not a supported feature, and I haven't seen a simple boolean flipping event model in spirit examples to copy. \[code\]//#define BOOST_SPIRIT_DEBUG//#include <string>#include <iostream>#include <boost/spirit/include/qi.hpp>#include <boost/spirit/include/phoenix.hpp>//#include <boost/config/warning_disable.hpp> //#include <boost/lambda/lambda.hpp>//#include <boost/bind.hpp>//#include <boost/algorithm/string/case_conv.hpp> //#include <stack>namespace qi = boost::spirit::qi;namespace phx = boost::phoenix;typedef std::string::const_iterator It;typedef qi::space_type Skipper;struct grammar : qi::grammar<It, Skipper> { grammar() : grammar::base_type(html) { using namespace qi; using namespace phx; attr = +(char_) >> '=' >> ( as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ] | as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ] ); directive = lit("<?") >> *(char_ - "?>") >> "?>"; comment = lit("<!--") >> *(char_ - "-->") >> "-->"; tagclose = lit("</") >> *( char_ - '>') >> '>'; fragment = as_string[lexeme[ *(char_) - '<' ]] [ if_(_r1) [std::cout<< qi::_1 << ' ' ]]; content= lit('b') | 'p' | 'i' | "br" | "em" | "div" | "span" | "strong" ; elem = lit("<")-(lit('!') | '?') >> content[ _a= true ] ^ (+char_ -content)[ _a = false] >> *(+space >> *attr) >> ((lit("/>")>>fragment(_a)>>html) ^ (lit('>')>> fragment(_a) % html >> tagclose)); ; html = *(elem ^ comment ^ directive); } qi::rule<It, Skipper> html , tagclose , comment , content , directive , attr ; qi::rule<It, Skipper, bool > elem; qi::rule<It, bool , Skipper > fragment;};int main(int argc, const char *argv[]) { std::string s; const static grammar html_; while (std::getline(std::cin, s)) { It f = s.begin(), l = s.end(); phrase_parse(f, l, html_, qi::space); } return 0;}\[/code\]
 
Back
Top