alwayzhappy
New Member
Here's an easy point for an XPath expert! Document structure:\[code\]<tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token></tokens>\[/code\]Ignoring the semantic improbability of the document, I want to pull out [["Newt", "Gingrich"], ["Garry", "Trudeau"]], that is: when there are two tokens in a row whose entityTypes are PROPER_NOUN, I want to extract the words from those two tokens.I've gotten as far as:\[code\]"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"\[/code\]... which gets as far as finding the second of two consecutive PROPER_NOUN tokens, but I'm not sure how to get it to emit the first token along with it. Some notes:
- I don't mind doing higher-level processing of the NodeSets (e.g. in Ruby / Nokogiri) if that simplifies the problem.
- In the event that there are three or more consecutive PROPER_NOUN tokens (call them A, B, C), ideally I'd like to emit [A, B], [B, C].