Parsing tags with TagSoup in Haskell

frashlanutq5 · Mar 16, 2013

I've been trying to learn how to extract data from HTML files in Haskell, and have hit a wall. I'm not really experience with Haskell at all, and my previous knowledge is from Python (and BeatifulSoup for HTML parsing).I'm using TagSoup to look at my HTML (seemed to be recommended) and sort of have a basic idea of how it works. Here's the basic segment of my code in question (self-contained, and outputs information for testing):\[code\]import System.IOimport Network.HTTPimport Text.HTML.TagSoupimport Data.Listmain :: IO ()main = do http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody let tags = dropWhile (~/= TagOpen "div" []) (parseTags http) --let list = sections (~== TagOpen "td" [("align","center")]) tags done tags where done xs = case xs of [] -> putStrLn $ "\n" _ -> do putStrLn $ show $ head xs done (tail xs)\[/code\]However, I'm not trying to get to any "div" tag. I want to drop everything prior to a tag in a format like this:\[code\]TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")]TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]\[/code\]I've tried writing it out:\[code\]let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)\[/code\]But then it tries to find the literal [0-9]+. I haven't figured out a workaround with the Text.Regex.Posix module yet, and escaping the characters doesn't work. What's the solution here?

Parsing tags with TagSoup in Haskell

frashlanutq5

New Member