Looking for a way to trim HTML code using terminal commands

Xeon212 · Mar 20, 2013

I'm trying to learn \[code\]awk\[/code\] and \[code\]sed\[/code\] better, to be able to create cross-compatible terminal tools without needing things like PHP, Perl and so on. I'm now trying to clean up a very long string which is basically a part of an HTML document that I've fetched with \[code\]curl\[/code\]. I'm wondering about the best way to go about this.Most solutions that I have found are counting on luxuries like static files or structures, but as I'm trying to clean up fetched HTML code I want to be able to assume that the "periphery" of the string can change a lot, both in size and structure. So what I think I need to be able to do is essentially identify HTML tags, as these likely will not change, and extract the data from those HTML tags, no matter where they are. An example could be something like this:\[code\]<span class="unique-class">Payload</span>\[/code\]I need to be able to look for that entire HTML tag, and when it is found, I need to extract basically everything after the \[code\]>\[/code\], until a \[code\]<\[/code\] is found and another tag starts.Since my original code is basically useless due to the fact that it just \[code\]grep\[/code\]s lines matching certain words (words that can show up in non-interesting instances on the same page), I'm really open for anything.

Looking for a way to trim HTML code using terminal commands

Xeon212

New Member