For some reason HTML is always dirty, it’s usually full of Analytics tags, JavaScript or contains nested HTML tags. This is usually fine for displaying in browsers but at some point an iOS app will need to display HTML content and usually when it does, you need it to display clean HTML or only a small subset of HTML tags… all it takes is an unexpected tag and the whole document layout could be ruined. So here is a way of quickly and easily stripping HTML content down.
What’s the problem with regex?
The tried and tested method is to use regular expressions to strip away HTML tags but this has a number of problems: What if the HTML isn’t properly formed properly? What if you have JavaScript inside the HTML and the regex matches on that? What if you strip away the tags but you don’t want the content of link (anchor) tags such as ‘Click Here’? With the regex approach it might work for some really simple HTML but it’s becoming clear one single regex isn’t going to solve this. Applying different regex to different parts of the HTML snippet would work except this approach is beginning to require a deep knowledge of the HTML specification and would probably involve writing a small browser to cover all edge cases which you don’t want to do… or do you?
The power of a small browser in your binary
Wait, don’t go! It’s not as a mammoth task as it seems… The issue is that you want to parse the HTML just like a browser would except you don’t need to render the content. We’ll borrow a trick from web browsers which is to model the HTML content using the Document Object Model. This a way of representing the HTML content internally as a tree structure. Web browsers typically parse the HTML into a DOM tree before using that to render the contents to screen. It’s that parsing and the DOM tree that you want to use because the HTML 5 spec covers how it should be parsed and what happens if a tag is missing or not formatted correctly.
Once you have the DOM, you can walk down the tree and delete, modify or add additional tags. Finally once you are happy with your tree you can serialising it to HTML and put it in your app safely in the knowledge that you’ve stripped it of any nasty code or tags.
HTML parsing on the shoulders of giants
This is where Gumbo comes in, Gumbo is an implementation of the HTML5 parsing algorithm implemented as a pure C99 library with no outside dependencies, written by Google! It’s C so it’s fast except it also means it can be a bit difficult to work with which is why you might want to use Objective-Gumbo instead – it’s an Objective-C wrapper around Gumbo.
To get started with the DOM, all you need is to create an OGDocument using your HTML text snippet:
NSString *HTML = @"...";
OGDocument *doc = [ObjectiveGumbo parseDocumentWithString:HTML];
OGElement *bodyTag = [[doc elementsWithTag:GUMBO_TAG_BODY] firstObject];
..
The DOM is actually really simple, you have nodes and nodes can have children nodes. Each node can have a tag associated to it such as a p tag or a body tag.
Now you have your root node or body tag as a DOM Element node, you can recursively iterate through modifying the DOM or simply write out the HTML tags as strings.
- (NSString *)stringFromElement:(OGElement *)element
{
NSMutableString *string = [[NSMutableString alloc] init];
// Write out the starting tag
if (element.tag == GUMBO_TAG_P) {
[string appendString:@"
"];
}
// Handle the children nodes
// (also recursively call this method until we have no children)
for (OGNode *child in element.children) {
if ([child isKindOfClass:[OGText class]]) {
NSString *text = ((OGText *)child).text;
// Convert & back to & etc
CFStringRef textHTML = CFXMLCreateStringByEscapingEntities(kCFAllocatorDefault,
(CFStringRef) text,
NULL);
[string appendString: (__bridge NSString *)textHTML];
} else if ([child isKindOfClass:[OGElement class]]) {
[string appendString: [self stringFromElement:((OGElement *)child)]];
}
}
// Write out the closing tag
if (element.tag == GUMBO_TAG_P) {
[string appendString:@"
"];
}
}
Now the above is just a simple method which scans for paragraph tags and outputs new clean paragraph tags containing text that has been filtered by the HTML 5 parser.
NSString *HTML = @"...";
OGDocument *doc = [ObjectiveGumbo parseDocumentWithString:HTML];
OGElement *bodyTag = [[doc elementsWithTag:GUMBO_TAG_BODY] firstObject];
NSString *cleanHTML = [[self stringFromElement:bodyTag] copy];
Hopefully that should be a good starting point. You could create a rule-based parser and keep certain text in and remove others. Another option would be to strip out HTML links or perhaps you might want to keep them in but render them at the end of the document. There are many different possible options and I recommend you have a play around and if you do something interesting please post it to GitHub and send me a link.