text_extraction

text_extraction

Send to Kindle

home » snippets » text_extraction

These links were fetched from List of resources: Article text extraction from HTML documents

Links

Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty – using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
Project Goose by Gravity labs
Perl module HTML::Feature
Webstemmer is a web crawler and page layout analyzer with a text extraction utility
Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)

Web APIs

After a short inquiry I came across some very decent web APIs:

Alchemy API Web Page Cleaning – a well known commercial API with a limited free service
ViewText.org – they’re asking you to be kind to their servers, so this is not your typical commercial service
DiffBot API – describes itself as: “Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags.”
Purifry – is promising high performance and good accuracy. It’s also available as a binary.