These links were fetched from List of resources: Article text extraction from HTML documents
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)
After a short inquiry I came across some very decent web APIs:
- Alchemy API Web Page Cleaning – a well known commercial API with a limited free service
- ViewText.org – they’re asking you to be kind to their servers, so this is not your typical commercial service
- DiffBot API – describes itself as: “Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags.”
- Purifry – is promising high performance and good accuracy. It’s also available as a binary.