Module for automatic summarization of text documents and HTML pages.
-
Updated
May 16, 2024 - Python
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Domain-specific language for extracting structured data from HTML documents
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Article extraction benchmark: dataset and evaluation scripts
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Extract embedded metadata from HTML markup
Extract price amount and currency symbol from a raw text string
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
fast python port of arc90's readability tool, updated to match latest readability.js!
Parse numbers written in natural language
Heuristic based boilerplate removal tool
Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.
To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."