python - Remove boilerplate content from HTML page -
i use justext implementation found here https://github.com/miso-belica/justext clean content out of html page. works this:
import requests import justext response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("english")) paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text
i have downloaded pages parse using tool (some of them no longer available online), , extract html content out of them. since justext appears working on output of request (which response type object), wondering if there custom way set content of response object contain html text parse.
response.content
of <type 'str'>
>>> requests import >>> r = get("http://www.google.com/") >>> type(r.content) <type 'str'>
so call:
justext.justext(my_html_string, justext.get_stoplist("english"))
Comments
Post a Comment