python - Remove boilerplate content from HTML page -

i use justext implementation found here https://github.com/miso-belica/justext clean content out of html page. works this:

import requests import justext  response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("english")) paragraph in paragraphs:   if not paragraph.is_boilerplate:       print paragraph.text

i have downloaded pages parse using tool (some of them no longer available online), , extract html content out of them. since justext appears working on output of request (which response type object), wondering if there custom way set content of response object contain html text parse.

response.content of <type 'str'>

>>> requests import >>> r = get("http://www.google.com/") >>> type(r.content) <type 'str'>

so call:

justext.justext(my_html_string, justext.get_stoplist("english"))

Search This Blog

Szoka

python - Remove boilerplate content from HTML page -

Comments

Post a Comment

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

go - Idiomatic way to handle template errors in golang -