Extracting and parsing HTML from a secure website with Python?
Let's dive into this, shall we?
Ok, I need to write a script (I don't care what language, prefer something
like Python or Javascript, but whatever works I will take time to learn).
The script will access multiple URL's, extract text from each site and
store it into a folder on my PC. (From there I am manipulating the data
with Python, which I know how to do.)
EDIT: Currently I am using python's NLTK module. Here is a simple version
of my code:
url = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
This code works fine for both http and https, but not for instances where
authentication is required.
Is there a Python module which deals with secure authentication?
Thanks in advance for help! And to the mods who will view this as a bad
question, please just give me ways to make it better. I need ideas..from
people, not Google.
No comments:
Post a Comment