大家经常会用Python进行数据挖掘的说,但是有些网站是需要登陆才能看到内容的,那怎么用Python实现模拟登陆呢?其实网路上关于这方面的描述很多,不过前些日子遇到了一个需要cookie才能登陆的网站,而且这个网站还有些问题,于是费了好大的劲才搞定,现在贴出来给大家分享下。
首先是用Python3标准库里的urllib包实现的一个版本,不需要考虑许多细节:
1 #! /usr/bin/env python 2 # -*- coding:utf-8 -*- 3 4 import urllib.request 5 import urllib.parse 6 import http.cookiejar 7 8 StudentInfoURL = 'http://210.x.x.1:90/student/index.jsp' 9 loginURL = 'http://210.x.x.1:90/login.jsp'10 loginCheckURL = 'http://210.x.x.1:90/j_security_check'11 post_data = urllib.parse.urlencode({ 'j_username': 'xxxxxxx', 'j_password': 'xxxxxxx'})12 headers = {13 'Content-Type': 'application/x-www-form-urlencoded',14 'UserAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36'15 }16 17 cj = http.cookiejar.CookieJar()18 opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))19 #此处一定要链接一次,否则得不到cookie20 opener.open(loginCheckURL) 21 urllib.request.install_opener(opener)22 23 24 ######################此处加入异常处理,再登一次即可######################25 request = urllib.request.Request(loginCheckURL, post_data, headers)26 try:27 response = urllib.request.urlopen(request)28 except:29 response = urllib.request.urlopen(request)30 print(response.read().decode('GBK'))31 32 33 ######################可以开始正常访问啦######################34 request = urllib.request.Request(StudentInfoURL, headers=headers)35 fp = urllib.request.urlopen(request)36 print(fp.read().decode('GBK'))
下面是另一个版本,用的是比较底层的http包里的client模块实现的,个人很喜欢这个版本:
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 4 import http.client 5 6 ########################################################### 7 HOST = '210.x.x.1:90' 8 UserName = "xxxxxxx" 9 PassWord = "xxxxxxx"10 data = "j_username=%s&j_password=%s" %(UserName,PassWord)11 Headers = {12 "Content-Type":"application/x-www-form-urlencoded",13 "User-Agent":"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)",14 }15 ###########################################################16 17 18 #连接服务器19 conn = http.client.HTTPConnection(HOST,timeout=30)20 conn.connect()21 22 #GET到登录页,以获取cookies23 conn.request("GET","/j_security_check",None,Headers)24 res = conn.getresponse()25 m_cookie = res.getheader("Set-Cookie").split(';')[0]26 res.read()27 28 #POST到登录页,进行登录29 Headers["Cookie"] = m_cookie30 conn.request("POST","/j_security_check",data,Headers)31 res = conn.getresponse()32 res.read()33 if res.status == 400:34 #再次链接到登录页35 conn.request("POST","/j_security_check",data,Headers)36 res = conn.getresponse()37 res.read()38 conn.close()39 40 41 42 43 44 ######################可以开始正常访问啦######################45 conn2 = http.client.HTTPConnection(HOST)46 conn2.request("GET","/student/index.jsp",None,Headers)47 fp = conn2.getresponse()48 print(fp.status)49 print(fp.read().decode("GBK"))50 ###########################################################
欢迎大家批评