爬虫请求页面的几种方法

爬虫

发布日期: 2019-05-28

文章字数: 546

阅读时长: 2 分

阅读次数:

爬虫请求页面的几种方法

看了好多爬虫请求方式，大同小异。在这里汇总整理一下几种请求网页的方式。文内含构造请求头，设置代理。

python中有多种库可以用来处理http请求，比如python的原生库：urllib包、requests类库。urllib和urllib2是相互独立的模块，python3.0以上把urllib和urllib2合并成一个库了，requests库使用了urllib3。requests库的口号是“HTTP For Humans”，为人类使用HTTP而生，用起来不知道要比python原生库好用多少呢，比起urllib包的繁琐，requests库特别简洁和容易理解。

其中比较常用的方法是 requests、selenium

pyppeteer 一些逆向化比较好

一、urllib.request

urllib包中的urllib.request 经常配合 urllib.parse 来一起使用，里面的 urlencode 可以把key-value这样的键值对转换成我们想要的格式
```
 dic = {
     "jl":city,
     "kw":profession,
     "p":page
 }
 urlwith = urllib.parse.urlencode(dic1)
 urlFinal = url + urlwith
```

请求网页代码

 from fake_useragent import UserAgent  #随机情求头
 import urllib.request
 user_agent =UserAgent()
 request = urllib.request.Request("https://httpbin.org/ip")
 request.add_header("User-Agent", user_agent.random)
 response = urllib.request.urlopen(request)
 html = response.read().decode("utf8",'ignore')

 print(html)

二、requests

安装

pip install requests
请求网页代码

from fake_useragent import UserAgent  #随机情求头
import requests

user_agent =UserAgent()
headers = {
            'Cookie':'' ,
            'User-Agent': user_agent.random
            }

proxy ={
        'https':'218.241.219.226:9999'
        }

html = requests.get("https://www.xicidaili.com/", headers=headers, proxies=proxy)

print(html.text)

三、selenium

1.请求网页代码

from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
browser = webdriver.Chrome()
#设置代理ip
# chromeOptions.add_argument("--proxy-server=https://112.85.130.91:9999")
# 一定要注意，=两边不能有空格，不能是这样--proxy-server = http://202.20.16.82:10152
# browser = webdriver.Chrome(options = chromeOptions)
browser.get("https://httpbin.org/ip")
# cookies = driver.get_cookies()

print(browser.page_source)
#打印cookies
print(browser.get_cookies())

四、pyppeteer

1.请求网页代码

import pyppeteer
#浏览器地址
exepath = 'C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chrome.exe'
browser = await launch({'executablePath': exepath, 'headless': True, 'slowMo': 30})
page = await browser.newPage()
await page.setViewport({'width': 1366, 'height': 768})
await page.goto('https://httpbin.org/ip')

print(await page.content())
#打印cookies
print(await page.cookies())