反扒措施 | 城北

爬虫 Python

爬虫

发布日期: 2019-06-14

文章字数: 1.2k

阅读时长: 4 分

阅读次数:

爬虫常见反扒措施

在互联网上爬取数据的过程中难免出现ip被封或者服务器返回403等等，这可能是你被网站检测为爬虫而采取的反爬措施，本文主要总结了一些常见的情况及规避的措施。

构造合理的请求头

HTTP 的请求头是在你每次向网络服务器发送请求时，传递的一组属性和配置信息。HTTP 定义了十几种古怪的请求头类型，不过大多数都不常用。
较为常用的是 Cookie User-Agent。尤其是User-Agent，这个没有大多数网站是请求不到的，Cookie中包含了你的登录信息。

from fake_useragent import UserAgent
headers = {
            'Cookie':'' ,
            'User-Agent': user_agent.random
            }

另外，请求头还可能让网站改变内容的布局样式。例如，用移动设备浏览网站时，通常会看到一个没有广告、Flash以及其他干扰的简化的网站版本。因此，把你的请求头User-Agent改成下面这样，也许就可以看到一个更容易采集的网站了。

User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) App leWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

使用代理

使用代理也是一种较为常见的方法，使用方法较为简单，只是使用前一定有确定ip可用。

    import requests
    proxies = {
        "http": "XXX.XX.XX.XX:XX",
    }
    print requests.get('http://www.ip181.com/',proxies=proxies).content

http://www.xdaili.cn/monitor 是一个可以检测代理IP的网站，它可以检测出你使用的ip是什么，正好可以用来检验自己是否使用代理ip成功。如果你有多个代理IP，那么可以建立一个IP池，每次随机从IP池中选择一个进行访问，如果访问失败了，则将其从池中去除即可。

图形验证码的识别

直接将图片识别

 import tesserocr
 from PIL import Image

 image = Image.open('code2.jpg')
 result = tesserocr.image_to_text(image)
 print(result)

将图片进行简单处理后识别，正确率较高

 import tesserocr
 from PIL import Image

 image = Image.open('code2.jpg')
 #向量化处理图片  方便识别
 image = image.convert('L')
 #调整阈值
 threshold = 127
 table = []
 for i in range(256):
     if i < threshold:
         table.append(0)
     else:
         table.append(1)

 image = image.point(table, '1')
 image.show()

 result = tesserocr.image_to_text(image)
 print(result)

示例操作

 import tesserocr
 from PIL import Image

 def get_image():
     #将网页截屏
     browser.save_screenshot('jietu.png')
     #找到验证码位置
     img=browser.find_element_by_id("chkimg")
     #获取验证码x,y轴坐标
     location = img.location
 #     print(location)
     #获取验证码的长宽
     size =img.size
     #写成我们需要截取的位置坐标
     left = location['x']
     top = location['y']
     right = left + size['width']
     bottom = top + size['height']
     #打开截图
     i=Image.open("jietu.png")
     #使用Image的crop函数，从截图中再次截取我们需要的区域
     image_obj =i.crop((left, top, right, bottom))

     image_obj.save('yzm.png')
     image=Image.open('yzm.png')

     #向量化处理图片  方便识别
     image=image.convert("L")
     #调整阈值
     threshold=80
     table=[]
     for i in range(256):
         if i <threshold:
             table.append(0)
         else:
             table.append(1)
     image=image.point(table,'1')
 #     image.show()  将图片显示出来
     #识别
     yzm=tesserocr.image_to_text(image)
     yzm=yzm.strip() #剔除空格
     #清空验证码输入框
     browser.find_element_by_id("verifyCode").clear()
     #将验证码填入
     browser.find_element_by_id("verifyCode").send_keys(yzm)

模拟滑块

下面的代码仅使用于需要拖到底的验证码滑块。

 swipe_button = self.browser.find_element_by_id('nc_1_n1z') #获取滑动拖动控件
 #模拟拽托

 action = ActionChains(self.browser) # 实例化一个action对象  
 action.click_and_hold(swipe_button).perform() # perform()用来执行ActionChains中存储的行为
 action.reset_actions()
 action.move_by_offset(580, 0).perform() # 移动滑块