发布于 2016-04-03 03:14:32 | 386 次阅读 | 评论: 0 | 来源: 网友投递

这里有新鲜出炉的Scrapy 0.24 中文文档,程序狗速度看过来!

Scrapy Python的爬虫框架

Scrapy是一个Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。


这篇文章主要介绍了讲解Python的Scrapy爬虫框架使用代理进行采集的方法,并介绍了随机使用预先设好的user-agent来进行爬取的用法,需要的朋友可以参考下

1.在Scrapy工程下新建“middlewares.py”


# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加


DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^


from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码


DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码


#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]


最新网友评论  共有(0)条评论 发布评论 返回顶部

Copyright © 2007-2017 PHPERZ.COM All Rights Reserved   冀ICP备14009818号  版权声明  广告服务