所在位置：

极简的端到端 Scrapy 教程(第二部分)【翻译】

admin python scrapy

在第一部分，你学习了如何设置 Scrapy 项目，并编写一个基本的爬虫通过下面的页面导航链接来提取页面。但是，提取的数据仅仅展示在控制台。在第二部分，我将会介绍条目和条目加载器的概念，并解释你为什么应该要使用它们来存储额外的数据。

让我们先看一下 Scrapy 的架构：

就像你在步骤七和步骤八看到的，Scrapy 是围绕条目的概念设计的，例如，爬虫提取额外的数据为条目，然后通过条目管道把条目做进一步的处理。我总结了使用条目的一些关键原因：

Scrapy 是围绕条目设计的，并期望将条目做为爬虫的输出 - 你将会在第四部分里看到，当你部署项目到 ScrapingHub 或者类似的服务时，这里有一个默认的界面给你浏览条目和相关的统计。
条目在单独的文件中清晰地定义常见的输出数据格式，这能使你快速地检查收集的结构化数据是什么，并在创建了不一致的数据时会提示异常，例如在你的代码里字段的拼写错误 - 这种情况发生的频率比你想象的要高。
你可以（通过条目加载器）向每个条目添加预处理，例如去除空白，移除特殊的字符等，并将此处理代码与主爬行器逻辑分离，以保持代码结构化和干净。
在第三部分，你将会学习如何添加不同的条目管道，以检测相同的条目和保存条目到数据库中。

在第一部分的末尾，我们的生成了下面的数据：

yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
                'tags': quote.css('.tag::text').getall(),
            }

和

yield {
            'author_name': response.css('.author-title::text').get(),
            'author_birthday': response.css('.author-born-date::text').get(),
            'author_bornlocation': response.css('.author-born-location::text').get(),
            'author_bio': response.css('.author-description::text').get(),
        }

你可能注意到，作者和作者姓名是同一样东西（一个是引用页面的，一个来自相应的个人作者页面）。因此，我们实际上提取了6条数据，即引用文本，标签，作者名，生日，出生地址，和简历。现在，让我们定义这些数据的条目。

打开自动生成的 items.py 文件并按如下的方式更新内容：

from scrapy.item import Item, Field


class QuoteItem(Item):
    quote_content = Field()
    tags = Field()
    author_name = Field()
    author_birthday = Field()
    author_bornlocation = Field()
    author_bio = Field()

我们只定义了一个名叫 "QuoteItem" 的 Scrapy 条目，其中包含6个字段用来存储提取的数据。在这里，如果你之前设计过相关的数据库，你可能会问：我应该要有两个条目 QuoteItem 和 AuthorItem 来处理数据的逻辑么？回答是可以的，但是这种情况不建议你这样做，因为条目是由 Scrapy 用异步的方式返回的，并且您将添加额外的逻辑以将引用项目与其对应的项目相匹配 - 在这种情况下，你更加容易把相关的引用和作者放到一个条目里。

现在，您可以将提取的数据放入爬虫文件中的条目中，如下所示：

from tutorial.items import QuoteItem
...
quote_item = QuoteItem()
...
for quote in quotes:
    quote_item['quote_content'] = quote.css('.text::text').get()
    quote_item['tags'] = quote.css('.tag::text').getall()

或者首选的方式是使用 ItemLoader，如下所示：

from scrapy.loader import ItemLoader
from tutorial.items import QuoteItem
...
        
for quote in quotes:
    loader = ItemLoader(item=QuoteItem(), selector=quote)
    loader.add_css('quote_content', '.text::text')
    loader.add_css('tags', '.tag::text')
    quote_item = loader.load_item()

嗯，使用 ItemLoader 的代码看似乎会更复杂 - 为什么要这样做？快速的答案是：你从 css 选择器获取的原始数据可能需要进一步解释。例如，提取的 quote_content 在 Unicode 里有引号需要被去掉。

生日是一个字符串，需要解释成 Python 的日期格式：

出生地址有 “in" 的额外字符串需要被删除：

ItemLoader使预处理/后处理功能可以很好地从爬虫代码中指定，并且条目的每一个字段有不同的预处理函数，以更好的重用代码。

例如，我们能创建一个函数来移除前面提到 Unicode 引号，如下所示：

from scrapy.item import Item, Field
from scrapy.loader.processors import MapCompose

def remove_quotes(text):
    # strip the unicode quotes
    text = text.strip(u'\u201c'u'\u201d')
    return text

class QuoteItem(Item):
    quote_content = Field(
        input_processor=MapCompose(remove_quotes)
        )
    tags = Field()
    author_name = Field()
    author_birthday = Field()
    author_bornlocation = Field()
    author_bio = Field()

MapCompose使我们能够将多个处理函数应用于一个字段（本例中只有一个）。ItemLoader 返回一个列表，例如标签的 ['death', 'life']。对于作者姓名，虽然列表返回的一个值，例如 [‘Jimi Hendrix’]，但TakeFirst 处理器会获取到列表的第一个值。添加了其它的处理器之后，items.py 看起来像：

from scrapy.item import Item, Field
from scrapy.loader.processors import MapCompose, TakeFirst
from datetime import datetime


def remove_quotes(text):
    # strip the unicode quotes
    text = text.strip(u'\u201c'u'\u201d')
    return text


def convert_date(text):
    # convert string March 14, 1879 to Python date
    return datetime.strptime(text, '%B %d, %Y')


def parse_location(text):
    # parse location "in Ulm, Germany"
    # this simply remove "in ", you can further parse city, state, country, etc.
    return text[3:]


class QuoteItem(Item):
    quote_content = Field(
        input_processor=MapCompose(remove_quotes),
        # TakeFirst return the first value not the whole list
        output_processor=TakeFirst()
        )
    author_name = Field(
        input_processor=MapCompose(str.strip),
        output_processor=TakeFirst()
        )
    author_birthday = Field(
        input_processor=MapCompose(convert_date),
        output_processor=TakeFirst()
    )
    author_bornlocation = Field(
        input_processor=MapCompose(parse_location),
        output_processor=TakeFirst()
    )
    author_bio = Field(
        input_processor=MapCompose(str.strip),
        output_processor=TakeFirst()
        )
    tags = Field()

现在的关键问题是我们从引用页加载了两个字段引用内容和标签，然后发出另外一个请求以获取相应的作者页面来加载作者姓名，作者生日，作者出生地和简历。要做这些，我们需要从一个页面传递条目 quote_item 到另外一个页面作为元数据，如下所示：

yield response.follow(author_url, self.parse_author, meta={'quote_item': quote_item})

在作者的解释函数，你能获得条目：

def parse_author(self, response):
        quote_item = response.meta['quote_item']

现在，添加了条目和条目加载器之后，我们的爬虫看起来像：

import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']


    def parse(self, response):
        self.logger.info('Parse function called on {}'.format(response.url))
        # quotes = response.xpath("//div[@class='quote']")
        quotes = response.css('div.quote')

        for quote in quotes:
            loader = ItemLoader(item=QuoteItem(), selector=quote)
            # pay attention to the dot .// to use relative xpath
            # loader.add_xpath('quote_content', ".//span[@class='text']/text()")
            loader.add_css('quote_content', '.text::text')
            # loader.add_xpath('author', './/small//text()')
            loader.add_css('tags', '.tag::text')
            quote_item = loader.load_item()
            author_url = quote.css('.author + a::attr(href)').get()
            # go to the author page and pass the current collected quote info
            yield response.follow(author_url, self.parse_author, meta={'quote_item': quote_item})

        # go to Next page
        for a in response.css('li.next a'):
            yield response.follow(a, self.parse)

    def parse_author(self, response):
        quote_item = response.meta['quote_item']
        loader = ItemLoader(item=quote_item, response=response)
        loader.add_css('author_name', '.author-title::text')
        loader.add_css('author_birthday', '.author-born-date::text')
        loader.add_css('author_bornlocation', '.author-born-location::text')
        loader.add_css('author_bio', '.author-description::text')
        yield loader.load_item()

在控制台运行这个爬虫 scrapy crawl quotes，你可以找到提取的条目总数：

2019-09-11 09:49:j36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
'downloader/request_count': 111,
...
'item_scraped_count': 50,

恭喜！你已经完成了本教程的第二部分。

参考链接

https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-ii-b917509b73f7

【上一篇】极简的端到端 Scrapy 教程(第一部分)【翻译】

【下一篇】极简的端到端 Scrapy 教程(第三部分)【翻译】