Powered by GitBook

正则的使用

一、需要系统的学习正则表达式可以参考

二、在`python`爬虫中需要掌握的正则有

1、元字符
- 1..:除了\n以外的任意字符
- 2.*:出现0到多次
- 3.?:出现0或者1次
- 4.+:表示出现1到多次
2、常用的方法
- 1.compile:表示生成正则表达式参考地址
- 2.findall:查找全部注意返回的是一个列表参考地址

三、使用正则抓取唐诗

import re
import requests

class GuShiWen(object):
    def __init__(self):
        self.url = 'https://www.gushiwen.org/default_2.aspx'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
        }

    def get_html(self):
        """
        抓取古诗文第一页内容
        :return:
        """
        response = requests.get(url=self.url, headers=self.headers)
        if response.status_code == 200:
            gusiwen_list = []
            params = re.compile('.*?(<div class="sons".*?)<div style="width:1px; height:1px; overflow:hidden;">', re.S)
            article_list = params.findall(response.text)
            for article in article_list:
                gusiwen_dict = {}
                title = re.compile('.*?<b>(.*)</b>', re.S).findall(article)[0]
                content = re.compile('.*?<div class="contson".*?>(.*?)</div>', re.S).findall(article)[0].strip().replace('<br />', '')
                gusiwen_dict['title'] = title
                gusiwen_dict['content'] = content
                gusiwen_list.append(gusiwen_dict)
            print(gusiwen_list)
            return
        print('请求错误')

if __name__ == "__main__":
    gusiwen = GuShiWen()
    gusiwen.get_html()

四、在爬虫中使用正则获取数据

1、基本上是使用findall方法
2、主要是网页多行字符要使用re.S
3、如果正则比较复杂的时候使用re.compile()对正则包装下

results matching ""

No results matching ""