最近想看5个字成语域名注册情况(5个字的首字母,比如民以食为本,myswb.com),首先我需要一份5个字成语库。于是写了一个简单爬虫,索性将3个字到12个字全部爬下来。

1. 概述

幸运的是,911查询网站列出了3个字到12个字的成语(估计不全),比如5个字的成语。爬取的结果,我已分享在GitHub,在这里。或者点击下列链接查看。

举例,以下是部分5个字成语,点击链接可以查看解释(其实把解释爬下来也不难):

2. 爬取

其实,方法跟之前的博文《第一个爬虫程序:建立联系方式表格》差不多。完整的源代码,我已分享在GitHub,在这里

2.1 收集urls

注意到某个字数下的成语有好几页,先把这些页面的链接收集起来。用Chrome浏览器的Inspect element查看HTML文本,页面的导航HTML文本如下:

<div class="gclear pp bt center f14"><span class="gray">首页</span> <a href="zishu_5_p4.html">末页</a> <span class="gray">|</span> <a href="zishu_5.html" class="red noline">1</a> <a href="zishu_5_p2.html">2</a> <a href="zishu_5_p3.html">3</a> <a href="zishu_5_p4.html">4</a> <span class="gray">|</span> <span class="gray">上一页</span> <a href="zishu_5_p2.html">下一页</a></div>

这样的话,抓取就很方便了,源代码如下:

## Step 1: get all page urls
urls_set = set()
url = format_url.format(word_counts=word_counts)
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
try:
    response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        pass

    soup = BeautifulSoup(response.text)
    for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : # navigate pages
        for item in anchor.find_all('a') :
            page_url = urljoin(base_url, item.attrs['href'])
            urls_set.add(page_url)

2.2 抓取

去掉非重要代码,如下:

def crawler_chinese_idiom(self, url):    
    idioms = list()    

    parsed_uri = urlparse(url)    
    base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)     

    response = requests.get(url)    
    soup = BeautifulSoup(response.text)     

    # for result_set in soup.find_all("ul", {"class", re.compile(r"l[345]\s+center")}): #有bug    
    for result_set in soup.find_all("ul", {"class" :  ['l3', 'l4', 'l5', 'center']}):    
        for idiom in result_set.find_all('li') :            
            sub_url = idiom.find_all('a')[0].attrs['href']            
            idiom_url = urljoin(base_url, sub_url)            
            t = (idiom.get_text(), idiom_url)            
            idioms.append(t)     

    return idioms

2.3 完整源代码


#!/usr/bin/env python3

# this program is designed to crawler Chinese idioms betwee 3 and 12 charaters
# By SparkandShine,  sparkandshine.net
# July 21th, 2015

from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
#from urlparse import urlparse  python2.x
from urllib.parse import urlparse
from urllib.parse import urljoin

import os

class Crawler_Chinese_Idioms :
    def __init__(self):
        pass

    ### function output ###
    def format_output(self, filename, chinese_idioms):
        fp = open(filename, 'w')


        for item in chinese_idioms :
            s = '\t'.join(item)
            fp.write(s + '\n')
            #print(s)

        fp.close()


    ### function crawler chinese idioms, word counts [3, 12]###
    def crawler_chinese_idioms(self):
        out_dir = 'dataset_chinese_idioms/'
        format_filename = 'chinese_idioms_{word_counts}.dat'

        if not os.path.exists(out_dir):
            os.makedirs(out_dir)

        format_url = 'http://chengyu.911cha.com/zishu_{word_counts}.html'


        for word_counts in range(3, 13) :
        #for word_counts in [8] :
            chinese_idioms = set()

            ## Step 1: get all page urls
            urls_set = set()
            url = format_url.format(word_counts=word_counts)
            parsed_uri = urlparse(url)
            base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            try:
                response = requests.get(url)
            except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
                pass

            soup = BeautifulSoup(response.text)
            for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : #navigate pages
                for item in anchor.find_all('a') :
                    page_url = urljoin(base_url, item.attrs['href'])
                    urls_set.add(page_url)
                #print(urls_set)

            ## Step 2: crawler chinese idioms
            for url in urls_set :
                idioms = self.crawler_chinese_idiom(url)
                chinese_idioms.update(idioms)

            ## Step 3: write to file
            filename = out_dir + format_filename.format(word_counts=word_counts)
            self.format_output(filename, chinese_idioms)


    ### function, crawler chinese idioms from a given url ###
    def crawler_chinese_idiom(self, url):
        idioms = list()

        parsed_uri = urlparse(url)
        base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
        #print('base_url', base_url)

        try:
            response = requests.get(url)
        except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            # ignore pages with errors
            return idioms

        soup = BeautifulSoup(response.text)

        ## !!! there might be a bug !!!
        #for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #l4 center or l5 center
        #for result_set in soup.find_all("ul", {"class" :  "l4 center"}):
        for result_set in soup.find_all("ul", {"class" :  ['l3', 'l4', 'l5', 'center']}):
        #for result_set in soup.find_all("ul", {"class" :  ["l4 center", "l5 center"]}):
            for idiom in result_set.find_all('li') :
                sub_url = idiom.find_all('a')[0].attrs['href']
                idiom_url = urljoin(base_url, sub_url)

                t = (idiom.get_text(), idiom_url)
                #print(t)
                idioms.append(t)

        return idioms

### END OF CLASS ###

def main():
    crawler = Crawler_Chinese_Idioms()
    crawler.crawler_chinese_idioms()


if __name__ == '__main__':
    main()

3. 获取更全的成语

这样爬到的成语,想必是不全的。我想到的另一个思路:爬取一定数量的网页(比如成语查询网站),然后再中文分词,将成语从HTML文本筛选出来。

本文系Spark & Shine原创,转载需注明出处本文最近一次修改时间 2022-03-18 19:54

results matching ""

    No results matching ""