python爬取优词词典

动态 1 491

摘要:

运用python爬取优词词典并制作索引

前期准备:

1.python学习

2.了解网络知识

3.了解爬虫原理

4.requests模块的运用知识

5.Beautiful模块的理解运用

6.数据库知识的运用

7.pymysql的运用


在这里我不在赘述python的安装以及pip安装requests,pymysqlBeautiful网上有很多教程(前期请面向百度编程)



做好前面几点,我们开始编写爬虫


1明确目标:目标网站 http://www.youdict.com/ciku/


目标元素:单词(包括英文,中文),单词连接,图片连接



2.编写获取页面以及获取元素代码:


newsurl='http://www.youdict.com\
        /ciku/id_5_0_0_0_0.html'
res=requests.get(newsurl)
#res=requests.get(newsurl)
res.encoding='utf-8'
soup =BeautifulSoup(res.text,'html.parser')
#print(soup)
divs=soup.select(".col-sm-6")
#print(divs[0])
for each_div in divs:
    english=each_div.div.div.h3.a.text
    imgurl=transurl(each_div.div.img['src'])
    chinese=each_div.div.p.text
    #print(english+"   "+imgurl+"  "+chinese)
    insert(english,chinese,imgurl)


3.根据页面跳转规则拼接url:

newsurl='http://www.youdict.com\ciku/id_5_0_0_0_'+str(i)+'.html'

i  是由循环确定




4.连接数据库:

def insert(english,chinese,imgurl):
db = pymysql.connect("localhost","root","your\
db pass","your db name" )
cursor = db.cursor()
#summary = summary.tostring(summary,encoding='utf-8')
english=pymysql.escape_string(english)
chinese=pymysql.escape_string(chinese)
imgurl=pymysql.escape_string(imgurl)
sql="insert into reaserchwords(english,chinese,\
imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
cursor.execute(sql)
db.commit()
db.close()




5.组合起来完整的爬虫:

# coding=utf-8
'''
Created on 2018.8.18
@author: ZEC---
'''

import requests
import pymysql
from bs4 import BeautifulSoup


def insert(english,chinese,imgurl):
    db = pymysql.connect("localhost","root","your\
    db pass","your db name" ) 
    cursor = db.cursor()
    #summary = summary.tostring(summary,encoding='utf-8')
    english=pymysql.escape_string(english)
    chinese=pymysql.escape_string(chinese)
    imgurl=pymysql.escape_string(imgurl)
    sql="insert into reaserchwords(english,chinese,\
    imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
    
    cursor.execute(sql)
    db.commit()
    db.close()
def transurl(url):
    url="http://www.youdict.com"+url
    url.strip('\n')
    return url
def main_thread(start,end):
    i=start
    while i<end: 
        newsurl='http://www.youdict.com\
        /ciku/id_5_0_0_0_'+str(i)+'.html'
        res=requests.get(newsurl)
        #res=requests.get(newsurl)
        res.encoding='utf-8'
        soup =BeautifulSoup(res.text,'html.parser')
        #print(soup)
        divs=soup.select(".col-sm-6")
        #print(divs[0])
        for each_div in divs:
            english=each_div.div.div.h3.a.text
            imgurl=transurl(each_div.div.img['src'])
            chinese=each_div.div.p.text
            #print(english+"   "+imgurl+"  "+chinese)
            insert(english,chinese,imgurl)
        print(str(i+1)+"页面 is ok")
        i=i+1

main_thread(67,274)



自己做的单词搜索页面如图:

www.senlear.com/words