我从高二开始看煎蛋网,大一的时候用C++写了一个小程序抓取煎蛋网无聊图图片:下载煎蛋网图片

前段时间在知乎回答了一个问题之后,有人找我咨询如何抓取煎蛋网图片,鉴于之前的代码实在太丑了,而且没有跨平台性,就不贴出来了。我花了两个小时学了一下Python,然后写了个脚本,代码又短又好看。

抓取煎蛋网图片的难点在于如何抓取Gif图,我们先看一下煎蛋网Gif图元素是怎样的:

<img src="http://***/thumbnail/***.gif" org_src="http://***/mw1024/***.gif" onload="add_img_loading_mask(this, load_sina_gif);" style="max-width: 486px; max-height: 450px;">

我们可以看到,对于Gif图,其src元素内容是一个包含thumbnail字符串的链接,而后面会有一个org_src元素,内容只是把thumbnail替换成mw1024,也就是说org_src才是真正的Gif链接。于是我们在抓取到网页源码之后,只需将包含thumbnail元素的链接删掉即可。效果图如下:

让我们看看代码,抓取到网页源码之后,用正则表达式处理出我们需要的字符串,然后下载图片。

'''

    author  : haipz
    site    : haipz.com
    email   : i@haipz.com

'''

import urllib2, os, re, thread, time

def getHtml(url) :
    page = urllib2.urlopen(url)
    html = page.read()
    return html

def filterComment(source) :
    pattern = ur'begin comments([\s\S]*?)end comments'
    matchs = re.search(pattern, source)
    return matchs.group()

def filterThumbnail(source) :
    pattern = ur'<img src="([\s\S]*?)\.gif"'
    reobj = re.compile(pattern)
    result, number = reobj.subn('', source)
    return result

def downloadPicture(picurl, picpath, picname) :
    pic = urllib2.urlopen(picurl, timeout = 60)
    f = open(picpath + picname, 'wb')
    f.write(pic.read()) 
    f.close()
    print 'System: ' + picname + ' saved\n'

choice = int(input("0. ooxx 1. pic : "))
pagestart = int(input("page start: "))
pageend = int(input("page end: "))

if choice == 0 :
    dirname = "ooxx"
else :
    dirname = "pic"

path = os.getcwd() + "/" + dirname
isExists = os.path.exists(path)
if not isExists :
    print 'System: ' + path + " created"
    os.makedirs(path)
else :
    print 'System: ' + path + " exists"

initurl = "http://jandan.net/" + dirname + "/"

for pagenum in range(pagestart, pageend) :
    cururl = initurl + "page-" + str(pagenum)
    print 'Current url: ' + cururl
    inithtml = getHtml(cururl)
    curhtml = filterComment(inithtml)

    pattern = ur'<li id="comment-([\s\S]*?)</li>'
    reobj = re.compile(pattern)
    matchs = reobj.findall(curhtml)
    count0 = 0
    for match in matchs :
        count0 = count0 + 1
        match = filterThumbnail(match)

        oopattern = ur'(?:<span id="cos_support-)(?:\d*?)(?:">)(\d*?)(?:</span>)'
        xxpattern = ur'(?:<span id="cos_unsupport-)(?:\d*?)(?:">)(\d*?)(?:</span>)'
        oo = re.search(oopattern, match).group(1)
        xx = re.search(xxpattern, match).group(1)

        picpattern = ur'(?:src=")([\s\S]*?)(.jpg|.png|.gif)'
        picobj = re.compile(picpattern)
        result = picobj.findall(match)

        count1 = 0
        for pic in result :
            count1 = count1 + 1
            picurl = pic[0] + pic[1]
            picpath = path + '/'
            picname = str(pagenum) + '_oo' + oo + '_xx' + xx + '_' + str(count0) + '_' + str(count1) + pic[1]

            print 'Infomation:'
            print 'Picture url: ' + picurl
            print 'Picture path: ' + picpath
            print 'Picture name: ' + picname

            try :
                downloadPicture(picurl, picpath, picname)
            except Exception as e :
                print(e)

最后附上pyc文件,右键点击链接选择链接另存为即可下载:

http://haipz.qiniudn.com/20150124ooxx.pyc

转载保留版权:http://haipz.com/blog/i/6406 - 海胖博客