注册登录后全站资源免费查看下载
您需要 登录 才可以下载或查看,没有账号?立即注册
×
最近在学习python爬虫的协程部分,所以拿一个小说网站来练手。
首先使用同步爬取抓取小说章节所有的url,然后用异步的方法来保存小说内容。以txt文件保存。 食用方法python环境为 python 3.9
用到的库有requests,BeautifulSoup,asyncio,aiohttp,aiofiles,使用源码要先安装库
遇到的问题小说一共有1400+章节,但是每次爬取都只能爬1200+左右,估计是timeout。 代码贴出 - import requests
- from bs4 import BeautifulSoup
- import asyncio
- import aiohttp
- import aiofiles
- headers = {
- 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
- 'Referer': 'http://www.biqugse.com',
- "Accept-Encoding": "gzip, deflate"
- }
- async def get_content(xs_url, name):
- async with aiohttp.ClientSession() as session:
- async with session.get(xs_url, headers=headers) as resp:
- response = await resp.content.read()
- soup = BeautifulSoup(response, "html.parser")
- content = soup.find("div", id="content")
- data = content.text.split()
- data = "\r\n".join(data)
- async with aiofiles.open(f"novel/{name}.txt", mode="a", encoding="utf-8") as f:
- await f.write(data)
- print(f"已经下载{name}")
- async def main(url):
- resp = requests.get(url)
- soup = BeautifulSoup(resp.text, "html.parser").find("div", attrs={"id": "list"}).find_all("dd")[9:]
- tasks = []
- print("开始创建异步任务")
- for i in soup:
- xs_url = url.rstrip("/25802/") + i.find("a").get("href")
- name = i.find("a").text
- print(xs_url,name)
- tasks.append(asyncio.create_task(get_content(xs_url, name),name=name))
- await asyncio.wait(tasks)
- print("结束异步任务")
- if __name__ == '__main__':
- print("开始执行")
- main_url = "http://www.biqugse.com/25802/"
- loop = asyncio.get_event_loop()
- loop.run_until_complete(main(main_url))
复制代码
|