您现在的位置是：首页 > 编程 > 编程经验

编程经验

python爬虫怎样获取知乎问答内容

周琼 2023-09-12编程经验

第一段：爬虫知乎网页的概述要获取知乎问答内容，我们需要使用爬虫技术。Python的爬虫工具相当丰富，我们可以使用requests、BeautifulSoup、Selenium等库来编写代

第一段：爬虫知乎网页的概述

要获取知乎问答内容，我们需要使用爬虫技术。Python的爬虫工具相当丰富，我们可以使用requests、BeautifulSoup、Selenium等库来编写代码。其中requests库用于获取网页内容，BeautifulSoup库可以解析HTML代码，Selenium库用于模拟用户操作。在这里，我们将以requests和BeautifulSoup为例，解析知乎问答页面的代码如下：

import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/question/37787176'
r = requests.get(url)
soup = BeautifulSoup(r.content)

print(soup)

第二段：查找知乎问题的标题和描述

为了获取知乎问题的标题和描述，我们需要检查该页面的HTML源代码，并找到对应的标记。在知乎答案页面中，问题的标题标记是h1，描述的标记是div。如下面的代码所示，我们可以使用BeautifulSoup的find函数来获取标记：

import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/question/37787176'
r = requests.get(url)
soup = BeautifulSoup(r.content)

title = soup.find('h1', {'class':'QuestionHeader-title'}).text
description = soup.find('div', {'class':'QuestionHeader-detail'}).text

print('问题标题：', title)
print('问题描述：', description)

第三段：获取知乎答案的内容和赞同数

同样地，我们也需要检查知乎答案页面HTML源代码，并找到对应的标记。在答案页面中，答案的内容标记是div，赞同数的标记是button。如下面的代码所示，我们可以使用BeautifulSoup的find_all函数来获取多个标记，并使用正则表达式来获取赞同数：

import requests
import re
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/question/37787176/answer/396803853'
r = requests.get(url)
soup = BeautifulSoup(r.content)

content = soup.find('div', {'class':'RichContent-inner'}).text
votes = soup.find_all('button', {'class':'Button VoteButton VoteButton--up'})

regex = re.compile(r'\d+')
votes = int(regex.search(votes[0]['aria-label']).group())

print('答案内容：', content)
print('赞同数：', votes)

第四段：爬取知乎用户信息

在知乎上，用户的个人信息可以通过访问他们的主页来获取。我们可以使用requests获取HTML源代码，BeautifulSoup解析并查找对应标记。用户的头像、昵称和个人简介都可以在页面中找到。如下面的代码所示，我们使用requests和BeautifulSoup库获取John Doe用户信息：

import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/people/john-doe-29-56'
r = requests.get(url)
soup = BeautifulSoup(r.content)

avatar = soup.find('img', {'class':'Avatar Avatar--large UserAvatar-inner'}).get('src')
nickname = soup.find('span', {'class':'ProfileHeader-name'}).text
headline = soup.find('span', {'class':'RichText'}).text

print('用户头像链接：', avatar)
print('用户昵称：', nickname)
print('用户个人简介：', headline)

很赞哦！ ()

上一篇
Python List用于数据存储排序和筛选

PythonList用于数据存储Python中的List是一种有序的可变序列,允许同一个List中的元素可以有不同的数据类型，包括数字、字符串、列表等数据类型。以下是一些基本用法

c语言编程笔录