c语言编程笔录

首页 >   > 笔记大全

笔记大全

爬取新浪新闻的 PHP 爬虫实战

更新时间:2023-10-06

PHP Web 爬虫介绍

PHP Web 爬虫是一个可以模拟浏览器行为,自动抓取网页内容并解析的程序。通过 PHP 语言的编写,爬虫可以自动化的获取网站数据,如新闻、图片、视频等,并将该数据进行分析与处理,生成所需的报表、文章等。和 Python 的 Requests 和 Beautiful Soup 模块类似,PHP 的 Web 爬虫同样需要第三方插件的支持。

// 安装第三方插件 GuzzleHttp
require_once __DIR__ . '/vendor/autoload.php';
 
// 创建 GuzzleHttp 客户端对象
$client = new \GuzzleHttp\Client();
 
// 请求网站
$res = $client->request('GET', 'https://www.example.com');
 
// 输出响应数据
echo $res->getStatusCode();
echo $res->getBody();

爬虫执行流程

爬虫执行主要包括以下几个步骤:1、请求 URL 获取网页源代码;2、解析网页源代码,获取所需数据;3、将所需数据存储到数据库或输出至文件。对于新浪新闻网站的爬虫,由于新浪新闻的页面结构比较固定,以及文章详情页的 URL 规则一致,所以在编写爬虫时,可以采用如下实现流程。

// 请求新浪新闻页面
$url = 'http://www.sina.com.cn/';
$html = file_get_contents($url);
 
// 解析页面获取新闻链接
$pattern = '/(.*?)<\/a>/';
preg_match_all($pattern, $html, $matches[]);
 
// 请求新闻链接获取新闻详情
foreach ($matches as $match) {
  $news_url = $match[1];
  $news_html = file_get_contents($news_url);
 
  // 解析获取新闻标题、内容、时间等数据
  $pattern = '/(.*?)<\/title>.*?<div class="article">(.*?)<\/div>.*?<span class="time">(.*?)<\/span>/';
  preg_match($pattern, $news_html, $detail);
 
  // 将新闻数据存储至数据库
  $title = $detail[1];
  $content = $detail[2];
  $time = $detail[3];
  $sql = "INSERT INTO sina_news (title, content, time) VALUES ('$title', '$content', '$time')";
  mysqli_query($conn, $sql);
}
</pre>

<h4>设置请求头与使用代理</h4>
<p>
为了避免请求被网站服务器拦截或者发生反爬虫,我们需要在请求头中添加一些参数,如 User-Agent、Accept-Language、Referer 等。此外,使用代理 IP 可以更有效地避免 IP 被封禁。下面是使用 GuzzleHttp 客户端设置请求头与代理的示例代码。
</p>
<pre>
// 设置请求头
$headers = [
  'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
  'Accept-Language' => 'zh-CN,zh;q=0.9,en;q=0.8',
  'Referer' => 'https://www.google.com/',
  'Upgrade-Insecure-Requests' => '1',
];
$client = new \GuzzleHttp\Client([
  'headers' => $headers,
]);
 
// 使用代理
$proxy = '192.168.1.1:1080';
$client = new \GuzzleHttp\Client([
    'proxy' => $proxy,
]);
</pre>

<h4>结合大数据与机器学习</h4>
<p>
在实际应用中,爬虫获取的数据量可能非常大,如何对这些数据进行保存、搜索和分析是必须要考虑的问题。此时,我们可以结合大数据和机器学习的技术,使用 Hadoop、Spark、Elasticsearch、Kibana 等工具来进行数据处理、存储和分析。对于新浪新闻爬虫,我们可以将爬取的数据存储至 Elasticsearch 中,并使用 Kibana 来展示数据分析结果。下面是使用 Elasticsearch 和 Kibana 的示例代码。
</p>
<pre>
// Elasticsearch
$search = '要搜索的关键字';
$params = [
  'index' => 'sina_news',
  'type' => 'news',
  'body' => [
    'query' => [
      'match' => [
        'content' => $search,
      ],
    ],
  ],
];
$response = $client->search($params);
 
// Kibana
SELECT COUNT(*) as count, DATE_FORMAT(time, '%Y-%m-%d') as date FROM sina_news GROUP BY date;
</pre>        </div>
        <div class="share" id="down">          <div class="share-text">
            <p>本文如果侵犯了你的权益请联系站长整改删除</p>
            <p>转载请注明出处</p>
            <p>本文地址:<a href="https://www.radbuilder.com/marketing/Python/14540.html" target="_blank">https://www.radbuilder.com/marketing/Python/14540.html</a></p>
          </div>
        </div>
      </div>
    </div>
    <div class="clear blank"></div>
    <div class="down-links whitebg">
    <div class="news-title">
        <h2></h2>
      </div>
      <ul>

     </ul>
    </div>
     <div class="clear blank"></div>
    <div class="down-otherlink whitebg">
      <div class="news-title">
        <h2>图文推荐</h2>
      </div>
      <ul>
        <li><a href="https://www.radbuilder.com/marketing/Python/2052.html" target="_blank"><i><img src="/d/file/p/2023/07-01/small6647a07de5915ca25dce66845345e784.jpg"></i>
          <p>如何在php中对json对象的值进行输出</p>
          <span class="down-info"></span></a></li>
        <li><a href="https://www.radbuilder.com/marketing/Python/2007.html" target="_blank"><i><img src="/d/file/p/2023/07-01/small1c14fee124197f8a8d8d542f6468c434.jpg"></i>
          <p>jquery min js指的是什么</p>
          <span class="down-info"></span></a></li>
        <li><a href="https://www.radbuilder.com/marketing/Python/2049.html" target="_blank"><i><img src="/d/file/p/2023/07-01/small3b228fd92e9008dce070ca2578c632ff.jpg"></i>
          <p>正在执行的SQL语句怎么在postgresql中结束</p>
          <span class="down-info"></span></a></li>
        <li><a href="https://www.radbuilder.com/marketing/Python/1995.html" target="_blank"><i><img src="/d/file/p/2023/07-01/small1b9fcb9991e07208298c92968b167cbe.jpg"></i>
          <p>PHP中is+numeric与ctype+digit有什么不同</p>
          <span class="down-info"></span></a></li>
        <li><a href="https://www.radbuilder.com/marketing/Python/2109.html" target="_blank"><i><img src="/d/file/p/2023/07-02/small27f3e50affd2acd5029588ed2c1111ab.jpg"></i>
          <p>利用php如何对非法字符进行过滤</p>
          <span class="down-info"></span></a></li>
        <li><a href="https://www.radbuilder.com/marketing/Python/2102.html" target="_blank"><i><img src="/d/file/p/2023/07-02/smalle75ad5f22e9a604e3a9a83f0dd057d19.jpg"></i>
          <p>怎么在Java中使用LinkedList</p>
          <span class="down-info"></span></a></li>
      </ul>
    </div>

    <div class="pinglun-box whitebg">
      <div class="news-title">
        <h2></h2>
      </div>

    </div>
  </div>
  <aside class="side-section right-box">
     <div class="whitebg down-tuijian">

    </div>
    <div class="blank clear" ></div>
    <div class="whitebg down-paihang">
      <h2 class="side-title">热门排行</h2>
      <ul>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/2052.html" title="如何在php中对json对象的值进行输出" target="_blank">如何在php中对json对象的值进行输出</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/1174.html" title="uniapp实现定位权限" target="_blank">uniapp实现定位权限</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/1244.html" title="Python实现热加载配置文件的方法" target="_blank">Python实现热加载配置文件的方法</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/3489.html" title="ps如何把皮肤通透白嫩" target="_blank">ps如何把皮肤通透白嫩</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/11784.html" title="js中怎么用文件流下载csv文件" target="_blank">js中怎么用文件流下载csv文件</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/989.html" title="怎么使用PHP进行人工智能开发" target="_blank">怎么使用PHP进行人工智能开发</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/1008.html" title="XML文档不能使用css样式表如何办" target="_blank">XML文档不能使用css样式表如何办</a></li>
        <li><i></i><a href="https://www.radbuilder.com/marketing/Python/1257.html" title="JavaScript中的错误处理技巧" target="_blank">JavaScript中的错误处理技巧</a></li>
      </ul>
    </div>
    <div class="blank clear" ></div>
    <div class="whitebg cloud">
      <h2 class="side-title">标签云</h2>
      <ul>
     <a href="https://www.radbuilder.com/e/tags/?tagid=64&tempid=8" target="_blank">C(6)</a> <a href="https://www.radbuilder.com/e/tags/?tagid=73&tempid=8" target="_blank">PHP(1)</a> <a href="https://www.radbuilder.com/e/tags/?tagid=63&tempid=8" target="_blank">C++(7)</a> <a href="https://www.radbuilder.com/e/tags/?tagid=102&tempid=8" target="_blank">c语言(750)</a>      </ul>
    </div>
    <div class="ad ad-small"></div>
    <div class="whitebg down-suiji">
      <h2 class="side-title">猜你喜欢</h2>
      <ul>
                  
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/14400.html"><i><img src=""></i>
          <p>Python基本形态学滤波怎么实现</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/13212.html"><i><img src=""></i>
          <p>JavaScript实现web登录注册</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/2867.html"><i><img src=""></i>
          <p>c语言find函数的用法详解</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/9276.html"><i><img src=""></i>
          <p>HTML中如何设置为email链接</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/3884.html"><i><img src=""></i>
          <p>Android如何封装Banner控件</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/1323.html"><i><img src=""></i>
          <p>聊聊关于Node多进程模型和项目部署</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/4587.html"><i><img src=""></i>
          <p>删除gitee提交信息的方法是什么</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/14494.html"><i><img src=""></i>
          <p>怎么查看vue环境</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/8672.html"><i><img src=""></i>
          <p>HTML的footer标签</p>
          <span class="down-info"></span></a></li>
           
        <li><a target="_blank"  href="https://www.radbuilder.com/marketing/Python/11486.html"><i><img src=""></i>
          <p>ps如何改海报上的大字</p>
          <span class="down-info"></span></a></li>
           </ul>
    </div>
  </aside>
</article>
<div class="clear blank"></div>
<footer>
  <div class="footer box">
    <div class="wxbox">
      <ul>
        <li><span> </span></li>
        <li><span> </span></li>
      </ul>
    </div>
    <div class="endnav">
      <p>备案号:<a href="https://beian.miit.gov.cn/" target="_blank" rel="nofollow">粤ICP备2023061792号-2</a> <a href="https://www.radbuilder.com/sitemap.xml" target="_blank">网站地图</a></p>
    </div>
  </div>
</footer>
<div class="toolbar-open"></div>
<div class="toolbar">
  <div class="toolbar-close"><span id="closed"></span></div>
  <div class="toolbar-nav">
    <ul id="toolbar-menu">
      <li><i class="side-icon-user"></i>
        <section>
          <div class="userinfo">
              <script src="https://www.radbuilder.com/e/member/login/loginjs.php"></script>
          </div>
        </section>
      </li>
      <li><i class="side-icon-qq"></i>
        <section class="qq-section">
          <div class="qqinfo"><a href="http://wpa.qq.com/msgrd?v=3&uin=19801987&site=qq&menu=yes">站长QQ</a></div>
        </section>
      </li>
      <li><i class="side-icon-weixin"></i>
        <section class="weixin-section">
          <div class="weixin-info">
            <p>个人微信</p>
            <p class="text12">工作时间</p>
            <p class="text12">周一至周日 9:00-21:00</p>
          </div>
        </section>
      </li>
      <li><i class="side-icon-dashang"></i>
        <section class="dashang-section">
          <p></p>
          <ul>
            <li></li>
            <li></li>
          </ul>
        </section>
      </li>
    
    </ul>
  </div>
</div>
<div class="endmenu">
<ul>
<li><a href="https://www.radbuilder.com/"><i class="iconfont icon-shouye"></i>首页</a></li>
<li><a href="https://www.radbuilder.com/phone-fenlei.html"><i class="iconfont icon-fenlei"></i>分类</a></li>
<li><a href="https://www.radbuilder.com/phone-list.html"><i class="iconfont icon-navicon-wzgl"></i>所有</a></li>
<li><a href="https://www.radbuilder.com/e/member/my/"><i class="iconfont icon-My"></i>我的</a></li>
</ul>
</div>
<a href="#" title="返回顶部" class="icon-top"></a>
</body>
</html><script src="https://www.radbuilder.com/e/public/onclick/?enews=donews&classid=20&id=14540"></script>