Scrapy 框架 学习

Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架,用途非常广泛。

  • Scrapy Engine(引擎): 负责SpiderItemPipelineDownloaderScheduler中间的通讯,信号、数据传递等。
  • Scheduler(调度器): 它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎
  • Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spider来处理,
  • Spider(爬虫):它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器)
  • Item Pipeline(管道):它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方.
  • Downloader Middlewares(下载中间件):你可以当作是一个可以自定义扩展下载功能的组件。
  • Spider Middlewares(Spider中间件):你可以理解为是一个可以自定扩展和操作引擎Spider中间通信的功能组件(比如进入Spider的Responses;和从Spider出去的Requests)

一:爬虫步骤

制作 Scrapy 爬虫 一共需要4步:

  • 新建项目 (scrapy startproject xxx):新建一个新的爬虫项目
  • 明确目标 (编写items.py):明确你想要抓取的目标
  • 制作爬虫 (spiders/xxspider.py):制作爬虫开始爬取网页
  • 存储内容 (pipelines.py):设计管道存储爬取内容
  1. scrapy startproject test

  2. scrapy genspider 爬虫名 爬虫域名

二:Scrapy项目基本流程

2.1 默认的Scrapy项目结构

使用全局命令startproject创建项目,在project_name文件夹下创建一个名为project_name的Scrapy项目。

scrapy startproject myproject

虽然可以被修改,但所有的Scrapy项目默认有类似于下边的文件结构:

1
2
3
4
5
6
7
8
9
10
11
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...

scrapy.cfg 存放的目录被认为是 项目的根目录 。该文件中包含python模块名的字段定义了项目的设置。

2.2 定义要抓取的数据

Item 是保存爬取到的数据的容器;其使用方法和python字典类似, 并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。
类似在ORM中做的一样,您可以通过创建一个 scrapy.Item 类, 并且定义类型为 scrapy.Field 的类属性来定义一个Item。
首先根据需要从dmoz.org(DMOZ网站是一个著名的开放式分类目录(Open DirectoryProject),由来自世界各地的志愿者共同维护与建设的最大的全球目录社区)获取到的数据对item进行建模。 我们需要从dmoz中获取名字,url,以及网站的描述。 对此,在item中定义相应的字段。编辑items.py 文件:

1
2
3
4
5
6
import scrapy

class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()

2.3 使用项目命令genspider创建Spider

scrapy genspider [-t template]

在当前项目中创建spider。
这仅仅是创建spider的一种快捷方法。该方法可以使用提前定义好的模板来生成spider。您也可以自己创建spider的源码文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed

$ scrapy genspider -d basic
import scrapy

class $classname(scrapy.Spider):
name = "$name"
allowed_domains = ["$domain"]
start_urls = (
'http://www.$domain/',
)

def parse(self, response):
pass

$ scrapy genspider -t basic example example.com
Created spider 'example' using template 'basic' in module:
mybot.spiders.example

2.4 编写提取item数据的Spider

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。
其包含了一个用于下载的初始URL,如何跟进网页中的链接以及如何分析页面中的内容, 提取生成 item 的方法。
为了创建一个Spider,您必须继承 scrapy.Spider 类,且定义以下三个属性:

  • name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
  • start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
  • parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy

class DmozSpider(scrapy.spider.Spider):
name = "dmoz" #唯一标识,启动spider时即指定该名称
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)

2.5 进行爬取

执行项目命令crawl,启动Spider:

scrapy crawl dmoz

在这个过程中:
Scrapy为Spider的 start_urls 属性中的每个URL创建了 scrapy.Request 对象,并将 parse 方法作为回调函数(callback)赋值给了Request。
Request对象经过调度,执行生成 scrapy.http.Response 对象并送回给spider parse() 方法。

2.6 通过选择器提取数据

Selectors选择器简介:
Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors),因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。
XPath 是一门用来在XML文件中选择节点的语言,也可以用在HTML上。 CSS 是一门将HTML文档样式化的语言。选择器由它定义,并与特定的HTML元素的样式相关连。

XPath表达式的例子和含义:

  • /html/head/title: 选择HTML文档中 标签内的 元素</li> <li>/html/head/title/text(): 选择上面提到的 <title> 元素的文字</li> <li>//td: 选择所有的 <td> 元素</li> <li>//div[@class=”mine”]: 选择所有具有 class=”mine” 属性的 div 元素</li> </ul> </blockquote> <p><strong>提取数据:</strong><br> 观察HTML源码并确定合适的XPath表达式。<br> 在查看了网页的源码后,您会发现网站的信息是被包含在 第二个 <ul> 元素中。<br> 我们可以通过这段代码选择该页面中网站列表里所有 <li> 元素:<br> response.xpath(‘//ul/li’)</p> <p>Item 对象是自定义的python字典。 您可以使用标准的字典语法来获取到其每个字段的值。<br> 一般来说,Spider将会将爬取到的数据以 Item 对象返回。所以为了将爬取的数据返回,我们最终的代码将是:</p> <figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> tutorial.items <span class="keyword">import</span> DmozItem</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DmozSpider</span><span class="params">(scrapy.Spider)</span>:</span></span><br><span class="line"> name = <span class="string">"dmoz"</span></span><br><span class="line"> allowed_domains = [<span class="string">"dmoz.org"</span>]</span><br><span class="line"> start_urls = [</span><br><span class="line"> <span class="string">"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"</span>,</span><br><span class="line"> <span class="string">"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"</span></span><br><span class="line"> ]</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">parse</span><span class="params">(self, response)</span>:</span></span><br><span class="line"> <span class="keyword">for</span> sel <span class="keyword">in</span> response.xpath(<span class="string">'//ul/li'</span>):</span><br><span class="line"> item = DmozItem()</span><br><span class="line"> item[<span class="string">'title'</span>] = sel.xpath(<span class="string">'a/text()'</span>).extract()</span><br><span class="line"> item[<span class="string">'link'</span>] = sel.xpath(<span class="string">'a/@href'</span>).extract()</span><br><span class="line"> item[<span class="string">'desc'</span>] = sel.xpath(<span class="string">'text()'</span>).extract()</span><br><span class="line"> <span class="keyword">yield</span> item</span><br></pre></td></tr></table></figure> <p>现在对dmoz.org进行爬取将会产生 DmozItem 对象。</p> </div> <div> <div> <div style="text-align:center;color: #ccc;font-size:14px;"> -------------本文结束 <i class="fa fa-paw"></i> 感谢您的阅读------------- </div> </div> </div> <!-- 相关文章推荐 --> <footer class="post-footer"> <div class="post-tags"> <a href="/tags/Scrapy/" rel="tag"><i class="fa fa-tag"></i> Scrapy</a> </div> <div class="post-nav"> <div class="post-nav-next post-nav-item"> <a href="/posts/5417e75b/" rel="next" title="Jquery 总结"> <i class="fa fa-chevron-left"></i> Jquery 总结 </a> </div> <span class="post-nav-divider"></span> <div class="post-nav-prev post-nav-item"> <a href="/posts/55e6fd87/" rel="prev" title="Python 多线程"> Python 多线程 <i class="fa fa-chevron-right"></i> </a> </div> </div> </footer> </div> </article> <div class="post-spread"> <div data-weibo-title="分享到微博" data-qq-title="分享到QQ" data-douban-title="分享到豆瓣" class="social-share" class="share-component" data-disabled="qzone,google+,linkedin" data-description="Share.js - 一键分享到微博,QQ空间,腾讯微博,人人,豆瓣..."> 分享到: </div> </div> </div> </div> <div class="comments" id="comments"> </div> </div> <div class="sidebar-toggle"> <div class="sidebar-toggle-line-wrap"> <span class="sidebar-toggle-line sidebar-toggle-line-first"></span> <span class="sidebar-toggle-line sidebar-toggle-line-middle"></span> <span class="sidebar-toggle-line sidebar-toggle-line-last"></span> </div> </div> <aside id="sidebar" class="sidebar"> <div class="sidebar-inner"> <ul class="sidebar-nav motion-element"> <li class="sidebar-nav-toc sidebar-nav-active" data-target="post-toc-wrap"> 文章目录 </li> <li class="sidebar-nav-overview" data-target="site-overview-wrap"> 站点概览 </li> </ul> <section class="site-overview-wrap sidebar-panel"> <div class="site-overview"> <div class="site-author motion-element" itemprop="author" itemscope itemtype="http://schema.org/Person"> <img class="site-author-image" itemprop="image" src="http://photo.jomeswang.top/20200405170507.png" alt="Jomeswang" /> <p class="site-author-name" itemprop="name">Jomeswang</p> <p class="site-description motion-element" itemprop="description"></p> </div> <nav class="site-state motion-element"> <div class="site-state-item site-state-posts"> <a href="/archives/"> <span class="site-state-item-count">190</span> <span class="site-state-item-name">日志</span> </a> </div> <div class="site-state-item site-state-categories"> <a href="/categories/index.html"> <span class="site-state-item-count">53</span> <span class="site-state-item-name">分类</span> </a> </div> <div class="site-state-item site-state-tags"> <a href="/tags/index.html"> <span class="site-state-item-count">103</span> <span class="site-state-item-name">标签</span> </a> </div> </nav> <div class="feed-link motion-element"> <a href="/atom.xml" rel="alternate"> <i class="fa fa-rss"></i> RSS </a> </div> <div class="links-of-author motion-element"> <span class="links-of-author-item"> <a href="https://github.com/jomeswang" target="_blank" title="GitHub"> <i class="fa fa-fw fa-github"></i>GitHub</a> </span> </div> <div class="links-of-blogroll motion-element links-of-blogroll-block"> <div class="links-of-blogroll-title"> <!-- modify icon to fire by szw --> <i class="fa fa-history fa-" aria-hidden="true"></i> 近期文章 </div> <ul class="links-of-blogroll-list"> <li> <a href="/posts/6c0bcfea/" title="转载:前端开发的瓶颈" target="_blank">转载:前端开发的瓶颈</a> </li> <li> <a href="/posts/4b195fc0/" title="粤港澳大湾区金融数学建模大赛(一等奖)量化论文" target="_blank">粤港澳大湾区金融数学建模大赛(一等奖)量化论文</a> </li> <li> <a href="/posts/4f7564f/" title="以太链上发行自己的代币Token" target="_blank">以太链上发行自己的代币Token</a> </li> <li> <a href="/posts/22ff3d83/" title="2020-年春节总结" target="_blank">2020-年春节总结</a> </li> <li> <a href="/posts/4a21ea90/" title="区块链学习二(以太坊)" target="_blank">区块链学习二(以太坊)</a> </li> </ul> </div> </div> </section> <!--noindex--> <section class="post-toc-wrap motion-element sidebar-panel sidebar-panel-active"> <div class="post-toc"> <div class="post-toc-content"><ol class="nav"><li class="nav-item nav-level-2"><a class="nav-link" href="#一:爬虫步骤"><span class="nav-text">一:爬虫步骤</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#二:Scrapy项目基本流程"><span class="nav-text">二:Scrapy项目基本流程</span></a><ol class="nav-child"><li class="nav-item nav-level-3"><a class="nav-link" href="#2-1-默认的Scrapy项目结构"><span class="nav-text">2.1 默认的Scrapy项目结构</span></a></li><li class="nav-item nav-level-3"><a class="nav-link" href="#2-2-定义要抓取的数据"><span class="nav-text">2.2 定义要抓取的数据</span></a></li><li class="nav-item nav-level-3"><a class="nav-link" href="#2-3-使用项目命令genspider创建Spider"><span class="nav-text">2.3 使用项目命令genspider创建Spider</span></a></li><li class="nav-item nav-level-3"><a class="nav-link" href="#2-4-编写提取item数据的Spider"><span class="nav-text">2.4 编写提取item数据的Spider</span></a></li><li class="nav-item nav-level-3"><a class="nav-link" href="#2-5-进行爬取"><span class="nav-text">2.5 进行爬取</span></a></li><li class="nav-item nav-level-3"><a class="nav-link" href="#2-6-通过选择器提取数据"><span class="nav-text">2.6 通过选择器提取数据</span></a></li></ol></li></ol></div> </div> </section> <!--/noindex--> <div class="back-to-top"> <i class="fa fa-arrow-up"></i> <span id="scrollpercent"><span>0</span>%</span> </div> </div> </aside> </div> </main> <footer id="footer" class="footer"> <div class="footer-inner"> <div class="copyright">© <span itemprop="copyrightYear">2022</span> <span class="with-love"> <i class="fa fa-user"></i> </span> <span class="author" itemprop="copyrightHolder">Jomeswang</span> <span class="post-meta-divider">|</span> <span class="post-meta-item-icon"> <i class="fa fa-area-chart"></i> </span> <span class="post-meta-item-text">Site words total count:</span> <span title="Site words total count">230.7k</span> </div> <div id="denglu" style="vertical-align:middle;margin:0 auto;justify-content: center; display: flex" > <span id="timeDate">载入天数...</span><span id="times">载入时分秒...</span> <a style="display: flex" href="https://www.upyun.com/?utm_source=lianmeng&utm_medium=referral" target="_blank" rel="noopener" > <img src="http://photo.jomeswang.top/20200404100532.png" href="https://www.upyun.com/?utm_source=lianmeng&utm_medium=referral" width="50px" height="28px" style="margin: 0; margin-left: 6px; border :0;vertical-align:middle; "> <a href="https://www.upyun.com/?utm_source=lianmeng&utm_medium=referral" target="_blank" rel="noopener" style="color: #ffd700; margin-left:5px">又拍云CDN支持</a> </a> </div> <style scoped> #denglu img{ height: 28px !important } #denglu{ height: 28px !important } @media only screen and (max-width: 600px) { #denglu img{ height: 28px !important } #denglu{ font-size: 8px !important; height: 28px !important } } </style> <script> var now = new Date(); function createtime() { var grt= new Date("12/31/2019 12:00:00");//此处修改你的建站时间或者网站上线时间 now.setTime(now.getTime()+250); days = (now - grt ) / 1000 / 60 / 60 / 24; dnum = Math.floor(days); hours = (now - grt ) / 1000 / 60 / 60 - (24 * dnum); hnum = Math.floor(hours); if(String(hnum).length ==1 ){hnum = "0" + hnum;} minutes = (now - grt ) / 1000 /60 - (24 * 60 * dnum) - (60 * hnum); mnum = Math.floor(minutes); if(String(mnum).length ==1 ){mnum = "0" + mnum;} seconds = (now - grt ) / 1000 - (24 * 60 * 60 * dnum) - (60 * 60 * hnum) - (60 * mnum); snum = Math.round(seconds); if(String(snum).length ==1 ){snum = "0" + snum;} document.getElementById("timeDate").innerHTML = "本站已安全运行 "+dnum+" 天 "; document.getElementById("times").innerHTML = hnum + " 小时 " + mnum + " 分 " + snum + " 秒"; } setInterval("createtime()",250); </script> <!-- <div class="powered-by">由 <a class="theme-link" target="_blank" href="https://hexo.io">Hexo</a> 强力驱动</div> <span class="post-meta-divider">|</span> <div class="theme-info">主题 — <a class="theme-link" target="_blank" href="https://github.com/iissnan/hexo-theme-next">NexT.Pisces</a> v5.1.4</div> --> <script async src="//busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js"> </script> <div class="busuanzi-count"> <script async src="https://busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js"></script> <span class="site-uv"> <i class="fa fa-user"></i> <span class="busuanzi-value" id="busuanzi_value_site_uv"></span> 人次 </span> <span class="site-pv"> <i class="fa fa-eye"></i> <span class="busuanzi-value" id="busuanzi_value_site_pv"></span> 次 </span> </div> </div> </footer> </div> <script type="text/javascript"> if (Object.prototype.toString.call(window.Promise) !== '[object Function]') { window.Promise = null; } </script> <script type="text/javascript" src="/lib/jquery/index.js?v=2.1.3"></script> <script type="text/javascript" src="/lib/fastclick/lib/fastclick.min.js?v=1.0.6"></script> <script type="text/javascript" src="/lib/jquery_lazyload/jquery.lazyload.js?v=1.9.7"></script> <script type="text/javascript" src="/lib/velocity/velocity.min.js?v=1.2.1"></script> <script type="text/javascript" src="/lib/velocity/velocity.ui.min.js?v=1.2.1"></script> <script type="text/javascript" src="/lib/fancybox/source/jquery.fancybox.pack.js?v=2.1.5"></script> <script type="text/javascript" src="/js/src/utils.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/motion.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/affix.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/schemes/pisces.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/scrollspy.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/post-details.js?v=5.1.4"></script> <script type="text/javascript" src="/js/src/bootstrap.js?v=5.1.4"></script> <script src="//unpkg.com/valine@latest/dist/Valine.min.js"></script> <script type="text/javascript"> var GUEST = ['nick','mail','link']; var guest = 'nick,mail,link'; guest = guest.split(',').filter(item=>{ return GUEST.indexOf(item)>-1; }); new Valine({ el: '#comments' , verify: false, notify: false, appId: 'dFtraLC9No0f9cAyolmlwtc9-gzGzoHsz', appKey: 'IeFjgH3bToPk2ShTSKLmQRbM', placeholder: '快来“打”我', avatar:'mm', guest_info:guest, pageSize:'10' || 10, }); </script> <script type="text/javascript"> // Popup Window; var isfetched = false; var isXml = true; // Search DB path; var search_path = "search.json"; if (search_path.length === 0) { search_path = "search.xml"; } else if (/json$/i.test(search_path)) { isXml = false; } var path = "/" + search_path; // monitor main search box; var onPopupClose = function (e) { $('.popup').hide(); $('#local-search-input').val(''); $('.search-result-list').remove(); $('#no-result').remove(); $(".local-search-pop-overlay").remove(); $('body').css('overflow', ''); } function proceedsearch() { $("body") .append('<div class="search-popup-overlay local-search-pop-overlay"></div>') .css('overflow', 'hidden'); $('.search-popup-overlay').click(onPopupClose); $('.popup').toggle(); var $localSearchInput = $('#local-search-input'); $localSearchInput.attr("autocapitalize", "none"); $localSearchInput.attr("autocorrect", "off"); $localSearchInput.focus(); } // search function; var searchFunc = function(path, search_id, content_id) { 'use strict'; // start loading animation $("body") .append('<div class="search-popup-overlay local-search-pop-overlay">' + '<div id="search-loading-icon">' + '<i class="fa fa-spinner fa-pulse fa-5x fa-fw"></i>' + '</div>' + '</div>') .css('overflow', 'hidden'); $("#search-loading-icon").css('margin', '20% auto 0 auto').css('text-align', 'center'); $.ajax({ url: path, dataType: isXml ? "xml" : "json", async: true, success: function(res) { // get the contents from search data isfetched = true; $('.popup').detach().appendTo('.header-inner'); var datas = isXml ? $("entry", res).map(function() { return { title: $("title", this).text(), content: $("content",this).text(), url: $("url" , this).text() }; }).get() : res; var input = document.getElementById(search_id); var resultContent = document.getElementById(content_id); var inputEventFunction = function() { var searchText = input.value.trim().toLowerCase(); var keywords = searchText.split(/[\s\-]+/); if (keywords.length > 1) { keywords.push(searchText); } var resultItems = []; if (searchText.length > 0) { // perform local searching datas.forEach(function(data) { var isMatch = false; var hitCount = 0; var searchTextCount = 0; var title = data.title.trim(); var titleInLowerCase = title.toLowerCase(); var content = data.content.trim().replace(/<[^>]+>/g,""); var contentInLowerCase = content.toLowerCase(); var articleUrl = decodeURIComponent(data.url); var indexOfTitle = []; var indexOfContent = []; // only match articles with not empty titles if(title != '') { keywords.forEach(function(keyword) { function getIndexByWord(word, text, caseSensitive) { var wordLen = word.length; if (wordLen === 0) { return []; } var startPosition = 0, position = [], index = []; if (!caseSensitive) { text = text.toLowerCase(); word = word.toLowerCase(); } while ((position = text.indexOf(word, startPosition)) > -1) { index.push({position: position, word: word}); startPosition = position + wordLen; } return index; } indexOfTitle = indexOfTitle.concat(getIndexByWord(keyword, titleInLowerCase, false)); indexOfContent = indexOfContent.concat(getIndexByWord(keyword, contentInLowerCase, false)); }); if (indexOfTitle.length > 0 || indexOfContent.length > 0) { isMatch = true; hitCount = indexOfTitle.length + indexOfContent.length; } } // show search results if (isMatch) { // sort index by position of keyword [indexOfTitle, indexOfContent].forEach(function (index) { index.sort(function (itemLeft, itemRight) { if (itemRight.position !== itemLeft.position) { return itemRight.position - itemLeft.position; } else { return itemLeft.word.length - itemRight.word.length; } }); }); // merge hits into slices function mergeIntoSlice(text, start, end, index) { var item = index[index.length - 1]; var position = item.position; var word = item.word; var hits = []; var searchTextCountInSlice = 0; while (position + word.length <= end && index.length != 0) { if (word === searchText) { searchTextCountInSlice++; } hits.push({position: position, length: word.length}); var wordEnd = position + word.length; // move to next position of hit index.pop(); while (index.length != 0) { item = index[index.length - 1]; position = item.position; word = item.word; if (wordEnd > position) { index.pop(); } else { break; } } } searchTextCount += searchTextCountInSlice; return { hits: hits, start: start, end: end, searchTextCount: searchTextCountInSlice }; } var slicesOfTitle = []; if (indexOfTitle.length != 0) { slicesOfTitle.push(mergeIntoSlice(title, 0, title.length, indexOfTitle)); } var slicesOfContent = []; while (indexOfContent.length != 0) { var item = indexOfContent[indexOfContent.length - 1]; var position = item.position; var word = item.word; // cut out 100 characters var start = position - 20; var end = position + 80; if(start < 0){ start = 0; } if (end < position + word.length) { end = position + word.length; } if(end > content.length){ end = content.length; } slicesOfContent.push(mergeIntoSlice(content, start, end, indexOfContent)); } // sort slices in content by search text's count and hits' count slicesOfContent.sort(function (sliceLeft, sliceRight) { if (sliceLeft.searchTextCount !== sliceRight.searchTextCount) { return sliceRight.searchTextCount - sliceLeft.searchTextCount; } else if (sliceLeft.hits.length !== sliceRight.hits.length) { return sliceRight.hits.length - sliceLeft.hits.length; } else { return sliceLeft.start - sliceRight.start; } }); // select top N slices in content var upperBound = parseInt('1'); if (upperBound >= 0) { slicesOfContent = slicesOfContent.slice(0, upperBound); } // highlight title and content function highlightKeyword(text, slice) { var result = ''; var prevEnd = slice.start; slice.hits.forEach(function (hit) { result += text.substring(prevEnd, hit.position); var end = hit.position + hit.length; result += '<b class="search-keyword">' + text.substring(hit.position, end) + '</b>'; prevEnd = end; }); result += text.substring(prevEnd, slice.end); return result; } var resultItem = ''; if (slicesOfTitle.length != 0) { resultItem += "<li><a href='" + articleUrl + "' class='search-result-title'>" + highlightKeyword(title, slicesOfTitle[0]) + "</a>"; } else { resultItem += "<li><a href='" + articleUrl + "' class='search-result-title'>" + title + "</a>"; } slicesOfContent.forEach(function (slice) { resultItem += "<a href='" + articleUrl + "'>" + "<p class=\"search-result\">" + highlightKeyword(content, slice) + "...</p>" + "</a>"; }); resultItem += "</li>"; resultItems.push({ item: resultItem, searchTextCount: searchTextCount, hitCount: hitCount, id: resultItems.length }); } }) }; if (keywords.length === 1 && keywords[0] === "") { resultContent.innerHTML = '<div id="no-result"><i class="fa fa-search fa-5x" /></div>' } else if (resultItems.length === 0) { resultContent.innerHTML = '<div id="no-result"><i class="fa fa-frown-o fa-5x" /></div>' } else { resultItems.sort(function (resultLeft, resultRight) { if (resultLeft.searchTextCount !== resultRight.searchTextCount) { return resultRight.searchTextCount - resultLeft.searchTextCount; } else if (resultLeft.hitCount !== resultRight.hitCount) { return resultRight.hitCount - resultLeft.hitCount; } else { return resultRight.id - resultLeft.id; } }); var searchResultList = '<ul class=\"search-result-list\">'; resultItems.forEach(function (result) { searchResultList += result.item; }) searchResultList += "</ul>"; resultContent.innerHTML = searchResultList; } } if ('auto' === 'auto') { input.addEventListener('input', inputEventFunction); } else { $('.search-icon').click(inputEventFunction); input.addEventListener('keypress', function (event) { if (event.keyCode === 13) { inputEventFunction(); } }); } // remove loading animation $(".local-search-pop-overlay").remove(); $('body').css('overflow', ''); proceedsearch(); } }); } // handle and trigger popup window; $('.popup-trigger').click(function(e) { e.stopPropagation(); if (isfetched === false) { searchFunc(path, 'local-search-input', 'local-search-result'); } else { proceedsearch(); }; }); $('.popup-btn-close').click(onPopupClose); $('.popup').click(function(e){ e.stopPropagation(); }); $(document).on('keyup', function (event) { var shouldDismissSearchPopup = event.which === 27 && $('.search-popup').is(':visible'); if (shouldDismissSearchPopup) { onPopupClose(); } }); </script> <script> function showTime(Counter) { var query = new AV.Query(Counter); var entries = []; var $visitors = $(".leancloud_visitors"); $visitors.each(function () { entries.push( $(this).attr("id").trim() ); }); query.containedIn('url', entries); query.find() .then(function (results) { var COUNT_CONTAINER_REF = '.leancloud-visitors-count'; if (results.length === 0) { $visitors.find(COUNT_CONTAINER_REF).text(0); return; } for (var i = 0; i < results.length; i++) { var item = results[i]; var url = item.get('url'); var time = item.get('time'); var element = document.getElementById(url); $(element).find(COUNT_CONTAINER_REF).text(time); } for(var i = 0; i < entries.length; i++) { var url = entries[i]; var element = document.getElementById(url); var countSpan = $(element).find(COUNT_CONTAINER_REF); if( countSpan.text() == '') { countSpan.text(0); } } }) .catch(function ( error) { console.log("Error: " + error); }); } function addCount(Counter) { var $visitors = $(".leancloud_visitors"); var url = $visitors.attr('id').trim(); var title = $visitors.attr('data-flag-title').trim(); var query = new AV.Query(Counter); query.equalTo("url", url); query.find().then(function(results) { if (results.length > 0) { var counter = results[0]; counter.fetchWhenSave(true); counter.increment("time"); counter.save() .then(function(counter) { var $element = $(document.getElementById(url)); $element.find('.leancloud-visitors-count').text(counter.get('time')); }) .catch(function(counter, error) { console.log('Failed to save Visitor num, with error message: ' + error.message); }); } else { var newcounter = new Counter(); /* Set ACL */ var acl = new AV.ACL(); acl.setPublicReadAccess(true); acl.setPublicWriteAccess(true); newcounter.setACL(acl); /* End Set ACL */ newcounter.set("title", title); newcounter.set("url", url); newcounter.set("time", 1); newcounter.save() .then(function(counter) { var $element = $(document.getElementById(url)); $element.find('.leancloud-visitors-count').text(counter.get('time')); }) .catch(function(counter, error) { console.log('Failed to save Visitor num, with error message: ' + error.message); }); } }) .catch(function(error) { console.log('Error:' + error.code + " " + error.message); }); } $(function() { var Counter = AV.Object.extend("Counter"); if ($('.leancloud_visitors').length == 1) { addCount(Counter); } else if ($('.post-title-link').length > 1) { showTime(Counter); } }); </script> <script> (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { // bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { // bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> <!-- 页面动态背景 --> <!-- 页面点击烟花 --> <!-- 页面点击小红心 --> <script type="text/javascript" src="/js/src/clicklove.js"></script> <!-- 代码块复制功能 --> <script type="text/javascript" src="/js/src/clipboard.min.js"></script> <script type="text/javascript" src="/js/src/clipboard-use.js"></script> <!--share.js--> <link rel="stylesheet" href="/dist/css/share.min.css"> <script src="/dist/js/social-share.min.js"></script> <script src="/live2dw/lib/L2Dwidget.min.js?094cbace49a39548bed64abff5988b05"></script><script>L2Dwidget.init({"pluginRootPath":"live2dw/","pluginJsPath":"lib/","pluginModelPath":"assets/","model":{"jsonPath":"/live2dw/assets/haruto.model.json"},"display":{"position":"right","width":200,"height":400},"mobile":{"show":false},"react":{"opacity":0.7},"log":false,"tagMode":false});</script></body> </html>