利用NodeJS和PhantomJS抓取网站页面信息以及网站截图

时间：2021-05-25

利用PhantomJS做网页截图经济适用，但其API较少，做其他功能就比较吃力了。例如，其自带的Web Server Mongoose最高只能同时支持10个请求，指望他能独立成为一个服务是不怎么实际的。所以这里需要另一个语言来支撑服务，这里选用NodeJS来完成。

安装PhantomJS

首先，去PhantomJS官网下载对应平台的版本，或者下载源代码自行编译。然后将PhantomJS配置进环境变量，输入

$ phantomjs

如果有反应，那么就可以进行下一步了。

利用PhantomJS进行简单截图

复制代码代码如下: var webpage = require('webpage') , page = webpage.create(); page.viewportSize = { width: 1024, height: 800 }; page.clipRect = { top: 0, left: 0, width: 1024, height: 800 }; page.settings = { javascriptEnabled: false, loadImages: true, userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0' }; page.open('http://ponent(imagePath),
'&id=',
id,
'&status=',
].join('');
postMan.post(data);
}
// release the memory
page.close();
});
}

var postMan = {
postPage: null,
posting: false,
datas: [],
len: 0,
currentNum: 0,
init: function (snapshot) {
var postPage = webpage.create();
postPage.customHeaders = {
'secret': pkg.secret
};
postPage.open('http://localhost:' + pkg.port + '/bridge?campaignId=' + campaignId, function () {
var urls = JSON.parse(postPage.plainText).urls
, url;

this.len = urls.length;

if (this.len) {
for (var i = this.len; i--;) {
url = urls[i];
snapshot(url.id, url.url, url.imagePath);
}
}
});
this.postPage = postPage;
},
post: function (data) {
this.datas.push(data);
if (!this.posting) {
this.posting = true;
this.fire();
}
},
fire: function () {
if (this.datas.length) {
var data = this.datas.shift()
, that = this;
this.postPage.open('http://localhost:' + pkg.port + '/bridge', 'POST', data, function () {
that.fire();
// kill child process
setTimeout(function () {
if (++this.currentNum === this.len) {
that.postPage.close();
phantom.exit();
}
}, 500);
});
} else {
this.posting = false;
}
}
};
postMan.init(snapshot);

效果

利用NodeJS和PhantomJS抓取网站页面信息以及网站截图

相关文章

NodeJs实现简单的爬虫功能案例分析

C#使用Selenium+PhantomJS抓取数据

如何合理运用网站关键词

Python3获取cookie常用三种方案

在wordpress网站优化过程中的robots.txt爬虫协议