2009年3月5日星期四

[PerlChina] 原创脚本--抓取需要登录后才能看到的页面

第一次在这里发帖,大家多多关照!o(∩_∩)o...哈哈!
脚本的实现策略比较简单:
第一步,先在你要抓取页面的网站,注册一个帐号,并登陆成功后,在IE浏览器的地址栏输入如下js代码:
document.write(); 回车,得到网站的cookie值;
第二步,再将得到的cookie赋值个 LWP::UserAgent 对象
第三步,通过 LWP::UserAgent 实现抓取
第四步,解析html

具体实现如下,测试后,有写网站抓取不稳定,意思就是说,获取的数据有时会话过期提示要登录。
希望大家指点,如何实现更稳定,或者问题怎么解决?。谢谢。:)

--------------------------------------------------------------------------------------------------------------------------------
配置文件:config.xml
<?xml version='1.0' encoding='utf-8'?>
<config>
<domain>www.xxxxnet</domain>
<!-- Default '/' -->
<path>/</path>
<!-- Default './data' -->
<data_dir>data</data_dir>
<!-- Default 'Set-Cookie3' -->
<cookie_version>Set-Cookie3</cookie_version>
<!-- After logined a website , then set IE's url with
'javascript:document.write()' , <enter> and copy result to here -->
<cookies>
cck_lasttime=1236217579437; cck_count=0; cnzz_a321858=60; vw321858=
%3A16839968%3A42577264%3A65005069%3A37251471%3A33312220%3A36235501%3A35414756%3A35922741%3A62201617%3A35405608%3A53791128%3A35904130%3A53791127%3A35727515%3A36402653%3A34408563%3A36743486%3A74800923%3A37495433%3A32038215%3A37759456%3A;
sin321858=none; rtime=0; ltime=1236243678484;
cnzz_eid=77435460-1236217578-;
ASPSESSIONIDQAADDSQC=GHONMGMDNILLLAELGEAKEMJN
</cookies>
<!-- Can add multiply dataurls -->
<dataurls>
<data_url_prefix>http://www.xxxxx.net/shangji/
showgongying.asp?id=
</data_url_prefix>
<start>2435</start>
<end>2437</end>
<starthtml><![CDATA[<table width="645" border="0"
cellspacing="0" cellpadding="0">]]></starthtml>
<endhtml><![CDATA[<table width="98%" border="0"
align="center" cellpadding="0" cellspacing="0">]]></endhtml>
<sub_dir>showgongying</sub_dir>
</dataurls>
<dataurls>
<data_url_prefix>http://www.xxxxxx.net/shangji/
showshangji.asp?id=
</data_url_prefix>
<start>2265</start>
<end>2267</end>
<sub_dir>showshangji</sub_dir>
<starthtml><![CDATA[<table width="645" border="0"
cellspacing="0" cellpadding="0">]]></starthtml>
<endhtml><![CDATA[<table width="98%" border="0"
align="center" cellpadding="0" cellspacing="0">]]></endhtml>
</dataurls>
</config>
------------------------------------------------------------------------------------------------------------------------------
html解析实现模块:
########################
# Version: 0.02
### Author: dungang
### Date: 2009.03.05
### File: ParseData.pm
# Email: dungang@hotmail.com
########################
package ParseData;
use HTML::Parser;
use base 'HTML::Parser';
use IO::File;
sub setstart {
my ( $self , $savefile, $startstr, $endstr ) = @_;
$self->{fh} = IO::File->new();
$self->{fh}->open(">" . $savefile);
$self->{startstr} = $startstr;
$self->{endstr} = $endstr;
}

sub start
{
my ($self,$tagname,$attr,$text,$dtext ) = @_;
if ( lc($tagname) eq 'table' ) {
if ( $text eq $self->{startstr} ){
$self->{bool} = 1;
}
elsif ( $text eq $self->{endstr} ) {
$self->{bool} = 0;
}
}
}


sub text
{
my ( $self, $text ) = @_;
if ( $self->{bool} ) {
$text =~s/(\s*)|(&.*?;)|(<!--.*?-->)//g;
$text =~s///g;
$self->{fh}->print( $text . "\n") if $text ne '' ;
}
}

sub setend
{
my $self = shift;
$self->{fh}->close;
$self->eof;
}

1;

-----------------------------------------------------------------------------------------------------------------------------------
实现抓取脚本:getPage.pl

#!/usr/bin/perl
########################
# Version: 0.02
### Author: dungang
### Date: 2009.03.05
### File: getPage.pl
# Email: dungang@hotmail.com
########################
use XML::Parser;
use LWP::UserAgent;
use HTTP::Cookies;
use XML::Simple;
use File::Path;
use ParseData;
$XML::Simple::PREFERRED_PARSER = "XML::Parser";
-e 'config.xml' or die "Can't find config.xml file in current
directory !";
my $config = XMLin('config.xml');
my $cookie_jar = HTTP::Cookies->new();
my @cks = split(';',$config->{cookies});
foreach (@cks) {
my @nv = split('=',$_);
$cookie_jar->set_cookie( $config->{cookie_version}, $nv[0] , $nv
[1], $config->{path}||'/', $config->{domain}, '80', '', '', '1' );
}
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 2.0.50727)");
$ua->cookie_jar($cookie_jar);
my $p = ParseData->new(
start_h => [\&ParseData::start,"self,tagname,attr,text,dtext"],
text_h => [\&ParseData::text,'self,text'],
);
foreach my $url (@{$config->{dataurls}}) {
my $data_dir = $config->{data_dir} . '/' . $url->{sub_dir};
mkpath($data_dir) unless -d $data_dir;
for ( my $i=$url->{start}; $i <= $url->{end}; $i++ ) {
my $req = HTTP::Request->new(GET => $url->
{data_url_prefix} . $i);
my $res = $ua->request($req);
if ($res->is_success) {
$p->setstart($data_dir . '/' . $i . '.txt', $url->{starthtml},
$url->{endhtml});
$p->parse($res->content);
$p->setend;
}
else {
print $res->status_line, "\n";
}
}
}

--~--~---------~--~----~------------~-------~--~----~
您收到此信息是由于您订阅了 Google 论坛"PerlChina Mongers 讨论组"论坛。
要在此论坛发帖,请发电子邮件到 perlchina@googlegroups.com
要退订此论坛,请发邮件至 perlchina+unsubscribe@googlegroups.com
更多选项,请通过 http://groups.google.com/group/perlchina?hl=zh-CN 访问该论坛
-~----------~----~----~----~------~----~------~--~---

没有评论: