parsing a doc with HTML::TableExtract [Perl] to fetch data

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
lin
Forum Commoner
Posts: 49
Joined: Tue Dec 07, 2010 1:53 pm

parsing a doc with HTML::TableExtract [Perl] to fetch data

Post by lin »

howdy - good evening dear friends,

first of all - i am very very happy that i have found this great place. I like this forum very very much, since it has a great and supportive community! I learn alot form you folks here! Each question has got some great reviewers and - each thread is a rich value and learning asset.

Well i am new to Perl - and fairly new to this board here: i am currently workin out a little parser: i want to parse a table

http://www.schulministerium.nrw.de/BP/S ... pDO=154763

this page has a table: well a table with vaules and lables.

We need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so we don't even need to do that. But, what about to provide anything to the constructor, I would provide the class of the table.

We do not want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, we should process the table row-by-row.

Well - can this be done like so:

Code: Select all

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_ergebnis_tab_info' },
);

$te->parse_file('t.html');

for my $table ( $te->tables ) {
    print Dump $table->columns;
}

Note i want to parse a site like this. http://www.schulministerium.nrw.de/BP/S ... pDO=154763

so for a first trial i save the html of the page and try it out!

Can you review the code and give some hints...

love to hear from you
regards
lin
Post Reply