[SOLVED] Extract Web Data

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
btfans
Forum Newbie
Posts: 22
Joined: Thu Jun 10, 2004 10:58 am

[SOLVED] Extract Web Data

Post by btfans »

Hello,

I want to extract some data from an html site ......

and reformat some data between <pre> .. </pre> tags ..... pls help; as I am newbie ... this php code now not work ?

Code: Select all

<?
	$file = "http://something.htm";
	$contents = file($file);
	$size = sizeof($contents);

	for($i = 0; $i < $size; $i++) {
	$alldata = $contents[$i];

	preg_match("/<pre.*?>(.+)<\/pre>/im",$alldata,$matches);
	print_r($matches);
}

?>
The html page is (encode with big5 char) ...

Code: Select all

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=big5">
<title>something</title>
</head>

<body bgcolor="#FFFFFF">
<p align="center"><img src="../../images_e/logo_dblue.gif" alt="logo" width="333" height="65">
<h1 align="center">report</h1>
<p><i>report detail</i>

<pre>
line 1:
tag2 : data2
tag3 : data3
tag4 : number n - data4

line 5
line 6
line 7
</pre>

</body>
</html>
Requirement:

Output string to show the following :

Code: Select all

line 1:
tag2 : data2
tag3 : data3
tag4 : data4
Thanks very much ....


feyd | please use the

Code: Select all

and

Code: Select all

tags we've provided :: [/color][url=http://forums.devnetwork.net/viewtopic.php?t=21171][color=red]:arrow: [u][b]Posting Code in the Forums[/b][/u][/color][/url]
Last edited by btfans on Mon Jun 28, 2004 12:36 am, edited 1 time in total.
kettle_drum
DevNet Resident
Posts: 1150
Joined: Sun Jul 20, 2003 9:25 pm
Location: West Yorkshire, England

Post by kettle_drum »

If your not good with regex then use a mixture of substr and strpos to parse away data you dont want - or even explode().
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you can try switching your regex to:

Code: Select all

preg_match('#<pre[^>]*?'.'>(.*?)</pre>#is',$alldata,$matches);
[edit]made a slight alteration to get the code to show correctly..
Last edited by feyd on Mon Jun 28, 2004 12:42 am, edited 1 time in total.
btfans
Forum Newbie
Posts: 22
Joined: Thu Jun 10, 2004 10:58 am

Post by btfans »

Will try .....

Can anyone teach me the reformatting of

Code: Select all

line 1: 
tag2 : data2 
tag3 : data3 
tag4 : number n - data4 
line 5
linne6
line 7
 ...
to

Code: Select all

line 1: 
tag2 : data2 
tag3 : data3 
tag4 : data4
btfans
Forum Newbie
Posts: 22
Joined: Thu Jun 10, 2004 10:58 am

Post by btfans »

I modified as:


--------------------------------------------------------------------------------

<?
$file = "http://something.htm";
$contents = file($file);
$size = sizeof($contents);
$alldata = '';

for($i = 0; $i < $size; $i++) {
$alldata .= $contents[$i];

if (preg_match_all("|<pre.*?>(.*?)</pre>|is",$alldata,$matches));
{
$main = implode(' ',$matches[1]);
echo $main;
}
}
?>
--------------------------------------------------------------------------------



and result now:
--------------------------------------------------------------------------------

line 1: tag2 : data2tag3 : data3tag4 : number n - data4line 5line 6line 7
--------------------------------------------------------------------------------



and repeat many times....

So my (silly) question:
1) how to get those I want
2) cannot see "\n" ??
btfans
Forum Newbie
Posts: 22
Joined: Thu Jun 10, 2004 10:58 am

Post by btfans »

Hi,

Sorry for inexperience on "\n" and php, now I changed my code as
<?
$file = "something.htm";
$contents = file($file);
$size = sizeof($contents);
$alldata=implode("\n", $contents);
preg_match_all("|<pre.*?>(.*?)</pre>|ism",$alldata,$matches);



foreach($matches[1] as $match)
{ $pieces = explode(":", $match);

echo "$pieces[0] <br>";
echo "$pieces[1] <br>";
echo "$pieces[2] <br>";
echo "$pieces[3] <br>";
echo nl2br ($pieces[4])."<br>\n";
}

?>
result:
line 1
tag2
data2tag3
data3tag4
number n - data4

line 5
Howto change to :
line 1
tag2data2
tag3data3
tag4data4



and tag4 remove "number n -"

Can I use-
$pieces = explode(":\n", $match);



to extract all parts between ":" ??

Not quite understand "\n" in IE (ignored?)

Any advise welcome.
btfans
Forum Newbie
Posts: 22
Joined: Thu Jun 10, 2004 10:58 am

Post by btfans »

[SOLVED]
Post Reply