Html parsing and pattern matching

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Html parsing and pattern matching

Post by hydroxide »

How could I parse an html file that follows a pattern identical to this:

Code: Select all

<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href="mailto:mlewis@mycompany.com">mlewis@mycompany.com</a>><br>
 <<a href="mailto:lrivera@mycompany.com">lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that client's previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr>
With the names and numbers obviously being different in each entry. The entries are in list form like this, with hundreds of entries. How could I potentially convert this into a flat file ready for insertion into a database?
Last edited by hydroxide on Tue Jun 06, 2006 6:48 am, edited 1 time in total.
User avatar
TheMoose
Forum Contributor
Posts: 351
Joined: Tue May 23, 2006 10:42 am

Post by TheMoose »

Regular expressions can easily parse through that, and then you can either a] insert into a database, or b] create a flatfile with the parsed data.

Regex Part I
Regex Part II
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?
User avatar
aerodromoi
Forum Contributor
Posts: 230
Joined: Sun May 07, 2006 5:21 am

Post by aerodromoi »

hydroxide wrote:Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?
Here's a basic example

Code: Select all

<?php
$string = "<b>bold text</b>Contact Phone: 555 555-5555<br>ipsum dolor sit amet</a>Contact Phone: 675 545-5555<br>";

preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches, PREG_SET_ORDER);

for($i=0;$i<count($matches);$i++){
  print "phone no.".$i.": ".$matches[$i][2]."<br />";
}
?>
The script will output all contact phone numbers found in the string $string, provided that they are surrounded by Contact Phone:(space)
and <br> and that they only consist of numbers, spaces and -.

aerodromoi
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

I'm really going around in circles now. I stripped all of the html tags out of the text file. It has 500 or so entries just like this one. How could I write a regexp that would go through this enormous file and return all of the matched values? I'm guessing I'd have to write ones to return the data for each category (contact name, etc). I need it to open the text file and read from there. I'm trying to figure it out with file();, but I'm not sure if that's the right way. Help?
[ABC Farming]
Client ID: 23-12-002
Processing Location: bradenton, <jjsmith@jjsmith.com>
<psmith@jjsmith.com>
Citra, FL 32113
Contact Name: Gaston Carbajal
Contact Phone: 352-361-5363
Client Original Call In Date: 5/31/06
Client Original Period Begin Date: 5/24/06
Client Orginal Period End Date: 5/30/06
Client Orginal Check Date: 6/02/06
Client Orginal Delivery Date: 6/02/06
Client New Call In Date: 6/07/06
Client New Period Begin Date: 5/31/06
Client New Period End Date: 6/06/06
Client New Check Date: 6/09/06
Client New Delivery Date: 6/09/06
Reason for false start:
Don't know, called him several times and did not call back.
Change date: Friday, June 02, 2006 at 08:16:55 (EDT)
________________________________________________________________________________

[CDF Roofing]
Client ID: 41-44-119
Processing Location: ftlauderdale, <mlewis@jjsmith.com>
<lrivera@jjsmith.com>
Stuart, FL 34992
Contact Name: Dorothy/George Johnson
Contact Phone: 772 260-1713
Client Original Call In Date: 05/31/06
Client Original Period Begin Date: 05/24/06
Client Orginal Period End Date: 05/30/06
Client Orginal Check Date: 06/02/06
Client Orginal Delivery Date: 06/02/06
Client New Call In Date: 06/05/06
Client New Period Begin Date: 05/29/06
Client New Period End Date: 06/04/06
Client New Check Date: 06/09/06
Client New Delivery Date: 06/09/06
Reason for false start:
1st False start: Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4. Also per Matt he was not aware that client's previous payroll company (Presidion) required a written 30 day notice prior to canceling their account.
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)
User avatar
aerodromoi
Forum Contributor
Posts: 230
Joined: Sun May 07, 2006 5:21 am

Post by aerodromoi »

hydroxide wrote:I'm really going around in circles now. I stripped all of the html tags out of the text file. It has 500 or so entries just like this one. How could I write a regexp that would go through this enormous file and return all of the matched values? I'm guessing I'd have to write ones to return the data for each category (contact name, etc).
That's correct. You could explode the whole string, though, which would give you an array containing every single line. However, as you surely don't want the string "Contact phone: " in the field for a specific phone number, you'll have to use a regex for each piece of information you want to store in the database.

aerodromoi
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

I'm not exactly sure what you're saying. What could I do to allow a regexp to seach and return matches for the entire text file? I'm kind of a newb (obviously :( ) could you try to explain it less ambigously?
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

aerodromoi wrote:
hydroxide wrote:Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?
Here's a basic example

Code: Select all

<?php
$string = "<b>bold text</b>Contact Phone: 555 555-5555<br>ipsum dolor sit amet</a>Contact Phone: 675 545-5555<br>";

preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches, PREG_SET_ORDER);

for($i=0;$i<count($matches);$i++){
  print "phone no.".$i.": ".$matches[$i][2]."<br />";
}
?>
The script will output all contact phone numbers found in the string $string, provided that they are surrounded by Contact Phone:(space)
and <br> and that they only consist of numbers, spaces and -.

aerodromoi
It might not make sense to you hydroxide, but you actually want the HTML tag in the file at this point. What the regular expression matching is doing is going through the code line by line and reading everything matches a given set of parameters into an array for use later. It is also stripping the HTML as it does this. What aerodromoi suggested is the best method for doing what you want.
User avatar
aerodromoi
Forum Contributor
Posts: 230
Joined: Sun May 07, 2006 5:21 am

Post by aerodromoi »

Assuming there is only one name/phone number per record:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>";

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $matchesheader, PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $matchesid, PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matchesphone, PREG_SET_ORDER);


print "header: ".$matchesheader[0][2]."<br />";
print "client id: ".$matchesid[0][2]."<br />";
print "phone: ".$matchesphone[0][2]."<br />";
?>
will print out

Code: Select all

header: Random Company Name
client id: 12-23-111
phone: 555 555-5555
Stripping the string of the html tags only makes it harder to retrieve the pieces you want.

aerodromoi
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

How can I, instead of just putting in a small bit of information, search an entire large file? Paste the whole thing into the script?

By the way, I really appreciate the help you've given me so far.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Code: Select all

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $match['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $match['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $match['phone'], PREG_SET_ORDER);

for ($x = 0; $x <= count($match['id']); $x++)
{
   print "header: ".$match['header'][0][$x]."<br />";
   print "client id: ".$match['id'][0][$x]."<br />";
   print "phone: ".$match['phone'][0][$x]."<br />";    
}
Perhaps something like this.. although you are probably better off using a single preg call instead of multiple.
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

I tried to use file(); and fopen, but neither one would allow me to match through the file with those regular expressions. How could I make this work? I tried reading the documentation, but to no avail.
User avatar
hydroxide
Forum Commoner
Posts: 77
Joined: Mon Jun 05, 2006 9:53 am

Post by hydroxide »

Also, when I tried to use your example, Jcart, it did not work properly:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that clients previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr> 

<b>Another Company</b><br>
Client ID: 1234567<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34999<br>
Contact Name: James Bulter<br>
Contact Phone: 999 999-999<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  More random junksjsabebfeia<br>
Change date: Thursday, June 07, 2006 at 16:23:12 (EDT)<hr> ";

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $match['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $match['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $match['phone'], PREG_SET_ORDER);

for ($x = 0; $x <= count($match['id']); $x++)
{
   print " ".$match['header'][0][$x]."<br />";
   print " ".$match['id'][0][$x]."<br />";
   print " ".$match['phone'][0][$x]."<br />";   
}
?>
This script output:
Random Company Name
Client ID: 12-23-111

Contact Phone: 555 555-5555


Client ID:
Contact Phone:
Random Company Name
12-23-111
555 555-5555
Which isn't correct... and I'm not sure what's wrong. I need to have it match and display multiple results, which it won't do, obviously.

What I'm trying to figure out is how to make it search a text file for these matches and not just what I put in the actual script, which would then display all matches, not just the first ones.


Thanks again for all the help guys.
User avatar
aerodromoi
Forum Contributor
Posts: 230
Joined: Sun May 07, 2006 5:21 am

Post by aerodromoi »

hydroxide wrote: Which isn't correct... and I'm not sure what's wrong. I need to have it match and display multiple results, which it won't do, obviously.

What I'm trying to figure out is how to make it search a text file for these matches and not just what I put in the actual script, which would then display all matches, not just the first ones.


Thanks again for all the help guys.

Here's a revised version:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that clients previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr>

<b>Another Company</b><br>
Client ID: 1234567<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34999<br>
Contact Name: James Bulter<br>
Contact Phone: 999 999-999<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  More random junksjsabebfeia<br>
Change date: Thursday, June 07, 2006 at 16:23:12 (EDT)<hr> ";


preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $matches['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $matches['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Name\:\s)([a-zA-Z-_\/\.\:\s]*)(<br>)/is", $string, $matches['name'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches['phone'], PREG_SET_ORDER);

for($i=0;$i<count($matches['header']);$i++){
  print "<h2>".$i.":</h2>\n";
  print "header: ".$matches['header'][$i][2]."<br />";
  print "client id: ".$matches['id'][$i][2]."<br />";
  print "name: ".$matches['name'][$i][2]."<br />";  
  print "phone: ".$matches['phone'][$i][2]."<br />";
  print "<br />\n";
}
?>
However, this approach requires that all the regular expressions work!

As to the flatfile:

Code: Select all

$backend = "source.txt";
if (!file_exists($backend)) die("Sorry - the file you specified does not exist!");
$handle   = fopen($backend, "r");
$string   = fread($handle, filesize($backend));
fclose($handle);
aerodromoi
Post Reply