Page 1 of 1

Html parsing and pattern matching

Posted: Mon Jun 05, 2006 9:57 am
by hydroxide
How could I parse an html file that follows a pattern identical to this:

Code: Select all

<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href="mailto:mlewis@mycompany.com">mlewis@mycompany.com</a>><br>
 <<a href="mailto:lrivera@mycompany.com">lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that client's previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr>
With the names and numbers obviously being different in each entry. The entries are in list form like this, with hundreds of entries. How could I potentially convert this into a flat file ready for insertion into a database?

Posted: Mon Jun 05, 2006 10:05 am
by TheMoose
Regular expressions can easily parse through that, and then you can either a] insert into a database, or b] create a flatfile with the parsed data.

Regex Part I
Regex Part II

Posted: Mon Jun 05, 2006 10:24 am
by hydroxide
Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?

Posted: Mon Jun 05, 2006 11:09 am
by aerodromoi
hydroxide wrote:Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?
Here's a basic example

Code: Select all

<?php
$string = "<b>bold text</b>Contact Phone: 555 555-5555<br>ipsum dolor sit amet</a>Contact Phone: 675 545-5555<br>";

preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches, PREG_SET_ORDER);

for($i=0;$i<count($matches);$i++){
  print "phone no.".$i.": ".$matches[$i][2]."<br />";
}
?>
The script will output all contact phone numbers found in the string $string, provided that they are surrounded by Contact Phone:(space)
and <br> and that they only consist of numbers, spaces and -.

aerodromoi

Posted: Mon Jun 05, 2006 11:29 am
by hydroxide
I'm really going around in circles now. I stripped all of the html tags out of the text file. It has 500 or so entries just like this one. How could I write a regexp that would go through this enormous file and return all of the matched values? I'm guessing I'd have to write ones to return the data for each category (contact name, etc). I need it to open the text file and read from there. I'm trying to figure it out with file();, but I'm not sure if that's the right way. Help?
[ABC Farming]
Client ID: 23-12-002
Processing Location: bradenton, <jjsmith@jjsmith.com>
<psmith@jjsmith.com>
Citra, FL 32113
Contact Name: Gaston Carbajal
Contact Phone: 352-361-5363
Client Original Call In Date: 5/31/06
Client Original Period Begin Date: 5/24/06
Client Orginal Period End Date: 5/30/06
Client Orginal Check Date: 6/02/06
Client Orginal Delivery Date: 6/02/06
Client New Call In Date: 6/07/06
Client New Period Begin Date: 5/31/06
Client New Period End Date: 6/06/06
Client New Check Date: 6/09/06
Client New Delivery Date: 6/09/06
Reason for false start:
Don't know, called him several times and did not call back.
Change date: Friday, June 02, 2006 at 08:16:55 (EDT)
________________________________________________________________________________

[CDF Roofing]
Client ID: 41-44-119
Processing Location: ftlauderdale, <mlewis@jjsmith.com>
<lrivera@jjsmith.com>
Stuart, FL 34992
Contact Name: Dorothy/George Johnson
Contact Phone: 772 260-1713
Client Original Call In Date: 05/31/06
Client Original Period Begin Date: 05/24/06
Client Orginal Period End Date: 05/30/06
Client Orginal Check Date: 06/02/06
Client Orginal Delivery Date: 06/02/06
Client New Call In Date: 06/05/06
Client New Period Begin Date: 05/29/06
Client New Period End Date: 06/04/06
Client New Check Date: 06/09/06
Client New Delivery Date: 06/09/06
Reason for false start:
1st False start: Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4. Also per Matt he was not aware that client's previous payroll company (Presidion) required a written 30 day notice prior to canceling their account.
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)

Posted: Mon Jun 05, 2006 11:35 am
by aerodromoi
hydroxide wrote:I'm really going around in circles now. I stripped all of the html tags out of the text file. It has 500 or so entries just like this one. How could I write a regexp that would go through this enormous file and return all of the matched values? I'm guessing I'd have to write ones to return the data for each category (contact name, etc).
That's correct. You could explode the whole string, though, which would give you an array containing every single line. However, as you surely don't want the string "Contact phone: " in the field for a specific phone number, you'll have to use a regex for each piece of information you want to store in the database.

aerodromoi

Posted: Mon Jun 05, 2006 11:42 am
by hydroxide
I'm not exactly sure what you're saying. What could I do to allow a regexp to seach and return matches for the entire text file? I'm kind of a newb (obviously :( ) could you try to explain it less ambigously?

Posted: Mon Jun 05, 2006 12:18 pm
by RobertGonzalez
aerodromoi wrote:
hydroxide wrote:Even if there are hundreds of similar entries in the same file? I don't see how pattern matching could get all of the information I need... Also, how would it be possible to put it into a database table with columns like Contact_Name and Contact_Phone? How could php get the required information?
Here's a basic example

Code: Select all

<?php
$string = "<b>bold text</b>Contact Phone: 555 555-5555<br>ipsum dolor sit amet</a>Contact Phone: 675 545-5555<br>";

preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches, PREG_SET_ORDER);

for($i=0;$i<count($matches);$i++){
  print "phone no.".$i.": ".$matches[$i][2]."<br />";
}
?>
The script will output all contact phone numbers found in the string $string, provided that they are surrounded by Contact Phone:(space)
and <br> and that they only consist of numbers, spaces and -.

aerodromoi
It might not make sense to you hydroxide, but you actually want the HTML tag in the file at this point. What the regular expression matching is doing is going through the code line by line and reading everything matches a given set of parameters into an array for use later. It is also stripping the HTML as it does this. What aerodromoi suggested is the best method for doing what you want.

Posted: Mon Jun 05, 2006 1:05 pm
by aerodromoi
Assuming there is only one name/phone number per record:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>";

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $matchesheader, PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $matchesid, PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matchesphone, PREG_SET_ORDER);


print "header: ".$matchesheader[0][2]."<br />";
print "client id: ".$matchesid[0][2]."<br />";
print "phone: ".$matchesphone[0][2]."<br />";
?>
will print out

Code: Select all

header: Random Company Name
client id: 12-23-111
phone: 555 555-5555
Stripping the string of the html tags only makes it harder to retrieve the pieces you want.

aerodromoi

Posted: Mon Jun 05, 2006 7:13 pm
by hydroxide
How can I, instead of just putting in a small bit of information, search an entire large file? Paste the whole thing into the script?

By the way, I really appreciate the help you've given me so far.

Posted: Mon Jun 05, 2006 7:33 pm
by John Cartwright

Code: Select all

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $match['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $match['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $match['phone'], PREG_SET_ORDER);

for ($x = 0; $x <= count($match['id']); $x++)
{
   print "header: ".$match['header'][0][$x]."<br />";
   print "client id: ".$match['id'][0][$x]."<br />";
   print "phone: ".$match['phone'][0][$x]."<br />";    
}
Perhaps something like this.. although you are probably better off using a single preg call instead of multiple.

Posted: Tue Jun 06, 2006 6:48 am
by hydroxide
I tried to use file(); and fopen, but neither one would allow me to match through the file with those regular expressions. How could I make this work? I tried reading the documentation, but to no avail.

Posted: Tue Jun 06, 2006 7:44 am
by hydroxide
Also, when I tried to use your example, Jcart, it did not work properly:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that clients previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr> 

<b>Another Company</b><br>
Client ID: 1234567<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34999<br>
Contact Name: James Bulter<br>
Contact Phone: 999 999-999<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  More random junksjsabebfeia<br>
Change date: Thursday, June 07, 2006 at 16:23:12 (EDT)<hr> ";

preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $match['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $match['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $match['phone'], PREG_SET_ORDER);

for ($x = 0; $x <= count($match['id']); $x++)
{
   print " ".$match['header'][0][$x]."<br />";
   print " ".$match['id'][0][$x]."<br />";
   print " ".$match['phone'][0][$x]."<br />";   
}
?>
This script output:
Random Company Name
Client ID: 12-23-111

Contact Phone: 555 555-5555


Client ID:
Contact Phone:
Random Company Name
12-23-111
555 555-5555
Which isn't correct... and I'm not sure what's wrong. I need to have it match and display multiple results, which it won't do, obviously.

What I'm trying to figure out is how to make it search a text file for these matches and not just what I put in the actual script, which would then display all matches, not just the first ones.


Thanks again for all the help guys.

Posted: Tue Jun 06, 2006 9:22 am
by aerodromoi
hydroxide wrote: Which isn't correct... and I'm not sure what's wrong. I need to have it match and display multiple results, which it won't do, obviously.

What I'm trying to figure out is how to make it search a text file for these matches and not just what I put in the actual script, which would then display all matches, not just the first ones.


Thanks again for all the help guys.

Here's a revised version:

Code: Select all

<?php
$string = "<b>Random Company Name</b><br>
Client ID: 12-23-111<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34992<br>
Contact Name: Dorothy/George Johnson<br>
Contact Phone: 555 555-5555<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  Client requested to change pay period from Wed 5/24- Tues 5/30 to new dates of Mon 5/29 to Sun 6/4.  Also per Matt he was not aware that clients previous payroll company required a written 30 day notice prior to canceling their account.<br>
Change date: Thursday, June 01, 2006 at 16:23:12 (EDT)<hr>

<b>Another Company</b><br>
Client ID: 1234567<br>
Processing Location: ftlauderdale, <<a href='mailto:mlewis@mycompany.com'>mlewis@mycompany.com</a>><br>
 <<a href='mailto:lrivera@mycompany.com'>lrivera@mycompany.com</a>><br>
Stuart, FL 34999<br>
Contact Name: James Bulter<br>
Contact Phone: 999 999-999<br>
Client Original Call In Date: 05/31/06<br>
Client Original Period Begin Date: 05/24/06<br>
Client Orginal Period End Date: 05/30/06<br>
Client Orginal Check Date: 06/02/06<br>
Client Orginal Delivery Date: 06/02/06<br>
Client New Call In Date: 06/05/06<br>
Client New Period Begin Date: 05/29/06<br>
Client New Period End Date: 06/04/06<br>
Client New Check Date: 06/09/06<br>
Client New Delivery Date: 06/09/06<br>
<u>Reason for false start:</u><br>
1st False start:  More random junksjsabebfeia<br>
Change date: Thursday, June 07, 2006 at 16:23:12 (EDT)<hr> ";


preg_match_all("/(<b>)([a-zA-Z0-9\s-_\.\:]*)(<\/b>)/is", $string, $matches['header'], PREG_SET_ORDER);
preg_match_all("/(Client ID\:\s)([0-9-\s]*)(<br>)/is", $string, $matches['id'], PREG_SET_ORDER);
preg_match_all("/(Contact Name\:\s)([a-zA-Z-_\/\.\:\s]*)(<br>)/is", $string, $matches['name'], PREG_SET_ORDER);
preg_match_all("/(Contact Phone\:\s)([0-9-\s]*)(<br>)/is", $string, $matches['phone'], PREG_SET_ORDER);

for($i=0;$i<count($matches['header']);$i++){
  print "<h2>".$i.":</h2>\n";
  print "header: ".$matches['header'][$i][2]."<br />";
  print "client id: ".$matches['id'][$i][2]."<br />";
  print "name: ".$matches['name'][$i][2]."<br />";  
  print "phone: ".$matches['phone'][$i][2]."<br />";
  print "<br />\n";
}
?>
However, this approach requires that all the regular expressions work!

As to the flatfile:

Code: Select all

$backend = "source.txt";
if (!file_exists($backend)) die("Sorry - the file you specified does not exist!");
$handle   = fopen($backend, "r");
$string   = fread($handle, filesize($backend));
fclose($handle);
aerodromoi