I would not use a regex for this. Why? Because you don't know if a certain data type will be included in the document. So you will end up with data not being where it should in the resulting array returned by the regex! example
Say.... (Vibration) is missing, you would end up with
Code: Select all
[0] => Array
(
[Type] => Polyphonic
[Customization] => Download
[Vibration] => 1000 x 20 fields, Photo call
[Phonebook] => Polyphonic
)
The more results you have in the page means more and more bad resulting data! This is where a loop control is needed because you need to control...
1. what to find
2. limit each result set
3. if all data is not found still set the array element so your db insert maintains all the fields it's expecting!
Sure it more work to write, but it's complete control over just a guessing type control!
this is untested, but based on your data it will do what you want, plus it will handle multi result sets that could be in each page!
The results should look like this... ( where empty values, means the data was not found)
Code: Select all
Array
(
[0] => Array
(
[Type] => Polyphonic
[Customization] => Download
[Vibration] => Yes
[Phonebook] => 1000 x 20 fields, Photo call
)
[1] => Array
(
[Type] => Polyphonic
[Customization] =>
[Vibration] => Yes
[Phonebook] => 1000 x 20 fields, Photo call
)
[2] => Array
(
[Type] => Polyphonic
[Customization] => Download
[Vibration] =>
[Phonebook] => 1000 x 20 fields, Photo call
)
)
// script...
Code: Select all
<?
// file to read or url...
$out = file_get_contents ( 'url_or_file_path_plus_name' );
// the data holder
$keep = array ();
// what to look for (start at, stop at)
$data = array (
'h_ringtype.htm' => 'Type',
'h_ringcustom.htm' => 'Customization',
'h_vibrat.htm' => 'Vibration',
'h_number.htm' => 'Phonebook'
);
// split the page, because each data we are looking
// for is contained in a <tr>data</tr>
$split = '<tr>';
$parts = explode ( $split, $out );
// the resulting data array increment counter
$na = 0;
for ( $i = 0; $i < sizeof ( $parts ); $i++ )
{
foreach ( $data AS $k => $v )
{
// if we find exactly what we want, grab it, clean it, set it and break until the next match
// we use (2) test patterns because some data on the page might match (1) of the match
// patterns, but the bogus data will never match both of are testing patterns!
if ( strpos ( $parts[$i], $k ) !== false && strpos ( $parts[$i], $v ) !== false )
{
// each result set start, we add a new array, we do it this way to catch
// the last result without having to repeat code blocks! In other words
// it is to handle multi result sets found in a page...
if ( isset ( $keep[$na][$v] ) && $v = 'Type' )
{
$na++;
}
// the cleaner, just get the data we want
$temp = substr ( $parts[$i], strpos ( $parts[$i], $k ) );
$temp = substr ( $temp, ( strpos ( $temp, $v ) + strlen ( $v ) ) );
$temp = strip_tags ( $temp );
// fix up any line breaks and cut out extra spaces
$old = array ( '#\\r?\\n#', '#\\s+#' );
$new = array ( ' ', ' ' );
$keep[$na][$v] = trim ( preg_replace ( $old, $new, $temp ) );
// done, exist the foreach()
break;
}
}
}
// fix all the unset elements, we can't add this to
// the first loop because we would not catch the
// last array on multi result page, which would
// mean you would have to write this twice. So in
// this case it better to leave it here...
for ( $i = 0; $i < sizeof ( $keep ); $i++ )
{
if ( ! isset ( $keep[$i]['Type'] ) )
{
$keep[$i]['Type'] = '';
}
if ( ! isset ( $keep[$i]['Customization'] ) )
{
$keep[$i]['Customization'] = '';
}
if ( ! isset ( $keep[$i]['Vibration'] ) )
{
$keep[$i]['Vibration'] = '';
}
if ( ! isset ( $keep[$i]['Phonebook'] ) )
{
$keep[$i]['Phonebook'] = '';
}
}
// just show the resulting array
print_r ( $keep );
?>
yj