Page 1 of 1

Why "Header" returns me redundant data?

Posted: Tue Jul 28, 2009 7:18 pm
by senglory

Code: Select all

 
<TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">
\s*
<tr\s+class="header\s+maincol">\s*
(?<Header>.*?(?=</tr>))</tr>
 
\s+
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>
I apply this regex in Regulator to the file zipped

http://depositfiles.com/files/ipowahlh5

Why does "Header" group contains 11111111111111 in it? It's supposed to stop at the first </tr>.

And if I remove

Code: Select all

 
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>
it becomes OK.

Re: Why "Header" returns me redundant data?

Posted: Wed Jul 29, 2009 1:13 am
by prometheuzz
I, probably like many people answering questions in this forum, am not going to download some file(s) and try to figure out what you're trying to do and/or what your question is. So, if you'd like my help, please describe your problem on the forum without needing to download anything. Thanks.

Re: Why "Header" returns me redundant data?

Posted: Wed Jul 29, 2009 1:45 am
by senglory
OK, lt me sow here the text to be processed:


<div class="main">
<div class="paging maincol">

<b> 1 </b> |

<a href="http://www.autosite.com.ua/used_results ... ge=2">2</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=3">3</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=4">4</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=5">5</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=6">6</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=7">7</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=8">8</a> |

<a href="http://www.autosite.com.ua/used_results ... ge=9">9</a> |

<a href="http://www.autosite.com.ua/used_results ... =10">10</a> |

<span class="hpad15"><a href="http://www.autosite.com.ua/used_results ... >Следующие &raquo;</a></span>

</div>
</div>
<div style="clear:both">
<table class="used-results maincol" cellspacing="0">
<tr class="header maincol">
<td class="header" width="118">
<div class="tr-left fl"><b>Цена</b></div>
<div class="tr-right fr"><a href="javascript:Sort('Price', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Price', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td class="header" width="212"><div class="tr-left fl"><b>Марка/модель/модификация</b></div></td>
<td class="header" width="83">
<div class="tr-left fl"><b>Год</b><br />выпуска</div>
<div class="tr-right fr"><a href="javascript:Sort('Year', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Year', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td class="header" width="75">
<div class="tr-left fl"><b>Пробег</b></div>
<div class="tr-right fr"><a href="javascript:Sort('Mileage', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Mileage', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td width="105" class="header">
<div class="tr-left fl"><b>Дата</b><br />добавления</div>
<div class="tr-right fr"><a href="javascript:Sort('Date', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Date', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td width="58">&nbsp;</td>
</tr>
11111111111111111111111111111
<tr>
<td class="res-left"><a href="http://www.autosite.com.ua/Skoda-Octavi ... html"><img src="/pictures/28-6-2009/520576/thumb_image1.jpg" border="0" class="imgborder" ></a><div class="price red11 b lh18">9100 <b>USD</b></div></td>
<td class="res"><div class="c">
<a href="http://www.autosite.com.ua/Skoda-Octavi ... 20576.html" class="red12"><strong>Skoda Octavia LS</strong></a>
<br />
<strong>Закарпатье</strong><br />
ХОРОШЕЕ СОСТОЯНИЕ.ЧЕШСКАЯ ЗБОРКА. ...&nbsp;&nbsp;<br />
<a href="http://www.autosite.com.ua/Skoda-Octavi ... е</a></div>
</td>
<td class="res">2000 г</td>
<td class="res">200000 км </td>
<td class="res">28/06/2009</td>
<td class="res-right" align="center"><br /><img src="http://i.autosite.com.ua/img/urgent_icon.gif" alt="" /></td>
</tr>


<tr bgcolor="#FCFDFE">
<td class="res-left"><a href="http://www.autosite.com.ua/Mazda-323_au ... html"><img src="/pictures/25-7-2009/543533/thumb_image1.jpg" border="0" class="imgborder" ></a><div class="price red11 b lh18">3800 <b>USD</b></div></td>
<td class="res"><div class="c">
<a href="http://www.autosite.com.ua/Mazda-323_auto_543533.html" class="red12"><strong>Mazda 323 </strong></a>
<br />
<strong>АР Крым</strong><br />
сигнализация, цз., протувотуманки, иммобилайзер, гу рул ...&nbsp;&nbsp;<br />
<a href="http://www.autosite.com.ua/Mazda-323_au ... е</a></div>
</td>
<td class="res">1991 г</td>
<td class="res">80000 км </td>
<td class="res">25/07/2009</td>
<td class="res-right" align="center"><br /><img src="http://i.autosite.com.ua/img/urgent_icon.gif" alt="" /></td>
</tr>

</table>



After processing Header contains 11111111111111111111111111111. This not what I want. I need to have 11111111111111111111111111111 and all text below in the "Content" group.

Re: Why "Header" returns me redundant data?

Posted: Wed Jul 29, 2009 2:26 am
by prometheuzz
senglory wrote:

Code: Select all

 
<TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">
\s*
<tr\s+class="header\s+maincol">\s*
(?<Header>.*?(?=</tr>))</tr>
 
\s+
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>
I apply this regex in Regulator to the file zipped

http://depositfiles.com/files/ipowahlh5

Why does "Header" group contains 11111111111111 in it? It's supposed to stop at the first </tr>.
No, look at this part of your regex:

Code: Select all

(?<Header>.*?(?=</tr>))</tr>
\s+
(?<Content><tr(?:[^>]+?)>)
When you have matched your first "</tr>", you tell the regex engine you only want white space characters to come in between that "</tr>" and the the next "<tr", yet there is "11111111111111" in between the first "</tr>" and the second "<tr". Besides, the second "<tr" is followed directly by a ">" making this part of your regex fail as well:

Code: Select all

(?<Content><tr(?:[^>]+?)>)
since you defined there should be at least one non '>' character after it ([^>]+).

Re: Why "Header" returns me redundant data?

Posted: Wed Jul 29, 2009 2:39 am
by prometheuzz
Try something like this:

Code: Select all

$regex = '#
  <TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">\s*
  <tr\s+class="header\s+maincol">\s*
  (?<Header>
    (?:(?!</tr>).)*+
  )
  </tr>
  .*?
  (?<Content>
    <tr(?:[^>]*+)>
    (?:(?!</table>).)*+
  )
  </table>
#xsi';