Why "Header" returns me redundant data?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Why "Header" returns me redundant data?

Postby senglory » Tue Jul 28, 2009 7:18 pm

Syntax: [ Download ] [ Hide ]
 
<TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">
\s*
<tr\s+class="header\s+maincol">\s*
(?<Header>.*?(?=</tr>))</tr>
 
\s+
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>


I apply this regex in Regulator to the file zipped

http://depositfiles.com/files/ipowahlh5

Why does "Header" group contains 11111111111111 in it? It's supposed to stop at the first </tr>.

And if I remove

Syntax: [ Download ] [ Hide ]
 
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>

it becomes OK.
Attachments
text.zip
(9.27 KiB) Downloaded 52 times
senglory
Forum Newbie
 
Posts: 3
Joined: Tue Jul 28, 2009 7:16 pm

Re: Why "Header" returns me redundant data?

Postby prometheuzz » Wed Jul 29, 2009 1:13 am

I, probably like many people answering questions in this forum, am not going to download some file(s) and try to figure out what you're trying to do and/or what your question is. So, if you'd like my help, please describe your problem on the forum without needing to download anything. Thanks.
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Why "Header" returns me redundant data?

Postby senglory » Wed Jul 29, 2009 1:45 am

OK, lt me sow here the text to be processed:


<div class="main">
<div class="paging maincol">

<b> 1 </b> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=2">2</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=3">3</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=4">4</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=5">5</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=6">6</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=7">7</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=8">8</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=9">9</a> |

<a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=10">10</a> |

<span class="hpad15"><a href="http://www.autosite.com.ua/used_results.html?Make=0&Type=1&Model=&PriceFrom=&PriceTo=&Body=any&YearFrom=&YearTo=&Engine=any&MileageFrom=&MileageTo=&Color=any&Price=0&Year=0&Region=0&only_photo=&only_24=&only_dtp=&smod=&sdir=&page=10">Следующие &raquo;</a></span>

</div>
</div>
<div style="clear:both">
<table class="used-results maincol" cellspacing="0">
<tr class="header maincol">
<td class="header" width="118">
<div class="tr-left fl"><b>Цена</b></div>
<div class="tr-right fr"><a href="javascript:Sort('Price', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Price', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td class="header" width="212"><div class="tr-left fl"><b>Марка/модель/модификация</b></div></td>
<td class="header" width="83">
<div class="tr-left fl"><b>Год</b>
выпуска</div>
<div class="tr-right fr"><a href="javascript:Sort('Year', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Year', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td class="header" width="75">
<div class="tr-left fl"><b>Пробег</b></div>
<div class="tr-right fr"><a href="javascript:Sort('Mileage', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Mileage', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td width="105" class="header">
<div class="tr-left fl"><b>Дата</b>
добавления</div>
<div class="tr-right fr"><a href="javascript:Sort('Date', 'asc')"><img src="/img/sort_asc_inactive.gif"></a><a href="javascript:Sort('Date', 'desc')"><img src="/img/sort_desc_inactive.gif"></a></div>
</td>
<td width="58">&nbsp;</td>
</tr>
11111111111111111111111111111
<tr>
<td class="res-left"><a href="http://www.autosite.com.ua/Skoda-Octavia-LS_auto_520576.html"><img src="/pictures/28-6-2009/520576/thumb_image1.jpg" border="0" class="imgborder" ></a><div class="price red11 b lh18">9100 <b>USD</b></div></td>
<td class="res"><div class="c">
<a href="http://www.autosite.com.ua/Skoda-Octavia-LS_auto_520576.html" class="red12"><strong>Skoda Octavia LS</strong></a>


<strong>Закарпатье</strong>

ХОРОШЕЕ СОСТОЯНИЕ.ЧЕШСКАЯ ЗБОРКА. ...&nbsp;&nbsp;

<a href="http://www.autosite.com.ua/Skoda-Octavia-LS_auto_520576.html">Подробнее</a></div>
</td>
<td class="res">2000 г</td>
<td class="res">200000 км </td>
<td class="res">28/06/2009</td>
<td class="res-right" align="center">
<img src="http://i.autosite.com.ua/img/urgent_icon.gif" alt="" /></td>
</tr>


<tr bgcolor="#FCFDFE">
<td class="res-left"><a href="http://www.autosite.com.ua/Mazda-323_auto_543533.html"><img src="/pictures/25-7-2009/543533/thumb_image1.jpg" border="0" class="imgborder" ></a><div class="price red11 b lh18">3800 <b>USD</b></div></td>
<td class="res"><div class="c">
<a href="http://www.autosite.com.ua/Mazda-323_auto_543533.html" class="red12"><strong>Mazda 323 </strong></a>


<strong>АР Крым</strong>

сигнализация, цз., протувотуманки, иммобилайзер, гу рул ...&nbsp;&nbsp;

<a href="http://www.autosite.com.ua/Mazda-323_auto_543533.html">Подробнее</a></div>
</td>
<td class="res">1991 г</td>
<td class="res">80000 км </td>
<td class="res">25/07/2009</td>
<td class="res-right" align="center">
<img src="http://i.autosite.com.ua/img/urgent_icon.gif" alt="" /></td>
</tr>

</table>



After processing Header contains 11111111111111111111111111111. This not what I want. I need to have 11111111111111111111111111111 and all text below in the "Content" group.
senglory
Forum Newbie
 
Posts: 3
Joined: Tue Jul 28, 2009 7:16 pm

Re: Why "Header" returns me redundant data?

Postby prometheuzz » Wed Jul 29, 2009 2:26 am

senglory wrote:
Syntax: [ Download ] [ Hide ]
 
<TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">
\s*
<tr\s+class="header\s+maincol">\s*
(?<Header>.*?(?=</tr>))</tr>
 
\s+
(?<Content>
<tr(?:[^>]+?)>
.+
(?=</table>)
 
)
</table>


I apply this regex in Regulator to the file zipped

http://depositfiles.com/files/ipowahlh5

Why does "Header" group contains 11111111111111 in it? It's supposed to stop at the first </tr>.


No, look at this part of your regex:
Syntax: [ Download ] [ Hide ]
(?<Header>.*?(?=</tr>))</tr>
\s+
(?<Content><tr(?:[^>]+?)>)


When you have matched your first "</tr>", you tell the regex engine you only want white space characters to come in between that "</tr>" and the the next "<tr", yet there is "11111111111111" in between the first "</tr>" and the second "<tr". Besides, the second "<tr" is followed directly by a ">" making this part of your regex fail as well:
Syntax: [ Download ] [ Hide ]
(?<Content><tr(?:[^>]+?)>)

since you defined there should be at least one non '>' character after it ([^>]+).
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Why "Header" returns me redundant data?

Postby prometheuzz » Wed Jul 29, 2009 2:39 am

Try something like this:

Syntax: [ Download ] [ Hide ]
$regex = '#
  <TABLE\s+class="used-results\s+maincol"\s+cellSpacing="0">\s*
  <tr\s+class="header\s+maincol">\s*
  (?<Header>
    (?:(?!</tr>).)*+
  )
  </tr>
  .*?
  (?<Content>
    <tr(?:[^>]*+)>
    (?:(?!</table>).)*+
  )
  </table>
#xsi'
;
User avatar
prometheuzz
Forum Regular
 
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am


Return to Regex

Who is online

Users browsing this forum: No registered users and 1 guest