Page 1 of 1
Finding an expression after another
Posted: Thu Mar 01, 2012 2:57 pm
by RonH
I need to parse out some web pages, hopefully they are all formatted correctly but since I'm new to Regexp I'm unsure how to grab info from after a key occurrence.
Ex.
<th scope="row" style="text-align:left;"><a href="/City_of_license" title="City of license">City of license</a></th>
<td class="" style=""><a href="/Airdrie,_Alberta" title="Airdrie, Alberta">Airdrie, Alberta</a></td>
I need to place Airdrie, Alberta under a heading of City of license, getting all the info into a text file is all I need, I can then populate a DB from there.
How is this done? I have other info on the page as well but Once I figure out how to do this I can apply it to the others.
I was thinking of find the first instance of "City of license" then the first title=" and grab the text until " is reached.
Thanks
Re: Finding an expression after another
Posted: Thu Mar 01, 2012 8:58 pm
by ragax
Hi Ron,
I don't think I fully understand what we're trying to achieve.
Can you please provide the desired output? That will make it easier to work on the regex.
Wishing you a fun weekend.
Re: Finding an expression after another
Posted: Fri Mar 02, 2012 10:17 am
by RonH
I posted a reply last night but now it's not here...Hmm
Here is an excerpt from one of the web pages:
<th scope="row" style="text-align:left;"><a href="/City_of_license" title="City of license">City of license</a></th>
<td class="" style=""><a href="/Airdrie,_Alberta" title="Airdrie, Alberta">Airdrie, Alberta</a></td>
</tr>
<tr class="">
<th scope="row" style="text-align:left;">Branding</th>
<td class="" style="">Air 106-1</td>
</tr>
<tr class="note">
<th scope="row" style="text-align:left;"><a href="/Slogan" title="Slogan">Slogan</a></th>
<td class="" style="">Airdrie's Radio Station</td>
</tr>
<tr class="">
<th scope="row" style="text-align:left;"><a href="/Frequency" title="Frequency">Frequency</a></th>
<td class="" style="">106.1 <a href="/MHz" title="MHz" class="mw-redirect">MHz</a></td>
<th scope="row" style="text-align:left;"><a href="/Radio_format" title="Radio format">Format</a></th>
<td class="category" style=""><a href="/Adult_top_40" title="Adult top 40" class="mw-redirect">Adult top 40</a></td>
on my first pass I want to write to a file
City of license Airdrie, Alberta
then
Frequency 106.1
then
Radio format Adult top 40
...
I cannot just look for the title=" before the data because there is more than one title=", I would like to search for the first occurrence of title=" after a string, i.e. City of license
Clear as mud?
Thanks
Ron
Re: Finding an expression after another
Posted: Fri Mar 02, 2012 2:33 pm
by ragax
Hi Ron,
Clear as mud?
Clear as day.
Here's a regex that works. I wrote it in comment mode (aka whitespace mode) so you can see what's going on. Then, following the regex, I pasted code that uses it.
Code: Select all
(?sx) # comment mode, dot matches new line
>City[ ]of[ ]license # literal match
(?>.*?title=") # lazily eat up everything up to the next title
([^"]++) # capture the title (group 1)
(?>.*?>Frequency) # lazily eat up everything up to >Frequency
(?>(?:[^>]+>){3}) # eat up three tags
([^<\s]+) # capture the frequency (group 2)
(?>.*?>Format) # lazily eat up everything up to >Format
(?>.*?title=") # lazily eat up everything up to the next title
([^"]++) # capture the title (group 3)
Working Code:
Code: Select all
<?php
$string='<th scope="row" style="text-align:left;"><a href="/City_of_license" title="City of license">City of license</a></th>
<td class="" style=""><a href="/Airdrie,_Alberta" title="Airdrie, Alberta">Airdrie, Alberta</a></td>
</tr>
<tr class="">
<th scope="row" style="text-align:left;">Branding</th>
<td class="" style="">Air 106-1</td>
</tr>
<tr class="note">
<th scope="row" style="text-align:left;"><a href="/Slogan" title="Slogan">Slogan</a></th>
<td class="" style="">Airdrie\'s Radio Station</td>
</tr>
<tr class="">
<th scope="row" style="text-align:left;"><a href="/Frequency" title="Frequency">Frequency</a></th>
<td class="" style="">106.1 <a href="/MHz" title="MHz" class="mw-redirect">MHz</a></td>
<th scope="row" style="text-align:left;"><a href="/Radio_format" title="Radio format">Format</a></th>
<td class="category" style=""><a href="/Adult_top_40" title="Adult top 40" class="mw-redirect">Adult top 40</a></td>
';
$regex='~(?sx) # comment mode, dot matches new line
>City[ ]of[ ]license # literal match
(?>.*?title=") # lazily eat up everything up to the next title
([^"]++) # capture the title (group 1)
(?>.*?>Frequency) # lazily eat up everything up to >Frequency
(?>(?:[^>]+>){3}) # eat up three tags
([^<\s]+) # capture the frequency (group 2)
(?>.*?>Format) # lazily eat up everything up to >Format
(?>.*?title=") # lazily eat up everything up to the next title
([^"]++) # capture the title (group 3)
~';
preg_match($regex,$string,$m);
echo 'City of license '.$m[1].'<br />';
echo 'Frequency '.$m[2].'<br />';
echo 'Radio Format '.$m[3].'<br />';
?>
Output:
City of license Airdrie, Alberta
Frequency 106.1
Radio Format Adult top 40
I hope this is what you were looking for. Let me know if you have any questions.

Re: Finding an expression after another
Posted: Thu Mar 08, 2012 1:20 pm
by RonH
Thanks works great!
Re: Finding an expression after another
Posted: Thu Mar 08, 2012 1:35 pm
by ragax
Glad to hear it, Ron, thanks for letting me know.
