Helpd is needed for regex a large html file.

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
Yoni
Forum Newbie
Posts: 6
Joined: Sat Oct 28, 2006 11:56 pm

Helpd is needed for regex a large html file.

Post by Yoni »

Hi all,
First of all here is a little brief about who I am & what I want to do:
I'm an unemployed for the last few months, here in Israel we have a really good web site for us to search a new job in; though for any job that we want to send our CV we must send an e-mail manually + we should copy the email address & the subject (that inc. job id)...

** All of the jobs are appear on one html page without frames. **

So what I wanted to do is to make a regex for a specific tags that include the email address + work title + job id & move them into a script that will fillter & send email if the job is something that interest me.

While trying to regex I've got no results, below is an example for the regex that I'm running:
preg_match_all('/\Wh\d class=PositionTitle\W[\w\s\\/]*\W\W\w\d\W/', $content, $match);


1st Note: The html filesize is around: 513KB.
2nd Note: The html code include Hebrew characters (Can be viewed with Windows-1255 \ UTF-8 encoding).
3rd Note: While trying to do a regex to a sample file like:
<h1 class=PositionTitle>Cde</h1>
<h2 class=PositionTitle>EFG</h2>
& such lines that include characters in the Hebrew language the result was find - So I'm thinking that the filesize may break the regex,.

Hope that you'll be able to give me a hand here.

Thanks in advance!! :)

Truly Yours,
Yoni D.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Helpd is needed for regex a large html file.

Post by php_east »

the way i do this is to first load HTML into PHP DOM.
and then i pick up the tags using DOM ( more reliable than regexes )
and only after that i bring out regexes to find what i want.

i usually get what i want this way, so you may wish to consider this method.
putting it the html into DOMhtml isn't difficult, only a few lines.

although DOM is difficult to use for me, it has a nice getElementsbyTagsName,
which save tons of work in regexes.

hope this helps.
Yoni
Forum Newbie
Posts: 6
Joined: Sat Oct 28, 2006 11:56 pm

Re: Helpd is needed for regex a large html file.

Post by Yoni »

Hi there,
I was wonder if you can write me please a few "big" commands \ redirections or such... I really don't know anything about the DOM HTML :\

Questions that will might give me a kick-start are:
- Is it possible to use PHP to convert regular html files into DOM html files?
- If the 1st question answer is positive which function should I use?
- If the 1st question is negative, what should I use to do this converting?
- Can you please supply me an example for the "getElementsbyTagsName" function?

thanks in advance!! :D
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Helpd is needed for regex a large html file.

Post by php_east »

Yoni wrote:Questions that will might give me a kick-start are:
- Is it possible to use PHP to convert regular html files into DOM html files?
yes, DOM is suited for that.
Yoni wrote:- If the 1st question answer is positive which function should I use?

Code: Select all

$doc        = new DOMDocument();
$doc->loadHTML($html);
$html is your html string. the PHP manual also list several other commands you can use, there is also a ->loadHTMLFile().
Yoni wrote:- Can you please supply me an example for the "getElementsbyTagsName" function?

Code: Select all

    $tag    = 'a';
    $nodes  = $doc->getElementsByTagName($tag) ; 
    foreach ($nodes as $node)
        {
        echo $node->nodeValue;
        }
 
a bit of study into DOM is required to use it. it is not that easy to use, especially is you aren't used to DOM concept ( like me ) but it makes handling HTML a lot easier that regex. once you have the nodeValue, you can test the element found using regex for the strings you want to pick up.

hope this helps.
Post Reply