Page 1 of 1

Ideas for Extracting Info from HTML Chat Logs

Posted: Sun Mar 12, 2006 1:10 am
by nigma
I've got a bunch of adium log files that are formatted like this:

Code: Select all

<div class="send">
<span class="timestamp">6:32:09 PM</span>
<span class="sender">dotkrisennay@hotmail.com: </span>
<pre class="message">hey</pre></div>

<div class="receive">
<span class="timestamp">6:32:19 PM</span>
<span class="sender">rolyoly@hotmail.com: </span>
<pre class="message"><B>hola</B></pre></div>
I want to extract all the timestamps fields that reside in a "recieve" div and put them in an array, and do the same for those that reside in a "send" div. The thing is, I'd like to avoid using regular expressions because I don't want to spend the time learning the syntax right now (unless the requisite syntax wouldn't be that hard to learn?). I can use C, Ruby, Perl, or PHP. (Although I am hoping to avoid doing it in C since the easiest way I can think of doing it there is not as easy as I'd like it to be.) Basically, in the end I just want the timestamps sorted into seperate arrays that way I can do things like find the difference between sent messages and the closest recieve message, etc.

Anyone want to offer some input on the matter?

Posted: Sun Mar 12, 2006 1:31 am
by feyd
xml parsing?

Posted: Sun Mar 12, 2006 1:42 am
by nigma
What if the log files are neither valid nor well formed xml?

Posted: Sun Mar 12, 2006 1:52 am
by feyd
then you need to do string parsing or regex..

Posted: Sun Mar 12, 2006 10:34 am
by josh
if the whole file takes that form, preg_match_all to get the div layers, then inside of that preg_match_all for each span, sort through the resulting array categorizing them into an array based on the class attribute

Posted: Sun Mar 12, 2006 7:43 pm
by nigma
In Perl the regular expressions weren't as hard as I anticipated them being:

Code: Select all

if (m%<div class="sent"><span class="timestamp">(\d+:\d+:\d+) \w+</span>%) {
  print $1;
}
Thanks for the advice.