Ideas for Extracting Info from HTML Chat Logs

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
User avatar
nigma
DevNet Resident
Posts: 1094
Joined: Sat Jan 25, 2003 1:49 am

Ideas for Extracting Info from HTML Chat Logs

Post by nigma »

I've got a bunch of adium log files that are formatted like this:

Code: Select all

<div class="send">
<span class="timestamp">6:32:09 PM</span>
<span class="sender">dotkrisennay@hotmail.com: </span>
<pre class="message">hey</pre></div>

<div class="receive">
<span class="timestamp">6:32:19 PM</span>
<span class="sender">rolyoly@hotmail.com: </span>
<pre class="message"><B>hola</B></pre></div>
I want to extract all the timestamps fields that reside in a "recieve" div and put them in an array, and do the same for those that reside in a "send" div. The thing is, I'd like to avoid using regular expressions because I don't want to spend the time learning the syntax right now (unless the requisite syntax wouldn't be that hard to learn?). I can use C, Ruby, Perl, or PHP. (Although I am hoping to avoid doing it in C since the easiest way I can think of doing it there is not as easy as I'd like it to be.) Basically, in the end I just want the timestamps sorted into seperate arrays that way I can do things like find the difference between sent messages and the closest recieve message, etc.

Anyone want to offer some input on the matter?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

xml parsing?
User avatar
nigma
DevNet Resident
Posts: 1094
Joined: Sat Jan 25, 2003 1:49 am

Post by nigma »

What if the log files are neither valid nor well formed xml?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

then you need to do string parsing or regex..
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

if the whole file takes that form, preg_match_all to get the div layers, then inside of that preg_match_all for each span, sort through the resulting array categorizing them into an array based on the class attribute
User avatar
nigma
DevNet Resident
Posts: 1094
Joined: Sat Jan 25, 2003 1:49 am

Post by nigma »

In Perl the regular expressions weren't as hard as I anticipated them being:

Code: Select all

if (m%<div class="sent"><span class="timestamp">(\d+:\d+:\d+) \w+</span>%) {
  print $1;
}
Thanks for the advice.
Post Reply