What would be the best way to parse Gmail chat logs from the webpage where it appears? As far as I know, this is the only way to access the Gmail chat logs on the server (via Gmail or Gmail on the desktop).
When looking at the generated source where the conversation is taking place, the markup looks like nested divs and spans (and in the div on another place on the page there are randomized identifiers and classes with two characters without a template). Here is an excerpt from a line that has a timestamp on the left:
<div>
<span style="display:block;float:left;color:#888">
2:56 PM
</span>
<span style="display:block;padding-left:6em">
<span>
<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
</span>
</span>
</div>
But not every line has a time stamp, so those who do not have one place irreplaceable spaces in their place:
<div>
<span style="display:block;float:left;color:#888">
</span>
<span style="display:block;padding-left:6em">
<span>
and reformat that into something like an xml format
</span>
</span>
</div>
Should I use XPath? Is there something more efficient?
Edit:
As soon as the data, it looks like this:
12:43 AM John: Something something something.
Something something something.
me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
with something something something.
12:49 AM John: Something.