Can somebody explain a "tokenizer" for me?

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Can somebody explain a "tokenizer" for me?

Post by Luke »

I have heard the word "tokenizer" thrown around in this forum and elsewhere. I am under the impression that a tokenizer is some kind of preprocessor which defines a sort of grammer for a file. This grammer is then used to transform the file into a more readable or usable format. Is this anywhere near what it is? Can somebody give me an explanation of what a tokenizer is? I'm thinking maybe I need one for qCal, but I just don't know. :(
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Can somebody explain a "tokenizer" for me?

Post by Chris Corbyn »

A tokenizer is the tool that's used in the very first step of parsing source code. It splits source code into its most basic tokens.

Take this string of source code:

Code: Select all

class Foo extends Bar {
}
It can be split into (roughly) the following tokens:

T_CLASS
T_STRING
T_EXTENDS
T_STRING
'{'
'}'

They're usually represented by numbers in a lexical analyzer.

It's the Parser's job (higher level tool) to understand what those tokens mean. The tokenizer (part of the lexical analyzer) just finds the actual tokens.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Can somebody explain a "tokenizer" for me?

Post by Chris Corbyn »

I might suggest reading up on Flex (or lex) ;)
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Can somebody explain a "tokenizer" for me?

Post by Luke »

Do you think making a tokenizer / lexer is maybe overkill for parsing an icalendar file? I'm looking at other libraries written in python and ruby and they seem to be simply using string/regex functions to parse the files. I'd really like to write whatever is best, not what's easiest though.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Can somebody explain a "tokenizer" for me?

Post by josh »

Yes, it is overkill. I don't know much about ical files but you usually aren't going to need more then an "interpreter". I did something similar when I had to parse ESRI shp file format, which has probably just as many idiosyncrasies as the ical format. Usually file formats are going to follow some static rules, if you have the right white papers its just a matter of representing the rules as executable code. Parsers would be needed for something more complex, like m4 config files used for the sendmail MTA... that may be the level of complexity with ical but I doubt it. Basically it sounds like you just need to parse some tuples, there's not a lot of business logic that needs to be inferred from the grammar.. that I know of, start with the easiest solution and only get complicated if you can justify it would be my advise. If you were doing TDD you'd just write test cases and then write regex to pass them, I'd only look to something else if regex failed.. although I'm surprised the files wouldn't be binary, not ascii. Can you give more detail about the actual format?
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Can somebody explain a "tokenizer" for me?

Post by Luke »

Umm yea I think you are right. The format is pretty simple. I think this is definitely overkill. It's basically as easy as this:

Code: Select all

BEGIN:COMPONENTNAME
PROPERTYNAME;PARAM1=value1;PARAM2=value2:propertyvalue
BEGIN:SUBCOMPONENT
END:SUBCOMPONENT
END:COMPONENTNAME
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Can somebody explain a "tokenizer" for me?

Post by josh »

Yeah I'd regex for keywords, and as they are found a builder object would build a tree of objects, those *are* your symbols/tokens, conceptually.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Can somebody explain a "tokenizer" for me?

Post by Luke »

Hmm... here's what I have so far:

Code: Select all

<?php
/**
 * Default icalendar parser (others include xcal parser, etc.)
 * @package qCal
 * @copyright Luke Visinoni (luke.visinoni@gmail.com)
 * @author Luke Visinoni (luke.visinoni@gmail.com)
 * @license GNU Lesser General Public License
 */ 
class qCal_Parser_iCalendar extends qCal_Parser {
 
    /**
     * This method will parse raw icalendar data. Eventually I might want to add a parseFromFile() method or something
     */
    public function parse($data) {
    
        $lines = explode("\r\n", $data);
        foreach($lines as $line) {
            if (preg_match("/(BEGIN):([a-z]+)/i", $line, $matches)) {
                // match BEGIN:COMPONENT
                $this->beginComponent($matches[2]);
            } elseif (preg_match("/(END):([a-z]+)/i", $line, $matches)) {
                // match END:COMPONENT
                $this->endComponent($matches[2]);
            } elseif (preg_match("/([a-z0-9-]+)([;a-z0-9=]+)?:(.+)/i", $line, $matches)) {
                // match PROPERTYNAME;PARAM=paramvalue:PROPERTYVALUE
                $params = array_filter(explode(";", $matches[2]));
                foreach ($params as $key => $paramset) {
                    $params = array();
                    if (strpos($paramset, "=") !== false) {
                        list($name, $value) = explode("=", $paramset, 2);
                        // pretty sure this will be OK, I don't think that params can be set multiple times
                        $params[$name] = $value;
                    }
                }
                $this->addProperty($matches[1], $matches[3], $params);
            } elseif (preg_match("/\s(.+)/i", $line, $matches)) {
                // match whitespace character and then anything (a continued content line)
                $this->continueProperty($matches[1]);
            }
        }
    
    }
    /**
     * Open a component (start it)
     */
    protected function beginComponent($name) {
    
        
    
    }
    /**
     * End currently open component
     */
    protected function endComponent($name) {
    
        
    
    }
    /**
     * Add a property to the currently open component
     */
    protected function addProperty($name, $value, $params) {
    
        
    
    }
    /**
     * Accepts a string to append to the previous property's value
     */
    protected function continueProperty($add) {
    
        
    
    }
 
}
I've written and rewritten this thing like ten times and I can get it working, but I can never seem to get it to go more than two levels deep. Any advice on this? I'm thinking some kind of recursion is in order... anyway I just thought I'd ask if anybody had any advice, I'll keep playin around with it.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Can somebody explain a "tokenizer" for me?

Post by alex.barylski »

Tokenizing and parsing are best viewed as distinct.

Tokenizing is the process of accepting a source string and breaking it into logical tokens. An alphabet string for instance would be tokenized into 26 character tokens. Source code on the other hand is far more expressive, so tokens range from single character operators (*) to multi-character strings or keywords, like switch.

Tokenizing by hand is difficult as you essentially need to implement a finite state machine for each token you wish to extract.

I have been told that it is possible to tokenize languages, even those which are context free, using regex. Although I am not personally convinced, apparently in theory it's possible. I would certainly look into tokenizing using regex for iCal files.

Parsing, is the process that actually gives your tokens semantic meaning.

Looking at the example you gave, I would use regex to tokenize and manually parse the tokens. The structure is fairly simple so it should be easy enough to parse by hand.
I've written and rewritten this thing like ten times and I can get it working, but I can never seem to get it to go more than two levels deep. Any advice on this? I'm thinking some kind of recursion is in order... anyway I just thought I'd ask if anybody had any advice, I'll keep playin around with it.
What do you mean by 2 levels deep, exactly?

That usually means recursion is required, yes.

Cheers,
Alex
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Can somebody explain a "tokenizer" for me?

Post by josh »

I'd rather see it written without needing recursion.. its your call though, do whats simplest. FYI the "parse tree" concept works something like this:

*devnet cuts this off, right click in FF and hit 'view image' or access the URL: http://img178.imageshack.us/img178/4710 ... reesq1.jpg

Image

Each "token" is an object in a tree, if you tell a node to execute it tells it's children to execute, the parser has to also "look up" or "down" the tree, depending on what it's parsing, since almost all languages formal and informal have tokens which modify the parsing rules for their parent and/or child tokens
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Can somebody explain a "tokenizer" for me?

Post by Ambush Commander »

You're looking for some sort of stack. Ya know, something you can push and pop off of. To tell you more you'll have to tell us what kind of data-structure you're looking to put the iCal into later... will it be a bunch of nested associated arrays? Objects?
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Can somebody explain a "tokenizer" for me?

Post by Luke »

Well I need to eventually build objects, but certain properties are required for certain components, so you can't create a component and then add line by line, it has to be done all at once, when you know all properties and subcomponents.

This will throw an exception

Code: Select all

$alarm = new qCal_Component_Alarm(array(
    'trigger' => 'P2WT5H20M'
));
Because the "action" property is required

Code: Select all

$alarm = new qCal_Component_Alarm(array(
    'trigger' => 'P2WT5H20M',
    'action' => 'audio'
));
You can add sub-components whenever you want

Code: Select all

$todo = new qCal_Component_Todo(array(
    'summary' => 'A todo item',
    'description' => 'This is the todo item description. It is longer than the summary',
));
$alarm = new qCal_Component_Alarm(array(
    'trigger' => 'P2WT5H20M',
    'action' => 'audio'
));
$todo->attach($alarm);
I was playing around with a solution that used a queue inside the parser, but I just couldn't seem to make it work. Maybe I just need to take a break and give it a look tomorrow.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Can somebody explain a "tokenizer" for me?

Post by Ambush Commander »

Ok, that's fair. Sounds like nested arrays, with a stack of references to levels of the array, is the way to go.

Whenever you hit a BEGIN, create an array to represent it and then create a reference to it on top of the stack. You operate off of the top element of the stack. When the next BEGIN rolls around, the same process happens, but the top element changes, and all of the props get funneled to the appropriate array element.

Then, when you hit an END, pop the stack. The top element of the stack will now be the parent of what you were originally working on, and you can continue adding properties to that.

Be warned; the most natural code for converting this array into your objects is recursive, unless you use a stack again to keep track of your iterative process through the array (or use some sort of recursive iterator).

Cheers,
Edward
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Can somebody explain a "tokenizer" for me?

Post by Luke »

I just can't seem to figure this out. I'm not sure where I'm going wrong.

Code: Select all

       $lines = explode($this->line_terminator, $this->content);
        $stack = array();
        foreach ($lines as $line) {
            // begin a component
            if (preg_match('#^BEGIN:([a-z]+)$#i', $line, $matches)) {
                // create new array representing the new component
                $array = array(
                    'type' => $matches[1],
                    'properties' => array(),
                    'children' => array(),
                );
                $stack[] = $array;
            } elseif (strpos($line, "END:") === 0) {
                // end of component
                $component = array_pop($stack);
                if (empty($stack)) {
                    // this is the end, we're done!
                    // and it doesn't <span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span> work.
                    pre($component);
                } else {
                    // attach this component to its parent
                    $parent =& current($stack);
                    array_push($parent['children'], $component);
                }
            } else {
                // continue component
                if (preg_match('#^([a-z;-]+):([a-z]+)$#i', $line, $matches)) {
                    // if line is a property line, start a new property
                    //$array['properties'][$matches[1]] = $matches[2];
                } elseif (preg_match('#^\w(.+)$#', $line, $matches)) {
                    // if it is a continuation of a line, continue the last property
                    //$lastproperty = current($array['properties']);
                    //$lastproperty .= $matches[1];
                }
            }
        }
I can never seem to get it to build the array. It just returns the original vcalendar component with no children.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Can somebody explain a "tokenizer" for me?

Post by Ambush Commander »

I suspect current() is defective. Check what it's returning.
Post Reply