Page 1 of 1

Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Sat Jan 10, 2009 3:50 am
by Chris Corbyn
Here's what it does:

http://w3style.co.uk/~d11wtq/kodify/demo/

It support multiple languages, though I've only written one so far. It uses a lexical analyzer based on lex.

To explain the algorithm in simple terms, LxAnalyzer.lex() is called over and over until the EOF indicator is returned. Each time lex() is called it searches through a set of rules, constrained to a known state. When a rule is matched an associated callback routine is invoked which interacts with the lexical analyzer's state stack and with Kodify's output methods. The ordering of rules is generally not important since the lexical analyzer will select the best match.

You can view the source using Kodify instead here:
Lx: http://w3style.co.uk/~d11wtq/kodify-source/Lx.html
Kodify: http://w3style.co.uk/~d11wtq/kodify-source/Kodify.html

Lexical Analyzer (standalone project of mine)

Fully unit tested to operate in any ECMAScript environment. Tests written in ECMAUnit.

[js]/** * Lx : ECMAScript Lexical Analyzer with a lex like API. *  * License: LGPLv3 *  * @author Chris Corbyn <chris@w3style.co.uk> * @version 0.0.0 */ /** * References the currently running LxAnalyzer. *  * @see {@link LxAnalyzer} */var Lx; /** * Default callback routine for LxDefaultRule. *  * Pre-defined only for optimization. *  * If overridden, the action invoked by the default rule will be the new * action. */var LxDefaultAction = function LxDefaultAction() {  Lx.Echo();  return Lx.Text.charCodeAt(0);}; /** * The default matching rule used internally by Lx. *  * Pre-defined only for optimization. *  * If overridden any unmatched tokens will be checked by the new rule. */var LxDefaultRule =  {    /** Required property "pattern" specifying what to match */  pattern : /^[\x00-\xFF]/,    /** Required property "action" specifying the routine to invoke */  action : LxDefaultAction  }; /** * An action which does nothing. *  * Pre-defined for optimization. *  * If overridden, rules applied will have the new action by default. */var LxEmptyAction = function LxEmptyAction() {}; /** * The entire lexical analyzer class. *  * This class contains all functionality for scanning.  When running it is * also accessible via the global instance Lx. *  * - Configuration methods are camelCase, starting with a lowercase letter. * - Scanning routine methods are CamelCase starting with an uppercase letter. * - Scanning properties are CamelCase starting with an uppercase letter. *  * @constructor */var LxAnalyzer = function LxAnalyzer() {    /** The input stream (String) */  this.In = '';    /** The output stream (String) */  this.Out = '';    /** The current start condition (state ID) */  this.START = 0;    /** The initial start condition (state ID) */  this.INITIAL = 0;    /** The EOF token ID */  this.EOF = 0;    /** The matched text during a scan */  this.Text = '';    /** The matched Text length during a scan */  this.Leng = 0;    /** The current line number (only if Lx.countLines() is specified) */  this.LineNo = 1;    /** The value of the matched token */  this.Lval = {};    /** @private */  var _TID = 256;    /** @private */  var _SID = 0;    /** @private */  var _rules = {    0 : []  };    /** @private */  var _wantsMore = false;    /** @private */  var _stateStack = [];    /** @private */  var _minInputSize = 32;    /** @private */  var self = this;    /** For consistency between actions using Lx and token specification */  Lx = self;    // -- Public methods    /**   * FSA optimization setting for minimum input fragment size.   *    * The scanner will first test if a rule matches inside the first s chars   * of the input source.   *    * Tokens are permitted to be longer (for example long strings), but the   * first s chars in the token then must fit the pattern.   *    * Default value 32 should work fine, raising it will increase the chance of   * matching very long tokens at the expense of speed.   *    * If you try to match an entire string with a rule of say,   * /"[^"]*"/ then matching will fail for long strings (and rightly so). A   * more optimized (and flexible) way to match such strings is to use state   * switching.   *    * Lx.rule('"', Lx.INITIAL).performs(function() {   *   // Opening "   *   Lx.PushState(Lx.IN_STRING);   * });   *    * Lx.rule(/[^"]+/, Lx.IN_STRING).performs(function() {   *   // String content   * });   *    * Lx.rule('"', Lx.IN_STRING).performs(function() {   *   // Closing "   *   Lx.PopState();   * });   *    * @param {Integer} s   */  this.setMinInputSize = function setMinInputSize(s) {    _minInputSize = s;  };    /**   * Defines a new exclusive state, accessible as a property of the currently   * running analyzer.   *    * Exclusive states differ from inclusive in the tokens they match.  When   * the analyzer is in an exclusive state it can only match tokens which are   * in that state.  In an inclusive state the analyzer will match tokens with   * no specified state along with tokens in its own state.   *    * @param {String} stateName The name of the state (tip: use UPPERCASE)   * @param {Boolean} exclusive True for an exclusive state, false otherwise   *    * @return The new state ID   * @type Integer   */  this.addExclusiveState = function addState(stateName) {    if (typeof self[stateName] == "undefined") {      self[stateName] = ++_SID;      _rules[_SID] = [];    }    return self[stateName];  };    /**   * Defines a new token ID with the given name, accessible as a property of   * the current running analyzer.   *    * Defining a token does nothing by itself.  It must then be returned by   * the action associated with a rule.   *    * @param {String} tokenName The name of the token (tip: use UPPERCASE)   *    * @return The new token ID   * @type Integer   *    * @see {@link #addRule}   */  this.addToken = function addToken(tokenName) {    if (typeof self[tokenName] == "undefined") {      self[tokenName] = ++_TID;    }    return self[tokenName];  };    /**   * Define a new rule matching the given pattern.   *    * If states is passed as a parameter this rule will only be active when the   * analyzer is in one of the given states.  The states parameter may be the   * state ID, or an Array of state IDs.   *    * @param {Object} pattern A String or RegExp to match   * @param {Object} states The Integer state ID, or an Array of state IDs   *    * @return The new rule (contains a parameter named "action")   * @type Object   *    * @see {@link #Echo}   * @see {@link #Begin}   * @see {@link #PushState}   * @see {@link #PopState}   * @see {@link #TopState}   * @see {@link #Reject}   * @see {@link #More}   * @see {@link #Less}   * @see {@link #Unput}   * @see {@link #Input}   * @see {@link #Terminate}   *    * @see {@link #addToken}   * @see {@link #addState}   */  this.addRule = function addRule(pattern, states) {    if (!states) {      states = [0];    }        if (!(states instanceof Array)) {      states = [states];    }        var rule = {      pattern : _optimizePattern(pattern),      action : LxEmptyAction    };        var ruleContainer;    for (var i = 0, len = states.length; i < len; ++i) {      if (typeof _rules[states] == "undefined") {        throw "State ID " + states + " does not exist";      }      ruleContainer = _rules[states];      ruleContainer[ruleContainer.length] = rule;    }    return rule;  };    /**   * Find the next input token, advancing through the input.   *    * If no user-specified token is matched, the character code of the next   * character is returned instead.   *    * @return The ID of the found token, or 0 (zero) for EOF.   * @type Integer   *    * @see {@link #wrap}   * @see {@link #addToken}   */  this.lex = function lex() {    Lx = self;        var tokenId;    while (!tokenId && self.In.length > 0) {      tokenId = _lexScan();    }    return !tokenId ? self.EOF : tokenId;  };    /**   * Returns true if all input has been read, or false if not.   *    * This routine should always be called when {@link #lex} returns 0 since   * the scanner may want to switch to a new input source.   *    * @return True if finished, false if not.   * @type Boolean   */  this.wrap = function wrap() {    return self.In.length > 0;  };    // -- Scanning routines    /**   * Tell the analyzer to retain whatever is in the Lx.Text property and append   * the next found token to it instead of overwriting it.   *    * The value of Lx.Leng must not be modified.   */  this.More = function More() {    _wantsMore = true;  };    /**   * Tell the analyzer to put all but the first n characters back into the   * input stream (Lx.In).   *    * Leng and Text are adjusted accordingly.   *    * @param {Integer} n Number of chars to put back starting at the rightmost   */  this.Less = function Less(n) {    if (n > self.Text.length) {      throw "Cannot put back " + n + " characters from a " +        self.Text.length + " token";    }    self.In = self.Text.substr(n) + self.In;    self.Leng = n;    self.Text = self.Text.substring(0, self.Leng);  };    /**   * Place character c at the start of the input stream (Lx.In) so that it will   * be scanned next.   *    * @param {String} c The character to place back on the input stream   */  this.Unput = function Unput(c) {    self.In = c + self.In;  };    /**   * Read the next character in the input stream and seek through the stream.   *    * @return The next character in the input stream   * @type String   */  this.Input = function Input() {    if (self.In.length == 0) {      return 0;    }        var c = self.In.charAt(0);    self.In = self.In.substring(1);    return c;  };    /**   * Append the contents of Lx.Text to the output stream (Lx.Out).   */  this.Echo = function Echo() {    self.Out += self.Text;  };    /**   * Switch the start condition to the given state.   *    * The next time {@link #lex} is invoked it will scan in the new state.   *    * @param {Integer} state The new state ID   *    * @see {@link #addState}   */  this.Begin = function Begin(state) {    if (!(state in _rules)) {      throw "There is no state ID [" + state + "]";    }    self.START = state;  };    /**   * Push the current state (Lx.START) onto the state stack and switch to the   * new state via {@link #Begin}.   *    * @param {Integer} state The new state   *    * @see {@link #addState}   * @see {@link #PopState}   * @see {@link #TopState}   */  this.PushState = function PushState(state) {    _stateStack[_stateStack.length] = self.START;    self.Begin(state);  };    /**   * Pops the top off the state stack and switches to it via {@link #Begin}.   *    * @see {@link #addState}   * @see {@link #PushState}   * @see {@link #TopState}   */  this.PopState = function PopState() {    self.Begin(self.TopState());    delete _stateStack[_stateStack.length - 1];    --_stateStack.length;  };    /**   * Returns the current top of the state stack without modifying the stack.   *    * @return The state ID at the top of the state stack, or INITIAL if the   *         stack is empty.   * @type Integer   *    * @see {@link #addState}   * @see {@link #PushState}   * @see {@link #PopState}   */  this.TopState = function TopState() {    if (_stateStack.length == 0) {      throw "Cannot read state stack since it is empty";    }    return (typeof _stateStack[_stateStack.length - 1] != "undefined")      ? _stateStack[_stateStack.length - 1]      : self.INITIAL      ;  };    /**   * Restart with new input, resetting the scanner (except for the START state).   *    * @param {String} input   */  this.Restart = function Restart(input) {    self.In = input;    self.Out = '';    self.Text = '';    self.Leng = 0;    self.LineNo = 1;    self.Lval = {};    _wantsMore = false;    _stateStack = [];  };    // -- Private methods    /** @private */  var _optimizePattern = function _optimizePattern(re) {    if (typeof re.valueOf() == "string") {      return re.valueOf();    }        var regexString = re.toString();    var pattern = regexString.substring(      regexString.indexOf('/') + 1,      regexString.lastIndexOf('/')    );    var flags = regexString.substring(regexString.lastIndexOf('/') + 1);    if (!flags) {      return new RegExp(pattern.replace(/^(?!\^)(.*)/, "^(?:$1)"));    } else {      return new RegExp(pattern.replace(/^(?!\^)(.*)/, "^(?:$1)"), flags);    }  };    /** @private */  var _scanByRegExp = function _scanByRegExp(re) {    var match = '';    var matches;        //FSA optimization check with re.test()    if (re.test(self.In.substring(0, _minInputSize))      && (matches = re.exec(self.In))      && matches.index == 0) {      match = matches[0];    }        return match;  };    /** @private */  var _scanByString = function _scanByString(string) {    var match = '';        if (self.In.substring(0, string.length) == string) {      match = string;    }        return match;  };    /** @private */  var _lexScan = function _lexScan() {    var bestLength = 0;    var bestMatch = '';    var bestRule;        //Inner function with access to local variables    var scan = function scan(rule) {      var match;      if (typeof rule.pattern != "string") { //TODO: Cheaper test than typeof?        match = _scanByRegExp(rule.pattern);      } else /* optimize */ if (bestLength < rule.pattern.length) {        match = _scanByString(rule.pattern);      }            if (match && match.length > bestLength) {        bestLength = match.length;        bestRule = rule;        bestMatch = match;      }    };        //Test each rule    for (var i = 0, len = _rules[self.START].length; i < len; ++i) {      scan(_rules[self.START]);    }        //If none match, use the default rule    if (!bestRule) {      scan(LxDefaultRule);      bestRule = LxDefaultRule;    }        //Adjust Text and Leng    if (_wantsMore) {      self.Text += bestMatch;      self.Leng += bestMatch.length;    } else {      self.Text = bestMatch;      self.Leng = bestMatch.length;    }        _wantsMore = false;        self.Lval = bestRule;        //Advanced through the input    self.In = self.In.substring(bestMatch.length);        //Return whatever the action specifies    return bestRule.action();  };  }; [/js]

Kodify Source (using Lx from above)

[js]/** * Kodify : JavaScript Code Beautifier using a real lexical scanning approach. *  * License: LGPLv3 *  * @author Chris Corbyn <chris@w3style.co.uk> * @version 0.0.0 */ /** Config setting for the class needed for kodify to operate */var KodifyClassName = "kodify"; /** * A programming language specification, wrapping the "Lx" project. *  * @constructor *  * @param {String} name The name of the language *  * @see {@link Kodify#lang} */var KodifyLanguage = function KodifyLanguage(name) {    /** The name of this language */  this.name = name;    /** @private */  var _scanner = new LxAnalyzer();    /** @private */  var _currentRule;    /** @private */  var _flags = {};    /** @private */  var self = this;    /**   * Declare a new flag named flagName having value v.   *    * @param {String} flagName The name of the flag   * @param {Object} v The value to set   * @return The current instance for method chaining   *    * @see {@link #flag}   */  this.addflag = function flag(flagName, v) {    _flags[flagName] = v;    return self;  };    /**   * Check the value of a flag named flagName, or change it's value to v.   *    * If the second parameter v is passed the value of the flag is changed to v.   *    * @param {String} flagName The name of the flag   * @param {Object} v An optional value to set   * @return The value of the flag named flagName   *    * @see {@link addflag}   */  this.flag = function flag(flagName, v) {    if (typeof v != "undefined") {      _flags[flagName] = v;    }    return _flags[flagName];  };    /**   * Declare a named state in the lexical analyzer.   *    * @param {String} s The name of the state   * @return The current instance for method chaining   *    * @see {@link LxAnalyzer#state}   */  this.state = function state(s) {    _scanner.addExclusiveState(s);    return self;  };    /**   * Create a new rule in the lexical analyzer.   *    * @param {Object} pattern A RegExp or a String   * @param {Object} states A state ID or an Array of state IDs   * @return The current instance for method chaining   *    * @see {@link LxAnalyzer#rule}   */  this.rule = function rule(pattern, states) {    _currentRule = _scanner.addRule(pattern, states);    return self  };    /**   * Specify the action to run when the last declared rule is matched.   *    * @param {Function} callback   *    * @see {@link LxAnalyzer#rule}   */  this.onmatch = function onmatch(callback) {    _currentRule.action = callback;    return self;  };    /**   * Replace the contents of "element" with the beautified version.   *    * @param {Element} element An element from the DOM   */  this.beautify = function beautify(element) {    Kodify.language = self;    Kodify.scanner = _scanner;    Kodify.builder = new KodifyBuilder(element);        _scanner.Begin(_scanner.INITIAL);    _scanner.Restart(element.textContent      ? element.textContent      : element.innerText    );    while ((0 != _scanner.lex())) ;        Kodify.builder.commit();  };  }; /** * The Builder class for managing the creation of beautified content. *  * @constructor *  * @param {Element} element The target element to write to */var KodifyBuilder = function KodifyBuilder(element) {    /** @private */  var _target = element;    /** @private */  var _context = document.createElement("span");    /** @private */  var self = this;    /**   * Append a node to the current context, applying className to it.   *    * @param {String} text The text to append   * @param {String} className The class name to apply to the text   */  this.append = function append(text, className) {    var s = document.createElement("span");    if (className) {      s.className = className;    }    var t = document.createTextNode(text);    s.appendChild(t);    _context.appendChild(s);  };    /**   * Commit changes to the target element.   */  this.commit = function commit() {    _target.innerHTML = '';    _target.appendChild(_context);  };  }; /** * The globally accessible Kodify instance. *  * A singleton re-used for each block of code to be beautified. */var Kodify = {    /** The {@link KodifyBuilder} instance */  builder : {},    scanner : {},    /** The currently scanning {@link KodifyLanguage} instance */  language : {},    /** Loaded language specifications as {@link KodifyLanguage} objects */  languageList : {},      /**   * Define a new language and return the specification builder.   *    * @param {String} langName The name of the language   * @return An instance of {@link KodifyLanguage}   */  lang : function lang(langName) {    if (!("langName" in this.languageList)) {      this.languageList[langName] = new KodifyLanguage(langName);    }    return this.languageList[langName];  },    /**   * Get or set a flag used in the scanning process.   *    * If the second parameter v is passed the value is changed to v.   *    * @param {String} flagName The name of the flag   * @param {Object} v A value to set   * @return The value of the flag named flagName   *    * @see {@link KodifyLanguage#flag}   */  flag : function flag(flagName, v) {    return this.language.flag(flagName, v);  },    /**   * Apply the class name "cls" to the matched text during scanning.   *    * @param {String} cls The class name to apply   */  className : function className(cls) {    this.builder.append(this.scanner.Text, cls);  },    /**   * Pass the matched text to the builder without applying any class.   */  unstyled : function unstyled() {    this.builder.append(this.scanner.Text);  },    /**   * Beautify all code blocks on the current page.   *    * Anything that has the class name of "kodify" will be scanned.   */  beautify : function beautify() {    var targets = this.getElementsByClassName(KodifyClassName);    for (var i = 0, ilen = targets.length; i < ilen; ++i) {      var classParts = targets.className.split(/\s+/);      for (var j = 0, jlen = classParts.length; j < jlen; ++j) {        if (classParts[j] in this.languageList) {          this.languageList[classParts[j]].beautify(targets);          break;        }      }    }  },    /**   * A document.getElementsByClassName() wrapper for non-supporting browsers.   *    * @param {String} className The required class name on the element   * @return The list of matched Elements   */  getElementsByClassName : function getElementsByClasName(className) {    if (document.getElementsByClassName) {      return document.getElementsByClassName(className);    }        //For Internet Explorer 6    var everything = document.all;    var elements = [];    var re = new RegExp("\\b" + className + "\\b", "i");    for (var i = 0, len = everything.length; i < len; ++i) {      if (everything.className && everything.className.match(re)) {        elements[elements.length] = everything;      }    }    return elements;  }  }; //Make sure kodify runs when the page is loadedwindow.onload = function () {  Kodify.beautify();}; [/js]

Re: Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Sat Jan 10, 2009 1:34 pm
by Weirdan
Fully unit tested to operate in any ECMAScript environment. Tests written in ECMAUnit.
Can we see them too?

Re: Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Sat Jan 10, 2009 2:30 pm
by Chris Corbyn
Weirdan wrote:
Fully unit tested to operate in any ECMAScript environment. Tests written in ECMAUnit.
Can we see them too?
Sure :)

http://github.com/d11wtq/lx/tree/optimi ... ts/unit/js

I'd post them all here but they're all in separate class files.

The README file under the base directory of that project explains how you can run them.

Re: Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Sat Jan 10, 2009 3:00 pm
by Chris Corbyn
You know, my use of window.onload was not a clever idea! That's going to potentially interfere with other rich-content on the web page. I'll fix that up to use the proper event capturing methods.

Re: Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Fri Jan 16, 2009 12:16 am
by nor0101
That is downright sweet - I want the mouseover braces highlighting feature in my editor!
I'll definitely read through this when I get a minute. It might provoke me into finally writing that syntax highlighting plugin for prolog in BBedit...

Re: Kodify Syntax Highlighter (and Lx, a lexical analyzer) in JS

Posted: Fri Jan 16, 2009 1:25 am
by Chris Corbyn
nor0101 wrote:That is downright sweet - I want the mouseover braces highlighting feature in my editor!
I'll definitely read through this when I get a minute. It might provoke me into finally writing that syntax highlighting plugin for prolog in BBedit...
I have kodify.org launching soon which is how I plan on getting contributors to write language definition files and themes for it ( http://kodify.w3style.co.uk/theme-builder - half built, in dire need of refactoring).

If you have FF, Opera or Safari you can frustrate yourself by creating a theme and then discovering I haven't implemented a "save" option yet :P Backend stuff.