Sunlight Documentation

Usage
Nesting languages
1. Manually nesting languages
2. Embedded languages
Plugins
1. Menu plugin
2. Documentation link plugin
API
jQuery plugin
Known Issues
- Won’t fix
- Will eventually fix

Usage

NOTE: it is highly recommended that you use this library under standards compliant mode (e.g. you have something like this <!doctype html> at the top of your HTML file). See the known issues for details. The library still works perfectly, but some optional features may not be rendered correctly, like line numbers.

Installation

Installation is simply referencing the core Sunlight file and desired language files (or the combined version of them) and the CSS file.

<link rel="stylesheet" type="text/css" href="/path/to/sunlight.default.css" />
<script type="text/javascript" src="/path/to/sunlight-min.js"></script>
<script type="text/javascript" src="/path/to/sunlight.csharp-min.js"></script>

Preparing the HTML

The element containing the code you want to highlight must have a special class pertaining to the code’s language. The class name should be in the format sunlight-highlight-{language}, where {language} is the name of the language for the code to highlight. For example, to highlight a block of C♯, we would use sunlight-highlight-csharp, as shown below.

<pre class="sunlight-highlight-csharp">
public object DoStuff() {
	return new object();
}
</pre>

If the language specified in the class name is not registered, then the code block will be rendered as plaintext. You can also manually render plaintext using the sunlight-highlight-plaintext class.

Highlighting

Once your HTML is properly set up, all you need to do is invoke the highlighter.

<script type="text/javascript">
	Sunlight.highlightAll();
</script>

If a line is too long, the code block will automatically horizontally scroll. This is governed by the overflow-x CSS property, and does not work in IE6. I have no intentions of caring about that.

Nesting languages

A rather cool feature of Sunlight is its ability to nest languages. For example, you could nest PHP within HTML, and still have both languages highlighted. There are two ways to accomplish this.

Manually nesting languages

The manual way involves modifying the snippet of code you want to highlight by surrounding nested languages with an element with the appropriate sunlight class, and Sunlight will do the rest. Note: you must either use Sunlight.highlightAll() or highlighter.highlightNode() for manually nested languages to work.

<pre class="sunlight-highlight-php">&lt;?php
	//a PHP string with embedded MySQL query; note how the MySql is highlighted
	$query = '<code class="sunlight-highlight-mysql">
		SELECT
			u.username,
			u.user_password AS `secret`,
			p.latitude AS `coordinates`,
			LENGTH(p.geoid)
		FROM users u
		INNER JOIN places p
			ON p.id = u.place_id
		WHERE u.id &gt; 10</code>';
		
	print_r($query);
?&gt;</pre>

becomes

<?php
	//a PHP string with embedded MySQL query; note how the MySql is highlighted
	$query = '
		SELECT
			u.username,
			u.user_password AS `secret`,
			p.latitude AS `coordinates`,
			LENGTH(p.geoid)
		FROM users u
		INNER JOIN places p
			ON p.id = u.place_id
		WHERE u.id > 10';
		
	print_r($query);
?>

Embedded languages

The other way of nesting languages in Sunlight requires no work, and is called embedded languages. In this case, the language definition itself contains rules for when to switch to an embedded language. The default Sunlight distribution has several languages that support this feature:

Scala to XML, to an arbitrary depth
XML to PHP (started by a <?php tag)
XML to CSS (started by a <style> tag)
XML to JavaScript (started by a <script> tag)
XML to C# (started by a <% tag)

You can view an example on the demo page. Note that for embedded languages to work, you must register the languages you need to switch to; otherwise the context switch will not be made. For example, to highlight PHP within XML, you would need to include both sunlight.xml.js and sunlight.php.js.

To create your own rules for embedded languages, you must define a property on the language definition with a rule on when to switch to the embedded language, and a rule on when to switch back to the parent language. As an example, here are the embedded language rules for switching from XML to PHP and back:

//sunlight.xml.js
{
	embeddedLanguages: {
		php: {
			//the "switchTo" function is called before the current character is processed
			//if true, this indicates to the parser to switch to PHP mode
			switchTo: function(context) {
				var peek4 = context.reader.peek(4);
				return context.reader.current() === "<" && (peek4 === "?php" || /^\?(?!xml)/.test(peek4));
			},
			
			//the "switchBack" function is called after the current character is processed
			//if true, this indicates to the parser to switch back to XML mode from PHP
			switchBack: function(context) {
				var prevToken = context.token(context.count() - 1);
				return prevToken && prevToken.name === "closeTag";
			}
		}
	}
}

Plugins

Sunlight provides a way to extend or enhance the core functionality by the use of plugins. Plugins are handled by eight events, which fire at various times during the highlighting process. When each event is fired, a callback is executed. The context of the callback (the this object) is the current highlighter instance. The callback is passed one argument, a context, which has different values based on which event is being fired.

//description of the context argument for each event
{
	beforeHighlightNode: {
		//the DOM node that is about to be highlighted
		node: {}
	},
	
	afterHighlightNode: {
		//the wrapper DOM node (only set for block-level elements)
		container: {},
		
		//the wrapper DOM node around the element that contains the code (only set for block-level elements)
		codeContainer: {},
		
		//the DOM node that was highlighted
		node: {},
		
		//the count of how many nodes have been highlighted (globally) so far
		count: 1
	},
	
	beforeHighlight: {
		//the raw, unhighlighted code
		code: "code",
		
		//the language definition
		language: {},
		
		//the analyzer context from a partial parse (e.g. for nested languages)
		previousContext: {}
	},
	
	afterHighlight: {
		//the analyzer context
		analyzerContext: {}
	},
	
	beforeTokenize: {
		//the raw, unhighlighted code
		code: "code",
		
		//the language definition
		language: {}
	},
	
	afterTokenize: {
		//the raw, unhighlighted code
		code: "code",
		
		//the parser context
		parserContext: {}
	},
	
	beforeAnalyze: {
		//the analyzer context
		analyzerContext: {}
	},
	
	afterAnalyze: {
		//the analyzer context
		analyzerContext: {}
	}
}

You can hook into any of these events by calling Sunlight.bind().

Sunlight.bind("afterHighlightNode", function(context) {
	/* do interesting stuff */
});

Menu plugin

The menu plugin shows a menu with several different links in the upper-right corner for block-level code snippets: collapse/expand the current code block, view the raw text and also provides a link back to the main Sunlight site. Here is an example of the menu plugin in action:

var highlighter = new Sunlight.Highlighter({ 
	//you can set autoCollapse to true to have the block collapsed by default
	//autoCollapse: true,
	showMenu: true
});
highlighter.highlightNode(document.getElementById("menu-example"))

The menu plugin is disabled in IE6 due to lack of support for getComputedStyle. This plugin is included with the default Sunlight distribution; you can load it by including the plugins/sunlight-plugin.menu-min.js file.

Documentation link plugin

This plugin renders certain tokens as hyperlinks to the language’s documentation. Supported languages are:

Lua (functions)
Perl (functions)
PHP (functions and language constructs)
Python (functions)
Ruby (functions)

This plugin is included in the default distribution. To install it, include the plugins/sunlight-plugin.doclinks-min.js file, and then enable it with the enableDoclinks option.

<?php
	//ctype_digit, is_int and unset all link to php.net documentation
	
	foreach ($_GET as $key => $value) {
		if (ctype_digit($value)) {
			$value = (int)$value;
		}
		
		if (!is_int($value)) {
			unset($_GET[$key]);
		}
	}
?>

API

The `Highlighter` Object

The Highlighter object provides several different methods for highlighting. But first you need to create one.

var highlighter = new Sunlight.Highlighter(); //yay!

To highlight a chunk of text, call highlight(). It returns a context object that contains information about the parsing process, as well an array of DOM nodes representing the code. Using the DOM nodes you could, for example, generate the raw HTML.

//first argument is the text to highlight, second is the language id
var context = highlighter.highlight("var foo = new Date();", "javascript");
var nodes = context.getNodes(); //array of DOM nodes

//the following will convert it to an HTML string
var dummyElement = document.createElement("div");
for (var i = 0; i < nodes.length; i++) {
	dummyElement.appendChild(nodes[i]);
}

var rawHtml = dummyElement.innerHTML;

To highlight a DOM node, pass the element to highlightNode():

var nodeToHighlight = document.getElementById("code-to-highlight");
highlighter.highlightNode(nodeToHighlight);

Note that this will recursively highlight any nested nodes that match a sunlight class name.

Customizing the highlighter

The Highlighter accepts an optional argument containing various options.

You can also set the options globally via the globalOptions property, which will become the default unless you override them in each Highlighter instance.

//set options globally
Sunlight.globalOptions.optionName = value;

//set options for this highlighter instance
var options = { /* options */ };
var highlighter = new Sunlight.Highlighter(options);

//OR
Sunlight.highlightAll(options);

Class prefix

You can change the value of the class prefix that Sunlight uses when generating all of its DOM nodes by changing the value of the classPrefix property. The default class prefix is "sunlight-". Note that the CSS classes will need to be updated if the class prefix is changed.

TAB width

You can change the width of a TAB character for each code block using the tabWidth property. The default width is four spaces.

This option has significance regarding whitespace. TAB characters, as everyone should know, take up one byte in the file but expand to a different amount of spaces depending on the editor. This option allows you to change the value of the expansion. However, it expands to a certain amount of spaces based on the relative position of the TAB byte in the line. For example, a TAB byte starting in the first column would take up four spaces, whereas a TAB byte starting in the third column would only take up one space. If you don’t understand any of this, you’re probably one of those people who think using TAB characters in source code is foolish. I pity you, and you should learn about how they work.

	This line is indented by one TAB
  	This line is indented by two spaces and one TAB
    This line is indented by four spaces
	And yet they all line up. It's a mystery!

Now, the same code using a tab width of 7:

	This line is indented by one TAB
  	This line is indented by two spaces and one TAB
    This line is indented by four spaces
	And yet they all line up. It's a mystery!

Showing whitespace

You can use the showWhitespace option to show or hide whitespace (spaces and TABs). This is not a super useful option, but it can help, for example, to illustrate how TABs work. Here is the previous example concerning tabWidth:

	This line is indented by one TAB
  	This line is indented by two spaces and one TAB
    This line is indented by four spaces
	And yet they all line up. It's a mystery!

Line numbers

Sunlight can also generate line numbers for each code block. There are three different options for displaying line numbers:

lineNumbers: true: display line numbers
lineNumbers: false: don’t display line numbers
lineNumbers: "automatic": display line numbers only if the node is a block-level element (default)

You can also specify which line number to start from as shown below:

var options = { 
	lineNumbers: true,
	lineNumberStart: 397
};

var highlighter = new Sunlight.Highlighter(options);
highlighter.highlightNode(document.getElementById("line-number-example"));

Line numbers are displayed as hyperlink fragments. Standards compliant mode is required for the hyperlinks to be rendered correctly in IE. These IE limitations will not be addressed.

Line highlighting

In addition to displaying line numbers, Sunlight can also highlight pre-defined lines. To highlight certain lines, specify each line number in an array with the lineHighlight option.

var highlighter = new Sunlight.Highlighter({ 
	lineNumbers: true, 
	lineHighlight: [2, 4, 5, 6]
});

highlighter.highlightNode(document.getElementById("line-highlight-example"));
//this is line 7

Max Height

To set the maximum height of the code block, set the maxHeight option to something other than false. If the value is an integer, then it will assume the height should be measured in pixels.

This option uses inline styles to set vertical overflow and maximum height, so it will not work in browsers that do not support those CSS properties (e.g. IE6).

Similar to the line number option, only block-level elements will be affected by this option.

Version

You can access the name of the current version of Sunlight. It’s pretty useless and I don’t know why you’d ever want to do it. But don’t let my incredulity stop you.

var highlighter = new Sunlight.Highlighter();
console.log(highlighter.version); //"1.22.0"

Registering Languages

Before you can actually highlight anything you need to register the language of the code you’re trying to highlight. This is normally done by including a language file (e.g. sunlight.csharp.js) but can be done directly as well.

var languageDefinition = { /* see below for details */ };
Sunlight.registerLanguage("languageName", languageDefinition);

Basic Language Definition

{
	//the language's keywords
	keywords: [],
	
	//strings, comments, etc.
	scopes: {},
	
	//i.e. +, &&, >>=, etc.
	operators: [],
	
	//should match a single character
	identFirstLetter: /regex/,
	
	//should match a single character
	identAfterFirstLetter: /regex/,
	
	//rules for coloring named idents, e.g. a class name
	namedIdentRules: {},
	
	//any custom tokens that don't fall under keywords or operators
	customTokens: {},
	
	//any custom parsing rules, e.g. the regex literal in JavaScript
	customParseRules: {},
	
	//custom analyzer for generating the HTML
	analyzer: myCustomAnalyzer,
	
	//whether keywords, tokens, etc. are case insensitive, default is false
	caseInsensitive: false,
	
	//regular expression that matches punctuation
	punctuation: /[^\w\s]/,
	
	//parse rule that parses a number
	numberParser: function(context) {},
	
	//regular expression defining what characters to not parse
	//this will short circuit the parsing process and no further parsing
	//will be done
	doNotParse: /\s/,
	
	//dictionary for storing arbitrary stateful objects during parsing/analysis
	contextItems: {}
};

Defining keywords

Keywords should be an array of strings.

//JavaScript's keywords
[ 
	"break", "case", "catch", "continue", "default", "delete", "do", "else",	
	"finally", "for", "function", "if", "in", "instanceof", "new", "return",
	"switch", "this", "throw", "try", "typeof", "var", "void", "while", "with",
	"true", "false", "null"
];

Defining scopes

Scopes are defined as a map from the scope name to the scope definition. The scope definition is an array of arrays. In each array, index 0 is the scope opener, index 1 is the scope closer index 2 is an array of escape sequences (optional) and index 3 is a boolean indicating whether the closer is zero width, meaning it should be included with the token’s value (optional, defaults to false).

{
	string: [
		[
			//opened by a double quote
			"\"", 
			//closed by a double quote
			"\"", 
			//double quotes and backslashes are escaped by a backslash
			["\\\"", "\\\\"]
		], [
			//opened by a single quote
			"'", 
			//closed by a single quote
			"'", 
			//single quotes and backslashes are escaped by a backslash
			["\\\'", "\\\\"]
		]
	],
	
	comment: [
		[
			//opened by two forward slashes
			"//",
			//closed by a line break
			"\n",
			//no escape sequences
			null,
			//don't include the line break in the value
			true
		]
	]
};

There is an alternate way of defining the scope closer which allows more flexibility. Instead of just a string, it can be defined as an object with properties regex, which is a regular expression matching the end of the scope, and length, which is the length of the closer.

//Java's annotation scope definition
{
	annotation: [ ["@", { length: 1, regex: /[\s\(]/ }, null, true] ]
}

Defining ident rules

Idents are variable names, function names, etc. They are defined by two regular expressions.

{
	//the first character of an ident must be a letter or an underscore
	identFirstLetter: /[A-Za-z_]/,
	
	//anything after that can be a letter, number or underscore
	identAfterFirstLetter: /\w/
};

Defining operators

Operators are defined as an array of strings. The only difference between an operator and a keyword (to sunlight) is that operators are an exact regular expression match and keywords end with a word boundary.

The order that operators are defined is important, because the match is greedy. That means + will match before += and stop looking, so += should come before + in the array.

//JavaScript's operators
[
	//arithmetic
	"++", "+=", "+",
	"--", "-=", "-",
		  "*=", "*",
		  "/=", "/",
		  "%=", "%",

	//boolean
	"&&", "||",

	//bitwise
	"|=",   "|",
	"&=",   "&",
	"^=",   "^",
	">>>=", ">>>", ">>=", ">>",
	"<<=", "<<",

	//inequality
	"<=", "<",
	">=", ">",
	"===", "==", "!==", "!=",

	//unary
	"!", "~",

	//other
	"?", ":", ".", "="
];

Case sensitivity

Some languages (e.g. CSS) are not case sensitive. You can indicate this by setting caseInsensitive: true.

Punctuation Parsing

Punctuation can be customized using a regular expression. By default, punctuation is considered anything that is not whitespace and not a letter, number or underscore.

If you want to treat punctuation as just text, you can use a regular expression that doesn’t match anything: /(?!x)x/.

Number Parsing

By default, numbers are considered integers (e.g. 12), floats (e.g. 12.25), hex (e.g. 0x1A) and scientific notation (e.g. 1e3). The default number parsing algorithm is pretty loose, i.e. it will consider any number followed by a letter a number. You can customize this by injecting your own number parsing function, which is basically a custom parse rule.

This is the default number parse rule:

function(context) {
	var current = context.reader.current(), 
		number, 
		line = context.reader.getLine(), 
		column = context.reader.getColumn(),
		allowDecimal = true;

	if (!/\d/.test(current)) {
		//is it a decimal followed by a number?
		if (current !== "." || !/\d/.test(context.reader.peek())) {
			return null;
		}

		//decimal without leading zero
		number = current + context.reader.read();
		allowDecimal = false;
	} else {
		number = current;
		if (current === "0" && context.reader.peek() !== ".") {
			//hex or octal
			allowDecimal = false;
		}
	}

	//easy way out: read until it's not a number or letter
	//this will work for hex (0xef), octal (012), decimal and scientific notation (1e3)
	//anything else and you're on your own

	var peek;
	while ((peek = context.reader.peek()) !== context.reader.EOF) {
		if (!/[A-Za-z0-9]/.test(peek)) {
			if (peek === "." && allowDecimal && /\d$/.test(context.reader.peek(2))) {
				number += context.reader.read();
				allowDecimal = false;
				continue;
			}
			
			break;
		}

		number += context.reader.read();
	}

	return context.createToken("number", number, line, column);

}

Advanced language definition

Because sunlight was developed with flexibility in mind, it provides a few ways to extend a language definition to fit pretty much any language with minimal fuss. Well, sometimes maximal fuss depending on the language (I’m looking at you, C♯, with your hard-to-parse generic definitions).

A prerequisite for using any of the below APIs is to to understand how sunlight parses text. Basically, using predetermined parse rules (which can also be customized, see the section on custom parse rules), it transforms the raw text into an array of tokens. Each token look like this:

{
	name: "tokenName",    //e.g. "keyword"
	value: "tokenValue",  //e.g. "if"
	language: "languageName", //e.g. "php"
	line: 1,
	column: 1
};

After the raw text is tokenized, it is then analyzed. The analysis is performed by iterating over each token, and invoking the appropriate function on the analyzer. The analyzer can be injected for further customization.

Named ident rules

In addition to idents, sunlight has a built-in way of highlighting what it calls named idents. Basically these are class names in most languages, but they can be anything you want. For example, the XML language definition uses named idents to highlight attribute names.

There are four types of named ident rules. Three of them are for convenience. Those three are:

follows rules
precedes rules
between rules

follows and precedes rules are basically identical in that they both examine the tokens sequentially. The only difference is that follows rules go backward, and precedes rules go forward. Both are defined as an array of objects, where each object has keys token (a string) and values (an array of strings).

between rules are defined as an array of objects. Each object has keys opener and closer, both of which are objects with keys token and values. between rules match any ident that falls between the opener and closer.

{
	namedIdentRules: {
		follows: [
			//this will match "MyClass" in "new MyClass()"
			[{ token: "keyword", values: ["new"] }, sunlight.util.whitespace]
		],
		
		precedes: [
			//this will match "String" in "String[]"
			[
				sunlight.util.whitespace, 
				{ token: "punctuation", values: ["["] }, 
				sunlight.util.whitespace,
				{ token: "punctuation", values: ["]"] }
			]
		],
		
		between: [
			//this will match "Cloneable" and "Kissable" in 
			//"class MyClass implements Cloneable, Kissable {"
			{ 
				opener: { token: "keyword", values: ["implements"] }, 
				closer: { token: "punctuation", values: ["{"] }
			}
		]
	}
};

The last type of ident rule is a custom rule. Custom rules allow the most flexibility. They give you direct access to the tokens, and from there you can do whatever you want. Custom rules are defined as an array of functions. Each function should return true or false and is passed a single argument, the analyzer context:

{
	//array of token objects
	tokens: [],
	
	//the index of the token being analyzed
	index: 0,
	
	//the language definition
	language: {},
	
	//used internally for parsing nested languages
	//you probably shouldn't mess with this
	continuation: someFunction,
	
	//adds the specified DOM node to the collection
	addNode: function(node) {},
	
	//creates a text node from the given token (this handles encoding, tab width, etc.)
	createTextNode: function(token) {},
	
	//returns an array of DOM nodes
	getNodes: function() {},
	
	//clears the node array (not recommended unless you know what you're doing)
	resetNodes: function() {},
	
	//the current highlighter instance's options
	options: {},
	
	//dictionary for arbitrary item storage
	items: {}
};

Below is a custom rule used in the C♯ language definition to detect aliases:

function(context) {
	//previous non-ws token must be "using" and next non-ws token must be "="
	var prevToken = sunlight.util.getPreviousNonWsToken(context.tokens, context.index);
	if (prevToken.name !== "keyword" || prevToken.value !== "using") {
		return false;
	}

	var nextToken = sunlight.util.getNextNonWsToken(context.tokens, context.index);
	if (nextToken.name !== "operator" || nextToken.value !== "=") {
		return false;
	}

	return true;
}

And this is it how it works (MyClass is a named ident):

using System.Linq;
using MyClass = System.Collections.ICollection;

Check the C♯ language definition for more examples of custom named ident rules.

Custom tokens

Custom tokens are basically keyword extensions. They behave in exactly the same way, but exist solely to provide a better breakdown between keyword-like tokens. For example, PHP has language constructs, functions and keywords. Language constructs (like echo, isset, etc.) are usually colored as keywords even though they often behave like functions.

Custom tokens are defined as a map from the name of the token to an object with keys values and boundary. boundary is a regex that Sunlight uses to determine when the token ends.

//PHP custom token excerpt
{
	customTokens: {
		languageConstruct: { 
			values: [
				"isset", "array", "unset", "list", "echo", "include_once", "include",
				"require_once", "require", "print", "empty", "return", "die", "eval",
				"exit"
			],
			boundary: "\\b"
		},
		
		constant: {
			values: [
				"__CLASS__", "__DIR__", "__FILE__", "__LINE__", "__FUNCTION__", 
				"__METHOD__", "__NAMESPACE__"
			],
			boundary: "\\b"
		},
		
		openTag: {
			values: ["<?php"],
			boundary: "\\s"
		}
	}
}

Custom parse rules

Custom parse rules can be used to perform your own language-specific parsing. Sunlight covers most of the basics by parsing keywords, operators, idents, etc. but occasionally you’ll need more control over the tokenizing process.

Custom parse rules are defined as an array of functions. Each function returns either a single token or an array of tokens, or null if the parse rule was not satisfied. The function is passed a single argument which is the parse context, defined as follows:

{
	//wrapper around the raw text
	reader: {
		//returns the next {count} characters without advancing the internal pointer
		peek: function(count) {},
		
		//returns the next {count} characters and advances the internal pointer
		read: function(count) {},
		
		//gets the current line number
		getLine: function() {},
		
		//gets the current column
		getColumn: function() {},
		
		//returns a boolean indicating whether the entire string has been read
		isEof: function() {},
		
		//returns a boolean indicating whether the internal pointer is at the end of a line
		isEol: function() {},
		
		//returns a boolean indicating whether the internal pointer is at the start of a line
		isSol: function() {},
		
		//returns a boolean indicating whether the internal pointer is at the start of a line
		//disregarding whitespace
		isSolWs: function() {},
		
		//the "end of string" constant, called EOF because EOS looks stupid
		EOF: undefined,
		
		//gets a string comprised of the current character to the end of the input string
		substring: function(),
		
		//gets a string comprised of the next character to the end of the input string
		peekSubstring: function(),
		
		//returns the character at the internal pointer's position
		current: function() {}
	},
	
	//the language definition
	language: language,
	
	//gets the token at the specified index, or undefined
	token: function(index) {},
	
	//gets all the tokens that have already been parsed
	getAllTokens: function() {},
	
	//gets the number of parsed tokens
	count: function() {},
	
	//the current Highlighter instance's options
	options: {},
	
	//contiguous unparsed characters (usually whitespace) are aggregated into
	//a single token, and are buffered here
	defaultData: {
		text: "",
		line: 1,
		column: 1
	},
	
	//creates and returns a token
	createToken: function(name, value, line, column) {},
	
	//user-defined dictionary for arbitrary storage
	items: {}
}

Below is an example of a custom parse rule from the PHP language definition. It handles the heredoc/nowdoc special string syntax. Another example is the regex literal handling in the JavaScript language definition. C♯ also contains several examples of how to handle contextual keywords (like get, set and value).

function(context) {
	if (context.reader.current() !== "<" || context.reader.peek(2) !== "<<") {
		return null;
	}
	
	var value = "<<<";
	var line = context.reader.getLine();
	var column = context.reader.getColumn();
	context.reader.read(2);
	
	var ident = "", isNowdoc = false;
	var peek = context.reader.peek();
	while (peek !== context.reader.EOF && peek !== "\n") {
		value += context.reader.read();
		
		if (peek !== "'") {
			//ignore NOWDOC apostophres
			ident += context.reader.current();
		} else {
			isNowdoc = true;
		}
		
		peek = context.reader.peek();
	}
	
	if (peek !== context.reader.EOF) {
		//read the newline
		value += context.reader.read();
		
		//read until "\n{ident};"
		while (context.reader.peek() !== context.reader.EOF) {
			if (context.reader.peek(ident.length + 2) === "\n" + ident + ";") {
				break;
			}
			
			value += context.reader.read();
		}
		
		if (context.reader.peek() !== context.reader.EOF) {
			value += context.reader.read(ident.length + 1); //don't read the semicolon
		}
	}
	
	return context.createToken(isNowdoc ? "nowdoc" : "heredoc", value, line, column);
}

Injecting a custom analyzer

After tokenizing, sunlight analyzes the tokens by iterating over each one and calling the appropriate function on an analyzer. That loop looks like this (simplified):

for (var i = 0, func; i < tokens.length; i++) {
	context.index = i;
	func = "handle_" + context.tokens[i].tokenName;
	
	analyzer[func] ? analyzer[func](context) : analyzer.handleToken(context);
}

The default behavior is to create a DOM node that looks like this: <span class="sunlight-{tokenName}">{tokenValue}</span>. If you don’t like that, you can extend the default analyzer. The PHP analyzer does this so that it can transform function names into hyperlinks:

var addFunctionLink = function(context) {
	var word = context.tokens[context.index].value;
	var suffix = context.tokens[context.index].name;
	var link = document.createElement("a");
	link.className = "sunlight-" + suffix;
	link.setAttribute("href", "http://php.net/" + word);
	link.appendChild(context.createTextNode(word));
	context.addNode(link);
}

var phpAnalyzer = sunlight.createAnalyzer();
phpAnalyzer.handle_languageConstruct = addFunctionLink;
phpAnalyzer.handle_function = addFunctionLink;

//then inject it into the language definition
var langauageDefinition = {
	analyzer: phpAnalyzer
};

Utilities

Sunlight has several utility functions that make writing language definitions easier.

/**
 * Determines if the array contains the value
 *
 * @param {Array}   array           The haystack
 * @param           value           The needle
 * @param {boolean} caseInsensitive Set to true to enable case insensitivity
 * @returns {boolean}
 */
Sunlight.util.contains = function(array, value, caseInsensitive);

/**
 * Gets the last character if the value is a string, or the last element
 * in the array
 *
 * @param arrayOrString An array or a string
 */
Sunlight.util.last = function(arrayOrString);

/**
 * Creates a hash map from the given array
 *
 * @param   {Array}   wordMap         An array of strings to hash
 * @param   {string}  boundary        A regular expression representing the boundary of 
 *                                    each string (e.g. "\\b")
 * @param   {boolean} caseInsensitive Indicates if the words are case insensitive (defaults 
 *                                    to false)
 * @returns {object} Each string in the array is hashed by its first letter. The value
 *                   is transformed into an object with properties value (the original value)
 *                   and a regular expression to match the word.
 */
Sunlight.util.createHashMap = function(wordMap, boundary, caseInsensitive);

/**
 * Determines if a word in the word map matches the current context.
 * This should be used from a custom parse rule.
 *
 * @param {object}  context   The parse context
 * @param {object}  wordMap   A hashmap returned by createHashMap
 * @param {string}  tokenName The name of the token to create
 * @param {boolean} doNotRead Whether or not to advance the internal pointer
 * @returns {object} A token returned from context.createToken
 */
Sunlight.util.matchWord = function(context, wordMap, tokenName, doNotRead);

/**
 * Creates a between rule
 *
 * @param {int}     startIndex      The index at which to start examining the tokens
 * @param {object}  opener          { token: "tokenName", values: ["token", "values"] }
 * @param {object}  closer          { token: "tokenName", values: ["token", "values"] }
 * @param {boolean} caseInsensitive Indicates whether the token values are case insensitive
 * @returns {function} Accepts an array of tokens as the single parameter and returns a boolean
 */
Sunlight.util.createBetweenRule = function(startIndex, opener, closer, caseInsensitive);

/**
 * Creates a follows or precedes rule
 *
 * @param {int}     index           The index at which to start examining the tokens
 * @param {int}     direction       1 for follows, -1 for precedes
 * @param {array}   tokenReqs       Array of token requirements, same as namedIdentRules.follows
 * @param {boolean} caseInsensitive Indicates whether the token values are case insensitive
 * @returns {function} Accepts an array of tokens as the single parameter and returns a boolean
 */
Sunlight.util.createProceduralRule = function(index, direction, tokenReqs, caseInsensitive);

/**
 * Gets the previous non-whitespace token. This is not safe for looping.
 *
 * @param {array} tokens Array of tokens
 * @param {int}   index  The index at which to start
 * @returns {object} The token or undefined
 */
Sunlight.util.getPreviousNonWsToken = function(tokens, index);

/**
 * Gets the next non-whitespace token. This is not safe for looping.
 *
 * @param {array} tokens Array of tokens
 * @param {int}   index  The index at which to start
 * @returns {object} The token or undefined
 */
Sunlight.util.getNextNonWsToken = function(tokens, index);

/**
 * Gets the previous token while the matcher returns true
 *
 * @param {array}    tokens  Array of tokens
 * @param {int}      index   The index at which to start
 * @param {function} matcher Predicate for determining if the token matches
 * @returns {object} The token or undefined
 */
Sunlight.util.getPreviousWhile = function(tokens, index, matcher);

/**
 * Gets the next token while the matcher returns true
 *
 * @param {array}    tokens  Array of tokens
 * @param {int}      index   The index at which to start
 * @param {function} matcher Predicate for determining if the token matches
 * @returns {object} The token or undefined
 */
Sunlight.util.getNextWhile = function(tokens, index, matcher);

/**
 * An object to be used in named ident rules to indicate optional whitespace
 */
Sunlight.util.whitespace = { token: "default", optional: true };

/**
 * Array of default string escape sequences
 */
Sunlight.util.escapeSequences = ["\\n", "\\t", "\\r", "\\\\", "\\v", "\\f"];

/**
 * The EOL character ("\r" on IE, "\n" otherwise)
 */
Sunlight.util.eol = "\n";

/**
 * Gets the computed style of the element
 *
 * @param {object} element A DOM element
 * @param {string} style   The name of the CSS style to retrieve
 * @returns {string}
 */
Sunlight.util.getComputedStyle = function(element, style);

/**
 * Escapes a string for use in a regular expression
 *
 * @param {string} s The string to escape
 * @returns {string}
 */
Sunlight.util.regexEscape = function(s);

jQuery plugin

(function($, window){
	
	$.fn.sunlight = function(options) {
		var highlighter = new window.Sunlight.Highlighter(options);
		this.each(function() {
			highlighter.highlightNode(this);
		});
		
		return this;
	};
	
}(jQuery, this));

//e.g. $("code").sunlight();

Known issues

Won’t fix

Line numbers are not correctly colored in IE in quirks mode
Trailing newlines are not correctly detected in IE in quirks mode
Horizontal scrolling doesn’t work in IE6 (doesn’t support overflow-x)
The menu plugin is disabled in IE6

Will eventually fix

Highlight CSS property values
Python/Java/Lisp exponents, e.g. 1e3-10
Perl heredocs do not allow spaces before identifiers
Perl heredocs do not correctly detect empty (newline) identifiers
Nested languags inside custom scopes are not supported
The Haskell implementation is a bit weak

Sunlight