Rules File Format

Note: I'm really bad at regular expressions, so the following examples may not be the best way to do things. Definitely don't use these example to learn RE's.

Structure

Lines have between 1 and 3 fields, separated by tabs:

Match Replace Processing Instruction

Comments

Lines starting with # are ignored.

Rules that just delete something

Example:
The following, which doesn't have anything except the Match expression, removes all Html Table elements:


<TABLE.*>

This "Match" expression is a regular expression as used by .net.

Rules which replace something

A replacement rule has the match, followed by a tab, followed by a replacement expression.

Example:
Here, we want to change hyperlinks to simple spans:


<A class="gloss link".*?>(.*?)</A>    <span class="gloss">$1</span>

Notice that .net regular expressions us $1 to match the first parentheses group, $2 for the second, etc.

This "Replace" expression is whatever is allowed by .net, with the following changes:

  • \n is replaced by a newline
  • \s is replaced by a space

Special Match values

By adding to the c# code, you manipulate an expression however you want. I've only added one, "SentenceCap", which does capitalization by looking at punctuation marks.

#Do some c# processing on everything found between elements
>[^<]+</    SentenceCap

Processing Instructions

A processing instruction tells 2p to do something special with this line. There's currently only one: REPEAT is used to say "keep doing this replacement until no more changes are possible". For example, the following removes all the blank lines in the file:

#remove empty lines
\n\r(\n\r)+    \n    REPEAT

Images