Rules File Format¶
Note: I'm really bad at regular expressions, so the following examples may not be the best way to do things. Definitely don't use these example to learn RE's.
Lines have between 1 and 3 fields, separated by tabs:
Lines starting with # are ignored.
Rules that just delete something¶
The following, which doesn't have anything except the Match expression, removes all Html Table elements:
This "Match" expression is a regular expression as used by .net.
Rules which replace something¶
A replacement rule has the match, followed by a tab, followed by a replacement expression.
Here, we want to change hyperlinks to simple spans:
<A class="gloss link".*?>(.*?)</A> <span class="gloss">$1</span>
Notice that .net regular expressions us $1 to match the first parentheses group, $2 for the second, etc.
This "Replace" expression is whatever is allowed by .net, with the following changes:
- \n is replaced by a newline
- \s is replaced by a space
Special Match values¶
By adding to the c# code, you manipulate an expression however you want. I've only added one, "SentenceCap", which does capitalization by looking at punctuation marks.
#Do some c# processing on everything found between elements >[^<]+</ SentenceCap
A processing instruction tells 2p to do something special with this line. There's currently only one: REPEAT is used to say "keep doing this replacement until no more changes are possible". For example, the following removes all the blank lines in the file:
#remove empty lines \n\r(\n\r)+ \n REPEAT