Sentence Breaking
The WorldServer sentence breaker engine is rules based. It uses a set of regular expression based rules to decide where to break text into sentences, and rule exceptions to block the breaking rules in some cases. The engine uses the Perl regular expression classes from the Jakarta ORO package. For more information, consult the documentation for the Perl5Util class that is available at http://jakarta.apache.org/oro/index.html.
The rules are contained in a standard properties file named SentenceBreaker.properties located in the WEB-INF/classes/config directory inside the WAR file. WorldServer has separate instances of the engine for each language. The sentence breaker file contains base definitions for the sentence breaking rules. The language specific files can add new rules or override either parts of the rules or the complete rules set. For example, SentenceBreaker_fr.properties may contain additional rules and exceptions applicable only for French.
The rules are defined in the above mentioned property file. In addition to the rules themselves, the property file contains macros, which are recurrent pieces of regular expression definitions that can be used in rules.