org.apache.lucene.analysis.miscellaneous
Class WordDelimiterFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter
- All Implemented Interfaces:
- Closeable
public final class WordDelimiterFilter
- extends org.apache.lucene.analysis.TokenFilter
Splits words into subwords and performs optional transformations on subword groups.
Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric characters).
- "Wi-Fi" -> "Wi", "Fi"
- split on case transitions
- "PowerShot" -> "Power", "Shot"
- split on letter-number transitions
- "SD500" -> "SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored
- "//hello---there, 'dude'" -> "hello", "there", "dude"
- trailing "'s" are removed for each subword
- "O'Neil's" -> "O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations.
- "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions)
- combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run.
- "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"
- "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
- "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different subword delimiters.
For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match.
One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default)
in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word
delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
input |
Constructor Summary |
WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in,
byte[] charTypeTable,
int generateWordParts,
int generateNumberParts,
int catenateWords,
int catenateNumbers,
int catenateAll,
int splitOnCaseChange,
int preserveOriginal,
int splitOnNumerics,
int stemEnglishPossessive,
org.apache.lucene.analysis.CharArraySet protWords)
|
WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in,
int generateWordParts,
int generateNumberParts,
int catenateWords,
int catenateNumbers,
int catenateAll,
int splitOnCaseChange,
int preserveOriginal,
int splitOnNumerics,
int stemEnglishPossessive,
org.apache.lucene.analysis.CharArraySet protWords)
|
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
LOWER
public static final int LOWER
- See Also:
- Constant Field Values
UPPER
public static final int UPPER
- See Also:
- Constant Field Values
DIGIT
public static final int DIGIT
- See Also:
- Constant Field Values
SUBWORD_DELIM
public static final int SUBWORD_DELIM
- See Also:
- Constant Field Values
ALPHA
public static final int ALPHA
- See Also:
- Constant Field Values
ALPHANUM
public static final int ALPHANUM
- See Also:
- Constant Field Values
WordDelimiterFilter
public WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in,
byte[] charTypeTable,
int generateWordParts,
int generateNumberParts,
int catenateWords,
int catenateNumbers,
int catenateAll,
int splitOnCaseChange,
int preserveOriginal,
int splitOnNumerics,
int stemEnglishPossessive,
org.apache.lucene.analysis.CharArraySet protWords)
- Parameters:
in
- Token stream to be filtered.charTypeTable
- generateWordParts
- If 1, causes parts of words to be generated: "PowerShot" => "Power" "Shot"generateNumberParts
- If 1, causes number subwords to be generated: "500-42" => "500" "42"catenateWords
- 1, causes maximum runs of word parts to be catenated: "wi-fi" => "wifi"catenateNumbers
- If 1, causes maximum runs of number parts to be catenated: "500-42" => "50042"catenateAll
- If 1, causes all subword parts to be catenated: "wi-fi-4000" => "wifi4000"splitOnCaseChange
- 1, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards)preserveOriginal
- If 1, includes original words in subwords: "500-42" => "500" "42" "500-42"splitOnNumerics
- 1, causes "j2se" to be three tokens; "j" "2" "se"stemEnglishPossessive
- If 1, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"protWords
- If not null is the set of tokens to protect from being delimited
WordDelimiterFilter
public WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in,
int generateWordParts,
int generateNumberParts,
int catenateWords,
int catenateNumbers,
int catenateAll,
int splitOnCaseChange,
int preserveOriginal,
int splitOnNumerics,
int stemEnglishPossessive,
org.apache.lucene.analysis.CharArraySet protWords)
- Parameters:
in
- Token stream to be filtered.generateWordParts
- If 1, causes parts of words to be generated: "PowerShot", "Power-Shot" => "Power" "Shot"generateNumberParts
- If 1, causes number subwords to be generated: "500-42" => "500" "42"catenateWords
- 1, causes maximum runs of word parts to be catenated: "wi-fi" => "wifi"catenateNumbers
- If 1, causes maximum runs of number parts to be catenated: "500-42" => "50042"catenateAll
- If 1, causes all subword parts to be catenated: "wi-fi-4000" => "wifi4000"splitOnCaseChange
- 1, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards)preserveOriginal
- If 1, includes original words in subwords: "500-42" => "500" "42" "500-42"splitOnNumerics
- 1, causes "j2se" to be three tokens; "j" "2" "se"stemEnglishPossessive
- If 1, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"protWords
- If not null is the set of tokens to protect from being delimited
incrementToken
public boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
-
- Overrides:
reset
in class org.apache.lucene.analysis.TokenFilter
- Throws:
IOException
Copyright © 2009-2012. All Rights Reserved.