WordDelimiterFilter (elasticsearch 0.20.0.Beta1-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.miscellaneous
Class WordDelimiterFilter

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.TokenFilter
              org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

All Implemented Interfaces:: Closeable

public final class WordDelimiterFilter
extends org.apache.lucene.analysis.TokenFilter
extends org.apache.lucene.analysis.TokenFilter

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules: - split on intra-word delimiters (by default, all non alpha-numeric characters). - "Wi-Fi" -> "Wi", "Fi" - split on case transitions - "PowerShot" -> "Power", "Shot" - split on letter-number transitions - "SD500" -> "SD", "500" - leading and trailing intra-word delimiters on each subword are ignored - "//hello---there, 'dude'" -> "hello", "there", "dude" - trailing "'s" are removed for each subword - "O'Neil's" -> "O", "Neil" - Note: this step isn't performed in a separate filter because of possible subword combinations.

The combinations parameter affects how subwords are combined: - combinations="0" causes no subword combinations. - "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run. - "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot" - "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC" - "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State`

Field Summary
`static int`	`ALPHA`
`static int`	`ALPHANUM`
`static int`	`DIGIT`
`static int`	`LOWER`
`static int`	`SUBWORD_DELIM`
`static int`	`UPPER`

Fields inherited from class org.apache.lucene.analysis.TokenFilter
`input`

Constructor Summary
`WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in, byte[] charTypeTable, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, org.apache.lucene.analysis.CharArraySet protWords)`
`WordDelimiterFilter(org.apache.lucene.analysis.TokenStream in, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, org.apache.lucene.analysis.CharArraySet protWords)`

Method Summary
`boolean`	`incrementToken()`
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.TokenFilter
`close, end`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

LOWER

public static final int LOWER

See Also:: Constant Field Values

UPPER

public static final int UPPER

See Also:: Constant Field Values

DIGIT

public static final int DIGIT

See Also:: Constant Field Values

SUBWORD_DELIM

public static final int SUBWORD_DELIM

See Also:: Constant Field Values

ALPHA

public static final int ALPHA

See Also:: Constant Field Values

ALPHANUM

public static final int ALPHANUM

See Also:: Constant Field Values

Constructor Detail