elasticsearch之使用正規表示式自定義分詞邏輯

一、Pattern Analyzer簡介

elasticsearch在索引和搜尋之前都需要對輸入的文字進行分詞，elasticsearch提供的pattern analyzer使得我們可以通過正規表示式的簡單方式來定義分隔符，從而達到自定義分詞的處理邏輯；

內建的的pattern analyzer的名字為pattern，其使用的模式是W+，即除了字母和數位之外的所有非單詞字元；

analyzers.add(new PreBuiltAnalyzerProviderFactory("pattern", CachingStrategy.ELASTICSEARCH,
            () -> new PatternAnalyzer(Regex.compile("\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/, null), true,
            CharArraySet.EMPTY_SET)));

作為全域性的pattern analyzer，我們可以直接使用

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "dog",
      "start_offset" : 45,
      "end_offset" : 48,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "s",
      "start_offset" : 49,
      "end_offset" : 50,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "word",
      "position" : 11
    }
  ]
}

二、自定義Pattern Analyzer

我們可以通過以下方式自定pattern analyzer，並設定分隔符為所有的空格符號；

PUT my_pattern_test_space_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_test_space_analyzer": {
          "type":      "pattern",
          "pattern":   "[\\p{Space}]", 
          "lowercase": true
        }
      }
    }
  }
}

我們使用自定義的pattern analyzer測試一下效果

POST my_pattern_test_space_analyzer/_analyze
{
  "analyzer": "my_pattern_test_space_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}


{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 12,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "bone.",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "word",
      "position" : 9
    }
  ]
}

三、常用的Java中的正規表示式

elasticsearch的Pattern Analyzer使用的Java Regular Expressions，只有瞭解Java中一些常用的正規表示式才能更好的自定義pattern analyzer；

單字元定義

x	        The character x
\\	        The backslash character
\0n	        The character with octal value 0n (0 <= n <= 7)
\0nn	    The character with octal value 0nn (0 <= n <= 7)
\0mnn	    The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh	    The character with hexadecimal value 0xhh
\uhhhh	    The character with hexadecimal value 0xhhhh
\x{h...h}	The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT  <= 0xh...h <=  Character.MAX_CODE_POINT)
\t	        The tab character ('\u0009')
\n	        The newline (line feed) character ('\u000A')
\r	        The carriage-return character ('\u000D')
\f	        The form-feed character ('\u000C')
\a	        The alert (bell) character ('\u0007')
\e	        The escape character ('\u001B')
\cx	        The control character corresponding to x

字元分組

[abc]	        a, b, or c (simple class)
[^abc]	        Any character except a, b, or c (negation)
[a-zA-Z]	    a through z or A through Z, inclusive (range)
[a-d[m-p]]	    a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]	d, e, or f (intersection)
[a-z&&[^bc]]	a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z](subtraction)

預定義的字元分組

.	Any character (may or may not match line terminators)
\d	A digit: [0-9]
\D	A non-digit: [^0-9]
\h	A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
\H	A non-horizontal whitespace character: [^\h]
\s	A whitespace character: [ \t\n\x0B\f\r]
\S	A non-whitespace character: [^\s]
\v	A vertical whitespace character: [\n\x0B\f\r\x85\u2028\u2029]
\V	A non-vertical whitespace character: [^\v]
\w	A word character: [a-zA-Z_0-9]
\W	A non-word character: [^\w]

POSIX字元分組

\p{Lower}	A lower-case alphabetic character: [a-z]
\p{Upper}	An upper-case alphabetic character:[A-Z]
\p{ASCII}	All ASCII:[\x00-\x7F]
\p{Alpha}	An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}	A decimal digit: [0-9]
\p{Alnum}	An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}	A visible character: [\p{Alnum}\p{Punct}]
\p{Print}	A printable character: [\p{Graph}\x20]
\p{Blank}	A space or a tab: [ \t]
\p{Cntrl}	A control character: [\x00-\x1F\x7F]
\p{XDigit}	A hexadecimal digit: [0-9a-fA-F]
\p{Space}	A whitespace character: [ \t\n\x0B\f\r]

以下我們通過正規表示式[\p{Punct}|\p{Space}]可以找出字串中的標點符號；

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("[\\p{Punct}|\\p{Space}]");
        Matcher matcher = p.matcher("The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.");
        while(matcher.find()){
            System.out.println("find "+matcher.group()
                    +" position: "+matcher.start()+"-"+matcher.end());
        }
    }
}


find   position: 3-4
find   position: 5-6
find   position: 11-12
find - position: 17-18
find   position: 23-24
find   position: 30-31
find   position: 35-36
find   position: 39-40
find   position: 44-45
find ' position: 48-49
find   position: 50-51
find . position: 55-56

四、 Pattern Analyzer的實現

PatternAnalyzer會根據具體的設定資訊,使用PatternTokenizer、LowerCaseFilter、StopFilter來組合構建TokenStreamComponents

PatternAnalyzer.java 

protected TokenStreamComponents createComponents(String s) {
    final Tokenizer tokenizer = new PatternTokenizer(pattern, -1);
    TokenStream stream = tokenizer;
    if (lowercase) {
        stream = new LowerCaseFilter(stream);
    }
    if (stopWords != null) {
        stream = new StopFilter(stream, stopWords);
    }
    return new TokenStreamComponents(tokenizer, stream);
}

PatternTokenizer裡的incrementToken會對輸入的文字進行分詞處理；由於PatternAnalyzer裡初始化PatternTokenizer裡的incrementToken會對輸入的文字進行分詞處理的時候對group設定為-1，所以這裡走else分支，最終提取命中符號之間的單詞；

PatternTokenizer.java

  @Override
  public boolean incrementToken() {
    if (index >= str.length()) return false;
    clearAttributes();
    if (group >= 0) {
    
      // match a specific group
      while (matcher.find()) {
        index = matcher.start(group);
        final int endIndex = matcher.end(group);
        if (index == endIndex) continue;       
        termAtt.setEmpty().append(str, index, endIndex);
        offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
        return true;
      }
      
      index = Integer.MAX_VALUE; // mark exhausted
      return false;
      
    } else {
    
      // String.split() functionality
      while (matcher.find()) {
        if (matcher.start() - index > 0) {
          // found a non-zero-length token
          termAtt.setEmpty().append(str, index, matcher.start());
          offsetAtt.setOffset(correctOffset(index), correctOffset(matcher.start()));
          index = matcher.end();
          return true;
        }
        
        index = matcher.end();
      }
      
      if (str.length() - index == 0) {
        index = Integer.MAX_VALUE; // mark exhausted
        return false;
      }
      
      termAtt.setEmpty().append(str, index, str.length());
      offsetAtt.setOffset(correctOffset(index), correctOffset(str.length()));
      index = Integer.MAX_VALUE; // mark exhausted
      return true;
    }
  }