Tokenizers | Text Analysis | Elasticsearch Tutorial

Introduction

Tokenizers are a core component of the text analysis process in Elasticsearch. They are used to break down text into smaller pieces, called tokens, which can then be indexed and searched. Tokenizers can be customized and combined in various ways to suit different use cases, from simple whitespace tokenization to complex pattern-based tokenization.

Basic Concepts

Before diving into the different types of tokenizers, it's important to understand some basic concepts:

Token: A single unit of text, such as a word or a number.
Tokenizer: A component that divides text into tokens.
Analyzer: A combination of a tokenizer and filters that process text during indexing and searching.

Types of Tokenizers

Elasticsearch provides a variety of tokenizers. Here are some of the most commonly used ones:

1. Standard Tokenizer

The standard tokenizer divides text into tokens based on word boundaries, which are defined by Unicode text segmentation rules. It's suitable for most languages.

Example:


                        POST _analyze
                        {
                            "tokenizer": "standard",
                            "text": "Elasticsearch is a powerful search engine."
                        }

{
  "tokens": [
    {"token": "Elasticsearch", "start_offset": 0, "end_offset": 13, "type": "", "position": 0},
    {"token": "is", "start_offset": 14, "end_offset": 16, "type": "", "position": 1},
    {"token": "a", "start_offset": 17, "end_offset": 18, "type": "", "position": 2},
    {"token": "powerful", "start_offset": 19, "end_offset": 27, "type": "", "position": 3},
    {"token": "search", "start_offset": 28, "end_offset": 34, "type": "", "position": 4},
    {"token": "engine", "start_offset": 35, "end_offset": 41, "type": "", "position": 5}
  ]
}

2. Whitespace Tokenizer

This tokenizer divides text based on whitespace characters (spaces, tabs, newlines, etc.). It's useful for languages where tokens are clearly separated by spaces.

Example:


                        POST _analyze
                        {
                            "tokenizer": "whitespace",
                            "text": "Elasticsearch is a powerful search engine."
                        }

{
  "tokens": [
    {"token": "Elasticsearch", "start_offset": 0, "end_offset": 13, "type": "word", "position": 0},
    {"token": "is", "start_offset": 14, "end_offset": 16, "type": "word", "position": 1},
    {"token": "a", "start_offset": 17, "end_offset": 18, "type": "word", "position": 2},
    {"token": "powerful", "start_offset": 19, "end_offset": 27, "type": "word", "position": 3},
    {"token": "search", "start_offset": 28, "end_offset": 34, "type": "word", "position": 4},
    {"token": "engine", "start_offset": 35, "end_offset": 41, "type": "word", "position": 5}
  ]
}

3. Keyword Tokenizer

The keyword tokenizer treats the entire input as a single token. This is useful for structured data like email addresses or URLs.

Example:


                        POST _analyze
                        {
                            "tokenizer": "keyword",
                            "text": "user@example.com"
                        }

{
  "tokens": [
    {"token": "user@example.com", "start_offset": 0, "end_offset": 16, "type": "word", "position": 0}
  ]
}

4. Pattern Tokenizer

This tokenizer uses a regular expression to split the text. It's highly customizable and can be used for more complex tokenization requirements.

Example:


                        POST _analyze
                        {
                            "tokenizer": "pattern",
                            "text": "Elasticsearch,Logstash,Kibana",
                            "pattern": ","
                        }

{
  "tokens": [
    {"token": "Elasticsearch", "start_offset": 0, "end_offset": 13, "type": "word", "position": 0},
    {"token": "Logstash", "start_offset": 14, "end_offset": 22, "type": "word", "position": 1},
    {"token": "Kibana", "start_offset": 23, "end_offset": 29, "type": "word", "position": 2}
  ]
}

Custom Tokenizers

In addition to the built-in tokenizers, Elasticsearch allows you to define custom tokenizers using various settings. Custom tokenizers can be tailored to handle specific text processing requirements.

Example:


                        PUT /my_index
                        {
                            "settings": {
                                "analysis": {
                                    "tokenizer": {
                                        "my_custom_tokenizer": {
                                            "type": "pattern",
                                            "pattern": "\\W+"
                                        }
                                    }
                                }
                            }
                        }

In this example, a custom tokenizer named my_custom_tokenizer is created using a regular expression pattern.

Conclusion

Tokenizers are a crucial part of text analysis in Elasticsearch. Understanding and utilizing the different types of tokenizers can greatly enhance your search engine's ability to process and analyze text. Whether you are using a built-in tokenizer or defining a custom one, Elasticsearch provides the flexibility to meet your specific needs.

Tokenizers in Elasticsearch

Introduction

Basic Concepts

Types of Tokenizers

1. Standard Tokenizer

2. Whitespace Tokenizer

3. Keyword Tokenizer

4. Pattern Tokenizer

Custom Tokenizers

Conclusion