Analyzers | Text Analysis | Elasticsearch Tutorial

Introduction

In Elasticsearch, an analyzer is used to break down text into searchable tokens and index those tokens for search operations. Analyzers play a crucial role in full-text search capabilities by processing input text into a structured format that can be easily queried.

Core Components of Analyzers

An analyzer in Elasticsearch is composed of three main components:

Character Filters: These preprocess the text before it is tokenized. They can remove or replace certain characters.
Tokenizer: This splits the input text into tokens or terms.
Token Filters: These perform additional processing on the tokens generated by the tokenizer, such as lowercasing, removing stop words, or stemming.

Built-in Analyzers

Elasticsearch offers several built-in analyzers that cover common use cases. Some of the most commonly used built-in analyzers include:

Standard Analyzer: The default analyzer, which provides standard tokenization and filtering.
Simple Analyzer: Tokenizes text by non-letter characters and lowercases tokens.
Whitespace Analyzer: Tokenizes text based on whitespace characters.
Stop Analyzer: Similar to the Simple Analyzer but also removes stop words.
Keyword Analyzer: Treats the entire input as a single token without any tokenization.

Custom Analyzers

In addition to built-in analyzers, Elasticsearch allows you to create custom analyzers tailored to specific requirements. A custom analyzer can be defined by specifying its character filters, tokenizer, and token filters.

Example: Custom Analyzer

Below is an example of how to define a custom analyzer in Elasticsearch:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "porter_stem"]
        }
      }
    }
  }
}

In this example, the custom analyzer named my_custom_analyzer is defined with:

An HTML Strip character filter to remove HTML tags.
A standard tokenizer to split text into tokens.
Three token filters: lowercase, stop, and porter_stem.

Testing Analyzers

After defining analyzers, it's essential to test them to ensure they work as expected. Elasticsearch provides the _analyze API for this purpose.

Example: Testing an Analyzer

The following request tests the my_custom_analyzer defined earlier:

POST /my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "The quick brown fox jumps over the lazy dog."
}

The response will show the tokens generated by the analyzer:

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "",
      "position": 2
    },
    {
      "token": "fox",
      "start_offset": 16,
      "end_offset": 19,
      "type": "",
      "position": 3
    },
    {
      "token": "jump",
      "start_offset": 20,
      "end_offset": 25,
      "type": "",
      "position": 4
    },
    {
      "token": "over",
      "start_offset": 26,
      "end_offset": 30,
      "type": "",
      "position": 5
    },
    {
      "token": "lazi",
      "start_offset": 35,
      "end_offset": 39,
      "type": "",
      "position": 6
    },
    {
      "token": "dog",
      "start_offset": 40,
      "end_offset": 43,
      "type": "",
      "position": 7
    }
  ]
}

Using Analyzers in Mappings

Analyzers are typically specified in the mappings of an index to define how fields should be analyzed. You can set a specific analyzer for a field when creating or updating an index mapping.

Example: Setting an Analyzer in Mappings

Below is an example of setting the my_custom_analyzer for a field in the index mapping:

PUT /my_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

Conclusion

Analyzers are a powerful feature in Elasticsearch that enable effective text analysis and full-text search. By understanding and utilizing both built-in and custom analyzers, you can enhance the search capabilities of your Elasticsearch applications.

Elasticsearch Analyzers