Token Filters | Text Analysis | Elasticsearch Tutorial

Introduction

Token filters are a crucial part of the text analysis process in Elasticsearch. They are applied to the tokens generated by a tokenizer and can modify, remove, or add tokens. Token filters can perform a variety of tasks such as converting tokens to lowercase, removing stop words, stemming tokens, and much more.

Common Token Filters

Lowercase Token Filter

The lowercase token filter converts all tokens to lowercase. This is useful for case-insensitive searching.

Example:

{
  "settings": {
    "analysis": {
      "filter": {
        "lowercase_filter": {
          "type": "lowercase"
        }
      }
    }
  }
}

Stop Token Filter

The stop token filter removes common stop words from the token stream. Stop words are words that are frequently used in the language and are often considered to add little value to the search.

Example:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

Stemmer Token Filter

The stemmer token filter reduces tokens to their root form. This can help match related terms with different endings.

Example:

{
  "settings": {
    "analysis": {
      "filter": {
        "light_english_stemmer": {
          "type": "stemmer",
          "name": "light_english"
        }
      }
    }
  }
}

Custom Token Filters

In addition to built-in token filters, Elasticsearch allows you to create custom token filters to suit specific needs. Below is an example of how to create a custom synonym token filter.

Example:

{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "car, automobile",
            "quick, fast"
          ]
        }
      }
    }
  }
}

Applying Token Filters in Analyzers

Token filters are applied within analyzers in Elasticsearch. An analyzer combines a tokenizer with one or more token filters. Below is an example of an analyzer that uses the lowercase and stop token filters.

Example:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_english_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      }
    }
  }
}

Testing Token Filters

You can use the _analyze API to test how token filters affect your text. This can be helpful for debugging and fine-tuning your analyzers.

Example:

POST /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "english_stop"],
  "text": "The Quick Brown Foxes"
}

Output:

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "",
      "position": 2
    },
    {
      "token": "foxes",
      "start_offset": 16,
      "end_offset": 21,
      "type": "",
      "position": 3
    }
  ]
}

Conclusion

Token filters are a powerful feature in Elasticsearch that allow you to manipulate tokens generated by a tokenizer in various ways. By understanding and utilizing different token filters, you can significantly improve the performance and accuracy of your search application.

Token Filters in Elasticsearch

Introduction

Common Token Filters

Lowercase Token Filter

Stop Token Filter

Stemmer Token Filter

Custom Token Filters

Applying Token Filters in Analyzers

Testing Token Filters

Conclusion