Swiftorial Logo
Home
Swift Lessons
Tutorials
Learn More
Career
Resources

Char Filters in Elasticsearch

Introduction

In Elasticsearch, character filters are a type of text analysis component that preprocess the text before it is passed to the tokenizer. They can be used to modify the input text in various ways, including removing or replacing characters, normalizing text, or handling special characters.

Types of Char Filters

Elasticsearch provides several built-in character filters:

  • HTML Strip Char Filter
  • Mapping Char Filter
  • Pattern Replace Char Filter

HTML Strip Char Filter

The HTML Strip Char Filter removes HTML elements from the input text.

Example:

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_strip": {
          "type": "html_strip"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "standard"
        }
      }
    }
  }
}

Input: <p>Hello <strong>World</strong></p>

Output: Hello World

Mapping Char Filter

The Mapping Char Filter allows you to define character mappings, replacing specific characters or sequences with defined replacements.

Example:

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": ["a => 1", "b => 2"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["my_mapping"],
          "tokenizer": "standard"
        }
      }
    }
  }
}

Input: abc

Output: 12c

Pattern Replace Char Filter

The Pattern Replace Char Filter uses regular expressions to identify patterns in the text and replace them with a specified replacement.

Example:

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_pattern": {
          "type": "pattern_replace",
          "pattern": "([0-9]+)",
          "replacement": "NUMBER"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["my_pattern"],
          "tokenizer": "standard"
        }
      }
    }
  }
}

Input: My phone number is 12345

Output: My phone number is NUMBER

Conclusion

Char filters in Elasticsearch are powerful tools that preprocess and normalize text before tokenization. They are useful for a variety of purposes, such as removing unwanted characters, replacing specific characters, or handling special formatting in text.