Char Filters in Elasticsearch
Introduction
In Elasticsearch, character filters are a type of text analysis component that preprocess the text before it is passed to the tokenizer. They can be used to modify the input text in various ways, including removing or replacing characters, normalizing text, or handling special characters.
Types of Char Filters
Elasticsearch provides several built-in character filters:
- HTML Strip Char Filter
- Mapping Char Filter
- Pattern Replace Char Filter
HTML Strip Char Filter
The HTML Strip Char Filter removes HTML elements from the input text.
Example:
PUT /my_index { "settings": { "analysis": { "char_filter": { "html_strip": { "type": "html_strip" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip"], "tokenizer": "standard" } } } } }
Input: <p>Hello <strong>World</strong></p>
Output: Hello World
Mapping Char Filter
The Mapping Char Filter allows you to define character mappings, replacing specific characters or sequences with defined replacements.
Example:
PUT /my_index { "settings": { "analysis": { "char_filter": { "my_mapping": { "type": "mapping", "mappings": ["a => 1", "b => 2"] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["my_mapping"], "tokenizer": "standard" } } } } }
Input: abc
Output: 12c
Pattern Replace Char Filter
The Pattern Replace Char Filter uses regular expressions to identify patterns in the text and replace them with a specified replacement.
Example:
PUT /my_index { "settings": { "analysis": { "char_filter": { "my_pattern": { "type": "pattern_replace", "pattern": "([0-9]+)", "replacement": "NUMBER" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["my_pattern"], "tokenizer": "standard" } } } } }
Input: My phone number is 12345
Output: My phone number is NUMBER
Conclusion
Char filters in Elasticsearch are powerful tools that preprocess and normalize text before tokenization. They are useful for a variety of purposes, such as removing unwanted characters, replacing specific characters, or handling special formatting in text.