How to delete Acute Accent symbol from tokenization process?

From docs we have

By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator.

But I have Acute Accent symbol (U+00B4) in some words and I need just delete this symbol from index.

How to do it using charset_table?

Not sure about the meaning of the delete symbol you mention but
If you do not map that char it becomes separator. If you want to collapse token on that char you could use ignore_chars index option.

ie no A in charset then chAir token tokenized as ch and ir tokens
if A in ignore_chars then chAir token tokenized as chir token

1 Like