Preventing bigrams across sentence boundary

barryhunter · May 24, 2021, 5:07pm

Recently had renewed interest in using bigram_index with an index, ultimately so can find common two word phrases.

Problem is, its indexing two words, across sentences.

tried using phrase_boundary (with phrase_boundary_step=10), but it doesn’t seem to work. I think it works without bigram_index (because its manipulated, the keyword position. )

But when using bigram_index, the two words are entered as a keyword in the dictionary. So the document matches.

I guess the question is, should phrase_boundary be affecting a bigram_index index? And if not, is there an alternative way to prevent indexing across sentence boundaries.

tomat · May 25, 2021, 6:18am

for now bigram_index tokenizer skips only blended tokens and there is no way to change some config option to add behavior you need for now.

However you could add feature request at Github for new bigram_index option sentence that will restrict bigram generation.

For now you could also modify your search query and add sentence operator there, like (("test1 test2") test1 SENTENCE test2) not sure will this give you speed up at matching or not.