Comparing Information Retrieval metrics with BM25 for Manticore Search vs Elasticsearch

We are trying to evaluate Manticore Search (MS) vs Elasticsearch (ES) for an Information Retrieval (IR) project. In order to decide we wanted to first come up with a baseline study of how information retrieval works with both MS and ES. We evaluate and compare using well adopted retrieval metrics (NDCG@10, and others) implemented as part of the BEIR project on two datasets TREC-COVID and NF-CORPUS.

All the setup, scripts and thorough results are available as part of this public github repository. The README tries to explain all the different strategies we used for the comparison. Please do let us know if anything requires more explanation and we can clarify. The primary question we are trying to answer is how can we get competitive results with MS similar to what we get from ES.

A summary of the results is below:

Results for trec-covid:

dataset settings NDCG@10
trec-covid MS (default) 0.29494
trec-covid MS (es-like) 0.59764
trec-covid ES 0.68803
trec-covid ES (reported in BEIR) 0.616

Results for nfcorpus:

dataset settings NDCG@10
nfcorpus MS (default) 0.28791
nfcorpus MS (es-like) 0.31715
nfcorpus ES 0.34281
nfcorpus ES (reported in BEIR) 0.297

A few questions we had were:

  1. What options are we missing on our MS index for us to get competitive results - similar to ES?
  2. What options are we missing on our MS ranking options for us to get competitive results - similar to ES?
  3. We’ve observed the best results for MS with the default en stop words although that list is much larger than the list for English stop words in ES. How can we explain this behavior?
2 Likes

@Sergey Hi,
This question relates to our pilot of comparing and phasing out our Elasticsearch cluster and migrating to Manticore. Over the last few days we have been stuck with quality issues of MLT queries and it is a crucial factor in our comparison.
Please advise.

Thank you for letting us know. Your research looks super cool and useful. We’ll review your scripts to figure out what may be optimized out in Manticore for higher rankings, but it may take some time. I can’t give an instant answer or advise unfortunately.

1 Like

Hey @Sergey , sorry to bother you again. Did you get a chance to check out the scripts/results?

Thanks!

Hi. Not yet, but it’s planned for this week.

1 Like

@Sergey any luck ? It is a major blocker for us.

Hi @Pavel_Nemirovsky

We have found 2 issues:

We are continuing working on it and have already managed to increase the relevance (according to the tests @Narayan_Acharya has provided), but there’s still work to do.

1 Like
  • another may be also a bug, but may be just a difference in the formulas

It’s turned to be a bug. Here’s the issue on the subject. The bug is already fixed and there’s a PR. With the fix (and if we avoid the other issue) Manticore gives a better score than Elasticsearch in the same test for one dataset (TREC-COVID) and similar results for the other one.

1 Like

Thanks for the update @Nick and @Sergey ! We appreciate you for taking the time to investigate this :slight_smile:

When can expect these changes to be available as part of a dev/nightly build for us try out?

I hope we’ll be able to address them during next 2 weeks.

Status update:

https://github.com/manticoresoftware/manticoresearch/issues/729 (about wrong field length related to stopword position) is fixed.
And another improvement related to BM25F formula has been made - merged branch tfidf-update into master · manticoresoftware/manticoresearch@2ae9686 · GitHub

So it should be fine now.

@Narayan_Acharya @Pavel_Nemirovsky can you guys retest please?

@Nick is also going to retest on our side and we’ll then make a pull request to your repository https://github.com/dMetrics/ir-bm25-benchmark with what we’ll get.

2 Likes

@Sergey @Nick

I tried the manticoresearch dev docker image (image id 0ad4d8dd094e) along with the python client (build from source directly from the github repo, commit id 23ec425). There were a couple of minor changes required to the benchmark.manticore.evaluate to make it compatible with the latest python client. Reporting results for ES-like manticore settings below:

python -m benchmark.manticore.evaluate data/trec-covid test trec_covid_es_like

|    | metric   |     k=1 |     k=2 |     k=5 |    k=10 |
|---:|:---------|--------:|--------:|--------:|--------:|
|  0 | NDCG     | 0.53    | 0.48745 | 0.42577 | 0.35274 |
|  1 | MAP      | 0.00157 | 0.00231 | 0.00405 | 0.00564 |
|  2 | Recall   | 0.00157 | 0.00248 | 0.00499 | 0.00795 |
|  3 | P        | 0.6     | 0.53    | 0.452   | 0.362   |
|  4 | MRR      | 0.6     | 0.67    | 0.703   | 0.70967 |
|  5 | R_cap    | 0.6     | 0.53    | 0.452   | 0.362   |
|  6 | Hole     | 0.04    | 0.08    | 0.12    | 0.172   |
|  7 | Accuracy | 0.6     | 0.74    | 0.86    | 0.9     |

python -m benchmark.manticore.evaluate data/nfcorpus test nfcorpus_es_like

|    | metric   |     k=1 |     k=2 |     k=5 |    k=10 |
|---:|:---------|--------:|--------:|--------:|--------:|
|  0 | NDCG     | 0.43189 | 0.39606 | 0.35404 | 0.32029 |
|  1 | MAP      | 0.05647 | 0.0785  | 0.10289 | 0.11827 |
|  2 | Recall   | 0.05647 | 0.08331 | 0.12326 | 0.15094 |
|  3 | P        | 0.44892 | 0.38854 | 0.30341 | 0.23251 |
|  4 | MRR      | 0.44892 | 0.49071 | 0.51796 | 0.52513 |
|  5 | R_cap    | 0.44892 | 0.39938 | 0.3404  | 0.29457 |
|  6 | Hole     | 0.06502 | 0.0743  | 0.08421 | 0.08514 |
|  7 | Accuracy | 0.44892 | 0.53251 | 0.62848 | 0.68421 |

Slight bump in NDCG@10 for the nfcorpus_es_like index but big drop in NDCG@10 for trec_covid_es_like index compared to our previous results. I am guessing you guys have changes in mind for index creation options we’ve used that gets equivalent/better results when compared to Elastic. I’ll wait for your PR with the suggestions :slight_smile:

1 Like

Do you reindex your data for that bench? As fix was into indexer tool and that need to rebuild your indexes from scratch to make fix working.

Yes, indexes were built from scratch again for the benchmarks reported in the previous comment.

We’ve created a PR with updated tests for a new Manticore version.

1 Like

Thank you for the PR @Nick ! I’ve merged it and reported new results here and they are awesome :rocket: Thanks again, @Nick @Sergey @tomat , for all the help and addressing these issues promptly :slight_smile:

Thank you for helping sorting the issues out!

Just in case this was not noticed - the new results for the default settings (no extra indexing options) are (much) poor compared to the previous version. Although this does not block us (because we’re using the one with ES-like indexing options) we thought it would be worth pointing out in case it needs further investigation.

Thank you for you notice. We’ve checked this issue. The reason is that the default settings you use don’t include ‘index_field_lengths’ setting which is necessary since bm25f ranker cannot be applied correctly without it. So computed results are expected to be poor, regardless of the version used.

1 Like