Can I migrate from Sphinx Search?


#1

Hi.

First of all, thank you very much for your efforts for keeping this awesome software up to date with the excellent documentation.

I am considering about changing sphinx search to manticore. Well, in fact, I would have posted this thread before, but there is a user call barryhunter on sphinx search forums which is it is awesome and it is the person which maintains alive sphinxsearch. Without his help, we could never have discovered the potential of Sphinx Search.

We are using sphinx search in a production server with Foolz\SphinxQL\SphinxQL package (PHP). Now, we have to a do a migration, and I have to consider if we opt to manticore. Sphinxsearch is giving us some problems with Ubuntu and the systemd scripts.

But I have to know if our current sphinx.config is compatible, and also if how we proceed it is still valid in manticore.

Summary: we use a dedicated server with 16 cores. Daily we fill a table called products_tmp (100GB). When this table is completed, we index with sphinxsearch that table and rename products_tmp to products. Each *.sps is about 705MB, *.spa about 100MB and *.spd and *.spp about 360MB


As you can see in the details, the sphinx.config may look a bit weird.
The thing is that our dedicated server has 16 cores, so we share the indexes via scripting PHP with distributed index for faster speed read.
mem_limit seems to be low, but that is because we index 8 at each time and we are in the limit of RAM usage.

echo pp0 pp1 pp2 pp3 pp4 pp5 pp6 pp7 | xargs -n1 -P8 /usr/bin/indexer --nohup --rotate --config /etc/sphinxsearch/sphinx.conf

rename 's/tmp/new/' /mnt/disk/sphinxsearch_data/*.tmp.* -v

echo pp8 pp9 pp10 pp11 pp12 pp13 pp14 pp15 | xargs -n1 -P8 /usr/bin/indexer --rotate --config /etc/sphinxsearch/sphinx.conf

With the above lines, we index products_tmp table, and when the last index has been processed, is when we do rotate. That is why we have to use the “hack” of -nohup + rename, in order to “wait” until all the work has been done for pp15 (see line below if ($i == ($cores - 1)).

We currently can’t use seamless_rotate = 1 because off high peak of RAM and the crash of mysql service.

sphinx.config (omitted irrelevant parts)
#!/usr/bin/php -q
<?php

$table = "products_tmp";
$cores = 16;

?>

source theshop
{
    type               = mysql
    [...]    
    sql_query_range  = \
    SELECT MIN(id),MAX(id) FROM <?php echo $table;?>    

    sql_query          = \
    SELECT id, name,price,brand,category,sku FROM <?php echo $table;?> WHERE id>=$start AND id<=$end
    [...]
    
    sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
    sql_query_pre = SET NAMES utf8
    sql_query_pre = SET SESSION query_cache_type=OFF
    
    
}

<?php for ($i=0; $i<$cores; $i++):?>
source ptp<?=$i?>: theshop
{
sql_query          = \
SELECT id, name,price,brand,category,sku FROM <?php echo $table;?> WHERE id>=$start AND id<=$end AND id % <?=$cores?> = <?=$i?>

<?php 
if ($i == ($cores - 1) && $table == "products_tmp") echo "sql_query_post_index = RENAME TABLE `products` TO `products_old`, `products_tmp` TO `products`".PHP_EOL;
?>
}
<?php endfor; ?>


index products_template
{
	type = plain
    source            = theshop
    path              = /mnt/disk/sphinxsearch_data/products
    stopwords		  = /home/ubuntu/sphinx_stopwords.txt
    wordforms 		  = /home/ubuntu/sphinx_wordform.txt
    stopword_step = 0
    stopwords_unstemmed = 1
    html_strip = 1
    blend_chars = -,&,U+002E
    blend_mode = trim_none,trim_both
    docinfo           = extern
    #mlock = 1    
    morphology = libstemmer_es    
    min_stemming_len = 4
    dict = keywords
    min_infix_len = 2
    #min_prefix_len = 2 
    min_word_len = 1
    index_exact_words=1
    expand_keywords= 0
    charset_table = #ommitted for brevity

}

<?php for ($i=0; $i<$cores; $i++) { ?>
index pp<?=$i?>: products_template
{  
  source = ptp<?=$i.PHP_EOL?>
  path              = /mnt/disk/sphinxsearch_data/products_<?=$i?>

}
<?php } ?>

index products
{
  type = distributed
  <?php for ($i=0; $i<$cores; $i++) { ?>
  local = pp<?=$i.PHP_EOL?>
  <?php } ?>
        
}

indexer
{
    mem_limit    = 128M
    write_buffer = 4M
    lemmatizer_cache = 1024M
}
searchd
{
    listen            = 127.0.0.1:9306:mysql41
    log               = /var/log/sphinxsearch/searchd.log
    #query_log_format = sphinxql
    #query_log         = /var/log/sphinxsearch/query.log
    read_timeout      = 5
    max_children      = 30
    pid_file          = /var/run/sphinxsearch/searchd.pid   
    seamless_rotate   = 0
    preopen_indexes   = 1
    unlink_old        = 1
    binlog_path       = /var/lib/sphinxsearch/data 
    dist_threads = <?php echo $cores.PHP_EOL;?>
    #expansion_limit = 5
}

#2

I suspect you are coming from Sphinx 2.x. There’s no change needed in your config and you can load the indexes created by Sphinx in Manticore. In the most simple way, you just have to replace the executables (searchd,indexer). if you are upgrading using a package, there are some permissions you will need to change, check https://manticoresearch.com/2018/08/13/migrating-from-sphinx/ or https://docs.manticoresearch.com/latest/html/installation.html#upgrading-from-sphinx-search.

Foolz package is compatible with Manticore, in fact the latest version also has a dedicated helper for our Percolate Query feature.

lemmatizer_cache = 1024M

This matters only for lemmatize* processors, it has no effect on stemmers. Also the value off the record,256M is enough to cache all lemmatizers dictionaries available.

Regarding your rotation: if you have a problem with RAM, better have the second batch with --nohup as well (rename after) and trigger separate at the end the rotation with HUP signal.

kill -HUP `cat /var/run/sphinxsearch/searchd.pid`

How much RAM increase you get via seamless? From the info you provide one index should use no more than 1GB extra.


#3

Hi Globerada

I have to know if our current sphinx.config is compatible, and also if how we proceed it is still valid in manticore.

Yes, the config is compatible. I don’t remember us doing anything yet which would make it incompatible.

With regards to your use case in general I don’t understand one thing and can give few recommendations:

  1. the xargs command you use should run the indexers in parallel. At the same time your logic is to ““wait” until all the work has been done for pp15”. But pp15 may finish earlier than the rest and then the RENAME logic may cause some issues then.
  2. instead of sending HUP you may want to you RELOAD INDEX … FROM command or just RELOAD INDEXES (supported in Manticore, but missing in Sphinx), it just may be easier to do from application in many cases than sending a signal and RELOAD INDEX can make the process more controlled.
  3. indexing from a tmp table and having to do "RENAME TABLE products TO products_old, products_tmp TO products" seems a little bit wrong in most cases, but may be really necessary in some. Just want to make sure you’ve considered saving last indexed id to some counters table or doing indexation based on some “updated” instead of the id. In the both cases you can make your indexation lighter by utilizing main+delta schema with kill-lists and reindexing only those documents that require rebuilding. Given you’re limited in resources it may be useful.

#4

Wow, thank you Adrian and Sergey for your fast reply!

Yes Adrian, you are totally right. We are using Sphinx 2.2.11-id64-release (95ae9a6). I know, I bit outdated, but if it works…

I don’t know what is Percolate Query, the first google search throws me an elastic search result. I will investigate in a while.

About the RAM. The machine has 57.59 GB. We usually are between 50 - 54 memory usage (innodb_buffer_pool_size = 34G). The server is heavily optimized for processing 150GB CSV files at fast as possible (products is a innodb table). Attached the relevant parts of mysql configuration, if it helps to understand the point.

About your note of --nohup to the second batch, I think it wouldn’t be possible. Or at least, if I understand well how sql_query_post_index works. Notice that if I do what you say, there will be a while between sql_query_post_index and rotate where the web app would be reading data from the “old” indexes (which after rename, it is going to be products_old, but the real table now is products). I know, it is a bit tricky, but I didn’t find an easiest way.

@sergey

  1. Yes. I run xargs for speed. If I index one per one, it takes a lot of time to complete!

At the same time your logic is to ““wait” until all the work has been done for pp15”. But pp15 may finish earlier than the rest and then the RENAME logic may cause some issues then.

Here I am taking advantage of the mysql locks, so the rename is not done until all the index process have finished. It seems to be working.

  1. Ohh, neat, so RELOAD INDEXES is something that I must really take a look! With my current version it is not available, so again another point for the migration!

  2. Yes, realtime indexes is something that I have considered sometimes, but the application was thought different, and a lot of data may change every import time, so the level of fragmentation could be high. That is why the designers opted for build the whole table on each import.

mysqld.cnf

[mysqld]

bulk_insert_buffer_size= 2G
#myisam_max_sort_file_size = 15G
#myisam_sort_buffer_size = 15G
table_open_cache= 4000
sort_buffer_size = 2G
read_rnd_buffer_size = 5M
read_buffer_size= 1M
query_cache_type = 0
#query_cache_limit = 5M
query_cache_size = 0
max_allowed_packet=32M
skip-name-resolve

innodb_flush_log_at_trx_commit = 0
sync_binlog=0
innodb_flush_method=O_DIRECT
innodb_buffer_pool_instances=16

innodb_flush_neighbors = 0
innodb_thread_concurrency=128
#https://nbsoftsolutions.com/blog/optimizing-innodb-bulk-insert
#https://www.speedemy.com/17-key-mysql-config-file-settings-mysql-5-7-proof/
innodb_change_buffering = all
innodb_change_buffer_max_size = 25
innodb_buffer_pool_size = 34G
innodb_log_file_size = 2G
#https://www.percona.com/blog/2011/07/09/how-to-change-innodb_log_file_size-safely/
skip-innodb_doublewrite

#https://www.saotn.org/mysql-innodb-performance-improvement/
innodb_read_io_threads = 8
innodb_write_io_threads = 8
#innodb_force_recovery=1