Out of memory (unable to allocate 3221225480 bytes) when indexing a table with 39 million rows

Mitch_Downey · March 13, 2022, 12:08am

I’ve been trying to solve an “out of memory” error for the past few weeks, tried many different things, and nothing has worked for me.

Can anyone provide some advice on how to solve this “out of memory” error? Obviously the server and/or Docker container needs more memory, but I find it hard to believe that Manticore just needs ever more RAM for larger tables. Is there a way to limit the amount of RAM the indexer process needs to use? Any help would be greatly appreciated.

The command I am running is:

docker exec -it podverse_manticore_prod gosu manticore indexer idx_episode --verbose;

The table I am attempting to index has over 39 million rows. The other tables I am indexing have less than 5 million rows, and those indexes are created by the same server and Docker container with no problem.

The error message:

using config file '/etc/manticoresearch/manticore.conf'...
indexing index 'idx_episode'...
collected 39093235 docs, 1779.6 MB
creating lookup: 39093.2 Kdocs, 100.0% done
sorted 282.1 Mhits, 100.0% done
FATAL: out of memory (unable to allocate 3221225480 bytes)

# BASE CONFIG

indexer {
    mem_limit = 256M
}

source base {
    type = pgsql
    sql_host = podverse_db
    sql_user = postgres
    sql_pass = mysecretpw
    sql_db = postgres
    sql_ranged_throttle = 50
}

# EPISODE TITLE SEARCH

source episode_csv: base {
    sql_field_string = title
    sql_attr_string = podverse_id
    sql_attr_uint = pastHourTotalUniquePageviews
    sql_attr_uint = pastDayTotalUniquePageviews
    sql_attr_uint = pastWeekTotalUniquePageviews
    sql_attr_uint = pastMonthTotalUniquePageviews
    sql_attr_uint = pastYearTotalUniquePageviews
    sql_attr_uint = pastAllTimeTotalUniquePageviews
    sql_attr_timestamp = created_date
    sql_query = \
        SELECT \
            id AS podverse_id, \
            title, \
            "pastHourTotalUniquePageviews", \
            "pastDayTotalUniquePageviews", \
            "pastWeekTotalUniquePageviews", \
            "pastMonthTotalUniquePageviews", \
            "pastYearTotalUniquePageviews", \
            "pastAllTimeTotalUniquePageviews", \
            extract(epoch from "pubDate") AS created_date \
        FROM "episodes"
}

source episode_base: base {
    sql_field_string = title
    sql_attr_string = podverse_id
    sql_attr_uint = pastHourTotalUniquePageviews
    sql_attr_uint = pastDayTotalUniquePageviews
    sql_attr_uint = pastWeekTotalUniquePageviews
    sql_attr_uint = pastMonthTotalUniquePageviews
    sql_attr_uint = pastYearTotalUniquePageviews
    sql_attr_uint = pastAllTimeTotalUniquePageviews
    sql_attr_timestamp = created_date
}

source episode: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes
    sql_range_step = 10000
    sql_query = \
        SELECT \
            int_id, \
            id AS podverse_id, \
            title, \
            "pastHourTotalUniquePageviews", \
            "pastDayTotalUniquePageviews", \
            "pastWeekTotalUniquePageviews", \
            "pastMonthTotalUniquePageviews", \
            "pastYearTotalUniquePageviews", \
            "pastAllTimeTotalUniquePageviews", \
            extract(epoch from "pubDate") AS created_date \
        FROM episodes \
        WHERE int_id >= $start AND int_id <= $end
}

index idx_episode {
    source = episode
    path = episodes_title
    dict = keywords
    expand_keywords = 1
    min_infix_len = 2
    charset_table = non_cjk, cjk
    rt_mem_limit = 4G
}

# SEARCHD CONFIG

searchd {
    listen = 9306:mysql41
    listen = 9308:http
    log = searchd.log
    query_log = query.log
    binlog_path =
    network_timeout = 5
    client_timeout = 300
    # max_children = 30
    persistent_connections_limit = 30
    pid_file = manticoresearch.pid
    seamless_rotate = 1
    preopen_indexes = 1
    unlink_old = 1
    # mva_updates_pool = 1M
    max_packet_size = 8M
    max_filters = 256
    max_filter_values = 4096
    max_batch_queries = 32
    workers = threads
    collation_server = utf8_ci
}

The Docker Compose config for this service is:

  podverse_manticore:
    image: manticoresearch/manticore
    container_name: podverse_manticore_prod
    depends_on: 
      - podverse_db
    restart: always
    networks:
      - nginx-proxy
    ports:
      - 9306:9306
      - 9308:9308
    ulimits:
      nproc: 65535
      nofile:
         soft: 65535
         hard: 65535
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ./manticore/data:/var/lib/manticore
      - ./manticore/manticore.conf:/etc/manticoresearch/manticore.conf
    logging:
      driver: 'json-file'
      options:
        max-file: '1'
        max-size: '50m'

Source files in repo:

github.com

podverse/podverse-ops/blob/13b9cf8861164bf243fd9c150236e415b35d720c/manticore/manticore.conf

# BASE CONFIG

indexer {
    mem_limit = 256M
}

source base {
    type = pgsql
    sql_host = podverse_db
    sql_user = postgres
    sql_pass = mysecretpw
    sql_db = postgres
    sql_ranged_throttle = 50
}

# AUTHOR NAME SEARCH

source author_base: base {
    sql_field_string = name
    sql_attr_string = podverse_id

This file has been truncated. show original

github.com

podverse/podverse-ops/blob/13b9cf8861164bf243fd9c150236e415b35d720c/docker-compose.prod.db.yml#L30


      
            networks:
              - nginx-proxy
            environment:
              ENABLE_IPV6: "true"
            logging:
              driver: "json-file"
              options:
                max-file: "1"
                max-size: "50m"
          
          
podverse_db:
            image: postgres:11.5
            container_name: podverse_db_prod
            depends_on:
              - podverse_nginx_proxy
            env_file:
              - ./config/podverse-db-prod.env
            ports:
              - '5432:5432'
            networks:
              - nginx-proxy

Mitch_Downey · March 14, 2022, 1:32am

Alright, I figured out an approach that seems to be working for me, but I would guess it must be somewhat hacky, because the config I wrote is very non-DRY and has a fixed limit on the # of rows it can support. Basically I split the 80 million row index into 10 separate indexes, each index containing 20 million rows (and the extra indexes for the db to grow over time), then connected them using a local distributed index.

If anyone has suggested improvements I’d appreciate it, but for now this approach seems like it should work for our purposes.

Full config can be found here: podverse-ops/manticore.conf at 0e70bf795fd448dc050d00db204057d3446714da · podverse/podverse-ops · GitHub

# EPISODE TITLE SEARCH

source episode_base: base {
    sql_field_string = title
    sql_attr_string = podverse_id
    sql_attr_uint = pastHourTotalUniquePageviews
    sql_attr_uint = pastDayTotalUniquePageviews
    sql_attr_uint = pastWeekTotalUniquePageviews
    sql_attr_uint = pastMonthTotalUniquePageviews
    sql_attr_uint = pastYearTotalUniquePageviews
    sql_attr_uint = pastAllTimeTotalUniquePageviews
    sql_attr_timestamp = created_date
    
    sql_range_step = 10000
    sql_query = \
        SELECT \
            int_id, \
            id AS podverse_id, \
            title, \
            "pastHourTotalUniquePageviews", \
            "pastDayTotalUniquePageviews", \
            "pastWeekTotalUniquePageviews", \
            "pastMonthTotalUniquePageviews", \
            "pastYearTotalUniquePageviews", \
            "pastAllTimeTotalUniquePageviews", \
            extract(epoch from "pubDate") AS created_date \
        FROM episodes \
        WHERE int_id >= $start AND int_id <= $end
}

source episode_01: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 1 AND int_id < 20000001
}

source episode_02: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 20000001 AND int_id < 40000001
}

source episode_03: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 40000001 AND int_id < 60000001
}

source episode_04: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 60000001 AND int_id < 80000001
}

source episode_05: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 80000001 AND int_id < 100000001
}

source episode_06: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 100000001 AND int_id < 120000001
}

source episode_07: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 120000001 AND int_id < 140000001
}

source episode_08: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 140000001 AND int_id < 160000001
}

source episode_09: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 160000001 AND int_id < 180000001
}

source episode_10: episode_base {
    sql_query_range = SELECT MIN(int_id), MAX(int_id) FROM episodes WHERE int_id >= 180000001 AND int_id < 200000001
}

index idx_episode_base {
    dict = keywords
    expand_keywords = 1
    min_infix_len = 2
    charset_table = non_cjk, cjk
}

index idx_episode_01: idx_episode_base {
    source = episode_01
    path = episodes_title_01
}

index idx_episode_02: idx_episode_base {
    source = episode_02
    path = episodes_title_02
}

index idx_episode_03: idx_episode_base {
    source = episode_03
    path = episodes_title_03
}

index idx_episode_04: idx_episode_base {
    source = episode_04
    path = episodes_title_04
}

index idx_episode_05: idx_episode_base {
    source = episode_05
    path = episodes_title_05
}

index idx_episode_06: idx_episode_base {
    source = episode_06
    path = episodes_title_06
}

index idx_episode_07: idx_episode_base {
    source = episode_07
    path = episodes_title_07
}

index idx_episode_08: idx_episode_base {
    source = episode_08
    path = episodes_title_08
}

index idx_episode_09: idx_episode_base {
    source = episode_09
    path = episodes_title_09
}

index idx_episode_10: idx_episode_base {
    source = episode_10
    path = episodes_title_10
}

index idx_episode_dist {
    type = distributed
    local = idx_episode_01
    local = idx_episode_02
    local = idx_episode_03
    local = idx_episode_04
    local = idx_episode_05
    local = idx_episode_06
    local = idx_episode_07
    local = idx_episode_08
    local = idx_episode_09
    local = idx_episode_10
}

Sergey · March 14, 2022, 2:25am

Hi

I see you are already using mem_limit and it’s not helpful. It’s a known issue GitHub#72 ⁃ mem_limit not being honoured · Issue #72 · manticoresoftware/manticoresearch · GitHub .

What else you can try is:

smaller batches (sql_range_step)
disable min_infix_len
use RT index to avoid this issue completely since RT index respects rt_mem_limit

tomat · March 14, 2022, 7:27am

min_infix_len for dict=keywords does not follow mem_limit limit and for index with large dictionary could overflow box RAM

Mitch_Downey · March 17, 2022, 1:51am

Thanks so much for the info @Sergey and @tomat. Good to know I wasn’t just going crazy and the indexer doesn’t always follow the mem_limit.

The local distributed index approach I explained previously seems to be working fine for our purposes. I think the next step we’ll take to improving it is setting up a real-time index. Thanks again.