Indexing external text files

Hi,
I have 3.2GB of plain text documents stored as .txt files (they are converted pdf files). A total of ~260,000 files. I do not wish to store them in the MariaDB database for various reasons.

Can I get manticore to index these files off disk directly by providing a reference to the file location? → either using a plain or realtime table? … or do I need to use realtime and open and read the document myself and shove the text in as part of an SQL insert into manticore?

If the latter this then begs the question about paramaterised queries as the docs clearly contain many sql busting chars.

Thank you muchly. :slight_smile:

p.s. I am looking to replace Elasticsearch (NastySearch I call it)

Hi. You can use:

  1. sql_file_field - Manticore Search Manual: Data creation and modification > Adding data from external storages > Fetching from databases > Processing fetched data
  2. script which will output the docs in the xml file format Manticore Search Manual: Data creation and modification > Adding data from external storages > Fetching from XML streams
  3. or a script which populates your RT table. Use escaping depending on the client/interface you use: SQL/JSON etc.

OK. So to make sure I understand; using option 1 my configuration would contain two entries for the field (lets call the field content)

sql_file_field =content
sql_field_string = content

My MariaDB would hold “/path/to/somefile.txt” in content and manticore would just suck in somefile.txt file and index its contents?

… or do i just need the sql_file_field entry?

My MariaDB would hold “/path/to/somefile.txt” in content and manticore would just suck in somefile.txt file and index its contents?

Yes.

Thanks Sergey. I have decided to go with RT tables afterall.

1 Like