Store the data in a pointer. Convert to and from string_t and let duckdb handle the pointer swizzling.

With just count prefix in the string_t.

Exact match queries

QueryStandard method (s)Umbra-mol 20-byte prefix (s)Umbra-mol as string_t (s)Speedup (string_t vs standard method)
117.2380.4960.141122x
212.5550.4730.103122x
322.1960.3640.24092x
412.2450.3590.21756x

With counts + dalke fingerprints (12 byte prefix)

Exact matches:

QueryStandard method (s)Umbra-mol 20-byte prefix (s)Umbra-mol (12 byte count + dalke fp) (s)Postgres control (s)
117.2380.4960.2110.084
212.5550.4730.175233
322.1960.3640.6990.162
412.2450.3590.3510.900

Substructure matches:

QueryStandard method (s)Umbra-mol 10-byte prefix + dalke fp (s)Umbra-mol (12-byte count + dalke fp) (s)Postgres (s)
114.48719.958.904130.59
213.9619.470.7681.6
SELECT umbra_mol from molecule where umbra_is_substruct(umbra_mol, 'O=CNCCc1ccccc1');

133531 molecules. 9.061 seconds

SELECT rdkit_mol from molecule where is_substruct(rdkit_mol, 'O=CNCCc1ccccc1');

133531 molecules. 16.374 seconds

postgres

select rdkit_mol from compound_structures where rdkit_mol@>'O=CNCCc1ccccc1';

133531 molecules. 143 seconds.