Actually, using the MACCS keys is better than the counts prefix in that the number of collisions goes down rapidly.
chembl_33=# select maccs_fp(rdkit_mol) mfp, count(*) c from compound_structures group by mfp order by c desc;
mfp | c
----------------------------------------------+-----
\x000000020008400000ec7cbd11c5f8271b77fede3f | 531
\x000000020008600000e4159e12f5eaa81bfdaeff3f | 510
\x000000020008600002ec7dbf33e7faaf1bfffeff3f | 215
\x00000004000064020415040e2c3359c91bbe6eaf39 | 208
\x00400002100860000ae45cbf33f7fabb3bffeeff3f | 184
\x00000004000064020415040e2c3359c91bbf6eaf39 | 180
\x00400002100860000ae47dbf33e7febf3ffffeff3f | 153
\x0009000400004402041500922a7331e81abbe6f73f | 97
\x000000000000004200010022240202e000b267a53d | 97
\x000000000000004200010022240202e000ba67a53d | 96
\x00000000000004020011080c09115c891b9a6cad31 | 92
\x00400002100860000ae47dbf13e7febf3ffffeff3f | 87
\x00000000000004420011080c09115c891b9a6cad31 | 83
\x000000000000000000010022080002e000b067a53d | 82
\x00000000000060420001040e242202e9189a47af3c | 81
\x000000000000600204010c0e253359c91bbe6eab39 | 80
\x000000000000000000010022080002e000b867a53d | 76
\x000000000000004200010022240202e008ba67a53d | 71
\x00000000000060000084159e12d17c833bffecff3f | 70
\x000000000000600202801d9e33f37cabbbffefff3f | 68
\x00000000040024000414000e291359801ba86ea939 | 65
\x000000020008600000e4159f12f5eaa81bfdaeff3f | 64
\x000000000300883e290ae6fbf2fe7bb7ffffffff3f | 63
\x000000000000404200010002240202a0089a47a53c | 62
\x000000040000640204110c0e2d3359c91bbe6eab39 | 60
\x00000000200c0002002864ba32fe7ba7fbffffff3f | 59
\x00000000000060020001040e0c305cc81b9f6caf39 | 56
\x00000000000060420001040e242202e9189e47af3c | 56
...
Time: 799348.187 ms (13:19.348)
Note: Got 100k molecules for testing from chembl_33. Made sure they are the same molecules in duckdb and pg by exporting a selection of it, and creating a new table to make sure the same molecules in both.
For the queries, use the sqc data from dalke (https://hg.sr.ht/~dalke/sqc/browse?rev=tip) and BindingDB (Liu,T., Lin,Y., Wen,X., Jorrisen, R.N. and Gilson,M.K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities Nucleic Acids Research 35:D198-D201 (2007).)
Extracted the top 100 most searched substructures data. The full data set was taking too long to test.
cat BindingDB_substructure.dat | sort -t ' ' -k 2 -n | tail -100 | awk -F' ' '{print $1}' > top100_substructure_queries.csv
Then confirm I am getting the same results from both databases:
In PG:
chembl_33=# select rdkit_mol@>'OC1=CC=C(C=CC2=CC(O)=CC(O)=C2)C=C1' as s, count(*) from hundredkmol group by s;
s | count
---+-------
f | 99914
t | 86
(2 rows)
Time: 779.601 ms
In Duckdb:
D select is_substruct(rdkit_mol, 'OC1=CC=C(C=CC2=CC(O)=CC(O)=C2)C=C1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 99914 │
│ true │ 86 │
└─────────┴──────────────┘
Run Time (s): real 2.865 user 3.493166 sys 2.234336
chembl_33=# select rdkit_mol@>'CC' as s, count(*) from hundredkmol group by s;
s | count
---+-------
f | 1154
t | 98846
(2 rows)
Time: 773.831 ms
D select is_substruct(rdkit_mol, 'CC') as s, count(*) from molecule group by s;
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 1154 │
│ true │ 98846 │
└─────────┴──────────────┘
Run Time (s): real 1.971 user 1.970049 sys 0.000146
Have problems getting the duckdb python client to load the extension. Getting error, and not sure how to fix it. Might be a problem with the build process.
Used DuckDB to read the csv and then construct each SQL query with string formatting, and output that to a file. Did one for Postgres and one for duckdb. Now I can run the sql script and record the output to a file to record the experiment.
D with st as (select * from read_csv('top100_substructure_queries.csv', header=false)) select format('SELECT is_substruct(rdkit_mol, ''{}'') as s, count(*) from molecule group by s;', column0) from st;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ format('SELECT is_substruct(rdkit_mol, ''{}'') as s, count(*) from molecule group by s;', column0) │
│ varchar │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ SELECT is_substruct(rdkit_mol, 'O=CNCCc1ccccc1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1CCCC1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CC(C)C[C@H](NC(=O)[C@@H]1CCCN1C(=O)[C@@H]([NH3+])C(C)C)C([O-])=O') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'O=C1OC2=C(C=CC=C2)C=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'OC(=O)C1=CC=CC=C1O') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1CCC2CCCC2C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1OC2=C(O1)C=CC=C2') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CCNC') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'ClC1=CC=CC=C1Cl') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'COc1cc2c(Nc3cc(CC(=O)Nc4cccc(F)c4F)[nH]n3)ncnc2cc1OCCCN(CCO)CC(C)C') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'FC(F)(F)C1=CC=CC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'N1C=CC2=C1N=CC=C2') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'N1N=CC2=CC=CC=C12') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'NC1=C2N=CNC2=NC=N1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'NC1=CC=NC(N)=N1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'O=C1NC2=NC=NC=C2C=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'OCCCC1=CNC2=CC=CC=C12') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'O=CNC1=NC=CC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'N1C=CC=N1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'ONC') as s, count(*) from molecule group by s; │
│ · │
│ · │
│ · │
│ SELECT is_substruct(rdkit_mol, 'CI') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'OC') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'S1C2=CC=CC=C2N=C1C1=CC=CC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CN') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, '[H]C(=C([H])C1=CC(O)=CC(O)=C1)C2=CC=C(O)C=C2') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'O=S') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'N1C=CC=C1.C2=CC3=C(C=C2)C=CC=C3') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, '[H]CN') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1NCC2=C1C=CC3=C2C4=CC=CC=C4N3') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'N1C=CC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'OC1=CC=C(C=CC2=CC(O)=CC(O)=C2)C=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1=CC=NC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CC12CCC3C(CCC4=CC(O)=CC=C34)C1CCC2O') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CNC') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'NC') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1=CC=CC=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'C1=CC2=C(C=C1)C=CC=C2') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'O=C1OC2=CC=CC=C2C=C1') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CC') as s, count(*) from molecule group by s; │
│ SELECT is_substruct(rdkit_mol, 'CC(C)C1(C)SC(Nc2ccccc2C(F)(F)F)=NC1=O') as s, count(*) from molecule group by s; │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 100 rows (40 shown) │
Tried MACCS key as prefix. Ran a bitwise AND between query and target. If there is no common bit, then bail out early.
Not very selective
D SELECT umbra_is_substruct(umbra_mol, 'O=CNCCc1ccccc1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95586 │
│ true │ 4414 │
└─────────┴──────────────┘
Run Time (s): real 2.556 user 2.526835 sys 0.024585
D SELECT is_substruct(rdkit_mol, 'O=CNCCc1ccccc1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95586 │
│ true │ 4414 │
└─────────┴──────────────┘
Run Time (s): real 2.482 user 2.481999 sys 0.000026
and
D SELECT umbra_is_substruct(umbra_mol, 'C1CCCC1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95416 │
│ true │ 4584 │
└─────────┴──────────────┘
Run Time (s): real 2.514 user 2.509029 sys 0.002755
D SELECT is_substruct(rdkit_mol, 'C1CCCC1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95416 │
│ true │ 4584 │
└─────────┴──────────────┘
Run Time (s): real 2.501 user 2.482198 sys 0.019992
In the second query, with phenyl as the query molecule, it does bailout about 300 times. The target molecules in those cases look strange. But the bailout does occur, just not very often.
Move onto the 54 bit dalke set.
54 bit dalke set
D SELECT is_substruct(rdkit_mol, 'C1CCCC1') as s, count(*) from molecule group by s;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95416 │
│ true │ 4584 │
└─────────┴──────────────┘
Run Time (s): real 2.484 user 3.065887 sys 1.900994
D SELECT umbra_is_substruct(umbra_mol, 'C1CCCC1') as s, count(*) from molecule group by s;
┌─────────┬──────────────┐
│ s │ count_star() │
│ boolean │ int64 │
├─────────┼──────────────┤
│ false │ 95416 │
│ true │ 4584 │
└─────────┴──────────────┘
Run Time (s): real 1.493 user 1.234225 sys 0.259693
logged whenever the short circuit was triggered:
wc -l umbra_substruct_log_file.txt
95416
The number logged and the number false is the same, which means every false was successfully short circuited.
Normal substructure search results:
s,count_star()
false,95586
true,4414
s,count_star()
false,95416
true,4584
s,count_star()
false,100000
s,count_star()
false,98791
true,1209
s,count_star()
false,99619
true,381
umbra substructure search results:
s,count_star()
false,95586
true,4414
s,count_star()
false,95416
true,4584
s,count_star()
false,100000
s,count_star()
false,98791
true,1209
s,count_star()
false,99619
true,381
rdkit postgres extension control:
s | count
---+-------
f | 95586
t | 4414
(2 rows)
s | count
---+-------
f | 95416
t | 4584
(2 rows)
s | count
---+--------
f | 100000
(1 row)
s | count
---+-------
f | 98791
t | 1209
(2 rows)
s | count
---+-------
f | 99619
t | 381
(2 rows)
normal substructure times:
Run Time (s): real 2.502 user 2.481379 sys 0.022054
Run Time (s): real 2.478 user 2.478177 sys 0.000980
Run Time (s): real 3.139 user 3.138265 sys 0.000003
Run Time (s): real 2.521 user 2.518485 sys 0.001001
Run Time (s): real 2.390 user 2.390861 sys 0.000004
umbra substructure times: seems a bit faster!
Run Time (s): real 2.094 user 2.069648 sys 0.025732
Run Time (s): real 1.113 user 1.113625 sys 0.000005
Run Time (s): real 1.173 user 1.172624 sys 0.000000
Run Time (s): real 1.377 user 1.376821 sys 0.000000
Run Time (s): real 1.810 user 1.808781 sys 0.000000
rdkit postgres extension control:
Time: 820.463 ms
Time: 905.434 ms
Time: 876.220 ms
Time: 786.153 ms
Time: 784.434 ms
Trying a more OLAP type query
This shows where duckdb excels – analytical queries
D SELECT count(*) FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN compound_properties cp ON cp.molregno=m.molregno
WHERE umbra_is_substruct(m.umbra_mol, 'O=CNCCc1ccccc1');
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 61541 │
└──────────────┘
Run Time (s): real 11.732 user 55.942848 sys 4.096991
chembl_33=# SELECT count(*) FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN compound_properties cp ON cp.molregno=m.molregno
WHERE m.rdkit_mol@>'O=CNCCc1ccccc1';
count
-------
61541
(1 row)
Time: 117963.246 ms (01:57.963)
There is a index on the rdkit column. Not sure if there is another index that is needed for substructure search
chembl_33=# \d compound_structures
Table "public.compound_structures"
Column | Type | Collation | Nullable | Default
--------------------+-------------------------+-----------+----------+---------
molregno | bigint | | not null |
molfile | text | | |
standard_inchi | character varying(4000) | | |
standard_inchi_key | character varying(27) | | not null |
canonical_smiles | character varying(4000) | | |
rdkit_mol | mol | | |
num_rings | integer | | |
num_bonds | integer | | |
amw | double precision | | |
num_atoms | integer | | |
prefix | text | | |
maccs_fp | bfp | | |
Indexes:
"pk_cmpdstr_molregno" PRIMARY KEY, btree (molregno)
"compound_structures_pk" UNIQUE, btree (molregno)
"idx_cmpdstr_smiles" btree (canonical_smiles)
"idx_cmpdstr_stdkey" btree (standard_inchi_key)
"molidx" gist (rdkit_mol)
"uk_cmpdstr_stdinchkey" UNIQUE CONSTRAINT, btree (standard_inchi_key)
Foreign-key constraints:
"fk_cmpdstr_molregno" FOREIGN KEY (molregno) REFERENCES molecule_dictionary(molregno) ON DELETE CASCADE
About 10x faster than postgres
But it makes the exact match slower. I think it is because now the query molecule for exact match is getting all these dalke fingerprints calculated, and or there is more bytes to push through the CPU because of the dalke fps
Trying to get exact match to be fast again
Three experiments:
- generate dalke fp on everything
- generate dalke fp only for substructure match and cache the generated dalke fp for substructure match
- generate only for substructure match and deserialize only necessary prefix for exact match
Experiment 1
Exact match queries:
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) |
---|---|---|
1 | 16.06 | 5.26 |
2 | 9.77 | 5.21 |
3 | 22.17 | 5.44 |
4 | 12.35 | 5.336 |
before adding the dalke fp bits, all the exact match times of the UmbraMol were roughly < 0.6 seconds
Substructure queries:
Query 1:
standard method
SELECT count(*) FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN compound_properties cp ON cp.molregno=m.molregno
WHERE is_substruct(m.rdkit_mol, 'O=CNCCc1ccccc1');
UmbraMol with dalke fp
SELECT count(*) FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN compound_properties cp ON cp.molregno=m.molregno
WHERE umbra_is_substruct(m.umbra_mol, 'O=CNCCc1ccccc1');
Query 2:
SELECT a.standard_type, avg(a.value), count(a.value), a.relation, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN compound_properties cp ON cp.molregno=m.molregno
WHERE is_substruct(m.rdkit_mol,'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
GROUP BY m.rdkit_mol, a.relation, a.standard_type;
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) | Postgres (s) |
---|---|---|---|
1 | 13.47 | 11.89 | 128.59 |
2 | 13.96 | 5.56 | 0.6 |
Not sure why query 2 in postgres is so fast but query 1 is not. Need to look at explain analyze more.
Experiment 2
Exact match queries: It doesn’t look like not generating the dalke fp saves that much time. Perhaps that means the size of the header is now causing slow down
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) |
---|---|---|
1 | 16.06 | 5.19 |
2 | 9.77 | 5.13 |
3 | 22.17 | 5.70 |
4 | 12.35 | 5.23 |
Substructure queries:
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) | Postgres (s) |
---|---|---|---|
1 | 14.487 | 14.22 | 130.59 |
2 | 13.96 | 7.15 | 0.6 |
Got 2,573,287 cache hits on the first query, so it doesn’t look like generating the dalke fp and then caching it helps… Seems to make it slower!
Dalke fp seems to help alot on the second query
Experiment 3
Exact match queries:
Just deserializing what is needed on the spot seems to shave a couple seconds off, but doesn’t return to the performance that was in the original version.
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) |
---|---|---|
1 | 16.06 | 3.73 |
2 | 9.77 | 3.55 |
3 | 22.17 | 4.11 |
4 | 12.35 | 3.59 |
I wonder if the increased header size with the dalke fp bits causes data to move in and out of disk and the buffer pool slower?
Query | Standard method (s) | Umbra-mol 10-byte prefix + dalke fp (s) | Postgres (s) |
---|---|---|---|
1 | 14.487 | 19.95 | 130.59 |
2 | 13.96 | 19.47 | 0.6 |
The substructure searches got really slow…and I don’t think I changed anything there.