Umbra-mols -- Umbra-style molecules

· chembl, chemistry, rdkit, databases, duckdb, umbra

Umbra-mols -- Umbra-style molecules - part 2

After sharing part 1, user dalke on hacker news gave me some helpful feedback and several new ideas to explore. Thanks again.

In part 1, I was inspired by Umbra-style strings, and I incorporated a prefix in front of the binary molecule encoding in order to drastically speed up exact match queries in duckdb with an RDKit extension.

One of the ideas for improvement is to get more out of the bytes than I currently am for the 20 bytes in the prefix that I’m using.

In this post, I do some analysis on the prefix and then optimize the prefix based off those findings.

Abstract

Analyzed the current version of the prefix in Umbra-style molecules, Umbra-mols, to get an idea of how often the short-circuit would not be triggered and deserialization would need to be done: 1
Made a rough measurement of how additional deserialization of the binary molecule slows query execution down 2
Explored alternative prefix schemes – jump to that: 3
Analyzed the max sizes of the easily accessible molecular attributes in chembl, pointing to an optimization: a 10 byte and 7 byte prefix achieved similar results to the 20 byte prefix: 4

Quick recap:

In Umbra, the authors introduce a string encoding format (Umbra-style/German-style strings) that enables a short-circuit to quickly return without more costly pointer dereferencing which helps to speed up filtering in database systems. If I understand correctly, the prefix alone is not supposed to be be sufficient for distinguishing every record from another. Rather it is to rapidly filter out records that simply cannot fulfill the condition, so that more expensive computations can be avoided. Inspired by this and RDKit and chemicalite, I applied some of the ideas from Umbra-strings to molecules.

Currently, the prefix in “Umbra-mols” uses four easily accessible molecular attributes to short-circuit in a comparison operation, and this leads to a big improvement in exact match queries on molecular data in duckdb with an RDKit extension, see part 1.

| 4 bytes - number of atoms | 4 bytes - number of bonds | 4 bytes - amw | 4 bytes - number of rings | 4 bytes - size of the binary molecule | n bytes - binary molecule |

The first 16 bytes makes up the prefix that is checked, and if two molecules have different prefixes, the function can short circuit and avoid deserializing the binary molecule for further checks to determine if the two molecules are the “same” (depends on what the search is sensitive to).

In Part 1 I find this method achieves ~26-60x speedup compared to an initial implementation without this prefix.

Taking a closer look at the “counts prefix”

I’ll call this the “counts” prefix. How good is this prefix? How often do we hit the short-circuit, achieving a speedup, and how often not?

I wanted to get some baseline measurements so that I have something to compare it to when trying other ideas.

First, I want to see how many duplicate molecules are in the table to get a baseline.

There are 2,372,674 molecules in the table, 2,372,528 distinct by the chembl SMILES (146 duplicates), and 2,372,527 distinct RDKit canonical SMILES (147 duplicates).

D select count(*) from molecule;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      2372674 │
└──────────────┘
Run Time (s): real 0.001 user 0.004088 sys 0.000006

D select count(distinct(smiles)) from molecule;
┌────────────────────────┐
│ count(DISTINCT smiles) │
│         int64          │
├────────────────────────┤
│                2372528 │
└────────────────────────┘
Run Time (s): real 0.197 user 0.710163 sys 0.084169

D select count(distinct(mol)) from molecule;
┌─────────────────────┐
│ count(DISTINCT mol) │
│        int64        │
├─────────────────────┤
│             2372527 │
└─────────────────────┘
Run Time (s): real 0.652 user 2.213165 sys 0.457163

I used the Postgres RDKit extension to calculate the header values (number of atoms, etc.), and then queried to find the counts of compound structures in chembl_33 that have the same number of num_rings, num_atoms, amw, and num_bonds.

Below, I show the counts of replicated prefixes, collisions, in descending order.

There are ~245,562 records (nearly 10%) that have at least one other molecule that match its prefix (and, not shown, ~28,000 that have at least 10 other molecules matching its prefix). There are also some groups that have lots of replicates (>200, even one 511).

That means for many molecules it is not possible to resolve if they are different by the prefix alone; they require the deserialization of the binary molecule to answer that. Furthermore, for some molecules, if you hit one of these hot spots, your query could need to deserialize 511 times.

D select * from (select num_rings, num_atoms, amw::integer, num_bonds, count(*) as count from molecule group by num_rings, num_atoms, amw::integer, num_bonds order by count desc) as g where g.count > 1;
┌───────────┬───────────┬──────────────────────┬───────────┬───────┐
│ num_rings │ num_atoms │ CAST(amw AS INTEGER) │ num_bonds │ count │
│   int32   │   int32   │        int32         │   int32   │ int64 │
├───────────┼───────────┼──────────────────────┼───────────┼───────┤
│         4 │       193 │                 1348 │        42 │   511 │
│         3 │        39 │                  308 │         4 │   330 │
│         3 │        42 │                  322 │         5 │   314 │
│         3 │        35 │                  282 │         3 │   311 │
│         3 │        31 │                  252 │         2 │   300 │
│         3 │        41 │                  310 │         4 │   294 │
│         3 │        36 │                  294 │         3 │   292 │
│         3 │        38 │                  296 │         4 │   292 │
│         3 │        32 │                  268 │         2 │   286 │
│         3 │        37 │                  280 │         3 │   280 │
│         3 │        36 │                  281 │         3 │   280 │
│         3 │        35 │                  278 │         3 │   279 │
│         3 │        37 │                  293 │         3 │   274 │
│         3 │        38 │                  309 │         4 │   272 │
│         3 │        34 │                  266 │         3 │   272 │
│         3 │        34 │                  266 │         2 │   271 │
│         3 │        47 │                  338 │         5 │   269 │
│         3 │        38 │                  296 │         3 │   266 │
│         3 │        41 │                  323 │         4 │   266 │
│         3 │        40 │                  311 │         4 │   262 │
│         · │         · │                   ·  │         · │     · │
│         · │         · │                   ·  │         · │     · │
│         · │         · │                   ·  │         · │     · │
│         5 │        76 │                  549 │         4 │     2 │
│         8 │        77 │                  636 │         6 │     2 │
│         4 │        30 │                  336 │         1 │     2 │
│         3 │        59 │                  476 │         6 │     2 │
│         5 │        52 │                  543 │         4 │     2 │
│         5 │        55 │                  571 │         4 │     2 │
│         3 │        71 │                  494 │         8 │     2 │
│         2 │        65 │                  444 │         3 │     2 │
│         3 │        55 │                  391 │         6 │     2 │
│         5 │        52 │                  465 │         8 │     2 │
│         4 │        43 │                  426 │         1 │     2 │
│         5 │        56 │                  379 │         7 │     2 │
│         2 │       103 │                  652 │        11 │     2 │
│         4 │        40 │                  401 │         6 │     2 │
│         4 │        46 │                  411 │         8 │     2 │
│         2 │        61 │                  400 │         7 │     2 │
│         2 │        19 │                  183 │         1 │     2 │
│         3 │        48 │                  290 │         3 │     2 │
│         3 │        49 │                  330 │         6 │     2 │
│         1 │        38 │                  324 │         4 │     2 │
├───────────┴───────────┴──────────────────────┴───────────┴───────┤
│ 245562 rows (40 shown)                                 5 columns │
└──────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.094 user 0.312892 sys 0.017117

I picked a SMILES from this group, belonging to molregno=1387483 which is CSCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@@H](Cc1ccccc1)NC(=O)[C@@H](Cc1ccccc1)NC(=O)[C@@H](CCC(N)=O)NC(=O)[C@@H](CCC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](CCCCN)NC(=O)[C@H]1CCCN1C(=O)[C@H](N)CCCN=C(N)N)C(N)=O and ran this query with the Umbra-mol.

Note: truncated the output to fit better, so it says 40 shown, but I deleted some rows to not take up so much room.

D select molregno, canonical_smiles, mol, umbra_mol from molecule where umbra_is_exact_match(umbra_mol,'CSCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@@H](Cc1ccccc1)NC(=O)[C@@H](Cc1ccccc1)NC(=O)[C@@H](CCC(N)=O)NC(=O)[C@@H](CCC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](CCCCN)NC(=O)[C@H]1CCCN1C(=O)[C@H](N)CCCN=C(N)N)C(N)=O');
┌──────────┬──────────────────────┬──────────────────────┬────────────────────────────────────┐
│ molregno │   canonical_smiles   │         mol          │             umbra_mol              │
│  int64   │       varchar        │         mol          │              umbramol              │
├──────────┼──────────────────────┼──────────────────────┼────────────────────────────────────┤
│  1387483 │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C@H](CC(C)C)NC(…  │
│  1387484 │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C@@H](CC(C)C)NC…  │
│  1387487 │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C…  │ CSCC[C@H](NC(=O)[C@@H](CC(C)C)NC…  │
│      ·   │          ·           │          ·           │                 ·                  │
│      ·   │          ·           │          ·           │                 ·                  │
│      ·   │          ·           │          ·           │                 ·                  │
├──────────┴──────────────────────┴──────────────────────┴────────────────────────────────────┤
│ 510 rows (40 shown)                                                               4 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 1.651 user 4.568367 sys 0.814046

Indeed, this query runs slower than, for example Query 1 from part 1, and this is insensitive to stereochemistry, so there are many hits. I logged each time the short circuit was not triggered and deserialization of the binary molecule was required, and I got 512 lines (not sure why there is an extra row though). These results seem to support that the more deserialization we need to do in the query execution, the slower it will be. But, it’s still not bad – the standard implementation which deserializes the binary molecule takes 40 seconds. Note, there are several more checks that happen with the RDKit molecule object, after deserializing, if the prefix cannot conclude that the molecules are different. So that means for each of these 511, there could be several steps of more expensive computation happening at each one.

Anyhow, seeing the number of collisions gives me some indication of how much filtering the prefix of counts help in filtering molecules, and a place to work from to see if alternative prefixes could be at least similar in its filtering capabilities or better, and ideally smaller than the current 20 byte prefix.

Squashed prefix: uses less space but more collisions and deserialization

I tried a prefix that would trade off space for more collisions by summing the counts and storing it in a single 2 byte value, and then I ran the queries to see how more collisions would affect the execution time of the query. This would make a header of 4 bytes (the summed values, and the binary molecule size which is used for deserialization).

We can see that there will be a big increase in conflicts, but let’s see how it affects the execution time.

D select * from (select (num_rings+num_atoms+amw::integer+num_bonds) as ra, count(*) as count from molecule group by ra order by count desc) as g where g.count > 1;
┌───────┬───────┐
│  ra   │ count │
│ int32 │ int64 │
├───────┼───────┤
│   389 │  9192 │
│   393 │  8946 │
│   406 │  8808 │
│   424 │  8769 │
│   402 │  8758 │
│   407 │  8641 │
│   394 │  8633 │
│   411 │  8612 │
│   397 │  8589 │
│   398 │  8572 │
│   380 │  8544 │
│   376 │  8482 │
│   385 │  8362 │
│   429 │  8357 │
│   415 │  8353 │
│   384 │  8349 │
│   388 │  8254 │

I ran the queries with this 4-byte “squashed” prefix from my examples. The squashed prefix is about ~1.7x as slow in Query 1 as the non-squashed prefix, and even worse on other queries. I also logged to a file whenever there was a prefix that was a match and thus deserialization of the binary molecule was necesssary. I found that in Query 1, the short circuit was not triggered 3,121 times as opposed to 487 with the separated values prefix (~6.4x more deserialization).

Query	Standard method (s)	Umbra-mol 20-byte prefix (s)	Umbra-mol squashed 4-byte prefix (s)
1	17.238	0.496	0.889
2	12.555	0.473	0.493
3	22.196	0.364	1.735
4	12.245	0.359	0.900

Although there are still many collisions, the execution time is much better than the standard method of deserializing every record, but there is a noticeable user experience lag between the original prefix and the squashed prefix.

Trying prefixes made from canonical smiles

Since comparing the SMILES between the two molecules is the final check in the exact match algorithm (part 1), why make better use of the bytes in the prefix to store something that might filter out more?

Again, the goal is not perfect filtering by having zero collisions (unless we can achieve that without spending too much time or space to calculate a hash or prefix for that, then that would be great).

On one extreme, we could put the SMILES all in the prefix. That would give a very good filtering because there would be few collisions, but we would have a variable length prefix that could become huge. For example, this is a SMILES in chembl: CC(C)CCC[C@@](C)(O)[C@H]1CC[C@H]2[C@@H]3C[C@H](O[C@@H]4O[C@H](C)[C@@H](O)[C@H](O[C@@H]5OC[C@@H](O[C@@H]6O[C@H](C)[C@@H](O[C@@H]7OC[C@H](O)[C@H](O)[C@H]7O)[C@H](O)[C@H]6O[C@@H]6O[C@H](C)[C@H](O)[C@H](O)[C@H]6O)[C@H](O)[C@H]5O[C@@H]5O[C@H](C)[C@@H](O)[C@H](O)[C@H]5O)[C@H]4O)[C@H]4C[C@@H](OS(=O)(=O)O)CC[C@]4(C)C3=CC[C@@]21C

At that point, it’s probably better to not even bother with prefixes, and just have the user create a TEXT column with the canonoical SMILES and filter on that. Doing so would also take advantage of duckdb’s powerful string implementation as well as its compression methods.

But, this requires the user to have to add more data to their db for the SMILES column, and perhaps require them to understand the internals of duckdb more – that it’s very good with strings and compression. I would like to abstract that away if possible so that the UmbraMol type is useable out of the box.

More complex hash functions could be used on the SMILES string, but it’s extra complexity in creating the prefix. The RegistrationHash from RDKit was one suggestion of an idea to explore, but I didn’t find an implementation of it in the C++ RDKit API. As for other hash functions, I don’t want to bring in other dependencies to calculate the hash, and I don’t want to code up my own low-collision hash.

I certainly could be overestimating the cost of computating a hash, or canonical SMILES.

But along the lines of using the canonical SMILES, I explored taking slices of the canonicalSMILES and using that as a prefix. I tried to see how many collisions would occur with 8 bytes taken out of the SMILES string.

I tried the following:

take the first 8 bytes of the canonical smiles and use that as a prefix
take the first 4 and last 4 bytes
split into three parts – take the first 2 bytes, middle 4 bytes, and last 2 bytes
split into 4 equal parts of 2 bytes each

The idea of taking slices at different points of the SMILES is that hopefully I could introduce some distinguishing features among similar SMILES so that I would not get a collision.

Unfortunately, they all had lots of collisions, and I abandoned the approach. As we saw in the previous experiment, increasing the amount of deserializations of the binary molecule will have a noticeable impact on the performance. Ultimately, I felt like I was trying to brute force some clever sampling scheme. I’m no good at math, but perhaps someone more skilled in math could figure out a clever sampling scheme to reduce the probability of a collision.

Taking the first 8 bytes as a prefix results in lots of collisions.

D update molecule set prefix=left(mol::TEXT, 8);
D update molecule set prefix_two_split=concat(left(mol::TEXT, 4), right(mol::TEXT,4));

D SELECT prefix, count(*) as c from molecule group by prefix order by c desc;
┌──────────┬────────┐
│  prefix  │   c    │
│ varchar  │ int64  │
├──────────┼────────┤
│ COc1ccc( │ 127784 │
│ COc1cccc │  54504 │
│ CCOC(=O) │  49159 │
│ CC(C)(C) │  30680 │
│ O=C(Nc1c │  30619 │
│ Cc1ccc(C │  28144 │
│ CCCCCCCC │  27587 │
│ COc1ccc2 │  25252 │
│    ·     │      · │
│    ·     │      · │
│    ·     │      · │
├──────────┴────────┤
│    21365 rows     │
│    (40 shown)     │
└───────────────────┘
Run Time (s): real 0.021 user 0.089599 sys 0.001947

Taking the first four and last four bytes still results in a lot of molecules having the same prefix.

D SELECT prefix_two_split, count(*) as c from molecule group by prefix_two_split order by c desc;
┌──────────────────┬───────┐
│ prefix_two_split │   c   │
│     varchar      │ int64 │
├──────────────────┼───────┤
│ COc1)cc1         │ 73051 │
│ Cc1c)cc1         │ 46462 │
│ COc1c1OC         │ 40504 │
│ O=C()cc1         │ 28492 │
│ COc12)c1         │ 20259 │
│ O=C()CC1         │ 19656 │
│ Cc1c2)c1         │ 19023 │
│    ·             │     · │
│    ·             │     · │
│    ·             │     · │
├──────────────────┴───────┤
│  47074 rows (40 shown)   │
└──────────────────────────┘
Run Time (s): real 0.030 user 0.145595 sys 0.000979

A prefix constructed from taking the first 2 bytes, middle 4 bytes, and last 2 bytes still has lots of replicates. It’s better than the squashed prefix, but not better than the 20 byte “counts” prefix.

D update molecule SET prefix_three_split=concat(left(mol::TEXT, 2), substr(mol::TEXT, length(mol::TEXT)//2, 4) ,right(mol::TEXT,2));

D SELECT prefix_three_split, count(*) as c from molecule group by prefix_three_split order by c desc;
┌────────────────────┬───────┐
│ prefix_three_split │   c   │
│      varchar       │ int64 │
├────────────────────┼───────┤
│ CC(=O)c1           │  6794 │
│ CO(=O)c1           │  5983 │
│ CCC(=Oc1           │  4767 │
│ Cc(=O)c1           │  4606 │
│ O=ccccc1           │  4512 │
│ COC(=Oc1           │  3882 │
│ CCccc(c1           │  3692 │
│    ·               │     · │
│    ·               │     · │
│    ·               │     · │
├────────────────────┴───────┤
│   336588 rows (40 shown)   │
└────────────────────────────┘
Run Time (s): real 0.081 user 0.339158 sys 0.032252

Split into four equal parts 2 bytes each is better, but still much more collisions that the “counts” prefix.

D update molecule SET prefix_three_split=concat(left(mol::TEXT, 2), substr(mol::TEXT, (length(mol::TEXT)//2)//2, 2),substr(mol::TEXT, length(mol::TEXT)//2+(length(mol::TEXT)-length(mol::TEXT)//2)//2, 2) ,right(mol::TEXT,2));

D SELECT prefix_four_split, count(*) as c from molecule group by prefix_four_split order by c desc;
┌───────────────────┬───────┐
│ prefix_four_split │   c   │
│      varchar      │ int64 │
├───────────────────┼───────┤
│ CCccccc1          │  3105 │
│ O=ccccc1          │  2785 │
│ COc(ccc1          │  2062 │
│ O=ccccC1          │  1779 │
│ COccccc1          │  1706 │
│ CCc(ccc1          │  1537 │
│ COC(ccc1          │  1518 │
│ CO(=ccc1          │  1394 │
│ CO(Cccc1          │  1362 │
│ O=ccc1c1          │  1066 │
│    ·              │     · │
│    ·              │     · │
│    ·              │     · │
├───────────────────┴───────┤
│  599864 rows (40 shown)   │
└───────────────────────────┘
Run Time (s): real 0.116 user 0.454655 sys 0.051427

I also tried increasing the prefix made from a sample of the SMILES string to 12 bytes – 6 bytes from the front, 6 from the back. It was not an improvement.

D update molecule set prefix_12b_two_split=concat(left(mol::TEXT, 6), right(mol::TEXT,6));

D SELECT prefix_12b_two_split, count(*) as c from molecule group by prefix_12b_two_split order by c desc;
┌──────────────────────┬───────┐
│ prefix_12b_two_split │   c   │
│       varchar        │ int64 │
├──────────────────────┼───────┤
│ COc1cc)cc1OC         │ 18373 │
│ COc1ccc2)cc1         │ 15665 │
│ COc1ccC)c1OC         │ 13882 │
│ COc1ccC2)cc1         │  9836 │
│ Cc1cccc2)cc1         │  9437 │
│ COc1cc(OC)c1         │  8243 │
│ COc1cc=O)cc1         │  7534 │
│ COc1ccn2)cc1         │  6396 │
│ Cc1cccC2)cc1         │  5658 │
│      ·               │     · │
│      ·               │     · │
│      ·               │     · │
├──────────────────────┴───────┤
│    251328 rows (40 shown)    │
└──────────────────────────────┘
Run Time (s): real 0.059 user 0.265381 sys 0.021503

At this point, I felt like I was trying to brute-force a hash function and I was feeling pretty dumb about this. So I stopped with this approach.

Can we squeeze the “count prefix” into smaller number of bytes?

The “counts” prefix currently consists of number of atoms, number of rings, number of bonds, amw, and the size of the binary molecule in bytes.

I took a look at the max number of each count in chembl to get an idea of how big of an int the prefix should support.

There is a long right tail for each of the categories.

D select median(num_atoms), max(num_atoms),median(num_rings), max(num_rings),median(num_bonds), max(num_bonds),median(amw::integer), max(amw::integer) from molecule;
┌───────────────────┬────────────────┬───────────────────┬────────────────┬───────────────────┬────────────────┬──────────────────────────────┬───────────────────────────┐
│ median(num_atoms) │ max(num_atoms) │ median(num_rings) │ max(num_rings) │ median(num_bonds) │ max(num_bonds) │ median(CAST(amw AS INTEGER)) │ max(CAST(amw AS INTEGER)) │
│      double       │     int32      │      double       │     int32      │      double       │     int32      │            double            │           int32           │
├───────────────────┼────────────────┼───────────────────┼────────────────┼───────────────────┼────────────────┼──────────────────────────────┼───────────────────────────┤
│              50.0 │           1398 │               3.0 │             85 │               5.0 │            360 │                        393.0 │                     12546 │
└───────────────────┴────────────────┴───────────────────┴────────────────┴───────────────────┴────────────────┴──────────────────────────────┴───────────────────────────┘
Run Time (s): real 0.144 user 0.224646 sys 0.080628

# and the max binary molecule size. switched to RDKit postgres to calculate this value

chembl_33=# select max(octet_length(mol_to_pkl(rdkit_mol))) from compound_structures;
  max
-------
 15762
(1 row)

If I wrote the query correctly and I’m interpreting the results correctly, 99% of the molecules for the count values would fit in a single byte – they are below 255. The amw would need a 16-bit integer, as well
the binary molecule size. The binary molecule size is important to capture correctly, because it is needed for deserialization and re-sizing the buffer.

For the other values, just putting a ceiling on the value if an incoming molecule has values bigger than the int allows would still probably suffice as a prefix filter, since we do not need to absolutely get to zero collisions – we are not trying to develop a cryptographic hash.

D select quantile_cont(num_atoms, 0.99), quantile_cont(num_bonds, 0.99), quantile_cont(num_rings, 0.99), quantile_cont(amw::integer, 0.99) from molecule;
┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬───────────────────────────────────────────┐
│ quantile_cont(num_atoms, 0.99) │ quantile_cont(num_bonds, 0.99) │ quantile_cont(num_rings, 0.99) │ quantile_cont(CAST(amw AS INTEGER), 0.99) │
│             double             │             double             │             double             │                  double                   │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼───────────────────────────────────────────┤
│                          200.0 │                           37.0 │                            8.0 │                                    1446.0 │
└────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴───────────────────────────────────────────┘
Run Time (s): real 0.190 user 0.182839 sys 0.164021

# and the binary molecule size

chembl_33=#  select percentile_cont(0.99) within group (order by octet_length(mol_to_pkl(rdkit_mol))) from compound_structures;
-[ RECORD 1 ]---+-----
percentile_cont | 1310

Time: 124568.271 ms (02:04.568)

What I takeaway from this is that if I want to be fairly safe and handle pretty much any molecule in the input, I can use 16-bit unsigned integers which covers a range of 0 to 65,535. This would cover up to the max values that can be found in chembl. Using 2 bytes for each field of the prefix would get the prefix down to 10 bytes – 8 bytes for the molecule properties, and 2 bytes for the binary molecule size.

But I think it can be reasonable to just cover up to 99% of the cases (according the molecules in chembl), An 8-bit unsigned integer for most of the molecular properties in the prefix would cover 99% of the cases. That would result in a 7 byte prefix. Any molecule that comes in and has a property > 255, can just be capped at 255 and put in that bin. This should have a minimal increase on the number of collisions since the median values of many of the values are far below 255, and the collision would only occur if all other fields of the prefix match another one. Seems ok, but maybe I’m wrong.

| 1 byte - number of atoms | 1 byte - number of bonds | 2 bytes - amw | 1 bytes - number of rings | 2 bytes - size of the binary molecule |

I made the 10 byte and 7 byte prefixes. For example, the serialized Umbra-mol now looks like this:

10 byte prefix: 06 00 06 00 4e 00 01 00 83 00
binary molecule: ef be ad de 00 00 00 00 0f 00 00 00 00 00 00 00 00 00 00 00 06 00 00 00 06 00 00 00 80 01 06 40 68 00 00 00 03 03 01 06 40 68 00 00 00 03 03 01 06 40 68 00 00 00 03 03 01 06 40 68 00 00 00 03 03 01 06 40 68 00 00 00 03 03 01 06 40 68 00 00 00 03 03 01 0b 00 01 68 0c 01 02 68 0c 02 03 68 0c 03 04 68 0c 04 05 68 0c 05 00 68 0c 14 01 00 00 00 06 00 05 04 03 02 01 17 04 00 00 00 00 00 00 00 16

7 byte prefix: 06 05 56 00 00 6d 00
binary molecule: ef be ad de 00 00 00 00 0f 00 00 00 00 00 00 00 00 00 00 00 06 00 00 00 05 00 00 00 80 01 06 00 60 00 00 00 01 03 06 00 60 00 00 00 02 02 06 00 60 00 00 00 02 02 06 00 60 00 00 00 02 02 06 00 60 00 00 00 02 02 06 00 60 00 00 00 01 03 0b 00 01 00 01 02 00 02 03 00 03 04 00 04 05 00 14 00 00 00 00 17 04 00 00 00 00 00 00 00 16

Then I tested it on the queries I did in my previous post. Not a well thought out benchmark, just something I did before and could roughly compare to. I may have had some background YouTube running while running queries.

The 7 and 10 byte prefix performs similarly as the larger 20 byte prefix. When the speeds seem faster, I’m not sure if it’s noise. Running it a few times seems like the values are a bit faster on average, but I don’t know if it is statistically significant, and it’s not really noticeable from a user experience point of view, I think.

Query	Standard method (s)	Umbra-mol 20-byte prefix (s)	Umbra-mol 10-byte prefix (s)	Umbra-mol 7-byte prefix (s)
1	17.238	0.496	0.311	0.363
2	12.555	0.473	0.273	0.272
3	22.196	0.364	0.592	0.648
4	12.245	0.359	0.350	0.380

Can we improve this with information that we already have in the header, like the binary molecule size?

We also have the binary molecule size in the header, but currently do not consider it when computing exact match. Here, I used the RDKit Postgres extension to get the length of the binary molecule in bytes. I then checked if we would see fewer collisions if we consider the binary size along with the other values.

The result of the query is here:

 num_rings | num_atoms | amw  | num_bonds | bmolsize | count
-----------+-----------+------+-----------+----------+-------
         4 |       193 | 1348 |        42 |     1197 |   510
         3 |        31 |  252 |         2 |      320 |    90
         3 |        41 |  323 |         4 |      378 |    87
         3 |        32 |  251 |         2 |      321 |    81
         3 |        32 |  268 |         2 |      333 |    81
         3 |        33 |  267 |         3 |      332 |    73
         5 |        44 |  386 |         2 |      450 |    73
         3 |        35 |  265 |         3 |      332 |    72
         3 |        40 |  311 |         5 |      363 |    71
         2 |        41 |  314 |         6 |      358 |    70

It looks like it can reduce some of the collisions that were in the 200-300 replicate name down to < 100. I think it would improve the short-circuit hits, and speed up the queries.

However, I’m not sure how reliable it would be to include the binary molecule size in consideration because if RDKit changes its format, the binary molecule size could change even though the molecule stays the same. The other values values are intrinsic to the molecule, but the binary format is not.

This could lead to false negatives if the binary molecule format changes and we use the new version while we have something stored in the database from a previous format. If the new format has a different size, the short-circuit code will detect that the sizes are different, and conclude the molecules are not the same, when they could be.

So, I won’t use that in the prefix.

I also considered putting a short SMILES into the prefix and that seems to help reduce collisions. The trade off would be to add an extra computation to generate the RDKit canonical SMILES and increasing the prefix size. But there’s some other ideas to try out, so I decided not to pursue that route right now.

D select * from (select num_rings,num_atoms, amw::integer, num_bonds,prefix, count(*) as count from molecule group by num_rings, num_atoms, amw::integer, num_bonds, prefix order by count desc) as g where g.count > 1;
┌───────────┬───────────┬──────────────────────┬───────────┬──────────┬───────┐
│ num_rings │ num_atoms │ CAST(amw AS INTEGER) │ num_bonds │  prefix  │ count │
│   int32   │   int32   │        int32         │   int32   │ varchar  │ int64 │
├───────────┼───────────┼──────────────────────┼───────────┼──────────┼───────┤
│         4 │       193 │                 1348 │        42 │ CSCC[C@H │   509 │
│         3 │        39 │                  308 │         4 │ COc1ccc( │    70 │
│         3 │        35 │                  282 │         3 │ COc1ccc( │    58 │
│         3 │        38 │                  296 │         4 │ COc1ccc( │    55 │
│         2 │        41 │                  314 │         6 │ COc1ccc( │    52 │
│         3 │        38 │                  309 │         4 │ COc1ccc( │    52 │
│         3 │        40 │                  311 │         5 │ COc1ccc( │    52 │
│         3 │        40 │                  324 │         5 │ COc1ccc( │    51 │
│         3 │        46 │                  356 │         6 │ COc1ccc( │    49 │
│         2 │        37 │                  284 │         5 │ COc1ccc( │    49 │
│         3 │        36 │                  281 │         3 │ COc1ccc( │    49 │
│         3 │        39 │                  312 │         4 │ COc1ccc( │    48 │
│         4 │        49 │                  388 │         5 │ COc1ccc( │    47 │
│         3 │       110 │                  623 │        25 │ CCCCCCCC │    47 │
│         3 │        38 │                  248 │         0 │ C=C1C(=O │    45 │