duckdb_rdkit
duckdb_rdkit
Experiments in building a cheminformatics extension for an OLAP database
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- access few or single row(s) vs access many or all rows
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- access few or single row(s) vs access many or all rows
- update amount in an account vs average account balance
- access few or single row(s) vs access many or all rows
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- access few or single row(s) vs access many or all rows
- update amount in an account vs average account balance
- update amount in an account vs average account balance
- the techniques used in building these systems are very different:
- OLTP optimizes for finding individual rows
- OLAP optimizes for full column scans
- i.e. different strategies to indexing, storage formats, query execution
- OLTP optimizes for finding individual rows
- access few or single row(s) vs access many or all rows
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- access few or single row(s) vs access many or all rows
- update amount in an account vs average account balance
- update amount in an account vs average account balance
- the techniques used in building these systems are very different:
- OLTP optimizes for finding individual rows
- OLAP optimizes for full column scans
- i.e. different strategies to indexing, storage formats, query execution
- OLTP optimizes for finding individual rows
- access few or single row(s) vs access many or all rows
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
duckdb is an in-process OLAP database and query execution engine
Background information
Two main types of database workloads
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- access few or single row(s) vs access many or all rows
- update amount in an account vs average account balance
- update amount in an account vs average account balance
- the techniques used in building these systems are very different:
- OLTP optimizes for finding individual rows
- OLAP optimizes for full column scans
- i.e. different strategies to indexing, storage formats, query execution
- OLTP optimizes for finding individual rows
- access few or single row(s) vs access many or all rows
- Online Transactional Processing (OLTP) & Online Analytical Processing (OLAP)
- duckdb is an in-process OLAP database and query execution engine
- RDKit is a cheminformatics toolkit for working with molecules on computers
Implementing the duckdb_rdkit extension
Exact match and substructure match
- is this molecule the same as that molecule?
- is this fragment found in that molecule?
Implementing the duckdb_rdkit extension
Exact match and substructure match
- is this molecule the same as that molecule?
- is this fragment found in that molecule?
The Mol object
~~~graph-easy --as=boxart
[ CCO (SMILES) ] <-> [ RDKit Mol object (in-memory) ] <-> [ Serialize to binary (on-disk)]
~~~
- The RDKit Mol object contains a lot of information about the molecule for cheminformatics work
Implementing the duckdb_rdkit extension
Exact match and substructure match
- is this molecule the same as that molecule?
- is this fragment found in that molecule?
The Mol object
~~~graph-easy --as=boxart
[ CCO (SMILES) ] <-> [ RDKit Mol object (in-memory) ] <-> [ Serialize to binary (on-disk)]
~~~
- The RDKit Mol object contains a lot of information about the molecule for cheminformatics work
Initial attempt at exact match in duckdb_rdkit:
- adapted the code for comparing molecules from RDKit extensions for Postgres & SQLite
- very poor performance
Query | Standard method (s) |
---|---|
1 | 17.238 |
2 | 12.555 |
3 | 22.196 |
4 | 12.245 |
- chembl 33, ~2.3 million molecules
- default duckdb settings: AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Implementing the duckdb_rdkit extension
Improving exact match and substructure match performance
Fast forward after many experiments and iterations…
- Inspired by Umbra-style strings (Neumann, T., Freitag M. CIDR 2020)
- “Umbra Mol”
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
- Examined the code for comparing molecules
bool mol_cmp(const RDKit::ROMol &m1, const RDKit::ROMol &m2) {
int res = m1.getNumAtoms() - m2.getNumAtoms();
if (res) {
return false;
}
res = m1.getNumBonds() - m2.getNumBonds();
if (res) {
return false;
}
res = int(RDKit::Descriptors::calcAMW(m1, false) -
RDKit::Descriptors::calcAMW(m2, false) + .5);
if (res) {
return false;
}
res = m1.getRingInfo()->numRings() - m2.getRingInfo()->numRings();
if (res) {
return false;
}
// more expensive checks like substructure match and creating canonical
// smiles. Omitted for brevity
}
- short-circuits with cheaper comparisons
- but requires deserialization of the binary to the in-memory RDKit Mol
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
- Store the counts in a prefix in front of the binary molecule
~~~graph-easy --as=boxart
[num_atoms | num_bonds | amw | num_rings | binary rdkit molecule]
~~~
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
- Store the counts in a prefix in front of the binary molecule
~~~graph-easy --as=boxart
[num_atoms | num_bonds | amw | num_rings | binary rdkit molecule]
~~~
- Compare prefix binary directly
- return false if prefixes don’t match
- no deserialization to RDKit Mol
- if prefixes match, deserialize and do the rest of the checks
- return false if prefixes don’t match
~~~graph-easy --as=boxart
[ CCO (SMILES) ] <-> [ RDKit Mol object (in-memory) ] <-> [ Serialize to binary (on-disk)]
~~~
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
In initial experiments, I used 4B each
Analyzed chembl 33 to optimize size of prefix
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
In initial experiments, I used 4B each
Analyzed chembl 33 to optimize size of prefix
D select quantile_cont(num_atoms, 0.99), quantile_cont(num_bonds, 0.99), quantile_cont(num_rings, 0.99), quantile_cont(amw::integer, 0.99) from molecule;
┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬───────────────────────────────────────────┐
│ quantile_cont(num_atoms, 0.99) │ quantile_cont(num_bonds, 0.99) │ quantile_cont(num_rings, 0.99) │ quantile_cont(CAST(amw AS INTEGER), 0.99) │
│ double │ double │ double │ double │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼───────────────────────────────────────────┤
│ 200.0 │ 37.0 │ 8.0 │ 1446.0 │
└────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴───────────────────────────────────────────┘
Run Time (s): real 0.190 user 0.182839 sys 0.164021
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
In initial experiments, I used 4B each
Analyzed chembl 33 to optimize size of prefix
D select quantile_cont(num_atoms, 0.99), quantile_cont(num_bonds, 0.99), quantile_cont(num_rings, 0.99), quantile_cont(amw::integer, 0.99) from molecule;
┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬───────────────────────────────────────────┐
│ quantile_cont(num_atoms, 0.99) │ quantile_cont(num_bonds, 0.99) │ quantile_cont(num_rings, 0.99) │ quantile_cont(CAST(amw AS INTEGER), 0.99) │
│ double │ double │ double │ double │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼───────────────────────────────────────────┤
│ 200.0 │ 37.0 │ 8.0 │ 1446.0 │
└────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴───────────────────────────────────────────┘
Run Time (s): real 0.190 user 0.182839 sys 0.164021
in some cases, 99% of the data can be represented with a few bits
- number of atoms, 7 bits (oops - but still good: 97th percentile is 119 atoms)
- number of bonds, 6 bits
- number of rings, 3 bits
- amw, 11 bits
total: 27 bits (~4B as opposed to 20B)
Implementing the duckdb_rdkit extension
Short-circuiting for exact match
Query | Standard method (s) | Umbra-mol (s) | speedup |
---|---|---|---|
1 | 17.238 | 0.496 | 34.75x |
2 | 12.555 | 0.473 | 26.54x |
3 | 22.196 | 0.364 | 60.98x |
4 | 12.245 | 0.359 | 34.11x |
- Avoiding deserialization is very good
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Substructure filter: let’s say we have a bit to mark the presence of N
- Query: NC, Target: OC – not a substructure
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Substructure filter: let’s say we have a bit to mark the presence of N
- Query: NC, Target: OC – not a substructure
- Query: CC, Target: NCC – could be substructure, need full substructure test
- Query: NC, Target: OC – not a substructure
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Substructure filter: let’s say we have a bit to mark the presence of N
- Query: NC, Target: OC – not a substructure
- Query: CC, Target: NCC – could be substructure, need full substructure test
- Can only bail out for the first case: If the query has the feature, but the target does not, cannot be a substructure match
- Need a good set of features
- Query: NC, Target: OC – not a substructure
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Substructure filter: let’s say we have a bit to mark the presence of N
- Query: NC, Target: OC – not a substructure
- Query: CC, Target: NCC – could be substructure, need full substructure test
- Can only bail out for the first case: If the query has the feature, but the target does not, cannot be a substructure match
- Need a good set of features
- Query: NC, Target: OC – not a substructure
Andrew Dalke shared his work on developing a substructure filter (2012):
- Long story short: 55 bit fingerprint (
dalke_fp
) (1 if feature present, 0 if not) that we can use to short-circuit by checking if the bit is on in the query, but off in the target
- Long story short: 55 bit fingerprint (
Implementing the duckdb_rdkit extension
Substructure filter: dalke_fp
- Substructure comparison is computationally expensive: complicated graph calculations
Substructure filter: let’s say we have a bit to mark the presence of N
- Query: NC, Target: OC – not a substructure
- Query: CC, Target: NCC – could be substructure, need full substructure test
- Can only bail out for the first case: If the query has the feature, but the target does not, cannot be a substructure match
- Need a good set of features
- Query: NC, Target: OC – not a substructure
Andrew Dalke shared his work on developing a substructure filter (2012):
- Long story short: 55 bit fingerprint (
dalke_fp
) (1 if feature present, 0 if not) that we can use to short-circuit by checking if the bit is on in the query, but off in the target
- Long story short: 55 bit fingerprint (
- Incorporated into the prefix of Umbra Mol for speeding up substructure matches
- Details here:
- http://www.dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html
- https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02078.html
Implementing the duckdb_rdkit extension
Store pointer to binary molecule instead of inlining
- During experiments with dalke_fp substructure filter, saw performance regression
- Tracked it down to the struct getting too big
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| binary rdkit molecule (chembl 33 avg: ~455B)]
~~~
Implementing the duckdb_rdkit extension
Store pointer to binary molecule instead of inlining
- During experiments with dalke_fp substructure filter, saw performance regression
- Tracked it down to the struct getting too big
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| binary rdkit molecule (chembl 33 avg: ~455B)]
~~~
- Storing a pointer to the binary molecule can shrink the struct significantly
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
Implementing the duckdb_rdkit extension
Store pointer to binary molecule instead of inlining
- During experiments with dalke_fp substructure filter, saw performance regression
- Tracked it down to the struct getting too big
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| binary rdkit molecule (chembl 33 avg: ~455B)]
~~~
- Storing a pointer to the binary molecule can shrink the struct significantly
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Pointer swizzling is required (virtual memory address vs disk offset)
- Long story short: Made it look like a
string_t
in duckdb and duckdb’s internals handles the rest
- Long story short: Made it look like a
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Default duckdb & Postgres settings
- Postgres running in docker
- gist index on molecules in Postgres
- AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Default duckdb & Postgres settings
- Postgres running in docker
- gist index on molecules in Postgres
- AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Exact match
Query | Standard method (s) | Umbra-mol v2 (s) | speedup (Umbra-mol vs standard method) | Postgres control (s) |
---|---|---|---|---|
1 | 17.238 | 0.179 | 96x | 0.084 |
2 | 12.555 | 0.145 | 87x | 233 |
3 | 13.027 | 0.263 | 50x | 2.47 |
4 | 12.245 | 0.255 | 48x | 6.185 |
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Default duckdb & Postgres settings
- Postgres running in docker
- gist index on molecules in Postgres
- AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Exact match
Query | Standard method (s) | Umbra-mol v2 (s) | speedup (Umbra-mol vs standard method) | Postgres control (s) |
---|---|---|---|---|
1 | 17.238 | 0.179 | 96x | 0.084 |
2 | 12.555 | 0.145 | 87x | 233 |
3 | 13.027 | 0.263 | 50x | 2.47 |
4 | 12.245 | 0.255 | 48x | 6.185 |
- results in parentheses are results from a further optimization: simplified code to reduce pointer chasing. Mostly helps exact match.
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Default duckdb & Postgres settings
- Postgres running in docker
- gist index on molecules in Postgres
- AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Exact match
Query | Standard method (s) | Umbra-mol v2 (s) | speedup (Umbra-mol vs standard method) | Postgres control (s) |
---|---|---|---|---|
1 | 17.238 | 0.179 (0.082) | 96x (210x) | 0.084 |
2 | 12.555 | 0.145 (0.064) | 87x (196x) | 233 |
3 | 13.027 | 0.263 (0.256) | 50x (51x) | 2.47 |
4 | 12.245 | 0.255 (0.241) | 48x (51x) | 6.185 |
- results in parentheses are results from a further optimization: simplified code to reduce pointer chasing. Mostly helps exact match.
Implementing the duckdb_rdkit extension
Umbra-Mol results
~~~graph-easy --as=boxart
[counts (4B) | dalke_fp (8B)| pointer to binary molecule (8B -- 64 bit CPU)]
~~~
- Default duckdb & Postgres settings
- Postgres running in docker
- gist index on molecules in Postgres
- AMD Ryzen 5 4500U CPU, 16GB RAM, Samsung PM991 SSD
Exact match
Query | Standard method (s) | Umbra-mol v2 (s) | speedup (Umbra-mol vs standard method) | Postgres control (s) |
---|---|---|---|---|
1 | 17.238 | 0.179 (0.082) | 96x (210x) | 0.084 |
2 | 12.555 | 0.145 (0.064) | 87x (196x) | 233 |
3 | 13.027 | 0.263 (0.256) | 50x (51x) | 2.47 |
4 | 12.245 | 0.255 (0.241) | 48x (51x) | 6.185 |
- results in parentheses are results from a further optimization: simplified code to reduce pointer chasing. Mostly helps exact match.
Substructure match
Query | Standard method (s) | Umbra-mol v2 (s) | speedup (Umbra-mol vs standard method) | Postgres control (s) |
---|---|---|---|---|
1 | 23.388 | 0.267 | 88x | 0.741 |
2 | 14.094 | 5.93 | 2x | 98 |
3 | 14.294 | 0.553 | 26x | 12.114 |
4 | 13.994 | 6.804 (2.352) | 2x (6x) | 1237 (20 min) |
Wrapping up
- Not a comprehensive benchmark. It’s just 4 queries that I made up
- Not saying this is better than Postgres + RDKit
- type of workload is important (i.e. updating individual record)
- type of workload is important (i.e. updating individual record)
- Currently, not many features, and very experimental
- github.com/bodowd/duckdb_rdkit
Acknowledgments
- Thomas Leyer
- Fuad Abdallah
- Andrew Dalke
bing.odowd@bayer.com
Supplementary slides
Exact match queries
Query 1
-- umbra mol
select molregno,canonical_smiles, rdkit_mol, umbra_mol from molecule where umbra_is_exact_match(umbra_mol,'Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O');
-- standard
select molregno,canonical_smiles, rdkit_mol, umbra_mol from molecule where is_exact_match(rdkit_mol,'Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O');
-- postgres
select molregno,canonical_smiles, rdkit_mol from compound_structures where rdkit_mol@='Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O';
Supplementary slides
Exact match queries
Query 2
-- umbra mol
select molregno,canonical_smiles, rdkit_mol, umbra_mol from molecule where umbra_is_exact_match(umbra_mol,'CCC');
-- standard
select molregno,canonical_smiles, rdkit_mol, umbra_mol from molecule where is_exact_match(rdkit_mol,'CCC');
-- postgres
select molregno,canonical_smiles, rdkit_mol from compound_structures where rdkit_mol@='CCC';
Supplementary slides
Exact match queries
Query 3
-- umbra mol
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.umbra_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE umbra_is_exact_match(m.umbra_mol, 'COc1cc(/C=C/C(=O)OCCCCCCN(C)CCCCOC(=O)c2c3ccccc3cc3ccccc23)cc(OC)c1OC');
-- standard
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE is_exact_match(m.rdkit_mol, 'COc1cc(/C=C/C(=O)OCCCCCCN(C)CCCCOC(=O)c2c3ccccc3cc3ccccc23)cc(OC)c1OC');
-- postgres
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.rdkit_mol FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE m.rdkit_mol@='COc1cc(/C=C/C(=O)OCCCCCCN(C)CCCCOC(=O)c2c3ccccc3cc3ccccc23)cc(OC)c1OC';
Supplementary slides
Exact match queries
Query 4
-- umbra mol
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.umbra_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE umbra_is_exact_match(m.umbra_mol, 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
GROUP BY m.umbra_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- standard
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE is_exact_match(m.rdkit_mol, 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- postgres
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE m.rdkit_mol@='CC(=O)Nc1nnc(S(N)(=O)=O)s1'
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
Supplementary slides
Substructure match queries
Query 1
-- umbra mol
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.umbra_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE umbra_is_substruct(m.umbra_mol, 'COc1cc2c(Nc3cc(CC(=O)Nc4cccc(F)c4F)[nH]n3)ncnc2cc1OCCCN(CCO)CC(C)C');
-- standard
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE is_substruct(m.rdkit_mol, 'COc1cc2c(Nc3cc(CC(=O)Nc4cccc(F)c4F)[nH]n3)ncnc2cc1OCCCN(CCO)CC(C)C');
-- postgres
SELECT pbd.prediction_method, a.value, a.units, a.type, a.relation, m.rdkit_mol FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE m.rdkit_mol@='COc1cc2c(Nc3cc(CC(=O)Nc4cccc(F)c4F)[nH]n3)ncnc2cc1OCCCN(CCO)CC(C)C';
Supplementary slides
Substructure match queries
Query 2
-- umbra mol
SELECT count(*) FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE umbra_is_substruct(m.umbra_mol, 'O=CNCCc1ccccc1');
-- standard
SELECT count(*) FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE is_substruct(m.rdkit_mol, 'O=CNCCc1ccccc1');
-- postgres
SELECT count(*) FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
WHERE m.rdkit_mol@>'O=CNCCc1ccccc1';
Supplementary slides
Substructure match queries
Query 3
-- umbra mol
SELECT avg(a.value), stddev(a.value), a.units,a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.umbra_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE umbra_is_substruct(m.umbra_mol, 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
GROUP BY m.umbra_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- normal
SELECT avg(a.value), stddev(a.value), a.units,a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE is_substruct(m.rdkit_mol, 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- postgres
SELECT avg(a.value), stddev(a.value), a.units,a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE m.rdkit_mol@>'CC(=O)Nc1nnc(S(N)(=O)=O)s1'
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
Supplementary slides
Substructure match queries
Query 4
-- umbra mol
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.umbra_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE umbra_is_substruct(m.umbra_mol, 'N1C=CC=N1')
GROUP BY m.umbra_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- standard
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM molecule m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE is_substruct(m.rdkit_mol, 'N1C=CC=N1')
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;
-- postgres
SELECT avg(a.value), stddev(a.value), a.units, a.type, count(a.value), a.relation, bs.site_name, ys.assay_organism, m.rdkit_mol FROM compound_structures m
INNER JOIN activities a ON a.molregno=m.molregno
INNER JOIN predicted_binding_domains pbd ON pbd.activity_id=a.activity_id
INNER JOIN binding_sites bs ON pbd.site_id=bs.tid
INNER JOIN assays ys ON ys.tid=bs.tid
WHERE m.rdkit_mol@>'N1C=CC=N1'
GROUP BY m.rdkit_mol, a.relation, a.units, a.type, bs.site_name, ys.assay_organism;