duckdb_rdkit: RDKit extension in duckdb

duckdb_rdkit: RDKit extension in duckdb


Very brief intro to OLTP and OLAP

Online Transactional Processing (OLTP)

Online Analytical Processing (OLAP)


Very brief intro to OLTP and OLAP


duckdb


duckdb

import duckdb
duckdb.sql("SELECT standard_value, standard_units FROM '../chembl_33_files/activities.parquet' WHERE standard_value < 10 AND standard_units = 'nM'")

┌────────────────┬────────────────┐
 standard_value  standard_units 
     double         varchar     
├────────────────┼────────────────┤
            6.0  nM             
            2.0  nM             
            2.0  nM             
             ·   ·              
             ·   ·              
             ·   ·              
            6.0  nM             
            0.5  nM             
            2.7  nM             
├────────────────┴────────────────┤
  ? rows (>9999 rows, 6 shown)   
└─────────────────────────────────┘

Cheminformatics workloads and RDKit


Implementation of duckdb_rdkit


Implementation of duckdb_rdkit

Registering the Mol type in duckdb
~~~graph-easy --as=boxart
[ CCO (SMILES) ]  <-> [ RDKit Mol object (in-memory) ]  <-> [ Serialize to binary (on-disk)]
~~~
Implement basic molecule format functions

Implementing is_exact_match

Cc1ccccc1 is toluene

c1ccccc1C is toluene

Implementing is_exact_match

QueryStandard method (s)
117.238
212.555
322.196
412.245

Implementing is_exact_match

Initial attempt

Implementing is_exact_match: Umbra Mol

~~~graph-easy --as=boxart
[length (4B) | prefix (4B) | offset or pointer (8B)]
~~~

Implementing is_exact_match: Umbra Mol

~~~graph-easy --as=boxart
[length (4B) | prefix (4B) | offset or pointer (8B)]
~~~

Implementing is_exact_match: Umbra Mol

bool mol_cmp(const RDKit::ROMol &m1, const RDKit::ROMol &m2) {
  int res = m1.getNumAtoms() - m2.getNumAtoms();

  if (res) {
    return false;
  }

  res = m1.getNumBonds() - m2.getNumBonds();
  if (res) {
    return false;
  }

  res = int(RDKit::Descriptors::calcAMW(m1, false) -
            RDKit::Descriptors::calcAMW(m2, false) + .5);
  if (res) {
    return false;
  }

  res = m1.getRingInfo()->numRings() - m2.getRingInfo()->numRings();
  if (res) {
    return false;
  }

  // if m1 is substruct of m2 and m2 is substruct of m1, likely to be the same
  // molecule
  RDKit::MatchVectType matchVect;
  bool recursion_possible = false;
  bool do_chiral_match = false;
  bool ss1 = RDKit::SubstructMatch(m1, m2, matchVect, recursion_possible,
                                   do_chiral_match);
  bool ss2 = RDKit::SubstructMatch(m2, m1, matchVect, recursion_possible,
                                   do_chiral_match);
  if (ss1 && !ss2) {
    return false;
  } else if (!ss1 && ss2) {
    return false;
  }

  // the above can still fail in some chirality cases
  std::string smi1 = RDKit::MolToSmiles(m1, do_chiral_match);
  std::string smi2 = RDKit::MolToSmiles(m2, do_chiral_match);
  return smi1 == smi2;
}

Implementing is_exact_match: Umbra Mol

~~~graph-easy --as=boxart
[Number of atoms | Number of bonds| ... | binary RDKit molecule]
~~~
struct umbra_mol_t {
  uint32_t num_atoms;
  uint32_t num_bonds;
  uint32_t amw;
  uint32_t num_rings;
  uint32_t bmol_size;
  std::string bmol;
}


bool umbra_mol_cmp(umbra_mol_t m1, umbra_mol_t m2) {
  // check the prefix
  // if any of these values are not equal between the two molecules,
  // there is no way the molecules are the same
  if (m1.num_atoms != m2.num_atoms || m1.num_bonds != m2.num_bonds ||
      m1.amw != m2.amw || m1.num_rings != m2.num_rings) {
    return false;
  }

  // otherwise, run a full check on the molecule objects
  std::unique_ptr<RDKit::ROMol> left_mol(new RDKit::ROMol());
  std::unique_ptr<RDKit::ROMol> right_mol(new RDKit::ROMol());


  // deserialization step
  RDKit::MolPickler::molFromPickle(m1.bmol, *left_mol);
  RDKit::MolPickler::molFromPickle(m2.bmol, *right_mol);
  return mol_cmp(*left_mol, *right_mol);
}


Implementing is_exact_match: Umbra Mol

Umbra Mol results
QueryStandard method (s)Umbra-mol (s)speedup
117.2380.49634.75x
212.5550.47326.54x
322.1960.36460.98x
412.2450.35934.11x

Umbra Mol version 2


Umbra Mol version 2

improving the “count” prefix
D select quantile_cont(num_atoms, 0.99), quantile_cont(num_bonds, 0.99), quantile_cont(num_rings, 0.99), quantile_cont(amw::integer, 0.99) from molecule;
┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬───────────────────────────────────────────┐
│ quantile_cont(num_atoms, 0.99) │ quantile_cont(num_bonds, 0.99) │ quantile_cont(num_rings, 0.99) │ quantile_cont(CAST(amw AS INTEGER), 0.99) │
│             double             │             double             │             double             │                  double                   │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼───────────────────────────────────────────┤
│                          200.0 │                           37.0 │                            8.0 │                                    1446.0 │
└────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴───────────────────────────────────────────┘
Run Time (s): real 0.190 user 0.182839 sys 0.164021

# and the binary molecule size

chembl_33=#  select percentile_cont(0.99) within group (order by octet_length(mol_to_pkl(rdkit_mol))) from compound_structures;
-[ RECORD 1 ]---+-----
percentile_cont | 1310

Time: 124568.271 ms (02:04.568)

Umbra Mol version 2

improving the “count” prefix

Using the chembl analysis as a guide, here is how the new counts prefix looks like:

total: 27 bits (~4B as opposed to 20B)


Umbra Mol version 2

dalke_fp as a substructure filter

Umbra Mol version 2

dalke_fp as a substructure filter
                        all mols

                has O               no O

         has N     no N        has CC   no CC


Umbra Mol version 2

dalke_fp as a substructure filter

(Sorry: it’s a tree but it fell over)

~~~graph-easy --as=boxart
[all molecules] -> [has O]
[all molecules] -> [does not have O]
[has O] -> [has N]
[has O] -> [does not have N]
[does not have O] -> [has Cl]
[does not have O] -> [does not have Cl]
[has Cl] -> [...]
[does not have Cl] -> [and so on ...]
~~~

Umbra Mol version 2

dalke_fp as a substructure filter

Umbra Mol version 2

dalke_fp as a substructure filter

The prefix in UmbraMol now looks like this:

~~~graph-easy --as=boxart
[count prefix (4B) | dalke_fp (8B)|... ]
~~~


Umbra Mol version 2

storing a pointer to binary molecule instead of inlining

Umbra Mol version 2

storing a pointer to binary molecule instead of inlining

Umbra Mol version 2

storing a pointer to binary molecule instead of inlining

Umbra Mol version 2

storing a pointer to binary molecule instead of inlining
~~~graph-easy --as=boxart
[Umbra Mol ] -> [string_t] -> [duckdb internal pointer swizzling, etc.] -> [disk]
~~~
~~~graph-easy --as=boxart
[disk] -> [duckdb internal pointer swizzling, etc.] -> [string_t] -> [Umbra Mol] -> [cheminformatics stuff]
~~~

Umbra Mol version 2

storing a pointer to binary molecule instead of inlining

  std::string buffer;
  // total size = prefix size + binary molecule size
  buffer.reserve(total_size);
  // the counts prefix - 4 bytes
  buffer.append(reinterpret_cast<const char *>(&prefix),
                umbra_mol_t::COUNT_PREFIX_BYTES);
  // dalke_fp part of the prefix - 8 btyes
  buffer.append(reinterpret_cast<const char *>(&dalke_fp),
                umbra_mol_t::DALKE_FP_PREFIX_BYTES);
  // the binary molecule from RDKit
  buffer.append(binary_mol);


Umbra Mol version 2

storing a pointer to binary molecule instead of inlining
struct {
    uint32_t length;
    char prefix[4];
    char *ptr;
} pointer;
  struct {
    uint32_t length;
    char prefix[PREFIX_LENGTH];
    const char *ptr;
  } value;


Umbra Mol version 2

storing a pointer to binary molecule instead of inlining
  // This constructor is used to convert a `string_t` to an `umbra_mol_t`:
  umbra_mol_t(string_t buffer) {
    value.length = buffer.GetSize();
    // first 12 bytes are the prefix
    memset(value.prefix, 0, PREFIX_LENGTH);
    memcpy(&value.prefix, buffer.GetData(), PREFIX_LENGTH);
    // string_t::GetData() gets the pointer to the string
    value.ptr = buffer.GetData();

    D_ASSERT(value.ptr == buffer.GetData());
  }


Umbra Mol version 2


Umbra Mol version 2 experiments

Exact match
QueryStandard method (s)Umbra-mol part 2 (s)speedup (Umbra-mol vs standard method)Postgres control (s)
117.2380.17996x0.084
212.5550.14587x233
313.0270.26350x2.47
412.2450.25548x6.185
Substructure match
QueryStandard method (s)Umbra-mol part 2 (s)speedup (Umbra-mol vs standard method)Postgres control (s)
123.3880.26788x0.741
214.0945.932x98
314.2940.55326x12.114
413.9946.8042x1237 (20 min)