I'm having a look at ..
Molecular autoencoder lets us interpolate and do gradient-based optimization of compounds https://arxiv.org/pdf/1610.02415.pdf
The paper takes an input Smiles string (a text representation of molecule) and then maps it using a variational encoder into 2D latent space.
Example Smiles String for hexan-3-ol "CCCC(O)CC"
In the paper they pad short strings to 120 characters with spaces.
The paper encoded the string using a stack of 1D convolutional networks into a latent representation of the smiles string
It then uses a 3 Gated recurrent Units GRU to then map positions in the latent space back into a smiles string.
The problem I have in understanding this paper is determining what the input and output structures look like.
The paper is a bit vague on input and output structure. From the use of the 1D conv nets I suspect that the input is a vectorised representation akin to
'C' = 1
'O' = 2
'(' = 3
' ' = 0 #for padding
#so the hexan-3-ol smiles above would be
[1,1,1,1,3,2,4,1,1,0...padding to fixed length]
On the output the paper says
The last layer of the RNN decoder deﬁnes a probability distribution over all possible characters at each position in the SMILES string
So for the max Smiles length of 120 used in the paper with the 35 possible smiles characters does that mean that the output is [120x35] array?
carrying that logic forward does it suggest the input is instead a flattened [120*35] array - bearing in mind its an autoencoder.
My problem with that is the 1dConv which uses a max length of 9 which wouldn't be sufficient to cover the next atom in the sequence if its a flattened [120*35] array
Thanks for your help...
The definition of SMILES is more complicated than you may expect as it a linear representation of a graph.
In short, a letter designates an atom such as C=carbon, O=oxygen. The graph can be branched with parens i.e. C(C)C would form a "Y" structure. Finally, cycles can be created with closures represented as numbers. I.e. "C1CCC1" forms a square (i.e. the letter 1 is bonded to the other letter 1).
Note that this description is not complete, but should be a good grounding.
If a string is a valid smiles string, simply adding it to another valid smiles string will most-often make another valid string. I.e. "C1CC1" + "C1CC1" => "C1CC1C1CC1" is valid.
Often, on can extract a linear portion of a smiles string and "embed" it in another and a valid smiles string is formed.
What I believe the auto-encoder is learning, is how to do these transformations. A stupid example of replacing halides (chlorine, bromine, Iodine) attached to the example above could be:
The auto encoder learns the constant part and the variable part - but in linear string space. Now this is not perfect, and if you notice in the paper, when exploring the continuously differentiable space they need to find the closest valid smiles string.
If you want to explore smiles strings, all of the ones used in the paper were generated using the rdkit:
which, in full disclosure, I help maintain. Hopefully this helps.
You can find the source code here:
I have been playing around with it, and the input and output structures are MxN matrices with M being the max length of the SMILES string (120 in this case) and N the size of the character set. Each row M is a vector of zeroes, except for the position where the character at position M_i matches character N_j. To decode the output matrix into a SMILE you then go row by row and match to the character positions in your character set.
A problem with this encoding is that it takes up a lot of memory. Using the keras image iterator approach you can do the following:
First encode all smiles into a 'sparse' format which is a list of character set positions for each smile in your set.
Now you have a defined character set over all SMILES (charset), and each SMILE is now a list of numbers representing the position of each character in your character set. You can then begin using an iterator to do it on the fly while training a keras model using the fit_generator function.
import numpy as np import threading import collections class SmilesIterator(object): def __init__(self, X, charset, max_length, batch_size=256, shuffle=False, seed=None): self.X = X self.charset = charset self.max_length = max_length self.N = len(X) self.batch_size = batch_size self.shuffle = shuffle self.batch_index = 0 self.total_batches_seen = 0 self.lock = threading.Lock() self.index_generator = self._flow_index(len(X), batch_size, shuffle, seed) def reset(self): self.batch_index = 0 def __iter__(self): return self def _flow_index(self, N, batch_size, shuffle=False, seed=None): self.reset() while True: if self.batch_index == 0: index_array = np.arange(N) if shuffle: if seed is not None: np.random.seed(seed + total_batches_seen) index_array = np.random.permutation(N) current_index = (self.batch_index * batch_size) % N if N >= current_index + batch_size: current_batch_size = batch_size self.batch_index += 1 else: current_batch_size = N - current_index self.batch_index = 0 self.total_batches_seen += 1 yield(index_array[current_index: current_index + current_batch_size], current_index, current_batch_size) def next(self): with self.lock: index_array, current_index, current_batch_size = next(self.index_generator) #one-hot encoding is not under lock and can be done in parallel #reserve room for the one-hot encoded #batch, max_length, charset_length batch_x = np.zeros(tuple([current_batch_size, self.max_length, len(self.charset)])) for i, j in enumerate(index_array): x = self._one_hot(self.X[j]) batch_x[i] = x return (batch_x, batch_x) #fit_generator returns input and target def _one_hot(self, sparse_smile): ss =  counter = 0 for s in sparse_smile: cur =  * len(self.charset) cur[s] = 1 ss.append(cur) counter += 1 #handle end of line, make sure space ' ' is first in the charset for i in range(counter, len(self.charset)): cur =  * len(self.charset) cur = 1 ss.append(cur) ss = np.array(ss) return(ss)