First, we optimized the sequences for three most common features: C, H, E.
Feature C is, however, already present in almost every sequence with an average score
of 100% throughout the run. Thus, only features E and H are shown. During each step (epoch),
2-5 logs were collected and the average of the score is shown.
Despite the flutuation in the average score per step, we can observe a general increasing that
indicates the increasing probability of the feature being present in the sequence.
The probability of the feature H being present gradually increases from 63% to 75-85% at
steps 6-9.
(Pi Helix) is one of the rarest features and is found in only around 15% of known proteins.
The optimization process for this feature ended up being the hardest, with an average score of 0% appearing during a number of epochs.
In our previous experiments, we used a score of 0.8+ as the threshold for adding the sequence to the dataset.
However, because the feature is so rare, we took the threshold down to 0.1 for this experiment. In other words, if there was an even 10% chance of the sequence having alpha-helix, we were adding it to the dataset.
After some time, the generator actually starts producing the sequences that exceed a 20% chance of belonging to the pi-helix class.
At that point, the size of the dataset starts reaching a million sequences - which are problematic to keep in the running memory.
Future optimization to our code might include getting read of the old sequences while continually adding new and better ones.
Alternatively, RAM problems can be avoided by creating a flexible threshold that gets higher once the generator outputs sequences that are more successful.
To see if the network is able to learn several features at once, we have performed
simultaneous enrichment for three most common features. The initial score for feature C was already 100% and
thus no improvement can be observed there.
While collecting more logs per epoch would smoothen out the fluctuation, we can observe
an increase of by 30-40% for the feature E and up to 20% for the feature H.
Next, we optimized for a combination of rare and common features.
Specifically, two rare features (T,G) were enriched simultaneously with a more common feature (E).
It is known that GAN models are prone to mode collapse where the generator simply repeats
one output over and over. In the future, we aim to add a similarity penalty to the sequences
(Linder et al. 2020) that would force the generator to create different sequences.
In the meantime, we used a simple edit distance metric that measures sequence similarity to
monitor how different generated sequences are, before and after optimization. The following graph
is for the optimization of the feature H (before optimization and after optimization).