Pre-Training

  1. Feedback Net Training

  2. GAN Training

Feedback Net Training

The overall problem of Q8-prediction remains a challenge in the field. Recent advancement for AlphaFold established a benchmark during CASP14 with score of 92.4 out of 100 for Global Distance Test (GDT). Given the complexity of this problem,we omit the exact positioning of the structure for the purposes of this publication and focus just on whether it appears in any given sequence or no.
The Feedback Network was implemented as a multi-label classifier. More specifically, the last layer of the network consists of eight nodes, each with sigmoid activation and the network utilizes the binary cross-entropy loss. Given the sequence, language-like, nature of the problem we implemented an architecture that combines an embedding layer and bidirectional LSTMs.
While we included accuracy in the evaluation process, it is not the most descriptive metric for an unbalanced dataset. If we were to label all the classes as C,H,E, we would be right in most cases, and get high accuracy, but such labeling would not be useful as a feedback mechanism. Thus, precision, recall and hinge loss were used to evaluate the successfulness of multilabeling.

Feedback metrics

Overall, all the metrics had a high enough performance. The only class that turned out to be problematic is pi-helix (I), as it comprised less than 0.3% of the training dataset.

GAN Training

Generative Adversarial Network was pre-trained on over 23,000 real sequences. The hyperparameters were chosen based on the training loss of the generator and the discriminator, the converging gradient penalty, and the resemblence of the generated outputs to the real sequences.
In particular, notice the shift of the biochemical properties of the generated sequences as compared to the real ones. These properties were analyzed using metrics the model has never, using a protein analysis software. seen but is still able to match almost all parameters after 20 epochs.