Dataset

The dataset of protein sequences used for both Feedback and GAN pre-training was obtained from Protein Secondary Structure Dataset from Kaggle. The whole dataset contains 95,915 samples of varying length with up to 800 amino acids in a sequence. We have used sequences with up to 75 amino acids (= 225 letters in DNA coding). This resulted in over 23,000 training samples.

To limit the number of characters GAN has to learn, protein samples in this dataset were translated to DNA sequences. This translation is one-to-one (with a few expections that are not-siginificant for this task) and thus no information is lost during translation (see Approach)

Overall, we can see that dataset is unbalanced. This, however, is representative of how features are distributed in real life. The dominating features of dataset (C,H,E) are also the dominating features in nature. That is, sequences readily form these structures. The rarest features like pi-helix (I) also barely appear in the dataset and are biochemically very inprobable. We checked if the existence of any of the features heavily indicates presence of others. In other words, given that X is in the sequence, how many Y features will the sequences have? In the table below, y-axis contains conditioning features, and x-axis contains features which frequency we would want to check.

As expected, C,H,E, most often appear together. As for the rare features, there is also an interesting correlation: while C is the most common feature in the dataset, I is still more like to appear together with H, rather than C. For our problem it means that choice of combination of features may noticable influence the rate of training.