Treating biological sequences as natural language, a case study on sub-cellular protein localization
Extracting meaning out of biological sequences such as DNA, RNA, and strings of amino acids is a task that traditionally requires a large amount of expert knowledge. Breakthroughs and advancements of these subjects are slow due to the computational intractability inherent in biological sequences. If it were possible to lower or remove the high level of expertise needed to solve important problems in biology it might be possible to increase the pace of biological breakthroughs. As a small step in this direction this thesis focuses on the challenge of sub-cellular protein localization. It is possible to totally remove the need for any biological understanding by viewing the problem of Sub-cellular protein localization as a Natural Language Processing task. This method requires no hand engineered features and performs at a character level granularity. Modifications are made to an existing deep convolution network which was designed to perform a range of Natural Language Processing tasks such as Sentiment Analysis and Topic Classification. While this model does not achieve state of the art performance it is competitive with respect to other models evaluated in this Thesis. These findings are encouraging for a few reasons. First it is shown that a totally biologically naive method performs competitively with other hand engineered methods. Lastly it is hoped that the current intense research focus on Natural Language processing in the field of deep learning will greatly increase the viability of the method contained in this thesis in coming years.