The plight of the wheat bioinformatician: when your genome reference sequence changes more often than Taylor Swift’s boyfriend
I did my PhD at the University of Liverpool performing mutant identification and epigenetic studies in wheat. These analyses heavily involved the use of bioinformatics. As I had no prior bioinformatics experience, having previously worked identifying treatment response biomarkers for human chronic lymphocytic leukemia, it was a steep learning curve for me, particularly in my first year. However, throughout the PhD, and now the early stages of my first postdoc, surprisingly it was not the endless syntax error messages from the command line that have provided the biggest headache! Rather, the major source of much exasperation was working on a genome with an incomplete reference sequence that is constantly evolving.
In 2010, during the in the first year of my PhD, I was fortunate enough to have access to the first wheat assembly published by Brenchley et al. in Nature. This assembly was comparatively primitive to what we have today, with relatively short contigs, little ordering and incomplete allocation of contigs to the individual sub-genomes of wheat. However, this assembly was the first of its kind and more than adequate for mapping analyses in wheat. It also allowed me to develop valuable skills in dealing with an incomplete genome sequence, skills which remain highly desirable as more orphan crop genomes are sequenced. While writing my thesis and publications, in the latter stages of my PhD, an updated wheat assembly was published by the IWGSC in Science. This assembly was an improvement on its predecessor with longer contigs and flow-sorted chromosomes allowing clear definition of the sub-genomes of wheat. Later a methodology known as POPSEQ would order these contigs into chromosomal pseudomolecules simplifying and facilitating many bioinformatics analyses. This release was much welcomed by researchers including myself, but created questions regarding any manuscripts that were near completion at the time. Would this render current work out of date? Would reviewers immediately demand the use of the new reference genome?
I saw both sides of the coin at this time, one manuscript was accepted even though it was based on an analysis using the old reference genome and another analysis had to be largely re-worked and re-written using the new resources. It seemed as though, fairly, reviewers would only require the implementation of the new reference sequence if it would clearly have a major benefit or impact on the manuscript. Thought perhaps a little soul destroying for the lead author on the paper that had to be re-analyzed and re-written!
I wonder how we are to compare studies that have been carried out using the different reference genomes? Will there be comparative studies or alignments between the references? Little did I know that these questions would be a staple throughout my subsequent postdoctoral work, which I am now undertaking at the Earlham Institute in Norwich (EI, formerly TGAC). These questions are now more relevant than ever since we are welcoming two more reference sequence releases in quick succession that demonstrate further incremental steps towards a complete reference genome with still longer assembled contigs. As a bioinformatician, I have dedicated many an hour translating positions from one reference to another and debating the use of one reference over another based on the requirements of the project. Looking at the two new reference offerings on release: one reference is from EI with a largely transparent and open source method of construction, currently lacking contig order but providing gene annotations; and the other reference is from the IWGSC using a more secretive assembly approach (NRgene) and providing contig order but as of yet no annotation. Of course we hope that the ordering for the EI assembly and the annotation for the IWGSC assembly will appear quickly (and may even already be available when this blog is published) but as a researcher you must analyze your data as quickly and efficiently as possible and so are inclined to use the most accessible reference at the time – i.e. the most annotated or available in genome browsers etc. This is becoming a bit of a gamble at the project onset particularly in 2016. There is an overwhelming feeling that I was waiting for a set of long, ordered, annotated contigs, like waiting for a bus, throughout my PhD and now I am overwhelmed by options…. all the busses have arrived at once but which one to get on? I guess in the words of Taylor Swift, “Everything has changed” but there is no “Bad Blood” and hopefully going forward less “Blank spaces!”