Well depending on what sort of dataset you need for your research today is your lucky day (actually late last week was your lucky day but I’m buried in work and way behind).
Download links for lots of apple genome related data are now available from GDR: The Genome Database for Rosaceae.
One file you’ll be surprised not to find a a link to download the pseudomolecules. (For those reading this who don’t have to work with genomic data, pseudomolecules are continuous DNA sequences of entire chromosomes. Each chromosome is technically a single, incredibly complex, molecule, not unlike water or carbon dioxide.) From GDR:
Because of the heterozygous nature of the Malus x domestica genome, pseudomolecules corresponding to the chromosomes are not presented. To preserve halplotype information, this assembly consists of sets of assembled contigs organized into metacontigs (clusters) and anchored to the chromosomes using 1,643 genetic markers. There is no consensus sequence from assembled contigs. Please see the whole genome publication in Nature Genetics for further details.
Now there is a GFF file that lists the locations and orientations of all the genes on these (non-existence) pseudomolecules, as well as the locations and orientations of the sequence contigs (smaller pieces of DNA which are available for download). So in theory it should be possible to reassemble the smaller contig sized DNA sequences into pseudomolecules….
…I don’t recommend trying it. I’ve spent a number of hours over the last few days trying to do exactly that and the sad fact of the matter is overlapping contigs have significant differences in their sequences, so there is no simple way to squash them together, even if you don’t care about heterozygousity data. The best I’ve managed was a home made set of pseudomolecules where 24% of the annotated apple genes had correct start and stop codons. (The coding portion of a gene starts with the three letter codon “ATG” and ends with one of three codons “TGA” “TAA” or “TAG”. When more than three-quarters of the genes annotated on the genome aren’t following that rule, it means there is something very wrong with your annotations, your sequence, or both).
So have fun with the apple genome, but don’t expect the lack of pseudomolecules to be easily overcome. <– But if on the other hand you DO manage to overcome it, drop me a line using the new “Contact me” tab at the top of the website, and let me know how you managed it!
Thanks to Aaron for pointing this new sequence data out to me.