Discrepancies in BUSCO results between scaffold-scale and chromosome-scale assemblies used as reference
December 2019
This study lists discrepancies in BUSCO results obtained on the Perca flavescens genome assembly before and after scaffolding using Hi-C data.
- Genome assembly metrics
- BUSCO command lines and result summaries
- BUSCO metrics comparison
- Available files
Genome assembly metrics
Here are some statistics about the perch genome assembly before and after the last scaffolding step.
Before Hi-C scaffolding After Hi-C scaffolding Number of contigs 879 Number of scaffolds 269 Total size of contigs 877,025,633 Total size of scaffolds 877,440,133 Longest contig 18,280,501 Longest scaffold 44,580,961 Shortest contig 160 Shortest scaffold 160 N50 contig length 4,304,620 N50 scaffold length 37,412,490 L50 contig count 60 L50 scaffold count 11 N90 contig length 590 N90 scaffold length 31,135,000 L90 contig count 870 L90 scaffold count 21The total assembly size difference is due to the length of the spacers introduced between contigs. After the Hi-C scaffolding step, the perch genome is assembled into 24 chromosome-scale scaffolds with only 1.22% of the total genome which still unanchored into those 24 chromosome-scale scaffolds. (see the Hi-C map).
BUSCO command lines and result summaries
BUSCO v3.0.2 was run on both genome assembly versions using the following command lines.
python $BUSCO-DIRECTORY/scripts/run_BUSCO.py -i Perca_flavescens_contigs.fa -o busco_before_hiC -l $BUSCO-DIRECTORY/datasets/actinopterygii_odb9 -m genome -c 6 --limit 10 -sp zebrafish python $BUSCO-DIRECTORY/scripts/run_BUSCO.py -i Perca_flavescens_scaffolds.fa -o busco_after_hi-C -l $BUSCO-DIRECTORY/datasets/actinopterygii_odb9 -m genome -c 6 --limit 10 -sp zebrafish
Resulting BUSCO metrics
Before Hi-C scaffolding After Hi-C scaffolding C:97.8%[S:95.4%,D:2.4%],F:1.0%,M:1.2%,n:4584 C:97.6%[S:95.2%,D:2.4%],F:0.9%,M:1.5%,n:4584 4482 Complete BUSCOs (C) 4472 Complete BUSCOs (C) 4371 Complete and single-copy BUSCOs (S) 4363 Complete and single-copy BUSCOs (S) 111 Complete and duplicated BUSCOs (D) 109 Complete and duplicated BUSCOs (D) 47 Fragmented BUSCOs (F) 41 Fragmented BUSCOs (F) 55 Missing BUSCOs (M) 71 Missing BUSCOs (M) 4584 Total BUSCO groups searched 4584 Total BUSCO groups searched
BUSCO metrics comparison
Results obtained before and after scaffolding are of course very close but the metrics degradation (except the fragmented BUSCOs) puzzled us. We focused our attention on the missing BUSCO gene lists and we observed that among the 55 before scaffolding missing BUSCO genes and the 71 missing after scaffolding, only 40 were common. 15 BUSCO genes are missing before scaffolding and found after and more surprisingly 31 BUSCO genes are found before scaffolding and missing after.
BUSCO genes found before and missing after scaffolding
Why are some BUSCO genes no longer found in the assembly after scaffolding? To answer this question, we suspected that some contigs may have been fragmented during the scaffolding process. Indeed, the 3D-DNA pipeline used to scaffold the contigs performs a misjoin detection and correction step which could lead to contig fragmentation. After analysis, it appears that none of the 31 genomic regions supporting the BUSCOs before scaffolding (according to the full table tsv file) were concerned by the 3D-DNA misjoin correction step. This means that the 31 genomic regions supporting the BUSCOs before scaffolding are integrally conserved in the assembly after scaffolding. We therefore analyzed the steps performed by BUSCO to search for BUSCO-matching genes in the genome and noted that the 31 genomic regions supporting BUSCOs before scaffolding were not reported in the tBLASTn results filescoordinates_busco_after_hi-C.tsv
or coordinates_busco_after_hi-C_missing_and_frag_rerun.tsv
of
alignments performed on the assembly after scaffolding.
Example with EOG090C00C9
Final results for EOG090C00C9
grep EOG090C00C9 run_busco*/full_table* run_busco_after_hi-C/full_table_busco_after_hi-C.tsv:EOG090C00C9 Missing run_busco_before_hi-C/full_table_busco_before_hiC.tsv:EOG090C00C9 Complete utg149_pilon_pilon_pilon 6327270 6414496 2009.5 1501
EOG090C00C9
is found in the assembly before scaffolding in the contig utg149_pilon_pilon_pilon
but missing in the assembly after scaffolding.
tBLASTn results for EOG090C00C9
grep EOG090C00C9 run_busco*/blast_output/coordinates_busco_*_hi-C.tsv run_busco_after_hi-C/blast_output/coordinates_busco_after_hi-C.tsv:EOG090C00C9 HiC_scaffold_8 17788149 17862567 run_busco_before_hi-C/blast_output/coordinates_busco_before_hi-C.tsv:EOG090C00C9 utg149_pilon_pilon_pilon 6327270 6430002
In the assembly before scaffolding, we recover the genomic region reported in the final result table. In the assembly
after scaffolding, we see that the reported genomic region belongs to the chromosome-scale scaffold
HiC_scaffold_8
. But if we look for the contig utg149_pilon_pilon_pilon
in our scaffolding assembly file,
we see that the contig does not belong to this scaffold but belongs to HiC_scaffold_4
.
grep utg149_pilon_pilon_pilon Perca_flavescens.final.assembly.tsv HiC_scaffold_4 utg149_pilon_pilon_pilon - 12004819 21323521
Finally, the genomic region HiC_scaffold_8:17788149-17862567
reported after the tBLASTn does not match with the
EOG090C00C9
BUSCO-matching gene. This could have been different if the procedure dedicated to select the best
candidate genomic region did not chose this HiC_scaffold_8
region. If we look at the tBLASTn results against the
assembly after scaffolding, we recover the same HSPs than those allowing the identification of EOG090C00C9
in the
contig utg149_pilon_pilon_pilon
from the assembly before scaffolding.
tBLASTn hits for EOG090C00C9
Here is the set of HSPs leading to the selection of the candidate region utg149_pilon_pilon_pilon:6327270-6430002
from the assembly before scaffolding.
cat run_busco_before_hi-C/blast_output/tblastn_busco_before_hi-C.tsv | awk '$1=="EOG090C00C9"&&$2=="utg149_pilon_pilon_pilon"' EOG090C00C9 utg149_pilon_pilon_pilon 51.68 238 102 3 1 228 6412462 6411758 7e-62 239 EOG090C00C9 utg149_pilon_pilon_pilon 97.18 71 2 0 1358 1428 6347747 6347535 4e-34 148 EOG090C00C9 utg149_pilon_pilon_pilon 65.42 107 28 2 967 1072 6358582 6358286 1e-28 130 EOG090C00C9 utg149_pilon_pilon_pilon 83.93 56 9 0 1301 1356 6348316 6348149 1e-20 104 EOG090C00C9 utg149_pilon_pilon_pilon 76.27 59 14 0 1242 1300 6349143 6348967 1e-19 100 EOG090C00C9 utg149_pilon_pilon_pilon 92.00 50 4 0 1067 1116 6355164 6355015 4e-19 99.4 EOG090C00C9 utg149_pilon_pilon_pilon 64.71 68 21 1 391 455 6377522 6377319 7e-18 95.5 EOG090C00C9 utg149_pilon_pilon_pilon 63.16 57 21 0 917 973 6359890 6359720 2e-14 84.3 EOG090C00C9 utg149_pilon_pilon_pilon 65.38 52 18 0 1428 1479 6344965 6344810 8e-14 82.0 EOG090C00C9 utg149_pilon_pilon_pilon 67.92 53 17 0 1117 1169 6354020 6353862 2e-13 80.9 EOG090C00C9 utg149_pilon_pilon_pilon 48.75 80 34 2 721 793 6362060 6361821 3e-13 80.1 EOG090C00C9 utg149_pilon_pilon_pilon 70.13 77 23 0 844 920 6360339 6360109 2e-12 77.4 EOG090C00C9 utg149_pilon_pilon_pilon 63.46 52 19 0 791 842 6360970 6360815 2e-12 77.4 EOG090C00C9 utg149_pilon_pilon_pilon 83.33 36 6 0 1170 1205 6352090 6351983 2e-11 73.9 EOG090C00C9 utg149_pilon_pilon_pilon 55.10 49 19 1 346 391 6380863 6380717 2e-06 57.4 EOG090C00C9 utg149_pilon_pilon_pilon 43.94 66 21 4 490 539 6375514 6375317 2e-04 51.2 EOG090C00C9 utg149_pilon_pilon_pilon 50.98 51 20 2 539 584 6373181 6373029 2e-04 50.8
In the assembly after scaffolding, we recover the "same set of HSPs" in the chromosome-scale scaffold HiC_scaffold_4
but the HiC_scaffold_4:14911060-14950493
genomic region is not considered as the best candidate.
cat run_busco_after_hi-C/blast_output/tblastn_busco_after_hi-C.tsv| awk '$1=="EOG090C00C9" && $2=="HiC_scaffold_4" && $10<14979000 && $9>14911000' EOG090C00C9 HiC_scaffold_4 51.68 238 102 3 1 228 14911060 14911764 7e-62 239 EOG090C00C9 HiC_scaffold_4 97.18 71 2 0 1358 1428 14975775 14975987 4e-34 148 EOG090C00C9 HiC_scaffold_4 65.42 107 28 2 967 1072 14964940 14965236 1e-28 130 EOG090C00C9 HiC_scaffold_4 83.93 56 9 0 1301 1356 14975206 14975373 1e-20 104 EOG090C00C9 HiC_scaffold_4 76.27 59 14 0 1242 1300 14974379 14974555 1e-19 100 EOG090C00C9 HiC_scaffold_4 92.00 50 4 0 1067 1116 14968358 14968507 4e-19 99.4 EOG090C00C9 HiC_scaffold_4 64.71 68 21 1 391 455 14946000 14946203 7e-18 95.5 EOG090C00C9 HiC_scaffold_4 63.16 57 21 0 917 973 14963632 14963802 2e-14 84.3 EOG090C00C9 HiC_scaffold_4 65.38 52 18 0 1428 1479 14978557 14978712 8e-14 82.0 EOG090C00C9 HiC_scaffold_4 67.92 53 17 0 1117 1169 14969502 14969660 2e-13 80.9 EOG090C00C9 HiC_scaffold_4 48.75 80 34 2 721 793 14961462 14961701 3e-13 80.1 EOG090C00C9 HiC_scaffold_4 70.13 77 23 0 844 920 14963183 14963413 2e-12 77.4 EOG090C00C9 HiC_scaffold_4 63.46 52 19 0 791 842 14962552 14962707 2e-12 77.4 EOG090C00C9 HiC_scaffold_4 83.33 36 6 0 1170 1205 14971432 14971539 2e-11 73.9 EOG090C00C9 HiC_scaffold_4 55.10 49 19 1 346 391 14942659 14942805 2e-06 57.4 EOG090C00C9 HiC_scaffold_4 43.94 66 21 4 490 539 14948008 14948205 2e-04 51.2 EOG090C00C9 HiC_scaffold_4 50.98 51 20 2 539 584 14950341 14950493 2e-04 50.8
Actually, the ranking procedure of the genomic regions identified by tBLASTn only considers a single region by reference
sequence, the region of the lowest e-value HSP. Consequently, selecting a single region by reference sequence
increases the probability to miss the best candidate region as the size of the reference sequences increases.
In our case, the selected region of HiC_scaffold_4
is the region
HiC_scaffold_4:30368188-30368874
which effectively includes the HSP with the best e-value but does not include
any other HSP within 50 Kb upstream and downstream.
cat run_busco_after_hi-C/blast_output/tblastn_busco_after_hi-C.tsv| awk '$1=="EOG090C00C9" && $2=="HiC_scaffold_4" && $10<30410000 && $9>30318000' EOG090C00C9 HiC_scaffold_4 69.87 229 61 1 1 221 30368188 30368874 1e-86 319
If we delete this tBLASTn hit in the tblastn_busco_after_hi-C.tsv
file, the candidate region becomes
HiC_scaffold_4:14906060-14983712
which is the same region as the one supporting EOG090C00C9
in the
assembly before scaffolding. Maybe this selection of the best candidate region through the best HSP is too strict.
BUSCOs missing before and found after scaffolding
In the 15 BUSCO genes missing before and found after scaffolding, we observe that after scaffolding:- 2 BUSCO genes are supported by a region overlapping two contigs therefore retrieved thanks to the scaffolding process
- 2 BUSCO genes are supported by a region not listed in the best candidate genomic regions after the ranking procedure in the assembly before scaffolding. Respectively, 12 and 16 reference sequences have a lower e-value and the maximum number of selected genomic region, although increased from 3 to 10, is not high enough
- 11 BUSCO genes are supported by a region listed in the best candidate genomic regions after the ranking procedure in the assembly before scaffolding. Surprisingly, the gene models generated by Augutus do not lead to predict the same proteins whereas the selected genomic regions only differs by few bases at the start or end of the sequence. After noting that those 11 genomic regions are in the reverse direction after scaffolding, BUSCO is restarted on a reverse-complement counterpart of these 11 regions. The results are identical to those obtained after scaffolding, these 11 BUSCO genes have indeed been identified in the reverse complemented contigs. Of course in our assemblies, many other BUSCO genes have been identified on both contigs and scaffolds of these contigs in reverse direction. The reason why the strand influences Augustus predictions for some contigs and not for others remains unexplained.
Available files
Description | File |
---|---|
BUSCO results before Hi-C scaffolding | busco_before_hi-C.tgz |
BUSCO results after Hi-C scaffolding | busco_after_hi-C.tgz |