Discrepancies in BUSCO results between scaffold-scale and chromosome-scale assemblies used as reference

December 2019

This study lists discrepancies in BUSCO results obtained on the Perca flavescens genome assembly before and after scaffolding using Hi-C data.

Genome assembly metrics

Here are some statistics about the perch genome assembly before and after the last scaffolding step.

Before Hi-C scaffolding                             After Hi-C scaffolding
Number of contigs              879                  Number of scaffolds              269
Total size of contigs  877,025,633                  Total size of scaffolds  877,440,133
Longest contig          18,280,501                  Longest scaffold          44,580,961
Shortest contig                160                  Shortest scaffold                160
N50 contig length        4,304,620                  N50 scaffold length       37,412,490
L50 contig count                60                  L50 scaffold count                11
N90 contig length              590                  N90 scaffold length       31,135,000
L90 contig count               870                  L90 scaffold count                21

The total assembly size difference is due to the length of the spacers introduced between contigs. After the Hi-C scaffolding step, the perch genome is assembled into 24 chromosome-scale scaffolds with only 1.22% of the total genome which still unanchored into those 24 chromosome-scale scaffolds. (see the Hi-C map).

BUSCO command lines and result summaries

BUSCO v3.0.2 was run on both genome assembly versions using the following command lines.

python $BUSCO-DIRECTORY/scripts/run_BUSCO.py -i Perca_flavescens_contigs.fa -o busco_before_hiC -l $BUSCO-DIRECTORY/datasets/actinopterygii_odb9 -m genome -c 6 --limit 10 -sp zebrafish
python $BUSCO-DIRECTORY/scripts/run_BUSCO.py -i Perca_flavescens_scaffolds.fa -o busco_after_hi-C -l $BUSCO-DIRECTORY/datasets/actinopterygii_odb9 -m genome -c 6 --limit 10 -sp zebrafish

Resulting BUSCO metrics

Before Hi-C scaffolding                             After Hi-C scaffolding
C:97.8%[S:95.4%,D:2.4%],F:1.0%,M:1.2%,n:4584        C:97.6%[S:95.2%,D:2.4%],F:0.9%,M:1.5%,n:4584
4482 Complete BUSCOs (C)                            4472 Complete BUSCOs (C)
4371 Complete and single-copy BUSCOs (S)            4363 Complete and single-copy BUSCOs (S)
111  Complete and duplicated BUSCOs (D)             109  Complete and duplicated BUSCOs (D)
47   Fragmented BUSCOs (F)                          41   Fragmented BUSCOs (F)
55   Missing BUSCOs (M)                             71   Missing BUSCOs (M)
4584 Total BUSCO groups searched                    4584 Total BUSCO groups searched

BUSCO metrics comparison

Results obtained before and after scaffolding are of course very close but the metrics degradation (except the fragmented BUSCOs) puzzled us. We focused our attention on the missing BUSCO gene lists and we observed that among the 55 before scaffolding missing BUSCO genes and the 71 missing after scaffolding, only 40 were common. 15 BUSCO genes are missing before scaffolding and found after and more surprisingly 31 BUSCO genes are found before scaffolding and missing after.

BUSCO genes found before and missing after scaffolding

Why are some BUSCO genes no longer found in the assembly after scaffolding? To answer this question, we suspected that some contigs may have been fragmented during the scaffolding process. Indeed, the 3D-DNA pipeline used to scaffold the contigs performs a misjoin detection and correction step which could lead to contig fragmentation. After analysis, it appears that none of the 31 genomic regions supporting the BUSCOs before scaffolding (according to the full table tsv file) were concerned by the 3D-DNA misjoin correction step. This means that the 31 genomic regions supporting the BUSCOs before scaffolding are integrally conserved in the assembly after scaffolding. We therefore analyzed the steps performed by BUSCO to search for BUSCO-matching genes in the genome and noted that the 31 genomic regions supporting BUSCOs before scaffolding were not reported in the tBLASTn results files coordinates_busco_after_hi-C.tsv or coordinates_busco_after_hi-C_missing_and_frag_rerun.tsv of alignments performed on the assembly after scaffolding.

Example with EOG090C00C9

Final results for EOG090C00C9

grep EOG090C00C9 run_busco*/full_table*
run_busco_after_hi-C/full_table_busco_after_hi-C.tsv:EOG090C00C9	Missing
run_busco_before_hi-C/full_table_busco_before_hiC.tsv:EOG090C00C9	Complete	utg149_pilon_pilon_pilon	6327270	6414496	2009.5	1501

EOG090C00C9 is found in the assembly before scaffolding in the contig utg149_pilon_pilon_pilon but missing in the assembly after scaffolding.

tBLASTn results for EOG090C00C9

grep EOG090C00C9 run_busco*/blast_output/coordinates_busco_*_hi-C.tsv
run_busco_after_hi-C/blast_output/coordinates_busco_after_hi-C.tsv:EOG090C00C9	HiC_scaffold_8	17788149	17862567
run_busco_before_hi-C/blast_output/coordinates_busco_before_hi-C.tsv:EOG090C00C9	utg149_pilon_pilon_pilon	6327270	6430002

In the assembly before scaffolding, we recover the genomic region reported in the final result table. In the assembly after scaffolding, we see that the reported genomic region belongs to the chromosome-scale scaffold HiC_scaffold_8. But if we look for the contig utg149_pilon_pilon_pilon in our scaffolding assembly file, we see that the contig does not belong to this scaffold but belongs to HiC_scaffold_4.

grep utg149_pilon_pilon_pilon Perca_flavescens.final.assembly.tsv 
HiC_scaffold_4	utg149_pilon_pilon_pilon	-	12004819	21323521

Finally, the genomic region HiC_scaffold_8:17788149-17862567 reported after the tBLASTn does not match with the EOG090C00C9 BUSCO-matching gene. This could have been different if the procedure dedicated to select the best candidate genomic region did not chose this HiC_scaffold_8 region. If we look at the tBLASTn results against the assembly after scaffolding, we recover the same HSPs than those allowing the identification of EOG090C00C9 in the contig utg149_pilon_pilon_pilon from the assembly before scaffolding.

tBLASTn hits for EOG090C00C9

Here is the set of HSPs leading to the selection of the candidate region utg149_pilon_pilon_pilon:6327270-6430002 from the assembly before scaffolding.

cat run_busco_before_hi-C/blast_output/tblastn_busco_before_hi-C.tsv | awk '$1=="EOG090C00C9"&&$2=="utg149_pilon_pilon_pilon"'
EOG090C00C9	utg149_pilon_pilon_pilon	51.68	238	102	3	1	228	6412462	6411758	7e-62	  239
EOG090C00C9	utg149_pilon_pilon_pilon	97.18	71	2	0	1358	1428	6347747	6347535	4e-34	  148
EOG090C00C9	utg149_pilon_pilon_pilon	65.42	107	28	2	967	1072	6358582	6358286	1e-28	  130
EOG090C00C9	utg149_pilon_pilon_pilon	83.93	56	9	0	1301	1356	6348316	6348149	1e-20	  104
EOG090C00C9	utg149_pilon_pilon_pilon	76.27	59	14	0	1242	1300	6349143	6348967	1e-19	  100
EOG090C00C9	utg149_pilon_pilon_pilon	92.00	50	4	0	1067	1116	6355164	6355015	4e-19	99.4
EOG090C00C9	utg149_pilon_pilon_pilon	64.71	68	21	1	391	455	6377522	6377319	7e-18	95.5
EOG090C00C9	utg149_pilon_pilon_pilon	63.16	57	21	0	917	973	6359890	6359720	2e-14	84.3
EOG090C00C9	utg149_pilon_pilon_pilon	65.38	52	18	0	1428	1479	6344965	6344810	8e-14	82.0
EOG090C00C9	utg149_pilon_pilon_pilon	67.92	53	17	0	1117	1169	6354020	6353862	2e-13	80.9
EOG090C00C9	utg149_pilon_pilon_pilon	48.75	80	34	2	721	793	6362060	6361821	3e-13	80.1
EOG090C00C9	utg149_pilon_pilon_pilon	70.13	77	23	0	844	920	6360339	6360109	2e-12	77.4
EOG090C00C9	utg149_pilon_pilon_pilon	63.46	52	19	0	791	842	6360970	6360815	2e-12	77.4
EOG090C00C9	utg149_pilon_pilon_pilon	83.33	36	6	0	1170	1205	6352090	6351983	2e-11	73.9
EOG090C00C9	utg149_pilon_pilon_pilon	55.10	49	19	1	346	391	6380863	6380717	2e-06	57.4
EOG090C00C9	utg149_pilon_pilon_pilon	43.94	66	21	4	490	539	6375514	6375317	2e-04	51.2
EOG090C00C9	utg149_pilon_pilon_pilon	50.98	51	20	2	539	584	6373181	6373029	2e-04	50.8

In the assembly after scaffolding, we recover the "same set of HSPs" in the chromosome-scale scaffold HiC_scaffold_4 but the HiC_scaffold_4:14911060-14950493 genomic region is not considered as the best candidate.

cat run_busco_after_hi-C/blast_output/tblastn_busco_after_hi-C.tsv| awk '$1=="EOG090C00C9" && $2=="HiC_scaffold_4" && $10<14979000 && $9>14911000'
EOG090C00C9	HiC_scaffold_4	51.68	238	102	3	1	228	14911060	14911764	7e-62	  239
EOG090C00C9	HiC_scaffold_4	97.18	71	2	0	1358	1428	14975775	14975987	4e-34	  148
EOG090C00C9	HiC_scaffold_4	65.42	107	28	2	967	1072	14964940	14965236	1e-28	  130
EOG090C00C9	HiC_scaffold_4	83.93	56	9	0	1301	1356	14975206	14975373	1e-20	  104
EOG090C00C9	HiC_scaffold_4	76.27	59	14	0	1242	1300	14974379	14974555	1e-19	  100
EOG090C00C9	HiC_scaffold_4	92.00	50	4	0	1067	1116	14968358	14968507	4e-19	99.4
EOG090C00C9	HiC_scaffold_4	64.71	68	21	1	391	455	14946000	14946203	7e-18	95.5
EOG090C00C9	HiC_scaffold_4	63.16	57	21	0	917	973	14963632	14963802	2e-14	84.3
EOG090C00C9	HiC_scaffold_4	65.38	52	18	0	1428	1479	14978557	14978712	8e-14	82.0
EOG090C00C9	HiC_scaffold_4	67.92	53	17	0	1117	1169	14969502	14969660	2e-13	80.9
EOG090C00C9	HiC_scaffold_4	48.75	80	34	2	721	793	14961462	14961701	3e-13	80.1
EOG090C00C9	HiC_scaffold_4	70.13	77	23	0	844	920	14963183	14963413	2e-12	77.4
EOG090C00C9	HiC_scaffold_4	63.46	52	19	0	791	842	14962552	14962707	2e-12	77.4
EOG090C00C9	HiC_scaffold_4	83.33	36	6	0	1170	1205	14971432	14971539	2e-11	73.9
EOG090C00C9	HiC_scaffold_4	55.10	49	19	1	346	391	14942659	14942805	2e-06	57.4
EOG090C00C9	HiC_scaffold_4	43.94	66	21	4	490	539	14948008	14948205	2e-04	51.2
EOG090C00C9	HiC_scaffold_4	50.98	51	20	2	539	584	14950341	14950493	2e-04	50.8

Actually, the ranking procedure of the genomic regions identified by tBLASTn only considers a single region by reference sequence, the region of the lowest e-value HSP. Consequently, selecting a single region by reference sequence increases the probability to miss the best candidate region as the size of the reference sequences increases. In our case, the selected region of HiC_scaffold_4 is the region HiC_scaffold_4:30368188-30368874 which effectively includes the HSP with the best e-value but does not include any other HSP within 50 Kb upstream and downstream.

cat run_busco_after_hi-C/blast_output/tblastn_busco_after_hi-C.tsv| awk '$1=="EOG090C00C9" && $2=="HiC_scaffold_4" && $10<30410000 && $9>30318000'
EOG090C00C9	HiC_scaffold_4	69.87	229	61	1	1	221	30368188	30368874	1e-86	  319

If we delete this tBLASTn hit in the tblastn_busco_after_hi-C.tsv file, the candidate region becomes HiC_scaffold_4:14906060-14983712 which is the same region as the one supporting EOG090C00C9 in the assembly before scaffolding. Maybe this selection of the best candidate region through the best HSP is too strict.

BUSCOs missing before and found after scaffolding

In the 15 BUSCO genes missing before and found after scaffolding, we observe that after scaffolding:

Available files

Description File
BUSCO results before Hi-C scaffolding busco_before_hi-C.tgz
BUSCO results after Hi-C scaffolding busco_after_hi-C.tgz