In total, 3701 protein-coding genes (excluding gene families Proline-Proline-Glutamic acid protein-PPE high throughput screening compounds and Proline-Glutamic acid protein-PE) and the rDNA genes were annotated. To estimate the copy per genome of the assembled contigs, we followed the statistical method developed by Nederbragt et al. (Nederbragt et al., 2010), using the assembly information contained within the 454AlignmentInfo.tsv file generated by Newbler. The mauve
v2.3.1 software package was used for genome comparison (Darling et al., 2004), using the default options and manual inspection. The reference genomes used for comparison were (ebi database): H37Rv (AL123456), KZN4207 (CP001662), CCDC5079 (CP001641), CCDC5180 selleckchem (CP001642), CDC1551 (AE000516), F11 (CP000717) and H37Ra (CP000611). The annotated chromosome of UT205 strain was deposited in the ebi-ena database (http://www.ebi.ac.uk/ena/home ) under the accession number HE608151. All found differences were deeply analysed afterwards with the artemis software. The predicted proteins comparison was carried out with
fasta36 tool GGSEARCH (Pearson & Lipman, 1988), comparing each amino acid sequence with the one of the corresponding ortholog. Whole genome sequencing resulted in 375 462 reads with a total count of 155 436 474 bases. A total of 97.98% of the reads (4 288 599 assembled bases) were included within the assembly. The N50 value assembled was 81 913 bases, meaning that 50% of the genome was assembled in contigs of 81 kbp or larger. This calculation was carried out with the total genome assembled by Newbler. The average and largest contig lengths were 30 573 and 192 340, respectively. The average contig sequencing depth was 38.9× and 99% of the assembled genome had a minimum coverage FER of 20×. Contig reordering with the ABACAS tool generated
a single molecule with most of the contigs included. Only 20 small contigs representing 17 396 bp were excluded, including those containing PE-PGRS,vPPE genes, 13E12 repeat protein and transposases, and the pks12 and Rv1319c genes, both with gaps within the assembly. The gaps (Ns) fall into repetitive elements such as IS6110, IS1081, 13E12 or within genes such as PPE,vPG-PGRS,vpks12,vcysA3,vsseC1,vRv1319c and some transposases. In total, 3701 CDS sequences were transferred and manually curated. The rRNAs were transferred with the RATT tool and manually inspected. The tRNAs were predicted with the tRNAscan software (Lowe & Eddy, 1997), then compared to the reference genome and, if necessary, manually curated. To identify and quantify the repetitive elements/contigs present in the genome of the UT205 isolate, we tested the contigs depth read with the R routine as described (Nederbragt et al., 2010), demonstrating a high correlation between the contig-specific read depth and the number of copies present in the genome. As shown in Fig.