IBISS: Interactive Bovine In Silico SNP Database

The Clustering Process

In the first round of clustering, sequences are grouped together on the basis of sequence similarity. Thus a cluster (cl#) should include all transcripts from a single gene, or from multigene families with very closely related sequences.

In the second round of clustering the clusters are further subdivided to generate contigs (ct#). The alignment of sequences within each contig is refined and for each contig one or more consensus sequences are generated (cn#). Each consensus sequence represents a model mRNA, different cn# sequences from the same contig (ct#) represent different splice variants. The different contigs (ct#) within a cluster (cl#) may represent more different transcripts from the same gene or represent slightly variant genes, if the cluster contained transcripts from a multi-gene family.

A Primary consensus cn# sequence for a contig ct# is the longest consensus sequence. If the consensus sequences are the same length, the consensus sequence with the greatest number of constituent sequences is ranked best. If both length and number of constituent sequences are equal, the program then ranks the consensus sequences by the number of good bases (i.e., A, T, C, or G). All other consensus sequences from the same cluster are designated Alternate consensus sequences.

Thus, the number of clusters cl# is an estimate of the number of genes represented in the data set. The number of primary consensus sequences is an overestimate of the number of genes and an under estimate of the number of splice variants. The total number of consensus sequences is likely to be an over estimate of the number of splice variants contained in the data set - as many may be due to regions of bad sequence or artifacts generated during the construction of the libraries.