Existing studies have challenged the current definition of named bacterial species, especially in the case of highly recombinogenic bacteria. This has led to considering the use of computational procedures to examine potential bacterial clusters that are not identified by species naming. This paper describes the use of sequence data obtained from MLST databases as input for a k-means algorithm extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process. An implementation of the k-means algorithm has been developed based on an existing source code implementation, and it has been evaluated against MLST data. Results point out to potential bacterial clusters that are close to more than one different named species and thus may become candidates for alternative classifications accounting for genotypic information. The use of hierarchical clustering with sequence comparison as similarity metric has the potential to find clusters different from named species by using a more informed cluster formation strategy than a conventional nominal variant of the algorithm.
In contrast to eukaryotic organisms, divergence between different species in bacteria is not completely irreversible, and different named species are not ecologically distinct. In consequence, a named bacterial species is more like a genus than a species [
In spite of the effort in classifying bacteria, previous studies [
This paper describes a new software tool that uses sequence data obtained from MLST databases as input for well-known clustering algorithms. Concretely, the k-means algorithm is extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process. The k-means algorithm does not assume any kind of probability distribution on the data it segments, so it provides an adequate starting point for research in clustering applied to MLST databases, which are unbalanced in their coverage of bacteria, since they are basically built by non-planned contributions from researchers worldwide, as occurs with many other biological databases.
Prokaryotes reproduce asexually and are thus in principle unable to conform to Ernst Mayr’s popular Biological Species Concept “groups of actually or potentially interbreeding natural populations which are reproductively isolated from other such groups”. So currently the practical need for identification and naming is addressed by some broadly agreed but provisional bacterial species definition. A theory for a unifying species concept that would explain patterns of microbial diversity in ecological and population genetic terms could then be found in the future from the examination and continuous analysis of bulks of biological data using among others, bioinformatics techniques.
Related bacteria were at the beginning identified by careful analysis of phenotypes (typically, metabolic traits), but nowadays sequence similarity approaches are used as a source of potential units or as a way of finding supporting evidence for named species. Operationally, molecular de nitions are often used, and species are usually expected to share at least 70% binding in standardized DNA–DNA hybridization [
The criterion of a threshold of DNA–DNA hybridization says that strains that show approximately 70% or greater DNA–DNA relatedness are considered to belong to the same species and those that have less than this value are different species. DNA–DNA hybridization is considered to be the best method for assigning bacteria to taxonomic units. However, the technique is time consuming, expensive and requires considerable human effort. This has led to using lighter techniques for the same purpose, being identification with 16S rRNA the most widespread today, and in fact, the increase of registered species is attributed to the use of that technique [
The use of 16S rRNA sequences to study bacterial species has been considered as especially appropriate for the following reasons [ Its presence in almost all bacteria. The function of the 16S rRNA gene over time has not changed, suggesting that random sequence changes are a more accurate measure of time (evolution). The 16S rRNA gene (1500 bp) is large enough for bio-informatics analysis.
As mentioned by Hanage et al. [
There are other important concepts related with the study of bacterial species. Some of them which are present in MLST databases are briefly discussed here. An important consideration regarding methodology is that a very low percentage of bacterial species are cultivable with today’s techniques. So, laboratory studies cannot cover the diversity of bacterial units, and molecular techniques suggest that there could be billions of potential species [
A strain is a subset of a bacterial species differing from other bacteria of the same species by some minor but identifiable difference. Strain have been defined as “populations of organisms that descends from a single organism or pure culture isolate. Strains within a species may differ slightly from one another in many ways.” Strains are often created in the laboratory by mutagenizing existing strains or wild type examples of bacterial species.
A serovar is a strain differentiated by serological means. For example, individual strains of Salmonella are often distinguished and distinguishable by serological means. Serovars allow organisms to be classified at the sub-species level. An isolate is a microbial or viral sample that has been obtained from an infected individual, rather than grown in a laboratory. In chemistry and bacteriology, the verb isolate means to obtain a pure chemical, bacteriological or viral sample. The noun ‘isolate’ refers to the sample itself. MLST databases register isolate information, thus the data included is determined by that particular data gathering practice.
The definition of bacterial species has been recently challenged, especially in the case of highly recombinogenic bacteria [
The problem of determining bacterial species is also related to how far we should go in the division into subgroups. Cohan et al. [
Clustering is the partitioning of objects into different subgroups, with the subsets called clusters. Ideally, the data in each subset share some common trait(s) which are relevant to the problem at hand. Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering algorithms can be classified along diverse dimensions [ Exclusive Clustering, being k-means a typical representative. Overlapping Clustering, being fuzzy c-means (first described by Dunn [ Hierarchical Clustering, as the CobWeb clustering method. Probabilistic Clustering, as the EM clustering method.
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. In the second type, fuzzy sets or other kind of gradual relatedness measure is used to cluster data, so that each object may belong to two or more clusters with different degrees of membership or closeness. Hierarchical clustering produces clusters contained in more general ones. Finally, probabilistic clustering uses some assumed probability distributions for the formation of clusters.
In our approach, the k-means has been selected as a first step towards using other methods. The benefit of using k-means is that it does not impose requirements on the population to be clustered (as the EM algorithm) and allows for clear groupings of bacteria that are close w.r.t. their genotypes. Although fuzzy clustering seems a good candidate for clusters with imprecise boundaries, its multiple classification makes its outcomes more difficult to interpret, and in our current research we are interested in generating first non-overlapping groupings, that in future work may be contrasted with overlapping ones.
In phylogeny, clustering is mostly used for finding 16S rRNA groups to create phylogenetic trees. But it can be used for other genomic traits also [
Genotypic data has been used as the basis to characterize bacterial isolates. Concretely, multilocus sequence typing (MLST) is a procedure for characterizing isolates of bacterial species using the sequences of internal fragments of a limited number of house-keeping genes [
In our context, fuzzy clustering appears to be coherent with the idea that lateral gene transfer and asexual reproduction in bacteria produce a higher level of variety in the species concept. Cohan et al. [
“[
However, the raw material for defining membership functions would be that of sequence similarity, and the scores produced by alignment algorithms still lack a theory–based interpretation. The estimation of significance in terms of similarity of an alignment is based on estimating the probability that the score computed or a higher one can occur simply by chance, given the probabilistic models for the sequences [
This later challenge also is found with probability–based clustering algorithms. For example, the expectation-maximization (EM) algorithm [
Hierarchical clustering appears as a reasonable approach given that at surface, they are coherent with existing approaches that use phylogenetic trees [
Clustering algorithms have been used for different classification or discovery purposes that use biological data as input. Particularly, in [
The BURST algorithm [
The k-means algorithm [
where there are
The algorithm can be applied to numerical or nominal data, or a combination of both. A common form of the algorithm uses partitions the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or when centroids are no longer changed).
The k-means algorithm assumes a vector space describing the instances and provides a generic arithmetic formulation in which the instances xj are considered to be numerical. Variants dealing with nominal data reduce to the numerical variant by considering value coincidence to score one, zero otherwise. A straightforward approach for considering MLST data would be that of using a special metric for object similarity. The following equation reflects this change.
Concretely, the changes are twofold:
The arithmetic distance in the vector space is substituted by an average estimation of the similarity of the sequences at the different loci considered. Instead of computing a centroid
The alignScore(x
It should deal with actual objects in the instance space considered, to guarantee that genotipic comparisons are biologically meaningful.
It should be flexible to consider vectors or weighted vectors of sequence fragments, thus allowing for different configurations.
The second requirement makes the algorithm flexible enough to work with different configurations:
Several loci, as in MLST databases which typically describe bacteria with alleles in seven housekeeping genes. A single locus (often with a longer sequence), as is common with 16S rRNA analysis. Any combination of the above, including concatenations of sequences as used by Hanage et al. [
Sequence alignment algorithms as the Needleman-Wunsch one [
In principle, the scores returned by the clustering algorithm may be in any range, since the computation does not depend on the range but in the discernibility of distances. Nonetheless, the scores returned by common alignment algorithm implementations are not normalized in any interval, which may require some transformation in the implementation of the algorithm, as it will be described later.
The selection of cluster centroids is a critical aspect in the design of the algorithm, since it affects the performance and possibilities of the convergence of the iterative process of the k-means algorithm. Instead of computing the averages of vector elements, there is a search process, which again uses sequence similarity. The following formula specifies the criterion for selecting the centroid of cluster
That is, the sum of the distances of each element in the cluster to the rest of the elements is computed, and the minimum general distance is used to determine the new centroid.
The data gathered for testing the software and analyzing results has been downloaded from MLST databases
The data downloaded is stored in denormalized form in two tables with the following general structure.
ISOLATES( id,
Where id identifies the isolate, and there are a number of other identification fields including origin, curator and the like, then there is a specification of species, and a list of columns depending on the database that store the allele types for the housekeeping genes that characterize that particular database. The serogroup, clonal complex and other fields can be as alternative criteria for grouping bacterial units.
ALLELES(locusid, id, seq)
Where the pair (locusid, id) identify the allele, and seq contains the sequence. The data in the relational schema is then converted to the ARFF format to make it usable with the Weka machine learning framework (described in the next section). This is accomplished by a utility program named MLST2ARFF.
The download process is implemented with a small utility program called MLSTDownloader that gets data to a local relational store. It uses the MLST SOAP API.
This is an implementation of a variant of the SimpleKMeans algorithm provided in Weka4 that uses global alignment of sequences of housekeeping genes to drive the clustering process. Weka (Waikato Environment for Knowledge Analysis) is an open source suite of machine learning software written in Java, developed at the University of Waikato.
The main classes with the implementation are the following:
SimpleKMeans4mlst.java, the modified SimpleKMeans. NucleotidePairAligner.java, a utility class for getting global alignment scores.
The BioJava implementation of the Needleman-Wunsch algorithm (class NeedlemanWunsch) has been used for the similarity scoring.
This section reports on the evaluation of the algorithm on the approach presented for MLST databases.
The Neisseria MLST database has been used as the case study since they are among the most recombinogenic bacteria, and there is good evidence for relatively frequent localized recombination between the named Neisseria species [
The modified k-means algorithm using sequence alignment scores has been contrasted with the use of the conventional k–means provided in the Weka libraries using allele attributes as nominal values. The use of nominal values for the classification is related to the eBURST algorithm.
In general, the within cluster sum of squared error results in lower values in the modified version, but this is not necessarily related to the identification of ecologically significant clusters.
A moderately extensive exploration has been carried out with the isolates identified with the ids from 1 to 400 in the Neisseria MLST database, again asking for ten clusters. The SimpleKMeans4mlst algorithm produced the following clustering.
Serotype 2a appears to determine the first cluster, which correspond to a particular virulent variant of N. meningitidis. Cluster 2 has grouped N. Lactamica with some noise. This is consistent with the findings in [
A very relevant finding is that sequence types (ST) are only discriminating the same or similar clusters in the cases mentioned in the above table. This suggest that using sequences in turn of allele numbers provide matches with biological characteristics not covered by STs.
From the analysis above, it seems that serogroups or serotypes are determining the clusters to a great extent. This is confirming the classifications at the sub-species level.
Regarding clonal complexes, cluster 0 is basically determined by ST-11 complex/ET-37 complex, and clusters 6–9 by ST-41/44 complex/Lineage 3, ST-32 complex/ET-5 complex, ST-11 complex/ET-37 complex, ST-22 complex respectively (the latter case blurred with ST-23 complex/Cluster A3). However, the others are not clearly determined by any clonal complex, while there are attributes that may to some extent be determinant, especially if more clusters are requested to the algorithm, thus having the opportunities to split them.
The original SimpleKMeans with the same data and input parameters produces clusters less balanced. A larger cluster of 125 instances concentrates N. Lactamica (13 isolates), N. gonorrhoeae (1 isolate) and N. Meningitidis C (59 isolates), but also dispersed isolates from other serogroups.
The obvious limitation is that of the size of the databases used for the evaluation, which were constrained by the computational cost of the algorithm’s implementation. In addition to that, the exploration of the clusters needs to be informed by theoretical models [
Regarding the similarity measure, the significance of the global alignments should be explored, to prevent low significant to be acting as noise in the clustering process.
The established naming of bacterial species is nowadays challenged by molecular level methods that suggest that there is a much rich diversity of organisms than in the non-microbial realm [
This work has explored the possibility of extending existing clustering algorithms to deal with sequence data as a tool for identifying potential bacterial groups that may be biologically significative. Concretely, an extension of the k-means algorithm has been reported, along with its implementation. Data from MLST databases have been used for the testing.
There are several avenues for future research related to the clustering of bacteria for the identification of significant units. These include the following:
Improving the described extended k-means implementation. Concretely, performance can be critically improved by caching sequence alignment scores, which are repeated several times and that are by themselves computationally expensive. Evaluating the outcomes of extended clustering algorithms, by using them systematically with MLST databases. However, the evaluation would need to analyze the result in contrast with some biological-ecological interpretation of other data items. Extending the analysis to other clustering schemes, as those that have been mentioned in the paper.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.