Techniques of data exploration subject the data to either supervised or unsupervised analysis. Unsupervised analytical methods include clustering where as supervised approaches aim finer classification.
Clustering methods
All clustering methods aim to form clusters of data points with minimal distance between the vectors in the same cluster while maintaining maximum distance between the data points in different clusters.
The most common approaches in clustering are
• Hierarchical clustering
• K- Means clustering
• Self Organizing Maps
• Partitioning around Mediosis
• Principle Component Analysis
a. Hierarchical clustering
This method combines objects which are more similar together progressively clustering those that are different in different or farther clusters. This is like constructing a phylogenetic tree. The result of the clustering is depicted as dendrograms with similar clusters in adjacent groups.
It uses a progressive approach to construct the clusters. The basic levels of clusters are formed of individual genes or experiments. Higher levels of clustering are achieved by defining similarity in a much finer way.
Those clusters which find a place away from the root are more dissimilar while those near to the root of the cluster are higher in similarity. You can use either bottom-up method or top-down approach to cluster.
Bottom-up method - Measure the distance between the data points using the selected distance measure and cluster it into initial clusters based on similarity. The inter cluster distances are measured and used to further group the initial clusters. The steps are repeated until a single cluster comprising all the data points is formed.
Top-down approach considers the entire data to be in a single cluster. This is divided into two sub clusters using the mean or centroid value of the cluster. The step is repeated until the clusters finally contain a single gene or expression value. The approach is faster but less accurate than the bottom-up method.
This type of clustering does not need to specify the number of clusters in advance. But there can be loss of data integrity as n objects have n (n-1)/2 distance measures between paired data points. Hence the dendrograms from hierarchical clustering have n-1 inner nodes. The visualization also does not specify the arrangement of different clusters, hence with n number of data there are 2n-2 possible choices of constructing dendrograms.
b. K-Means Clustering
This is a very simple type of clustering in which the number of clusters needs to be pre specified. This is arbitrary and left to the choice of the researcher. Once the number of clusters (k) is selected, the algorithm will try to classify the available N data points to k clusters. The centroid or mean expression value of each cluster is found and distance metric from each individual data point is analyzed to include them in the best possible cluster. The steps are repeated for all possible permutations and combinations of the data points and the best result is chosen for further analysis.
Each object is assigned to a centroid value with the Euclidean distance in most cases. Once the clusters are defined, the new centroid values are calculated as MEV (Mean Expression Values) of the clusters.
The drawback of this technique is the random choice of clusters in the initial clustering process. So the quality of the results depends on the initialization. Starting from several initiation points will give you a better clustering.
c. Self Organizing Maps
They essentially follow the same principle as K means where the algorithm represents the clusters as nodes of a two dimensional grid. It randomly selects a data point and moves the nodes or cluster centers to it starting from the closer nodes. The number of moves is to be kept low. Similarity between clusters drives the entire process.
d. Partitioning around Mediosis (PAM)
K-Means clustering is robust, but is based on Euclidean distance metrics. PAM is a modified approach of this method which can be used with any distance measure. The algorithm tends to minimize the sum of distances from the various data points to the respective cluster centers. The cluster centers also need to be data values for PAM. The same method in K-Means is employed here also.
e. Principle Component Analysis (PCA)
The method considers every gene as a vector. The PCA algorithm tries to find the best dimension that represents the data variations precisely.
What to do with time course data?
This requires a modified approach. The differences between expression values at two consecutive time points are to be included as additional observations to the data set. The augmented data matrix is then subjected for usual clustering methods such as K means or PAM.
Though these are easier, the significance of the result is doubtful. Even 'random' data can result in meaningful clusters by chance and hence the choice should be dependent on the necessity and the purpose of research.
About Author / Additional Info: