An efficient way of performing clustering would beselect a subset of important features.

We want to fist remove unimportantfeatures that contribute to noise, we can then reduce the data size for moreuseful clustering. One approach here would be to use an unbiased method forunsupervised data such as Principal Component Analysis (PCA) to extractimportant features and perform dimension reduction. As we covered in A0.9, PCAgives us a set of uncorrelated directions that are ordered by their variance.

Ifwe assume that these directions are indeed most important for robustclustering, then we can discard variables that have a large component in lowvariance directions. The choice of PCA is also strategic because PCA gives alow dimensional summary of the data, helps detect outliers, and can be used tovisualize the data in an interpretable manner as well. Using the first few PCsthat capture most of the variation in the data, we can use k-means clustering in the PCA sub-space. k-means clustering tries to optimize variance of the clusters andis computationally cost-effective. If we wanted to use a set of pre-determined featuresbased on biological knowledge, we could consider many variables that arerelatively easy to measure such as (1) length of the hairpin stem, (2) ratio ofpurines to pyrimidines in the hairpin loop, and (3) distance from the 3′-end,and (4) 5′-end of the mRNA, and then subsequently use k-means clustering on these 4 features. We are using k-means instead of a hierarchicalclustering algorithm because we have little basis to believe that the underlyingdata has a hierarchical structure, nor do we necessarily want to recover ahierarchy.

Although it is a possible pitfall that this method of featureselection can lead to a high clustering error as we may lose valuableinformation by throwing away potentially useful features, k-means clustering will generate centroids that are easy tounderstand and use for subsequent biological studies.