An efficient way of performing clustering would be

select a subset of important features. We want to fist remove unimportant

features that contribute to noise, we can then reduce the data size for more

useful clustering. One approach here would be to use an unbiased method for

unsupervised data such as Principal Component Analysis (PCA) to extract

important features and perform dimension reduction. As we covered in A0.9, PCA

gives us a set of uncorrelated directions that are ordered by their variance. If

we assume that these directions are indeed most important for robust

clustering, then we can discard variables that have a large component in low

variance directions. The choice of PCA is also strategic because PCA gives a

low dimensional summary of the data, helps detect outliers, and can be used to

visualize the data in an interpretable manner as well. Using the first few PCs

that capture most of the variation in the data, we can use k-means clustering in the PCA sub-space. k-means clustering tries to optimize variance of the clusters and

is computationally cost-effective.

If we wanted to use a set of pre-determined features

based on biological knowledge, we could consider many variables that are

relatively easy to measure such as (1) length of the hairpin stem, (2) ratio of

purines to pyrimidines in the hairpin loop, and (3) distance from the 3′-end,

and (4) 5′-end of the mRNA, and then subsequently use k-means clustering on these 4 features. We are using k-means instead of a hierarchical

clustering algorithm because we have little basis to believe that the underlying

data has a hierarchical structure, nor do we necessarily want to recover a

hierarchy. Although it is a possible pitfall that this method of feature

selection can lead to a high clustering error as we may lose valuable

information by throwing away potentially useful features, k-means clustering will generate centroids that are easy to

understand and use for subsequent biological studies.