My Master’s Dissertation, Revisited Series – Part 2:
Posted by sanstechbytes on April 30, 2012
In part 1, we had a quick overview of data mining and text (data) mining and some terminology used. Let’s explore clustering in this part, some of the clustering algorithms and text clustering using text mining.
In clustering, the objective is to partition unstructured set of objects to clusters or groups. We often want that the objects to be as similar as possible to objects in the same cluster and as dissimilar to objects from other clusters as possible.
Clustering vs. Categorization
By automatic categorization, we mean to let a machine decide to which of a set of predefined categories a text belongs. In clustering, the machine decides how a given text set should be partitioned. Categorization is suitable when one wants to categorize new texts according to a known categorization, clustering when one wants to discover new structures not previously known. Both methods may give interesting results on an unknown text set; categorization sorts them according to a wellknown structure, clustering displays the structure of the particular set. This thesis deals with clustering of texts.
Let’s look at an example for visualization of Clusters for Customer groups based on their age and income, for simplicity. I think the diagrams (Courtesy: Java Data Mining : Strategy, Standard and Practice book by Hornick, Venkayala, Marcade) are pretty much selfexplanatory.
Customer Clusters for above set of data:
Cluster Model – Histograms
Clustering is an unsupervised learning method. The result (the clustering, the partition) is based solely on the object representation, the similarity measure and the clustering algorithm. If these correspond to the users understanding the result might well be an intuitive and useful clustering. One must keep in mind, though, that clustering algorithms always produce clusterings, even when this is not justified, and that there in most cases exist many relevant clusterings of a set of complex objects.
There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:

Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed topdown or bottomup. Hierarchical methods rely on a distance function to measure the similarity between clusters.

Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.

Localitybased — Groups neighboring data objects into clusters based on local conditions.

Gridbased — Divides the input space into hyperrectangular cells, discards the lowdensity cells, and then combines adjacent highdensity cells to form clusters.
Let’s look at some algorithms. Let’s explore basic and popular kMeans algorithm for clustering.
kMeans Algorithm:
The kMeans algorithm is a distancebased clustering algorithm that partitions the data into a predetermined number of clusters provided there are enough distinct cases. In the basic version of the kmeans, we first choose K initial centroids, where K is a cluster specified parameter, viz., the number of clusters desired. Each point is then assigned to the closest centroid, and each collections of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the the cluster. We repeat the assignment and update steps until no point changes clusters, or equivalently, until the centroids remain the same.
Distancebased algorithms rely on distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.
Basic kmeans is formally described as below:
1. Select K points as initial Centroids
2. repeat:
3. Form K clusters by assigining each point to it’s closest centroid..
4. Recompute the centroid of each cluster.
5. until: Centroids do not change
The operation of above kmeans algorithm is illustrated in below figure, which shows, how starting from three centroids, the final clusters are found in four assignmentupdate steps. In these and other figures displaying kmeans clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the points to those centroids. The centroids are indicated by the “+” symbol; all points belonging to the same cluster have the same marker shape.
(Courtesy: Introduction to Data Mining: Tan, Vipin Kumar, Steinbach)
Now, let’s look at enhanced kMeans algorithm supported by ODM for clustering.
Enhanced kmeans –

Builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting kmeans algorithm. The centroid of the inner nodes in the hierarchy is updated to reflect changes as the tree evolves. The whole tree is returned.

Grows the tree one node at a time (unbalanced approach). Based on a user setting available in either of the programming interfaces, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified in the build setting for clustering models, CLUS_NUM_CLUSTERS.

Provides probabilistic scoring and assignment of data to clusters.

Returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes or the mean and variance for numerical attributes.
This approach to kmeans avoids the need for building multiple kmeans models and provides clustering results that are consistently superior to the traditional kmeans.
I used Oracle Data Mining (ODM) API’s that use enhanced kmeans algorithm described above, for kmeans clustering. OCluster algorithm implementation is also provided in ODM for clustering.
Time and Space Complexity:
Since only the centroids and data points are stored, the storage required is O((m + K)n), where m is the number of points and n is the number of attributes. The time required (time complexity) is modest too, and can be determined to be: O(I * K * m * n), where I is the number of iterations required for convergence.
‘I’ is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Hence, Kmeans is linear in m, the number of points, provided K, no of clusters, is significantly less than m.
Text Clustering
Text Clustering is a specific task in text mining that refers to the process of deriving highquality information from text and is used for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.
The main applications of clustering in text mining are:
■ Simple clustering. This refers to the creation of clusters of text features . For example: grouping the hits returned by a search engine.
■ Taxonomy generation. This refers to the generation of hierarchical groupings. For example: a cluster that includes text about car manufacturers is the parent of child clusters that include text about car models.
■ Topic extraction. This refers to the extraction of the most typical features of a group. For example: the most typical characteristics of documents in each document topic.
Let’s talk about Simple Clustering a bit. Suppose you’ve a call center web application and your call center reps enter the comments about a particular offering (product) or in general about the call center experience on your site, and those comments are recorded in your database tables. To make sense of this unstructured data, i.e. to know which group your customers belong to – HAPPY, SATISFIED, UNHAPPY etc or customer’s level of acceptance about a particular product, and come up with some marketing strategy to improve customer call center experience or product usability, you want to mine the comments (text) and apply kmeans algorithm to group comments. Remember, we’re talking about hundreds of thousands of customers and comments and add to that unstructured data (Comments can contain alphabets, numbers, special chars, image etc). Any traditional smart code to retrieve data and group them, is not apt here and is out of contention, for such a scale.
To apply clustering algorithm for above scenario, say kmeans, you need to have the text undergo a special preprocessing step known as Term Extraction or Feature Extraction. This process breaks the text down into units (terms) that can be mined. Text terms may be keywords or other documentderived features . The parameters like Semantic similarity, similar words stop words and weighting schemes like TFiDF may be considered while applying a text extraction algorithm like NMF (refer to part 1). You may want to refer to Text Mining section in part 1 of this series, for some theory.
I used ODM API’s. Oracle enhanced kmeans algorithm (see below for this) supports text mining. Oracle Data Miner graphical tool performs term extraction transparently when you create or apply a text mining model. The Oracle Data Mining Java API provides term extraction classes for use in Java applications. If you are using the PL/SQL API, you can use a set of Oracle Text table functions to extract the terms. For screenshots of clustering model and Term Extraction model, please watch out for part 3 in this series.
Suppose you want to predict if customers will increase spending with an affinity card. You want to include the comments field from a customer survey as one of the predictors. Before building the classification model, you want to create a reduced set of text features to improve the predictive power of the comments. To do this, you will create a feature extraction model.
Leave a Reply