As I had mentioned in one of my previous posts that I would talk about my Master’s dissertation in my future posts, I’m happy that now is the time. The dissertation was titled “Text Clustering in a Call Centre Application using Data Mining”. My primary motivation was to explore data mining concepts and in particular clustering and see how it can connect to real-world problems in CRM domain. Being a java developer, I found it easier to quickly use Oracle Data Mining (ODM) API built on top of Java Data Mining (JSR Spec – http://www.jcp.org/en/jsr/detail?id=247) for my proof-of-concept.

In this three-part series of “My Master’s Dissertation, Revisited”, I’ll give a quick overview of data mining, text mining(in particular) and some terminology used in this series, in the first part. In part two of this series, I’ll talk about Text Clustering, enhanced k-means algorithm for clustering, and the sample that I used. I’ll wrap up this series by explaining as to how I applied k-means model, the code that I customized, and the results with a link to the sample code zip file.

I’ll be focusing on data mining concepts and functions relevant to my dissertation in this series. For info on other mining functions or more details, please refer to my previous post on the topic.

### What is Data Mining? Why is it used?

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis and simple querying. Data Mining is, in plain terms, making sense of the data. The key properties of data mining are:

#### i) Automatic discovery of patterns **: **

Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of the data mining models.

#### ii) Prediction of likely outcomes:

A model might predict income based on education and other demographic factors. A data mining technique can generate a rule might specify that a person who has a bachelor’s degree and lives in a certain neighborhood is likely to have an income greater than the regional average.

#### iii) Creation of actionable information:

A town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers.

#### iv) Focus on large data sets and databases:

When you want to retrieve data from thousands or millions of records that have tens or hundreds of columns (attributes or dimensions) like transactional data (eg: customer data) from table, data mining models make lot of sense.

### What it can do and what it cannot?

Data mining discovers hidden information in your data, but it cannot tell you the value of the information to your organization. You might already be aware of important patterns as a result of working with your data over time. Data mining can confirm or qualify such empirical observations in addition to finding new patterns that may not be immediately discernible through simple observation.

### Steps in Data Mining Process:

The steps in the process of data mining can diagrammatically be represented as:

#### 1. Problem Definition:

This initial phase of a data mining project focuses on understanding the project objectives and requirements.

For example, your business problem might be: “How can I sell more of my product to customers?” You might translate this into a data mining problem such as: “Which customers are most likely to purchase the product?” A model that predicts who is most likely to purchase the product must be built on data that describes the customers who have purchased the product in the past.

#### 2. Data Gathering and Preparation:

The data understanding phase involves data collection and exploration. As you take a closer look at the data, you can determine how well it addresses the business problem. You might decide to remove some of the data or add additional data. This is also the time to identify data quality problems and to scan for patterns in the data. For example, you might transform a DATE_OF_BIRTH column to AGE; you might insert the average income in cases where the INCOME column is null.

#### 3. Model Building and Evaluation:

In this phase, you select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, you will need to step back to the previous phase to implement them.

In preliminary model building, it often makes sense to work with a reduced set of data (fewer rows in the case table), since the final case table might contain thousands or millions of cases. Based on the extent to which the model has to be improved, we can increase the number of attributes.

#### 4. Knowledge Deployment:

Knowledge deployment is the use of data mining within a target environment. In the deployment phase, insight and actionable information can be derived from data.

Deployment can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting tools. For example, a sales representative could run a model that predicts the likelihood of fraud within the context of an online sales transaction.

### Data Mining Functions

Data mining functions can be broadly classified into Supervised and Unsupervised mining functions. Each data mining function uses one or more algorithms to build a model.

The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model “learns” the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by mining functions analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.

The building of an unsupervised model doesn’t involve training, as there is no previously-known result to guide the algorithm in building the model. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.

Let’s have a quick look at how can these functions help a data miner achieve and the business context in which they’re used.

#### Association:

The goal of the Associations mining function is to find items that are consistently associated with each other in a meaningful way. For example, you can analyze purchase transactions to discover combinations of goods that are often purchased together. The Associations mining function answers the question: If certain items are present in a transaction, what other item or items are likely to be present in the same transaction?

#### Classification:

The process of automatically creating a model of classes from a set of records that contain class labels. The classification technique analyzes records that are already known to belong to a certain class, and creates a profile for a member of that class from the common characteristics of the records. You can then use a data mining scoring tool to apply this Classification model to new records, that is, records that have not yet been classified. This enables you to predict if the new records belong to that particular class.

Commercial applications of this mining function include credit card scoring, ranking of customers for directed mailing, the prediction of credit risk in new customers, and attrition prediction.

#### Regression:

Regression is similar to classification except for the type of the predicted value. Classification predicts a class label, regression predicts a numeric value. Regression also can determine the input fields that are most relevant to predict the target field values. The predicted value might not be identical to any value contained in the data that is used to build the model. An example application is customer ranking by expected profit.

#### Clustering:

The Clustering mining function includes algorithms like k-means, O-cluster etc. It groups similar customers together. At the same time it maximizes the differences between the different customer groups that are formed in this way.

The groups that are found are known as clusters. Each cluster tells a specific story about customer identity or behavior, for example, about their demographic background, or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes.

In the commercial environment, clustering is used in the areas like: Cross-marketing, Cross-selling, Customizing marketing plans for different customer types, Deciding which media approach to use, Understanding shopping goals, and Many other areas.

Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.

The prerequisites to understand text clustering and in particular k-means algorithm as well as some other mining functions are: Understanding of Numerical, Categorical Attributes, Data Transformation techniques, Euclidean Distance, Document Matrix, Vector Space Model (VSM), Text Mining. See section ‘Some Terminology’ below.

### Some Terminology:

**Numerical Attribute:** An attribute whose values are numbers. The numeric value can be either an integer or a real number.

**Categorical Attribute:** An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, and so on, or ordered (ordinal) such as high, medium, or low temperatures.

**Binning:** This transformation maps a set of input values to a smaller set of bins. The input values may be discrete or continuous.

**Normalization:** It maps numerical values to a particular numerical range, typically [0…1]. There are several types of normalization (e.g., z-score, min-max, and shift-scale).

**Explosion:** This transformation applies only to discrete data such as strings or numbers representing individual categories. The goal is to transform such attributes into numerical attributes.

**Support:** The ratio of the number of occurrences of R, given all occurrences of all rules.

**Confidence:** The confidence of a rule X->Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.

**Euclidean Distance:** The Euclidean distance or Euclidean metric is the “ordinary” distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. In Cartesian coordinates, if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by:

**d(p, q) = d(q, p) = √ (q1 – p1)2 + (q2 – p2)2 + ….. + (qn – pn)2 = √∑( qi – pi)2**

**For N dimensions**

In general, for an n-dimensional space, the distance is:

**d(p, q) = √ (p1 – q1)2 + (p2 – q2)2 + … + (pi – qi)2 + … + (pn – qn)2 = √∑( qi – pi)2**

**Vector Space Model:** In the vector space model each text in a set of texts is represented by a vector in a high-dimensional space, with as many dimensions as the number of different words in the set. Each text gets weights (values) in the indices (dimensions) based on what words appear in them. These weights model how important the corresponding word is deemed to be to explain the content of the text. They are dependent on whether (and how often) the word appears in the document and in the entire set. Texts whose vectors are close to each other in this space are considered being similar in content.

## Text Mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Unstructured data like ppt, audio, images, call center notes etc can be converted into numerical attribute using Text Mining. The term extraction process might involve application of NMF (Non-Negative Matrix Factorization) algorithms and treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label.

Most of the text-mining methods are grounded in the term-based VSM to calculate the distance or similarity between documents. Let’s look at some definitions for using document representation of vector.

A document is commonly represented as a vector of terms in a VSM. The basis of the vector collection. In a simple way, we can use a word to be a term. Yet, morphological variants like ’actor’, ’action’ and ’acting’ are so closely related that they are usually conflated into a single word stem, e.g., ’act’ by stemming [4], [5]. After stemming, two word stems are treated as unrelated if they are different. For example, the stem of ’suggestion’ and ’advice’ are usually considered unrelated despite of their apparent relationship. Since different word stems are considered unrelated, i.e., independent, the base vectors in VSM are orthogonal to each other.

A set of documents X in traditional term-based VSM is represented by X = {X1, X2…… Xn}. One document is represented by Xj = {xj1, xj2,…..xj,m}. Terms in the vocabulary extracted from the corresponding document collection are represented by {t1, t2…tm}.

Where xjii is the weight of the term tii in document Xj and is usually determined by the number of times ti appears in Xj (known as the term frequency). Other weighting scheme like TF-iDF can also be applied.

Based on the definition of VSM, distance between two text documents X1 and X2 can be easily computed. There are many methods to measure this distance, such as: Cosine similarity and Minkowski distance including Euclidean distance, Manhattan distance and Maximum distance. Here, we give the definition of Euclidean distance which is effective and frequently used in text clustering.

Euclidean distance of two documents X1 and X2 in the VSM, is defined as:

**D(X1, X2) = √( X1 – X2 ) ( X1 – X2 ) T**

**= √i=1∑m( x1i-x2i)2**

The smaller the value of d(X1, X2) is, the more similar the two documents are. From above equation, we can see that this distance definition does not take into account any patterns of term correlation that exist in the real-world data.