Sanstech

Ideas, Knowledge, Technology, Computer Science, Experience associated with my work and some geeky stuff I progressively encounter during my journey towards enlightenment. Read on…

  • RSS RSS Feed

    • Cloud Computing
      It’s been really long, since I last wrote a tech post. In this post, I’m just sharing few useful links to get started on Cloud Computing, whether you’re a developer, quality engineer, business leader, or a project manager intending to get started with Cloud Computing. I’m currently designing systems and services, for a platform that’s […]
    • The Pragmatic Programmer
      I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this. The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code […]
    • 2013 in review
      The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people. Click here to see the […]
    • Goodbye, Ness!
      It had to happen sometime. I thought Feb 2013 was the right time. I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also […]
    • Meta: information retrieval from Tweets
      I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem! Problem: How’d you go about finding meta information about a person’s tweets? NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, follower […]
  • Twitter Updates

Archive for the ‘Research’ Category

Understanding Big Data

Posted by sanstechbytes on September 13, 2012

It’s been a while, since I last posted!

To keep this rolling (I’m hardly getting any time to post my own articles or stuff about my experiments these days 😦  ), I just wanted to share this ebook on Big Data titled  Understanding Big Data: Analytics for Enterprise Class and Streaming Data.

Cheers!

Advertisements

Posted in Data Mining, Research | Leave a Comment »

My Master’s Dissertation, Revisited Series – Part 3

Posted by sanstechbytes on May 1, 2012

In this final part of the series, let us look at the choices of implementation that I made and the results of my work. I thought I would rather share that as a stuff, part of the download.

Although I liked world-leading data mining open-source software like WEKARapid Miner, I don’t recall why I stuck to exploring more of a trial version of commercial Oracle Data Mining Suite. We used Oracle 10g at work; we use SAS Enterprise Miner for data mining needs. Intuitively, it should have been with the intention of seeing how Oracle Data Mining fitted the bill in that context. It turns out that last statement was too much of a speculative future then.

The repository for all my work along with other useful artifacts can be downloaded here(around 93 MB, I think!!!).

When you’ve downloaded RAR file –

For my explanation of how I’ve used ODM API’s, refer to dissertationrepository\results\My Dissertation POC Demo.docx.

For detailed steps on installation of database, creating user, sample schemas etc.:  refer to Oracle Data Mining Administrator’s Guide can be located in RAR  file at: dissertationrepository\pdf\datamining_admin_guide.pdf.

For installing softwares like odminer and importing java project: refer to dissertationrepository\software. Perhaps, oracle database can be downloaded from here.

For screenshots and results of my POC work,  refer to:  dissertationrepository\results\My Dissertation POC Demo.docx

For other sample projects, books, ideas, papers etc: refer to dissertationrepository\extra

REFERENCES

1. Campos, M.M., Milenova, B.L., “O-Cluster: Scalable Clustering of Large High

Dimensional Data Sets”, Oracle Data Mining Technologies, 10 Van De Graaff

Drive, Burlington, MA 01803.

2. Oracle Data Miner –  http://www.oracle.com/technology/products/bi/odm/

3. JSR Specifications – http://jcp.org/en/jsr/detail?id_247

4. Introduction to Data Mining by Vipin Kumar, Michael Steinbach, Pang-ning Tan, Addison Wesley, 2006.

5. Java Data Mining: Standard, Theory and Practice: A Practical Guide for Architecture, Design and Implementation by Mark Hornick, Erik Marcade, and Sunil Venkayala, Morgan Kauffman Publishers, 2007

6. Collective Intelligence in Action by Satnam Alag, Manning Publications, 2007

7. Using text mining to understand the call center customers’ claims http://library.witpress.com/pages/PaperInfo.asp?PaperID=16715 by G. M. Caputo, V. M. Bastos & N. F. F. Ebecken.

8. Oracle Database 11g / Data Mining API Documentation

9.  KD Nuggets www.kdnuggets.com

10. ACM www.acm.org

Posted in Data Mining, Research | Leave a Comment »

My Master’s Dissertation, Revisited Series – Part 2:

Posted by sanstechbytes on April 30, 2012

In part 1, we had a quick overview of data mining and text (data) mining and some terminology used. Let’s explore clustering in this part, some of the clustering algorithms and text clustering using text mining.

In clustering, the objective is to partition unstructured set of objects to clusters or groups. We often want that the objects to be as similar as possible to objects in the same cluster and as dissimilar to objects from other clusters as possible.

Clustering vs. Categorization

By automatic categorization, we mean to let a machine decide to which of a set of predefined categories a text belongs. In clustering, the machine decides how a given text set should be partitioned. Categorization is suitable when one wants to categorize new texts according to a known categorization, clustering when one wants to discover new structures not previously known. Both methods may give interesting results on an unknown text set; categorization sorts them according to a well-known structure, clustering displays the structure of the particular set. This thesis deals with clustering of texts.

Let’s look at an example for visualization of Clusters for Customer groups based on their age and income, for simplicity. I think the diagrams (Courtesy: Java Data Mining : Strategy, Standard and Practice book by Hornick, Venkayala, Marcade) are pretty much self-explanatory.

Customer Clusters for above set of data:

Cluster Model – Histograms

 

Clustering is an unsupervised learning method. The result (the clustering, the partition) is based solely on the  object representation, the similarity measure and the clustering algorithm. If these correspond to the users understanding the result might well be an intuitive and useful clustering. One must keep in mind, though, that clustering algorithms always produce clusterings, even when this is not justified, and that there in most cases exist many relevant clusterings of a set of complex objects.

There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:

  • Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed top-down or bottom-up. Hierarchical methods rely on a distance function to measure the similarity between clusters.

  • Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.

  • Locality-based — Groups neighboring data objects into clusters based on local conditions.

  • Grid-based — Divides the input space into hyper-rectangular cells, discards the low-density cells, and then combines adjacent high-density cells to form clusters.

Let’s look at some algorithms. Let’s explore basic and popular k-Means algorithm for clustering.

k-Means Algorithm:

The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters provided there are enough distinct cases. In the basic version of the k-means, we first choose K initial centroids, where K is a cluster specified parameter, viz., the number of clusters desired. Each point is then assigned to the closest centroid, and each collections of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the the cluster. We repeat the assignment and update steps until no point changes clusters, or equivalently, until the centroids remain the same.

Distance-based algorithms rely on distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.

Basic k-means is formally described as below:

1. Select K points as initial Centroids

2. repeat:

3.    Form K clusters by assigining each point to it’s closest centroid..

4.    Recompute the centroid of each cluster.

5. until: Centroids do not change

The operation of above k-means algorithm is illustrated in below figure, which shows, how starting from three centroids, the final clusters are found in four assignment-update steps. In these and other figures displaying k-means clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the points to those centroids. The centroids are indicated by the “+” symbol; all points belonging to the same cluster have the same marker shape.

(Courtesy: Introduction to Data Mining: Tan, Vipin Kumar, Steinbach)

Now, let’s look at enhanced k-Means algorithm supported by ODM for clustering.

Enhanced k-means –

  • Builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-means algorithm. The centroid of the inner nodes in the hierarchy is updated to reflect changes as the tree evolves. The whole tree is returned.

  • Grows the tree one node at a time (unbalanced approach). Based on a user setting available in either of the programming interfaces, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified in the build setting for clustering models, CLUS_NUM_CLUSTERS.

  • Provides probabilistic scoring and assignment of data to clusters.

  • Returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a  rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes or the mean and variance for numerical attributes.

This approach to k-means avoids the need for building multiple k-means models and provides clustering results that are consistently superior to the traditional k-means.

I used Oracle Data Mining (ODM) API’s that use enhanced k-means algorithm described above, for k-means clustering. O-Cluster algorithm implementation is also provided in ODM for clustering.

Time and Space Complexity:

Since only the centroids and data points are stored, the storage required is O((m + K)n), where m is the number of points and n is the number of attributes. The time required (time complexity) is modest too, and can be determined to be: O(I * K * m * n), where I is the number of iterations required for convergence.

‘I’ is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Hence, K-means is linear in m, the number of points, provided K, no of clusters, is significantly less than m.

Text Clustering 

Text Clustering is a specific task in text mining that refers to the process of deriving high-quality information from text and is used for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

The main applications of clustering in text mining are:
■  Simple clustering. This refers to the creation of clusters of text features .  For example: grouping the hits returned by a search engine.

■  Taxonomy generation. This refers to the generation of hierarchical groupings. For example: a cluster that includes text about car manufacturers is the parent of child clusters that include text about car models.

■  Topic extraction. This refers to the extraction of the most typical features of a group. For example: the most typical characteristics of documents in each document topic.

Let’s talk about Simple Clustering a bit. Suppose you’ve a call center web application and your call center reps enter the comments about a particular offering (product) or in general about the call center experience on your site, and those comments are recorded in your database tables. To make sense of this unstructured data,  i.e. to know which group your customers belong to – HAPPY, SATISFIED, UNHAPPY etc or customer’s level of acceptance about a particular product, and come up with some marketing strategy to improve  customer call center experience or product usability, you want to mine the comments (text) and apply k-means algorithm to group comments. Remember, we’re talking about hundreds of thousands of customers and comments and add to that unstructured data (Comments can contain alphabets, numbers, special chars, image etc). Any traditional smart code to retrieve data and group them, is not apt here and is out of contention, for such a scale.

To apply clustering algorithm for above scenario, say k-means, you need to have the text undergo a special pre-processing step known as Term Extraction or Feature Extraction. This process breaks the text down into units (terms) that can be mined. Text terms may be keywords or other document-derived features . The parameters like Semantic similarity, similar words stop words  and weighting schemes like TF-iDF may be considered while applying a text extraction algorithm like NMF (refer to part 1). You may want to refer to Text Mining section in part 1 of this series, for some theory.

I used ODM API’s. Oracle enhanced k-means algorithm (see below for this) supports text mining. Oracle Data Miner graphical tool performs term extraction transparently when you create or apply a text mining model. The Oracle Data Mining Java API provides term extraction classes for use in Java applications. If you are using the PL/SQL API, you can use a set of Oracle Text table functions to extract the terms. For screenshots of clustering model and Term Extraction model, please watch out for part 3 in this series.

Suppose you want to predict if customers will increase spending with an affinity card. You want to include the comments field from a customer survey as one of the predictors. Before building the classification model, you want to create a reduced set of text features to improve the predictive power of the comments. To do this, you will create a feature extraction model.

In the next part, we’ll look at ODM API’s and my dissertation repository for code, results, links and pdfs.

Posted in Data Mining, Research | Leave a Comment »

My Master’s Dissertation, Revisited Series – Part 1:

Posted by sanstechbytes on April 28, 2012

As I had mentioned in one of my previous posts that I would talk about my Master’s dissertation in my future posts, I’m happy that now is the time. The dissertation was titled “Text Clustering in a Call Centre Application using Data Mining”. My primary motivation was to explore data mining concepts and in particular clustering and see how it can connect to real-world problems in CRM domain. Being a java developer, I found it easier to quickly use Oracle Data Mining (ODM) API built on top of Java Data Mining (JSR Spec – http://www.jcp.org/en/jsr/detail?id=247) for my proof-of-concept.

In this three-part series of “My Master’s Dissertation, Revisited”, I’ll give a quick overview of data mining, text mining(in particular) and some terminology used in this series, in the first part. In part two of this series, I’ll talk about Text Clustering, enhanced k-means algorithm for clustering, and the sample that I used. I’ll wrap up this series by explaining as to how I applied k-means model, the code that I customized, and the results with a link to the sample code zip file.

I’ll be focusing on data mining concepts and functions relevant to my dissertation in this series. For info on other mining functions or more details, please refer to my previous post on the topic.

What is Data Mining? Why is it used?

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis and simple querying.  Data Mining is, in plain terms, making sense of the data. The key properties of data mining are:

i) Automatic discovery of patterns

Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of the data mining models.

ii) Prediction of likely outcomes:

A model might predict income based on education and other demographic factors. A data mining technique can generate a rule might specify that a person who has a bachelor’s degree and lives in a certain neighborhood is likely to have an income greater than the regional average.

iii) Creation of actionable information:

A town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers.

iv) Focus on large data sets and databases:

When you want to retrieve data from thousands or millions of records that have tens or hundreds of columns (attributes or dimensions) like transactional data (eg: customer data) from table, data mining models make lot of sense.

What it can do and what it cannot?

Data mining discovers hidden information in your data, but it cannot tell you the value of the information to your organization.  You might already be aware of important patterns as a result of working with your data over time. Data mining can confirm or qualify such empirical observations in addition to finding new patterns that may not be immediately discernible through simple observation.

Steps in Data Mining Process:

The steps in the process of data mining can diagrammatically be represented as:

1. Problem Definition:

This initial phase of a data mining project focuses on understanding the project objectives and requirements.

For example, your business problem might be: “How can I sell more of my product to customers?” You might translate this into a data mining problem such as: “Which customers are most likely to purchase the product?” A model that predicts who is most likely to purchase the product must be built on data that describes the customers who have purchased the product in the past.

2. Data Gathering and Preparation:

The data understanding phase involves data collection and exploration. As you take a closer look at the data, you can determine how well it addresses the business problem. You might decide to remove some of the data or add additional data. This is also the time to identify data quality problems and to scan for patterns in the data. For example, you might transform a DATE_OF_BIRTH column to AGE; you might insert the average income in cases where the INCOME column is null.

3. Model Building and Evaluation:

In this phase, you select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, you will need to step back to the previous phase to implement them.

In preliminary model building, it often makes sense to work with a reduced set of data (fewer rows in the case table), since the final case table might contain thousands or millions of cases. Based on the extent to which the model has to be improved, we can increase the number of attributes.

4. Knowledge Deployment:

Knowledge deployment is the use of data mining within a target environment. In the deployment phase, insight and actionable information can be derived from data.

Deployment can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting tools. For example, a sales representative could run a model that predicts the likelihood of fraud within the context of an online sales transaction.

Data Mining Functions

Data mining functions can be broadly classified into Supervised and Unsupervised mining functions. Each data mining function uses one or more algorithms to build a model.

The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model “learns” the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by mining functions analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.

The building of an unsupervised model doesn’t involve training, as there is no previously-known result to guide the algorithm in building the model. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.

Let’s have a quick look at how can these functions help a data miner achieve and the business context in which they’re used.

Association:

The goal of the Associations mining function is to find items that are consistently associated with each other in a meaningful way. For example, you can analyze purchase transactions to discover combinations of goods that are often purchased together. The Associations mining function answers the question: If certain items are present in a transaction, what other item or items are likely to be present in the same transaction?

Classification:

The process of automatically creating a model of classes from a set of records that contain class labels. The classification technique analyzes records that are already known to belong to a certain class, and creates a profile for a member of that class from the common characteristics of the records. You can then use a data mining scoring tool to apply this Classification model to new records, that is, records that have not yet been classified. This enables you to predict if the new records belong to that particular class.

Commercial applications of this mining function include credit card scoring, ranking of customers for directed mailing, the prediction of credit risk in new customers, and attrition prediction.

Regression:

Regression is similar to classification except for the type of the predicted value. Classification predicts a class label, regression predicts a numeric value. Regression also can determine the input fields that are most relevant to predict the target field values. The predicted value might not be identical to any value contained in the data that is used to build the model. An example application is customer ranking by expected profit.

Clustering:

The Clustering mining function includes algorithms like k-means, O-cluster etc. It groups similar customers together. At the same time it maximizes the differences between the different customer groups that are formed in this way.

The groups that are found are known as clusters. Each cluster tells a specific story about customer identity or behavior, for example, about their demographic background, or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes.

In the commercial environment, clustering is used in the areas like: Cross-marketing, Cross-selling, Customizing marketing plans for different customer types, Deciding which media approach to use, Understanding shopping goals, and Many other areas.

Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.

The prerequisites to understand text clustering and in particular k-means algorithm as well as some other mining functions are: Understanding of Numerical, Categorical Attributes, Data Transformation techniques, Euclidean Distance, Document Matrix, Vector Space Model (VSM), Text Mining. See section ‘Some Terminology’ below.

Some Terminology:

Numerical Attribute: An attribute whose values are numbers. The numeric value can be either an integer or a real number.

Categorical Attribute: An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, and so on, or ordered (ordinal) such as high, medium, or low temperatures.

Binning:  This transformation maps a set of input values to a smaller set of bins. The input values may be discrete or continuous.

Normalization: It maps numerical values to a particular numerical range, typically [0…1]. There are several types of normalization (e.g., z-score, min-max, and shift-scale).

Explosion: This transformation applies only to discrete data such as strings or numbers representing individual categories. The goal is to transform such attributes into numerical attributes.

Support: The ratio of the number of occurrences of R, given all occurrences of all rules.

Confidence: The confidence of a rule X->Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.

Euclidean Distance: The Euclidean distance or Euclidean metric is the “ordinary” distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. In Cartesian coordinates, if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by:

d(p, q) = d(q, p) = √ (q1 – p1)2 + (q2 – p2)2  + ….. + (qn – pn)2  =  √∑( qi – pi)2

For N dimensions

In general, for an n-dimensional space, the distance is:

d(p, q) = √ (p1 – q1)2 + (p2 – q2)2  + … + (pi – qi)2  + … + (pn – qn)2  =  √∑( qi – pi)2

Vector Space Model: In the vector space model each text in a set of texts is represented by a vector in a high-dimensional space, with as many dimensions as the number of different words in the set. Each text gets weights (values) in the indices (dimensions) based on what words appear in them. These weights model how important the corresponding word is deemed to be to explain the content of the text. They are dependent on whether (and how often) the word appears in the document and in the entire set. Texts whose vectors are close to each other in this space are considered being similar in content.

Text Mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment  analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Unstructured data like ppt, audio, images, call center notes etc can be converted into numerical attribute using Text Mining. The term extraction process might involve application of NMF (Non-Negative Matrix Factorization) algorithms and treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label.

Most of the text-mining methods are grounded in the term-based VSM to calculate the distance or similarity between documents. Let’s look at some definitions for using document representation of vector.

A document is commonly represented as a vector of terms in a VSM. The basis of the vector collection. In a simple way, we can use a word to be a term. Yet, morphological variants like ’actor’, ’action’ and ’acting’ are so closely related that they are usually conflated into a single word stem, e.g., ’act’ by stemming [4], [5]. After stemming, two word stems are treated as unrelated if they are different. For example, the stem of ’suggestion’ and ’advice’ are usually considered unrelated despite of their apparent relationship. Since different word stems are considered unrelated, i.e., independent, the base vectors in VSM are orthogonal to each other.

A set of documents X in traditional term-based VSM is represented by X = {X1, X2…… Xn}. One document is represented by Xj = {xj1, xj2,…..xj,m}. Terms in the vocabulary extracted from the corresponding document collection are represented by {t1, t2…tm}.

Where xjii is the weight of the term tii in document Xj and is usually determined by the number of times ti appears in Xj (known as the term frequency). Other weighting scheme like TF-iDF can also be applied.

Based on the definition of VSM, distance between two text documents X1 and X2 can be easily computed. There are many methods to measure this distance, such as: Cosine similarity and Minkowski distance including Euclidean distance, Manhattan distance and Maximum distance. Here, we give the definition of Euclidean distance which is effective and frequently used in text clustering.

Euclidean distance of two documents X1 and X2  in the VSM, is defined as:

D(X1, X2) = √( X1 – X2 ) ( X1 – X2 )   T

     = √i=1∑m( x1i-x2i)2

The smaller the value of d(X1, X2) is, the more similar the two documents are. From above equation, we can see that this distance definition does not take into account any patterns of term correlation that exist in  the real-world data.

Posted in Data Mining, Research | 1 Comment »

Researching Research…

Posted by sanstechbytes on March 31, 2012

In a world of rapidly developing technologies, the citizens that shape up the technology landscape require a combination of at least a two or more skills:

a)  Hard (Technical) Skills
b)  Soft Skills (Leadership, Motivation, Emoti. Maturity, Communication etc.)
c)  Ability to acquire new skills.

I strongly believe that the third skill  above, is significantly enhanced by research and have experienced it myself. Given the complexity of nature of interaction of components, different subsystems and speed with which the technology landscape evolves or changes, the ability to acquire new skills to relate to that landscape, is a key attribute for anyone.

What is research after all? We often hear the word ‘research’ being used in R&D organizations and academia, and also not so often in Service Organizations. In a very literal sense, in R&D organizations and academia, research mostly refers to the first-class research that’s driven by the the sheer new idea, and the efforts converging to a well-defined charter and eventually, resulting in filing of a patent, in publications, in research journals etc. It’s mostly used when there’s a talk about the work based on some innovative idea. As Mihaly Csikszentmihalyi puts it, In a very literal sense, when the work done during research draws attention of or is submitted to, the body constituted by a group of people registered as experts in the that field, that have the authority to approve or disapprove that work. In Service Organizations, it’s used in the context of prototyping task with lot of unknowns to start with, in the context of troubleshooting an issue or improving performance of a system. One of the myths, is that, research is an activity with no realistic goal or objective.

As Uday Khedkar of IIT Bombay, puts it in one of his presentations on the topic: “Research could involve any of the below or none of the below activities in our world like – Building a new device, Writing a new software, Repairing a device or Debugging a software, Drawing a conclusion from a lot of data, Proving a theorem, Formulating a theorem”. With different meanings attributed to the word ‘research’,  research, in essence, is a game of innovative ideas that are significant. The significance of ideas could lie in any of – Beauty, Utility, Enhancement of Knowledge.

The ingredients of good research are:
a) Innovation
– An idea that changes one’s behavior contrary to invention that doesn’t change one’s behavior. Idea is the most generated entity in the world. But, if doesn’t change one’s behavior, although it
embarks a new way of thinking and highlights a new way of building things, it remains as an invention.

b) Aesthetics
As Henry Poincare puts it, “Scientists study science not because it is useful, but because it is
beautiful. Here I do not talk about the beauty of appearance or beauty of qualities . . . Here I talk about that profound beauty which comes from a harmonious order of parts . . .”. Take painting or any design activity.
c) Other important aspects viz., Completeness
– Rigour
– Empirical demonstration
– Effective communication

Rigour removes imprecision and adds concreteness. It makes an idea very immune to personal interpretation. Empirical demonstration is through an evidence acquired by observation and experimentation. It opens up new channels for innovation. At an abstract level, it’s about capturing the interaction of  systems with each other; and drawing conclusions based on consistency in the type and nature of interactions. Humans beings can be considered to be made up of some ‘systems’ as well!  When research is not effectively communicated, the innovative ideas, research is not termed as real research by the field.

In Service Organizations, the rigour is often lacking in the way people deliver software, when they give excuse of client deadlines, to hide their inability to learn to work with rigour for delivering a mediocre quality code or software.

When we talk about the process of research, we talk about parameters like The Spirit of Inquiry, Breadth vs Depth, Ability to Abstract and Modularize, Moving from Confusion Phase to start with to Conviction Phase in the end, Research in Academia vs. Research in Industry,

The Ten Commandments of Creativity in Research
1. Seek extension of an earlier known solution
2. If you have a solution, find a problem
3. Find out the right questions to ask
4. Seek generality by removing specificities
5. Seek symmetry by testing for duality
6. Observe patterns in ideas
7. Distill the essence, refine your ideas
8. Distinguish the relevant from the irrelevant
9. Build levels of abstractions and migrate between them
10. Mix deep thinking with routine mechanical work.

Research is often seen as a major source of ideas and by implication, of innovation. One can  internalize ‘research’ nature with practice. Research makes a researcher, a better learner. Although grooming this nature is a leadership attribute for one, to build efficient and quality teams and build robust and long-lasting systems, it’s also of paramount criticality to develop the skill or the ability to switch to different mindset in order to thrive in a dynamic nature of business, especially in service organizations.

Recommended Reading:
1. An Article to clear up some Misconceptions about the nature of Research
2. Richard Hamming on Research
3. Mihaly Csikszentmihalyi on Creativity and Flow

Posted in Research | Leave a Comment »

Performance Engineering

Posted by sanstechbytes on January 26, 2012

I’m invested in MIT OpenCourseWare. That reminds me to go back to MIT site to look for stuff that I like to learn with a first-principles based approach! ;-).

Below are some of the videos of the lectures taught under 6.172 – Performance Engineering of Software Systems and I find ’em  very useful. Check out this MIT Course Page playlist for more related lectures.

Distributed Systems:

Basic Performance Engineering:

Posted in Research | Leave a Comment »

Applying Computational Thinking: A Consumer’s Perspective

Posted by sanstechbytes on November 11, 2011

Abstract

In this white paper, I give a brief background on the challenges we face, when we, the consumers, set out to consume something in the process of accomplishing a task and the need for computational thinking, thereby arising. I describe ways to effectively address our inability in coming up with an efficient strategy in such situations and in particular, address our inability in making an optimal decision for a sub-problem of Bad Buy. I’ve outlined some steps that can be incorporated in our decisioning framework to aid us in making an Optimal Buy. With a singular focus on the topic of Computational Thinking, I’ve illustrated application of CT techniques for situations mostly relevant to ‘Consumer’ in day-to-day life. In the end, I draw some conclusions based on my observations and my own experience applying computational thinking, as a consumer. A ‘Consumer’ is not just referred to the one that buys items; without loss of generality, Consumer is anyone that consumes data.

Background

The abilities of an individual, including the abilities for abstract thought, understanding, communication, reasoning, learning, planning, emotional intelligence and problem solving define degree of his/her intelligence. The path to excellence in any field consists of journey through many layers of abstraction of knowledge needed to use and enhance these abilities. This applies to sheer accomplishment of a task as well. in this journey of accomplishing a goal, let’s say for all practical purposes, that each and every skill (to perform any task) is acquired in a layered manner through different abstraction layers with the level of difficulty increasing at each layer. In schools, some gifted students naturally build upon the above fundamental abilities, move through each and every layer of abstraction easily and some not-so-gifted struggle a bit to do so but end up being there, through hard work and practice, while majority of the students struggle to move to the more and more difficult layers of abstraction itself. For any given generation at any given point in time, vast majority of students are at a layer of abstraction with lower level of difficulty compared to a minority (or say < 40% of students) that are at a layer of abstraction with higher level of difficulty. Industries want people that are in these layers of abstraction where the real-world action is, when they graduate.

We’re not talking about just the chosen fields to draw these observations. Regardless of one’s years of schooling, everyone thinks computationally in completing a task because this kind of thinking is already embedded inside of us either because it was passed to us by our ancestors, or we’ve observed that it works and have incorporated it or by virtue of our study and practice. A task can be: Going to work in field, Answering multiple phone calls, Designing a system, Washing Clothes, Boarding a bus, Writing a program etc. Everything is algorithmic in nature.

Keywords

Pattern, Algorithm, Computation, Time, Space, Consumer, Buy, Market, e-Commerce

 

The Problem

There’re two types of problems I’m trying to address. One is the problem of a sub-optimal buy; and the other, the poor information consumption.

The information overload has impacted all spheres of our lives. Add to this our own vulnerabilities to deal with it, we suddenly find it difficult to make an optimal decision or consume rich information. We find ourselves in loops.

In all our outward dealings related to buying an item or consuming information, this has resulted in increasing the complexity of the whole process of consumption. We often end up making a sub-optimal buy or a bad buy.

Let’s look at it, from the perspective of trends in Education Sector. The lack of ability to think computationally, among graduates has led to the poor supply of man-power to industries in various sectors. The prevalent education system is the major contributor to such a situation. Researchers and Visionaries have increasingly felt the need for introducing appropriate courses at schools or for introducing mechanism to educate masses concerned with developing new way of thinking known as Computational Thinking, to bridge the academia-industry gap as an effort in tackling this issue. They’ve also simulated and have experimented with results on Computational Thinking, with an appropriate course material, at high-school and pre-university levels that can significantly help the cause. However, a course on Computational Thinking is still not taught at most schools.

Understanding Computational Thinking

Computational Thinking is a way of solving problems, designing systems, and understanding human behavior that draws on concepts fundamental to computer science. Computational thinking is thinking in terms of abstractions, invariably multiple layers of abstraction at once. Computational thinking is about the automation of these abstractions. The automaton could be an algorithm, a Turing machine, a tangible device, a software system—or the human brain. (Carnegie Mellon, n.d.) [Bold added for emphasis.]

Computational Thinking is the amalgam of thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent. In simple terms, it’s the thinking that involves techniques like problem decomposition, pattern recognition, pattern generalization to define abstractions or models, algorithm design, and data analysis and visualization. As software engineers, we use these techniques; we acquire these skills with practice and build reliable computing systems. While CT draws on concepts from Computer Science, it’s relevance is increasingly being recognized in other areas of active study including algorithmic medicine, computational archaeology, computational economics, computational finance, computation and journalism, computational law, computational social science, and digital humanities.

Specific computational thinking techniques include: problem decomposition, pattern recognition, pattern generalization to define abstractions or models, algorithm design, and data analysis and visualization. In the next section, I’ve made an attempt to describe each of these techniques with examples, to develop better understanding of CT and in a way, address the said problems.

How Computational Thinking Solves Your Problem?

Can we apply computational thinking in daily life? Is Computational Thinking already a subsystem in a big system of our thought processes? Yes and Yes. Although engineers, scientists, architects, or in general those involved in accomplishing a task requiring abstraction and automation, use CT in their daily lives,  some of the examples given below still relate to events in everyone’s life, including those mentioned above.

1. Pattern Recognition:

The ability to notice similarities or common differences that will help us make predictions or lead us to shortcuts. Pattern recognition is frequently the basis for solving problems and designing algorithm.

Example:

Many a times we in IT are under enormous pressure to meet certain deadlines. Most of us usually search on Google or seek help from colleagues or think about solutions based on existing code or existing artifacts. It happened with few of my colleagues. All of us including Yours Truly, in our team didn’t have adequate knowledge of XSL syntax (we could generally use help from existing code or google). There’s a requirement to sort a bunch of records based on date and it was inside an XSL file. My peers got no help from existing code and spent roughly one day thinking about various approaches to solve, which they weren’t willing to use.

When they discussed with me, I predicted that since XSLT is a declarative language used for reorganizing document content, and a way to generate multiple results– such as HTML, WAP, and SVG–from the same content and as any programming language that is used for the tasks mentioned in previous line, provides routines for search and sort operations, XSLT might provide a sort routine. With XSLT, it’s mostly tag based and I keyed in and phew, that works! We solved the problem in 2 minutes. Searching and Sorting are some of the key operations supported by all programming languages that are used for tasks like above. I recognized the pattern and predicted. Five minutes, and we got our solution! You don’t need work experience all the time. It’s the thinking you need.

2. Pattern Generalization and Abstraction:

It refers to the ability to filter out information that is not necessary to solve a certain type of problem and generalize the information that is necessary. Pattern generalization and abstraction allows us to represent an idea or a process in general terms (e.g., variables) so that we can use it to solve other problems that are similar in nature.

Example:

All the people are inherently aware of the fact that due to middlemen (brokers), they always buy items at a heightened price than the original price. It’s a pattern that’s been observed since the mankind began trading. Why don’t people generalize this pattern and try to find ways to solve this problem and to make an efficient buy? They don’t have to be literate to be educated. Even with the internet accessible from mobile, desktop, laptop and reachable to masses, many people do not innovatively think and save time and money using Internet!

Think about buying an item at a lesser price than is sold at many places. Be it book, apparel, music CD, gadget etc. If you think computationally, you come up with abstractions about different intermediary businesses involved between the producer (manufacturer) and the consumer (end user). You also observe the pattern that each intermediary business makes some money in the form of commission and the final price the item is sold at is obviously higher than it was actually priced by the manufacturer. Instead of the B2B (Business 2 Business) model, if you use the B2C (Business 2 Consumer) model, through online shopping for example, there’s a saving in both money and time. Think Amazon.com, Think Flipkart.com etc.

Suppose you went to buy book from a Shop on Avenue Road in Bangalore (This is the place in Bangalore where books are sold usually at discounted rate of >= 20%). This is still the B2B model. In B2C model, you usually shop online from Amazon.com or Flipkart.com; you spend about 2 minutes in making online transaction at the same price. You save money and save the time required to go and buy the book from a shop. Even if you won’t buy online, you’ll avoid spending time at the shop looking at many items, if you do your homework online. You save time anyways.

3. Decomposition:

The ability to break down a task into minute details so that we can clearly explain a process to another person or to a computer, or even to just write notes for ourselves. Decomposing a problem frequently leads to pattern recognition and generalization, and thus the ability to design an algorithm.

Example:

Suppose you’re buying a Laptop that you want for software development or you want for high performance computing. If you’re a software engineer or you’re someone who’s work needs awareness of, or is associated with internals of a workstation, you probably have fair amount of knowledge on what Laptop components are critical for your needs. If you’re not aware of critical nature of components for your needs, you would go to an expert, or ask questions in online forums (the latter is more computational. If you’re thinking computationally, you decompose your Laptop into different components that make up the laptop and you would think the one with highest clock rate, latest chipset, highest RAM, Solid State Disk would give optimum performance. You visualize the performance results in your workstation based on the past experience patterns. Then, you google for products on internet and compare them on one page. You identify the range which meets your requirements of memory and data processing speed. Then you look at the individual brands and additional. You don’t straightaway look up for different brands on google and then are caught up mixing brands and unclear requirements ending up with overlap of wants and requirements. Decomposition saves time and gives a clear picture to you about your configuration and look and feel for your laptop. From the perspective of someone not so knowledgeable about PC components, this Laptop Buying Task can be considered to be solved algorithmically and more on this below.

4. Algorithm Design:

The ability to develop a step-by-step strategy for solving a problem. Algorithm design is often based on the decomposition of a problem and the identification of patterns that help to solve the problem. In computer science as well as in mathematics, algorithms are often written abstractly, utilizing variables in place of specific numbers.

Example:

Let’s consider a scenario when a layman wants to make an optimal decision in buying a mobile. Let’s also visualize the scenario when the so called computer-savvy person wants to do so, without applying computational thinking techniques. By ‘optimal’, I mean the type of thinking that results in deciding a great mobile for his budget and yet something that could offer rich visual and audio experience without compromising on the core requirements like call quality, decent battery backup etc, having invested minimum time in the process. By ‘layman’, I mean someone who can’t appreciate the technical specifications like 5.1 Channel Output, file formats like MP3, MP4, HD etc, Screen Resolution, and Memory (in-built and expandable) size etc, himself. Ultimately, everyone really wants to make an optimal decision!

With so much of competition in mobile phone market, there are plenty of products in both low-end and high-end segments on the offer by different mobile handset manufacturers. This makes the job of making optimal decision a little challenging. There is also lot of middlemen, ‘n’ number of commission-sucking middlemen in increasing order of commission sucking ability(this particularly refers to the second-hand mobile sellers), shops, malls, mobile showrooms, online stores, where mobiles are sold. From business perspective, the sellers need to keep themselves busy, and can’t afford to spend good time explaining tech specs to laymen. Laymen waste a lot of time on looking at lot of products (enticing) that are cheaper, that may turn out to be costlier after some time in terms of the experience they expected but weren’t able to express while buying a mobile and making. In the end, they make a bad buy for their budget.

Let’s talk about computer-savvy persons now. As it’s proven that “Computers / Internet addresses the human vulnerabilities of being lonely and fearful of intimacy”, when computer-savvy person wants to make a decision he’ll google for sites with listing of mobiles, actually many sites, he’ll start randomly scanning different products that are enticing, might read reviews of course(which is good to a degree). Think about the time he actually spends before he comes up with his “practical” requirements and fixing his budget“. He would have wasted lot of time.

Let’s assume that the layman or the computer-savvy person referred to above thinks computationally. Sometimes budget and requirements can have cyclic relationship from CT perspective. Ideally, there’s only a linear uni-directional relationship. Then:

As a first step, he notes down his principles of buying that include: Simplicity/Craze (No, I mean it, although the word ‘Principle’ is not usually referred to, in the context of characteristics like craze etc.), Rough handling, Battery Longevity, Visual Experience, Something pleasing to eyes and at the same time captured for memory, Audio Experience, Digital Life needs (Read documents, PDFs, Email, Surf web, Chat, Widgets, Games etc).

You might be thinking everyone thinks about these things. True, their thinking is not systematic, rather jumbled and one principle overlaps another in their buying process without CT. Remember, the objective is to make an optimal decision. Remember, you might make a great buy spending a lot of time but it’s not just about money. It’s very much about time as well. You might get lucky and make a great buy by spending little time and well within your budget.

1) In the next step, he translates these principles into requirements in terms of technical specifications like:

– Rough handling (robust outer build),

– Battery Longevity (type of battery Li-Ne etc),

– 5.1/ 7.1 channel output (Great Audio experience),

– Busy Life Business (Dual SIM / Triple SIM)

– HD/HTML5/etc for rich Visual experience,

– Android/Bada (widgets/apps),

– Visual *memorable* Experience (Camera 3.2 MP Etc),

– Digital Life Needs (3G, WiFi, Bluetooth, Memory Card with 8 Gb/16Gb etc (Read docs, Watch videos etc)). In case of laymen, they might seek counsel from sellers or some expert to translate their principles into requirements.

2) In the next step, he fixes his budget and allows a buffer of some money like 1k etc.

3) He uses Flipkart.com or any other site that allows for comparison of mobiles on price, tech spec etc (In case of laymen, he’ll seek sellers’ help at the shop).

4) He comes up with a shortlist of products to meet his requirements.

5) He checks if anyone in his radar (friends’ circle) has purchased a mobile that’s in his shortlist.

6) He reads reviews online or watches internals of mobile on YouTube.com or visits Chroma store(store where they let you touch the items for sale and get a feel of the item that you want to buy) etc (in case of layman).

7) If there turn out to be 2 or more matches meeting his all his requirements (including his budget of course), he uses intuition, look and feel to decide which mobile he finally wants to buy. He’s made an optimal decision and is happy with the time spent in his mobile buying process.

8) If there’re no matches, if he can wait for few months when newer mobiles meeting his requirements could be released to the market, he would wait.

9) Anyone that’s crazy will aim at high-price mobile although he mayn’t use most of the features in the mobile at all, for some reasons. It’s not an optimal decision from a computational thinking perspective. The one who thinks computationally, says, “I don’t need to use those features now, I’ll buy a low-price mobile for now and after 2 years, when I would need those features, I would buy the same mobile at probably 2/3 the price that the crazy person has bought now”. Computational Thinker users power of web and technology to assess the trends and situation and make his decision.

It’s a sequence of steps and it’s algorithmic.

Conclusion

From my own experience and observation, lot of us do not think computationally, and spend lot of time in building our Decisioning Framework and also in realizing or living the life we actually want to live. It’s quite impossible or rather extremely difficult to permanently fix some of the human vulnerabilities like being lonely and being fearful of intimacy, trying to explore something enticing, trying to explore something that can grab the attention of your senses. When there’s an information overload and there’s so much to consume digitally and through other means, computational thinking should serve as a quintessential skill to solve problems or dilemmas arising out of such situations or in general, accomplish any task. In these uncertain times and fluctuating economic conditions, CT is a skill that most people should realize the importance of, for ‘Economy Living’.

In summary, Computational thinking is about using Abstraction and using Automation. We’re all increasingly feeling the want to buy time for ourselves, especially those of us leading urban lifestyles. If we think computationally, we’ll save time (always) and money (in some cases) as well. It’s synonymous to thinking in Computer Science where time (time)-complexity is more important than space (money)-complexity in most cases, these days.

References:

1.Google University: Exploring Computational Thinking

2.Computational Thinking: ACM Publication Jeannette M. Wing PDF

3.Carnegie Mellon University

4. MIT Open Courseware Lecture – What do Computer Scientists do?

Posted in Research | 2 Comments »

Reliable Processing of HTTP requests

Posted by sanstechbytes on July 6, 2011

I thought of sharing an interesting update from my circle of friends. Keerthidhar Dongre, an ex-colleague and a friend of mine, had filed a patent in December of 2008, on his claims about Reliable Processing of HTTP requests.
His patent has been accepted by USPTO.The full-text for this patent US007975047  can be read here. An amazing feat, indeed!

Posted in Research | Leave a Comment »