Ideas, Knowledge, Technology, Computer Science, Experience associated with my work and some geeky stuff I progressively encounter during my journey towards enlightenment. Read on…

  • RSS RSS Feed

    • The Pragmatic Programmer
      I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this. The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code […]
    • 2013 in review
      The stats helper monkeys prepared a 2013 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people. Click here to see the […]
    • Goodbye, Ness!
      It had to happen sometime. I thought Feb 2013 was the right time. I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also […]
    • Meta: information retrieval from Tweets
      I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem! Problem: How’d you go about finding meta information about a person’s tweets? NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, follower […]
    • Understanding Big Data
      It’s been a while, since I last posted! To keep this rolling (I’m hardly getting any time to post my own articles or stuff about my experiments these days 😦  ), I just wanted to share this ebook on Big Data titled  Understanding Big Data: Analytics for Enterprise Class and Streaming Data. Cheers!
  • Twitter Updates

The Pragmatic Programmer

Posted by sanstechbytes on March 16, 2014

I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this.

The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code snippets used are not specific to a particular paradigm. It’s a rare book in the sense that the practical experience and wisdom of the authors is obvious in the way they’ve dealt with different software engineering topics. Any programmer can easily relate to the advice and analogies given by the authors throughout the text.  As an experienced developer, I could discover ‘right’ ways of approaching many finer aspects at various phases of SDLC, although I had learnt and have been practicing most of them over the last few years.

I’ll brief on some of the points elucidated by the authors –

When you’re asked about estimates, various ways to give estimate are: that you say, “I’ll get back to you by ….” Giving estimates in units of Days, if duration is in the range of 1-15 days, in Weeks, if it’s in in the range of 3-8 weeks, in Months, if it’s in the range of 8-30 Weeks and think hard before quoting an estimate, if it’s beyond 30+ Weeks. Talk to people in team, who have implemented similar features to help yourself give realistic estimates. 

When you’re designing a system, you adhere to Design by Contract methodology involving the need of Pre- and Post conditions. You take care of Orthogonality by designing systems that’re independent of each other, in their implementation. Database code should be orthogonal to user interface code. You separate ‘what’ from ‘how’ – design meta-data driven frameworks, sub-systems, to easily accommodate any architectural level changes in future.

When writing code, you use DRY principle (Don’t Repeat Yourself); avoid duplication caused by Imposition (Constrained  by design), Inadvertency (you don’t realize, you’re duplicating), Lack of Patience (you get lazy), Lack of active and frequent communication among developers that work in same team and write code for same module. Use Automation as far as possible, Code Generators, Test Frameworks, Modeling Tools etc. You evaluate the time and space complexity (Big-Oh – O(n)) before implementing the algos, choosing the data structures. When tuning for performance, don’t do premature optimization; the optimization has be done iteratively, improving upon the results obtained from each iteration. Most of the times, developers may say that, “We’re always under pressure to deliver code. So, we won’t be in a situation to adhere to these principles.” But, I would say, “If we inculcate the habit of writing good code right from the time we begin our careers, over a period of time, we’ll be able to deliver high-quality code, even under most demanding situations.”

When you’re debugging an issue, exploit the IDE’s like Eclipse, and other editors, to quicken. The quintessential need to effectively utilize widely used tools in many OS environments to enhance our productivity levels is the need of the hour. For routine work, have scripts for many tasks like starting / stopping a bunch of servers, consoles, setting ENV variables; opening your default no. of browser tabs (having a launcher file).

I must say that, the pragmatic philosophy is more relevant these days, than it was ever before. In the end, the authors have listed a very exhaustive list of internet resources for different tools, text references etc. I thoroughly enjoyed reading this and it’s a 5 out of 5, for me.     

Posted in Programming | Leave a Comment »

2013 in review

Posted by sanstechbytes on January 1, 2014

The stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people.

Click here to see the complete report.

Posted in Uncategorized | Leave a Comment »

Goodbye, Ness!

Posted by sanstechbytes on June 19, 2013

It had to happen sometime. I thought Feb 2013 was the right time.

I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also get to work on a product that does comprehensive mining of insurance *historical* claims data, I can’t forget the  experience at Ness that played an important part in getting me there. At Ness, I was lucky to be able to switch to teams where I could be offered a good work-life balance in the sense that I could pursue my Masters in Software Systems(MS) from BITS, Pilani during 2008 and 2010, at least for a couple of years compromising too little on opportunities for vertical and horizontal growth. MS was a very good learning experience for me and perhaps, the knowledge gained and the skills acquired helped me make it through good no of interview rounds with the best of the companies like Google, Amazon, Yahoo etc. I’ve made good friends with few people at Ness that I would like to keep in touch with, for lifetime!  There’re highs and lows in that journey during which I was supervised by 3 different managers for 3 different products in CRM, Life Sciences and Digital Media Supply Chain Management domains.

Interestingly, me and my friend, Mahantesh were having a conversation after a long 6 months, while he’s in India for few weeks. It’s something related to my job search, job search in general, breaking promises, Indian IT job market etc.

As Mahantesh and I continued our conversation, he asked, “You had rejected 3 offers. Then, you got an offer from HCL for a Lead Role to work on UID project. You also mentioned that HCL would give an opportunity to lead a team of engineers that develop software potentially used by 1 billion customers, something consumed by the common man in India (ideally!). This is Desha Seve (serving the nation, in English). Why did you reject it…?.”

I replied, “(in Kannada) Modalu Matru Seve, Aamele Matru Bhoomi Seve, meaning First serve your Mother and then serve your Mother Land. Ideally, I wouldn’t compromise on either; I would not uphold one against the other, because both are superior to heaven (A Sanskrit shloka – “Janani Janma Bhoomischa Swargadapi Gariyasi…” uttered by Lord Rama in Ramayana, a Hindu Epic, means that).  Unfortunately, I was not finding opportunity to serve both mother and motherland at the same time with a very high degree of satisfaction and sense of pride and achievement each day, every day. I don’t claim I’ve found that now. It’s too good to be true anywhere. Ultimately, it’s a balanced deal that matters. After a reasonably long passive job search, I settled for a job that currently offers acceptable levels of sense of pride and satisfaction. Additionally, to provide a very decent standard of living to my parents in the costliest Indian city, I need decent amount of money; Money can buy me a decent amount time for myself and family; I need decent comfort standards for living. Perhaps, this ‘decent’ can be subjective. What’s ‘decent’ to me, may be ‘indecent’ to someone else or maybe, ‘most decent’ to some other person. With due respect to HCL, for its contribution towards generation of jobs and Indian Economy, I rejected HCL Offer, too. What inspired me to reject the offer was based on the fact that I don’t exist without my mother. She exists and hence, I do. If she didn’t, I wouldn’t. This ‘I’ would not be the same ‘I’, if I were born to someone else, too. You can never repay a mother’s debt. He nodded in agreement.  We both knew such is the plight of most young Indian professionals. You can never write an algorithm to justify this. There’re infinite conditions and infinite ways in which the logic is executed, to prove this.

Then, we got into little bit of Mythological references and spiritual aspects related to these discussions. I said, “In order to practise true Dharma, even if I lie and kill people, I’m not a liar and am not a criminal. If one practises true Dharma, it leads him/her to state of Self-Realization, the ultimate state of being, at some point in time in his life time. The definition of true Dharma is too hard to grasp by the common man. It literally is so complicated to comprehend when we’re in dilemma or when we need to take a decision under really intricate and complex situations. That’s why we’ve gurus, swamijis, Popes, Seers, Pontiffs, and Mullahs to propagate the knowledge and educate common men, the practice of true Dharma. I’m not an authority on the subject…”  But, see Legend below for my attempt to interpret closer meaning of true Dharma.

He wanted us to dwell more on this with mythological references, related to another dilemma that many job seekers face in the recent times. He knew the rationale behind it, as everyone does. He asked, “Why you broke promise to HR or Recruiters in the first company that you had accepted offer from, and that you had told them you would definitely join them after 3 months?”

I replied, “Otherwise they wouldn’t offer me, in the first place. We tell recruiters or the HR that we’ll definitely join your organization after we’ve accepted their offer, to avoid all the unpleasant talk back and forth during notice period. They indulge in this kind of impractical talk, rather than just releasing the offer letter, which stands void, if candidate doesn’t join them. This is so true when the notice period is like 3 months or so.  Very few employers would accept the candidate joining them after 3 months from the day the offer was made. The dynamic nature of the industry, the volatile nature of assignments in most private companies, the democratic nature of applying for a job (this is needed!),   IT companies, in particular have attributed to such shift in mindset amongst job seekers and employers.  As I retrieved the related info from ‘mythology’ part of my brain, it reminded me of a story in Mahabharata, a Hindu epic, where Lord Krishna would have promised that he would not lift any weapon during the famous Kurukshetra War between Kouravas and Pandavas. Actually, during the war, when Arjuna, the only capable Pandava, to compete and defeat Bheeshma and make enough wounds on his body with his arrows that the latter would not be able to lift weapons anymore (Bheeshma couldn’t be killed by anyone because he had the boon of “icchamarani“, someone who dies on his own accord – whenever he wants; none could kill him), becomes weak enough to not do so, Krishna breaks his promise, and readies himself to kill Bheeshma with his Sudarshana Chakra (exercises Veto power; Lord Krishna, an incarnation of Lord Vishnu is the supreme Lord. He’s bigger than the biggest and smaller than the smallest). At this point in time, Arjuna realizes his true Dharma as a Kshatriya; realizes that Bheeshma is on the wrong side of true Dharma. He interrupts the Lord and pledges before Him that he’ll lift his weapons to fight against Bheeshma and that he would not become too weak to not perform his true Dharma ever again. “

I said, “With due respect to the Companies that follow true Dharma (invest in development of their human resource, follow professional standards in their Employment Practices etc), let me generalize this. Companies are like Bheeshma; I’m not trying to actually compare both of them. Many parallels can, however, be drawn. It’s an attempt to do so. Bheeshma knew that he was on the wrong side. He thought that because he’d got his bread from Kouravas, his only Dharma was to protect Kouravas. Bheeshma was only concerned with the protection of Kouravas, just because they give him his bread. Bheeshma forgot that if following true Dharma calls for being disloyal to someone, it’s fair enough to be disloyal. More often, Companies know that they’re not following professional standards, they’re not actually investing in development of their people; most of the times, they’re dancing to the tunes of investors, different stakeholders within their workforce. They can revoke offers made as per their whims and fancies, and they can fire employees whenever they wish citing reasons like restructuring or reorg even when they’re not doing badly. That’s nature of the job market. Sometimes, it’s genuine; some other times, it’s not. Company can exist as long it wishes to; even it can exist until it’s deregistered.  Company dies only whenever they (shareholders) want. Why cry and stop applying for other company jobs? If I’m applying to another job, after I’ve accepted an offer from one company, then I’m basically seeking a better path of true Dharma; seeking a path that leads me to Self-Realization, my ultimate pursuit in life. If I don’t get a better offer, it means that somehow, I’m not yet qualified enough to take that route to Self-Realization. Or, I’m not lucky enough to get that offer. When you’re following true Dharma, you mayn’t be successful at each intermediate stage; you may undergo lot of hardships. In the end, you’ll emerge as the winner. In my case, I was strongly convinced about my true Dharma, before I broke my promise to the first offer I got. I’ve no regrets about the decision I had made then. “

To be continued….


True DharmaEveryone is born with a certain nature. The individual’s personality is shaped by many external factors, apart from the value system that he inculcates and the traits he inherits from his family. Education, peer influences, social setup, environment have their own influences on his thinking and his actions. He can choose the path in life by virtue of his physical and mental attributes. This path can take many deviations in one’s journey as the individual’s goals, aspirations and motivations evolve over a period of time. True Dharma is taking deviations but being strongly convinced about those deviations, for the good of the things. The principles of Practice of True Dharma should override everything, every other principle on this earth. Being Loyal is acceptable, but not at the cost of true Dharma.

Iccha Marani:  Iccha – Want, Marana – death; Marani – The one that dies.

Posted in Personal | 2 Comments »

Meta: information retrieval from Tweets

Posted by sanstechbytes on December 1, 2012

I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem!


How’d you go about finding meta information about a person’s tweets?

NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, followers whatever appeals to you.


Meta information about a person’s tweets largely includes information that’s based on one or more of:
a) Content of the Tweet (Video, Text, Image, URL’s etc)
b) User Attribute (Age, Gender, Location, Followers, Following, Topics)
c) User Action (Retweet, Reply, Expand, Tag, Favorite, Follow, View Conversation, Browse Content (Click on URL and tags in a tweet)).

My approach is presented in terms of Design Considerations, Notes; Domain Model, Object Model, Persistence Model and Implementation Notes in the sections below.

Design Considerations:
The schema  design decisions are made to accommodate high scalability needs of twitter data that’s in (usually in the range GB’s or TB’s). For the purposes of analytics, the database operations are mostly read-only. The data is a live one. Also, for frequent reads and in general, large scale needs, the tables aren’t normalized and the redundant data can be easily spread across different tables.

Caching is implemented on the server side, for frequently accessed read operations. I’ve used Object Oriented approach using Java in my API design.

Design Notes:
The source of twitter data, can be :
a) Twitter dataset file in ZIP/XLS format (GB’s or TB’s of data provided by Twitter)
b) Twitter data stored in database tables on the public cloud.
c) Twitter data stored in our database tables (assuming we’re providing Twitter Type Infrastructure and Data Analytics).
d) Twitter response in XML format.

In case of (a), the dataset file can be read and mapped to our Java Object Model by parsing the files, tokenizing and using cache service for retrieval of the XLS data. In case of both (b) &(c), we can map our Object Model using an ORM framework to the data model. Implementation choices include Spring, Hibernate, JPA etc.

In case of (d), marshalling and unmarshalling of XML (using JAXB), is done to extract meta data.

I’m dealing with case (c) in my approach for the design of the persistence model. It’s assumed that the content of data, the capture of user actions(clicks) and user tweet profile attributes can be easily passed (using web 2.0 technologies like Jquery, HTML5, Ext-Js, DWR etc, to Java API which then talks to persistence layer to persist data in tables).

Domain Model:
Below is the domain model to represent business entities or real-world objects.

Object Model:
The object model for the domain model above is represented below using Class Diagrams.
Class Diagrams (without arrrow marks showcasing OO relationship though –  explained later)

Class <User>
-userId: String
-name: String
-gender: char
-age: int
-location: String
-tweets: List
-followers: List
-following: List
-lists: Set+addFollower(User user): boolean
+getFollowers(): List
+addFollowing(User user): boolean
+getFollowing(): List
+addToList(String listName): boolean
+getLists(): Set

+tweet(String content): boolean
+tweets(): List<Tweet>

+retweet(String content): boolean
+favorite(String content): boolean
+reply(String content): boolean
+delete(String content): boolean
+expand(String content): boolean
+viewConversation(String convn): void

+getUserId(): String
+setUserId(String userId): void
+getName(): String
+setName(String userId): void
+getGender(): char
+setGender(char gender): void
+getAge(): int
+setAge(int age): void
+getLocation(): String
+setLocation(String location): void

Class <Tweet>
-tweetId: String
-maxLength: int
-contentType: ContentType
-content: String
-tags: List
-postedBy: String
-favoriteFreq: int
-replyFreq: int
-expandFreq: int
-createDateTime: TimeStamp
+setRetweetFreq(): void
+getRetweetFreq(): int+setPostedBy(String userId): void
+getPostedBy(): String

+setContentType(ContentType ContentType): void
+getContentType(): ContentType

+setFavoriteFreq(): void
+getFavoriteFreq(): int
+setReplyFreq(): void
+getReplyFreq(): int
+setExpandFreq(): void
+getExpandFreq(): int

+setViewConversationFreq(): void
+getViewConversationFreq(): int

+addTag(Tag tag): boolean
+getTags(): List

Class <UserTweetCache>
-cacheInstance: UserTweetCache
-userTweetMap: Map<string, set<tweet=””>>
+lookUpTweet(String userId, String twtStr): Tweet
+getUserTweetMap(): Map<string, set<tweet=””>>
+getCacheInstance(): UserTweetCache


Class <ContentType>
-TweetContentType: enum{ TEXT, VIDEO, IMAGE, URL, COMPOSITE}
+getValue(TweetContentType): String


Class <Tag>
-id: String
-name: String
-source: String+setTagId(String tagid): void
+getTagId(): String
+setSource(String source): void
+getSource(): String
+setName(String source): void
+getName(): String


Class <UserListHelper>
-name: String
+addUserToList(String userId, String listName): boolean
+lists: Set<String>


Class <UserList>
-name: String
-userListMap: Map<string, set<user=””>>
-addUser(String userId): User
+setName(String listName): void
+getName(): String


interface <MetaData>
+extractMetaData(Set tweets): TweetAnalysisMetaData


+extractMetaData(Set<User> users): String


+extractMetaData(Set<Tweet> tweets): String


+extractMetaData(Set<Tweet> tweets): String


class <TextAnalyzer>
+computeTermFreq: void
+applyIDF(): void
+tokenize(String tweetContent): List


class TweetContentMagnitudeVectorImpl
+normalize(TweetContentMagnitudeVector vec): void


class TweetContentMagnitudeVector
+getTFIDFValue(): double


class TextClusterer
+clusterTweets(Set<Tweet>): void
+applyKMeansSetting(KMeansSetting kmeanSetting): void
+computeSimilarityMatrix(): String[][]
+printClusteStatistics(): void


class UserTweetPersistenceManager
+persistUser(User user): boolean
+persistTweet(Tweet user): boolean
+persistTag(Tag tag): boolean
-retrieveUser(String tweetId): boolean
-retrieveTweet(String userId): boolean


class TweetQueryResult
+getQueryCount(): int
+getRelevantTweets(): Set


class TweetQueryParameter
+QueryParameter.Parameter: enum {KEY, APIUSER, START, INDEX, LIMIT, SORTBY}
+QueryType.Typer: enum {UPDATE, SELECT, DELETE}+setParameter(QueryParameter queryParam): void
+getParameter(QueryParameter queryParam): void
+setType(QueryType queryType): void
+getType(): QueryType


class TweetAnalysisMetaData
+displayMetaDataProfile(MetadataType): void+displayUserAttributeBasedMetaDataProfile(MetadataType.USER_ATTRIBUTE): void
+displayUserActionBasedMetaDataProfile(MetadataType.USER_ACTION): void


Implementation Notes:
To avoid frequent calls to DB, Cache (UserTweetCache) is implemented to retrieve tweets and users from a Map which is updated for every new tweet or user and whose reference is got through UserTweetCache singleton instance. If using Map could be a concern for Memory Leaks, size can be specified for Map. Also, LinkedList can be a viable option with LRU type of Cache implementation (discard the LRU LinkedList objects from the cache).

To ensure data integrity for much of the data (we’re not really concerned if some of the unimportant data from the perspective of Cache, gets updated, which is not reflected in Cache), only after the records are successfully deleted or updated in DB, is the UserTweetCache Tweet / User object  updated accordingly.


Persistence Model:
Schema Design:
1. User

userid            varchar2(50)
gender           varchar2(2)
age                int(3)
location         varchar2(30)

2. Tweet

tweetid           varchar2(50)
content           varchar2(2)
createdatetime TimeStamp(19)
favFreq           int
replFreq          int
retweetFreq    int
expandFreq     int
viewConvFreq int

3. Tag

tagid           varchar2(20)
tagtext        varchar2(50)

4. UserAction

actionId           int(2)
description       varchar2(20)
details             varchar2(40)

5. User_Tweet_Tag

userId                 varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(40)
createDate         TimeStamp(19)

6. User_Tweet_UserAction

userId                 varchar2(20)
tweetId               varchar2(20)
actionId               int


sourceid             varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(20)
weight                 double(22)
createDate        TimeStamp(19)


Saving Data:
To persist data from the update based on user action (tweet, retweet, reply, view, expand, favorite etc), UserTweetPersistenceManager encapsulates TweetQueryParameter, QueryType and TweetQueryResult objects that interact with ORM API’s to persist data in DB.

If update is successful, the UserTweetCache is updated with the modified Tweet object.
If update is a failure, the UserTweetCache is not updated with the modified Tweet object.
All the error conditions are handled to ensure data integrity of the Cache.


Retrieving Data:
Set and Set are populated with corresponding ResultSet objects retrieved using UserTweetPersistenceManager – retrieveUsers(tweetId), retrieveTweets(userId) etc.

To retrieve user-attribute based metadata, the Set<User> is passed to the MetaData extractor infrastructure.
To retrieve content-based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

To retrieve user action based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

The Tweet data for a user, or the User data for a tweet can be fetched by doing a join operation amongst User_Tweet_Tag, User, Tweet, User_Tweet_UserAction tables.

TweetAnalysisMetaData displays the metadata profile or statistics based on the type of MetaData (MetaDataType.USERACTION, MetaDataType.CONTENT, MetaDataType.USERATTRIBUTE or aggregating all of them).

TweetAnalysisMetaData also displays cluster statistics or profile for predictive modeling or data mining tasks. The mechanism used for Text Clustering and Text Mining is similar to that explained on my blog

To be Continued…

Posted in Data Mining, Java, Random | Tagged: , , , | 4 Comments »

Understanding Big Data

Posted by sanstechbytes on September 13, 2012

It’s been a while, since I last posted!

To keep this rolling (I’m hardly getting any time to post my own articles or stuff about my experiments these days 😦  ), I just wanted to share this ebook on Big Data titled  Understanding Big Data: Analytics for Enterprise Class and Streaming Data.


Posted in Data Mining, Research | Leave a Comment »

Java Design Patterns: Template Pattern in the Real World

Posted by sanstechbytes on July 31, 2012

If you’re declaring some abstract methods in your class and making the sub classes provide concrete implementation for them, you’re indeed using Template pattern. Template pattern formalizes the idea of defining an algorithm in a class, but leaving some of the details to be implemented in sub classes.

In Java, Arrays.sort(Object[] o) internally invokes mergesort, that in turn, does a .compareTo() check for object comparison. .compareTo() implementation for the object type being compared, will be invoked.  AbstractCollection is one example, too.

Suppose, you’re designing a Sort API that you intend to ship as a jar file to your customer. Your customer would like the flexibility of adding newer or better sort implementations (specialized sorts are needed in a large scale system, applications) in their code. Often, you wouldn’t want to invoke default sort implementation of Java for large data. For instance, your customer wants to sort a very large array of 0’s and 1’s. They would write their own sort logic yielding in-place, O(n) time-complexity, rather than use default Arrays.sort(..) with O( n log n) in time. Obviously, they don’t want you to modify code for them every time. You want details of sort to be implemented in the sub classes that your customer writes. You would come up with something like this below:

public abstract class Sorter {
 public void sort(Object[] o) {

 abstract void sort1(Object[] o);

class CustomSorter1 extends Sorter {

 public CustomSorter1() {


 public void sort(Object[] o) {

 //Your sorting logic goes here



Suppose, you’re developing a Graphics project for school. One of the tasks is to, drawing different triangles based on the input points. You would have subclasses like IsosCelesTriangle, EquilateralTriangle,  an abstract class Triangle and Graphics and Point classes.  Graphics and Point will be composite objects of Triangle.  drawLine() is common to all Triangle types. draw2ndLine() methods will be different for each Traingle type and implemented by sub classes.

Suppose, you’re building a Tourist Trip Reservation application. In every package trip, the activities are defined for each day of a Trip. Depending on the package type, the activities might vary. If you apply Template Pattern here: PackageTrip class is the abstract class. PackageA, PackageB etc are sub classes. performTrip() will be called by PackageA, PackageB etc. It would internally invoke specific Package implementations.

Template patterns is a one of the very useful and frequently used design patterns. Think about scenarios where you could apply in your project, now!

Posted in Java | Tagged: | Leave a Comment »

A Complete Singleton in Java

Posted by sanstechbytes on June 26, 2012

Singleton pattern is a design pattern that restricts the instantiation of a class to one object. In Java, the implemention should be done to ensure that only one instance of the class exists per class loader, per JVM. The objective in this post is not to explain much theory for scenarios where you apply Singleton etc or it’s implementation variants (read this for those details) but to explain what it takes to make a Singleton, a Complete Singleton. We all know we can use many variants of the so called Singleton implementations in our code. I wonder if all of us really know if our implementation will still be a ‘Singleton’ in situations where we try to invoke clone() and/or make the class Serializable. The class will no longer be a Singleton.

Let’s consider first the clone() scenario and later, serialization scenario. Let’s say your Singleton implementation is as below:

public final class Singleton implements Serializable {

 // static Singleton reference is initialised to refer to Singleton instance, when this class is first loaded
 // so that this instance can be accessible without creating an object and without having to worry about thread
 // synchronization in case of, lazy loading (where we do a null check inside getInstance()...

public static final Singleton instance= new Singleton(); // make it private as an optimization step.

 // Suppresses instantiation outside this class
 private Singleton() {


 // One instance per class loader of this class, per JVM
 public static Singleton getInstance(){
return instance;


Suppose the client API invokes method like below:

public class Client {
 Singleton instance = Singleton.getInstance();
 Singleton clone = (Singleton) instance.clone();

So,  clone() call creates another instance of Singleton in heap!!! But, we wanted to ensure ONLY one instance at any time!!! Singleton loses Singleton-ness property!!!!

To prevent this, you should override clone() like below:

// Prevents Singleton.clone() call by the client
public Object clone() throws CloneNotSupportedException {
  throw new CloneNotSupportedException();

When we do so, the client code will give a compile-time error (CloneNotSupportedException is a checked exception). Half Done!

Now, let’s say our Singleton implements Serializable(in above class). The above class would no longer be a singleton if the words “implements Serializable” were added to its declaration. It doesn’t matter whether the class uses the default serialized form or a custom serialized form, nor does it matter whether the class provides an explicit readObject() method. Any readObject method, whether explicit or default, returns a newly created instance, which will not be the same instance that was created at class initialization time. As Joshua Bloch mentions in his Effective Java: Prior to the 1.2 release, it was impossible to write a
serializable singleton class. In the 1.2 release, the readResolve() feature was added to the serialization facility. If the class of an object being deserialized defines a readResolve method with the proper declaration, this method is invoked on the newly created object after it is deserialized. The object reference returned by this method is then returned in lieu of the newly created object. In most uses of this feature, no reference to the newly created object is retained; the object is effectively stillborn, immediately becoming eligible for garbage collection.

private Object readResolve() throws ObjectStreamException {
 // Return the one true Singleton and let the garbage collector
 // take care of the Singleton impersonator.
 return instance;

Note that since the class is final, the contract says that the readResolve() has to be a private method. This method ignores the deserialized object, simply returning the distinguished Singleton instance created when the class was initialized. Therefore the serialized form of a Singleton instance need not contain any real data; all instance fields should be marked transient. This applies not only to this Singleton, but to all singletons.

So here’s the complete singleton code, a complete Singleton Code!!!

public final class Singleton implements Serializable {
 // static Singleton reference is initialised to refer to Singleton instance, when this class is first loaded
 // so that this instance can be accessible without creating an object and without having to worry about thread
 // synchronization in case of, lazy loading (where we do a null check inside getInstance()...
 public static final Singleton instance= new Singleton(); // make it private as an optimization step.</p>
 // Suppresses instantiation outside this class
 private Singleton() {


 // One instance per class loader of this class, per JVM
 public static Singleton getInstance(){
  return instance;
 // Prevents Singleton.clone() call by the client
 public Object clone() throws CloneNotSupportedException {
  throw new CloneNotSupportedException();

 private Object readResolve() throws ObjectStreamException {
  // Return the one true Singleton and let the garbage collector
  // take care of the Singleton impersonator.
  return instance;

Here’s the Lazy Loading variant of the Singleton implementaion using Double-Checked Locking. Based on the Java Memory Model issues, there could be a Just-In-Time (JIT) run-time compiler Threat with this kind of implementation, so explains, Peter Haggar, of Practical Java™ Programming Language Guide fame, here.

public final class Singleton implements Serializable {

public static Singleton instance=null; // make it private as an  optimization step.

// Suppresses instantiation outside this class
private Singleton() {


//One instance per class loader of this class, per JVM
public static Singleton getInstance(){
if(instance == null) {
synchronized(Singleton.class) {
if(instance == null)
return instance;

// Prevents Singleton.clone() call by the client
public Object clone() throws CloneNotSupportedException {
throw new CloneNotSupportedException();

private Object readResolve() throws ObjectStreamException {
// Return the one true Singleton and let the garbage collector
// take care of the Singleton impersonator.
return instance;


Posted in Java | Tagged: | 1 Comment »

My Master’s Dissertation, Revisited Series – Part 3

Posted by sanstechbytes on May 1, 2012

In this final part of the series, let us look at the choices of implementation that I made and the results of my work. I thought I would rather share that as a stuff, part of the download.

Although I liked world-leading data mining open-source software like WEKARapid Miner, I don’t recall why I stuck to exploring more of a trial version of commercial Oracle Data Mining Suite. We used Oracle 10g at work; we use SAS Enterprise Miner for data mining needs. Intuitively, it should have been with the intention of seeing how Oracle Data Mining fitted the bill in that context. It turns out that last statement was too much of a speculative future then.

The repository for all my work along with other useful artifacts can be downloaded here(around 93 MB, I think!!!).

When you’ve downloaded RAR file –

For my explanation of how I’ve used ODM API’s, refer to dissertationrepository\results\My Dissertation POC Demo.docx.

For detailed steps on installation of database, creating user, sample schemas etc.:  refer to Oracle Data Mining Administrator’s Guide can be located in RAR  file at: dissertationrepository\pdf\datamining_admin_guide.pdf.

For installing softwares like odminer and importing java project: refer to dissertationrepository\software. Perhaps, oracle database can be downloaded from here.

For screenshots and results of my POC work,  refer to:  dissertationrepository\results\My Dissertation POC Demo.docx

For other sample projects, books, ideas, papers etc: refer to dissertationrepository\extra


1. Campos, M.M., Milenova, B.L., “O-Cluster: Scalable Clustering of Large High

Dimensional Data Sets”, Oracle Data Mining Technologies, 10 Van De Graaff

Drive, Burlington, MA 01803.

2. Oracle Data Miner –

3. JSR Specifications –

4. Introduction to Data Mining by Vipin Kumar, Michael Steinbach, Pang-ning Tan, Addison Wesley, 2006.

5. Java Data Mining: Standard, Theory and Practice: A Practical Guide for Architecture, Design and Implementation by Mark Hornick, Erik Marcade, and Sunil Venkayala, Morgan Kauffman Publishers, 2007

6. Collective Intelligence in Action by Satnam Alag, Manning Publications, 2007

7. Using text mining to understand the call center customers’ claims by G. M. Caputo, V. M. Bastos & N. F. F. Ebecken.

8. Oracle Database 11g / Data Mining API Documentation

9.  KD Nuggets

10. ACM

Posted in Data Mining, Research | Leave a Comment »

My Master’s Dissertation, Revisited Series – Part 2:

Posted by sanstechbytes on April 30, 2012

In part 1, we had a quick overview of data mining and text (data) mining and some terminology used. Let’s explore clustering in this part, some of the clustering algorithms and text clustering using text mining.

In clustering, the objective is to partition unstructured set of objects to clusters or groups. We often want that the objects to be as similar as possible to objects in the same cluster and as dissimilar to objects from other clusters as possible.

Clustering vs. Categorization

By automatic categorization, we mean to let a machine decide to which of a set of predefined categories a text belongs. In clustering, the machine decides how a given text set should be partitioned. Categorization is suitable when one wants to categorize new texts according to a known categorization, clustering when one wants to discover new structures not previously known. Both methods may give interesting results on an unknown text set; categorization sorts them according to a well-known structure, clustering displays the structure of the particular set. This thesis deals with clustering of texts.

Let’s look at an example for visualization of Clusters for Customer groups based on their age and income, for simplicity. I think the diagrams (Courtesy: Java Data Mining : Strategy, Standard and Practice book by Hornick, Venkayala, Marcade) are pretty much self-explanatory.

Customer Clusters for above set of data:

Cluster Model – Histograms


Clustering is an unsupervised learning method. The result (the clustering, the partition) is based solely on the  object representation, the similarity measure and the clustering algorithm. If these correspond to the users understanding the result might well be an intuitive and useful clustering. One must keep in mind, though, that clustering algorithms always produce clusterings, even when this is not justified, and that there in most cases exist many relevant clusterings of a set of complex objects.

There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:

  • Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed top-down or bottom-up. Hierarchical methods rely on a distance function to measure the similarity between clusters.

  • Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.

  • Locality-based — Groups neighboring data objects into clusters based on local conditions.

  • Grid-based — Divides the input space into hyper-rectangular cells, discards the low-density cells, and then combines adjacent high-density cells to form clusters.

Let’s look at some algorithms. Let’s explore basic and popular k-Means algorithm for clustering.

k-Means Algorithm:

The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters provided there are enough distinct cases. In the basic version of the k-means, we first choose K initial centroids, where K is a cluster specified parameter, viz., the number of clusters desired. Each point is then assigned to the closest centroid, and each collections of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the the cluster. We repeat the assignment and update steps until no point changes clusters, or equivalently, until the centroids remain the same.

Distance-based algorithms rely on distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.

Basic k-means is formally described as below:

1. Select K points as initial Centroids

2. repeat:

3.    Form K clusters by assigining each point to it’s closest centroid..

4.    Recompute the centroid of each cluster.

5. until: Centroids do not change

The operation of above k-means algorithm is illustrated in below figure, which shows, how starting from three centroids, the final clusters are found in four assignment-update steps. In these and other figures displaying k-means clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the points to those centroids. The centroids are indicated by the “+” symbol; all points belonging to the same cluster have the same marker shape.

(Courtesy: Introduction to Data Mining: Tan, Vipin Kumar, Steinbach)

Now, let’s look at enhanced k-Means algorithm supported by ODM for clustering.

Enhanced k-means –

  • Builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-means algorithm. The centroid of the inner nodes in the hierarchy is updated to reflect changes as the tree evolves. The whole tree is returned.

  • Grows the tree one node at a time (unbalanced approach). Based on a user setting available in either of the programming interfaces, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified in the build setting for clustering models, CLUS_NUM_CLUSTERS.

  • Provides probabilistic scoring and assignment of data to clusters.

  • Returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a  rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes or the mean and variance for numerical attributes.

This approach to k-means avoids the need for building multiple k-means models and provides clustering results that are consistently superior to the traditional k-means.

I used Oracle Data Mining (ODM) API’s that use enhanced k-means algorithm described above, for k-means clustering. O-Cluster algorithm implementation is also provided in ODM for clustering.

Time and Space Complexity:

Since only the centroids and data points are stored, the storage required is O((m + K)n), where m is the number of points and n is the number of attributes. The time required (time complexity) is modest too, and can be determined to be: O(I * K * m * n), where I is the number of iterations required for convergence.

‘I’ is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Hence, K-means is linear in m, the number of points, provided K, no of clusters, is significantly less than m.

Text Clustering 

Text Clustering is a specific task in text mining that refers to the process of deriving high-quality information from text and is used for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

The main applications of clustering in text mining are:
■  Simple clustering. This refers to the creation of clusters of text features .  For example: grouping the hits returned by a search engine.

■  Taxonomy generation. This refers to the generation of hierarchical groupings. For example: a cluster that includes text about car manufacturers is the parent of child clusters that include text about car models.

■  Topic extraction. This refers to the extraction of the most typical features of a group. For example: the most typical characteristics of documents in each document topic.

Let’s talk about Simple Clustering a bit. Suppose you’ve a call center web application and your call center reps enter the comments about a particular offering (product) or in general about the call center experience on your site, and those comments are recorded in your database tables. To make sense of this unstructured data,  i.e. to know which group your customers belong to – HAPPY, SATISFIED, UNHAPPY etc or customer’s level of acceptance about a particular product, and come up with some marketing strategy to improve  customer call center experience or product usability, you want to mine the comments (text) and apply k-means algorithm to group comments. Remember, we’re talking about hundreds of thousands of customers and comments and add to that unstructured data (Comments can contain alphabets, numbers, special chars, image etc). Any traditional smart code to retrieve data and group them, is not apt here and is out of contention, for such a scale.

To apply clustering algorithm for above scenario, say k-means, you need to have the text undergo a special pre-processing step known as Term Extraction or Feature Extraction. This process breaks the text down into units (terms) that can be mined. Text terms may be keywords or other document-derived features . The parameters like Semantic similarity, similar words stop words  and weighting schemes like TF-iDF may be considered while applying a text extraction algorithm like NMF (refer to part 1). You may want to refer to Text Mining section in part 1 of this series, for some theory.

I used ODM API’s. Oracle enhanced k-means algorithm (see below for this) supports text mining. Oracle Data Miner graphical tool performs term extraction transparently when you create or apply a text mining model. The Oracle Data Mining Java API provides term extraction classes for use in Java applications. If you are using the PL/SQL API, you can use a set of Oracle Text table functions to extract the terms. For screenshots of clustering model and Term Extraction model, please watch out for part 3 in this series.

Suppose you want to predict if customers will increase spending with an affinity card. You want to include the comments field from a customer survey as one of the predictors. Before building the classification model, you want to create a reduced set of text features to improve the predictive power of the comments. To do this, you will create a feature extraction model.

In the next part, we’ll look at ODM API’s and my dissertation repository for code, results, links and pdfs.

Posted in Data Mining, Research | Leave a Comment »

My Master’s Dissertation, Revisited Series – Part 1:

Posted by sanstechbytes on April 28, 2012

As I had mentioned in one of my previous posts that I would talk about my Master’s dissertation in my future posts, I’m happy that now is the time. The dissertation was titled “Text Clustering in a Call Centre Application using Data Mining”. My primary motivation was to explore data mining concepts and in particular clustering and see how it can connect to real-world problems in CRM domain. Being a java developer, I found it easier to quickly use Oracle Data Mining (ODM) API built on top of Java Data Mining (JSR Spec – for my proof-of-concept.

In this three-part series of “My Master’s Dissertation, Revisited”, I’ll give a quick overview of data mining, text mining(in particular) and some terminology used in this series, in the first part. In part two of this series, I’ll talk about Text Clustering, enhanced k-means algorithm for clustering, and the sample that I used. I’ll wrap up this series by explaining as to how I applied k-means model, the code that I customized, and the results with a link to the sample code zip file.

I’ll be focusing on data mining concepts and functions relevant to my dissertation in this series. For info on other mining functions or more details, please refer to my previous post on the topic.

What is Data Mining? Why is it used?

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis and simple querying.  Data Mining is, in plain terms, making sense of the data. The key properties of data mining are:

i) Automatic discovery of patterns

Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of the data mining models.

ii) Prediction of likely outcomes:

A model might predict income based on education and other demographic factors. A data mining technique can generate a rule might specify that a person who has a bachelor’s degree and lives in a certain neighborhood is likely to have an income greater than the regional average.

iii) Creation of actionable information:

A town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers.

iv) Focus on large data sets and databases:

When you want to retrieve data from thousands or millions of records that have tens or hundreds of columns (attributes or dimensions) like transactional data (eg: customer data) from table, data mining models make lot of sense.

What it can do and what it cannot?

Data mining discovers hidden information in your data, but it cannot tell you the value of the information to your organization.  You might already be aware of important patterns as a result of working with your data over time. Data mining can confirm or qualify such empirical observations in addition to finding new patterns that may not be immediately discernible through simple observation.

Steps in Data Mining Process:

The steps in the process of data mining can diagrammatically be represented as:

1. Problem Definition:

This initial phase of a data mining project focuses on understanding the project objectives and requirements.

For example, your business problem might be: “How can I sell more of my product to customers?” You might translate this into a data mining problem such as: “Which customers are most likely to purchase the product?” A model that predicts who is most likely to purchase the product must be built on data that describes the customers who have purchased the product in the past.

2. Data Gathering and Preparation:

The data understanding phase involves data collection and exploration. As you take a closer look at the data, you can determine how well it addresses the business problem. You might decide to remove some of the data or add additional data. This is also the time to identify data quality problems and to scan for patterns in the data. For example, you might transform a DATE_OF_BIRTH column to AGE; you might insert the average income in cases where the INCOME column is null.

3. Model Building and Evaluation:

In this phase, you select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, you will need to step back to the previous phase to implement them.

In preliminary model building, it often makes sense to work with a reduced set of data (fewer rows in the case table), since the final case table might contain thousands or millions of cases. Based on the extent to which the model has to be improved, we can increase the number of attributes.

4. Knowledge Deployment:

Knowledge deployment is the use of data mining within a target environment. In the deployment phase, insight and actionable information can be derived from data.

Deployment can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting tools. For example, a sales representative could run a model that predicts the likelihood of fraud within the context of an online sales transaction.

Data Mining Functions

Data mining functions can be broadly classified into Supervised and Unsupervised mining functions. Each data mining function uses one or more algorithms to build a model.

The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model “learns” the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by mining functions analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.

The building of an unsupervised model doesn’t involve training, as there is no previously-known result to guide the algorithm in building the model. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.

Let’s have a quick look at how can these functions help a data miner achieve and the business context in which they’re used.


The goal of the Associations mining function is to find items that are consistently associated with each other in a meaningful way. For example, you can analyze purchase transactions to discover combinations of goods that are often purchased together. The Associations mining function answers the question: If certain items are present in a transaction, what other item or items are likely to be present in the same transaction?


The process of automatically creating a model of classes from a set of records that contain class labels. The classification technique analyzes records that are already known to belong to a certain class, and creates a profile for a member of that class from the common characteristics of the records. You can then use a data mining scoring tool to apply this Classification model to new records, that is, records that have not yet been classified. This enables you to predict if the new records belong to that particular class.

Commercial applications of this mining function include credit card scoring, ranking of customers for directed mailing, the prediction of credit risk in new customers, and attrition prediction.


Regression is similar to classification except for the type of the predicted value. Classification predicts a class label, regression predicts a numeric value. Regression also can determine the input fields that are most relevant to predict the target field values. The predicted value might not be identical to any value contained in the data that is used to build the model. An example application is customer ranking by expected profit.


The Clustering mining function includes algorithms like k-means, O-cluster etc. It groups similar customers together. At the same time it maximizes the differences between the different customer groups that are formed in this way.

The groups that are found are known as clusters. Each cluster tells a specific story about customer identity or behavior, for example, about their demographic background, or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes.

In the commercial environment, clustering is used in the areas like: Cross-marketing, Cross-selling, Customizing marketing plans for different customer types, Deciding which media approach to use, Understanding shopping goals, and Many other areas.

Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.

The prerequisites to understand text clustering and in particular k-means algorithm as well as some other mining functions are: Understanding of Numerical, Categorical Attributes, Data Transformation techniques, Euclidean Distance, Document Matrix, Vector Space Model (VSM), Text Mining. See section ‘Some Terminology’ below.

Some Terminology:

Numerical Attribute: An attribute whose values are numbers. The numeric value can be either an integer or a real number.

Categorical Attribute: An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, and so on, or ordered (ordinal) such as high, medium, or low temperatures.

Binning:  This transformation maps a set of input values to a smaller set of bins. The input values may be discrete or continuous.

Normalization: It maps numerical values to a particular numerical range, typically [0…1]. There are several types of normalization (e.g., z-score, min-max, and shift-scale).

Explosion: This transformation applies only to discrete data such as strings or numbers representing individual categories. The goal is to transform such attributes into numerical attributes.

Support: The ratio of the number of occurrences of R, given all occurrences of all rules.

Confidence: The confidence of a rule X->Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.

Euclidean Distance: The Euclidean distance or Euclidean metric is the “ordinary” distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. In Cartesian coordinates, if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by:

d(p, q) = d(q, p) = √ (q1 – p1)2 + (q2 – p2)2  + ….. + (qn – pn)2  =  √∑( qi – pi)2

For N dimensions

In general, for an n-dimensional space, the distance is:

d(p, q) = √ (p1 – q1)2 + (p2 – q2)2  + … + (pi – qi)2  + … + (pn – qn)2  =  √∑( qi – pi)2

Vector Space Model: In the vector space model each text in a set of texts is represented by a vector in a high-dimensional space, with as many dimensions as the number of different words in the set. Each text gets weights (values) in the indices (dimensions) based on what words appear in them. These weights model how important the corresponding word is deemed to be to explain the content of the text. They are dependent on whether (and how often) the word appears in the document and in the entire set. Texts whose vectors are close to each other in this space are considered being similar in content.

Text Mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment  analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Unstructured data like ppt, audio, images, call center notes etc can be converted into numerical attribute using Text Mining. The term extraction process might involve application of NMF (Non-Negative Matrix Factorization) algorithms and treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label.

Most of the text-mining methods are grounded in the term-based VSM to calculate the distance or similarity between documents. Let’s look at some definitions for using document representation of vector.

A document is commonly represented as a vector of terms in a VSM. The basis of the vector collection. In a simple way, we can use a word to be a term. Yet, morphological variants like ’actor’, ’action’ and ’acting’ are so closely related that they are usually conflated into a single word stem, e.g., ’act’ by stemming [4], [5]. After stemming, two word stems are treated as unrelated if they are different. For example, the stem of ’suggestion’ and ’advice’ are usually considered unrelated despite of their apparent relationship. Since different word stems are considered unrelated, i.e., independent, the base vectors in VSM are orthogonal to each other.

A set of documents X in traditional term-based VSM is represented by X = {X1, X2…… Xn}. One document is represented by Xj = {xj1, xj2,…..xj,m}. Terms in the vocabulary extracted from the corresponding document collection are represented by {t1, t2…tm}.

Where xjii is the weight of the term tii in document Xj and is usually determined by the number of times ti appears in Xj (known as the term frequency). Other weighting scheme like TF-iDF can also be applied.

Based on the definition of VSM, distance between two text documents X1 and X2 can be easily computed. There are many methods to measure this distance, such as: Cosine similarity and Minkowski distance including Euclidean distance, Manhattan distance and Maximum distance. Here, we give the definition of Euclidean distance which is effective and frequently used in text clustering.

Euclidean distance of two documents X1 and X2  in the VSM, is defined as:

D(X1, X2) = √( X1 – X2 ) ( X1 – X2 )   T

     = √i=1∑m( x1i-x2i)2

The smaller the value of d(X1, X2) is, the more similar the two documents are. From above equation, we can see that this distance definition does not take into account any patterns of term correlation that exist in  the real-world data.

Posted in Data Mining, Research | 1 Comment »