Sanstech

Ideas, Knowledge, Technology, Computer Science, Experience associated with my work and some geeky stuff I progressively encounter during my journey towards enlightenment. Read on…

  • RSS RSS Feed

    • Cloud Computing
      It’s been really long, since I last wrote a tech post. In this post, I’m just sharing few useful links to get started on Cloud Computing, whether you’re a developer, quality engineer, business leader, or a project manager intending to get started with Cloud Computing. I’m currently designing systems and services, for a platform that’s […]
    • The Pragmatic Programmer
      I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this. The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code […]
    • 2013 in review
      The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people. Click here to see the […]
    • Goodbye, Ness!
      It had to happen sometime. I thought Feb 2013 was the right time. I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also […]
    • Meta: information retrieval from Tweets
      I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem! Problem: How’d you go about finding meta information about a person’s tweets? NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, follower […]
  • Twitter Updates

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Archive for the ‘Java’ Category

Meta: information retrieval from Tweets

Posted by sanstechbytes on December 1, 2012

I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem!

Problem:

How’d you go about finding meta information about a person’s tweets?

NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, followers whatever appeals to you.

Solution:

Meta information about a person’s tweets largely includes information that’s based on one or more of:
a) Content of the Tweet (Video, Text, Image, URL’s etc)
b) User Attribute (Age, Gender, Location, Followers, Following, Topics)
c) User Action (Retweet, Reply, Expand, Tag, Favorite, Follow, View Conversation, Browse Content (Click on URL and tags in a tweet)).

My approach is presented in terms of Design Considerations, Notes; Domain Model, Object Model, Persistence Model and Implementation Notes in the sections below.

Design Considerations:
The schema  design decisions are made to accommodate high scalability needs of twitter data that’s in (usually in the range GB’s or TB’s). For the purposes of analytics, the database operations are mostly read-only. The data is a live one. Also, for frequent reads and in general, large scale needs, the tables aren’t normalized and the redundant data can be easily spread across different tables.

Caching is implemented on the server side, for frequently accessed read operations. I’ve used Object Oriented approach using Java in my API design.

Design Notes:
The source of twitter data, can be :
a) Twitter dataset file in ZIP/XLS format (GB’s or TB’s of data provided by Twitter)
b) Twitter data stored in database tables on the public cloud.
c) Twitter data stored in our database tables (assuming we’re providing Twitter Type Infrastructure and Data Analytics).
d) Twitter response in XML format.

In case of (a), the dataset file can be read and mapped to our Java Object Model by parsing the files, tokenizing and using cache service for retrieval of the XLS data. In case of both (b) &(c), we can map our Object Model using an ORM framework to the data model. Implementation choices include Spring, Hibernate, JPA etc.

In case of (d), marshalling and unmarshalling of XML (using JAXB), is done to extract meta data.

I’m dealing with case (c) in my approach for the design of the persistence model. It’s assumed that the content of data, the capture of user actions(clicks) and user tweet profile attributes can be easily passed (using web 2.0 technologies like Jquery, HTML5, Ext-Js, DWR etc, to Java API which then talks to persistence layer to persist data in tables).

Domain Model:
Below is the domain model to represent business entities or real-world objects.

Object Model:
The object model for the domain model above is represented below using Class Diagrams.
Class Diagrams (without arrrow marks showcasing OO relationship though –  explained later)

Class <User>
-userId: String
-name: String
-gender: char
-age: int
-location: String
-tweets: List
-followers: List
-following: List
-lists: Set+addFollower(User user): boolean
+getFollowers(): List
+addFollowing(User user): boolean
+getFollowing(): List
+addToList(String listName): boolean
+getLists(): Set

+tweet(String content): boolean
+tweets(): List<Tweet>

+retweet(String content): boolean
+favorite(String content): boolean
+reply(String content): boolean
+delete(String content): boolean
+expand(String content): boolean
+viewConversation(String convn): void

+getUserId(): String
+setUserId(String userId): void
+getName(): String
+setName(String userId): void
+getGender(): char
+setGender(char gender): void
+getAge(): int
+setAge(int age): void
+getLocation(): String
+setLocation(String location): void

Class <Tweet>
-tweetId: String
-maxLength: int
-contentType: ContentType
-content: String
-tags: List
-postedBy: String
-favoriteFreq: int
-replyFreq: int
-expandFreq: int
-createDateTime: TimeStamp
+setRetweetFreq(): void
+getRetweetFreq(): int+setPostedBy(String userId): void
+getPostedBy(): String

+setContentType(ContentType ContentType): void
+getContentType(): ContentType

+setFavoriteFreq(): void
+getFavoriteFreq(): int
+setReplyFreq(): void
+getReplyFreq(): int
+setExpandFreq(): void
+getExpandFreq(): int

+setViewConversationFreq(): void
+getViewConversationFreq(): int

+addTag(Tag tag): boolean
+getTags(): List

Class <UserTweetCache>
-cacheInstance: UserTweetCache
-userTweetMap: Map<string, set<tweet=””>>
+lookUpTweet(String userId, String twtStr): Tweet
+getUserTweetMap(): Map<string, set<tweet=””>>
+getCacheInstance(): UserTweetCache

 

Class <ContentType>
-TweetContentType: enum{ TEXT, VIDEO, IMAGE, URL, COMPOSITE}
+getValue(TweetContentType): String

 

Class <Tag>
-id: String
-name: String
-source: String+setTagId(String tagid): void
+getTagId(): String
+setSource(String source): void
+getSource(): String
+setName(String source): void
+getName(): String

 

Class <UserListHelper>
-name: String
+addUserToList(String userId, String listName): boolean
+lists: Set<String>

 

Class <UserList>
-name: String
-userListMap: Map<string, set<user=””>>
-addUser(String userId): User
+setName(String listName): void
+getName(): String

 

interface <MetaData>
+extractMetaData(Set tweets): TweetAnalysisMetaData
+MetaDataType: enum {USERACTION, CONTENT,USER_ATTRIBUTE }

 

UserAttributeBasedMetaData
+extractMetaData(Set<User> users): String

 

ContentBasedMetaData
+extractMetaData(Set<Tweet> tweets): String

 

UserActionBasedMetaData
+extractMetaData(Set<Tweet> tweets): String

 

class <TextAnalyzer>
+computeTermFreq: void
+applyIDF(): void
+tokenize(String tweetContent): List
..

 

class TweetContentMagnitudeVectorImpl
+normalize(TweetContentMagnitudeVector vec): void
..

 

class TweetContentMagnitudeVector
+getTFIDFValue(): double
..

 

class TextClusterer
+clusterTweets(Set<Tweet>): void
+applyKMeansSetting(KMeansSetting kmeanSetting): void
+computeSimilarityMatrix(): String[][]
+printClusteStatistics(): void

 

class UserTweetPersistenceManager
+persistUser(User user): boolean
+persistTweet(Tweet user): boolean
+persistTag(Tag tag): boolean
-retrieveUser(String tweetId): boolean
-retrieveTweet(String userId): boolean

 

class TweetQueryResult
+getQueryCount(): int
+getRelevantTweets(): Set

 

class TweetQueryParameter
+QueryParameter.Parameter: enum {KEY, APIUSER, START, INDEX, LIMIT, SORTBY}
+QueryType.Typer: enum {UPDATE, SELECT, DELETE}+setParameter(QueryParameter queryParam): void
+getParameter(QueryParameter queryParam): void
+setType(QueryType queryType): void
+getType(): QueryType

 

class TweetAnalysisMetaData
+displayMetaDataProfile(MetadataType): void+displayUserAttributeBasedMetaDataProfile(MetadataType.USER_ATTRIBUTE): void
+displayContentBasedMetaDataProfile(MetadataType.CONTENT):void
+displayUserActionBasedMetaDataProfile(MetadataType.USER_ACTION): void

 

Implementation Notes:
To avoid frequent calls to DB, Cache (UserTweetCache) is implemented to retrieve tweets and users from a Map which is updated for every new tweet or user and whose reference is got through UserTweetCache singleton instance. If using Map could be a concern for Memory Leaks, size can be specified for Map. Also, LinkedList can be a viable option with LRU type of Cache implementation (discard the LRU LinkedList objects from the cache).

To ensure data integrity for much of the data (we’re not really concerned if some of the unimportant data from the perspective of Cache, gets updated, which is not reflected in Cache), only after the records are successfully deleted or updated in DB, is the UserTweetCache Tweet / User object  updated accordingly.

 

Persistence Model:
Schema Design:
1. User

userid            varchar2(50)
gender           varchar2(2)
age                int(3)
location         varchar2(30)

2. Tweet

tweetid           varchar2(50)
content           varchar2(2)
createdatetime TimeStamp(19)
favFreq           int
replFreq          int
retweetFreq    int
expandFreq     int
viewConvFreq int

3. Tag

tagid           varchar2(20)
tagtext        varchar2(50)

4. UserAction

actionId           int(2)
description       varchar2(20)
details             varchar2(40)

5. User_Tweet_Tag

userId                 varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(40)
createDate         TimeStamp(19)

6. User_Tweet_UserAction

userId                 varchar2(20)
tweetId               varchar2(20)
actionId               int

7.Tweet_Tag

sourceid             varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(20)
weight                 double(22)
createDate        TimeStamp(19)

 

Saving Data:
To persist data from the update based on user action (tweet, retweet, reply, view, expand, favorite etc), UserTweetPersistenceManager encapsulates TweetQueryParameter, QueryType and TweetQueryResult objects that interact with ORM API’s to persist data in DB.

If update is successful, the UserTweetCache is updated with the modified Tweet object.
If update is a failure, the UserTweetCache is not updated with the modified Tweet object.
All the error conditions are handled to ensure data integrity of the Cache.

 

Retrieving Data:
Set and Set are populated with corresponding ResultSet objects retrieved using UserTweetPersistenceManager – retrieveUsers(tweetId), retrieveTweets(userId) etc.

To retrieve user-attribute based metadata, the Set<User> is passed to the MetaData extractor infrastructure.
To retrieve content-based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

To retrieve user action based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

The Tweet data for a user, or the User data for a tweet can be fetched by doing a join operation amongst User_Tweet_Tag, User, Tweet, User_Tweet_UserAction tables.

TweetAnalysisMetaData displays the metadata profile or statistics based on the type of MetaData (MetaDataType.USERACTION, MetaDataType.CONTENT, MetaDataType.USERATTRIBUTE or aggregating all of them).

TweetAnalysisMetaData also displays cluster statistics or profile for predictive modeling or data mining tasks. The mechanism used for Text Clustering and Text Mining is similar to that explained on my blog https://sanstechbytes.wordpress.com/2012/04/28/my-masters-dissertation-thesis-revisited-series-part-1/.

To be Continued…
Advertisements

Posted in Data Mining, Java, Random | Tagged: , , , | 4 Comments »

Java Design Patterns: Template Pattern in the Real World

Posted by sanstechbytes on July 31, 2012

If you’re declaring some abstract methods in your class and making the sub classes provide concrete implementation for them, you’re indeed using Template pattern. Template pattern formalizes the idea of defining an algorithm in a class, but leaving some of the details to be implemented in sub classes.

In Java, Arrays.sort(Object[] o) internally invokes mergesort, that in turn, does a .compareTo() check for object comparison. .compareTo() implementation for the object type being compared, will be invoked.  AbstractCollection is one example, too.

Suppose, you’re designing a Sort API that you intend to ship as a jar file to your customer. Your customer would like the flexibility of adding newer or better sort implementations (specialized sorts are needed in a large scale system, applications) in their code. Often, you wouldn’t want to invoke default sort implementation of Java for large data. For instance, your customer wants to sort a very large array of 0’s and 1’s. They would write their own sort logic yielding in-place, O(n) time-complexity, rather than use default Arrays.sort(..) with O( n log n) in time. Obviously, they don’t want you to modify code for them every time. You want details of sort to be implemented in the sub classes that your customer writes. You would come up with something like this below:


public abstract class Sorter {
 public void sort(Object[] o) {
 sort1(o);
 }

 abstract void sort1(Object[] o);
 ...
}

class CustomSorter1 extends Sorter {

 public CustomSorter1() {

 }

 public void sort(Object[] o) {

 //Your sorting logic goes here

 }

}

Suppose, you’re developing a Graphics project for school. One of the tasks is to, drawing different triangles based on the input points. You would have subclasses like IsosCelesTriangle, EquilateralTriangle,  an abstract class Triangle and Graphics and Point classes.  Graphics and Point will be composite objects of Triangle.  drawLine() is common to all Triangle types. draw2ndLine() methods will be different for each Traingle type and implemented by sub classes.

Suppose, you’re building a Tourist Trip Reservation application. In every package trip, the activities are defined for each day of a Trip. Depending on the package type, the activities might vary. If you apply Template Pattern here: PackageTrip class is the abstract class. PackageA, PackageB etc are sub classes. performTrip() will be called by PackageA, PackageB etc. It would internally invoke specific Package implementations.

Template patterns is a one of the very useful and frequently used design patterns. Think about scenarios where you could apply in your project, now!

Posted in Java | Tagged: | Leave a Comment »

A Complete Singleton in Java

Posted by sanstechbytes on June 26, 2012

Singleton pattern is a design pattern that restricts the instantiation of a class to one object. In Java, the implemention should be done to ensure that only one instance of the class exists per class loader, per JVM. The objective in this post is not to explain much theory for scenarios where you apply Singleton etc or it’s implementation variants (read this for those details) but to explain what it takes to make a Singleton, a Complete Singleton. We all know we can use many variants of the so called Singleton implementations in our code. I wonder if all of us really know if our implementation will still be a ‘Singleton’ in situations where we try to invoke clone() and/or make the class Serializable. The class will no longer be a Singleton.

Let’s consider first the clone() scenario and later, serialization scenario. Let’s say your Singleton implementation is as below:


public final class Singleton implements Serializable {

 // static Singleton reference is initialised to refer to Singleton instance, when this class is first loaded
 // so that this instance can be accessible without creating an object and without having to worry about thread
 // synchronization in case of, lazy loading (where we do a null check inside getInstance()...

public static final Singleton instance= new Singleton(); // make it private as an optimization step.

 // Suppresses instantiation outside this class
 private Singleton() {

 }

 // One instance per class loader of this class, per JVM
 public static Singleton getInstance(){
return instance;
 }

}

Suppose the client API invokes method like below:


public class Client {
 ...
 Singleton instance = Singleton.getInstance();
 Singleton clone = (Singleton) instance.clone();
 ...
}

So,  clone() call creates another instance of Singleton in heap!!! But, we wanted to ensure ONLY one instance at any time!!! Singleton loses Singleton-ness property!!!!

To prevent this, you should override clone() like below:

// Prevents Singleton.clone() call by the client
public Object clone() throws CloneNotSupportedException {
  throw new CloneNotSupportedException();
}

When we do so, the client code will give a compile-time error (CloneNotSupportedException is a checked exception). Half Done!

Now, let’s say our Singleton implements Serializable(in above class). The above class would no longer be a singleton if the words “implements Serializable” were added to its declaration. It doesn’t matter whether the class uses the default serialized form or a custom serialized form, nor does it matter whether the class provides an explicit readObject() method. Any readObject method, whether explicit or default, returns a newly created instance, which will not be the same instance that was created at class initialization time. As Joshua Bloch mentions in his Effective Java: Prior to the 1.2 release, it was impossible to write a
serializable singleton class. In the 1.2 release, the readResolve() feature was added to the serialization facility. If the class of an object being deserialized defines a readResolve method with the proper declaration, this method is invoked on the newly created object after it is deserialized. The object reference returned by this method is then returned in lieu of the newly created object. In most uses of this feature, no reference to the newly created object is retained; the object is effectively stillborn, immediately becoming eligible for garbage collection.

private Object readResolve() throws ObjectStreamException {
 // Return the one true Singleton and let the garbage collector
 // take care of the Singleton impersonator.
 return instance;
}

Note that since the class is final, the contract says that the readResolve() has to be a private method. This method ignores the deserialized object, simply returning the distinguished Singleton instance created when the class was initialized. Therefore the serialized form of a Singleton instance need not contain any real data; all instance fields should be marked transient. This applies not only to this Singleton, but to all singletons.

So here’s the complete singleton code, a complete Singleton Code!!!

public final class Singleton implements Serializable {
&nbsp;
<pre>
 // static Singleton reference is initialised to refer to Singleton instance, when this class is first loaded
 // so that this instance can be accessible without creating an object and without having to worry about thread
 // synchronization in case of, lazy loading (where we do a null check inside getInstance()...
 public static final Singleton instance= new Singleton(); // make it private as an optimization step.</p>
 // Suppresses instantiation outside this class
 private Singleton() {

 }

 // One instance per class loader of this class, per JVM
 public static Singleton getInstance(){
  return instance;
 }
 // Prevents Singleton.clone() call by the client
 public Object clone() throws CloneNotSupportedException {
  throw new CloneNotSupportedException();
 }

 private Object readResolve() throws ObjectStreamException {
  // Return the one true Singleton and let the garbage collector
  // take care of the Singleton impersonator.
  return instance;
 }

Here’s the Lazy Loading variant of the Singleton implementaion using Double-Checked Locking. Based on the Java Memory Model issues, there could be a Just-In-Time (JIT) run-time compiler Threat with this kind of implementation, so explains, Peter Haggar, of Practical Java™ Programming Language Guide fame, here.

</p>
public final class Singleton implements Serializable {

public static Singleton instance=null; // make it private as an  optimization step.

// Suppresses instantiation outside this class
private Singleton() {

}

//One instance per class loader of this class, per JVM
public static Singleton getInstance(){
if(instance == null) {
synchronized(Singleton.class) {
if(instance == null)
return instance;
}
}
 }

// Prevents Singleton.clone() call by the client
public Object clone() throws CloneNotSupportedException {
throw new CloneNotSupportedException();
}

private Object readResolve() throws ObjectStreamException {
// Return the one true Singleton and let the garbage collector
// take care of the Singleton impersonator.
return instance;
}

}

Posted in Java | Tagged: | 1 Comment »

Number to Text Representation Problem

Posted by sanstechbytes on April 22, 2012

Problem:

Given a number (up to a billion), display the string (text) representation of it. For eg.: 123456789 should be represented as “One Hundred and Twenty Three Million Four Hundred and Fifty Six thousand Seven Hundred and Eighty Nine. ‘and’ should only be used after hundred.

Solution:

Approach 1: In a OO approach, a number can be represented by Number class. Each Number object has a list of Triplet objects (composition). Triplet represents a group of max 3 digits and can’t exist without a Number.  With this approach, it’s easy to accommodate different text representation systems like British English International Unit (million etc), British English Indian Units (lac, crore).

I wrote the classes (see below) for both approaches including JUnit Tests. I’ve captured the outputs for sample runs in the project folder that you can download here.

Approach 2: Assume Text representation in British English. Think of converting 123456789 into “One Hundred and Twenty Three Million Four Hundred and Fifty Six thousand Seven Hundred and Eighty Nine.

So, convert(123456789) = convert(123) + “million” + convert(456) + “thousand” + convert(789). [Courtesy: Gayle Laakmann of careercup.com]. This makes use of the technique of recursion. A lot of new code has to be written to extend this behavior though. There’re too many special cases to handle.

Code: Approach 1:


package numbertextoop;

import java.util.LinkedList;
import java.util.List;

import utils.NumberTextUtils;

/**
* This class represents a given number. Can be used to display in British
* English phrase the number Eg: 1234 in British English phrase: One Thousand
* Two Hundred and Thirty Four This class can accommodate text representation
* types like British, Indian etc with the constructor provided 04/14/2012 @author
* sanjeev 04/19/2012 @author sanjeev
*
* @version 1.1
*/

public class Number {

private int num;
private List&amp;lt;Triplet&amp;gt; triplets;

/**
* Assume default text representation type as British
*/
public Number(int i) {

this.num = i;

/**
* Handle negative number
*/
if (i &amp;lt; 0) {
i = -1 * i;
}

int length = Integer.toString(i).length();

if (length % NumberTextUtils.THREE == 0) {
length = length / NumberTextUtils.THREE;
} else if (length % NumberTextUtils.THREE &amp;gt; 0) {
length = length / NumberTextUtils.THREE + 1;
}

this.triplets = new LinkedList&amp;lt;Triplet&amp;gt;();
buildTriplets(i, length);
}

/**
* Allows for specifying text representation system, British, Indian etc
* text representationType could be enum - TextRepresentationType.BRITISH,
* TextRepresentationType.INDIAN. In Case of 'Indian', constants like Lac,
* Crore, should be added appropriately in NumberTextUtils.java
*/
public Number(int i, String representationType) {

}

/**
* @param num
* @param length
*            Builds triplet objects and arranges the position (hundredth,
*            tenth, unit) of the digits forming a triplet
*/
private void buildTriplets(int num, int length) {
while (length != 0) {
int tripletValue = num % 1000;

Triplet triplet = new Triplet(tripletValue);
triplet = triplet.matchTripletPlaces(tripletValue);

System.out.println(&amp;quot;triplet&amp;quot; + triplet.getHundredthPlace() + &amp;quot;.&amp;quot;
+ triplet.getTenthPlace() + &amp;quot;:&amp;quot; + triplet.getUnitPlace());
((LinkedList&amp;lt;Triplet&amp;gt;) this.triplets).addFirst(triplet);

num = num / NumberTextUtils.THOUSAND;
length--;
}
}

/**
* String representation of this number
*
*/
public String toString() {

/**
* When num = 0, returns &amp;quot;Zero&amp;quot; or if accessed thro
*
*/
if (this.num == 0) {
return NumberTextUtils.ZERO_TEXT;
}

StringBuilder numTextBuilder = new StringBuilder();

/**
* Negative number
*
*/
if (this.num &amp;lt; 0) {
numTextBuilder.append(&amp;quot;Negative &amp;quot;);
}

return constructEnglishText(this.triplets, numTextBuilder).toString();
}

/**
* Constructs the British English text from this group of triplets formed
* from
*
* @param num
* @param triplets
*            List
*/
private StringBuilder constructEnglishText(List&amp;lt;Triplet&amp;gt; triplets,
StringBuilder numTextBuilder) {
int mapIndex = triplets.size();

for (Triplet triplet : triplets) {
numTextBuilder = numTextBuilder.append(triplet.toString()).append(
&amp;quot; &amp;quot;);

/**
* When num = 1,000,000: print only one million
*/
if (triplet.getHundredthPlace() == 0
&amp;amp;&amp;amp; triplet.getTenthPlace() == 0
&amp;amp;&amp;amp; triplet.getUnitPlace() == 0
&amp;amp;&amp;amp; mapIndex &amp;lt; this.triplets.size()) {
numTextBuilder = numTextBuilder.append(&amp;quot;&amp;quot;);
} else {
numTextBuilder = numTextBuilder.append(
NumberTextUtils.tripletMap().get(mapIndex)).append(&amp;quot; &amp;quot;);
}

mapIndex--;
}

return numTextBuilder;
}

public int getNum() {
return num;
}

public String getText() {
return this.toString();
}

public List&amp;lt;Triplet&amp;gt; getTriplets() {
return triplets;
}

}

 

package numbertextoop;

import utils.NumberTextUtils;

/**
* This class represents a group of digits up to a maximum of 3 that form a
* given number. 04/14/12 @author sanjeev 04/19/12 @author sanjeev
*
* @version 1.1
*/
public class Triplet {

private int hundredthPlace;
private int tenthPlace;
private int unitPlace;
private int value;

public Triplet(int tripletGroupNum) {
this.value = tripletGroupNum;
}

/**
* Arranges the place of digits for hundredth, tenth and unit positions in
* this triplet
*
* @param tripletValue
* @return Triplet
*/
public Triplet matchTripletPlaces(int tripletValue) {
this.setHundredthPlace(tripletValue / NumberTextUtils.HUNDRED);

tripletValue = tripletValue % NumberTextUtils.HUNDRED;
this.setTenthPlace(tripletValue / NumberTextUtils.TEN);

tripletValue = tripletValue % NumberTextUtils.TEN;
this.setUnitPlace(tripletValue);

return this;
}

/**
* Returns the string representation of this triplet
*
* @see java.lang.Object#toString()
*/
public String toString() {
return constructTripletText().toString();
}

/**
* @return tripletText StringBuilder
*/
private StringBuilder constructTripletText() {

StringBuilder tripletText = new StringBuilder();

tripletText = appendHundredthPlaceText(tripletText);

/** 'Teen' candidate check: 13 -&amp;gt; Thirteen, 11 -&amp;gt; Eleven */
StringBuilder teenCandidateBuilder = new StringBuilder();
String teenCandidate = teenCandidateBuilder.append(
String.valueOf(this.tenthPlace)).append(
String.valueOf(this.unitPlace)).toString();
int teenCandidateValue = Integer.parseInt(teenCandidate);

tripletText = appendTenthPlaceText(tripletText, teenCandidateValue);

tripletText = appendUnitPlaceText(tripletText, teenCandidateValue);

return tripletText;
}

/**
* Unit place text candidates Unit place : Five, Zero, Two etc
*/
private StringBuilder appendUnitPlaceText(StringBuilder tripletText,
int teenCandidate) {
/** Print unit place text: 1 -&amp;gt; one, 9 -&amp;gt; nine etc */
if (this.unitPlace != 0) {
if (isATeenNumber(teenCandidate)) {
return tripletText;
}

tripletText = appendDigitText(tripletText, this.unitPlace);
}

return tripletText;
}

/**
* Tenth place and 'Teen' text candidates Tenth place : Five, Eleven,
* Thirteen etc.
*/
private StringBuilder appendTenthPlaceText(StringBuilder tripletText,
int teenCandidate) {
if (this.tenthPlace != 0) {
if (isATeenNumber(teenCandidate)) {
tripletText = tripletText
.append(NumberTextUtils.teens[this.unitPlace]);
} else {
tripletText = tripletText
.append(NumberTextUtils.tens[this.tenthPlace - 1]);
}
tripletText.append(&amp;quot; &amp;quot;);
}

return tripletText;
}

/**
* Hundredth place text candidates Hundredth place : Two Hundred, Two
* Hundred and etc.
*/
private StringBuilder appendHundredthPlaceText(StringBuilder tripletText) {
if (this.hundredthPlace != 0) {
if (this.tenthPlace == 0 &amp;amp;&amp;amp; this.unitPlace == 0) {
tripletText = appendDigitText(tripletText, this.hundredthPlace)
.append(&amp;quot; Hundred &amp;quot;);
} else {
tripletText = appendDigitText(tripletText, this.hundredthPlace)
.append(&amp;quot; Hundred and &amp;quot;);
}
}

return tripletText;
}

/**
* @param tripletText
* @param digit
* @return
*/
public StringBuilder appendDigitText(StringBuilder tripletText, int digit) {
tripletText.append(NumberTextUtils.digits[digit]);
return tripletText;
}

/**
* @param teenCandidate
* @return true if the number belongs to [11, 12, 13, 14, 15, 16, 17, 18,
*         19]
*/
public boolean isATeenNumber(int teenCandidate) {
return (teenCandidate &amp;gt;= NumberTextUtils.ELEVEN &amp;amp;&amp;amp; teenCandidate &amp;lt; NumberTextUtils.TWENTY);
}

public int getValue() {
return value;
}

public int getHundredthPlace() {
return hundredthPlace;
}

public void setHundredthPlace(int hundredthPlace) {
this.hundredthPlace = hundredthPlace;
}

public int getTenthPlace() {
return tenthPlace;
}

public void setTenthPlace(int tenthPlace) {
this.tenthPlace = tenthPlace;
}

public int getUnitPlace() {
return unitPlace;
}

public void setUnitPlace(int unitPlace) {
this.unitPlace = unitPlace;
}

}

Code: Approach 2:

package numbertextalgo;

import utils.NumberTextUtils;

/**
* This class represents a given number in British English phrase Eg: 1234 in
* British English phrase: One Thousand Two Hundred and Thirty Four
*
* @author sanjeev, 14-Apr-2012
* @version 1.0
*/

public class NumberText {

public NumberText() {

}

public static String numberToString(int number) {
if (number == 0) {
return NumberTextUtils.ZERO_TEXT;
} else if (number &amp;lt; 0) {
return NumberTextUtils.NEGATIVE_TEXT + numberToString(-1 * number);
}

int count = 0;
String str = &amp;quot;&amp;quot;;

while (number &amp;gt; 0) {
if (number % NumberTextUtils.THOUSAND != 0) {
str = numberToString100(number % NumberTextUtils.THOUSAND)
+ NumberTextUtils.bigs[count] + &amp;quot; &amp;quot; + str;
number /= NumberTextUtils.THOUSAND;
count++;
}
}

return str;
}

private static String numberToString100(int number) {
String str = &amp;quot;&amp;quot;;

/** Convert hundredth place */
if (number &amp;gt;= NumberTextUtils.HUNDRED) {
str += NumberTextUtils.digits[number / NumberTextUtils.HUNDRED - 1]
+ &amp;quot; Hundred and &amp;quot;;
number %= NumberTextUtils.HUNDRED;
}

/** Convert NumberTextUtils.tens place */
if (number &amp;gt;= NumberTextUtils.ELEVEN
&amp;amp;&amp;amp; number &amp;lt;= NumberTextUtils.NINETEEN) {
return str + NumberTextUtils.teens[number - NumberTextUtils.ELEVEN]
+ &amp;quot; &amp;quot;;
} else if (number == NumberTextUtils.TEN
|| number &amp;gt;= NumberTextUtils.TWENTY) {
str += NumberTextUtils.tens[number / NumberTextUtils.TEN - 1] + &amp;quot; &amp;quot;;
number %= NumberTextUtils.TEN;
}

/** Convert ones place */
if (number &amp;gt;= NumberTextUtils.ONE &amp;amp;&amp;amp; number &amp;lt;= NumberTextUtils.NINE) {
str += NumberTextUtils.digits[number - 1] + &amp;quot; &amp;quot;;
}

return str;
}
}

 

package utils;

import java.util.HashMap;
import java.util.Map;

/**
* This class is used as a utility class for representing a number in text
* 04/14/12 @author sanjeev 04/19/12 @author sanjeev
*
* @version 1.1
*/
public class NumberTextUtils {

public static String[] digits = new String[] { &amp;quot;Zero&amp;quot;, &amp;quot;One&amp;quot;, &amp;quot;Two&amp;quot;,
&amp;quot;Three&amp;quot;, &amp;quot;Four&amp;quot;, &amp;quot;Five&amp;quot;, &amp;quot;Six&amp;quot;, &amp;quot;Seven&amp;quot;, &amp;quot;Eight&amp;quot;, &amp;quot;Nine&amp;quot; };
public static String[] teens = new String[] { &amp;quot;&amp;quot;, &amp;quot;Eleven&amp;quot;, &amp;quot;Twelve&amp;quot;,
&amp;quot;Thirteen&amp;quot;, &amp;quot;Fourteen&amp;quot;, &amp;quot;Fifteen&amp;quot;, &amp;quot;Sixteen&amp;quot;, &amp;quot;Seventeen&amp;quot;,
&amp;quot;Eighteen&amp;quot;, &amp;quot;Nineteen&amp;quot; };
public static String[] tens = new String[] { &amp;quot;Ten&amp;quot;, &amp;quot;Twenty&amp;quot;, &amp;quot;Thirty&amp;quot;,
&amp;quot;Forty&amp;quot;, &amp;quot;Fifty&amp;quot;, &amp;quot;Sixty&amp;quot;, &amp;quot;Seventy&amp;quot;, &amp;quot;Eighty&amp;quot;, &amp;quot;Ninety&amp;quot; };
public static String[] bigs = new String[] { &amp;quot;&amp;quot;, &amp;quot;Thousand&amp;quot;, &amp;quot;Million&amp;quot;,
&amp;quot;Billion&amp;quot; };

public static final String NEGATIVE_TEXT = &amp;quot;Negative&amp;quot;;
public static final String ZERO_TEXT = &amp;quot;Zero&amp;quot;;

public static final int ONE = 1;
public static final int THREE = 3;
public static final int NINE = 9;
public static final int TEN = 10;
public static final int TWENTY = 20;
public static final int ELEVEN = 11;
public static final int NINETEEN = 19;
public static final int HUNDRED = 100;
public static final int THOUSAND = 1000;

public static Map&amp;lt;Integer, String&amp;gt; tripletGroupMap = new HashMap&amp;lt;Integer, String&amp;gt;();

static {
tripletGroupMap.put(1, &amp;quot;&amp;quot;);
tripletGroupMap.put(2, &amp;quot;Thousand&amp;quot;);
tripletGroupMap.put(3, &amp;quot;Million&amp;quot;);
tripletGroupMap.put(4, &amp;quot;Billion&amp;quot;);
}

public static Map&amp;lt;Integer, String&amp;gt; tripletMap() {
return tripletGroupMap;
}
} 

 

package numbertextoop;

/**
* Test Class with main() for Number.java
*
* @author sanjeev
*
*/
public class TestNumber {

public static void main(String[] args) {
int num = 1000000000;
Number number = new Number(num);

sop(num + &amp;quot; in British English Phrase using OOP: &amp;quot; + number.getText());
}

private static void sop(Object o) {
System.out.println(o);
}

}

Posted in Java | Leave a Comment »

Document Search

Posted by sanstechbytes on December 29, 2011

I recently started working on a trivial problem of search by making use of symbol table with the future goal of enhancing the API designed, for Large Scale and Memory requirements. I just wanted to share this experience in the meantime.

The Problem:
Find the list of documents that contain a particular string. The documents are stored as simple text files in a flat structure having a single directory in a Library. Keep in mind that searching for a word will not be a one-time activity. There can be thousands of calls to the search API exposed to all the clients at the same time. Assume that the public document libraries are not getting modified i.e.:
a. No new documents are being added to the library.
b. No existing documents are being modified.
c. No existing documents are being deleted.
Define a data-structure to hold the necessary information about the documents so that the above API can respond back in minimum time-complexity. Write an implementation for populating information in the data-structure you have defined above.

The Solution:
Given that there can be thousands of calls to one API to retrieve similar info from the same data, we can accept the burden of pre-processing the data. This pre-processing can be performed at the time of server start-up. For the sake of simplicity, let’s consider that the string searched for, in the documents, contains complete words,it can contain duplicate words. If the string contains one word, then our API should retrieve a set of documents that contain the word. If the string contains more than one word, then the intersection of sets of documents that contain individual words will give us the set of documents .

A data structure that maintains key and value pair could be very much optimal, when the purpose of search is to, say, for instance, find definition, find relevant pages, find the song to download, find type and value, find process transactions. For the above problem, we could maintain a pre-processed dictionary with key as Keyword and value as set of documents that contain the keyword searched for, in the documents.

I’ve considered simple search mechanism, excluding the complexities like distributing documents on different machines, assigning the task of alphabetical arragement of keywords to a particular machine, merging the list of documents with sorted keywords etc. As you’re thinking now, the complexities refer to the Large Scale and Memory needs which are the inherent while searching millions of documents. We’re also thinking about doing mergesort, using Hadoop etc. As I work on this in my free time, I want to enhance this API to develop for Large Scale and Memory needs. As I’m talking about Java implementation, we can implement Singleton Design pattern by creating a single instance of the Search API to service all the client requests, per WAR, per classloader, per JVM and ensuring that single instance can’t be cloned.  Note that I’ve used MyST as a wrapper that uses TreeMap API (implements height balancing Red-Black Tree. Worst-case O(log n) time for insert and search operations). HashMap API approach that offers O(c) for most cases and O(n) worst-case for insert and search operations can be considered. If we’re thinking about sorting the documents based on keywords, (for large scale and memory needs), I think, we could readily use TreeMap that maintains the sorted order of it’s keys.

Below is the code:

package search;
import java.io.File;
/**
 *
 * @author Sanjeev Kulkarni
 * Test Client for FileSearcher.java
 */
public class TestFileSearcher {
/**
 * @param args
 */
 public static void main(String[] args) {
 FileSearcher fileSearcher = new FileSearcher();
 String keyWord = "www.kulkarni.org";
 MySET set = fileSearcher.search(keyWord);
 sop("Files containing  " + keyWord + " - : ");
 if (set.isEmpty()) {
 sop("Keyword not found in any of the files");
 return;
 }
 for (File file : set) {
 sop( file.getName() +" at " + file.getAbsolutePath());
 }
 }
/**
 * Utility that prints output to the console
 */
 private static void sop(Object o) {
 System.out.println(o);
 }
}

package search;

import java.io.File;

interface ISearcher {
    /*
     * * @param keyWord - a single word to be searched in the available
     * documents * @returns List - a list of files in which the word is
     * present
     */

    // It can't contain duplicate elements.
    // It has to be a SET since each file (Document) has a unique name
    MySET search(String keyWord);

}

package search;
import java.io.File;

public class FileSearcher implements ISearcher {

    private MyST> myST = null;

    public MySET search(String keyWord) {
        FileSearchFactory fileSearchfactory = FileSearchFactory.getInstance();
        myST = fileSearchfactory.st;
        MySET set = new MySET();
        if (myST.contains(keyWord)) {
            set = myST.get(keyWord);
        }
        return set;
    }

}

package search;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0 Singleton implementation for indexing files based on keyword, to
 *          provide single instance of Symbol Table(MyST class) for all
 *          concurrent client requests. When the first client queries for a
 *          particular keyword, file indexing is done and then search in
 *          performed using O(log n). Hashtable offering O(c) - constant time
 *          operation can be considered. Given that - The files(public document
 *          libraries) are not getting modified, i.e. a. No new documents are
 *          being added to the libraries. b. No existing documents are being
 *          modified and c. No existing documents are being deleted, from the
 *          second client request onwards, only search is performed using Binary
 *          Search taking O(log n).
 *
 */

public class FileSearchFactory {

    private static FileSearchFactory instance = null;
    protected MyST> st = null;
    private static final int SIZE = 3;

    private FileSearchFactory() throws FileSearchException {
        // key = word, value = set of files containing that word
        // This is done only once, per ClassLoader, per Application.
        // 'n' clients get handle to one instance of FileSearchFactory
        // loaded in one class loader, in one WAR deployment
        indexFilesForKeyWords();
    }

    private void indexFilesForKeyWords() throws FileSearchException {
        st = new MyST>();
        // Read in files from the directory
        // For demo, below hard-coded paths are used

        String[] filePaths = new String[SIZE];
        filePaths[0] = "D:/devzone/projects/search/doc1.txt";
        filePaths[1] = "D:/devzone/projects/search/doc2.txt";
        filePaths[2] = "D:/devzone/projects/search/doc3.txt";

        // create inverted index of all files
        System.out.println("Indexing files..");
        for (String filePath : filePaths) {
            System.out.println("  " + filePath);
            File file = new File(filePath);
            Scanner in = null;

            try {
                in = new Scanner(file);
            } catch (FileNotFoundException fnfe) {
                throw new FileSearchException(
                        "File wasn't found while indexing files");
            }

            while (in.hasNext()) {
                String word = in.next();
                if (!st.contains(word))
                    st.put(word, new MySET());
                MySET set = st.get(word);
                set.add(file);
            }
        }
    }

    public static synchronized FileSearchFactory getInstance() {
        // lazy initialization
        if (instance == null) {
            try {
                instance = new FileSearchFactory();
            } catch (FileSearchException e) {
                System.out.println(e.getMessage());
            }
        }
        return instance;
    }

    // preventing client's invoking Object.clone(), to ensure complete singleton
    // implementation
    public Object clone() throws CloneNotSupportedException {
        throw new CloneNotSupportedException();
    }

}

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0
 *  Custom exception for messages in file search
 */
public class FileSearchException extends Exception {

    public FileSearchException(){
        super();
    }

    public FileSearchException(String msg){
        super(msg);
    }

    //add machines id's

}

package search;

import java.util.Iterator;
import java.util.SortedSet;
import java.util.TreeSet;

public class MySET> implements Iterable {

    private TreeSet set;

    /**
     * Create an empty set.
     */
    public MySET() {
        set = new TreeSet();
    }

    /**
     * Is this set empty?
     */
    public boolean isEmpty() {
        return set.isEmpty();
    }

    /**
     * Add the key to this set.
     */
    public void add(Key key) {
        set.add(key);
    }

    /**
     * Does this set contain the given key?
     */
    public boolean contains(Key key) {
        return set.contains(key);
    }

    /**
     * Delete the given key from this set.
     */
    public void delete(Key key) {
        set.remove(key);
    }

    /**
     * Return the number of keys in this set.
     */
    public int size() {
        return set.size();
    }

    /**
     * Return an Iterator for this set.
     */
    public Iterator iterator() {
        return set.iterator();
    }

    /**
     * Return the key in this set with the maximum value.
     */
    public Key max() {
        return set.last();
    }

    /**
     * Return the key in this set with the minimum value.
     */
    public Key min() {
        return set.first();
    }

    /**
     * Return the smallest key in this set >= k.
     */
    public Key ceil(Key k) {
        SortedSet tail = set.tailSet(k);
        if (tail.isEmpty())
            return null;
        else
            return tail.first();
    }

    /**
     * Return the largest key in this set <= k.
     */
    public Key floor(Key k) {
        if (set.contains(k))
            return k;

        // does not include key if present (!)
        SortedSet head = set.headSet(k);
        if (head.isEmpty())
            return null;
        else
            return head.last();
    }

    /**
     * Return the union of this set with that set.
     */
    public MySET union(MySET that) {
        MySET c = new MySET();
        for (Key x : this) {
            c.add(x);
        }
        for (Key x : that) {
            c.add(x);
        }
        return c;
    }

    /**
     * Return the intersection of this set with that set.
     */
    public MySET intersects(MySET that) {
        MySET c = new MySET();
        if (this.size() < that.size()) {
            for (Key x : this) {
                if (that.contains(x))
                    c.add(x);
            }
        } else {
            for (Key x : that) {
                if (this.contains(x))
                    c.add(x);
            }
        }
        return c;
    }

}

package search;

import java.util.Iterator;
import java.util.SortedMap;
import java.util.TreeMap;

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0
 * Symbol Table with Key = keyword, Value = SET of Files that
 * contain this keyword
 * @param 
 * @param 
 */

public class MyST, Value> implements Iterable {
    private TreeMap st;

    /**
     * Create an empty symbol table.
     */
    public MyST() {
        st = new TreeMap();
    }

    /**
     * Put key-value pair into the symbol table. Remove key from table if
     * value is null.
     */
    public void put(Key key, Value val) {
        if (val == null) st.remove(key);
        else             st.put(key, val);
    }

    /**
     * Return the value paired with given key; null if key is not in table.
     */
    public Value get(Key key) {
        return st.get(key);
    }

    /**
     * Delete the key (and paired value) from table.
     * Return the value paired with given key; null if key is not in table.
     */
    public Value delete(Key key) {
        return st.remove(key);
    }

    /**
     * Is the key in the table?
     */
    public boolean contains(Key key) {
        return st.containsKey(key);
    }

    /**
     * How many keys are in the table?
     */
    public int size() {
        return st.size();
    }

    /**
     * Return an Iterator for the keys in the table.
     * To iterate over all of the keys in the symbol table st, use the
     * foreach notation: for (Key key : st).
     */
    public Iterator iterator() {
        return st.keySet().iterator();
    }

    /**
     * Return an Iterable for the keys in the table.
     * To iterate over all of the keys in the symbol table st, use the
     * foreach notation: for (Key key : st.keys()).
     */
    public Iterable keys() {
        return st.keySet();
    }

    /**
     * Return the smallest key in the table.
     */
    public Key min() {
        return st.firstKey();
    }

    /**
     * Return the largest key in the table.
     */
    public Key max() {
        return st.lastKey();
    }

    /**
     * Return the smallest key in the table >= k.
     */
    public Key ceil(Key k) {
        SortedMap tail = st.tailMap(k);
        if (tail.isEmpty()) return null;
        else return tail.firstKey();
    }

    /**
     * Return the largest key in the table <= k.
     */
    public Key floor(Key k) {
        if (st.containsKey(k)) return k;
        // does not include key if present (!)
        SortedMap head = st.headMap(k);
        if (head.isEmpty()) return null;
        else return head.lastKey();
    }
}

Posted in Java | Leave a Comment »