Ideas, Knowledge, Technology, Computer Science, Experience associated with my work and some geeky stuff I progressively encounter during my journey towards enlightenment. Read on…

  • RSS RSS Feed

    • Cloud Computing
      It’s been really long, since I last wrote a tech post. In this post, I’m just sharing few useful links to get started on Cloud Computing, whether you’re a developer, quality engineer, business leader, or a project manager intending to get started with Cloud Computing. I’m currently designing systems and services, for a platform that’s […]
    • The Pragmatic Programmer
      I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this. The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code […]
    • 2013 in review
      The stats helper monkeys prepared a 2013 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people. Click here to see the […]
    • Goodbye, Ness!
      It had to happen sometime. I thought Feb 2013 was the right time. I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also […]
    • Meta: information retrieval from Tweets
      I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem! Problem: How’d you go about finding meta information about a person’s tweets? NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, follower […]
  • Twitter Updates

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Meta: information retrieval from Tweets

Posted by sanstechbytes on December 1, 2012

I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem!


How’d you go about finding meta information about a person’s tweets?

NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, followers whatever appeals to you.


Meta information about a person’s tweets largely includes information that’s based on one or more of:
a) Content of the Tweet (Video, Text, Image, URL’s etc)
b) User Attribute (Age, Gender, Location, Followers, Following, Topics)
c) User Action (Retweet, Reply, Expand, Tag, Favorite, Follow, View Conversation, Browse Content (Click on URL and tags in a tweet)).

My approach is presented in terms of Design Considerations, Notes; Domain Model, Object Model, Persistence Model and Implementation Notes in the sections below.

Design Considerations:
The schema  design decisions are made to accommodate high scalability needs of twitter data that’s in (usually in the range GB’s or TB’s). For the purposes of analytics, the database operations are mostly read-only. The data is a live one. Also, for frequent reads and in general, large scale needs, the tables aren’t normalized and the redundant data can be easily spread across different tables.

Caching is implemented on the server side, for frequently accessed read operations. I’ve used Object Oriented approach using Java in my API design.

Design Notes:
The source of twitter data, can be :
a) Twitter dataset file in ZIP/XLS format (GB’s or TB’s of data provided by Twitter)
b) Twitter data stored in database tables on the public cloud.
c) Twitter data stored in our database tables (assuming we’re providing Twitter Type Infrastructure and Data Analytics).
d) Twitter response in XML format.

In case of (a), the dataset file can be read and mapped to our Java Object Model by parsing the files, tokenizing and using cache service for retrieval of the XLS data. In case of both (b) &(c), we can map our Object Model using an ORM framework to the data model. Implementation choices include Spring, Hibernate, JPA etc.

In case of (d), marshalling and unmarshalling of XML (using JAXB), is done to extract meta data.

I’m dealing with case (c) in my approach for the design of the persistence model. It’s assumed that the content of data, the capture of user actions(clicks) and user tweet profile attributes can be easily passed (using web 2.0 technologies like Jquery, HTML5, Ext-Js, DWR etc, to Java API which then talks to persistence layer to persist data in tables).

Domain Model:
Below is the domain model to represent business entities or real-world objects.

Object Model:
The object model for the domain model above is represented below using Class Diagrams.
Class Diagrams (without arrrow marks showcasing OO relationship though –  explained later)

Class <User>
-userId: String
-name: String
-gender: char
-age: int
-location: String
-tweets: List
-followers: List
-following: List
-lists: Set+addFollower(User user): boolean
+getFollowers(): List
+addFollowing(User user): boolean
+getFollowing(): List
+addToList(String listName): boolean
+getLists(): Set

+tweet(String content): boolean
+tweets(): List<Tweet>

+retweet(String content): boolean
+favorite(String content): boolean
+reply(String content): boolean
+delete(String content): boolean
+expand(String content): boolean
+viewConversation(String convn): void

+getUserId(): String
+setUserId(String userId): void
+getName(): String
+setName(String userId): void
+getGender(): char
+setGender(char gender): void
+getAge(): int
+setAge(int age): void
+getLocation(): String
+setLocation(String location): void

Class <Tweet>
-tweetId: String
-maxLength: int
-contentType: ContentType
-content: String
-tags: List
-postedBy: String
-favoriteFreq: int
-replyFreq: int
-expandFreq: int
-createDateTime: TimeStamp
+setRetweetFreq(): void
+getRetweetFreq(): int+setPostedBy(String userId): void
+getPostedBy(): String

+setContentType(ContentType ContentType): void
+getContentType(): ContentType

+setFavoriteFreq(): void
+getFavoriteFreq(): int
+setReplyFreq(): void
+getReplyFreq(): int
+setExpandFreq(): void
+getExpandFreq(): int

+setViewConversationFreq(): void
+getViewConversationFreq(): int

+addTag(Tag tag): boolean
+getTags(): List

Class <UserTweetCache>
-cacheInstance: UserTweetCache
-userTweetMap: Map<string, set<tweet=””>>
+lookUpTweet(String userId, String twtStr): Tweet
+getUserTweetMap(): Map<string, set<tweet=””>>
+getCacheInstance(): UserTweetCache


Class <ContentType>
-TweetContentType: enum{ TEXT, VIDEO, IMAGE, URL, COMPOSITE}
+getValue(TweetContentType): String


Class <Tag>
-id: String
-name: String
-source: String+setTagId(String tagid): void
+getTagId(): String
+setSource(String source): void
+getSource(): String
+setName(String source): void
+getName(): String


Class <UserListHelper>
-name: String
+addUserToList(String userId, String listName): boolean
+lists: Set<String>


Class <UserList>
-name: String
-userListMap: Map<string, set<user=””>>
-addUser(String userId): User
+setName(String listName): void
+getName(): String


interface <MetaData>
+extractMetaData(Set tweets): TweetAnalysisMetaData


+extractMetaData(Set<User> users): String


+extractMetaData(Set<Tweet> tweets): String


+extractMetaData(Set<Tweet> tweets): String


class <TextAnalyzer>
+computeTermFreq: void
+applyIDF(): void
+tokenize(String tweetContent): List


class TweetContentMagnitudeVectorImpl
+normalize(TweetContentMagnitudeVector vec): void


class TweetContentMagnitudeVector
+getTFIDFValue(): double


class TextClusterer
+clusterTweets(Set<Tweet>): void
+applyKMeansSetting(KMeansSetting kmeanSetting): void
+computeSimilarityMatrix(): String[][]
+printClusteStatistics(): void


class UserTweetPersistenceManager
+persistUser(User user): boolean
+persistTweet(Tweet user): boolean
+persistTag(Tag tag): boolean
-retrieveUser(String tweetId): boolean
-retrieveTweet(String userId): boolean


class TweetQueryResult
+getQueryCount(): int
+getRelevantTweets(): Set


class TweetQueryParameter
+QueryParameter.Parameter: enum {KEY, APIUSER, START, INDEX, LIMIT, SORTBY}
+QueryType.Typer: enum {UPDATE, SELECT, DELETE}+setParameter(QueryParameter queryParam): void
+getParameter(QueryParameter queryParam): void
+setType(QueryType queryType): void
+getType(): QueryType


class TweetAnalysisMetaData
+displayMetaDataProfile(MetadataType): void+displayUserAttributeBasedMetaDataProfile(MetadataType.USER_ATTRIBUTE): void
+displayUserActionBasedMetaDataProfile(MetadataType.USER_ACTION): void


Implementation Notes:
To avoid frequent calls to DB, Cache (UserTweetCache) is implemented to retrieve tweets and users from a Map which is updated for every new tweet or user and whose reference is got through UserTweetCache singleton instance. If using Map could be a concern for Memory Leaks, size can be specified for Map. Also, LinkedList can be a viable option with LRU type of Cache implementation (discard the LRU LinkedList objects from the cache).

To ensure data integrity for much of the data (we’re not really concerned if some of the unimportant data from the perspective of Cache, gets updated, which is not reflected in Cache), only after the records are successfully deleted or updated in DB, is the UserTweetCache Tweet / User object  updated accordingly.


Persistence Model:
Schema Design:
1. User

userid            varchar2(50)
gender           varchar2(2)
age                int(3)
location         varchar2(30)

2. Tweet

tweetid           varchar2(50)
content           varchar2(2)
createdatetime TimeStamp(19)
favFreq           int
replFreq          int
retweetFreq    int
expandFreq     int
viewConvFreq int

3. Tag

tagid           varchar2(20)
tagtext        varchar2(50)

4. UserAction

actionId           int(2)
description       varchar2(20)
details             varchar2(40)

5. User_Tweet_Tag

userId                 varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(40)
createDate         TimeStamp(19)

6. User_Tweet_UserAction

userId                 varchar2(20)
tweetId               varchar2(20)
actionId               int


sourceid             varchar2(20)
tweetId               varchar2(20)
tagId                   varchar2(20)
weight                 double(22)
createDate        TimeStamp(19)


Saving Data:
To persist data from the update based on user action (tweet, retweet, reply, view, expand, favorite etc), UserTweetPersistenceManager encapsulates TweetQueryParameter, QueryType and TweetQueryResult objects that interact with ORM API’s to persist data in DB.

If update is successful, the UserTweetCache is updated with the modified Tweet object.
If update is a failure, the UserTweetCache is not updated with the modified Tweet object.
All the error conditions are handled to ensure data integrity of the Cache.


Retrieving Data:
Set and Set are populated with corresponding ResultSet objects retrieved using UserTweetPersistenceManager – retrieveUsers(tweetId), retrieveTweets(userId) etc.

To retrieve user-attribute based metadata, the Set<User> is passed to the MetaData extractor infrastructure.
To retrieve content-based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

To retrieve user action based metadata, Set<Tweet> is passed to the MetaData extractor infrastructure.

The Tweet data for a user, or the User data for a tweet can be fetched by doing a join operation amongst User_Tweet_Tag, User, Tweet, User_Tweet_UserAction tables.

TweetAnalysisMetaData displays the metadata profile or statistics based on the type of MetaData (MetaDataType.USERACTION, MetaDataType.CONTENT, MetaDataType.USERATTRIBUTE or aggregating all of them).

TweetAnalysisMetaData also displays cluster statistics or profile for predictive modeling or data mining tasks. The mechanism used for Text Clustering and Text Mining is similar to that explained on my blog

To be Continued…

4 Responses to “Meta: information retrieval from Tweets”

  1. geek stuff said

    geek stuff…

    […]Meta: information retrieval from Tweets « Sanstech[…]…

  2. immortal technique…

    […]Meta: information retrieval from Tweets « Sanstech[…]…

  3. well, thank you very much for sharing the great post!!! lista de emails lista de emails lista de emails lista de emails lista de emails

  4. Hello Web Admin, I noticed that your On-Page SEO is is missing a few factors, for one you do not use all three H tags in your post, also I notice that you are not using bold or italics properly in your SEO optimization. On-Page SEO means more now than ever since the new Google update: Panda. No longer are backlinks and simply pinging or sending out a RSS feed the key to getting Google PageRank or Alexa Rankings, You now NEED On-Page SEO. So what is good On-Page SEO?First your keyword must appear in the title.Then it must appear in the URL.You have to optimize your keyword and make sure that it has a nice keyword density of 3-5% in your article with relevant LSI (Latent Semantic Indexing). Then you should spread all H1,H2,H3 tags in your article.Your Keyword should appear in your first paragraph and in the last sentence of the page. You should have relevant usage of Bold and italics of your keyword.There should be one internal link to a page on your blog and you should have one image with an alt tag that has your keyword….wait there’s even more Now what if i told you there was a simple WordPress plugin that does all the On-Page SEO, and automatically for you? That’s right AUTOMATICALLY, just watch this 4minute video for more information at. Seo Plugin

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: