Sanstech

Ideas, Knowledge, Technology, Computer Science, Experience associated with my work and some geeky stuff I progressively encounter during my journey towards enlightenment. Read on…

  • RSS RSS Feed

    • The Pragmatic Programmer
      I finished reading The Pragmatic Programmer by Andrew Hunt and David Thomas. It’s not a new book in the market but I was curious to read this. The technology topics covered, are not any different from those found in most software engineering books, but the way they’re presented using Pragmatic Philosophy Approach, is remarkable. Code […]
    • 2013 in review
      The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 1,200 times in 2013. If it were a cable car, it would take about 20 trips to carry that many people. Click here to see the […]
    • Goodbye, Ness!
      It had to happen sometime. I thought Feb 2013 was the right time. I quit Ness after a long 5 years and 4 months of stay, in Feb. I joined FICO (formerly, Fair Isaac) last Feb.  While I get an opportunity to work with many varied stakeholders like Scientists, Architect, Product Management, Peer Developers, PMO, Technical Publications and also […]
    • Meta: information retrieval from Tweets
      I pick significant problems randomly sometimes and enjoy solving them, or at least attempt designing api :-). Here’s one such problem! Problem: How’d you go about finding meta information about a person’s tweets? NOTE: a) Tweet == Twitter updates b) Meta information –> Loosely defined. You can interpret it anyway you want –> Frequency, topics, follower […]
    • Understanding Big Data
      It’s been a while, since I last posted! To keep this rolling (I’m hardly getting any time to post my own articles or stuff about my experiments these days 😦  ), I just wanted to share this ebook on Big Data titled  Understanding Big Data: Analytics for Enterprise Class and Streaming Data. Cheers!
  • Twitter Updates

Document Search

Posted by sanstechbytes on December 29, 2011

I recently started working on a trivial problem of search by making use of symbol table with the future goal of enhancing the API designed, for Large Scale and Memory requirements. I just wanted to share this experience in the meantime.

The Problem:
Find the list of documents that contain a particular string. The documents are stored as simple text files in a flat structure having a single directory in a Library. Keep in mind that searching for a word will not be a one-time activity. There can be thousands of calls to the search API exposed to all the clients at the same time. Assume that the public document libraries are not getting modified i.e.:
a. No new documents are being added to the library.
b. No existing documents are being modified.
c. No existing documents are being deleted.
Define a data-structure to hold the necessary information about the documents so that the above API can respond back in minimum time-complexity. Write an implementation for populating information in the data-structure you have defined above.

The Solution:
Given that there can be thousands of calls to one API to retrieve similar info from the same data, we can accept the burden of pre-processing the data. This pre-processing can be performed at the time of server start-up. For the sake of simplicity, let’s consider that the string searched for, in the documents, contains complete words,it can contain duplicate words. If the string contains one word, then our API should retrieve a set of documents that contain the word. If the string contains more than one word, then the intersection of sets of documents that contain individual words will give us the set of documents .

A data structure that maintains key and value pair could be very much optimal, when the purpose of search is to, say, for instance, find definition, find relevant pages, find the song to download, find type and value, find process transactions. For the above problem, we could maintain a pre-processed dictionary with key as Keyword and value as set of documents that contain the keyword searched for, in the documents.

I’ve considered simple search mechanism, excluding the complexities like distributing documents on different machines, assigning the task of alphabetical arragement of keywords to a particular machine, merging the list of documents with sorted keywords etc. As you’re thinking now, the complexities refer to the Large Scale and Memory needs which are the inherent while searching millions of documents. We’re also thinking about doing mergesort, using Hadoop etc. As I work on this in my free time, I want to enhance this API to develop for Large Scale and Memory needs. As I’m talking about Java implementation, we can implement Singleton Design pattern by creating a single instance of the Search API to service all the client requests, per WAR, per classloader, per JVM and ensuring that single instance can’t be cloned.  Note that I’ve used MyST as a wrapper that uses TreeMap API (implements height balancing Red-Black Tree. Worst-case O(log n) time for insert and search operations). HashMap API approach that offers O(c) for most cases and O(n) worst-case for insert and search operations can be considered. If we’re thinking about sorting the documents based on keywords, (for large scale and memory needs), I think, we could readily use TreeMap that maintains the sorted order of it’s keys.

Below is the code:

package search;
import java.io.File;
/**
 *
 * @author Sanjeev Kulkarni
 * Test Client for FileSearcher.java
 */
public class TestFileSearcher {
/**
 * @param args
 */
 public static void main(String[] args) {
 FileSearcher fileSearcher = new FileSearcher();
 String keyWord = "www.kulkarni.org";
 MySET set = fileSearcher.search(keyWord);
 sop("Files containing  " + keyWord + " - : ");
 if (set.isEmpty()) {
 sop("Keyword not found in any of the files");
 return;
 }
 for (File file : set) {
 sop( file.getName() +" at " + file.getAbsolutePath());
 }
 }
/**
 * Utility that prints output to the console
 */
 private static void sop(Object o) {
 System.out.println(o);
 }
}

package search;

import java.io.File;

interface ISearcher {
    /*
     * * @param keyWord - a single word to be searched in the available
     * documents * @returns List - a list of files in which the word is
     * present
     */

    // It can't contain duplicate elements.
    // It has to be a SET since each file (Document) has a unique name
    MySET search(String keyWord);

}

package search;
import java.io.File;

public class FileSearcher implements ISearcher {

    private MyST> myST = null;

    public MySET search(String keyWord) {
        FileSearchFactory fileSearchfactory = FileSearchFactory.getInstance();
        myST = fileSearchfactory.st;
        MySET set = new MySET();
        if (myST.contains(keyWord)) {
            set = myST.get(keyWord);
        }
        return set;
    }

}

package search;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0 Singleton implementation for indexing files based on keyword, to
 *          provide single instance of Symbol Table(MyST class) for all
 *          concurrent client requests. When the first client queries for a
 *          particular keyword, file indexing is done and then search in
 *          performed using O(log n). Hashtable offering O(c) - constant time
 *          operation can be considered. Given that - The files(public document
 *          libraries) are not getting modified, i.e. a. No new documents are
 *          being added to the libraries. b. No existing documents are being
 *          modified and c. No existing documents are being deleted, from the
 *          second client request onwards, only search is performed using Binary
 *          Search taking O(log n).
 *
 */

public class FileSearchFactory {

    private static FileSearchFactory instance = null;
    protected MyST> st = null;
    private static final int SIZE = 3;

    private FileSearchFactory() throws FileSearchException {
        // key = word, value = set of files containing that word
        // This is done only once, per ClassLoader, per Application.
        // 'n' clients get handle to one instance of FileSearchFactory
        // loaded in one class loader, in one WAR deployment
        indexFilesForKeyWords();
    }

    private void indexFilesForKeyWords() throws FileSearchException {
        st = new MyST>();
        // Read in files from the directory
        // For demo, below hard-coded paths are used

        String[] filePaths = new String[SIZE];
        filePaths[0] = "D:/devzone/projects/search/doc1.txt";
        filePaths[1] = "D:/devzone/projects/search/doc2.txt";
        filePaths[2] = "D:/devzone/projects/search/doc3.txt";

        // create inverted index of all files
        System.out.println("Indexing files..");
        for (String filePath : filePaths) {
            System.out.println("  " + filePath);
            File file = new File(filePath);
            Scanner in = null;

            try {
                in = new Scanner(file);
            } catch (FileNotFoundException fnfe) {
                throw new FileSearchException(
                        "File wasn't found while indexing files");
            }

            while (in.hasNext()) {
                String word = in.next();
                if (!st.contains(word))
                    st.put(word, new MySET());
                MySET set = st.get(word);
                set.add(file);
            }
        }
    }

    public static synchronized FileSearchFactory getInstance() {
        // lazy initialization
        if (instance == null) {
            try {
                instance = new FileSearchFactory();
            } catch (FileSearchException e) {
                System.out.println(e.getMessage());
            }
        }
        return instance;
    }

    // preventing client's invoking Object.clone(), to ensure complete singleton
    // implementation
    public Object clone() throws CloneNotSupportedException {
        throw new CloneNotSupportedException();
    }

}

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0
 *  Custom exception for messages in file search
 */
public class FileSearchException extends Exception {

    public FileSearchException(){
        super();
    }

    public FileSearchException(String msg){
        super(msg);
    }

    //add machines id's

}

package search;

import java.util.Iterator;
import java.util.SortedSet;
import java.util.TreeSet;

public class MySET> implements Iterable {

    private TreeSet set;

    /**
     * Create an empty set.
     */
    public MySET() {
        set = new TreeSet();
    }

    /**
     * Is this set empty?
     */
    public boolean isEmpty() {
        return set.isEmpty();
    }

    /**
     * Add the key to this set.
     */
    public void add(Key key) {
        set.add(key);
    }

    /**
     * Does this set contain the given key?
     */
    public boolean contains(Key key) {
        return set.contains(key);
    }

    /**
     * Delete the given key from this set.
     */
    public void delete(Key key) {
        set.remove(key);
    }

    /**
     * Return the number of keys in this set.
     */
    public int size() {
        return set.size();
    }

    /**
     * Return an Iterator for this set.
     */
    public Iterator iterator() {
        return set.iterator();
    }

    /**
     * Return the key in this set with the maximum value.
     */
    public Key max() {
        return set.last();
    }

    /**
     * Return the key in this set with the minimum value.
     */
    public Key min() {
        return set.first();
    }

    /**
     * Return the smallest key in this set >= k.
     */
    public Key ceil(Key k) {
        SortedSet tail = set.tailSet(k);
        if (tail.isEmpty())
            return null;
        else
            return tail.first();
    }

    /**
     * Return the largest key in this set <= k.
     */
    public Key floor(Key k) {
        if (set.contains(k))
            return k;

        // does not include key if present (!)
        SortedSet head = set.headSet(k);
        if (head.isEmpty())
            return null;
        else
            return head.last();
    }

    /**
     * Return the union of this set with that set.
     */
    public MySET union(MySET that) {
        MySET c = new MySET();
        for (Key x : this) {
            c.add(x);
        }
        for (Key x : that) {
            c.add(x);
        }
        return c;
    }

    /**
     * Return the intersection of this set with that set.
     */
    public MySET intersects(MySET that) {
        MySET c = new MySET();
        if (this.size() < that.size()) {
            for (Key x : this) {
                if (that.contains(x))
                    c.add(x);
            }
        } else {
            for (Key x : that) {
                if (this.contains(x))
                    c.add(x);
            }
        }
        return c;
    }

}

package search;

import java.util.Iterator;
import java.util.SortedMap;
import java.util.TreeMap;

/**
 *
 * @author Sanjeev Kulkarni
 * @version 1.0
 * Symbol Table with Key = keyword, Value = SET of Files that
 * contain this keyword
 * @param 
 * @param 
 */

public class MyST, Value> implements Iterable {
    private TreeMap st;

    /**
     * Create an empty symbol table.
     */
    public MyST() {
        st = new TreeMap();
    }

    /**
     * Put key-value pair into the symbol table. Remove key from table if
     * value is null.
     */
    public void put(Key key, Value val) {
        if (val == null) st.remove(key);
        else             st.put(key, val);
    }

    /**
     * Return the value paired with given key; null if key is not in table.
     */
    public Value get(Key key) {
        return st.get(key);
    }

    /**
     * Delete the key (and paired value) from table.
     * Return the value paired with given key; null if key is not in table.
     */
    public Value delete(Key key) {
        return st.remove(key);
    }

    /**
     * Is the key in the table?
     */
    public boolean contains(Key key) {
        return st.containsKey(key);
    }

    /**
     * How many keys are in the table?
     */
    public int size() {
        return st.size();
    }

    /**
     * Return an Iterator for the keys in the table.
     * To iterate over all of the keys in the symbol table st, use the
     * foreach notation: for (Key key : st).
     */
    public Iterator iterator() {
        return st.keySet().iterator();
    }

    /**
     * Return an Iterable for the keys in the table.
     * To iterate over all of the keys in the symbol table st, use the
     * foreach notation: for (Key key : st.keys()).
     */
    public Iterable keys() {
        return st.keySet();
    }

    /**
     * Return the smallest key in the table.
     */
    public Key min() {
        return st.firstKey();
    }

    /**
     * Return the largest key in the table.
     */
    public Key max() {
        return st.lastKey();
    }

    /**
     * Return the smallest key in the table >= k.
     */
    public Key ceil(Key k) {
        SortedMap tail = st.tailMap(k);
        if (tail.isEmpty()) return null;
        else return tail.firstKey();
    }

    /**
     * Return the largest key in the table <= k.
     */
    public Key floor(Key k) {
        if (st.containsKey(k)) return k;
        // does not include key if present (!)
        SortedMap head = st.headMap(k);
        if (head.isEmpty()) return null;
        else return head.lastKey();
    }
}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: