Top K words in a document

Ref: http://stackoverflow.com/questions/185697/the-most-efficient-way-to-find-top-k-frequent-words-in-a-big-word-sequence

  1. Basic solution

    a. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

    b. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

    c. After sorting, we just take the first K words. This takes O(K) time.

    the total time is O(n+nlg(n)+K)

  2. use heap instead of complete sort

    b'. build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

    c' extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

    To summarize, this solution cost time O(n+k*lg(n)).

  3. use bucket sort to reduce the time complexity

  4. large number of words: Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity

  5. use trie and heap

    ref: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/

  6. Optimized: use quick select O(n) time and O(k) space

    ref: http://www.zrzahid.com/top-k-or-k-most-frequent-words-in-a-document/

    public String[] topKWordsSelect(final String stream, final int k) {
    final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();

    final String[] words = stream.toLowerCase().trim().split(" ");
    for (final String word : words) {
        int freq = 1;
        if (frequencyMap.containsKey(word)) {
            freq = frequencyMap.get(word) + 1;
        }

        // update the frequency map
        frequencyMap.put(word, freq);
    }

    // Find kth largest frequency which is same as (n-k)th smallest frequency
    final int[] frequencies = new int[frequencyMap.size()];
    int i = 0;
    for (final int value : frequencyMap.values()) {
        frequencies[i++] = value;
    }
    final int kthLargestFreq = kthSmallest(frequencies, 0, i - 1, i - k);

    // extract the top K
    final String[] topK = new String[k];
    i = 0;
    for (final java.util.Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
        if (entry.getValue().intValue() >= kthLargestFreq) {
            topK[i++] = entry.getKey();
            if (i == k) {
                break;
            }
        }
    }

    return topK;
}