Thursday, December 11, 2014

Nine Chapter: 9. Big Data

1. Most frequent IP Addresses
Find the most frequent IP address in web log. The log size may over 100 G.

Naive Method:
Use a HashMap to count the frequencies for each IP address.
The key is the IP address, while the value is the counts of the IP address. 

Complexity:
Time: O(N * Read)
Memory: O(N').

Optimization 1: 
Only 2^32 distinct IP addresses -> 4G. 
For the longest Ip address, 255.255.255.255, it needs 15 characters, appropriately 16 bytes to store an IP address at worst case. So the need of total memory is 64 GB. 

The problem here is it might be too large to fit into memory. What if the memory size we have is only 100 M?

Optimization 2:
  • Split the 100 G file into 1000 files, each file has 100 MB.
  • The route function is hash(IP) % 1000.
  • Load a 100 MB file each time, get the most frequent IP, and store the result into file
Complexity:
Time : O(2 * N * read + N * write)
Space: O(N' / 1000)

Disadvantages:
Need more disk space, too much write.

Optimization 3: Map Reduce
http://www.ninechapter.com/problem/19/

将日志文件划分成适度大小的M份存放到处理节点。
每个map节点所完成的工作:统计访问百度的ip的出现频度(比较像统计词频,使用字典树),并完成相同ip的合并(combine)。
map节点将其计算的中间结果partition到R个区域,并告知master存储位置,所有map节点工作完成之后,reduce节点首先读入数据,然后以中间结果的key排序,对于相同key的中间结果调用用户的reduce函数,输出。
扫描reduce节点输出的R个文件一遍,可获得访问网站度次数最多的ip。
面试官角度:
该问题是经典的Map-Reduce问题。一般除了问最常访问,还可能会问最常访问的K个IP。一般来说,遇到这个问题要先回答Map-Reduce的解法。因为这是最常见的解法也是一般面试官的考点。如果面试官水平高一点,要进一步问你有没有其他解法的话,该问题有很多概率算法。能够在极少的空间复杂度内,扫描一遍log即可得到Top k Frequent Items(在一定的概率内)。有兴趣的读者,可以搜搜“Sticky Sampling”,”Lossy Counting”这两个算法。

2. Top K most frequent IP address:
Idea: Use min-heap. 

  • Build a Min Heap MH of the first k elements (arr[0] to arr[k-1]) of the given array. O(k)
  • For each element, after the kth element (arr[k] to arr[n-1]), compare it with root of MH
    • If the element is greater than the root then make it root and call heapify for MH
    • Else ignore it.
    • The step 2 is O((n-k)*logk)
  • Finally, MH has k largest elements and root of the MH is the kth largest element.
Time Complexity: 
O(k + (n-k)Logk) without sorted output. 
If sorted output is needed then O(k + (n-k)Logk + kLogk)








No comments:

Post a Comment