Find the most frequent IP address in web log. The log size may over 100 G.
Use a HashMap to count the frequencies for each IP address.
The key is the IP address, while the value is the counts of the IP address.
Time: O(N * Read)
Only 2^32 distinct IP addresses -> 4G.
For the longest Ip address, 255.255.255.255, it needs 15 characters, appropriately 16 bytes to store an IP address at worst case. So the need of total memory is 64 GB.
The problem here is it might be too large to fit into memory. What if the memory size we have is only 100 M?
- Split the 100 G file into 1000 files, each file has 100 MB.
- The route function is hash(IP) % 1000.
- Load a 100 MB file each time, get the most frequent IP, and store the result into file
Time : O(2 * N * read + N * write)
Space: O(N' / 1000)
Need more disk space, too much write.
Optimization 3: Map Reduce
2. Top K most frequent IP address:
Idea: Use min-heap.
- Build a Min Heap MH of the first k elements (arr to arr[k-1]) of the given array. O(k)
- For each element, after the kth element (arr[k] to arr[n-1]), compare it with root of MH
- If the element is greater than the root then make it root and call heapify for MH
- Else ignore it.
- The step 2 is O((n-k)*logk)
- Finally, MH has k largest elements and root of the MH is the kth largest element.
O(k + (n-k)Logk) without sorted output.
If sorted output is needed then O(k + (n-k)Logk + kLogk)