Monday, August 24, 2015

Leetcode: Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Understand the problem:
The naive approach is to use a sliding window, with size of 10. Put the visited string into a map. If the substring is visited again, add into the result. 

Code (Java):
public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> result = new ArrayList<String>();
        if (s == null || s.length() < 10) {
            return result;
        }
        
        Map<String, Integer> map = new HashMap<String, Integer>();
        
        int lo = 0;
        int hi = 9;
        
        while (hi < s.length()) {
            String substring = s.substring(lo, hi + 1);
            if (map.containsKey(substring) && map.get(substring) == 1) {
                result.add(substring);
                map.put(substring, map.get(substring) + 1);
            } else {
                map.put(substring, 1);
            }
            
            lo++;
            hi++;
        }
        
        return result;
    }
}

The code got memory limit exceeding error. Why? Because the space complexity is O(10 * n). We need to think of a better approach with less memory cost. 

Use bit manipulation:
Since the string contains only 4 characters, we could encode the characters using 2 bits each, i.e, 
'A' -- 00
'C' -- 01
'G' -- 10
'T'  -- 11
And since the length of the substring is only 10 characters, the total number of bits needed are only 20. Therefore we could encode a string into a integer. 

Code (Java):
public class Solution {
    private Map<Character, Integer> codingMap = new HashMap<Character, Integer>();
    
    private int encode(String s) {
        int value = 0;
        for (int i = 0; i < s.length(); i++) {
            value = (value << 2) + codingMap.get(s.charAt(i));
        }
        
        return value;
    }
    
    private void fill() {
        codingMap.put('A', 0);
        codingMap.put('C', 1);
        codingMap.put('G', 2);
        codingMap.put('T', 3);
    }
    
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> result = new ArrayList<String>();
        if (s == null || s.length() < 10) {
            return result;
        }
        
        fill();
        Map<Integer, Integer> map = new HashMap<Integer, Integer>();
        
        int lo = 0;
        int hi = 9;
        
        while (hi < s.length()) {
            String substring = s.substring(lo, hi + 1);
            int hashCode = encode(substring);
            if (map.containsKey(hashCode) && map.get(hashCode) == 1) {
                result.add(substring);
                map.put(hashCode, map.get(hashCode) + 1);
            } else {
                if (!map.containsKey(hashCode)) {
                    map.put(hashCode, 1);
                } else {
                    map.put(hashCode, map.get(hashCode) + 1);
                }
            }
            
            lo++;
            hi++;
        }
        
        return result;
    }
}

Analysis:
Now let's analyze the space cost. Since in Java, each character takes 2 bytes. For the previous solution using string, a 10-character substring takes 20 byte. For using the integer, which takes only 4 bytes. So the new solution saves the memory by 1/5. 

No comments:

Post a Comment