All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Understand the problem:
The naive approach is to use a sliding window, with size of 10. Put the visited string into a map. If the substring is visited again, add into the result.
Code (Java):
public class Solution { public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<String>(); if (s == null || s.length() < 10) { return result; } Map<String, Integer> map = new HashMap<String, Integer>(); int lo = 0; int hi = 9; while (hi < s.length()) { String substring = s.substring(lo, hi + 1); if (map.containsKey(substring) && map.get(substring) == 1) { result.add(substring); map.put(substring, map.get(substring) + 1); } else { map.put(substring, 1); } lo++; hi++; } return result; } }
The code got memory limit exceeding error. Why? Because the space complexity is O(10 * n). We need to think of a better approach with less memory cost.
Use bit manipulation:
Since the string contains only 4 characters, we could encode the characters using 2 bits each, i.e,
'A' -- 00
'C' -- 01
'G' -- 10
'T' -- 11
And since the length of the substring is only 10 characters, the total number of bits needed are only 20. Therefore we could encode a string into a integer.
Code (Java):
public class Solution { private Map<Character, Integer> codingMap = new HashMap<Character, Integer>(); private int encode(String s) { int value = 0; for (int i = 0; i < s.length(); i++) { value = (value << 2) + codingMap.get(s.charAt(i)); } return value; } private void fill() { codingMap.put('A', 0); codingMap.put('C', 1); codingMap.put('G', 2); codingMap.put('T', 3); } public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<String>(); if (s == null || s.length() < 10) { return result; } fill(); Map<Integer, Integer> map = new HashMap<Integer, Integer>(); int lo = 0; int hi = 9; while (hi < s.length()) { String substring = s.substring(lo, hi + 1); int hashCode = encode(substring); if (map.containsKey(hashCode) && map.get(hashCode) == 1) { result.add(substring); map.put(hashCode, map.get(hashCode) + 1); } else { if (!map.containsKey(hashCode)) { map.put(hashCode, 1); } else { map.put(hashCode, map.get(hashCode) + 1); } } lo++; hi++; } return result; } }
Analysis:
Now let's analyze the space cost. Since in Java, each character takes 2 bytes. For the previous solution using string, a 10-character substring takes 20 byte. For using the integer, which takes only 4 bytes. So the new solution saves the memory by 1/5.
No comments:
Post a Comment