All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Understand the problem:
The naive approach is to use a sliding window, with size of 10. Put the visited string into a map. If the substring is visited again, add into the result.
Code (Java):
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
if (s == null || s.length() < 10) {
return result;
}
Map<String, Integer> map = new HashMap<String, Integer>();
int lo = 0;
int hi = 9;
while (hi < s.length()) {
String substring = s.substring(lo, hi + 1);
if (map.containsKey(substring) && map.get(substring) == 1) {
result.add(substring);
map.put(substring, map.get(substring) + 1);
} else {
map.put(substring, 1);
}
lo++;
hi++;
}
return result;
}
}
The code got memory limit exceeding error. Why? Because the space complexity is O(10 * n). We need to think of a better approach with less memory cost.
Use bit manipulation:
Since the string contains only 4 characters, we could encode the characters using 2 bits each, i.e,
'A' -- 00
'C' -- 01
'G' -- 10
'T' -- 11
And since the length of the substring is only 10 characters, the total number of bits needed are only 20. Therefore we could encode a string into a integer.
Code (Java):
public class Solution {
private Map<Character, Integer> codingMap = new HashMap<Character, Integer>();
private int encode(String s) {
int value = 0;
for (int i = 0; i < s.length(); i++) {
value = (value << 2) + codingMap.get(s.charAt(i));
}
return value;
}
private void fill() {
codingMap.put('A', 0);
codingMap.put('C', 1);
codingMap.put('G', 2);
codingMap.put('T', 3);
}
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
if (s == null || s.length() < 10) {
return result;
}
fill();
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
int lo = 0;
int hi = 9;
while (hi < s.length()) {
String substring = s.substring(lo, hi + 1);
int hashCode = encode(substring);
if (map.containsKey(hashCode) && map.get(hashCode) == 1) {
result.add(substring);
map.put(hashCode, map.get(hashCode) + 1);
} else {
if (!map.containsKey(hashCode)) {
map.put(hashCode, 1);
} else {
map.put(hashCode, map.get(hashCode) + 1);
}
}
lo++;
hi++;
}
return result;
}
}
Analysis:
Now let's analyze the space cost. Since in Java, each character takes 2 bytes. For the previous solution using string, a 10-character substring takes 20 byte. For using the integer, which takes only 4 bytes. So the new solution saves the memory by 1/5.
No comments:
Post a Comment