I have a question and I hope I can formulate it in an understandable way. I am working with R and I am trying to write a script in which I am able to filter and extract specific names together with their sequences from a DNA dataset based on the first 4 characters or their specific name and save it as a new fasta.file.
The dataset I am working with contains 7754 obs with 2 variables (name & seq).
For better understanding, here is a created smaller dataset in which I can explain each step I doing now
name <- c("B*07:02:01:01","B*07:02:01:02", "B*07:02:01:03","B*07:98", "B*07:99","B*08:01:01:01","B*08:01:01:02","B*08:01:01:03","B*15:27:01",
"B*15:27:02","B*15:27:03")
seq <- c("AAAA","TTTT","GGGG","ATCG","AATG","ATTC","GTCT","TCTT","TAGTC","CATG","TGCA")
df3 <- data.frame(name,seq)
df4 <- separate(data = df3,col = name,into = c("name","protein","mutation_coding","mutation_noncoding"),sep = ":")
common_identifier = unique(str_sub(df4$name,0,4))
out = list()
for(element in common_identifier) {
out[[element]] = df4 %>%
filter(grepl(element, name)) %>%
mutate(name=element) %>%
select(name,protein,mutation_coding,mutation_noncoding, seq) %>%
write.csv(paste0("../saving path",gsub("\\*","",element),"_list.csv"), row.names = FALSE)
}
First of all, I am saving as csv.file which is annoying because then I have to transform the created files by hand into a fasta.file. Do you know a way to save it directly as fasta.file?
Second, in this script I`m basically splitting the name in parts to look for specific pattern (first four characters) and later I put the name back together to have the actual name back. Do you know an easier way to do this? Or is there a more elegant way of looking for the pattern "B07","B08"and "B*15" without splitting the name first?
Thank you in advance, I hope that was kind of a clear question :)
Cheers...
Aucun commentaire:
Enregistrer un commentaire