mercredi 22 septembre 2021

Filter columns by specific name and save as fasta.file

I have a question and I hope I can formulate it in an understandable way. I am working with R and I am trying to write a script in which I am able to filter and extract specific names together with their sequences from a DNA dataset based on the first 4 characters or their specific name and save it as a new fasta.file.

The dataset I am working with contains 7754 obs with 2 variables (name & seq).

For better understanding, here is a created smaller dataset in which I can explain each step I doing now

name <- c("B*07:02:01:01","B*07:02:01:02", "B*07:02:01:03","B*07:98", "B*07:99","B*08:01:01:01","B*08:01:01:02","B*08:01:01:03","B*15:27:01",
      "B*15:27:02","B*15:27:03")
seq <- c("AAAA","TTTT","GGGG","ATCG","AATG","ATTC","GTCT","TCTT","TAGTC","CATG","TGCA")

df3 <- data.frame(name,seq)


df4 <- separate(data = df3,col = name,into = c("name","protein","mutation_coding","mutation_noncoding"),sep = ":")

common_identifier = unique(str_sub(df4$name,0,4))

out = list()
for(element in common_identifier) {
  
  out[[element]] = df4 %>% 
    filter(grepl(element, name)) %>% 
    mutate(name=element) %>% 
    select(name,protein,mutation_coding,mutation_noncoding, seq) %>%
    write.csv(paste0("../saving path",gsub("\\*","",element),"_list.csv"), row.names = FALSE)
  
}

First of all, I am saving as csv.file which is annoying because then I have to transform the created files by hand into a fasta.file. Do you know a way to save it directly as fasta.file?

Second, in this script I`m basically splitting the name in parts to look for specific pattern (first four characters) and later I put the name back together to have the actual name back. Do you know an easier way to do this? Or is there a more elegant way of looking for the pattern "B07","B08"and "B*15" without splitting the name first?

Thank you in advance, I hope that was kind of a clear question :)

Cheers...

Aucun commentaire:

Enregistrer un commentaire