I am looking for a very specific pattern in a database of proteins. The structure I am looking for is C1XXXC2-X(n)-C3XXXC4. C = cysteine.
For example: MALEAQMTLRMFVLVAMASTVHVLSSSFSEDLGTVPLSKVFRSETRFTLIQSLRALLSRQLEAEVHQPEIGHPGFSDETSSRTGKRGGLGRCIHNCMNSRGGLNFIQCKTMCS
As you can see, this sequences ONLY has four C's and they are in the pattern that I mentioned above. There cannot be any more C's in the rest of the sequence!
I was using this code: grep "[A-Z]"C...C"[A-Z]"C...C"[A-Z]*", but it would give me sequences with more than four C's.
Thank you in advance for any help.
Aucun commentaire:
Enregistrer un commentaire