This is a quick script that will convert a stack of PDB files in the ./PDBs
directory into one or multiple FASTA files in the ./FASTA
directory. To run it, drop PDB files into the respective directory and run the method implemented below.
The output will end up looking like this:
!tree
. ├── FASTAs │ ├── 1UBQ_A.fasta │ ├── 3I3C_A.fasta │ ├── 3I3C_B.fasta │ ├── 3I3C_C.fasta │ ├── 3I3C_D.fasta │ ├── 7JNY_A.fasta │ └── all_PDBs.fasta ├── PDB_to_FASTA_Converter.ipynb └── PDBs ├── 1ubq.pdb ├── 3i3c.pdb └── 7jny.pdb 2 directories, 11 files
For this test case, the input should look something like this:
.
├── FASTAs
├── PDB_to_FASTA_Converter.ipynb
└── PDBs
├── 1ubq.pdb
├── 3i3c.pdb
└── 7jny.pdb
2 directories, 4 files
As long as the desired PDB files are in the PDBs directory, you should be good to go.
from Bio import SeqIO
from pathlib import Path
import re
This method will convert all of the PDBs in the PDBs
directory into a FASTA file or multiple FASTA files. A few notes:
single
" flag to compile all input FASTAs into a single PDB file. /FASTAs/
directory. Change the 'w'
flag to 'x'
in the built-in open
to change this behavior.def build_fastas(single_output = False):
# create empty FASTA file if necessary
if single_output is True:
with open('./FASTAs/all_PDBs.fasta', 'w'): pass
# traverse PDB directory and extract sequence information
for pdb_file in Path('./PDBs/').iterdir():
for record in SeqIO.parse(f'{pdb_file}', "pdb-seqres"):
name = re.sub('[^a-zA-Z0-9]', '_', record.id)
seq = str(record.seq)
# write unique FASTAs
if single_output is False:
with open(f'./FASTAS/{name}.fasta', 'w') as f:
f.write(f"> {name} \n")
f.write(seq)
# append to single empty FASTA
else:
with open(f'./FASTAS/all_PDBs.fasta', 'a') as f:
f.write(f"> {name} \n")
f.write(seq)
f.write('\n\n')
Run the following cell to execute the converter.
build_fastas()
# or
# build_fastas(single_output = True)