home | bio | blog | group | cv

Convert PDBs to FASTAs with BioPython¶

This is a quick script that will convert a stack of PDB files in the ./PDBs directory into one or multiple FASTA files in the ./FASTA directory. To run it, drop PDB files into the respective directory and run the method implemented below.

The output will end up looking like this:

In [5]:
!tree
.
├── FASTAs
│   ├── 1UBQ_A.fasta
│   ├── 3I3C_A.fasta
│   ├── 3I3C_B.fasta
│   ├── 3I3C_C.fasta
│   ├── 3I3C_D.fasta
│   ├── 7JNY_A.fasta
│   └── all_PDBs.fasta
├── PDB_to_FASTA_Converter.ipynb
└── PDBs
    ├── 1ubq.pdb
    ├── 3i3c.pdb
    └── 7jny.pdb

2 directories, 11 files

For this test case, the input should look something like this:

.
├── FASTAs
├── PDB_to_FASTA_Converter.ipynb
└── PDBs
    ├── 1ubq.pdb
    ├── 3i3c.pdb
    └── 7jny.pdb

2 directories, 4 files

As long as the desired PDB files are in the PDBs directory, you should be good to go.

Dependencies¶

The only external dependency is BioPython (specifically the SeqIO module). To install:

  • with conda: conda install -c conda-forge biopython
  • with pip: pip install biopython
In [6]:
from Bio import SeqIO

from pathlib import Path
import re

Method¶

This method will convert all of the PDBs in the PDBs directory into a FASTA file or multiple FASTA files. A few notes:

  • Use the "single" flag to compile all input FASTAs into a single PDB file.
  • The "pdb-atom" parser may be a better choice than "pdb-seqres" depending on how Rosetta PDB outputs are formatted. Try changing this if the output files are incorrect.
  • This method will overwrite FASTAs with the same name if they're already in the /FASTAs/ directory. Change the 'w' flag to 'x' in the built-in open to change this behavior.
In [7]:
def build_fastas(single_output = False):
    
    # create empty FASTA file if necessary
    if single_output is True: 
        with open('./FASTAs/all_PDBs.fasta', 'w'): pass
        
    # traverse PDB directory and extract sequence information    
    for pdb_file in Path('./PDBs/').iterdir():
        
        for record in SeqIO.parse(f'{pdb_file}', "pdb-seqres"):
            name = re.sub('[^a-zA-Z0-9]', '_', record.id)
            seq = str(record.seq)
            
            # write unique FASTAs
            if single_output is False:
                with open(f'./FASTAS/{name}.fasta', 'w') as f:
                    f.write(f"> {name} \n")
                    f.write(seq)
            
            # append to single empty FASTA
            else:
                with open(f'./FASTAS/all_PDBs.fasta', 'a') as f:
                    f.write(f"> {name} \n")
                    f.write(seq)
                    f.write('\n\n')

Run¶

Run the following cell to execute the converter.

In [8]:
build_fastas()
# or
# build_fastas(single_output = True)