PDB File Parser#
PDB file parser for molecular structure analysis using pdbreader.
This module provides functionality to parse PDB (Protein Data Bank) files and extract atomic coordinates and molecular information using the pdbreader library.
- class hbat.core.pdb_parser.PDBParser[source]#
Bases:
objectParser for PDB format files using pdbreader.
This class handles parsing of PDB (Protein Data Bank) format files and converts them into HBAT’s internal atom and residue representations. Uses the pdbreader library for robust PDB format handling.
- __init__() None[source]#
Initialize PDB parser.
Creates a new parser instance with empty atom and residue lists.
- parse_file(filename: str) bool[source]#
Parse a PDB file.
Reads and parses a PDB format file, extracting all ATOM and HETATM records and converting them to HBAT’s internal representation.
- parse_lines(lines: List[str]) bool[source]#
Parse PDB format lines.
Parses PDB format content provided as a list of strings, useful for processing in-memory PDB data.
- get_atoms_by_residue(res_name: str) List[Atom][source]#
Get all atoms from residues with specific name.
- get_hydrogen_atoms() List[Atom][source]#
Get all hydrogen atoms.
- Returns:
List of all hydrogen and deuterium atoms
- Return type:
List[Atom]
- has_hydrogens() bool[source]#
Check if structure contains hydrogen atoms.
Determines if the structure has a reasonable number of hydrogen atoms compared to heavy atoms, indicating explicit hydrogen modeling.
- Returns:
True if structure appears to contain explicit hydrogens
- Return type:
- get_residue_list() List[Residue][source]#
Get list of all residues.
- Returns:
List of all residues in the structure
- Return type:
List[Residue]
- get_chain_ids() List[str][source]#
Get list of unique chain IDs.
- Returns:
List of unique chain identifiers in the structure
- Return type:
List[str]
- get_statistics() Dict[str, Any][source]#
Get basic statistics about the structure.
Provides counts of atoms, residues, chains, and element composition.
- Returns:
Dictionary containing structure statistics
- Return type:
Dict[str, Any]
- get_bonds() List[Bond][source]#
Get list of all bonds.
- Returns:
List of all bonds in the structure
- Return type:
List[Bond]
Bond Detection Methods#
CONECT Record Parsing#
- PDBParser._parse_conect_records(conect_data: Any) None[source]#
Parse CONECT records to extract explicit bond information from PDB files.
CONECT records provide explicit bond connectivity information in PDB files. This method is called first during parsing, before the three-step bond detection, giving CONECT records effective priority in bond assignment.
Format:
CONECT records specify which atoms are bonded to each other using atom serial numbers. The pdbreader library provides this data as a DataFrame with:
parent: The serial number of the central atombonds: List of serial numbers of atoms bonded to the parent
Algorithm:
Iterate through each CONECT record row
Extract parent atom serial number from
parentfieldExtract list of bonded atom serial numbers from
bondsfieldFor each bonded atom in the list:
Verify both parent and bonded atom exist in
_atom_serial_mapRetrieve
Atomobjects for both atomsCalculate geometric distance between atoms
Create
Bondobject with:bond_type="explicit"(marks as explicitly defined in PDB)detection_method=BondDetectionMethods.CONECT_RECORDSCalculated distance value
Check
_bond_exists()to avoid duplicatesAppend bond to
self.bondslist
Handle any parsing errors gracefully with error messages
Priority and Relationship:
CONECT records are processed before
_detect_bonds_three_step()The three-step method respects existing CONECT bonds via
_bond_exists()checkThis prevents duplicate bond creation
Three-step method fills in missing bonds not in CONECT records
Common Usage:
Most PDB files do not include complete CONECT records. CONECT is typically used for:
Heteroatoms (ligands, cofactors, metals)
Non-standard residues
Disulfide bonds (
CYS-CYSbridges)Modified nucleotides/amino acids
Standard protein backbone and sidechain bonds are usually not in CONECT, requiring the three-step detection method.
- Parameters:
conect_data (Any) – CONECT records DataFrame from pdbreader with ‘parent’ and ‘bonds’ columns
- Returns:
None (bonds stored in
self.bondslist)- Return type:
None
Note
Bonds created from CONECT records are tagged with
bond_type="explicit"to distinguish them from algorithmically detected bonds.
- PDBParser._detect_bonds_three_step() None[source]#
Detect covalent bonds using three-step hierarchical approach.
This method provides a robust and efficient bond detection strategy that combines high-accuracy residue-based lookup with fallback distance-based methods. The three steps are progressively invoked based on the success of previous steps.
Method Overview:
The bond detection follows a hierarchical strategy optimized for both accuracy and performance:
Residue Lookup (CCD-based): Uses Chemical Component Dictionary (CCD) data to identify bonds based on known residue topology
Intra-residue Distance: Distance-based detection limited to atoms within the same residue for improved performance
Spatial Grid: Full distance-based detection using spatial grid partitioning for
O(n)complexity instead ofO(n²)
Algorithm:
Skip if structure has fewer than 2 atoms
Step 1: Call
_detect_bonds_from_residue_lookup()Step 2: If residue bonds < 25% of atom count, call
_detect_bonds_within_residues()Step 3: If total bonds < 25% of atom count, call
_detect_bonds_with_spatial_grid()Build bond adjacency map for fast lookups via
_build_bond_adjacency_map()
- Returns:
None (bonds stored in
self.bondslist)- Return type:
None
See also
_detect_bonds_from_residue_lookup()- Step 1: CCD-based detection_detect_bonds_within_residues()- Step 2: Intra-residue distance detection_detect_bonds_with_spatial_grid()- Step 3: Full spatial grid detection_build_bond_adjacency_map()- Builds adjacency map for fast lookups
Note
All three steps avoid creating duplicate bonds by checking
_bond_exists()before appending to the bonds list.
- PDBParser._detect_bonds_from_residue_lookup() int[source]#
Detect bonds using residue bond information from Chemical Component Dictionary (CCD).
This is Step 1 of the three-step bond detection hierarchy. Uses known residue topology from the CCD database to identify bonds based on atom names, providing the highest accuracy for standard residues.
Algorithm:
Iterate through all residues in the structure
Query CCD database via
get_residue_bonds(residue.name)for bond topologySkip residue if no CCD data is available (modified/non-standard residues)
Create atom name →
Atomobject mapping for the residueFor each bond in CCD data:
Extract atom names (
atom1,atom2) from bond infoCheck if both atoms exist in the current residue instance
Calculate geometric distance between atoms
Create
Bondwithdetection_method=BondDetectionMethods.RESIDUE_LOOKUPCheck
_bond_exists()to avoid duplicatesAppend to
self.bondsand increment counter
Return total count of bonds found
Limitations:
Only works for residues with CCD data
Modified residues may not have CCD entries
Non-standard ligands require fallback to distance-based methods
- Returns:
Number of bonds successfully detected using CCD lookup
- Return type:
See also
hbat.constants.get_residue_bonds()- Retrieves bond data from CCD
- PDBParser._detect_bonds_within_residues() None[source]#
Detect bonds within individual residues using distance-based approach.
This is Step 2 of the three-step bond detection hierarchy. Performs distance-based bond detection limited to atoms within the same residue, providing a performance-optimized fallback when CCD data is insufficient.
When Used:
Invoked only if Step 1 (CCD lookup) finds fewer than 25% of expected bonds (heuristic:
residue_bonds_found < len(atoms) / 4).Algorithm:
Iterate through all residues in the structure
Skip residues with fewer than 2 atoms
For each residue, check all atom pairs within that residue (O(m²) where m = residue size):
Calculate distance between atoms
iandj(wherej > i)Skip if distance >
ParametersDefault.MAX_BOND_DISTANCE(fast rejection)Call
_are_atoms_bonded_with_distance()to check bonding criteriaIf bonded, create
Bondwithdetection_method=BondDetectionMethods.DISTANCE_BASEDCheck
_bond_exists()to avoid duplicatesAppend to
self.bonds
Bonding Criteria:
Uses Van der Waals radii with covalent cutoff factor:
Bond exists if:
ParametersDefault.MIN_BOND_DISTANCE ≤ distance ≤ vdw_cutoffWhere:
vdw_cutoff = (vdw₁ + vdw₂) × ParametersDefault.COVALENT_CUTOFF_FACTORvdw₁,vdw₂: Van der Waals radii fromAtomicData.VDW_RADIIMinimum distance:
ParametersDefault.MIN_BOND_DISTANCEMaximum distance:
ParametersDefault.MAX_BOND_DISTANCE
Limitations:
Does not detect inter-residue bonds (e.g., peptide bonds, disulfides)
Requires Step 3 (spatial grid) to find cross-residue connectivity
- Returns:
None (bonds appended to
self.bondslist)- Return type:
None
See also
_are_atoms_bonded_with_distance()- Bonding criteria implementation
- PDBParser._detect_bonds_with_spatial_grid() None[source]#
Optimized bond detection using spatial grid partitioning for full structure.
This is Step 3 of the three-step bond detection hierarchy. Performs comprehensive distance-based bond detection across the entire structure using spatial grid optimization to achieve
O(n)complexity instead ofO(n²).When Used:
Invoked only if Steps 1 and 2 combined find fewer than 25% of expected bonds (heuristic:
total_bonds < len(atoms) / 4). Catches all remaining bonds including critical inter-residue bonds (peptide bonds, disulfides).Algorithm:
Create spatial grid with cell size =
ParametersDefault.MAX_BOND_DISTANCEAssign each atom to a grid cell based on coordinates:
grid_x = int(atom.x / grid_size)grid_y = int(atom.y / grid_size)grid_z = int(atom.z / grid_size)
For each grid cell and its 26 neighbors (3x3x3 cube):
Iterate through atom pairs between current cell and neighbor cell
Skip already-processed pairs using
processed_pairssetCall
_check_bond_between_atoms()to evaluate bonding
The
_check_bond_between_atomsmethod:Calculates distance between atoms
Fast-rejects if distance >
ParametersDefault.MAX_BOND_DISTANCECalls
_are_atoms_bonded_with_distance()for bonding criteriaCreates
Bondwithdetection_method=BondDetectionMethods.DISTANCE_BASEDAppends to
self.bonds(no duplicate check needed due toprocessed_pairs)
Performance:
Complexity:
O(n)average case (depends on atom density)Grid optimization: Only checks atoms in neighboring cells
Worst case:
O(n²)if all atoms in same grid cell (highly unlikely)Memory:
O(n)for grid structure and processed pairs set
Bonding Criteria:
Same as Step 2, uses Van der Waals radii with covalent cutoff factor:
Bond exists if:
ParametersDefault.MIN_BOND_DISTANCE ≤ distance ≤ vdw_cutoffWhere:
vdw_cutoff = (vdw₁ + vdw₂) × ParametersDefault.COVALENT_CUTOFF_FACTORvdw₁,vdw₂: Van der Waals radii fromAtomicData.VDW_RADIIMinimum distance:
ParametersDefault.MIN_BOND_DISTANCEMaximum distance:
ParametersDefault.MAX_BOND_DISTANCE
- Returns:
None (bonds appended to
self.bondslist)- Return type:
None
See also
_check_bond_between_atoms()- Individual bond evaluation_are_atoms_bonded_with_distance()- Bonding criteria
- PDBParser._build_bond_adjacency_map() None[source]#
Build fast bond lookup adjacency map for efficient neighbor queries.
This is the final step in bond detection, called after all bonds have been identified. Creates a bidirectional adjacency list mapping each atom to its bonded neighbors, enabling
O(1)lookup of bonded atoms.The adjacency map is used throughout HBAT for:
Finding which hydrogen is bonded to which donor
Traversing connected atoms in rings
Following peptide chain connectivity
Quickly checking if atoms share bonds
Algorithm:
Clear existing
self._bond_adjacencydictionaryFor each bond in
self.bonds:Initialize empty lists for both atoms if not present
Add
atom2_serialtoatom1_serial’s neighbor listAdd
atom1_serialtoatom2_serial’s neighbor list (bidirectional)
Result:
_bond_adjacency[atom_serial]returns list of bonded atom serials
Data Structure:
_bond_adjacency: Dict[int, Set[int]] = { atom_serial_1: {bonded_atom_1, bonded_atom_2, ...}, atom_serial_2: {bonded_atom_3, bonded_atom_4, ...}, ... }
Performance:
Build time:
O(b)whereb= number of bondsLookup time:
O(1)per atomMemory:
O(b)storage
- Returns:
None (builds
self._bond_adjacencydictionary)- Return type:
None