PDB File Parser

PDB File Parser#

PDB file parser for molecular structure analysis using pdbreader.

This module provides functionality to parse PDB (Protein Data Bank) files and extract atomic coordinates and molecular information using the pdbreader library.

class hbat.core.pdb_parser.PDBParser[source]#

Bases: object

Parser for PDB format files using pdbreader.

This class handles parsing of PDB (Protein Data Bank) format files and converts them into HBAT’s internal atom and residue representations. Uses the pdbreader library for robust PDB format handling.

__init__() → None[source]#

Initialize PDB parser.

Creates a new parser instance with empty atom and residue lists.

parse_file(filename: str) → bool[source]#

Parse a PDB file.

Reads and parses a PDB format file, extracting all ATOM and HETATM records and converting them to HBAT’s internal representation.

Parameters:: filename (str) – Path to the PDB file to parse
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool
Raises:: IOError if file cannot be read

parse_lines(lines: List[str]) → bool[source]#

Parse PDB format lines.

Parses PDB format content provided as a list of strings, useful for processing in-memory PDB data.

Parameters:: lines (List[str]) – List of PDB format lines
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool

get_atoms_by_element(element: str) → List[Atom][source]#

Get all atoms of specific element.

Parameters:: element (str) – Element symbol (e.g., ‘C’, ‘N’, ‘O’)
Returns:: List of atoms matching the element
Return type:: List[Atom]

get_atoms_by_residue(res_name: str) → List[Atom][source]#

Get all atoms from residues with specific name.

Parameters:: res_name (str) – Residue name (e.g., ‘ALA’, ‘GLY’)
Returns:: List of atoms from matching residues
Return type:: List[Atom]

get_hydrogen_atoms() → List[Atom][source]#

Get all hydrogen atoms.

Returns:: List of all hydrogen and deuterium atoms
Return type:: List[Atom]

has_hydrogens() → bool[source]#

Check if structure contains hydrogen atoms.

Determines if the structure has a reasonable number of hydrogen atoms compared to heavy atoms, indicating explicit hydrogen modeling.

Returns:: True if structure appears to contain explicit hydrogens
Return type:: bool

get_residue_list() → List[Residue][source]#

Get list of all residues.

Returns:: List of all residues in the structure
Return type:: List[Residue]

get_chain_ids() → List[str][source]#

Get list of unique chain IDs.

Returns:: List of unique chain identifiers in the structure
Return type:: List[str]

get_statistics() → Dict[str, Any][source]#

Get basic statistics about the structure.

Provides counts of atoms, residues, chains, and element composition.

Returns:: Dictionary containing structure statistics
Return type:: Dict[str, Any]

get_bonds() → List[Bond][source]#

Get list of all bonds.

Returns:: List of all bonds in the structure
Return type:: List[Bond]

get_bonds_for_atom(serial: int) → List[Bond][source]#

Get all bonds involving a specific atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonds involving this atom
Return type:: List[Bond]

get_bonded_atoms(serial: int) → List[int][source]#

Get serial numbers of atoms bonded to the specified atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonded atom serial numbers
Return type:: List[int]

get_bond_detection_statistics() → Dict[str, int][source]#

Get statistics about bond detection methods used.

Returns a dictionary with counts of bonds detected by each method.

Bond Detection Methods#

CONECT Record Parsing#

PDBParser._parse_conect_records(conect_data: Any) → None[source]#

Parse CONECT records to extract explicit bond information from PDB files.

CONECT records provide explicit bond connectivity information in PDB files. This method is called first during parsing, before the three-step bond detection, giving CONECT records effective priority in bond assignment.

Format:

CONECT records specify which atoms are bonded to each other using atom serial numbers. The pdbreader library provides this data as a DataFrame with:

parent: The serial number of the central atom
bonds: List of serial numbers of atoms bonded to the parent

Algorithm:

Iterate through each CONECT record row
Extract parent atom serial number from parent field
Extract list of bonded atom serial numbers from bonds field
For each bonded atom in the list:
1. Verify both parent and bonded atom exist in _atom_serial_map
2. Retrieve Atom objects for both atoms
3. Calculate geometric distance between atoms
4. Create Bond object with:
  - bond_type="explicit" (marks as explicitly defined in PDB)
  - detection_method=BondDetectionMethods.CONECT_RECORDS
  - Calculated distance value
5. Check _bond_exists() to avoid duplicates
6. Append bond to self.bonds list
Handle any parsing errors gracefully with error messages

Priority and Relationship:

CONECT records are processed before _detect_bonds_three_step()
The three-step method respects existing CONECT bonds via _bond_exists() check
This prevents duplicate bond creation
Three-step method fills in missing bonds not in CONECT records

Common Usage:

Most PDB files do not include complete CONECT records. CONECT is typically used for:

Heteroatoms (ligands, cofactors, metals)
Non-standard residues
Disulfide bonds (CYS-CYS bridges)
Modified nucleotides/amino acids

Standard protein backbone and sidechain bonds are usually not in CONECT, requiring the three-step detection method.

Parameters:: conect_data (Any) – CONECT records DataFrame from pdbreader with ‘parent’ and ‘bonds’ columns
Returns:: None (bonds stored in self.bonds list)
Return type:: None

Note

Bonds created from CONECT records are tagged with bond_type="explicit" to distinguish them from algorithmically detected bonds.

PDBParser._detect_bonds_three_step() → None[source]#

Detect covalent bonds using three-step hierarchical approach.

This method provides a robust and efficient bond detection strategy that combines high-accuracy residue-based lookup with fallback distance-based methods. The three steps are progressively invoked based on the success of previous steps.

Method Overview:

The bond detection follows a hierarchical strategy optimized for both accuracy and performance:

Residue Lookup (CCD-based): Uses Chemical Component Dictionary (CCD) data to identify bonds based on known residue topology
Intra-residue Distance: Distance-based detection limited to atoms within the same residue for improved performance
Spatial Grid: Full distance-based detection using spatial grid partitioning for O(n) complexity instead of O(n²)

Algorithm:

Skip if structure has fewer than 2 atoms
Step 1: Call _detect_bonds_from_residue_lookup()
Step 2: If residue bonds < 25% of atom count, call _detect_bonds_within_residues()
Step 3: If total bonds < 25% of atom count, call _detect_bonds_with_spatial_grid()
Build bond adjacency map for fast lookups via _build_bond_adjacency_map()

Returns:: None (bonds stored in self.bonds list)
Return type:: None

See also

_detect_bonds_from_residue_lookup() - Step 1: CCD-based detection
_detect_bonds_within_residues() - Step 2: Intra-residue distance detection
_detect_bonds_with_spatial_grid() - Step 3: Full spatial grid detection
_build_bond_adjacency_map() - Builds adjacency map for fast lookups

Note

All three steps avoid creating duplicate bonds by checking _bond_exists() before appending to the bonds list.

PDBParser._detect_bonds_from_residue_lookup() → int[source]#

Detect bonds using residue bond information from Chemical Component Dictionary (CCD).

This is Step 1 of the three-step bond detection hierarchy. Uses known residue topology from the CCD database to identify bonds based on atom names, providing the highest accuracy for standard residues.

Algorithm:

Iterate through all residues in the structure
Query CCD database via get_residue_bonds(residue.name) for bond topology
Skip residue if no CCD data is available (modified/non-standard residues)
Create atom name → Atom object mapping for the residue
For each bond in CCD data:
1. Extract atom names (atom1, atom2) from bond info
2. Check if both atoms exist in the current residue instance
3. Calculate geometric distance between atoms
4. Create Bond with detection_method=BondDetectionMethods.RESIDUE_LOOKUP
5. Check _bond_exists() to avoid duplicates
6. Append to self.bonds and increment counter
Return total count of bonds found

Limitations:

Only works for residues with CCD data
Modified residues may not have CCD entries
Non-standard ligands require fallback to distance-based methods

Returns:: Number of bonds successfully detected using CCD lookup
Return type:: int

See also

hbat.constants.get_residue_bonds() - Retrieves bond data from CCD

PDBParser._detect_bonds_within_residues() → None[source]#

Detect bonds within individual residues using distance-based approach.

This is Step 2 of the three-step bond detection hierarchy. Performs distance-based bond detection limited to atoms within the same residue, providing a performance-optimized fallback when CCD data is insufficient.

When Used:

Invoked only if Step 1 (CCD lookup) finds fewer than 25% of expected bonds (heuristic: residue_bonds_found < len(atoms) / 4).

Algorithm:

Iterate through all residues in the structure
Skip residues with fewer than 2 atoms
For each residue, check all atom pairs within that residue (O(m²) where m = residue size):
1. Calculate distance between atoms i and j (where j > i)
2. Skip if distance > ParametersDefault.MAX_BOND_DISTANCE (fast rejection)
3. Call _are_atoms_bonded_with_distance() to check bonding criteria
4. If bonded, create Bond with detection_method=BondDetectionMethods.DISTANCE_BASED
5. Check _bond_exists() to avoid duplicates
6. Append to self.bonds

Bonding Criteria:

Uses Van der Waals radii with covalent cutoff factor:

Bond exists if: ParametersDefault.MIN_BOND_DISTANCE ≤ distance ≤ vdw_cutoff
Where: vdw_cutoff = (vdw₁ + vdw₂) × ParametersDefault.COVALENT_CUTOFF_FACTOR
vdw₁, vdw₂: Van der Waals radii from AtomicData.VDW_RADII
Minimum distance: ParametersDefault.MIN_BOND_DISTANCE
Maximum distance: ParametersDefault.MAX_BOND_DISTANCE

Limitations:

Does not detect inter-residue bonds (e.g., peptide bonds, disulfides)
Requires Step 3 (spatial grid) to find cross-residue connectivity

Returns:: None (bonds appended to self.bonds list)
Return type:: None

See also

_are_atoms_bonded_with_distance() - Bonding criteria implementation

PDBParser._detect_bonds_with_spatial_grid() → None[source]#

Optimized bond detection using spatial grid partitioning for full structure.

This is Step 3 of the three-step bond detection hierarchy. Performs comprehensive distance-based bond detection across the entire structure using spatial grid optimization to achieve O(n) complexity instead of O(n²).

When Used:

Invoked only if Steps 1 and 2 combined find fewer than 25% of expected bonds (heuristic: total_bonds < len(atoms) / 4). Catches all remaining bonds including critical inter-residue bonds (peptide bonds, disulfides).

Algorithm:

Create spatial grid with cell size = ParametersDefault.MAX_BOND_DISTANCE
Assign each atom to a grid cell based on coordinates:
- grid_x = int(atom.x / grid_size)
- grid_y = int(atom.y / grid_size)
- grid_z = int(atom.z / grid_size)
For each grid cell and its 26 neighbors (3x3x3 cube):
1. Iterate through atom pairs between current cell and neighbor cell
2. Skip already-processed pairs using processed_pairs set
3. Call _check_bond_between_atoms() to evaluate bonding
The _check_bond_between_atoms method:
1. Calculates distance between atoms
2. Fast-rejects if distance > ParametersDefault.MAX_BOND_DISTANCE
3. Calls _are_atoms_bonded_with_distance() for bonding criteria
4. Creates Bond with detection_method=BondDetectionMethods.DISTANCE_BASED
5. Appends to self.bonds (no duplicate check needed due to processed_pairs)

Performance:

Complexity: O(n) average case (depends on atom density)
Grid optimization: Only checks atoms in neighboring cells
Worst case: O(n²) if all atoms in same grid cell (highly unlikely)
Memory: O(n) for grid structure and processed pairs set

Bonding Criteria:

Same as Step 2, uses Van der Waals radii with covalent cutoff factor:

Bond exists if: ParametersDefault.MIN_BOND_DISTANCE ≤ distance ≤ vdw_cutoff
Where: vdw_cutoff = (vdw₁ + vdw₂) × ParametersDefault.COVALENT_CUTOFF_FACTOR
vdw₁, vdw₂: Van der Waals radii from AtomicData.VDW_RADII
Minimum distance: ParametersDefault.MIN_BOND_DISTANCE
Maximum distance: ParametersDefault.MAX_BOND_DISTANCE

Returns:: None (bonds appended to self.bonds list)
Return type:: None

See also

_check_bond_between_atoms() - Individual bond evaluation
_are_atoms_bonded_with_distance() - Bonding criteria

PDBParser._build_bond_adjacency_map() → None[source]#

Build fast bond lookup adjacency map for efficient neighbor queries.

This is the final step in bond detection, called after all bonds have been identified. Creates a bidirectional adjacency list mapping each atom to its bonded neighbors, enabling O(1) lookup of bonded atoms.

The adjacency map is used throughout HBAT for:

Finding which hydrogen is bonded to which donor
Traversing connected atoms in rings
Following peptide chain connectivity
Quickly checking if atoms share bonds

Algorithm:

Clear existing self._bond_adjacency dictionary
For each bond in self.bonds:
1. Initialize empty lists for both atoms if not present
2. Add atom2_serial to atom1_serial’s neighbor list
3. Add atom1_serial to atom2_serial’s neighbor list (bidirectional)
Result: _bond_adjacency[atom_serial] returns list of bonded atom serials

Data Structure:

_bond_adjacency: Dict[int, Set[int]] = {
    atom_serial_1: {bonded_atom_1, bonded_atom_2, ...},
    atom_serial_2: {bonded_atom_3, bonded_atom_4, ...},
    ...
}

Performance:

Build time: O(b) where b = number of bonds
Lookup time: O(1) per atom
Memory: O(b) storage

Returns:: None (builds self._bond_adjacency dictionary)
Return type:: None

PDB File Parser

Contents

PDB File Parser#

Bond Detection Methods#

CONECT Record Parsing#