User Guide#
This comprehensive guide covers all major features of MolR for molecular structure analysis.
Structure Representation#
Understanding the Structure Class#
MolR uses a Structure of Arrays (SoA) design for efficient memory usage and vectorized operations:
import molr
import numpy as np
structure = molr.Structure.from_pdb("protein.pdb")
# All atom properties are NumPy arrays
print(f"Coordinates shape: {structure.coord.shape}")
print(f"Atom names: {structure.atom_name}")
print(f"Elements: {structure.element}")
print(f"Residue IDs: {structure.res_id}")
Key Properties#
The Structure class provides access to standard molecular properties:
# Atomic properties
structure.atom_name # Atom names (N, CA, C, O, etc.)
structure.element # Chemical elements (C, N, O, etc.)
structure.coord # 3D coordinates (n_atoms, 3)
structure.charge # Partial charges
structure.occupancy # Crystallographic occupancy
structure.b_factor # Temperature factors
# Residue properties
structure.res_name # Residue names (ALA, GLY, etc.)
structure.res_id # Residue numbers
structure.chain_id # Chain identifiers
structure.insertion_code # PDB insertion codes
# Structural hierarchy
structure.n_atoms # Number of atoms
structure.get_center() # Geometric center
File I/O Operations#
Loading Structures#
MolR supports multiple file formats and loading methods:
# From PDB file
structure = molr.Structure.from_pdb("protein.pdb")
# From mmCIF file
structure = molr.Structure.from_mmcif("structure.cif")
# From string data
pdb_data = open("protein.pdb").read()
structure = molr.Structure.from_pdb_string(pdb_data)
# Multi-model files (trajectories)
ensemble = molr.StructureEnsemble.from_pdb("trajectory.pdb")
Handling Multi-Model Files#
MolR automatically detects whether a file contains single or multiple models:
# Single model returns Structure
single_model = molr.Structure.from_pdb("1crn.pdb")
# Multi-model returns StructureEnsemble
ensemble = molr.StructureEnsemble.from_pdb("multi_model.pdb")
# Access individual models
first_model = ensemble[0] # Returns Structure
last_model = ensemble[-1]
# Iterate over models
for i, model in enumerate(ensemble):
print(f"Model {i}: {model.n_atoms} atoms")
Bond Detection System#
Hierarchical Bond Detection#
MolR uses a sophisticated hierarchical system with multiple providers:
from molr.bond_detection import DefaultBondDetector
# Default detector
detector = DefaultBondDetector()
# Detect bonds
bonds = detector.detect_bonds(structure)
print(f"Detected {len(bonds)} bonds")
Bond Provider Details#
The DefaultBondDetector combines multiple detection methods:
File-based bonds: From PDB CONECT records and mmCIF chemical bonds
Template-based: Standard residue topologies (amino acids, nucleotides)
Distance-based: Fallback using Van der Waals radii
# Access different bond sources
file_bonds = structure.file_bonds # From original file
# Complete bond set
all_bonds = structure.bonds
Working with Bonds#
The BondList class provides efficient bond storage and analysis:
bonds = structure.detect_bonds()
# Basic bond information
print(f"Number of bonds: {len(bonds)}")
# Iterate over bonds
for i in range(len(bonds)):
atom1_idx, atom2_idx = bonds.get_bond(i)
atom1_name = structure.atom_name[atom1_idx]
atom2_name = structure.atom_name[atom2_idx]
print(f"Bond {i}: {atom1_name}-{atom2_name}")
# Get neighbors of an atom
neighbors = bonds.get_neighbors(atom_idx=100)
# Bond connectivity matrix
connectivity = bonds.to_connectivity_matrix(structure.n_atoms)
Selection Language#
Basic Selections#
MolR provides a powerful selection language inspired by MDAnalysis and VMD:
# Select by atom name
ca_atoms = structure.select("name CA")
backbone = structure.select("backbone")
# Select by residue
his_residues = structure.select("resname HIS")
protein = structure.select("protein")
# Select by element
carbons = structure.select("element C")
nitrogens = structure.select("element N")
# Select by chain
chain_a = structure.select("chain A")
Boolean Operations#
Combine selections with logical operators:
# AND operation
his_ca = structure.select("resname HIS and name CA")
# OR operation
aromatics = structure.select("resname HIS or resname PHE or resname TYR")
# NOT operation
non_hydrogen = structure.select("not element H")
# Complex combinations
active_site = structure.select("(resname HIS TYR CYS) and (chain A) and backbone")
Spatial Selections#
Select atoms based on spatial relationships:
# Within distance of selection
near_ligand = structure.select("within 5.0 of (resname LIG)")
# Around a point
center_atoms = structure.select("around 8.0 of (10.0, 15.0, 20.0)")
# Center of geometry based
ligand = structure.select("resname LIG")
binding_site = structure.select("protein and within 6.0 of cog (resname LIG)")
Residue-Based Selections#
Use the byres
modifier to select entire residues:
# Select entire residues containing CA atoms within 5Å of ligand
binding_residues = structure.select("byres (name CA and within 5.0 of (resname LIG))")
# Select residues with any atom near the binding site
contact_residues = structure.select("byres (protein and within 4.0 of (resname LIG))")
Spatial Indexing and Queries#
KDTree Integration#
MolR uses scipy’s KDTree for efficient spatial queries:
# Built-in KDTree is automatically created and cached
atom_idx = 100
radius = 5.0
# Find neighbors within radius (O(log n) complexity)
neighbors = structure.get_neighbors_within(atom_idx, radius)
print(f"Found {len(neighbors)} neighbors")
Advanced Spatial Queries#
# Atoms within sphere around point
center = [10.0, 15.0, 20.0]
sphere_atoms = structure.get_atoms_within_sphere(center, radius=8.0)
# Atoms within sphere around center of geometry
ligand_mask = structure.select("resname LIG")
nearby_atoms = structure.get_atoms_within_cog_sphere(ligand_mask, radius=10.0)
# Contact analysis between selections
protein_mask = structure.select("protein")
ligand_mask = structure.select("resname LIG")
contacts = structure.get_atoms_between_selections(
protein_mask, ligand_mask, max_distance=4.0
)
Structure Manipulation#
Creating Substructures#
Extract parts of structures based on selections:
# Create new structure from selection
protein_mask = structure.select("protein")
protein_only = structure[protein_mask]
# Chain extraction
chain_a_mask = structure.chain_id == "A"
chain_a = structure[chain_a_mask]
# Combine multiple criteria
ca_atoms_mask = structure.select("name CA")
ca_structure = structure[ca_atoms_mask]
Coordinate Transformations#
# Translation
structure.translate([10.0, 0.0, 0.0])
# Center at origin
structure.center_at_origin()
# Custom transformations
import numpy as np
# Rotate around z-axis
angle = np.pi / 4
rotation_matrix = np.array([
[np.cos(angle), -np.sin(angle), 0],
[np.sin(angle), np.cos(angle), 0],
[0, 0, 1]
])
structure.coord = structure.coord @ rotation_matrix.T
Adding Annotations#
Extend structures with custom properties:
# Add custom annotation
structure.add_annotation("hydrophobicity", dtype=np.float32, default_value=0.0)
# Set values
hydrophobic_mask = structure.select("resname ALA VAL LEU ILE PHE TRP")
structure.hydrophobicity[hydrophobic_mask] = 1.0
# Use in selections
hydrophobic_atoms = structure.select("hydrophobicity > 0.5")
Performance Optimization#
Memory Management#
MolR uses lazy initialization for optional annotations:
# Annotations are loaded only when accessed
structure = molr.Structure.from_pdb("large_protein.pdb")
# These properties are always available (core data)
coords = structure.coord # Always loaded
names = structure.atom_name # Always loaded
# These are loaded on first access (optional data)
charges = structure.charge # Loaded when first accessed
bfactors = structure.b_factor # Loaded when first accessed
Efficient Selections#
# Use numpy operations for simple selections
ca_mask = structure.atom_name == "CA" # Fast numpy comparison
# Selection language for complex queries
complex_mask = structure.select("resname HIS and within 5.0 of (resname LIG)")
# Cache selections for reuse
protein_mask = structure.select("protein")
# Reuse protein_mask multiple times instead of re-selecting
Working with Large Structures#
# For very large structures (>100k atoms), consider subsetting first
structure = molr.Structure.from_pdb("large_complex.pdb")
# Extract region of interest first
roi_mask = structure.select("chain A and within 20.0 of (resname LIG)")
roi_structure = structure[roi_mask]
# Then perform expensive operations on smaller subset
bonds = roi_structure.detect_bonds()
neighbors = roi_structure.get_neighbors_within(atom_idx=10, radius=5.0)
Error Handling and Validation#
Common Issues and Solutions#
try:
structure = molr.Structure.from_pdb("protein.pdb")
except FileNotFoundError:
print("PDB file not found")
except ValueError as e:
print(f"Invalid PDB format: {e}")
# Validate structure
if structure.n_atoms == 0:
print("Warning: No atoms loaded")
# Check for missing coordinates
if np.any(np.isnan(structure.coord)):
print("Warning: NaN coordinates detected")
Best Practices#
Use appropriate file formats: PDB for simple structures, mmCIF for large complexes
Cache expensive operations: Store bond detection and selection results
Subset large structures: Work with regions of interest when possible
Validate input data: Check for missing atoms, coordinates, or bonds
Use spatial indexing: Leverage built-in KDTree for neighbor searches
Combine selections efficiently: Use boolean operations instead of multiple passes