Pipeline Architecture

SPXQuery uses a flexible, resumable pipeline architecture that processes SPHEREx data through four distinct stages.

Four-Stage Pipeline

The pipeline executes in this order:

1. Query Stage

Query the IRSA SPHEREx archive using TAP (Table Access Protocol).

What it does:

Searches for observations matching your source coordinates (RA/Dec)
Filters by spectral bands (D1-D6)
Resolves datalink URLs for each observation
Saves query results and metadata

Output:

results/query_summary.yaml - Observation metadata, time span, data size

Key features:

Automatic coordinate matching within search radius
Band filtering (query specific bands or all)
URL caching to avoid repeated datalink queries

2. Download Stage

Download FITS files from IRSA with optional cutout support.

What it does:

Downloads spectral images via HTTP
Applies cutout parameters if specified (reduces file size by 90%)
Organizes files by band (data/D1/, data/D2/, etc.)
Tracks download progress with parallel workers

Output:

data/D*/ - FITS files organized by spectral band
Download progress logging

Key features:

Parallel downloads (configurable workers, default: 4)
Skip existing files to enable resume
Retry logic with exponential backoff
Progress tracking for large datasets

3. Processing Stage

Extract aperture photometry from FITS files.

What it does:

Parses Multi-Extension FITS (MEF) structure
Extracts flux using circular aperture photometry (fixed or FWHM-based sizing)
Estimates background using annulus or window method
Subtracts zodiacal background from ZODI extension
Handles pixel flags for quality assessment
Repairs variance for flagged pixels with valid flux
Computes flux uncertainties from variance maps

Output:

results/photometry.json - Per-observation photometry results
Photometry metadata (aperture size, background estimation method)

Key features:

Adaptive apertures: FWHM-based sizing with PSF extraction (optional)
Dual background methods: Annulus (traditional) or window (crowded fields)
Variance repair: Automatic handling of NaN variance for flagged pixels
Zodiacal light subtraction (from ZODI extension)
Pixel flag tracking (FLAGS extension)
Spectral WCS handling for wavelength extraction
Parallel processing (configurable workers, default: 10)

4. Visualization Stage

Generate publication-quality plots with quality control.

What it does:

Creates combined spectral and temporal plots
Applies quality filtering (SNR threshold, bad pixel flags)
Marks rejected measurements with visual indicators
Generates light curve CSV file

Output:

results/combined_plot.png - Multi-panel visualization
results/lightcurve.csv - Time-series photometry data

Key features:

Quality control: good measurements (filled circles) vs. rejected (gray crosses)
Customizable colormaps, marker sizes, and figure parameters
Respects user’s matplotlibrc settings
Optional magnitude vs. flux plotting

Pipeline Execution Modes

SPXQuery supports three execution modes:

One-Click Execution

Run all stages automatically:

from spxquery.core.pipeline import run_pipeline

run_pipeline(
    ra=304.69,
    dec=42.44,
    output_dir="output",
    cutout_size="200px"
)

Step-by-Step Execution

Run individual stages with dependency checking:

from spxquery import SPXQueryPipeline, Source, QueryConfig

source = Source(ra=304.69, dec=42.44, name="my_source")
config = QueryConfig(source=source, output_dir="output")
pipeline = SPXQueryPipeline(config)

# Run stages individually
pipeline.run_query()
pipeline.run_download(skip_existing=True)
pipeline.run_processing()
pipeline.run_visualization()

The pipeline automatically checks dependencies - you cannot run processing before completing download.

Resumable Execution

The pipeline saves state after each stage to {source_name}.yaml. Resume from interruptions:

# Load configuration from saved state
config = QueryConfig.from_saved_state(
    source_name="my_source",
    output_dir="output"
)

pipeline = SPXQueryPipeline(config)
pipeline.resume()  # Automatically runs remaining stages

What gets saved:

Completed stages
Query results (observations, time span, data size)
Downloaded file paths
Photometry results
All configuration parameters

Stage Dependencies

The pipeline enforces these dependencies:

query: No dependencies (always runs first)
download: Requires query
processing: Requires query + download
visualization: Requires query + download + processing

If you try to run a stage without its dependencies, the pipeline will raise an error.

Customizing Pipeline Stages

You can customize which stages to run:

# Only query and download, skip processing
pipeline = SPXQueryPipeline(config, pipeline_stages=["query", "download"])
pipeline.run_full_pipeline()

This is useful for:

Downloading data for later analysis
Re-running specific stages with different parameters
Integrating SPXQuery into custom workflows

State Persistence

State files ({source_name}.yaml) contain:

# Pipeline state
stage: complete
completed_stages:
  - query
  - download
  - processing
  - visualization
pipeline_stages:
  - query
  - download
  - processing
  - visualization

# Query results
query_results:
  observations: [...]
  time_span_days: 33.1
  total_size_gb: 0.51

# Downloaded files and photometry
downloaded_files: [...]
photometry_results: [...]

This enables:

Resume after interruptions (network failures, crashes)
Audit trail of completed work
Configuration recovery (auto-load parameters from saved state)

Error Handling

The pipeline handles common errors gracefully:

Network failures: Retry logic with exponential backoff (configurable)
Missing files: Skip and continue processing remaining files
Invalid FITS: Log error and skip observation
Photometry failures: Mark as bad and continue
Interrupted execution: Resume from last completed stage

Errors are logged to help diagnose issues without stopping the entire pipeline.