Concepts

Phaser was designed with only a few concepts to make managing data pipelines as simple as possible.

Pipelines

Pipeline is the main organizing unit that is used to define the structure of the work to be done. It runs one or more Phases and does I/O, marshalling source data and checkpoint data between them. It will

  • load source data from files or a previous phase

  • save checkpoint data between phases

  • save outputs

  • marshal inputs and outputs between phases

  • report errors or warnings as summaries

  • captures and handles errors, according to the error policy

Errors and warnings are output to a file in the working directory by default.

Phases

Each Phase runs one or more steps with individual data transformation or validation logic, and the Phase does routine work in a robust way:

  • transform column headers to preferred name/case

  • routine parsing and data typing

Columns

A Phase can be configured with the data that it expects, defined as Columns.

When columns are passed to the Phase, then the data formats and constraints are enforced at the beginning of a Phaser. Many steps that might otherwise have been programmed functionally are therefore available to declare.

Columns and features available so far:

Steps

For your data pipeline project, most of the work unique to that project can be done within steps that operate in a Phase to give structure and debuggability:

  • Steps are meant to be written as pure functions so they can be individually testable with simple pythonic ways to pass row data and verify results

  • Steps can drop rows with bad data

  • Steps can access context information

  • Steps can create warnings or errors

  • Pre-baked steps are available to check uniqueness values and do common transforms

Checkpoint files

Comparison to other tools and libraries