Stonks Project Analysis

Introduction

The "Stonks" project serves as the primary, hands-on case study for the book "Building a Pragmatic Data Platform with dbt and Snowflake." It brings the Pragmatic Data Platform (PDP) architecture to life with a complete, working sample implementation. While simplified for educational purposes, it is realistic enough to demonstrate how the principles and patterns discussed throughout the book translate into a real-world data engineering project.

The project simulates a common scenario: analyzing personal investment data from a financial brokerage, in this case, Interactive Brokers (IB). It ingests raw data from CSV files containing trades, cash transactions, positions, and other financial activities. It then transforms this data through the PDP layers to ultimately produce a portfolio_analysis data mart, ready for consumption by BI tools or other data applications.

By following the project's development, the reader can see firsthand how to:

  • Structure a dbt project according to PDP principles.
  • Use the custom macros from the PDP dbt package to automate common, complex tasks.
  • Implement robust data storage patterns that ensure auditability and maintain a complete history.
  • Apply sophisticated business logic in a structured and maintainable way.
  • Build user-centric, reliable data products in the form of dimensional models.

PDP Layers Overview

The Stonks project is organized into the distinct layers of the Pragmatic Data Platform. The diagram below shows the flow of data through these layers. Click on any layer to jump to its detailed description.

1. Ingestion Layer: The Automated Gateway

As detailed in Chapters 8 and 9, this layer is handled by operational macros in the ingestion/ directory. It's responsible for getting raw data from source systems into the data platform's landing zone in Snowflake.

  • Infrastructure as Code: The setup_interactive_brokers.sql script uses the pragmatic_data.inout_setup_sql macro to define and create all necessary Snowflake objects from a YAML config, ensuring a repeatable, version-controlled environment.
  • Pattern-Based Ingestion: Each load_IB_....sql script wraps the pragmatic_data.run_CSV_ingestion macro, demonstrating a powerful, configuration-driven playbook that dramatically reduces boilerplate code.
  • Operational Control: The load_all_IB.sql script acts as a master controller, orchestrating the entire ingestion process in a single command, ideal for automation.
  • Idempotency and Reliability: The process is repeatable; Landing Tables track loaded files to avoid duplication and data corruption.

2. Storage Layer: The Stable Foundation

The models/01_storage directory implements the patterns from Chapters 10 and 11. It's the PDP's foundation, storing a perfect, immutable, and auditable history of the source data over time.

  • The STG » HIST » VER Pattern: This is the cornerstone of the layer. Every source entity is consistently processed through this three-stage pipeline using PDP macros for uniformity and reliability.
  • Hashing and Historization: The use of pdp_hash to create business key hashes (_HKEY) and change-detection hashes (_HDIFF) is the technical core, providing stable surrogate keys and a reliable mechanism for tracking changes.
  • Handling Embedded Entities: The security subfolder is a standout example of creating a complete historical picture of an entity even when its attributes are fragmented across multiple source files.
  • dbt Sources: The layer declares landing tables as dbt sources (in source_interactive_brokers.yml) to manage the dependency graph from the start.

3. Refined Layer: Creating Business-Centric Information

The models/02_refined layer, detailed in Chapters 12 and 13, transforms the source-aligned data into clean, integrated, and reusable business concepts, applying business logic to turn data into valuable information.

  • Master Data Creation: The REFH_IB_SECURITIES model serves as the "golden record" for securities, integrating multiple sources into a single, authoritative, and versioned history.
  • Event-Sourcing with a Recursive CTE: The REFH_IB_POSITIONS_CALCULATED model reconstructs a complete position history from a transaction log, demonstrating a powerful pattern for creating highly accurate and auditable records.
  • Time-Series Analysis: The TS_IB_REPORTED_POSITIONS_DAILY_VALUES model uses the pragmatic_data.time_join macro to perform a time-aware (ASOF) join, transforming sparse data into a dense, analysis-ready time series.
  • Master Date Dimension: The MDD_DATE_CALENDAR model uses dbt_utils.date_spine to create a standard date dimension, a best practice for any analytical modeling.

4. Delivery Layer

This is the final layer, designed to serve data consumers. As described in Chapters 14 and 15, its models are purpose-built for specific use cases like BI dashboards or application data sources.

  • Dimensional Modeling: The primary output is the portfolio_analysis data mart, which includes conformed dimensions (DIM_DATE, DIM_SECURITIES) and fact tables (FACT_POSITION_TRANSACTIONS).
  • Denormalized Reporting Models: The project builds RPT_POSITIONS_DAILY_VALUES, a wide, pre-joined table optimized for performance and ease of use in BI tools.
  • Application-Specific Models: RPT_POSITIONS_CURRENT_VALUES is a lean, specialized model designed to power a specific application, like the included Streamlit app.
  • Governance and Trust: Models in this layer use dbt Contracts to enforce data types and constraints, ensuring data products are reliable and trustworthy.

Detailed Model Analysis

Ingestion Layer (ingestion/)

This folder contains the operational macros to set up and run the data loading process. They are executed via dbt run-operation.

setup_interactive_brokers.sql

Purpose: To create all necessary Snowflake infrastructure (stages, file formats, landing tables) for ingesting the Interactive Brokers CSV files.

Code Analysis: This script uses the pragmatic_data.inout_setup_sql macro, which reads a simple YAML configuration to define the landing zone's database, schema, file format, and stage. This centralizes the entire setup in a single, readable configuration file.

load_*.sql scripts

Purpose: To execute the ingestion process for specific data entities (e.g., load_IB_Trades.sql).

Code Analysis: These scripts use the pragmatic_data.run_CSV_ingestion macro. This macro reads the table-specific YAML configuration and executes the COPY INTO commands to load new data from the configured Snowflake stage into the corresponding landing table.

load_all_IB.sql

Purpose: A convenient, all-in-one macro to run ingestion for all data types in the correct sequence.

Code Analysis: A wrapper macro that contains a series of {% do load_IB_...() %} calls, executing each individual ingestion macro. This is the typical entry point for a full source ingestion.

Storage Layer (models/01_storage/)

This layer stores a complete history of the ingested data using the STG -> HIST -> VER pattern.

The STG » HIST » VER Pattern
STG (Staging)
HIST (History)
VER (Versioned)

interactive_brokers/cash_transactions/

This directory processes financial transactions like dividends and interest. It's a classic example of the core storage pattern.

  • STG_IB_CASH_TRANSACTIONS.sql: Cleans and prepares raw data, adding metadata like business keys (hk) and change-detection hashes (rh) using the pragmatic_data.stage macro.
  • HIST_IB_CASH_TRANSACTIONS.sql: Incrementally builds an append-only history of all transactions using the pragmatic_data.save_history macro.
  • VER_IB_CASH_TRANSACTIONS.sql: Provides a simple view of the most recent version of each transaction using the pragmatic_data.current_from_history macro.

interactive_brokers/security/

This directory demonstrates a more advanced use case: creating a unified entity from information embedded across multiple source files (trades, positions, etc.). Each source's security data is processed through its own STG -> HIST -> VER pipeline before being merged in the Refined Layer.

Refined Layer (models/02_refined/)

This layer transforms the technical storage models into valuable, integrated business concepts.

interactive_brokers/securities/

  • MHIST_IB_SECURITIES.sql: Creates a "merged history" by unioning the VER security models from the Storage layer and applying the pragmatic_data.save_history_with_multiple_versions macro to handle records from different sources on the same day.
  • REFH_IB_SECURITIES.sql: Builds the final, refined history of securities. It uses pragmatic_data.versions_from_history_with_multiple_versions to create a clean, point-in-time correct history of each security, making it the "golden record".

interactive_brokers/positions_calculated/

  • REF_IB_POSITIONS_TRANSACTIONS.sql: Creates a unified stream of all transactions affecting position quantities (buys, sells, transfers, corporate actions).
  • REFH_IB_POSITIONS_CALCULATED.sql: The core of the business logic. It uses a recursive CTE to process the transaction stream from the model above, reconstructing a complete, event-sourced position history from scratch.

interactive_brokers/position_values/

  • TS_IB_REPORTED_POSITIONS_DAILY_VALUES.sql: Enriches reported position values with security and position attributes at the correct point in time. It uses the pragmatic_data.time_join macro to perform a time-aware (ASOF) join, creating a sparse time-series with data only on days when values were reported.

interactive_brokers/dividends/

  • REFH_IB_DIVIDENDS.sql: Creates a refined history of dividend payments by filtering the main cash transaction log for dividend-related descriptions.
  • AGG_IB_DIVIDENDS.sql: Aggregates the filtered dividend history by security, date, and currency, preparing it for use in a fact table by summarizing dividend events.

interactive_brokers/portfolios/

  • REF_IB_PORTFOLIOS.sql: Creates a master list of all unique portfolios (accounts) by scanning multiple VER_ tables and creating a distinct list of `PORTFOLIO_HKEY`s. This serves as the source for the portfolio dimension.

interactive_brokers/positions_reported/

  • REFH_IB_POSITIONS_REPORTED.sql: Creates a refined historical view of positions as reported by the source system. It's crucial for reconciliation against the calculated position history to ensure data quality.

master_data/

  • MDD_DATE_CALENDAR.sql: A canonical Master Date Dimension. It uses dbt_utils.date_spine to generate a complete calendar table, which is then enriched with attributes like year, month, and day of the week for analytical use.

Delivery Layer (models/03_delivery/)

This layer creates the final data products for end-users, primarily in the marts/portfolio_analysis/ directory.

Dimensions

  • DIM_DATE.sql: A standard date dimension, exposing the MDD_DATE_CALENDAR model for easy use.
  • SCD_SECURITIES.sql & DIM_SECURITIES.sql: These models create a Type 2 Slowly Changing Dimension to track changes to security attributes over time. SCD_SECURITIES uses the pragmatic_data.self_completing_dimension macro to build a dynamic dimension from the refined history, while DIM_SECURITIES provides a simple view of only current records.
  • DIM_PORTFOLIOS.sql: A simple dimension created from the refined list of portfolios.

Facts & Reporting

  • FACT_POSITION_TRANSACTIONS.sql: A fact table containing all position-changing transactions, linked to the conformed dimensions via surrogate keys.
  • FACT_DIVIDENDS.sql: A fact table for dividend events.
  • RPT_POSITIONS_DAILY_VALUES.sql: A wide, denormalized reporting model optimized for BI tools. It joins the sparse time-series data with the date dimension to create a dense daily history of position values, ready for easy analysis.

Application Models (apps/)

  • RPT_POSITIONS_CURRENT_VALUES.sql: A lean, specialized model that provides only the most recent information for each position, designed to efficiently power a front-end application.