Featured image of post Building a Live Data Simulator for Data Engineering Practice

Building a Live Data Simulator for Data Engineering Practice

How I built an automated, self-updating AdventureWorksLT database on Azure for practicing Change Data Capture and streaming pipelines

Intro

If you’ve ever tried to learn Change Data Capture (CDC), streaming pipelines, or incremental data loading, you’ve probably hit the same frustrating wall I did: static sample databases don’t teach you the hard parts. In this post, I’ll share how I built a Terraform project that deploys a continuously changing AdventureWorksLT database on Azure. This can be used for practicing real-world data engineering scenarios. And the best part? It doesn’t cost you more than a coffee per month.

GitHub Repository: https://github.com/stefanko-ch/AdventureworksLTDataSimulator

The Problem with Static Data

Let’s be honest: most data engineering tutorials use datasets that never change. You download a CSV, load it into a database, and that’s it. Great for learning SQL basics, terrible for learning how production data systems actually work.

When you’re building real data pipelines, you need to handle:

  • Continuous data changes - new records arriving every minute
  • Updates to existing records - customers changing addresses, orders being modified
  • Deletions - cancelled orders, removed products
  • Time-based patterns - business hours, seasonal trends
  • Data quality issues - duplicates, missing values, edge cases

The classic AdventureWorksLT sample database from Microsoft is excellent for learning SQL and database schema design. But for practicing CDC, streaming, or incremental loads? It just sits there, unchanged, waiting for you to manually insert test data.

I wanted something: a database that behaves like a production system without the complexity of actually running a production system.

The Solution: Automated Data Simulation

I created an open-source Terraform project that deploys a self-updating AdventureWorksLT database on Azure SQL. The data simulation logic lives in 6 SQL stored procedures that generate realistic business operations. But here’s the challenge: Azure SQL Database doesn’t have SQL Agent to schedule these procedures automatically.

That’s where Azure Logic Apps come in: they act as the external scheduler to trigger the stored procedures every 5 minutes.

Why Logic Apps for scheduling? I evaluated several options:

  • Azure Functions: Requires managing code, deployment packages, and runtime versions. Overkill for simple scheduled SQL calls.
  • Azure Data Factory: More expensive for this use case, and designed for data movement rather than simple stored procedure execution.
  • Databricks Jobs: Way too expensive (~$0.07/DBU-hour) for running simple SQL stored procedures every 5 minutes.
  • Azure Automation: Could work, but requires PowerShell/Python runbooks and is more complex to configure than Logic Apps.
  • Azure Logic Apps: Perfect fit - built-in SQL connector, simple scheduling, consumption pricing (~$0.000025 per action), and fully declarable in Terraform.

The winner: Logic Apps cost roughly $0.18/month for this workload (5 apps × 8,640 executions/month × $0.000025), making it the most cost-effective solution by far.

Here’s what the stored procedures do (triggered by Logic Apps):

Logic App Stored Procedure What It Does Frequency
Generate Orders usp_Sim_GenerateNewOrders Creates 100 new orders with line items Every 5 minutes
Ship Orders usp_Sim_ShipPendingOrders Ships ~50% of pending orders Every 5 minutes (offset +1 min)
Update Customers usp_Sim_UpdateCustomerInfo Changes phone numbers for 20 customers Every 5 minutes (offset +2 min)
Add Customers usp_Sim_GenerateNewCustomers Creates 10-20 new customers with addresses Every 5 minutes (offset +3 min)
Cancel Orders usp_Sim_CancelRandomOrders Cancels ~10% of pending orders Every 5 minutes (offset +4 min)

The staggered timing (offset by 1 minute each) spreads the load and creates a more realistic pattern of continuous activity.

Key Features:

  • 100% Infrastructure as Code - No manual SQL scripts or clicking in Azure Portal
  • Automatic database user creation - dbwriter and dbreader users with proper permissions
  • Secure credential storage - All passwords stored in Azure Key Vault
  • Cost-effective - Runs on Basic tier (~$5-10/month)
  • Easy cleanup - Simple terraform destroy when you’re done
  • Customizable - Change intervals, add new simulation procedures, or disable simulations

Architecture Overview

The system consists of three main Azure components working together:

Azure Logic Apps (5 Scheduled Tasks)

Every 5 minutes, staggered by 1 minute each to distribute load:

Time Logic App Action Volume
:00, :05, :10… Generate Orders Creates new sales orders with line items 100 orders
:01, :06, :11… Ship Orders Updates pending orders to shipped status ~50% of pending
:02, :07, :12… Update Customers Modifies customer phone numbers 20 customers
:03, :08, :13… New Customers Adds new customers with addresses 10-20 new
:04, :09, :14… Cancel Orders Deletes random pending orders ~10% of pending

Azure SQL Database

AdventureWorksLT-Live (Basic Tier - DTU-based)

6 Stored Procedures:

  • SalesLT.usp_Sim_GenerateNewOrders - Creates realistic orders with products
  • SalesLT.usp_Sim_ShipPendingOrders - Ships approximately 50% of pending orders
  • SalesLT.usp_Sim_UpdateCustomerInfo - Updates customer contact information
  • SalesLT.usp_Sim_GenerateNewCustomers - Creates new customers with addresses
  • SalesLT.usp_Sim_CancelRandomOrders - Cancels orders (simulates business operations)
  • SalesLT.usp_Sim_GetDatabaseStats - Returns statistics about simulation activity

SQL Users:

  • dbwriter - Execute permissions for Logic Apps
  • dbreader - Read-only access for your data pipelines

Azure Key Vault

Stores all credentials securely:

  • SQL Server admin password
  • dbwriter password
  • dbreader password
  • Connection strings

The beauty of this design is that everything is automated. The Terraform configuration:

  1. Creates the Azure SQL Database with AdventureWorksLT sample data
  2. Deploys all 6 stored procedures that simulate business operations
  3. Creates SQL users (dbwriter, dbreader) with appropriate permissions
  4. Stores credentials in Azure Key Vault
  5. Sets up 5 Logic Apps with staggered schedules
  6. Configures API connections between Logic Apps and SQL Database

How It Works

For detailed setup instructions, see the complete guide in the GitHub repository. Below, I’ll highlight the key technical components.

Terraform Infrastructure

The entire infrastructure is defined in Terraform, following Azure naming best practices:

1
2
3
4
5
# Resource naming conventions
resource_group_name = "rg-adventureworks-simulator"
sql_server_name     = "sql-adventureworks-sim"  # Must be globally unique
key_vault_name      = "kv-awsim"                # Must be globally unique
database_name       = "AdventureWorksLT-Live"

The SQL Database is deployed with the AdventureWorksLT sample:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
resource "azurerm_mssql_database" "adventureworks_live" {
  name      = "AdventureWorksLT-Live"
  server_id = azurerm_mssql_server.main.id
  sku_name  = "Basic"
  
  # This is the magic - Azure deploys AdventureWorksLT schema and data
  sample_name = "AdventureWorksLT"
  
  lifecycle {
    prevent_destroy = false
  }
}

Stored Procedures

Each stored procedure simulates a specific business operation. Here’s a simplified example of the order generation procedure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
CREATE OR ALTER PROCEDURE SalesLT.usp_Sim_GenerateNewOrders
    @OrderCount INT = 100
AS
BEGIN
    SET NOCOUNT ON;
    
    DECLARE @i INT = 0;
    DECLARE @CustomerID INT;
    DECLARE @AddressID INT;
    
    WHILE @i < @OrderCount
    BEGIN
        -- Pick a random customer
        SELECT TOP 1 @CustomerID = CustomerID 
        FROM SalesLT.Customer 
        ORDER BY NEWID();
        
        -- Get customer's address
        SELECT TOP 1 @AddressID = ca.AddressID 
        FROM SalesLT.CustomerAddress ca 
        WHERE ca.CustomerID = @CustomerID 
        ORDER BY NEWID();
        
        -- Create order header
        INSERT INTO SalesLT.SalesOrderHeader (
            OrderDate, CustomerID, ShipToAddressID, 
            Status, ModifiedDate
        )
        VALUES (
            GETDATE(), @CustomerID, @AddressID,
            1, -- Status = Pending
            GETDATE()
        );
        
        -- Add 1-5 random products to order
        -- (simplified - actual procedure includes price calculations)
        
        SET @i = @i + 1;
    END
END

The procedures use randomization (NEWID()) to create varied, realistic data patterns—not the same test data over and over.

Logic Apps Orchestration

Logic Apps are defined in Terraform with staggered schedules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
locals {
  simulation_procedures = {
    "generate-orders" = {
      name          = "logic-sim-generate-orders"
      procedure     = "SalesLT.usp_Sim_GenerateNewOrders"
      offset_minute = 0  # Runs at 0,5,10,15...
      parameters    = { "OrderCount" = 100 }
    }
    "ship-orders" = {
      name          = "logic-sim-ship-orders"
      procedure     = "SalesLT.usp_Sim_ShipPendingOrders"
      offset_minute = 1  # Runs at 1,6,11,16...
      parameters    = {}
    }
    # ... more procedures
  }
}

Each Logic App runs a simple workflow:

  1. Wait for the scheduled time (recurrence trigger)
  2. Execute the stored procedure via SQL connection
  3. Log the result

The 1-minute offsets prevent all procedures from running simultaneously, creating more realistic load patterns.

SQL Users and Permissions

Terraform automatically creates two SQL users:

1
2
3
4
5
6
7
8
9
-- Writer user (for Logic Apps)
CREATE USER [dbwriter] WITH PASSWORD = 'SecurePassword123!';
ALTER ROLE db_datareader ADD MEMBER [dbwriter];
ALTER ROLE db_datawriter ADD MEMBER [dbwriter];
GRANT EXECUTE TO [dbwriter];

-- Reader user (for your data pipelines)
CREATE USER [dbreader] WITH PASSWORD = 'SecurePassword456!';
ALTER ROLE db_datareader ADD MEMBER [dbreader];

Passwords are generated randomly and stored in Azure Key Vault, where your applications can retrieve them securely.

Getting Started

Prerequisites

  • Terraform >= 1.0
  • Azure CLI
  • An Azure subscription
  • Azure AD user account (for Key Vault access)

Quick Deploy

Clone the repository and configure:

1
2
3
4
5
6
git clone https://github.com/stefanko-ch/AdventureworksLTDataSimulator.git
cd AdventureworksLTDataSimulator/terraform

# Copy and edit configuration
cp terraform.tfvars.example terraform.tfvars
nano terraform.tfvars  # or your favorite editor

Required variables in terraform.tfvars:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
subscription_id = "your-azure-subscription-id"
admin_email     = "[email protected]"  # For Key Vault access

# Optional - customize these if needed
location            = "North Europe"
resource_group_name = "rg-adventureworks-simulator"
sql_server_name     = "sql-aw-sim-unique123"  # Must be globally unique
key_vault_name      = "kv-awsim-unique"       # Must be globally unique
simulation_enabled  = true
simulation_interval_minutes = 5

Deploy with Terraform:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Login to Azure
az login

# Initialize Terraform
terraform init

# Review the plan
terraform plan

# Deploy (takes ~5 minutes)
terraform apply

Terraform will output connection details:

Terraform output

Connect to Your Database

Retrieve credentials from Key Vault:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Reader password (for your data pipelines)
az keyvault secret show \
  --vault-name kv-awsim-unique \
  --name sql-reader-password \
  --query value -o tsv

# Writer password (for Logic Apps - already configured)
az keyvault secret show \
  --vault-name kv-awsim-unique \
  --name sql-writer-password \
  --query value -o tsv

Connection string example:

1
2
3
4
5
6
Server=sql-aw-sim-unique123.database.windows.net,1433;
Database=AdventureWorksLT-Live;
User ID=dbreader;
Password=<from-keyvault>;
Encrypt=True;
TrustServerCertificate=False;

Monitor Simulation Activity

Query the database to see activity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Check recent orders
SELECT TOP 100 *
FROM SalesLT.SalesOrderHeader
ORDER BY ModifiedDate DESC;

-- Count orders by status
SELECT 
    CASE Status 
        WHEN 1 THEN 'Pending'
        WHEN 5 THEN 'Shipped'
        WHEN 6 THEN 'Cancelled'
    END AS StatusName,
    COUNT(*) AS OrderCount
FROM SalesLT.SalesOrderHeader
GROUP BY Status;

-- See how data is growing
SELECT 
    CAST(OrderDate AS DATE) AS OrderDay,
    COUNT(*) AS OrderCount
FROM SalesLT.SalesOrderHeader
WHERE OrderDate >= DATEADD(day, -7, GETDATE())
GROUP BY CAST(OrderDate AS DATE)
ORDER BY OrderDay;

Or you can query the results directly from your terminal.

1
2
3
4
5
6
# Get the admin password from Key Vault
sqlcmd -S <your-sql-server>.database.windows.net \
       -d AdventureWorksLT-Live \
       -U sqladmin \
       -P "$(az keyvault secret show --vault-name <your-keyvault-name> --name sql-admin-password --query value -o tsv)" \
       -Q "EXEC SalesLT.usp_Sim_GetStatus;"

The output looks like this:

Simulation check

Cost Management

I run this demo several days in my test-environment. As you can see, it costs less than 0.44 Swiss Francs, so in my case it’s about 4.25 Swiss Francs per month. Daily Costs

But there’s also some other insights from this long running test. As we put in new data every few minutes, it takes more time to calculate and needs more CPU Power. Also more space will be allocated over time.

CPU growth

Is this a problem? I think not. Just destroy the environment with terraform and redeploy it again.

Ways to save money:

  1. Pause simulations when not in use:

    1
    
    simulation_enabled = false
    

    Then run terraform apply to disable Logic Apps.

  2. Reduce simulation frequency:

    1
    
    simulation_interval_minutes = 15  # Instead of 5
    
  3. Delete when done:

    1
    
    terraform destroy
    

Conclusion

Building data pipelines for static datasets teaches you syntax. Building pipelines for changing data teaches you data engineering.

This AdventureWorksLT Data Simulator gives you a safe, cheap, disposable environment to practice:

  • Change Data Capture (CDC)
  • Incremental loading patterns
  • Streaming pipelines
  • Data quality handling
  • Schema evolution
  • Slowly Changing Dimensions (SCD)

The entire setup costs less than a fancy coffee per month and can be deployed/destroyed in minutes. No manual configuration, no clicking through Azure Portal, just Infrastructure as Code.

Try it out: https://github.com/stefanko-ch/AdventureworksLTDataSimulator

If you build something cool with it, find bugs, or have ideas for improvements, I’d love to hear about it. Open an issue or submit a PR!


Title image generated with Google Gemini

comments powered by Disqus
All content on this website reflects my personal opinion only.
Built with Hugo