Building a Live Data Simulator for Data Engineering Practice
Intro
If you’ve ever tried to learn Change Data Capture (CDC), streaming pipelines, or incremental data loading, you’ve probably hit the same frustrating wall I did: static sample databases don’t teach you the hard parts. In this post, I’ll share how I built a Terraform project that deploys a continuously changing AdventureWorksLT database on Azure. This can be used for practicing real-world data engineering scenarios. And the best part? It doesn’t cost you more than a coffee per month.
GitHub Repository: https://github.com/stefanko-ch/AdventureworksLTDataSimulator
The Problem with Static Data
Let’s be honest: most data engineering tutorials use datasets that never change. You download a CSV, load it into a database, and that’s it. Great for learning SQL basics, terrible for learning how production data systems actually work.
When you’re building real data pipelines, you need to handle:
- Continuous data changes - new records arriving every minute
- Updates to existing records - customers changing addresses, orders being modified
- Deletions - cancelled orders, removed products
- Time-based patterns - business hours, seasonal trends
- Data quality issues - duplicates, missing values, edge cases
The classic AdventureWorksLT sample database from Microsoft is excellent for learning SQL and database schema design. But for practicing CDC, streaming, or incremental loads? It just sits there, unchanged, waiting for you to manually insert test data.
I wanted something: a database that behaves like a production system without the complexity of actually running a production system.
The Solution: Automated Data Simulation
I created an open-source Terraform project that deploys a self-updating AdventureWorksLT database on Azure SQL. The data simulation logic lives in 6 SQL stored procedures that generate realistic business operations. But here’s the challenge: Azure SQL Database doesn’t have SQL Agent to schedule these procedures automatically.
That’s where Azure Logic Apps come in: they act as the external scheduler to trigger the stored procedures every 5 minutes.
Why Logic Apps for scheduling? I evaluated several options:
- Azure Functions: Requires managing code, deployment packages, and runtime versions. Overkill for simple scheduled SQL calls.
- Azure Data Factory: More expensive for this use case, and designed for data movement rather than simple stored procedure execution.
- Databricks Jobs: Way too expensive (~$0.07/DBU-hour) for running simple SQL stored procedures every 5 minutes.
- Azure Automation: Could work, but requires PowerShell/Python runbooks and is more complex to configure than Logic Apps.
- Azure Logic Apps: Perfect fit - built-in SQL connector, simple scheduling, consumption pricing (~$0.000025 per action), and fully declarable in Terraform.
The winner: Logic Apps cost roughly $0.18/month for this workload (5 apps × 8,640 executions/month × $0.000025), making it the most cost-effective solution by far.
Here’s what the stored procedures do (triggered by Logic Apps):
| Logic App | Stored Procedure | What It Does | Frequency |
|---|---|---|---|
| Generate Orders | usp_Sim_GenerateNewOrders | Creates 100 new orders with line items | Every 5 minutes |
| Ship Orders | usp_Sim_ShipPendingOrders | Ships ~50% of pending orders | Every 5 minutes (offset +1 min) |
| Update Customers | usp_Sim_UpdateCustomerInfo | Changes phone numbers for 20 customers | Every 5 minutes (offset +2 min) |
| Add Customers | usp_Sim_GenerateNewCustomers | Creates 10-20 new customers with addresses | Every 5 minutes (offset +3 min) |
| Cancel Orders | usp_Sim_CancelRandomOrders | Cancels ~10% of pending orders | Every 5 minutes (offset +4 min) |
The staggered timing (offset by 1 minute each) spreads the load and creates a more realistic pattern of continuous activity.
Key Features:
- 100% Infrastructure as Code - No manual SQL scripts or clicking in Azure Portal
- Automatic database user creation -
dbwriteranddbreaderusers with proper permissions - Secure credential storage - All passwords stored in Azure Key Vault
- Cost-effective - Runs on Basic tier (~$5-10/month)
- Easy cleanup - Simple
terraform destroywhen you’re done - Customizable - Change intervals, add new simulation procedures, or disable simulations
Architecture Overview
The system consists of three main Azure components working together:
Azure Logic Apps (5 Scheduled Tasks)
Every 5 minutes, staggered by 1 minute each to distribute load:
| Time | Logic App | Action | Volume |
|---|---|---|---|
| :00, :05, :10… | Generate Orders | Creates new sales orders with line items | 100 orders |
| :01, :06, :11… | Ship Orders | Updates pending orders to shipped status | ~50% of pending |
| :02, :07, :12… | Update Customers | Modifies customer phone numbers | 20 customers |
| :03, :08, :13… | New Customers | Adds new customers with addresses | 10-20 new |
| :04, :09, :14… | Cancel Orders | Deletes random pending orders | ~10% of pending |
Azure SQL Database
AdventureWorksLT-Live (Basic Tier - DTU-based)
6 Stored Procedures:
SalesLT.usp_Sim_GenerateNewOrders- Creates realistic orders with productsSalesLT.usp_Sim_ShipPendingOrders- Ships approximately 50% of pending ordersSalesLT.usp_Sim_UpdateCustomerInfo- Updates customer contact informationSalesLT.usp_Sim_GenerateNewCustomers- Creates new customers with addressesSalesLT.usp_Sim_CancelRandomOrders- Cancels orders (simulates business operations)SalesLT.usp_Sim_GetDatabaseStats- Returns statistics about simulation activity
SQL Users:
dbwriter- Execute permissions for Logic Appsdbreader- Read-only access for your data pipelines
Azure Key Vault
Stores all credentials securely:
- SQL Server admin password
dbwriterpassworddbreaderpassword- Connection strings
The beauty of this design is that everything is automated. The Terraform configuration:
- Creates the Azure SQL Database with AdventureWorksLT sample data
- Deploys all 6 stored procedures that simulate business operations
- Creates SQL users (
dbwriter,dbreader) with appropriate permissions - Stores credentials in Azure Key Vault
- Sets up 5 Logic Apps with staggered schedules
- Configures API connections between Logic Apps and SQL Database
How It Works
For detailed setup instructions, see the complete guide in the GitHub repository. Below, I’ll highlight the key technical components.
Terraform Infrastructure
The entire infrastructure is defined in Terraform, following Azure naming best practices:
# Resource naming conventions
resource_group_name = "rg-adventureworks-simulator"
sql_server_name = "sql-adventureworks-sim" # Must be globally unique
key_vault_name = "kv-awsim" # Must be globally unique
database_name = "AdventureWorksLT-Live"
The SQL Database is deployed with the AdventureWorksLT sample:
resource "azurerm_mssql_database" "adventureworks_live" {
name = "AdventureWorksLT-Live"
server_id = azurerm_mssql_server.main.id
sku_name = "Basic"
# This is the magic - Azure deploys AdventureWorksLT schema and data
sample_name = "AdventureWorksLT"
lifecycle {
prevent_destroy = false
}
}
Stored Procedures
Each stored procedure simulates a specific business operation. Here’s a simplified example of the order generation procedure:
CREATE OR ALTER PROCEDURE SalesLT.usp_Sim_GenerateNewOrders
@OrderCount INT = 100
AS
BEGIN
SET NOCOUNT ON;
DECLARE @i INT = 0;
DECLARE @CustomerID INT;
DECLARE @AddressID INT;
WHILE @i < @OrderCount
BEGIN
-- Pick a random customer
SELECT TOP 1 @CustomerID = CustomerID
FROM SalesLT.Customer
ORDER BY NEWID();
-- Get customer's address
SELECT TOP 1 @AddressID = ca.AddressID
FROM SalesLT.CustomerAddress ca
WHERE ca.CustomerID = @CustomerID
ORDER BY NEWID();
-- Create order header
INSERT INTO SalesLT.SalesOrderHeader (
OrderDate, CustomerID, ShipToAddressID,
Status, ModifiedDate
)
VALUES (
GETDATE(), @CustomerID, @AddressID,
1, -- Status = Pending
GETDATE()
);
-- Add 1-5 random products to order
-- (simplified - actual procedure includes price calculations)
SET @i = @i + 1;
END
END
The procedures use randomization (NEWID()) to create varied, realistic data patterns—not the same test data over and over.
Logic Apps Orchestration
Logic Apps are defined in Terraform with staggered schedules:
locals {
simulation_procedures = {
"generate-orders" = {
name = "logic-sim-generate-orders"
procedure = "SalesLT.usp_Sim_GenerateNewOrders"
offset_minute = 0 # Runs at 0,5,10,15...
parameters = { "OrderCount" = 100 }
}
"ship-orders" = {
name = "logic-sim-ship-orders"
procedure = "SalesLT.usp_Sim_ShipPendingOrders"
offset_minute = 1 # Runs at 1,6,11,16...
parameters = {}
}
# ... more procedures
}
}
Each Logic App runs a simple workflow:
- Wait for the scheduled time (recurrence trigger)
- Execute the stored procedure via SQL connection
- Log the result
The 1-minute offsets prevent all procedures from running simultaneously, creating more realistic load patterns.
SQL Users and Permissions
Terraform automatically creates two SQL users:
-- Writer user (for Logic Apps)
CREATE USER [dbwriter] WITH PASSWORD = 'SecurePassword123!';
ALTER ROLE db_datareader ADD MEMBER [dbwriter];
ALTER ROLE db_datawriter ADD MEMBER [dbwriter];
GRANT EXECUTE TO [dbwriter];
-- Reader user (for your data pipelines)
CREATE USER [dbreader] WITH PASSWORD = 'SecurePassword456!';
ALTER ROLE db_datareader ADD MEMBER [dbreader];
Passwords are generated randomly and stored in Azure Key Vault, where your applications can retrieve them securely.
Getting Started
Prerequisites
Quick Deploy
Clone the repository and configure:
git clone https://github.com/stefanko-ch/AdventureworksLTDataSimulator.git
cd AdventureworksLTDataSimulator/terraform
# Copy and edit configuration
cp terraform.tfvars.example terraform.tfvars
nano terraform.tfvars # or your favorite editor
Required variables in terraform.tfvars:
subscription_id = "your-azure-subscription-id"
admin_email = "[email protected]" # For Key Vault access
# Optional - customize these if needed
location = "North Europe"
resource_group_name = "rg-adventureworks-simulator"
sql_server_name = "sql-aw-sim-unique123" # Must be globally unique
key_vault_name = "kv-awsim-unique" # Must be globally unique
simulation_enabled = true
simulation_interval_minutes = 5
Deploy with Terraform:
# Login to Azure
az login
# Initialize Terraform
terraform init
# Review the plan
terraform plan
# Deploy (takes ~5 minutes)
terraform apply
Terraform will output connection details:

Connect to Your Database
Retrieve credentials from Key Vault:
# Reader password (for your data pipelines)
az keyvault secret show \
--vault-name kv-awsim-unique \
--name sql-reader-password \
--query value -o tsv
# Writer password (for Logic Apps - already configured)
az keyvault secret show \
--vault-name kv-awsim-unique \
--name sql-writer-password \
--query value -o tsv
Connection string example:
Server=sql-aw-sim-unique123.database.windows.net,1433;
Database=AdventureWorksLT-Live;
User ID=dbreader;
Password=<from-keyvault>;
Encrypt=True;
TrustServerCertificate=False;
Monitor Simulation Activity
Query the database to see activity:
-- Check recent orders
SELECT TOP 100 *
FROM SalesLT.SalesOrderHeader
ORDER BY ModifiedDate DESC;
-- Count orders by status
SELECT
CASE Status
WHEN 1 THEN 'Pending'
WHEN 5 THEN 'Shipped'
WHEN 6 THEN 'Cancelled'
END AS StatusName,
COUNT(*) AS OrderCount
FROM SalesLT.SalesOrderHeader
GROUP BY Status;
-- See how data is growing
SELECT
CAST(OrderDate AS DATE) AS OrderDay,
COUNT(*) AS OrderCount
FROM SalesLT.SalesOrderHeader
WHERE OrderDate >= DATEADD(day, -7, GETDATE())
GROUP BY CAST(OrderDate AS DATE)
ORDER BY OrderDay;
Or you can query the results directly from your terminal.
# Get the admin password from Key Vault
sqlcmd -S <your-sql-server>.database.windows.net \
-d AdventureWorksLT-Live \
-U sqladmin \
-P "$(az keyvault secret show --vault-name <your-keyvault-name> --name sql-admin-password --query value -o tsv)" \
-Q "EXEC SalesLT.usp_Sim_GetStatus;"
The output looks like this:

Cost Management
I run this demo several days in my test-environment. As you can see, it costs less than 0.44 Swiss Francs, so in my case it’s about 4.25 Swiss Francs per month.

But there’s also some other insights from this long running test. As we put in new data every few minutes, it takes more time to calculate and needs more CPU Power. Also more space will be allocated over time.

Is this a problem? I think not. Just destroy the environment with terraform and redeploy it again.
Ways to save money:
-
Pause simulations when not in use:
simulation_enabled = falseThen run
terraform applyto disable Logic Apps. -
Reduce simulation frequency:
simulation_interval_minutes = 15 # Instead of 5 -
Delete when done:
terraform destroy
Conclusion
Building data pipelines for static datasets teaches you syntax. Building pipelines for changing data teaches you data engineering.
This AdventureWorksLT Data Simulator gives you a safe, cheap, disposable environment to practice:
- Change Data Capture (CDC)
- Incremental loading patterns
- Streaming pipelines
- Data quality handling
- Schema evolution
- Slowly Changing Dimensions (SCD)
The entire setup costs less than a fancy coffee per month and can be deployed/destroyed in minutes. No manual configuration, no clicking through Azure Portal, just Infrastructure as Code.
Try it out: https://github.com/stefanko-ch/AdventureworksLTDataSimulator
If you build something cool with it, find bugs, or have ideas for improvements, I’d love to hear about it. Open an issue or submit a PR!
Title image generated with Google Gemini
Related Articles
How to create a self-hosted Airbyte environment in the Azure Cloud
In this quick guide, I will show you how to set up an Airbyte environment in a virtual machine in Azure.
First steps with the Azure Cloud
Azure is Microsoft’s cloud computing platform. It offers a comprehensive suite of cloud services that enable companies to develop, deploy and manage applications without having to have physical hardware on site. Azure enables users to use resources such as virtual machines, storage, databases, networks and much more in Microsoft’s globally distributed data centers. The Azure Cloud is one of the leading hyper-scalers on the market. In the following article, I would like to give beginners a few useful tips to make it easier to get started with the Azure portal.
Secure Hetzner Docker Deployment via Cloudflare Zero Trust Tunnel
Self-hosting is awesome – you have full control over your data, no subscription fees, and you learn a lot about system administration along the way. But it comes with real challenges: security concern
Comments
Comments are powered by giscus. You need a GitHub account to comment.