Advent Automation 2025

Day 12B: Great Expectations Cloud Integration

Enterprise data quality validation using GE Cloud for cybersecurity logs

Part of: Day 12: Cybersecurity Data Quality Framework


Overview

Day 12B demonstrates production-ready integration with Great Expectations Cloud, the enterprise SaaS platform for data quality validation. This implementation uses native GE expectations, cloud-hosted Data Docs, and team collaboration features.

Why Day 12B?

While Day 12A shows you can build GE-style validation from scratch, Day 12B proves you can use the actual enterprise tool properly. This is important for:


Key Differences from Day 12A

Feature Day 12A (Custom) Day 12B (GE Cloud)
Setup No external account Requires GE Cloud account
Data Docs Local HTML files Cloud-hosted dashboard
Expectations Custom Python classes Native GE expectations
Collaboration Single developer Multi-user with permissions
Versioning Git only GE Cloud + Git
Monitoring Custom logging GE Cloud monitoring UI
Deployment Self-hosted Managed cloud service

Quick Start

Prerequisites

  1. Great Expectations Cloud Account
    • Sign up at https://greatexpectations.io/cloud
    • Free tier available for testing
    • Note your Organization ID and create an Access Token
  2. Python Environment
    python3 --version  # 3.11+ recommended
    pip install -r day12b_requirements.txt
    

Setup (15 minutes)

Step 1: Configure GE Cloud Credentials

# Copy environment template
cp day12b/.env.example ../config/.env

# Edit config/.env and add your credentials:
DAY12B_GE_CLOUD_ORG_ID=your-org-id-here
DAY12B_GE_CLOUD_ACCESS_TOKEN=your-access-token-here

Getting your GE Cloud credentials:

  1. Login to https://app.greatexpectations.io
  2. Navigate to Settings → Access Tokens
  3. Click “Create Token” with “Data Context” permissions
  4. Copy your Organization ID from the URL or Settings
  5. Paste both values into config/.env

Step 2: Initialize GE Cloud Connection

cd day12b
python3 day12b_SETUP_cloud.py

Expected output:

================================================================================
DAY 12B - CONNECTING TO GREAT EXPECTATIONS CLOUD
================================================================================
📡 Connecting to GE Cloud (Org: abc123...)
✅ Successfully connected to GE Cloud!

📊 Setting up Cloud Datasource...
✅ Created datasource: day12b_security_logs_cloud
✅ Added data asset: security_events
📊 Loaded 1000 records from day12_security_events.csv

🔍 Verifying GE Cloud Setup...
✅ Found 1 datasource(s):
   - day12b_security_logs_cloud
✅ GE Cloud setup verified successfully!

================================================================================
SETUP COMPLETE
================================================================================
Next steps:
1. Run: python3 day12b_CREATE_expectations.py
2. View Data Docs at: https://app.greatexpectations.io
================================================================================

Step 3: Create Expectation Suite

python3 day12b_CREATE_expectations.py

This creates 8 native GE expectations:

Step 4: Run Validation

python3 day12b_RUN_validation_cloud.py
echo $?  # Check exit code: 0=pass, 1=fail, 2=error

Step 5: View Results in GE Cloud

  1. Open https://app.greatexpectations.io
  2. Navigate to Data Docs
  3. View validation results with interactive UI
  4. Share reports with team members

GE Cloud Features Demonstrated

1. Cloud-Hosted Data Docs

Unlike Day 12A’s local HTML, GE Cloud provides:

2. Native GE Expectations

Uses official GE expectations instead of custom code:

# Day 12A (custom)
validator.expect_column_values_to_not_be_null('event_id', threshold=0.02)

# Day 12B (native GE)
validator.expect_column_values_to_not_be_null(
    column="event_id",
    mostly=0.98,  # GE's parameter name
    meta={"severity": "critical"}
)

3. Checkpoint Management

GE Cloud checkpoints enable:

4. API-First Design

All operations available via API for automation:


File Structure

day12b/
├── day12b_CONFIG_ge_cloud.py         # GE Cloud connection config
├── day12b_SETUP_cloud.py             # Initialize GE Cloud datasource
├── day12b_CREATE_expectations.py     # Build native GE expectation suite
├── day12b_RUN_validation_cloud.py    # Run validation via Cloud
├── day12b_requirements.txt           # Dependencies (GE Cloud SDK)
├── .env.example                      # Configuration template
├── logs/                             # Local execution logs
│   └── validation_results_cloud_*.json
└── README_12B.md                     # This file

Note: Reuses synthetic data from Day 12A - no need to regenerate.


Validation Results

Same synthetic data as Day 12A, but results viewed in GE Cloud UI:

Expected Outcome:

Why Failures are Expected: The synthetic data includes intentional quality issues to demonstrate the validator catches real problems. In production, you’d:

  1. Fix data source issues
  2. Adjust thresholds based on acceptable limits
  3. Set up alerts for critical failures

Integration Examples

CI/CD Pipeline (GitHub Actions)

name: Data Quality Validation

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM
  workflow_dispatch:

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r day12b/day12b_requirements.txt

      - name: Run GE Cloud Validation
        env:
          DAY12B_GE_CLOUD_ORG_ID: $
          DAY12B_GE_CLOUD_ACCESS_TOKEN: $
        run: |
          cd day12b
          python3 day12b_RUN_validation_cloud.py

Python Script Integration

from day12b.day12b_SETUP_cloud import day12b_get_cloud_context
from day12b.day12b_RUN_validation_cloud import day12b_run_validation

# Connect to GE Cloud
context = day12b_get_cloud_context()

# Run validation
results = day12b_run_validation(context)

# Check results
if results.success:
    print("✅ Data quality checks passed")
else:
    print("❌ Data quality issues detected")
    # Block pipeline, send alert, etc.

Airflow DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def run_ge_cloud_validation():
    from day12b.day12b_RUN_validation_cloud import day12b_run_cloud_validation
    exit_code = day12b_run_cloud_validation()
    if exit_code != 0:
        raise ValueError("Data quality validation failed")

dag = DAG(
    'security_data_quality',
    start_date=datetime(2025, 1, 1),
    schedule_interval='0 6 * * *'  # Daily 6 AM
)

validate_task = PythonOperator(
    task_id='validate_security_logs',
    python_callable=run_ge_cloud_validation,
    dag=dag
)

Comparison: When to Use Which

Use Day 12A (Custom Framework) When:

Use Day 12B (GE Cloud) When:

Use Both When:


Troubleshooting

“Could not connect to GE Cloud”

Check:

  1. Credentials in config/.env are correct
  2. Access token has “Data Context” permissions
  3. Organization ID matches your GE Cloud account
  4. Network allows HTTPS to app.greatexpectations.io

Test connection:

import great_expectations as gx
context = gx.get_context(
    mode="cloud",
    cloud_organization_id="your-org-id",
    cloud_access_token="your-token"
)
print(context.list_datasources())  # Should not error

“Datasource already exists”

Solution: The setup script handles this automatically. If you want to recreate:

context = day12b_get_cloud_context()
context.delete_datasource("day12b_security_logs_cloud")
# Then re-run day12b_SETUP_cloud.py

“Validation results show 0%”

Check:

  1. Data file exists at ../day12/data/day12_security_events.csv
  2. Run Day 12A data generator first: cd ../day12 && python3 day12_GENERATOR_synthetic_data.py
  3. Verify data loaded: context.list_datasources() should show assets

Next Steps

For Learning:

  1. Explore GE Cloud UI at https://app.greatexpectations.io
  2. Modify expectations in day12b_CREATE_expectations.py
  3. Try different data sources (SQL, S3, Snowflake)
  4. Set up Slack/email notifications

For Production:

  1. Replace synthetic data with real security logs
  2. Calibrate thresholds based on production patterns
  3. Set up scheduled checkpoints in GE Cloud
  4. Configure action-based workflows (alert on failure)
  5. Enable team access and permissions

For Portfolio:

  1. Take screenshots of GE Cloud Data Docs
  2. Record video walkthrough of validation process
  3. Write blog post comparing Day 12A vs 12B
  4. Add to resume: “Great Expectations Cloud - Enterprise data quality platform”

Resources


Created: December 12, 2025 Version: 1.0 Status: ✅ Ready for GE Cloud Setup