Enterprise data quality validation using GE Cloud for cybersecurity logs
Part of: Day 12: Cybersecurity Data Quality Framework
Day 12B demonstrates production-ready integration with Great Expectations Cloud, the enterprise SaaS platform for data quality validation. This implementation uses native GE expectations, cloud-hosted Data Docs, and team collaboration features.
While Day 12A shows you can build GE-style validation from scratch, Day 12B proves you can use the actual enterprise tool properly. This is important for:
| Feature | Day 12A (Custom) | Day 12B (GE Cloud) |
|---|---|---|
| Setup | No external account | Requires GE Cloud account |
| Data Docs | Local HTML files | Cloud-hosted dashboard |
| Expectations | Custom Python classes | Native GE expectations |
| Collaboration | Single developer | Multi-user with permissions |
| Versioning | Git only | GE Cloud + Git |
| Monitoring | Custom logging | GE Cloud monitoring UI |
| Deployment | Self-hosted | Managed cloud service |
python3 --version # 3.11+ recommended
pip install -r day12b_requirements.txt
Step 1: Configure GE Cloud Credentials
# Copy environment template
cp day12b/.env.example ../config/.env
# Edit config/.env and add your credentials:
DAY12B_GE_CLOUD_ORG_ID=your-org-id-here
DAY12B_GE_CLOUD_ACCESS_TOKEN=your-access-token-here
Getting your GE Cloud credentials:
config/.envStep 2: Initialize GE Cloud Connection
cd day12b
python3 day12b_SETUP_cloud.py
Expected output:
================================================================================
DAY 12B - CONNECTING TO GREAT EXPECTATIONS CLOUD
================================================================================
📡 Connecting to GE Cloud (Org: abc123...)
✅ Successfully connected to GE Cloud!
📊 Setting up Cloud Datasource...
✅ Created datasource: day12b_security_logs_cloud
✅ Added data asset: security_events
📊 Loaded 1000 records from day12_security_events.csv
🔍 Verifying GE Cloud Setup...
✅ Found 1 datasource(s):
- day12b_security_logs_cloud
✅ GE Cloud setup verified successfully!
================================================================================
SETUP COMPLETE
================================================================================
Next steps:
1. Run: python3 day12b_CREATE_expectations.py
2. View Data Docs at: https://app.greatexpectations.io
================================================================================
Step 3: Create Expectation Suite
python3 day12b_CREATE_expectations.py
This creates 8 native GE expectations:
expect_table_row_count_to_be_between - Reasonable event countexpect_table_columns_to_match_set - Required columns presentexpect_column_values_to_not_be_null - Completeness checksexpect_column_values_to_be_in_set - Categorical validationexpect_column_values_to_be_between - Risk score boundsexpect_column_values_to_match_regex - PII anonymization checkexpect_column_values_to_be_of_type - Type validationStep 4: Run Validation
python3 day12b_RUN_validation_cloud.py
echo $? # Check exit code: 0=pass, 1=fail, 2=error
Step 5: View Results in GE Cloud
Unlike Day 12A’s local HTML, GE Cloud provides:
Uses official GE expectations instead of custom code:
# Day 12A (custom)
validator.expect_column_values_to_not_be_null('event_id', threshold=0.02)
# Day 12B (native GE)
validator.expect_column_values_to_not_be_null(
column="event_id",
mostly=0.98, # GE's parameter name
meta={"severity": "critical"}
)
GE Cloud checkpoints enable:
All operations available via API for automation:
day12b/
├── day12b_CONFIG_ge_cloud.py # GE Cloud connection config
├── day12b_SETUP_cloud.py # Initialize GE Cloud datasource
├── day12b_CREATE_expectations.py # Build native GE expectation suite
├── day12b_RUN_validation_cloud.py # Run validation via Cloud
├── day12b_requirements.txt # Dependencies (GE Cloud SDK)
├── .env.example # Configuration template
├── logs/ # Local execution logs
│ └── validation_results_cloud_*.json
└── README_12B.md # This file
Note: Reuses synthetic data from Day 12A - no need to regenerate.
Same synthetic data as Day 12A, but results viewed in GE Cloud UI:
Expected Outcome:
Why Failures are Expected: The synthetic data includes intentional quality issues to demonstrate the validator catches real problems. In production, you’d:
name: Data Quality Validation
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM
workflow_dispatch:
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r day12b/day12b_requirements.txt
- name: Run GE Cloud Validation
env:
DAY12B_GE_CLOUD_ORG_ID: $
DAY12B_GE_CLOUD_ACCESS_TOKEN: $
run: |
cd day12b
python3 day12b_RUN_validation_cloud.py
from day12b.day12b_SETUP_cloud import day12b_get_cloud_context
from day12b.day12b_RUN_validation_cloud import day12b_run_validation
# Connect to GE Cloud
context = day12b_get_cloud_context()
# Run validation
results = day12b_run_validation(context)
# Check results
if results.success:
print("✅ Data quality checks passed")
else:
print("❌ Data quality issues detected")
# Block pipeline, send alert, etc.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def run_ge_cloud_validation():
from day12b.day12b_RUN_validation_cloud import day12b_run_cloud_validation
exit_code = day12b_run_cloud_validation()
if exit_code != 0:
raise ValueError("Data quality validation failed")
dag = DAG(
'security_data_quality',
start_date=datetime(2025, 1, 1),
schedule_interval='0 6 * * *' # Daily 6 AM
)
validate_task = PythonOperator(
task_id='validate_security_logs',
python_callable=run_ge_cloud_validation,
dag=dag
)
Check:
config/.env are correctapp.greatexpectations.ioTest connection:
import great_expectations as gx
context = gx.get_context(
mode="cloud",
cloud_organization_id="your-org-id",
cloud_access_token="your-token"
)
print(context.list_datasources()) # Should not error
Solution: The setup script handles this automatically. If you want to recreate:
context = day12b_get_cloud_context()
context.delete_datasource("day12b_security_logs_cloud")
# Then re-run day12b_SETUP_cloud.py
Check:
../day12/data/day12_security_events.csvcd ../day12 && python3 day12_GENERATOR_synthetic_data.pycontext.list_datasources() should show assetsday12b_CREATE_expectations.pyCreated: December 12, 2025 Version: 1.0 Status: ✅ Ready for GE Cloud Setup