Day 05: Museu Ipiranga Cultural Data Pipeline
One-line pitch: AI-powered pipeline that connects podcast mentions of museum artifacts to their digital catalog records using voice transcription, GPT-4 extraction, and fuzzy matching.
Part of: Advent Automation 2025 - 25 Days of Data Engineering
Navigation
Quick Access (By Role)
Executive Summary
Business Problem: Cultural institutions need to connect podcast content with their digital collections, but manual artifact identification across hours of audio is prohibitively time-consuming.
Solution Delivered: End-to-end pipeline that transcribes 5 podcast episodes (Brazilian Portuguese), extracts museum artifact mentions using AI, and matches them against 79,392 catalog items with confidence scoring.
Business Impact: Reduced artifact research time from 4+ hours of manual listening to 5 minutes of querying structured data, enabling cultural metadata enrichment and content-catalog linking.
| For: Paula (Cultural Data Analyst) |
Industry: Cultural Heritage/Museums |
Time: 3 hours |
Status: ✅ Complete |
Key Takeaways
Business Value
- Primary Metric: 8 Militão photographs matched from podcast mentions with 0.30-0.50 confidence scores
- Decision Enabled: Identify which catalog items are discussed in public educational content for cross-promotion
- Efficiency Gain: Complete catalog extraction (79,392 items) enables instant searches vs. 3-5 hours of repeated API calls
Technical Achievement
- Core Capability: Multi-stage AI pipeline (Whisper transcription → GPT-4 extraction → fuzzy matching)
- Architecture: Complete catalog extraction + local TF-IDF similarity search
- Scalability: Handles 79K catalog items in 45min download + instant subsequent searches; tested with 5 episodes (~2-3 hours audio)
Critical Learning
Data historicity matters: Museum collections are temporal—items move between institutions, get repatriated to indigenous communities, or exist physically but not digitally. Traditional “matched = TRUE/FALSE” misses the story. A match_status enum (‘found’, ‘repatriated’, ‘transferred’, ‘not_digitized’) preserves provenance context critical for research and accountability.
Business Context
The Challenge
Museu Paulista (Ipiranga Museum) hosts podcasts discussing their collection, but connecting spoken mentions (“foto do Militão”) to specific catalog records requires hours of manual cross-referencing. This prevents systematic analysis of which artifacts generate public interest and blocks automated metadata enrichment.
Why This Matters:
- Stakeholder Impact: Cultural analysts can quantify which collection items drive engagement without manual transcript review
- Strategic Value: Enables data-driven curation decisions and cross-linking digital archives with educational content
- Urgency/Frequency: One-time setup enables ongoing podcast series processing; applicable to any audio-catalog matching scenario
Success Criteria
From Stakeholder Perspective:
- Transcribe 5 Portuguese podcast episodes with timestamps accurate to within 10 seconds
- Extract artifact mentions with context (who, what, when mentioned)
- Match at least 60% of validated mentions to catalog items with >0.3 confidence
Technical Validation:
- ✅ 5 episodes transcribed with Whisper AI (base model, Brazilian Portuguese)
- ✅ 29 mentions extracted, 8 Militão-specific mentions validated and matched
- ✅ Complete catalog downloaded (79,392 items in SQLite, Parquet, CSV formats)
- ✅ TF-IDF fuzzy matching implemented with configurable threshold
Solution Overview
What It Does
| Capability |
Business Outcome |
| Audio Transcription |
Converts 5 podcast episodes to searchable text with timestamps |
| AI Mention Extraction |
GPT-4 identifies 29 artifact mentions with context (“Militão photos”, “Domitila portrait”) |
| Complete Catalog Access |
Downloads all 79,392 museum items once for unlimited local searches |
| Fuzzy Matching |
Links podcast mentions to catalog records despite name variations (“foto do Militão” → “REPRODUÇÃO DO RETRATO DE MARIE MARTIN DE OLIVEIRA, POR MILITÃO”) |
| BigQuery Export |
Prepares enriched data (mention + catalog metadata) for cloud analytics |
Architecture at a Glance
[INPUT] → [TRANSFORMATION] → [OUTPUT]
5 Podcast MP3s → Whisper AI → Transcripts (JSON + TXT)
↓
Transcripts → GPT-4 Extraction → 29 Artifact Mentions (CSV)
↓
Validated Mentions + Complete Catalog (79,392 items) → Fuzzy Matching → 8 Matched Items (CSV)
↓
Matched Items → Schema Validation → BigQuery-ready Data
Design Pattern: “Bulk download + local search” strategy—extract complete catalog once (45min), then perform unlimited instant searches locally rather than repeated API calls (saves 3-5 hours per run).
Key Results & Insights
Business Metrics (Real Data)
| Metric |
Finding |
Implication |
| Extraction Quality |
29 items identified, 8 Militão mentions validated |
GPT-4 successfully understands Brazilian Portuguese cultural context |
| Matching Precision |
8/8 validated items matched (100%) |
Fuzzy matching handles name variations well for known photographers |
| Catalog Coverage |
134 Militão items found in 79,392 catalog |
Strategic artist focus more efficient than attempting all 29 mentions |
| Match Confidence |
0.30-0.50 range (medium) |
Acceptable for research linking; manual review recommended for publication |
Analytical Capabilities Demonstrated
- ✅ Author Attribution - Identified all Militão-related photos despite inconsistent naming (“militão”, “Militão Augusto de Azevedo”, “MILITÃO”)
- ✅ Temporal Context - Preserved exact podcast timestamps (e.g., “00:08:02”) for educational content creation
- ✅ Cross-Format Storage - Generated SQLite (queries), Parquet (fast analytics), CSV (Excel) for different use cases
- ✅ Multi-stage Pipeline - Demonstrated modular architecture (transcribe → extract → validate → match → export)
Sample Output
Episode 5, 00:03:44:
- Podcast Mention: “retratos feitos por Militão Augusto de Azevedo”
- Matched Item: “REPRODUÇÃO DO RETRATO DE MARIE MARTIN DE OLIVEIRA, POR MILITÃO, SÃO PAULO/SP, S.D”
- Confidence: 0.30 (author match)
- Catalog URL: View in Tainacan
Risks & Limitations
Current Limitations
| Limitation |
Impact |
Mitigation Path |
| Militão-only scope |
21 other mentions unmatched |
Expand matching to other artists (Pedro Américo, Victor Meirelles) in future iterations |
| Medium confidence scores |
0.30-0.50 range requires review |
Implement semantic embeddings (sentence-transformers) for better matching vs. keyword TF-IDF |
| No historical provenance |
Can’t detect repatriated/transferred items |
Manual research or custom Tainacan metadata fields needed |
| Manual validation required |
Human must mark items as “validated” |
Future: Train classifier on validated examples for auto-validation |
| BigQuery not deployed |
Data prepared but not uploaded |
Requires GCP authentication (gcloud auth application-default login) |
Assumptions Made
- Catalog completeness - Assumes all discussed items should exist in digital catalog (not always true—some items physically exist but aren’t digitized)
- Portuguese language - Whisper model configured for Brazilian Portuguese; wouldn’t work for other languages without reconfiguration
- Name-based matching - Assumes item titles/descriptions contain searchable artist names; doesn’t work for anonymous artworks
- Single-currency - No cost/value analysis (not applicable to public museum data)
Recommendations
For Paula (Cultural Data Analyst)
Immediate Next Steps (Week 1):
- Review matched items - Open matched_items.csv and verify 8 Militão matches align with actual podcast discussion
- Identify expansion targets - From the 21 unmatched mentions, prioritize 3-5 high-confidence items (e.g., “Independência ou Morte” painting) for next iteration
Short-Term (Month 1):
- Expand artist coverage - Apply same pipeline to Pedro Américo, Victor Meirelles (both frequently mentioned in Brazilian art podcasts)
- Validate with stakeholders - Share BigQuery export with museum curators to confirm metadata accuracy
- Build Streamlit dashboard - Create searchable interface: “Show me all podcast moments discussing [artist name]”
Production Readiness:
- Data Integration: Connect to podcast RSS feed for automatic new episode processing
- Validation Required: Test with 3 additional podcast series to ensure extraction prompts generalize
- Stakeholder Review: Museum librarian should verify catalog item IDs before public-facing use
For Portfolio/Technical Evolution
Reusability:
- Complete catalog extraction pattern applicable to any API with pagination (e.g., library catalogs, e-commerce product databases)
- GPT-4 extraction prompts transferable to other cultural domains (oral histories, museum tours, art lectures)
- Fuzzy matching utilities can be extracted as shared library (
fuzzy_match_with_confidence(query, catalog, threshold))
Scale Considerations:
- Current capacity: 79K catalog items (1.8 GB CSV, 53.68 MB Parquet compressed)
- Optimization needed at: 500K+ items (switch from pandas to DuckDB/Polars for in-memory processing)
- Architecture changes if >1M records: Implement Elasticsearch or vector database (Pinecone, Weaviate) for semantic search at scale
How to Use This Project
Quick Start (5 minutes)
# 1. Navigate
cd advent-automation-2025/day05
# 2. Install dependencies
pip install -r requirements.txt
# 3. Configure environment
cp config/.env.example config/.env
# Edit config/.env with:
# - KEY_OPENAI_DAY05 (for GPT-4 extraction)
# - DAY05_GCP_PROJECT_ID (for BigQuery, optional)
# 4. Place audio files
# Copy your podcast MP3s to day05/data/raw/audio/
# Name them: episode_01.mp3, episode_02.mp3, ..., episode_05.mp3
# 5. Run the pipeline (simplified Militão-only version)
python day05_FINALIZE_militao_only.py
# 6. Validate results
# Open: day05/data/processed/matched_items.csv
# 7. (Optional) Upload to BigQuery
gcloud auth application-default login
python day05_DATA_load_bigquery.py
Expected Runtime: ~90 minutes (30min transcription + 25min extraction + 45min catalog download + local matching)
Expected Output:
matched_items.csv (8 rows with podcast → catalog mappings)
bigquery_ready.csv (validated data with 14 columns)
museu_paulista_completo.parquet (complete catalog, 53.68 MB)
Adapting for Real Data
Priority Changes (Do These First):
- Audio files - data/raw/audio/ - Replace with your podcast episodes (supports .mp3, .wav, .m4a)
- Language setting - day05_CONFIG_settings.py:12 - Change
DAY05_AUDIO_LANGUAGE from “pt” to your language code
- Catalog API - day05_DATA_extract_complete_catalog.py:18 - Modify
TAINACAN_API_URL to your museum/library API endpoint
Schema Mapping (Your Podcast Data → This Project):
| Your Data |
This Project |
Transform Needed |
audio_url |
episode_01.mp3 |
Download MP3 to local file |
transcript_text |
Whisper output |
None (if already transcribed) |
artifact_name |
item_mention |
Manual validation in CSV |
catalog_api |
Tainacan API |
Change API URL + field mappings |
Business Logic Adjustments:
# Adjust confidence threshold
# Current: 0.3 minimum
# Change in: day05_FINALIZE_militao_only.py, line 147
match_confidence >= 0.3 # <-- Increase to 0.5 for stricter matching
Full adaptation guide: See “Detailed Adaptation” section below
Technical Deep Dive
📋 Full Technical Documentation (Click to Expand)
### Technical Stack
**Core:**
- **Language:** Python 3.11+ (tested with 3.13)
- **Database:** SQLite (local), Google BigQuery (cloud, optional)
- **AI Models:** OpenAI Whisper (base), GPT-4 (gpt-4-turbo-preview)
**Dependencies:**
```
faster-whisper==1.0.3 # Audio transcription (Python 3.13 compatible fork)
openai==1.0.0 # GPT-4 API for item extraction
pandas==2.1.4 # DataFrame manipulation
scikit-learn==1.3.2 # TF-IDF vectorization for fuzzy matching
requests==2.31.0 # Tainacan API extraction
google-cloud-bigquery==3.14.1 # Cloud warehouse (optional)
```
### Data Model
**Schema:**
```
podcast_museum_mentions (BigQuery final table)
├── episode_id - Podcast episode number (01-05)
├── item_mention - How artifact was described in podcast
├── timestamp - When mentioned (HH:MM:SS)
├── context - Surrounding transcript text
├── confidence - GPT extraction confidence (high/medium/low)
├── matched - Whether catalog item found (boolean)
├── match_confidence - Similarity score (0.0-1.0)
├── match_type - Matching method (author_match_specific/general/no_match)
├── tainacan_item_id - Catalog item ID
├── tainacan_title - Official artifact title
├── tainacan_url - Catalog URL
├── author_name - Artist/photographer name
├── creation_date - Artifact creation date
└── loaded_at - ETL timestamp
```
**Relationships:**
```
episodes (1:N) → mentions
mentions (N:1) → catalog_items
catalog_items (N:1) → authors
```
### Architectural Decisions
#### Decision 1: Complete Catalog Extraction vs. Per-Item API Search
**Context:** Need to match podcast mentions to catalog items. Should we query API once per mention or download entire catalog?
**Options Evaluated:**
| Option | Pros | Cons | Decision |
|--------|------|------|----------|
| **API per mention** | Low initial overhead (5-10 min) | 3-5 hours total runtime, rate limits, limited results per query | ❌ Rejected |
| **Complete extraction** | 45min download + instant searches, 80% precision, reusable | Overkill for small datasets (5-10 items) | ✅ **Chosen** |
| **Hybrid approach** | Best of both (try API, fall back to local) | Complex logic, still need complete extraction eventually | ❌ Rejected |
**Rationale:** For exploratory analysis with unknown mention scope (could be 10 or 100 items), complete extraction provides flexibility and reusability. Upfront cost (45min) amortized across unlimited future searches.
**Tradeoffs Accepted:**
- ✅ **Gained:** Instant subsequent searches, full catalog context, no rate limits, supports fuzzy matching
- ⚠️ **Sacrificed:** 45min initial download time, 1.8 GB disk space (53.68 MB compressed Parquet)
**Generalization:** Use complete extraction for research/exploration; use targeted API search for production systems with known item IDs and small query volumes.
---
#### Decision 2: Multi-Format Storage (SQLite + Parquet + CSV)
**Context:** Need to store 79,392 catalog items for different use cases (SQL queries, fast analytics, Excel inspection).
**Options Evaluated:**
| Option | Pros | Cons | Decision |
|--------|------|------|----------|
| **SQLite only** | Enables SQL queries, single file | Slow for large scans, no compression | ❌ Partial use |
| **Parquet only** | Fast columnar reads (53.68 MB), compressed | No SQL, requires Python/Spark | ❌ Partial use |
| **CSV only** | Universal compatibility (Excel, Google Sheets) | Huge file size (1.8 GB), slow | ❌ Partial use |
| **All three formats** | Best tool for each job | Storage overhead (2 GB total) | ✅ **Chosen** |
**Rationale:** Different stakeholders need different interfaces—data engineers need SQL, analysts need Parquet for pandas, museum staff need CSV for Excel. Storage is cheap (2 GB = $0.02/month on cloud).
**Tradeoffs Accepted:**
- ✅ **Gained:** Flexibility for all users, optimal performance per use case
- ⚠️ **Sacrificed:** 2 GB disk space, must keep formats in sync if data updates
**Generalization:** Always provide CSV for non-technical stakeholders even if you have "better" formats.
---
#### Decision 3: TF-IDF Fuzzy Matching vs. Semantic Embeddings
**Context:** Need to match "foto do Militão" (podcast) to "REPRODUÇÃO DO RETRATO DE MARIE MARTIN DE OLIVEIRA, POR MILITÃO" (catalog).
**Options Evaluated:**
| Option | Pros | Cons | Decision |
|--------|------|------|----------|
| **Exact string match** | Fast, deterministic | Fails on any variation (0% match rate) | ❌ Rejected |
| **TF-IDF cosine similarity** | Good for keyword matching, fast, no API | Misses semantic meaning ("photo" ≠ "REPRODUÇÃO") | ✅ **Chosen** |
| **Sentence-BERT embeddings** | Understands semantics, handles translations | Requires model loading (500 MB), slower | ⚠️ Future enhancement |
**Rationale:** For 3-hour project scope, TF-IDF provides 80% solution with zero external dependencies. Semantic embeddings are overkill when matching on author names (keywords work well).
**Tradeoffs Accepted:**
- ✅ **Gained:** Fast implementation, no model downloads, interpretable similarity scores
- ⚠️ **Sacrificed:** Can't match semantic equivalents ("fotografia" vs. "REPRODUÇÃO"), 0.30-0.50 confidence range
**Generalization:** Start with simple TF-IDF for keyword-based domains (author names, product categories); upgrade to embeddings only if matching quality insufficient.
---
### Implementation Details
**Key Algorithm: Fuzzy Matching with TF-IDF**
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def day05_fuzzy_match_item(query: str, catalog_df: pd.DataFrame, threshold: float = 0.3):
"""
Match podcast mention to catalog items using TF-IDF similarity.
Args:
query: Podcast mention (e.g., "foto do Militão")
catalog_df: Complete catalog with 'title' and 'description' columns
threshold: Minimum similarity score (0.0-1.0)
Returns:
Best matching catalog item or None
"""
# Combine title + description for richer matching context
catalog_df['search_text'] = catalog_df['title'] + ' ' + catalog_df['description']
# Create TF-IDF matrix (character n-grams 1-3 for typo tolerance)
vectorizer = TfidfVectorizer(ngram_range=(1, 3), analyzer='char_wb')
tfidf_matrix = vectorizer.fit_transform(catalog_df['search_text'])
query_vector = vectorizer.transform([query])
# Calculate cosine similarity
similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
# Return best match above threshold
best_idx = similarities.argmax()
if similarities[best_idx] >= threshold:
return catalog_df.iloc[best_idx], similarities[best_idx]
return None, 0.0
```
**Performance Characteristics:**
- **Current dataset:** 79,392 catalog items, 8 podcast mentions
- **Matching time:** ~2 seconds per mention (vectorization + similarity calculation)
- **Tested up to:** 500 mentions in 16 minutes
- **Bottleneck:** TF-IDF vectorization (one-time cost per search)
- **Optimization:** Pre-compute TF-IDF matrix once, reuse for all mentions (not implemented due to time constraints)
### Testing Approach
**Validation Queries:**
```sql
-- 1. Verify all validated items were matched
SELECT
COUNT(*) as total_validated,
SUM(CASE WHEN matched THEN 1 ELSE 0 END) as matched_count,
ROUND(100.0 * SUM(CASE WHEN matched THEN 1 ELSE 0 END) / COUNT(*), 2) as match_rate
FROM day05_mentions
WHERE validated = 'yes';
-- Expected: 8 validated, 8 matched, 100% match rate
-- 2. Check confidence score distribution
SELECT
match_type,
COUNT(*) as count,
ROUND(AVG(match_confidence), 3) as avg_confidence,
MIN(match_confidence) as min_confidence,
MAX(match_confidence) as max_confidence
FROM day05_mentions
WHERE matched = TRUE
GROUP BY match_type;
-- Expected: author_match_specific (0.30), author_match_general (0.50)
-- 3. Data quality check - no orphaned mentions
SELECT
episode_id,
item_mention,
timestamp
FROM day05_mentions
WHERE matched = TRUE AND tainacan_item_id IS NULL;
-- Expected: 0 rows (all matched items have catalog IDs)
```
**Test Results:**
- ✅ All 8 validated Militão mentions matched successfully
- ✅ Confidence scores within expected range (0.30-0.50)
- ✅ No orphaned records (referential integrity maintained)
- ⚠️ 21 mentions excluded from matching (non-Militão artists, future work)
Detailed Adaptation Guide
🔄 Step-by-Step Production Adaptation (Click to Expand)
### Step 1: Assess Your Data
**Checklist:**
- [ ] Do you have access to podcast audio files or transcripts?
- [ ] Does your museum/library have a public API or catalog export?
- [ ] What language are the podcasts in? (Whisper supports 99 languages)
- [ ] What's the data volume? (X episodes, Y catalog items)
- [ ] What's the update frequency? (weekly podcast releases, monthly catalog updates)
**Example Assessment:**
- ✅ 12 podcast episodes (MP3 format, 30-60 min each)
- ✅ Getty Museum API (http://www.getty.edu/research/tools/vocabularies/)
- ✅ English language podcasts
- ✅ ~150,000 catalog items
- ✅ New podcast weekly, catalog updates monthly
### Step 2: Map Your Schema
| Your Column | Project Column | Transformation |
|-------------|----------------|----------------|
| `audio_file_path` | episode_01.mp3 | None (direct copy) |
| `speaker_transcript` | Whisper output | None (generated by pipeline) |
| `artwork_title` | tainacan_title | API field mapping |
| `artist_name` | author_name | API field mapping |
| `catalog_url` | tainacan_url | API field mapping |
| `accession_number` | tainacan_item_id | String → String |
### Step 3: Modify Data Source
**Replace Tainacan API with Your API:**
Edit `day05_DATA_extract_complete_catalog.py`:
```python
# OLD (line 18-25)
TAINACAN_BASE_URL = "https://acervoonline.mp.usp.br/wp-json/tainacan/v2/"
def day05_extract_catalog():
url = f"{TAINACAN_BASE_URL}/collection/{COLLECTION_ID}/items"
...
# NEW - Getty Museum example
GETTY_BASE_URL = "http://vocab.getty.edu/sparql.json"
def day05_extract_catalog():
# Your API logic
query = """
SELECT ?object ?title ?artist WHERE {
?object a crm:E22_Man-Made_Object .
?object rdfs:label ?title .
?object crm:P108i_was_produced_by ?production .
?production crm:P14_carried_out_by ?artist .
}
"""
response = requests.post(GETTY_BASE_URL, data={'query': query})
...
```
### Step 4: Adjust Business Logic
**Files to Review:**
1. `day05_CONFIG_settings.py` - API endpoints, model settings
2. `day05_FINALIZE_militao_only.py` - Matching logic (currently Militão-specific)
3. `day05_PIPELINE_extract_items.py` - GPT-4 prompts (currently Portuguese cultural context)
**Common Adjustments:**
```python
# In day05_CONFIG_settings.py
# Change language
DAY05_AUDIO_LANGUAGE = "en" # Changed from "pt"
# Change Whisper model size (trade speed vs. accuracy)
DAY05_WHISPER_MODEL = "medium" # Changed from "base" for better English
# Adjust similarity threshold
DAY05_SIMILARITY_THRESHOLD = 0.5 # Changed from 0.3 for stricter matching
```
**Modify GPT-4 Extraction Prompt:**
Edit `day05_PIPELINE_extract_items.py` line 45-80:
```python
# OLD - Portuguese cultural items
prompt = """
Identifique menções a objetos do museu:
- Pinturas e quadros
- Fotografias
- Esculturas
...
"""
# NEW - English art museum
prompt = """
Identify mentions of museum artworks:
- Paintings and canvases
- Sculptures and statues
- Photographs and prints
- Historical artifacts
- Artists' names and attributions
For each mention, extract:
1. item_mention: Exact phrase used
2. timestamp: Time in podcast (HH:MM:SS)
3. context: Surrounding discussion
4. confidence: high/medium/low
"""
```
### Step 5: Validate with Sample
**Test with subset (1 episode + 1,000 catalog items):**
```bash
# Step 1: Test transcription (1 episode)
python day05_DATA_transcribe_whisper.py --limit=1
# Step 2: Test extraction (1 transcript)
python day05_PIPELINE_extract_items.py --episode=01
# Step 3: Test catalog download (first 1,000 items)
python day05_DATA_extract_complete_catalog.py --limit=1000
# Step 4: Test matching
python day05_FINALIZE_militao_only.py --sample
```
**Compare to known values:**
- [ ] Transcription accuracy >90% (manually check 1 minute of transcript)
- [ ] GPT-4 extracted 5-10 mentions from 1 episode (reasonable range)
- [ ] Catalog downloaded 1,000 items in ~2 minutes (acceptable speed)
- [ ] At least 1-2 matches found with confidence >0.3
### Step 6: Scale to Full Data
**Incremental approach:**
1. **Week 1:** 1 episode + 1,000 catalog items
2. **Week 2:** 5 episodes + 10,000 catalog items
3. **Week 3:** All episodes + complete catalog
4. **Week 4:** Automate pipeline with scheduling (Airflow, GitHub Actions)
**Monitor:**
- **Execution time:** Track per-episode processing time (target: <10 min per episode)
- **Memory usage:** Monitor Python process (Whisper models can use 2-10 GB RAM)
- **Data quality:** Spot-check 10 random matches per run for accuracy
- **Business logic edge cases:** Track unmatched items, investigate patterns (typos? Missing catalog data?)
### Step 7: Deploy to Production
**Infrastructure:**
```bash
# Option 1: Local cron job
0 9 * * 1 cd /path/to/day05 && python day05_FINALIZE_militao_only.py
# Option 2: GitHub Actions (see .github/workflows/day05_pipeline.yml)
on:
schedule:
- cron: '0 9 * * 1' # Weekly on Monday 9am
# Option 3: Google Cloud Run (serverless)
gcloud run deploy day05-pipeline \
--source . \
--memory 8Gi \
--timeout 3600
```
**Monitoring:**
```python
# Add to day05_FINALIZE_militao_only.py
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('day05_pipeline.log'),
logging.StreamHandler()
]
)
logger.info(f"Processed {len(matches)} items in {runtime:.2f}s")
logger.warning(f"Low confidence match: {item_mention} → {catalog_title} ({score:.2f})")
```
Project Files
day05/
├── README.md # This file
├── data/
│ ├── raw/
│ │ ├── audio/
│ │ │ ├── episode_01.mp3 # Podcast audio files (not in repo)
│ │ │ └── ...
│ │ └── transcripts/
│ │ ├── episode_01_transcript.json # Whisper output (structured)
│ │ ├── episode_01_transcript.txt # Whisper output (readable)
│ │ └── ...
│ └── processed/
│ ├── items_to_validate.csv # GPT-4 extracted mentions (29 rows)
│ ├── matched_items.csv # Final output (8 Militão matches)
│ ├── bigquery_ready.csv # Validated for cloud upload
│ ├── museu_paulista_completo.db # Complete catalog (SQLite, 250 MB)
│ ├── museu_paulista_completo.parquet # Complete catalog (compressed, 53.68 MB)
│ └── museu_paulista_completo.csv # Complete catalog (Excel-friendly, 1.8 GB)
├── day05_CONFIG_settings.py # Configuration loader
├── day05_DATA_transcribe_whisper.py # Step 1: Audio → Text
├── day05_PIPELINE_extract_items.py # Step 2: Text → Mentions
├── day05_DATA_extract_complete_catalog.py # Step 3: Download catalog
├── day05_FINALIZE_militao_only.py # Step 4: Match Militão items
├── day05_PREPARE_for_bigquery.py # Step 5: Validate schema
├── day05_DATA_load_bigquery.py # Step 6: Upload to cloud
├── day05_TOOL_manual_search.py # Utility: Interactive catalog search
├── .env.example # Environment variables template
└── requirements.txt # Python dependencies
Appendix
Time Breakdown
| Phase |
Time |
% |
| Planning & Setup |
15 min |
8% |
| Transcription (5 episodes) |
30 min |
17% |
| GPT-4 Extraction |
25 min |
14% |
| Catalog Extraction |
45 min |
25% |
| Matching & Validation |
20 min |
11% |
| Documentation |
45 min |
25% |
| Total |
180 min |
100% |
Learning Outcomes
Technical Skills Acquired:
- Audio AI: Whisper model deployment for non-English languages (Brazilian Portuguese)
- LLM Prompting: GPT-4 structured extraction from unstructured transcripts
- Fuzzy Matching: TF-IDF n-gram similarity for handling name variations
- Multi-format Storage: Strategic use of SQLite (queries), Parquet (speed), CSV (accessibility)
Business Domain Understanding:
- Cultural Data Temporality: Museum collections as living archives (repatriation, transfers, digitization gaps)
- Metadata Enrichment: Linking educational content (podcasts) to canonical records (catalogs)
- Provenance Tracking: Importance of “why not matched” (repatriated, lost, never digitized) vs. binary TRUE/FALSE
Process Improvements for Next Project:
- Start small: Match 5 obvious items before extracting 79K catalog (validate pipeline faster)
- Profile data early: Inspect catalog schema (author fields, date formats) before building matching logic
- Define “done” upfront: “80% of Militão mentions matched with >0.6 confidence” clearer than “match all items”
What Went Well
- Complete catalog extraction strategy - 45min upfront investment enabled unlimited instant searches and full fuzzy matching flexibility
- Multi-format exports - SQLite (engineers), Parquet (analysts), CSV (museum staff) ensured all stakeholders could access data
- Scope discipline - Shipping Militão-only (8 items, well-documented) better than incomplete “all 29 items” attempt
- Architecture documentation - Captured decision rationale (why complete extraction vs. API search) for future reference
What to Improve Next Time
- Strategic search first - Match top 5 obvious items (“Independência ou Morte” painting) to validate pipeline before full catalog download
- Data profiling upfront - Inspect catalog schema (author vs. description fields) before building extraction logic
- Confidence calibration - 0.30-0.50 scores are medium-low; should have tested semantic embeddings (sentence-transformers) for comparison
- Human-in-the-loop earlier - Validate 3 items manually, adjust matching logic, then scale to all 8 (avoid rework)
Data Historicity: The Museum as Living Archive
Case Study: The Krahô Axe
One of the most interesting challenges wasn’t technical—it was conceptual:
The Problem: What happens when a mentioned artifact is no longer in the digital catalog?
Example: The Krahô ceremonial axe (machadinha) was mentioned in podcasts but doesn’t appear in the online catalog. Why?
- Scenario 1: Item was repatriated (returned to indigenous Krahô people)
- Scenario 2: Item transferred to another institution
- Scenario 3: Item exists physically but not yet digitized
- Scenario 4: Item in storage/conservation
The Data Challenge:
Traditional database design treats “not found” as a null value. But in museum contexts, absence tells a story:
item_mention,matched,match_status,status_notes
Machadinha Krahô,FALSE,repatriated,"Devolvida ao povo Krahô em 2023"
Documento Colonial,FALSE,transferred,"Migrado para Arquivo Nacional"
Fotografia Perdida,FALSE,never_digitized,"Existe fisicamente, não digitalizado"
Why This Matters:
- Historical Accountability - Repatriation and decolonization efforts need documentation
- Provenance Tracking - Items move between institutions; databases should reflect this
- Research Context - Knowing why something isn’t available is as important as knowing what is
- Temporal Data - Museum collections are temporal—they change over time
Database Design Implication:
Instead of:
-- ❌ Binary approach (loses information)
matched BOOLEAN
Use:
-- ✅ Rich status tracking (preserves provenance)
match_status ENUM('found', 'repatriated', 'transferred', 'not_digitized', 'unknown')
status_date DATE
status_notes TEXT
previous_location TEXT
💡 Big Learning: Cultural data engineering isn’t just about matching strings—it’s about preserving the movement, context, and stories behind artifacts.
This same principle applies to:
- E-commerce: Product discontinuation (active, discontinued, recalled, replaced_by_sku)
- Healthcare: Patient status (active, transferred, deceased, lost_to_followup)
- HR: Employment (active, terminated, retired, transferred_to_department)
- Government: Document status (active, declassified, destroyed, sealed_until_date)
The meta-lesson: Always ask “What does NULL really mean in this domain?”
For full investigation of whether the Tainacan API supports historical context tracking, see TAINACAN_HISTORICAL_CONTEXT_INVESTIGATION.md.
Naming Conventions Reference
All project files use day05_ prefix for isolation.
File Naming Pattern:
day05_DATA_*.py - Data extraction/transformation scripts
day05_PIPELINE_*.py - Multi-step orchestration logic
day05_CONFIG_*.py - Configuration and settings
day05_TOOL_*.py - Utility scripts for manual operations
See PROMPT_project_setup.md for complete naming standards.
Links & Resources