Overview
Tagtaly's data collection pipeline runs hourly, scanning RSS feeds from 5 major news sources. Each article is extracted, processed, deduplicated, and stored in our SQLite database. This document explains how.
Data Sources
Tagtaly tracks 5 primary news sources chosen for their coverage volume, editorial quality, and representativeness of UK/US news:
| Source | Region | Average Articles/Day | RSS Feed Type |
|---|---|---|---|
| BBC News | UK | 80-120 | Multiple category feeds |
| The Guardian | UK | 120-180 | Multiple section feeds |
| Sky News | UK | 60-100 | Single main feed |
| The Independent | UK | 40-80 | Multiple section feeds |
| Washington Post | US | 100-150 | Multiple section feeds |
These sources were selected because they: (1) publish high-quality journalism, (2) have significant editorial reach, (3) represent different editorial perspectives, (4) maintain reliable RSS feeds, (5) cover both UK and US news.
Collection Process
Step 1: Feed Fetching
The news_collector.py script runs hourly via GitHub Actions. It fetches each RSS feed and extracts article metadata:
Step 2: Deduplication
Multiple outlets may republish the same story (AP wire, aggregation services). To avoid double-counting, Tagtaly uses MD5 hashing of the article URL to create unique IDs:
This ensures each unique story appears once, regardless of how many outlets link to it.
Step 3: Data Cleaning
Raw feed data contains inconsistencies. Each article is cleaned:
- Remove HTML tags from summaries
- Normalize dates to ISO 8601 format
- Validate URLs (reject malformed links)
- Trim whitespace from all text fields
- Skip articles with missing headlines or URLs
Step 4: Storage
Articles are inserted into SQLite with this schema:
The id field (MD5 hash) serves as the primary key. If the same URL is fetched again, the INSERT is skipped, preventing duplicates.
Data Volume & Growth
- Raw articles collected: 500-600
- After deduplication: 420-500 unique
- Database size: ~1-2 MB per month
- Retention: All articles (no deletion)
Error Handling & Resilience
Feed Timeouts
If a feed is slow or offline, the collector continues with other sources rather than failing entirely:
Database Locks
SQLite can have lock contention. If the database is locked, the collection script retries with exponential backoff:
Malformed Data
Some feeds contain invalid data (missing headlines, broken URLs). Articles with critical missing fields are logged and skipped:
Quality Checks
Freshness Validation
Articles published more than 30 days ago are retained for historical analysis but flagged as "archived".
Source Validation
Each article's source is validated against our known sources list. Unknown sources trigger a warning.
URL Validation
URLs are checked for:
- Valid HTTP/HTTPS protocol
- No malformed characters
- Reachability (basic ping test)
Performance & Optimization
Collection Speed
Fetching 5 feeds + processing + deduplication typically completes in 2-3 minutes. This runs hourly, leaving 57 minutes for other pipeline steps (analysis, visualization).
Database Indexing
Critical queries are indexed for speed:
Batch Inserts
Articles are inserted in batches (100 per transaction) rather than individually, reducing transaction overhead:
Privacy & Ethical Considerations
Data We Collect
Only publicly published article metadata:
- Headline (public)
- Summary (public)
- Publication URL (public)
- Publication time (public)
- Source outlet name (public)
We do NOT collect: Author names, user data, private information, cookies, IP addresses, or any personal data.
Attribution
Every article links directly to the original publication. We're an analysis layer, not a content provider. Users always access original articles at source outlets.
Future Improvements
Planned enhancements to the collection pipeline:
- Additional sources: International outlets (Reuters, AP, Bloomberg)
- Web scraping: Outlets without RSS feeds
- Social media: Twitter/X trending detection
- Real-time processing: WebSocket feeds instead of hourly batches
Questions?
For technical questions about data collection, contact admin@tagtaly.com.