Overview

Tagtaly's data collection pipeline runs hourly, scanning RSS feeds from 5 major news sources. Each article is extracted, processed, deduplicated, and stored in our SQLite database. This document explains how.

Data Sources

Tagtaly tracks 5 primary news sources chosen for their coverage volume, editorial quality, and representativeness of UK/US news:

Source Region Average Articles/Day RSS Feed Type
BBC News UK 80-120 Multiple category feeds
The Guardian UK 120-180 Multiple section feeds
Sky News UK 60-100 Single main feed
The Independent UK 40-80 Multiple section feeds
Washington Post US 100-150 Multiple section feeds

These sources were selected because they: (1) publish high-quality journalism, (2) have significant editorial reach, (3) represent different editorial perspectives, (4) maintain reliable RSS feeds, (5) cover both UK and US news.

Collection Process

Step 1: Feed Fetching

The news_collector.py script runs hourly via GitHub Actions. It fetches each RSS feed and extracts article metadata:

for feed_url in RSS_FEEDS: response = requests.get(feed_url, timeout=10) feed = feedparser.parse(response.content) for entry in feed.entries: article = { 'headline': entry.title, 'source': feed.source_name, 'url': entry.link, 'summary': entry.summary or '', 'published_date': entry.published or datetime.now(), 'fetched_at': datetime.now() }

Step 2: Deduplication

Multiple outlets may republish the same story (AP wire, aggregation services). To avoid double-counting, Tagtaly uses MD5 hashing of the article URL to create unique IDs:

import hashlib url = "https://example.com/story" article_id = hashlib.md5(url.encode()).hexdigest() # Output: 5d41402abc4b2a76b9719d911017c592

This ensures each unique story appears once, regardless of how many outlets link to it.

Deduplication Impact: On average, 15-20% of collected articles are duplicates. Without deduplication, the same story would inflate volume metrics artificially.

Step 3: Data Cleaning

Raw feed data contains inconsistencies. Each article is cleaned:

Step 4: Storage

Articles are inserted into SQLite with this schema:

CREATE TABLE articles ( id TEXT PRIMARY KEY, -- MD5 hash of URL headline TEXT NOT NULL, source TEXT NOT NULL, url TEXT NOT NULL UNIQUE, published_date TEXT, summary TEXT, fetched_at TEXT NOT NULL, topic TEXT, sentiment TEXT, sentiment_score REAL );

The id field (MD5 hash) serves as the primary key. If the same URL is fetched again, the INSERT is skipped, preventing duplicates.

Data Volume & Growth

Daily Statistics:
  • Raw articles collected: 500-600
  • After deduplication: 420-500 unique
  • Database size: ~1-2 MB per month
  • Retention: All articles (no deletion)

Error Handling & Resilience

Feed Timeouts

If a feed is slow or offline, the collector continues with other sources rather than failing entirely:

try: response = requests.get(url, timeout=10) except requests.Timeout: logger.warning(f"Feed timeout: {url}") continue # Move to next feed

Database Locks

SQLite can have lock contention. If the database is locked, the collection script retries with exponential backoff:

for attempt in range(3): try: cursor.execute(INSERT_QUERY) break except sqlite3.OperationalError as e: if 'database is locked' in str(e): time.sleep(2 ** attempt) else: raise

Malformed Data

Some feeds contain invalid data (missing headlines, broken URLs). Articles with critical missing fields are logged and skipped:

if not article.get('headline') or not article.get('url'): logger.error(f"Skipping malformed article: {article}") continue

Quality Checks

Freshness Validation

Articles published more than 30 days ago are retained for historical analysis but flagged as "archived".

Source Validation

Each article's source is validated against our known sources list. Unknown sources trigger a warning.

URL Validation

URLs are checked for:

Performance & Optimization

Collection Speed

Fetching 5 feeds + processing + deduplication typically completes in 2-3 minutes. This runs hourly, leaving 57 minutes for other pipeline steps (analysis, visualization).

Database Indexing

Critical queries are indexed for speed:

CREATE INDEX idx_source ON articles(source); CREATE INDEX idx_fetched_date ON articles(fetched_at); CREATE INDEX idx_topic ON articles(topic);

Batch Inserts

Articles are inserted in batches (100 per transaction) rather than individually, reducing transaction overhead:

batch = [] for article in articles: batch.append(article) if len(batch) >= 100: cursor.executemany(INSERT_QUERY, batch) batch = []

Privacy & Ethical Considerations

Data We Collect

Only publicly published article metadata:

We do NOT collect: Author names, user data, private information, cookies, IP addresses, or any personal data.

Attribution

Every article links directly to the original publication. We're an analysis layer, not a content provider. Users always access original articles at source outlets.

Future Improvements

Planned enhancements to the collection pipeline:

Questions?

For technical questions about data collection, contact admin@tagtaly.com.

Explore More Methodology Documentation

Learn about our other technical processes

Topic Classification →