Data Collection Methodology

Overview

Tagtaly's data collection pipeline runs hourly, scanning RSS feeds from 5 major news sources. Each article is extracted, processed, deduplicated, and stored in our SQLite database. This document explains how.

Data Sources

Tagtaly tracks 5 primary news sources chosen for their coverage volume, editorial quality, and representativeness of UK/US news:

Source	Region	Average Articles/Day	RSS Feed Type
BBC News	UK	80-120	Multiple category feeds
The Guardian	UK	120-180	Multiple section feeds
Sky News	UK	60-100	Single main feed
The Independent	UK	40-80	Multiple section feeds
Washington Post	US	100-150	Multiple section feeds

These sources were selected because they: (1) publish high-quality journalism, (2) have significant editorial reach, (3) represent different editorial perspectives, (4) maintain reliable RSS feeds, (5) cover both UK and US news.

Collection Process

Step 1: Feed Fetching

The news_collector.py script runs hourly via GitHub Actions. It fetches each RSS feed and extracts article metadata:

 for feed_url in RSS_FEEDS:
    response = requests.get(feed_url, timeout=10)
    feed = feedparser.parse(response.content)

    for entry in feed.entries:
        article = {
            'headline': entry.title,
            'source': feed.source_name,
            'url': entry.link,
            'summary': entry.summary or '',
            'published_date': entry.published or datetime.now(),
            'fetched_at': datetime.now()
        } 

Step 2: Deduplication

Multiple outlets may republish the same story (AP wire, aggregation services). To avoid double-counting, Tagtaly uses MD5 hashing of the article URL to create unique IDs:

 import hashlib

url = "https://example.com/story"
article_id = hashlib.md5(url.encode()).hexdigest()
# Output: 5d41402abc4b2a76b9719d911017c592 

This ensures each unique story appears once, regardless of how many outlets link to it.

 Deduplication Impact: On average, 15-20% of collected articles are duplicates. Without deduplication, the same story would inflate volume metrics artificially. 

Step 3: Data Cleaning

Raw feed data contains inconsistencies. Each article is cleaned:

Remove HTML tags from summaries
Normalize dates to ISO 8601 format
Validate URLs (reject malformed links)
Trim whitespace from all text fields
Skip articles with missing headlines or URLs

Step 4: Storage

Articles are inserted into SQLite with this schema:

 CREATE TABLE articles (
    id TEXT PRIMARY KEY,           -- MD5 hash of URL
    headline TEXT NOT NULL,
    source TEXT NOT NULL,
    url TEXT NOT NULL UNIQUE,
    published_date TEXT,
    summary TEXT,
    fetched_at TEXT NOT NULL,
    topic TEXT,
    sentiment TEXT,
    sentiment_score REAL
); 

The id field (MD5 hash) serves as the primary key. If the same URL is fetched again, the INSERT is skipped, preventing duplicates.

Data Volume & Growth

Daily Statistics:

Raw articles collected: 500-600
After deduplication: 420-500 unique
Database size: ~1-2 MB per month
Retention: All articles (no deletion)

Error Handling & Resilience

Feed Timeouts

If a feed is slow or offline, the collector continues with other sources rather than failing entirely:

 try:
    response = requests.get(url, timeout=10)
except requests.Timeout:
    logger.warning(f"Feed timeout: {url}")
    continue  # Move to next feed 

Database Locks

SQLite can have lock contention. If the database is locked, the collection script retries with exponential backoff:

 for attempt in range(3):
    try:
        cursor.execute(INSERT_QUERY)
        break
    except sqlite3.OperationalError as e:
        if 'database is locked' in str(e):
            time.sleep(2 ** attempt)
        else:
            raise 

Malformed Data

Some feeds contain invalid data (missing headlines, broken URLs). Articles with critical missing fields are logged and skipped:

 if not article.get('headline') or not article.get('url'):
    logger.error(f"Skipping malformed article: {article}")
    continue 

Quality Checks

Freshness Validation

Articles published more than 30 days ago are retained for historical analysis but flagged as "archived".

Source Validation

Each article's source is validated against our known sources list. Unknown sources trigger a warning.

URL Validation

URLs are checked for:

Valid HTTP/HTTPS protocol
No malformed characters
Reachability (basic ping test)

Performance & Optimization

Collection Speed

Fetching 5 feeds + processing + deduplication typically completes in 2-3 minutes. This runs hourly, leaving 57 minutes for other pipeline steps (analysis, visualization).

Database Indexing

Critical queries are indexed for speed:

 CREATE INDEX idx_source ON articles(source);
CREATE INDEX idx_fetched_date ON articles(fetched_at);
CREATE INDEX idx_topic ON articles(topic); 

Batch Inserts

Articles are inserted in batches (100 per transaction) rather than individually, reducing transaction overhead:

 batch = []
for article in articles:
    batch.append(article)
    if len(batch) >= 100:
        cursor.executemany(INSERT_QUERY, batch)
        batch = [] 

Privacy & Ethical Considerations

Data We Collect

Only publicly published article metadata:

Headline (public)
Summary (public)
Publication URL (public)
Publication time (public)
Source outlet name (public)

We do NOT collect: Author names, user data, private information, cookies, IP addresses, or any personal data.

Attribution

Every article links directly to the original publication. We're an analysis layer, not a content provider. Users always access original articles at source outlets.

Future Improvements

Planned enhancements to the collection pipeline:

Additional sources: International outlets (Reuters, AP, Bloomberg)
Web scraping: Outlets without RSS feeds
Social media: Twitter/X trending detection
Real-time processing: WebSocket feeds instead of hourly batches

Questions?

For technical questions about data collection, contact admin@tagtaly.com.