Overview

Visual content drives engagement. Tagtaly ensures every article has a quality image for dashboard displays and social sharing. With 99.2% image coverage (497 of 501 articles), this document explains the 9-method extraction system, CDN bypass techniques, and responsive image handling.

The Challenge

Finding article images is harder than it seems:

Original Coverage: 64% (183 of 501 articles had images)
After Optimization: 99.2% (497 of 501 articles have images)
Improvement: +35.2 percentage points

The 9-Method Priority System

Tagtaly tries 9 extraction methods in order. When one succeeds, it stops (avoiding redundant fetches):

Priority Method Coverage Speed
1 RSS native image tag 40% Instant
2 Open Graph og:image 35% 1-2 sec
3 Twitter Card twitter:image 25% 1-2 sec
4 Schema.org JSON-LD image 20% 1-2 sec
5 Alternate image meta tag 15% 1-2 sec
6 Source-specific CDN patterns 10% Instant
7 Standard HTML img tags 8% 1-2 sec
8 Lazy-loaded data-src 5% 1-2 sec
9 Responsive srcset 3% 1-2 sec

Method 1: RSS Native Image

Fastest method. Some RSS feeds include `` tags with article images:

<image> <url>https://example.com/image.jpg</url> <width>1200</width> <height>630</height> </image>

Coverage: BBC, Sky News (~40%)

Methods 2-5: Metadata Extraction

When RSS fails, scrape article HTML for meta tags:

# Open Graph (most common) meta og:image content="https://..." # Twitter Card meta twitter:image content="https://..." # JSON-LD Schema "image": "https://..."

Combined Coverage: Most outlets (~35-40%)

Method 6: Source-Specific CDN Patterns

Each outlet has CDN patterns. BBC uses `images.bbc.co.uk`, Guardian uses `media.guim.co.uk`. For outlets with access restrictions, use proxy services:

# Guardian bypass via weserv.nl proxy original_url = "https://media.guim.co.uk/abc123/image.jpg" proxied_url = f"https://images.weserv.nl/?url={original_url}"

Coverage: Guardian, Washington Post, NPR (~10-15%)

Methods 7-9: HTML Parsing

When metadata fails, parse article HTML for images:

from bs4 import BeautifulSoup import requests response = requests.get(article_url, timeout=5) soup = BeautifulSoup(response.content, 'html.parser') # Standard images images = soup.find_all('img') # Lazy-loaded images lazy_images = soup.find_all('img', {'data-src': True}) # Responsive images in srcset for img in images: if img.get('srcset'): # Pick largest resolution srcset = img['srcset'].split(',')

Coverage: Fallback for remaining articles (~15-20%)

Image Quality & Sizing

Preferred Dimensions

For dashboard and social sharing, images should be:

Target Specifications:
  • Aspect ratio: 16:9 or 1.2:1 (landscape)
  • Minimum width: 800px (for quality)
  • File size: <2 MB (for load speed)
  • Format: JPEG, PNG, or WebP

Validation

Before storing, images are validated:

Handling Edge Cases

Blocked/Restricted Images

Some outlets block external access. Tagtaly uses three strategies:

  1. Proxy services: weserv.nl, imgproxy, or similar
  2. User-Agent headers: Spoof browser request
  3. Fallback placeholder: Use generic news image

Corrupted/Invalid Images

If all 9 methods fail, use a source-branded placeholder image. This ensures 100% visual coverage (though some placeholders vs actual images).

Performance Optimization

Caching

Image URLs are cached to avoid redundant fetches:

# Cache successful extractions image_cache = { 'article_url': 'image_url' } # For duplicate articles (same URL), use cached image if article_url in image_cache: image_url = image_cache[article_url] else: image_url = extract_image(article_url)

Timeout Handling

HTML scraping has a 5-second timeout per article. If scraping takes longer, the timeout triggers and moves to next article. This prevents slow websites from blocking the pipeline.

Batch Processing

Image extraction runs in parallel for multiple articles (up to 10 concurrent), reducing total extraction time from minutes to seconds.

Results & Statistics

Coverage by Source:
  • BBC: 100% (RSS native images)
  • Guardian: 100% (CDN proxy bypass)
  • Sky News: 100% (RSS + HTML)
  • Independent: 98% (meta + HTML)
  • Washington Post: 99% (JSON-LD + proxy)
Overall Performance:
  • Total articles processed: 501
  • Images found: 497 (99.2%)
  • Placeholder fallbacks: 4 (0.8%)
  • Average extraction time: 1.2 seconds per article

Future Improvements

Questions?

For questions about image extraction, contact admin@tagtaly.com.

Explore More Resources

Learn about other Tagtaly systems

Back to Resources →