Skip to content

jhengy/content-aggregator

Repository files navigation

Web Content Summarizer

A Python tool that aggregates and summarizes web content using AI. Features include:

Features

Feature Description Status
Web Scraping Extract articles from websites and blogs
AI Summarization Generate concise summaries using Gemini models
RSS Feed Support Process content from RSS/Atom feeds
PDF Processing Extract text content from PDF documents
CI/CD Integration Automated daily summaries via GitHub Actions
Date Filtering Filter content by publication date Partially working - only works for rss sources for now
Dynamic Content Handle JavaScript-rendered pages using Playwright

Setup

Installation

# Clone repository
git clone https://github.com/yourusername/content-aggregator.git
cd content-aggregator

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with dependencies
pip install -e .
playwright install chromium
playwright install-deps

Configuration

  1. Create .env file:
    cp .env.example .env
  2. Edit .env with your Gemini API details:
    GEMINI_API_KEY=your_api_key_here
    GEMINI_MODEL_SUMMARIZE=gemini-2.0-flash-exp
    GEMINI_MODEL_DATE_EXTRACT=gemini-2.0-flash-exp

Usage

Basic Usage

# Run aggregator and generate issue
scripts/run.sh

CLI Commands

Command Description Example
run Default aggregation process content-aggregator run

Testing

# Install with development dependencies
pip install -e '.[dev]'

# Run all tests
pytest tests/ -v -s
# Run internal tests
pytest tests/ -v -m "not external"
# Run external tests, reqwuire network call to url and external service such as gemini
pytest tests/ -v -m external

# Generate coverage report
pytest --cov=content_aggregator --cov-report=html -s

Automated Daily Summaries

CI

The GitHub Actions workflow:

  • Runs daily (off-peak time)
  • Processes configured content sources
  • Creates GitHub issues with summaries
  • Stores JSON results and summaries as artifacts

Output files will be created in:

  • outputs/results_*.json: Full results in JSON format
  • outputs/results_*_summary.txt: Executive summary text file

CI/CD Requirements

For GitHub Actions execution, ensure these repository settings:

  1. Under Settings > Actions > General:
    • Workflow permissions: "Read and write permissions"
    • Check "Allow GitHub Actions to create and approve pull requests"
  2. Add these secrets:
    • GEMINI_API_KEY
    • GEMINI_MODEL_SUMMARIZE
    • GEMINI_MODEL_DATE_EXTRACT

Challenges

  • different sources have different ways of getting the post links

    • tricky to eliminate links which are not posts: save model cost by excluding links which are irrelevant
  • non-deterministic output within and across models

    • same model can output different results with the same prompt
    • different models different output formats depending on the prompt -> hard to parse response in a consistent way
  • reliability

    • speed issue: slowness, causing timeouts
    • rate limiting causing errors
    • retry logic
  • quality of output

    • not always accurate, hallucinations can happen
    • not always relevant
    • not always useful
    • not always interesting
    • not always surprising
  • cost

    • expensive to host and run on your own
    • expensive to run on a cloud provider for better models
  • extract date from the post

    • llm can hallucinate date
  • extracting blog content from url

    • support for dynamic content
  • too many models to choose from

Identify the best model for the task

TODO

  • model dependent features
    • input
      • accept image
      • accept wide range of file types
        • accept video
    • output
      • generation
        • visualizations
      • linkages to something outside the article
    • summarization and extraction from web url, skip web scraping content before passing to llm
      • to what extent can ai model successfully extract content and summarize it based on the url? Signal to noise ratio

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published