From e4cb5a14b3c46b402d957616cb6b944bc8088ae1 Mon Sep 17 00:00:00 2001
From: Robin Singh <43510291+imrobinsingh@users.noreply.github.com>
Date: Tue, 17 Mar 2026 02:05:44 +0530
Subject: [PATCH] =?UTF-8?q?feat(skill):=20add=20data-scraper-agent=20?=
 =?UTF-8?q?=E2=80=94=20AI-powered=20public=20data=20collection=20for=20any?=
 =?UTF-8?q?=20source=20(#503)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* feat(skill): add data-scraper-agent skill

Workflow skill for building AI-powered public data collection agents.
Covers any scraping target: job boards, prices, news, GitHub, sports, events.

- Full architecture guide (config.yaml, scraper/, ai/, storage/)
- Gemini Flash free tier client with 4-model fallback chain
- Batch API pattern (5 items/call) — stays within free tier
- Feedback learning loop from user decisions
- Notion / Sheets / Supabase storage templates
- GitHub Actions cron schedule (100% free)
- Anti-patterns table, free tier limits reference, quality checklist
- Real-world examples and reference implementation (job-hunt-agent)

* fix(skill): address PR #503 review violations in data-scraper-agent

- Read batch_size from config.yaml instead of hardcoded constant
- Branch main.py on storage.provider; label example as Notion-only
- Replace undefined sync_feedback() with load_feedback() + comment
- Add commented Playwright browser install step to CI workflow
- Add permissions: contents: write; remove silent `git push || true`
- Remove external unvetted repo link from Reference Implementation
- Move import json to top of pipeline.py block (was after usage)
- Guard context.md read with exists() check; fall back to empty string
- Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)
- Remove duplicate config.yaml entry from project directory template
---
 skills/data-scraper-agent/SKILL.md | 764 +++++++++++++++++++++++++++++
 1 file changed, 764 insertions(+)
 create mode 100644 skills/data-scraper-agent/SKILL.md

diff --git a/skills/data-scraper-agent/SKILL.md b/skills/data-scraper-agent/SKILL.md
new file mode 100644
index 00000000..72a6548d
--- /dev/null
+++ b/skills/data-scraper-agent/SKILL.md
@@ -0,0 +1,764 @@
+---
+name: data-scraper-agent
+description: Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically.
+origin: community
+---
+
+# Data Scraper Agent
+
+Build a production-ready, AI-powered data collection agent for any public data source.
+Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.
+
+**Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase**
+
+## When to Activate
+
+- User wants to scrape or monitor any public website or API
+- User says "build a bot that checks...", "monitor X for me", "collect data from..."
+- User wants to track jobs, prices, news, repos, sports scores, events, listings
+- User asks how to automate data collection without paying for hosting
+- User wants an agent that gets smarter over time based on their decisions
+
+## Core Concepts
+
+### The Three Layers
+
+Every data scraper agent has three layers:
+
+```
+COLLECT → ENRICH → STORE
+  │           │        │
+Scraper    AI (LLM)  Database
+runs on    scores/   Notion /
+schedule   summarises Sheets /
+           & classifies Supabase
+```
+
+### Free Stack
+
+| Layer | Tool | Why |
+|---|---|---|
+| **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites |
+| **JS-rendered sites** | `playwright` (free) | When HTML scraping fails |
+| **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free |
+| **Storage** | Notion API | Free tier, great UI for review |
+| **Schedule** | GitHub Actions cron | Free for public repos |
+| **Learning** | JSON feedback file in repo | Zero infra, persists in git |
+
+### AI Model Fallback Chain
+
+Build agents to auto-fallback across Gemini models on quota exhaustion:
+
+```
+gemini-2.0-flash-lite (30 RPM) →
+gemini-2.0-flash (15 RPM) →
+gemini-2.5-flash (10 RPM) →
+gemini-flash-lite-latest (fallback)
+```
+
+### Batch API Calls for Efficiency
+
+Never call the LLM once per item. Always batch:
+
+```python
+# BAD: 33 API calls for 33 items
+for item in items:
+    result = call_ai(item)  # 33 calls → hits rate limit
+
+# GOOD: 7 API calls for 33 items (batch size 5)
+for batch in chunks(items, size=5):
+    results = call_ai(batch)  # 7 calls → stays within free tier
+```
+
+---
+
+## Workflow
+
+### Step 1: Understand the Goal
+
+Ask the user:
+
+1. **What to collect:** "What data source? URL / API / RSS / public endpoint?"
+2. **What to extract:** "What fields matter? Title, price, URL, date, score?"
+3. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?"
+4. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?"
+5. **Frequency:** "How often should it run? Every hour, daily, weekly?"
+
+Common examples to prompt:
+- Job boards → score relevance to resume
+- Product prices → alert on drops
+- GitHub repos → summarise new releases
+- News feeds → classify by topic + sentiment
+- Sports results → extract stats to tracker
+- Events calendar → filter by interest
+
+---
+
+### Step 2: Design the Agent Architecture
+
+Generate this directory structure for the user:
+
+```
+my-agent/
+├── config.yaml              # User customises this (keywords, filters, preferences)
+├── profile/
+│   └── context.md           # User context the AI uses (resume, interests, criteria)
+├── scraper/
+│   ├── __init__.py
+│   ├── main.py              # Orchestrator: scrape → enrich → store
+│   ├── filters.py           # Rule-based pre-filter (fast, before AI)
+│   └── sources/
+│       ├── __init__.py
+│       └── source_name.py   # One file per data source
+├── ai/
+│   ├── __init__.py
+│   ├── client.py            # Gemini REST client with model fallback
+│   ├── pipeline.py          # Batch AI analysis
+│   ├── jd_fetcher.py        # Fetch full content from URLs (optional)
+│   └── memory.py            # Learn from user feedback
+├── storage/
+│   ├── __init__.py
+│   └── notion_sync.py       # Or sheets_sync.py / supabase_sync.py
+├── data/
+│   └── feedback.json        # User decision history (auto-updated)
+├── .env.example
+├── setup.py                 # One-time DB/schema creation
+├── enrich_existing.py       # Backfill AI scores on old rows
+├── requirements.txt
+└── .github/
+    └── workflows/
+        └── scraper.yml      # GitHub Actions schedule
+```
+
+---
+
+### Step 3: Build the Scraper Source
+
+Template for any data source:
+
+```python
+# scraper/sources/my_source.py
+"""
+[Source Name] — scrapes [what] from [where].
+Method: [REST API / HTML scraping / RSS feed]
+"""
+import requests
+from bs4 import BeautifulSoup
+from datetime import datetime, timezone
+from scraper.filters import is_relevant
+
+HEADERS = {
+    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
+}
+
+
+def fetch() -> list[dict]:
+    """
+    Returns a list of items with consistent schema.
+    Each item must have at minimum: name, url, date_found.
+    """
+    results = []
+
+    # ---- REST API source ----
+    resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
+    if resp.status_code == 200:
+        for item in resp.json().get("results", []):
+            if not is_relevant(item.get("title", "")):
+                continue
+            results.append(_normalise(item))
+
+    return results
+
+
+def _normalise(raw: dict) -> dict:
+    """Convert raw API/HTML data to the standard schema."""
+    return {
+        "name": raw.get("title", ""),
+        "url": raw.get("link", ""),
+        "source": "MySource",
+        "date_found": datetime.now(timezone.utc).date().isoformat(),
+        # add domain-specific fields here
+    }
+```
+
+**HTML scraping pattern:**
+```python
+soup = BeautifulSoup(resp.text, "lxml")
+for card in soup.select("[class*='listing']"):
+    title = card.select_one("h2, h3").get_text(strip=True)
+    link = card.select_one("a")["href"]
+    if not link.startswith("http"):
+        link = f"https://example.com{link}"
+```
+
+**RSS feed pattern:**
+```python
+import xml.etree.ElementTree as ET
+root = ET.fromstring(resp.text)
+for item in root.findall(".//item"):
+    title = item.findtext("title", "")
+    link = item.findtext("link", "")
+```
+
+---
+
+### Step 4: Build the Gemini AI Client
+
+```python
+# ai/client.py
+import os, json, time, requests
+
+_last_call = 0.0
+
+MODEL_FALLBACK = [
+    "gemini-2.0-flash-lite",
+    "gemini-2.0-flash",
+    "gemini-2.5-flash",
+    "gemini-flash-lite-latest",
+]
+
+
+def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
+    """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""
+    global _last_call
+
+    api_key = os.environ.get("GEMINI_API_KEY", "")
+    if not api_key:
+        return {}
+
+    elapsed = time.time() - _last_call
+    if elapsed < rate_limit:
+        time.sleep(rate_limit - elapsed)
+
+    models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
+    _last_call = time.time()
+
+    for m in models:
+        url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"
+        payload = {
+            "contents": [{"parts": [{"text": prompt}]}],
+            "generationConfig": {
+                "responseMimeType": "application/json",
+                "temperature": 0.3,
+                "maxOutputTokens": 2048,
+            },
+        }
+        try:
+            resp = requests.post(url, json=payload, timeout=30)
+            if resp.status_code == 200:
+                return _parse(resp)
+            if resp.status_code in (429, 404):
+                time.sleep(1)
+                continue
+            return {}
+        except requests.RequestException:
+            return {}
+
+    return {}
+
+
+def _parse(resp) -> dict:
+    try:
+        text = (
+            resp.json()
+            .get("candidates", [{}])[0]
+            .get("content", {})
+            .get("parts", [{}])[0]
+            .get("text", "")
+            .strip()
+        )
+        if text.startswith("```"):
+            text = text.split("\n", 1)[-1].rsplit("```", 1)[0]
+        return json.loads(text)
+    except (json.JSONDecodeError, KeyError):
+        return {}
+```
+
+---
+
+### Step 5: Build the AI Pipeline (Batch)
+
+```python
+# ai/pipeline.py
+import json
+import yaml
+from pathlib import Path
+from ai.client import generate
+
+def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:
+    """Analyse items in batches. Returns items enriched with AI fields."""
+    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
+    model = config.get("ai", {}).get("model", "gemini-2.5-flash")
+    rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)
+    min_score = config.get("ai", {}).get("min_score", 0)
+    batch_size = config.get("ai", {}).get("batch_size", 5)
+
+    batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
+    print(f"  [AI] {len(items)} items → {len(batches)} API calls")
+
+    enriched = []
+    for i, batch in enumerate(batches):
+        print(f"  [AI] Batch {i + 1}/{len(batches)}...")
+        prompt = _build_prompt(batch, context, preference_prompt, config)
+        result = generate(prompt, model=model, rate_limit=rate_limit)
+
+        analyses = result.get("analyses", [])
+        for j, item in enumerate(batch):
+            ai = analyses[j] if j < len(analyses) else {}
+            if ai:
+                score = max(0, min(100, int(ai.get("score", 0))))
+                if min_score and score < min_score:
+                    continue
+                enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})
+            else:
+                enriched.append(item)
+
+    return enriched
+
+
+def _build_prompt(batch, context, preference_prompt, config):
+    priorities = config.get("priorities", [])
+    items_text = "\n\n".join(
+        f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"
+        for i, item in enumerate(batch)
+    )
+
+    return f"""Analyse these {len(batch)} items and return a JSON object.
+
+# Items
+{items_text}
+
+# User Context
+{context[:800] if context else "Not provided"}
+
+# User Priorities
+{chr(10).join(f"- {p}" for p in priorities)}
+
+{preference_prompt}
+
+# Instructions
+Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}
+Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""
+```
+
+---
+
+### Step 6: Build the Feedback Learning System
+
+```python
+# ai/memory.py
+"""Learn from user decisions to improve future scoring."""
+import json
+from pathlib import Path
+
+FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"
+
+
+def load_feedback() -> dict:
+    if FEEDBACK_PATH.exists():
+        try:
+            return json.loads(FEEDBACK_PATH.read_text())
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {"positive": [], "negative": []}
+
+
+def save_feedback(fb: dict):
+    FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)
+    FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))
+
+
+def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:
+    """Convert feedback history into a prompt bias section."""
+    lines = []
+    if feedback.get("positive"):
+        lines.append("# Items the user LIKED (positive signal):")
+        for e in feedback["positive"][-max_examples:]:
+            lines.append(f"- {e}")
+    if feedback.get("negative"):
+        lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")
+        for e in feedback["negative"][-max_examples:]:
+            lines.append(f"- {e}")
+    if lines:
+        lines.append("\nUse these patterns to bias scoring on new items.")
+    return "\n".join(lines)
+```
+
+**Integration with your storage layer:** after each run, query your DB for items with positive/negative status and call `save_feedback()` with the extracted patterns.
+
+---
+
+### Step 7: Build Storage (Notion example)
+
+```python
+# storage/notion_sync.py
+import os
+from notion_client import Client
+from notion_client.errors import APIResponseError
+
+_client = None
+
+def get_client():
+    global _client
+    if _client is None:
+        _client = Client(auth=os.environ["NOTION_TOKEN"])
+    return _client
+
+def get_existing_urls(db_id: str) -> set[str]:
+    """Fetch all URLs already stored — used for deduplication."""
+    client, seen, cursor = get_client(), set(), None
+    while True:
+        resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})
+        for page in resp["results"]:
+            url = page["properties"].get("URL", {}).get("url", "")
+            if url: seen.add(url)
+        if not resp["has_more"]: break
+        cursor = resp["next_cursor"]
+    return seen
+
+def push_item(db_id: str, item: dict) -> bool:
+    """Push one item to Notion. Returns True on success."""
+    props = {
+        "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},
+        "URL": {"url": item.get("url")},
+        "Source": {"select": {"name": item.get("source", "Unknown")}},
+        "Date Found": {"date": {"start": item.get("date_found")}},
+        "Status": {"select": {"name": "New"}},
+    }
+    # AI fields
+    if item.get("ai_score") is not None:
+        props["AI Score"] = {"number": item["ai_score"]}
+    if item.get("ai_summary"):
+        props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}
+    if item.get("ai_notes"):
+        props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}
+
+    try:
+        get_client().pages.create(parent={"database_id": db_id}, properties=props)
+        return True
+    except APIResponseError as e:
+        print(f"[notion] Push failed: {e}")
+        return False
+
+def sync(db_id: str, items: list[dict]) -> tuple[int, int]:
+    existing = get_existing_urls(db_id)
+    added = skipped = 0
+    for item in items:
+        if item.get("url") in existing:
+            skipped += 1; continue
+        if push_item(db_id, item):
+            added += 1; existing.add(item["url"])
+        else:
+            skipped += 1
+    return added, skipped
+```
+
+---
+
+### Step 8: Orchestrate in main.py
+
+```python
+# scraper/main.py
+import os, sys, yaml
+from pathlib import Path
+from dotenv import load_dotenv
+
+load_dotenv()
+
+from scraper.sources import my_source          # add your sources
+
+# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",
+# replace this import with storage.sheets_sync or storage.supabase_sync and update
+# the env var and sync() call accordingly.
+from storage.notion_sync import sync
+
+SOURCES = [
+    ("My Source", my_source.fetch),
+]
+
+def ai_enabled():
+    return bool(os.environ.get("GEMINI_API_KEY"))
+
+def main():
+    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
+    provider = config.get("storage", {}).get("provider", "notion")
+
+    # Resolve the storage target identifier from env based on provider
+    if provider == "notion":
+        db_id = os.environ.get("NOTION_DATABASE_ID")
+        if not db_id:
+            print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)
+    else:
+        # Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.
+        print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)
+
+    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
+    all_items = []
+
+    for name, fetch_fn in SOURCES:
+        try:
+            items = fetch_fn()
+            print(f"[{name}] {len(items)} items")
+            all_items.extend(items)
+        except Exception as e:
+            print(f"[{name}] FAILED: {e}")
+
+    # Deduplicate by URL
+    seen, deduped = set(), []
+    for item in all_items:
+        if (url := item.get("url", "")) and url not in seen:
+            seen.add(url); deduped.append(item)
+
+    print(f"Unique items: {len(deduped)}")
+
+    if ai_enabled() and deduped:
+        from ai.memory import load_feedback, build_preference_prompt
+        from ai.pipeline import analyse_batch
+
+        # load_feedback() reads data/feedback.json written by your feedback sync script.
+        # To keep it current, implement a separate feedback_sync.py that queries your
+        # storage provider for items with positive/negative statuses and calls save_feedback().
+        feedback = load_feedback()
+        preference = build_preference_prompt(feedback)
+        context_path = Path(__file__).parent.parent / "profile" / "context.md"
+        context = context_path.read_text() if context_path.exists() else ""
+        deduped = analyse_batch(deduped, context=context, preference_prompt=preference)
+    else:
+        print("[AI] Skipped — GEMINI_API_KEY not set")
+
+    added, skipped = sync(db_id, deduped)
+    print(f"Done — {added} new, {skipped} existing")
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+### Step 9: GitHub Actions Workflow
+
+```yaml
+# .github/workflows/scraper.yml
+name: Data Scraper Agent
+
+on:
+  schedule:
+    - cron: "0 */3 * * *"  # every 3 hours — adjust to your needs
+  workflow_dispatch:        # allow manual trigger
+
+permissions:
+  contents: write   # required for the feedback-history commit step
+
+jobs:
+  scrape:
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: "pip"
+
+      - run: pip install -r requirements.txt
+
+      # Uncomment if Playwright is enabled in requirements.txt
+      # - name: Install Playwright browsers
+      #   run: python -m playwright install chromium --with-deps
+
+      - name: Run agent
+        env:
+          NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}
+          NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+        run: python -m scraper.main
+
+      - name: Commit feedback history
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+          git add data/feedback.json || true
+          git diff --cached --quiet || git commit -m "chore: update feedback history"
+          git push
+```
+
+---
+
+### Step 10: config.yaml Template
+
+```yaml
+# Customise this file — no code changes needed
+
+# What to collect (pre-filter before AI)
+filters:
+  required_keywords: []      # item must contain at least one
+  blocked_keywords: []       # item must not contain any
+
+# Your priorities — AI uses these for scoring
+priorities:
+  - "example priority 1"
+  - "example priority 2"
+
+# Storage
+storage:
+  provider: "notion"         # notion | sheets | supabase | sqlite
+
+# Feedback learning
+feedback:
+  positive_statuses: ["Saved", "Applied", "Interested"]
+  negative_statuses: ["Skip", "Rejected", "Not relevant"]
+
+# AI settings
+ai:
+  enabled: true
+  model: "gemini-2.5-flash"
+  min_score: 0               # filter out items below this score
+  rate_limit_seconds: 7      # seconds between API calls
+  batch_size: 5              # items per API call
+```
+
+---
+
+## Common Scraping Patterns
+
+### Pattern 1: REST API (easiest)
+```python
+resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)
+items = resp.json().get("results", [])
+```
+
+### Pattern 2: HTML Scraping
+```python
+soup = BeautifulSoup(resp.text, "lxml")
+for card in soup.select(".listing-card"):
+    title = card.select_one("h2").get_text(strip=True)
+    href = card.select_one("a")["href"]
+```
+
+### Pattern 3: RSS Feed
+```python
+import xml.etree.ElementTree as ET
+root = ET.fromstring(resp.text)
+for item in root.findall(".//item"):
+    title = item.findtext("title", "")
+    link = item.findtext("link", "")
+    pub_date = item.findtext("pubDate", "")
+```
+
+### Pattern 4: Paginated API
+```python
+page = 1
+while True:
+    resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)
+    data = resp.json()
+    items = data.get("results", [])
+    if not items:
+        break
+    for item in items:
+        results.append(_normalise(item))
+    if not data.get("has_more"):
+        break
+    page += 1
+```
+
+### Pattern 5: JS-Rendered Pages (Playwright)
+```python
+from playwright.sync_api import sync_playwright
+
+with sync_playwright() as p:
+    browser = p.chromium.launch()
+    page = browser.new_page()
+    page.goto(url)
+    page.wait_for_selector(".listing")
+    html = page.content()
+    browser.close()
+
+soup = BeautifulSoup(html, "lxml")
+```
+
+---
+
+## Anti-Patterns to Avoid
+
+| Anti-pattern | Problem | Fix |
+|---|---|---|
+| One LLM call per item | Hits rate limits instantly | Batch 5 items per call |
+| Hardcoded keywords in code | Not reusable | Move all config to `config.yaml` |
+| Scraping without rate limit | IP ban | Add `time.sleep(1)` between requests |
+| Storing secrets in code | Security risk | Always use `.env` + GitHub Secrets |
+| No deduplication | Duplicate rows pile up | Always check URL before pushing |
+| Ignoring `robots.txt` | Legal/ethical risk | Respect crawl rules; use public APIs when available |
+| JS-rendered sites with `requests` | Empty response | Use Playwright or look for the underlying API |
+| `maxOutputTokens` too low | Truncated JSON, parse error | Use 2048+ for batch responses |
+
+---
+
+## Free Tier Limits Reference
+
+| Service | Free Limit | Typical Usage |
+|---|---|---|
+| Gemini Flash Lite | 30 RPM, 1500 RPD | ~56 req/day at 3-hr intervals |
+| Gemini 2.0 Flash | 15 RPM, 1500 RPD | Good fallback |
+| Gemini 2.5 Flash | 10 RPM, 500 RPD | Use sparingly |
+| GitHub Actions | Unlimited (public repos) | ~20 min/day |
+| Notion API | Unlimited | ~200 writes/day |
+| Supabase | 500MB DB, 2GB transfer | Fine for most agents |
+| Google Sheets API | 300 req/min | Works for small agents |
+
+---
+
+## Requirements Template
+
+```
+requests==2.31.0
+beautifulsoup4==4.12.3
+lxml==5.1.0
+python-dotenv==1.0.1
+pyyaml==6.0.2
+notion-client==2.2.1   # if using Notion
+# playwright==1.40.0   # uncomment for JS-rendered sites
+```
+
+---
+
+## Quality Checklist
+
+Before marking the agent complete:
+
+- [ ] `config.yaml` controls all user-facing settings — no hardcoded values
+- [ ] `profile/context.md` holds user-specific context for AI matching
+- [ ] Deduplication by URL before every storage push
+- [ ] Gemini client has model fallback chain (4 models)
+- [ ] Batch size ≤ 5 items per API call
+- [ ] `maxOutputTokens` ≥ 2048
+- [ ] `.env` is in `.gitignore`
+- [ ] `.env.example` provided for onboarding
+- [ ] `setup.py` creates DB schema on first run
+- [ ] `enrich_existing.py` backfills AI scores on old rows
+- [ ] GitHub Actions workflow commits `feedback.json` after each run
+- [ ] README covers: setup in < 5 minutes, required secrets, customisation
+
+---
+
+## Real-World Examples
+
+```
+"Build me an agent that monitors Hacker News for AI startup funding news"
+"Scrape product prices from 3 e-commerce sites and alert when they drop"
+"Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one"
+"Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion"
+"Monitor a subreddit for posts mentioning my company — classify sentiment"
+"Scrape new academic papers from arXiv on a topic I care about daily"
+"Track sports fixture results and keep a running table in Google Sheets"
+"Build a real estate listing watcher — alert on new properties under ₹1 Cr"
+```
+
+---
+
+## Reference Implementation
+
+A complete working agent built with this exact architecture would scrape 4+ sources,
+batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run
+100% free on GitHub Actions. Follow Steps 1–9 above to build your own.