From e4cb5a14b3c46b402d957616cb6b944bc8088ae1 Mon Sep 17 00:00:00 2001 From: Robin Singh <43510291+imrobinsingh@users.noreply.github.com> Date: Tue, 17 Mar 2026 02:05:44 +0530 Subject: [PATCH] =?UTF-8?q?feat(skill):=20add=20data-scraper-agent=20?= =?UTF-8?q?=E2=80=94=20AI-powered=20public=20data=20collection=20for=20any?= =?UTF-8?q?=20source=20(#503)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(skill): add data-scraper-agent skill Workflow skill for building AI-powered public data collection agents. Covers any scraping target: job boards, prices, news, GitHub, sports, events. - Full architecture guide (config.yaml, scraper/, ai/, storage/) - Gemini Flash free tier client with 4-model fallback chain - Batch API pattern (5 items/call) — stays within free tier - Feedback learning loop from user decisions - Notion / Sheets / Supabase storage templates - GitHub Actions cron schedule (100% free) - Anti-patterns table, free tier limits reference, quality checklist - Real-world examples and reference implementation (job-hunt-agent) * fix(skill): address PR #503 review violations in data-scraper-agent - Read batch_size from config.yaml instead of hardcoded constant - Branch main.py on storage.provider; label example as Notion-only - Replace undefined sync_feedback() with load_feedback() + comment - Add commented Playwright browser install step to CI workflow - Add permissions: contents: write; remove silent `git push || true` - Remove external unvetted repo link from Reference Implementation - Move import json to top of pipeline.py block (was after usage) - Guard context.md read with exists() check; fall back to empty string - Replace deprecated datetime.utcnow() with datetime.now(timezone.utc) - Remove duplicate config.yaml entry from project directory template --- skills/data-scraper-agent/SKILL.md | 764 +++++++++++++++++++++++++++++ 1 file changed, 764 insertions(+) create mode 100644 skills/data-scraper-agent/SKILL.md diff --git a/skills/data-scraper-agent/SKILL.md b/skills/data-scraper-agent/SKILL.md new file mode 100644 index 00000000..72a6548d --- /dev/null +++ b/skills/data-scraper-agent/SKILL.md @@ -0,0 +1,764 @@ +--- +name: data-scraper-agent +description: Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically. +origin: community +--- + +# Data Scraper Agent + +Build a production-ready, AI-powered data collection agent for any public data source. +Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time. + +**Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase** + +## When to Activate + +- User wants to scrape or monitor any public website or API +- User says "build a bot that checks...", "monitor X for me", "collect data from..." +- User wants to track jobs, prices, news, repos, sports scores, events, listings +- User asks how to automate data collection without paying for hosting +- User wants an agent that gets smarter over time based on their decisions + +## Core Concepts + +### The Three Layers + +Every data scraper agent has three layers: + +``` +COLLECT → ENRICH → STORE + │ │ │ +Scraper AI (LLM) Database +runs on scores/ Notion / +schedule summarises Sheets / + & classifies Supabase +``` + +### Free Stack + +| Layer | Tool | Why | +|---|---|---| +| **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites | +| **JS-rendered sites** | `playwright` (free) | When HTML scraping fails | +| **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free | +| **Storage** | Notion API | Free tier, great UI for review | +| **Schedule** | GitHub Actions cron | Free for public repos | +| **Learning** | JSON feedback file in repo | Zero infra, persists in git | + +### AI Model Fallback Chain + +Build agents to auto-fallback across Gemini models on quota exhaustion: + +``` +gemini-2.0-flash-lite (30 RPM) → +gemini-2.0-flash (15 RPM) → +gemini-2.5-flash (10 RPM) → +gemini-flash-lite-latest (fallback) +``` + +### Batch API Calls for Efficiency + +Never call the LLM once per item. Always batch: + +```python +# BAD: 33 API calls for 33 items +for item in items: + result = call_ai(item) # 33 calls → hits rate limit + +# GOOD: 7 API calls for 33 items (batch size 5) +for batch in chunks(items, size=5): + results = call_ai(batch) # 7 calls → stays within free tier +``` + +--- + +## Workflow + +### Step 1: Understand the Goal + +Ask the user: + +1. **What to collect:** "What data source? URL / API / RSS / public endpoint?" +2. **What to extract:** "What fields matter? Title, price, URL, date, score?" +3. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?" +4. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?" +5. **Frequency:** "How often should it run? Every hour, daily, weekly?" + +Common examples to prompt: +- Job boards → score relevance to resume +- Product prices → alert on drops +- GitHub repos → summarise new releases +- News feeds → classify by topic + sentiment +- Sports results → extract stats to tracker +- Events calendar → filter by interest + +--- + +### Step 2: Design the Agent Architecture + +Generate this directory structure for the user: + +``` +my-agent/ +├── config.yaml # User customises this (keywords, filters, preferences) +├── profile/ +│ └── context.md # User context the AI uses (resume, interests, criteria) +├── scraper/ +│ ├── __init__.py +│ ├── main.py # Orchestrator: scrape → enrich → store +│ ├── filters.py # Rule-based pre-filter (fast, before AI) +│ └── sources/ +│ ├── __init__.py +│ └── source_name.py # One file per data source +├── ai/ +│ ├── __init__.py +│ ├── client.py # Gemini REST client with model fallback +│ ├── pipeline.py # Batch AI analysis +│ ├── jd_fetcher.py # Fetch full content from URLs (optional) +│ └── memory.py # Learn from user feedback +├── storage/ +│ ├── __init__.py +│ └── notion_sync.py # Or sheets_sync.py / supabase_sync.py +├── data/ +│ └── feedback.json # User decision history (auto-updated) +├── .env.example +├── setup.py # One-time DB/schema creation +├── enrich_existing.py # Backfill AI scores on old rows +├── requirements.txt +└── .github/ + └── workflows/ + └── scraper.yml # GitHub Actions schedule +``` + +--- + +### Step 3: Build the Scraper Source + +Template for any data source: + +```python +# scraper/sources/my_source.py +""" +[Source Name] — scrapes [what] from [where]. +Method: [REST API / HTML scraping / RSS feed] +""" +import requests +from bs4 import BeautifulSoup +from datetime import datetime, timezone +from scraper.filters import is_relevant + +HEADERS = { + "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)", +} + + +def fetch() -> list[dict]: + """ + Returns a list of items with consistent schema. + Each item must have at minimum: name, url, date_found. + """ + results = [] + + # ---- REST API source ---- + resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15) + if resp.status_code == 200: + for item in resp.json().get("results", []): + if not is_relevant(item.get("title", "")): + continue + results.append(_normalise(item)) + + return results + + +def _normalise(raw: dict) -> dict: + """Convert raw API/HTML data to the standard schema.""" + return { + "name": raw.get("title", ""), + "url": raw.get("link", ""), + "source": "MySource", + "date_found": datetime.now(timezone.utc).date().isoformat(), + # add domain-specific fields here + } +``` + +**HTML scraping pattern:** +```python +soup = BeautifulSoup(resp.text, "lxml") +for card in soup.select("[class*='listing']"): + title = card.select_one("h2, h3").get_text(strip=True) + link = card.select_one("a")["href"] + if not link.startswith("http"): + link = f"https://example.com{link}" +``` + +**RSS feed pattern:** +```python +import xml.etree.ElementTree as ET +root = ET.fromstring(resp.text) +for item in root.findall(".//item"): + title = item.findtext("title", "") + link = item.findtext("link", "") +``` + +--- + +### Step 4: Build the Gemini AI Client + +```python +# ai/client.py +import os, json, time, requests + +_last_call = 0.0 + +MODEL_FALLBACK = [ + "gemini-2.0-flash-lite", + "gemini-2.0-flash", + "gemini-2.5-flash", + "gemini-flash-lite-latest", +] + + +def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict: + """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}.""" + global _last_call + + api_key = os.environ.get("GEMINI_API_KEY", "") + if not api_key: + return {} + + elapsed = time.time() - _last_call + if elapsed < rate_limit: + time.sleep(rate_limit - elapsed) + + models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK + _last_call = time.time() + + for m in models: + url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}" + payload = { + "contents": [{"parts": [{"text": prompt}]}], + "generationConfig": { + "responseMimeType": "application/json", + "temperature": 0.3, + "maxOutputTokens": 2048, + }, + } + try: + resp = requests.post(url, json=payload, timeout=30) + if resp.status_code == 200: + return _parse(resp) + if resp.status_code in (429, 404): + time.sleep(1) + continue + return {} + except requests.RequestException: + return {} + + return {} + + +def _parse(resp) -> dict: + try: + text = ( + resp.json() + .get("candidates", [{}])[0] + .get("content", {}) + .get("parts", [{}])[0] + .get("text", "") + .strip() + ) + if text.startswith("```"): + text = text.split("\n", 1)[-1].rsplit("```", 1)[0] + return json.loads(text) + except (json.JSONDecodeError, KeyError): + return {} +``` + +--- + +### Step 5: Build the AI Pipeline (Batch) + +```python +# ai/pipeline.py +import json +import yaml +from pathlib import Path +from ai.client import generate + +def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]: + """Analyse items in batches. Returns items enriched with AI fields.""" + config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text()) + model = config.get("ai", {}).get("model", "gemini-2.5-flash") + rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0) + min_score = config.get("ai", {}).get("min_score", 0) + batch_size = config.get("ai", {}).get("batch_size", 5) + + batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)] + print(f" [AI] {len(items)} items → {len(batches)} API calls") + + enriched = [] + for i, batch in enumerate(batches): + print(f" [AI] Batch {i + 1}/{len(batches)}...") + prompt = _build_prompt(batch, context, preference_prompt, config) + result = generate(prompt, model=model, rate_limit=rate_limit) + + analyses = result.get("analyses", []) + for j, item in enumerate(batch): + ai = analyses[j] if j < len(analyses) else {} + if ai: + score = max(0, min(100, int(ai.get("score", 0)))) + if min_score and score < min_score: + continue + enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")}) + else: + enriched.append(item) + + return enriched + + +def _build_prompt(batch, context, preference_prompt, config): + priorities = config.get("priorities", []) + items_text = "\n\n".join( + f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}" + for i, item in enumerate(batch) + ) + + return f"""Analyse these {len(batch)} items and return a JSON object. + +# Items +{items_text} + +# User Context +{context[:800] if context else "Not provided"} + +# User Priorities +{chr(10).join(f"- {p}" for p in priorities)} + +{preference_prompt} + +# Instructions +Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": ""}} for each item in order]}} +Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak.""" +``` + +--- + +### Step 6: Build the Feedback Learning System + +```python +# ai/memory.py +"""Learn from user decisions to improve future scoring.""" +import json +from pathlib import Path + +FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json" + + +def load_feedback() -> dict: + if FEEDBACK_PATH.exists(): + try: + return json.loads(FEEDBACK_PATH.read_text()) + except (json.JSONDecodeError, OSError): + pass + return {"positive": [], "negative": []} + + +def save_feedback(fb: dict): + FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True) + FEEDBACK_PATH.write_text(json.dumps(fb, indent=2)) + + +def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str: + """Convert feedback history into a prompt bias section.""" + lines = [] + if feedback.get("positive"): + lines.append("# Items the user LIKED (positive signal):") + for e in feedback["positive"][-max_examples:]: + lines.append(f"- {e}") + if feedback.get("negative"): + lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):") + for e in feedback["negative"][-max_examples:]: + lines.append(f"- {e}") + if lines: + lines.append("\nUse these patterns to bias scoring on new items.") + return "\n".join(lines) +``` + +**Integration with your storage layer:** after each run, query your DB for items with positive/negative status and call `save_feedback()` with the extracted patterns. + +--- + +### Step 7: Build Storage (Notion example) + +```python +# storage/notion_sync.py +import os +from notion_client import Client +from notion_client.errors import APIResponseError + +_client = None + +def get_client(): + global _client + if _client is None: + _client = Client(auth=os.environ["NOTION_TOKEN"]) + return _client + +def get_existing_urls(db_id: str) -> set[str]: + """Fetch all URLs already stored — used for deduplication.""" + client, seen, cursor = get_client(), set(), None + while True: + resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {}) + for page in resp["results"]: + url = page["properties"].get("URL", {}).get("url", "") + if url: seen.add(url) + if not resp["has_more"]: break + cursor = resp["next_cursor"] + return seen + +def push_item(db_id: str, item: dict) -> bool: + """Push one item to Notion. Returns True on success.""" + props = { + "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]}, + "URL": {"url": item.get("url")}, + "Source": {"select": {"name": item.get("source", "Unknown")}}, + "Date Found": {"date": {"start": item.get("date_found")}}, + "Status": {"select": {"name": "New"}}, + } + # AI fields + if item.get("ai_score") is not None: + props["AI Score"] = {"number": item["ai_score"]} + if item.get("ai_summary"): + props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]} + if item.get("ai_notes"): + props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]} + + try: + get_client().pages.create(parent={"database_id": db_id}, properties=props) + return True + except APIResponseError as e: + print(f"[notion] Push failed: {e}") + return False + +def sync(db_id: str, items: list[dict]) -> tuple[int, int]: + existing = get_existing_urls(db_id) + added = skipped = 0 + for item in items: + if item.get("url") in existing: + skipped += 1; continue + if push_item(db_id, item): + added += 1; existing.add(item["url"]) + else: + skipped += 1 + return added, skipped +``` + +--- + +### Step 8: Orchestrate in main.py + +```python +# scraper/main.py +import os, sys, yaml +from pathlib import Path +from dotenv import load_dotenv + +load_dotenv() + +from scraper.sources import my_source # add your sources + +# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase", +# replace this import with storage.sheets_sync or storage.supabase_sync and update +# the env var and sync() call accordingly. +from storage.notion_sync import sync + +SOURCES = [ + ("My Source", my_source.fetch), +] + +def ai_enabled(): + return bool(os.environ.get("GEMINI_API_KEY")) + +def main(): + config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text()) + provider = config.get("storage", {}).get("provider", "notion") + + # Resolve the storage target identifier from env based on provider + if provider == "notion": + db_id = os.environ.get("NOTION_DATABASE_ID") + if not db_id: + print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1) + else: + # Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc. + print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1) + + config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text()) + all_items = [] + + for name, fetch_fn in SOURCES: + try: + items = fetch_fn() + print(f"[{name}] {len(items)} items") + all_items.extend(items) + except Exception as e: + print(f"[{name}] FAILED: {e}") + + # Deduplicate by URL + seen, deduped = set(), [] + for item in all_items: + if (url := item.get("url", "")) and url not in seen: + seen.add(url); deduped.append(item) + + print(f"Unique items: {len(deduped)}") + + if ai_enabled() and deduped: + from ai.memory import load_feedback, build_preference_prompt + from ai.pipeline import analyse_batch + + # load_feedback() reads data/feedback.json written by your feedback sync script. + # To keep it current, implement a separate feedback_sync.py that queries your + # storage provider for items with positive/negative statuses and calls save_feedback(). + feedback = load_feedback() + preference = build_preference_prompt(feedback) + context_path = Path(__file__).parent.parent / "profile" / "context.md" + context = context_path.read_text() if context_path.exists() else "" + deduped = analyse_batch(deduped, context=context, preference_prompt=preference) + else: + print("[AI] Skipped — GEMINI_API_KEY not set") + + added, skipped = sync(db_id, deduped) + print(f"Done — {added} new, {skipped} existing") + +if __name__ == "__main__": + main() +``` + +--- + +### Step 9: GitHub Actions Workflow + +```yaml +# .github/workflows/scraper.yml +name: Data Scraper Agent + +on: + schedule: + - cron: "0 */3 * * *" # every 3 hours — adjust to your needs + workflow_dispatch: # allow manual trigger + +permissions: + contents: write # required for the feedback-history commit step + +jobs: + scrape: + runs-on: ubuntu-latest + timeout-minutes: 20 + + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: "3.11" + cache: "pip" + + - run: pip install -r requirements.txt + + # Uncomment if Playwright is enabled in requirements.txt + # - name: Install Playwright browsers + # run: python -m playwright install chromium --with-deps + + - name: Run agent + env: + NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }} + NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }} + GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} + run: python -m scraper.main + + - name: Commit feedback history + run: | + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + git add data/feedback.json || true + git diff --cached --quiet || git commit -m "chore: update feedback history" + git push +``` + +--- + +### Step 10: config.yaml Template + +```yaml +# Customise this file — no code changes needed + +# What to collect (pre-filter before AI) +filters: + required_keywords: [] # item must contain at least one + blocked_keywords: [] # item must not contain any + +# Your priorities — AI uses these for scoring +priorities: + - "example priority 1" + - "example priority 2" + +# Storage +storage: + provider: "notion" # notion | sheets | supabase | sqlite + +# Feedback learning +feedback: + positive_statuses: ["Saved", "Applied", "Interested"] + negative_statuses: ["Skip", "Rejected", "Not relevant"] + +# AI settings +ai: + enabled: true + model: "gemini-2.5-flash" + min_score: 0 # filter out items below this score + rate_limit_seconds: 7 # seconds between API calls + batch_size: 5 # items per API call +``` + +--- + +## Common Scraping Patterns + +### Pattern 1: REST API (easiest) +```python +resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15) +items = resp.json().get("results", []) +``` + +### Pattern 2: HTML Scraping +```python +soup = BeautifulSoup(resp.text, "lxml") +for card in soup.select(".listing-card"): + title = card.select_one("h2").get_text(strip=True) + href = card.select_one("a")["href"] +``` + +### Pattern 3: RSS Feed +```python +import xml.etree.ElementTree as ET +root = ET.fromstring(resp.text) +for item in root.findall(".//item"): + title = item.findtext("title", "") + link = item.findtext("link", "") + pub_date = item.findtext("pubDate", "") +``` + +### Pattern 4: Paginated API +```python +page = 1 +while True: + resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15) + data = resp.json() + items = data.get("results", []) + if not items: + break + for item in items: + results.append(_normalise(item)) + if not data.get("has_more"): + break + page += 1 +``` + +### Pattern 5: JS-Rendered Pages (Playwright) +```python +from playwright.sync_api import sync_playwright + +with sync_playwright() as p: + browser = p.chromium.launch() + page = browser.new_page() + page.goto(url) + page.wait_for_selector(".listing") + html = page.content() + browser.close() + +soup = BeautifulSoup(html, "lxml") +``` + +--- + +## Anti-Patterns to Avoid + +| Anti-pattern | Problem | Fix | +|---|---|---| +| One LLM call per item | Hits rate limits instantly | Batch 5 items per call | +| Hardcoded keywords in code | Not reusable | Move all config to `config.yaml` | +| Scraping without rate limit | IP ban | Add `time.sleep(1)` between requests | +| Storing secrets in code | Security risk | Always use `.env` + GitHub Secrets | +| No deduplication | Duplicate rows pile up | Always check URL before pushing | +| Ignoring `robots.txt` | Legal/ethical risk | Respect crawl rules; use public APIs when available | +| JS-rendered sites with `requests` | Empty response | Use Playwright or look for the underlying API | +| `maxOutputTokens` too low | Truncated JSON, parse error | Use 2048+ for batch responses | + +--- + +## Free Tier Limits Reference + +| Service | Free Limit | Typical Usage | +|---|---|---| +| Gemini Flash Lite | 30 RPM, 1500 RPD | ~56 req/day at 3-hr intervals | +| Gemini 2.0 Flash | 15 RPM, 1500 RPD | Good fallback | +| Gemini 2.5 Flash | 10 RPM, 500 RPD | Use sparingly | +| GitHub Actions | Unlimited (public repos) | ~20 min/day | +| Notion API | Unlimited | ~200 writes/day | +| Supabase | 500MB DB, 2GB transfer | Fine for most agents | +| Google Sheets API | 300 req/min | Works for small agents | + +--- + +## Requirements Template + +``` +requests==2.31.0 +beautifulsoup4==4.12.3 +lxml==5.1.0 +python-dotenv==1.0.1 +pyyaml==6.0.2 +notion-client==2.2.1 # if using Notion +# playwright==1.40.0 # uncomment for JS-rendered sites +``` + +--- + +## Quality Checklist + +Before marking the agent complete: + +- [ ] `config.yaml` controls all user-facing settings — no hardcoded values +- [ ] `profile/context.md` holds user-specific context for AI matching +- [ ] Deduplication by URL before every storage push +- [ ] Gemini client has model fallback chain (4 models) +- [ ] Batch size ≤ 5 items per API call +- [ ] `maxOutputTokens` ≥ 2048 +- [ ] `.env` is in `.gitignore` +- [ ] `.env.example` provided for onboarding +- [ ] `setup.py` creates DB schema on first run +- [ ] `enrich_existing.py` backfills AI scores on old rows +- [ ] GitHub Actions workflow commits `feedback.json` after each run +- [ ] README covers: setup in < 5 minutes, required secrets, customisation + +--- + +## Real-World Examples + +``` +"Build me an agent that monitors Hacker News for AI startup funding news" +"Scrape product prices from 3 e-commerce sites and alert when they drop" +"Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one" +"Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion" +"Monitor a subreddit for posts mentioning my company — classify sentiment" +"Scrape new academic papers from arXiv on a topic I care about daily" +"Track sports fixture results and keep a running table in Google Sheets" +"Build a real estate listing watcher — alert on new properties under ₹1 Cr" +``` + +--- + +## Reference Implementation + +A complete working agent built with this exact architecture would scrape 4+ sources, +batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run +100% free on GitHub Actions. Follow Steps 1–9 above to build your own.