One channel, end to end — every API call, every table write, every failure mode we actually hit.
Run it yourself with npm run harvest -- 홍석천 (single-channel mode).
@gaypig1111 → channels.list?forHandle=gaypig1111 → UCIP3hSJruPL4dIi95lsuCZA (410K subs).
UC ids pass straight through with zero API cost — which is why TARGETS prefers them.
channels DB write #1INSERT INTO channels (channel_id, channel_name, focus, notes) VALUES ($1, $2, $3, $4) ON CONFLICT (channel_id) DO UPDATE SET -- idempotent: re-running never duplicates channel_name = EXCLUDED.channel_name, focus = EXCLUDED.focus, notes = EXCLUDED.notes;
Every write in this pipeline is an upsert against a natural unique key — here channel_id.
That single discipline is what makes the whole harvester resumable and safe to re-run daily.
The most expensive YouTube call in the app (100 of the 10,000 daily units). One call per channel per run. For 홍석천이원일 it returned 50 videos — only 20 of which are ≥ 3 minutes (the rest are Shorts).
contentDetails.duration →
.match() on undefined threw, the catch returned [], and the whole channel got
"0 eligible videos" (zeroed 성시경 in 3 of 4 scheduled runs).item.contentDetails?.duration ?? 'PT0S' — one premiere can no longer blank a channel.// include-words OVERRIDE excludes — decide which error you'd rather make: // a false keep costs 1 Gemini call; a false skip silently loses a map point. INCLUDE: /맛집|식당|먹을텐데|또간집|노포|밥집|뷔페|횟집|술집|안주|포차|회식|국밥|해장국/ EXCLUDE: music·talk /부를텐데|만날텐데|레시피|LIVE/ · marts /마트|편의점|꿀템/ home-cooking /집에서.{0,16}(구웠|구워|만들)/ · outdoor /한라산|등반|올레길/ overseas /도쿄|오사카|뉴욕|일본|미국|베트남|하노이|해외/ per-channel: 성시경 → includeOnly /먹을텐데/ // 11/12 of its uploads are music/talk
For 홍석천이원일 all 20 long-form titles pass — including "이원일의 레시피!… 피자 맛집" which survives only because include-words win. Before this filter existed, ~60–70% of a day's Gemini quota went to music shows and mart hauls.
-- ① did this video already yield restaurants? SELECT 1 FROM restaurants WHERE video_id = $1; -- ② or did the LLM already confirm it has none? (zero-yield tombstone) SELECT extraction_empty_at FROM videos WHERE video_id = $1;
videos.extraction_empty_at is set only when the LLM responded and found nothing
(a quota-dead call sets nothing, so those stay retryable).runPipeline, ×10 videos)src/pipeline.ts orchestrates 5 steps; concurrency 2 videos at a time, exponential backoff between retries.
videos videos.list · 1 unit · DB write #2INSERT INTO videos (video_id, title, channel_title, description, published_at) VALUES ($1,$2,$3,$4,$5) ON CONFLICT (video_id) DO UPDATE SET … , updated_at = now();
transcripts 0 quota (scrape) · DB write #3INSERT INTO transcripts (video_id, transcript_data, language_code, has_korean) VALUES ($1, $2::jsonb, $3, $4) ON CONFLICT (video_id, language_code) DO UPDATE SET transcript_data = EXCLUDED.transcript_data;
No official API exists — this scrapes YouTube's caption endpoint. It's the flakiest step:
datacenter IPs get blocked (hence optional PROXY_URL for EC2), and it's cached in the DB so re-runs are free.
has_korean gates whether the LLM step is worth running.
comment_threads 1 unit/page (100 threads) · DB write #4INSERT INTO comment_threads (video_id, youtube_thread_id, top_level_comment, replies) VALUES ($1,$2,$3::jsonb,$4::jsonb) ON CONFLICT (youtube_thread_id) DO UPDATE SET …;
Why comments? Viewers name the restaurant ("여기 ○○집이에요!") and correct transcript typos — they're free disambiguation signal for the LLM. 홍석천이원일 videos carry hundreds of threads each.
prompt = title + description + transcript + comments
+ "return ONLY establishments actually REVIEWED — not dishes, not name-drops"
→ [{"name":"무교동북어국집","area":"무교동"}, …] // JSON array of {name, area}
Output is post-filtered: placeholder names (/이름\s*없|간판/) are dropped before enrichment.
The result carries llmResponded — distinguishing "the model said zero" from "the call died".
null → looked identical to
"no restaurants" → quota-era runs persisted 13 food videos as empty (incl. one with the restaurant name in the title).retryDelay and retry;
per-day 429 → latch isDailyQuotaExhausted() and the whole harvest fail-fasts in seconds instead of grinding.// ordered query plan — most specific first, validation on the loose end 0. overseas guard: area ~ /일본|도쿄|뉴욕|하노이…/ → skip entirely (Naver = Korea only) 1. "무교동북어국집 무교동" // name + full area 2. "무교동북어국집 무교동"[city] // name + first token of area 3. "무교동북어국집" + validate // bare name — hit's address MUST contain an area token 4. errata spell-fix, last resort // it "corrected" 퍼옹부이 → 포옹부인 (nonsense) coords: mapx/mapy ÷ 10,000,000 → lat/lng; else geocode the address
restaurants, or tombstone → videos DB write #5INSERT INTO restaurants (video_id, restaurant_name, area, naver_link, address, road_address, latitude, longitude, category, description, last_enriched_at) VALUES (…, now()) ON CONFLICT (video_id, restaurant_name) DO UPDATE SET … ; -- same place re-reviewed = update, not dup -- OR, when llmResponded ∧ names.length = 0: UPDATE videos SET extraction_empty_at = now() WHERE video_id = $1; -- never re-pay for this video
Un-geocoded rows (overseas, or genuinely not on Naver) stay with latitude IS NULL —
stored but invisible to the map. Nothing is deleted; provenance (which video said it) is always kept.
SELECT r.*, v.title AS video_title, v.channel_title FROM restaurants r LEFT JOIN videos v ON r.video_id = v.video_id WHERE r.latitude IS NOT NULL ORDER BY r.created_at DESC;
The viewer never talks to the pipeline — only to the table. That's why the map fills in live while a harvest runs.
| Table | Natural unique key | Written by | Role |
|---|---|---|---|
channels | channel_id | Stage A2 | Which channels we track + focus/notes |
videos | video_id | C1 (+ C6 tombstone) | Metadata · extraction_empty_at = zero-yield tombstone |
transcripts | (video_id, language_code) | C2 | Cached transcript jsonb · has_korean gate |
comment_threads | youtube_thread_id | C3 | LLM disambiguation signal (jsonb top comment + replies) |
restaurants | (video_id, restaurant_name) | C6 | The product. lat/lng NULL = extracted but unmappable |
Three invariants to stay meticulous about:
(1) every write is an upsert on a natural key — re-runs are always safe;
(2) raw inputs (transcript, comments) are cached in the DB so only Gemini/Naver steps ever need re-paying;
(3) a row in restaurants always traces back to the exact video that claimed it.
| Resource | Cost | Daily budget | Binding? |
|---|---|---|---|
| YouTube search.list (B1) | 100 units × 1 | 10,000 units | no |
| YouTube videos/commentThreads (B2, C1, C3) | ~1–10 units/video | no | |
| Transcript scrape (C2) | 0 units (rate-limited by IP) | rarely | |
| Gemini Flash (C4) | 1–3 req/video | free tier: small fixed RPD, resets 16:00 KST | YES — the bottleneck |
| Naver Search/Geocode (C5) | ≤3 req/restaurant | 25,000/day | no (after retry fix) |
Everything in Stage B exists to protect the Gemini column: filter before you pay, tombstone after you've paid once.
Three terms that get blurred together — this repo happens to contain a clean specimen of each.
A prototype's deliverable is the answer, not the code. The Puppeteer transcript scraper answered
"can we get Korean transcripts at all?" — then was thrown away for youtube-transcript-plus.
The research script answered "which channels actually yield restaurants?" — its output lives on as
channels.md notes ("100% hit rate", "1 video → 20 restaurants") while the runs themselves are forgotten.
You're allowed to write ugly prototypes because you've promised to delete them.
npm run tracerOne video → all 5 steps → real YouTube API, real Gemini, real Naver, real Postgres → a real coordinate. No mocks anywhere — that's what makes it a tracer and not a demo: it proves the architecture under real conditions, like a tracer round showing where your fire actually lands. Crucially, tracer code is not disposable: everything we hardened this week (retry latches, tombstones, area validation) was bolted onto that same skeleton. The skeleton never changed; it just grew armor.
"Viable" = a person can open the map and find a restaurant a YouTuber actually visited — that works today (86+ geocoded points, popups linking Naver + the source video). "Minimum" = the honest gaps: search matching is a stub, no dedup across channels reviewing the same place, free-tier pacing. An MVP's job is to generate product learning (do people want this? what do they search for?) — distinct from the tracer's technical learning.
Idempotent upserts, fail-fast quota latches, zero-yield tombstones, geocode validation, squatter-checked channel ids — none of these add features; they add trust in the data. The meticulous per-channel loop going forward:
① verify UC id (squatter check) → ② sample 50 titles, tune filter →
③ npm run harvest -- <channel> → ④ audit new rows (coords in right region? categories food?) →
⑤ check /map → repeat for the next channel.
Generated 2026-06-13 · reflects the code as of the 홍석천이원일 module · API docs · live map