Pipeline Flow. 한 채널이 지도 위 점이 되기까지

Worked example: 홍석천이원일

One channel, end to end — every API call, every table write, every failure mode we actually hit. Run it yourself with npm run harvest -- 홍석천 (single-channel mode).

① channel resolved ② videos listed & filtered ③ per-video pipeline ×10 ④ restaurants on /map

Stage A — Channel registration

Resolve the channel identity YouTube API · 1 unit

scripts/harvest_channels.ts → getChannelId() → src/youtube.ts

@gaypig1111 → channels.list?forHandle=gaypig1111 → UCIP3hSJruPL4dIi95lsuCZA (410K subs). UC ids pass straight through with zero API cost — which is why TARGETS prefers them.

Failure we hit: handles can be squatted. @deudly = a 7-subscriber fake; the real 더들리 is @dudely_08. A handle-only target would silently harvest a stranger's channel.

Mitigation: resolve once, verify subscriber count against expectations, then pin the UC id in TARGETS.

Upsert into `channels` DB write #1

upsertChannel() → src/db.ts

INSERT INTO channels (channel_id, channel_name, focus, notes)
VALUES ($1, $2, $3, $4)
ON CONFLICT (channel_id) DO UPDATE SET   -- idempotent: re-running never duplicates
  channel_name = EXCLUDED.channel_name, focus = EXCLUDED.focus, notes = EXCLUDED.notes;

Every write in this pipeline is an upsert against a natural unique key — here channel_id. That single discipline is what makes the whole harvester resumable and safe to re-run daily.

Stage B — Video discovery & triage

List recent uploads search.list · 100 units ⚠️

findVideosByChannelId(channelId, FETCH_BUFFER=50)

The most expensive YouTube call in the app (100 of the 10,000 daily units). One call per channel per run. For 홍석천이원일 it returned 50 videos — only 20 of which are ≥ 3 minutes (the rest are Shorts).

Fetch durations, drop Shorts videos.list · 1 unit

getVideoDetailsByIds() → MIN_DURATION_SEC = 180

Failure we hit: scheduled premieres have no contentDetails.duration → .match() on undefined threw, the catch returned [], and the whole channel got "0 eligible videos" (zeroed 성시경 in 3 of 4 scheduled runs).

Mitigation: item.contentDetails?.duration ?? 'PT0S' — one premiere can no longer blank a channel.

Title prefilter — don't pay for off-topic videos 0 API calls

isLikelyRestaurantVideo(title, target)

// include-words OVERRIDE excludes — decide which error you'd rather make:
// a false keep costs 1 Gemini call; a false skip silently loses a map point.
INCLUDE: /맛집|식당|먹을텐데|또간집|노포|밥집|뷔페|횟집|술집|안주|포차|회식|국밥|해장국/
EXCLUDE: music·talk /부를텐데|만날텐데|레시피|LIVE/ · marts /마트|편의점|꿀템/
         home-cooking /집에서.{0,16}(구웠|구워|만들)/ · outdoor /한라산|등반|올레길/
         overseas /도쿄|오사카|뉴욕|일본|미국|베트남|하노이|해외/
per-channel: 성시경 → includeOnly /먹을텐데/   // 11/12 of its uploads are music/talk

For 홍석천이원일 all 20 long-form titles pass — including "이원일의 레시피!… 피자 맛집" which survives only because include-words win. Before this filter existed, ~60–70% of a day's Gemini quota went to music shows and mart hauls.

Skip what's already done DB read

alreadyHarvested() — two checks

-- ① did this video already yield restaurants?
SELECT 1 FROM restaurants WHERE video_id = $1;
-- ② or did the LLM already confirm it has none? (zero-yield tombstone)
SELECT extraction_empty_at FROM videos WHERE video_id = $1;

Failure we hit: without ②, confirmed-empty videos were re-extracted ×3 retries every run forever — the 한라산 hiking video alone ate ~15 Gemini calls over 5 days re-confirming it contains no restaurants.

Mitigation: videos.extraction_empty_at is set only when the LLM responded and found nothing (a quota-dead call sets nothing, so those stay retryable).

Stage C — Per-video pipeline (`runPipeline`, ×10 videos)

src/pipeline.ts orchestrates 5 steps; concurrency 2 videos at a time, exponential backoff between retries.

Ingest metadata → `videos` videos.list · 1 unit · DB write #2

ingestVideoById() → upsertVideo()

INSERT INTO videos (video_id, title, channel_title, description, published_at)
VALUES ($1,$2,$3,$4,$5)
ON CONFLICT (video_id) DO UPDATE SET … , updated_at = now();

Extract transcript → `transcripts` 0 quota (scrape) · DB write #3

extractAndSaveTranscript() — youtube-transcript-plus, Korean first

INSERT INTO transcripts (video_id, transcript_data, language_code, has_korean)
VALUES ($1, $2::jsonb, $3, $4)
ON CONFLICT (video_id, language_code) DO UPDATE SET transcript_data = EXCLUDED.transcript_data;

No official API exists — this scrapes YouTube's caption endpoint. It's the flakiest step: datacenter IPs get blocked (hence optional PROXY_URL for EC2), and it's cached in the DB so re-runs are free. has_korean gates whether the LLM step is worth running.

Fetch comments → `comment_threads` 1 unit/page (100 threads) · DB write #4

fetchAndPersistCommentsByVideoId() — paginated

INSERT INTO comment_threads (video_id, youtube_thread_id, top_level_comment, replies)
VALUES ($1,$2,$3::jsonb,$4::jsonb)
ON CONFLICT (youtube_thread_id) DO UPDATE SET …;

Why comments? Viewers name the restaurant ("여기 ○○집이에요!") and correct transcript typos — they're free disambiguation signal for the LLM. 홍석천이원일 videos carry hundreds of threads each.

LLM extraction — names + areas Gemini Flash · 1 req (free tier: limited/day)

extractRestaurantsFromVideo() → src/llm.ts generateContent()

prompt = title + description + transcript + comments
       + "return ONLY establishments actually REVIEWED — not dishes, not name-drops"
→ [{"name":"무교동북어국집","area":"무교동"}, …]   // JSON array of {name, area}

Output is post-filtered: placeholder names (/이름\s*없|간판/) are dropped before enrichment. The result carries llmResponded — distinguishing "the model said zero" from "the call died".

Failure we hit: free-tier 429s were swallowed as null → looked identical to "no restaurants" → quota-era runs persisted 13 food videos as empty (incl. one with the restaurant name in the title).

Mitigation: per-minute 429 → honor the server's retryDelay and retry; per-day 429 → latch isDailyQuotaExhausted() and the whole harvest fail-fasts in seconds instead of grinding.

Naver enrichment — name → 좌표 Naver Local Search ≤3 req + Geocode fallback

enrichRestaurantsWithNaver() → getCoordinatesForRestaurant()

// ordered query plan — most specific first, validation on the loose end
0. overseas guard: area ~ /일본|도쿄|뉴욕|하노이…/ → skip entirely (Naver = Korea only)
1. "무교동북어국집 무교동"        // name + full area
2. "무교동북어국집 무교동"[city]   // name + first token of area
3. "무교동북어국집" + validate     // bare name — hit's address MUST contain an area token
4. errata spell-fix, last resort  // it "corrected" 퍼옹부이 → 포옹부인 (nonsense)
coords: mapx/mapy ÷ 10,000,000 → lat/lng; else geocode the address

Failures we hit: ① Katz's Deli (뉴욕) "geocoded" to 의정부; a Tokyo tonkatsu shop matched a health-goods store in 서초 — Naver silently returns Korean lookalikes for overseas places. ② Naver 429s permanently left Korean restaurants un-geocoded. ③ bare-name fallback matched a Seoul 호떡 stall for a 충주 stall.

Mitigations: ① overseas guard (step 0). ② retry 429/5xx + 150ms spacing — a re-enrich pass then fixed 4 of 5 Korean failures instantly. ③ the validate rule on bare-name hits — it rejected the wrong 전통호떡 live during repair.

Persist → `restaurants`, or tombstone → `videos` DB write #5

upsertManyRestaurants() / markVideoExtractionEmpty()

INSERT INTO restaurants (video_id, restaurant_name, area, naver_link, address,
                         road_address, latitude, longitude, category, description, last_enriched_at)
VALUES (…, now())
ON CONFLICT (video_id, restaurant_name) DO UPDATE SET … ;  -- same place re-reviewed = update, not dup

-- OR, when llmResponded ∧ names.length = 0:
UPDATE videos SET extraction_empty_at = now() WHERE video_id = $1;  -- never re-pay for this video

Un-geocoded rows (overseas, or genuinely not on Naver) stay with latitude IS NULL — stored but invisible to the map. Nothing is deleted; provenance (which video said it) is always kept.

Read path — the map GET /restaurants

src/server.ts → public/map.html (20s polling, clear-then-draw)

SELECT r.*, v.title AS video_title, v.channel_title
FROM restaurants r LEFT JOIN videos v ON r.video_id = v.video_id
WHERE r.latitude IS NOT NULL ORDER BY r.created_at DESC;

The viewer never talks to the pipeline — only to the table. That's why the map fills in live while a harvest runs.

The 5 tables — who writes what

Table	Natural unique key	Written by	Role
`channels`	channel_id	Stage A2	Which channels we track + focus/notes
`videos`	video_id	C1 (+ C6 tombstone)	Metadata · `extraction_empty_at` = zero-yield tombstone
`transcripts`	(video_id, language_code)	C2	Cached transcript jsonb · `has_korean` gate
`comment_threads`	youtube_thread_id	C3	LLM disambiguation signal (jsonb top comment + replies)
`restaurants`	(video_id, restaurant_name)	C6	The product. lat/lng NULL = extracted but unmappable

Three invariants to stay meticulous about: (1) every write is an upsert on a natural key — re-runs are always safe; (2) raw inputs (transcript, comments) are cached in the DB so only Gemini/Naver steps ever need re-paying; (3) a row in restaurants always traces back to the exact video that claimed it.

Quota economics per channel run (홍석천이원일, 10 videos)

Resource	Cost	Daily budget	Binding?
YouTube search.list (B1)	100 units × 1	10,000 units	no
YouTube videos/commentThreads (B2, C1, C3)	~1–10 units/video		no
Transcript scrape (C2)	0 units (rate-limited by IP)		rarely
Gemini Flash (C4)	1–3 req/video	free tier: small fixed RPD, resets 16:00 KST	YES — the bottleneck
Naver Search/Geocode (C5)	≤3 req/restaurant	25,000/day	no (after retry fix)

Everything in Stage B exists to protect the Gemini column: filter before you pay, tombstone after you've paid once.

Where we are: prototype → tracer bullet → MVP

Three terms that get blurred together — this repo happens to contain a clean specimen of each.

Prototype — disposable code that answers a question

legacy/ puppeteer experiments · scripts/research_channels.ts

A prototype's deliverable is the answer, not the code. The Puppeteer transcript scraper answered "can we get Korean transcripts at all?" — then was thrown away for youtube-transcript-plus. The research script answered "which channels actually yield restaurants?" — its output lives on as channels.md notes ("100% hit rate", "1 video → 20 restaurants") while the runs themselves are forgotten. You're allowed to write ugly prototypes because you've promised to delete them.

Tracer bullet — a thin end-to-end slice you keep and flesh out

src/pipeline.ts · `npm run tracer`

One video → all 5 steps → real YouTube API, real Gemini, real Naver, real Postgres → a real coordinate. No mocks anywhere — that's what makes it a tracer and not a demo: it proves the architecture under real conditions, like a tracer round showing where your fire actually lands. Crucially, tracer code is not disposable: everything we hardened this week (retry latches, tombstones, area validation) was bolted onto that same skeleton. The skeleton never changed; it just grew armor.

MVP — the smallest thing users can use, built to learn from

/map + the 16:30 daily self-harvest

"Viable" = a person can open the map and find a restaurant a YouTuber actually visited — that works today (86+ geocoded points, popups linking Naver + the source video). "Minimum" = the honest gaps: search matching is a stub, no dedup across channels reviewing the same place, free-tier pacing. An MVP's job is to generate product learning (do people want this? what do they search for?) — distinct from the tracer's technical learning.

Now — hardening: "works once" → "works unattended, daily, correctly"

This week's work, and the modular channel runbook

Idempotent upserts, fail-fast quota latches, zero-yield tombstones, geocode validation, squatter-checked channel ids — none of these add features; they add trust in the data. The meticulous per-channel loop going forward:

① verify UC id (squatter check) → ② sample 50 titles, tune filter → ③ npm run harvest -- <channel> → ④ audit new rows (coords in right region? categories food?) → ⑤ check /map → repeat for the next channel.

Generated 2026-06-13 · reflects the code as of the 홍석천이원일 module · API docs · live map