Pipeline Flow. 한 채널이 지도 위 점이 되기까지

← 지도로

Worked example: 홍석천이원일

One channel, end to end — every API call, every table write, every failure mode we actually hit. Run it yourself with npm run harvest -- 홍석천 (single-channel mode).

① channel resolved ② videos listed & filtered ③ per-video pipeline ×10 ④ restaurants on /map

Stage A — Channel registration

A1

Resolve the channel identity YouTube API · 1 unit

scripts/harvest_channels.ts → getChannelId() → src/youtube.ts

@gaypig1111channels.list?forHandle=gaypig1111UCIP3hSJruPL4dIi95lsuCZA (410K subs). UC ids pass straight through with zero API cost — which is why TARGETS prefers them.

Failure we hit: handles can be squatted. @deudly = a 7-subscriber fake; the real 더들리 is @dudely_08. A handle-only target would silently harvest a stranger's channel.
Mitigation: resolve once, verify subscriber count against expectations, then pin the UC id in TARGETS.
A2

Upsert into channels DB write #1

upsertChannel() → src/db.ts
INSERT INTO channels (channel_id, channel_name, focus, notes)
VALUES ($1, $2, $3, $4)
ON CONFLICT (channel_id) DO UPDATE SET   -- idempotent: re-running never duplicates
  channel_name = EXCLUDED.channel_name, focus = EXCLUDED.focus, notes = EXCLUDED.notes;

Every write in this pipeline is an upsert against a natural unique key — here channel_id. That single discipline is what makes the whole harvester resumable and safe to re-run daily.

Stage B — Video discovery & triage

B1

List recent uploads search.list · 100 units ⚠️

findVideosByChannelId(channelId, FETCH_BUFFER=50)

The most expensive YouTube call in the app (100 of the 10,000 daily units). One call per channel per run. For 홍석천이원일 it returned 50 videos — only 20 of which are ≥ 3 minutes (the rest are Shorts).

B2

Fetch durations, drop Shorts videos.list · 1 unit

getVideoDetailsByIds() → MIN_DURATION_SEC = 180
Failure we hit: scheduled premieres have no contentDetails.duration.match() on undefined threw, the catch returned [], and the whole channel got "0 eligible videos" (zeroed 성시경 in 3 of 4 scheduled runs).
Mitigation: item.contentDetails?.duration ?? 'PT0S' — one premiere can no longer blank a channel.
B3

Title prefilter — don't pay for off-topic videos 0 API calls

isLikelyRestaurantVideo(title, target)
// include-words OVERRIDE excludes — decide which error you'd rather make:
// a false keep costs 1 Gemini call; a false skip silently loses a map point.
INCLUDE: /맛집|식당|먹을텐데|또간집|노포|밥집|뷔페|횟집|술집|안주|포차|회식|국밥|해장국/
EXCLUDE: music·talk /부를텐데|만날텐데|레시피|LIVE/ · marts /마트|편의점|꿀템/
         home-cooking /집에서.{0,16}(구웠|구워|만들)/ · outdoor /한라산|등반|올레길/
         overseas /도쿄|오사카|뉴욕|일본|미국|베트남|하노이|해외/
per-channel: 성시경 → includeOnly /먹을텐데/   // 11/12 of its uploads are music/talk

For 홍석천이원일 all 20 long-form titles pass — including "이원일의 레시피!… 피자 맛집" which survives only because include-words win. Before this filter existed, ~60–70% of a day's Gemini quota went to music shows and mart hauls.

B4

Skip what's already done DB read

alreadyHarvested() — two checks
-- ① did this video already yield restaurants?
SELECT 1 FROM restaurants WHERE video_id = $1;
-- ② or did the LLM already confirm it has none? (zero-yield tombstone)
SELECT extraction_empty_at FROM videos WHERE video_id = $1;
Failure we hit: without ②, confirmed-empty videos were re-extracted ×3 retries every run forever — the 한라산 hiking video alone ate ~15 Gemini calls over 5 days re-confirming it contains no restaurants.
Mitigation: videos.extraction_empty_at is set only when the LLM responded and found nothing (a quota-dead call sets nothing, so those stay retryable).

Stage C — Per-video pipeline (runPipeline, ×10 videos)

src/pipeline.ts orchestrates 5 steps; concurrency 2 videos at a time, exponential backoff between retries.

C1

Ingest metadata → videos videos.list · 1 unit · DB write #2

ingestVideoById() → upsertVideo()
INSERT INTO videos (video_id, title, channel_title, description, published_at)
VALUES ($1,$2,$3,$4,$5)
ON CONFLICT (video_id) DO UPDATE SET … , updated_at = now();
C2

Extract transcript → transcripts 0 quota (scrape) · DB write #3

extractAndSaveTranscript() — youtube-transcript-plus, Korean first
INSERT INTO transcripts (video_id, transcript_data, language_code, has_korean)
VALUES ($1, $2::jsonb, $3, $4)
ON CONFLICT (video_id, language_code) DO UPDATE SET transcript_data = EXCLUDED.transcript_data;

No official API exists — this scrapes YouTube's caption endpoint. It's the flakiest step: datacenter IPs get blocked (hence optional PROXY_URL for EC2), and it's cached in the DB so re-runs are free. has_korean gates whether the LLM step is worth running.

C3

Fetch comments → comment_threads 1 unit/page (100 threads) · DB write #4

fetchAndPersistCommentsByVideoId() — paginated
INSERT INTO comment_threads (video_id, youtube_thread_id, top_level_comment, replies)
VALUES ($1,$2,$3::jsonb,$4::jsonb)
ON CONFLICT (youtube_thread_id) DO UPDATE SET …;

Why comments? Viewers name the restaurant ("여기 ○○집이에요!") and correct transcript typos — they're free disambiguation signal for the LLM. 홍석천이원일 videos carry hundreds of threads each.

C4

LLM extraction — names + areas Gemini Flash · 1 req (free tier: limited/day)

extractRestaurantsFromVideo() → src/llm.ts generateContent()
prompt = title + description + transcript + comments
       + "return ONLY establishments actually REVIEWED — not dishes, not name-drops"[{"name":"무교동북어국집","area":"무교동"}, …]   // JSON array of {name, area}

Output is post-filtered: placeholder names (/이름\s*없|간판/) are dropped before enrichment. The result carries llmResponded — distinguishing "the model said zero" from "the call died".

Failure we hit: free-tier 429s were swallowed as null → looked identical to "no restaurants" → quota-era runs persisted 13 food videos as empty (incl. one with the restaurant name in the title).
Mitigation: per-minute 429 → honor the server's retryDelay and retry; per-day 429 → latch isDailyQuotaExhausted() and the whole harvest fail-fasts in seconds instead of grinding.
C5

Naver enrichment — name → 좌표 Naver Local Search ≤3 req + Geocode fallback

enrichRestaurantsWithNaver() → getCoordinatesForRestaurant()
// ordered query plan — most specific first, validation on the loose end
0. overseas guard: area ~ /일본|도쿄|뉴욕|하노이…/ → skip entirely (Naver = Korea only)
1. "무교동북어국집 무교동"        // name + full area
2. "무교동북어국집 무교동"[city]   // name + first token of area
3. "무교동북어국집" + validate     // bare name — hit's address MUST contain an area token
4. errata spell-fix, last resort  // it "corrected" 퍼옹부이 → 포옹부인 (nonsense)
coords: mapx/mapy ÷ 10,000,000 → lat/lng; else geocode the address
Failures we hit: ① Katz's Deli (뉴욕) "geocoded" to 의정부; a Tokyo tonkatsu shop matched a health-goods store in 서초 — Naver silently returns Korean lookalikes for overseas places. ② Naver 429s permanently left Korean restaurants un-geocoded. ③ bare-name fallback matched a Seoul 호떡 stall for a 충주 stall.
Mitigations: ① overseas guard (step 0). ② retry 429/5xx + 150ms spacing — a re-enrich pass then fixed 4 of 5 Korean failures instantly. ③ the validate rule on bare-name hits — it rejected the wrong 전통호떡 live during repair.
C6

Persist → restaurants, or tombstone → videos DB write #5

upsertManyRestaurants() / markVideoExtractionEmpty()
INSERT INTO restaurants (video_id, restaurant_name, area, naver_link, address,
                         road_address, latitude, longitude, category, description, last_enriched_at)
VALUES (…, now())
ON CONFLICT (video_id, restaurant_name) DO UPDATE SET … ;  -- same place re-reviewed = update, not dup

-- OR, when llmResponded ∧ names.length = 0:
UPDATE videos SET extraction_empty_at = now() WHERE video_id = $1;  -- never re-pay for this video

Un-geocoded rows (overseas, or genuinely not on Naver) stay with latitude IS NULL — stored but invisible to the map. Nothing is deleted; provenance (which video said it) is always kept.

D

Read path — the map GET /restaurants

src/server.ts → public/map.html (20s polling, clear-then-draw)
SELECT r.*, v.title AS video_title, v.channel_title
FROM restaurants r LEFT JOIN videos v ON r.video_id = v.video_id
WHERE r.latitude IS NOT NULL ORDER BY r.created_at DESC;

The viewer never talks to the pipeline — only to the table. That's why the map fills in live while a harvest runs.

The 5 tables — who writes what

TableNatural unique keyWritten byRole
channelschannel_idStage A2Which channels we track + focus/notes
videosvideo_idC1 (+ C6 tombstone)Metadata · extraction_empty_at = zero-yield tombstone
transcripts(video_id, language_code)C2Cached transcript jsonb · has_korean gate
comment_threadsyoutube_thread_idC3LLM disambiguation signal (jsonb top comment + replies)
restaurants(video_id, restaurant_name)C6The product. lat/lng NULL = extracted but unmappable

Three invariants to stay meticulous about: (1) every write is an upsert on a natural key — re-runs are always safe; (2) raw inputs (transcript, comments) are cached in the DB so only Gemini/Naver steps ever need re-paying; (3) a row in restaurants always traces back to the exact video that claimed it.

Quota economics per channel run (홍석천이원일, 10 videos)
ResourceCostDaily budgetBinding?
YouTube search.list (B1)100 units × 110,000 unitsno
YouTube videos/commentThreads (B2, C1, C3)~1–10 units/videono
Transcript scrape (C2)0 units (rate-limited by IP)rarely
Gemini Flash (C4)1–3 req/videofree tier: small fixed RPD, resets 16:00 KSTYES — the bottleneck
Naver Search/Geocode (C5)≤3 req/restaurant25,000/dayno (after retry fix)

Everything in Stage B exists to protect the Gemini column: filter before you pay, tombstone after you've paid once.

Where we are: prototype → tracer bullet → MVP

Three terms that get blurred together — this repo happens to contain a clean specimen of each.

Prototype — disposable code that answers a question

legacy/ puppeteer experiments · scripts/research_channels.ts

A prototype's deliverable is the answer, not the code. The Puppeteer transcript scraper answered "can we get Korean transcripts at all?" — then was thrown away for youtube-transcript-plus. The research script answered "which channels actually yield restaurants?" — its output lives on as channels.md notes ("100% hit rate", "1 video → 20 restaurants") while the runs themselves are forgotten. You're allowed to write ugly prototypes because you've promised to delete them.

Tracer bullet — a thin end-to-end slice you keep and flesh out

src/pipeline.ts · npm run tracer

One video → all 5 steps → real YouTube API, real Gemini, real Naver, real Postgres → a real coordinate. No mocks anywhere — that's what makes it a tracer and not a demo: it proves the architecture under real conditions, like a tracer round showing where your fire actually lands. Crucially, tracer code is not disposable: everything we hardened this week (retry latches, tombstones, area validation) was bolted onto that same skeleton. The skeleton never changed; it just grew armor.

MVP — the smallest thing users can use, built to learn from

/map + the 16:30 daily self-harvest

"Viable" = a person can open the map and find a restaurant a YouTuber actually visited — that works today (86+ geocoded points, popups linking Naver + the source video). "Minimum" = the honest gaps: search matching is a stub, no dedup across channels reviewing the same place, free-tier pacing. An MVP's job is to generate product learning (do people want this? what do they search for?) — distinct from the tracer's technical learning.

Now — hardening: "works once" → "works unattended, daily, correctly"

This week's work, and the modular channel runbook

Idempotent upserts, fail-fast quota latches, zero-yield tombstones, geocode validation, squatter-checked channel ids — none of these add features; they add trust in the data. The meticulous per-channel loop going forward:

① verify UC id (squatter check)② sample 50 titles, tune filter③ npm run harvest -- <channel>④ audit new rows (coords in right region? categories food?)⑤ check /map → repeat for the next channel.

Generated 2026-06-13 · reflects the code as of the 홍석천이원일 module · API docs · live map