diff --git a/README.md b/README.md index b834a52..7ddcaab 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # duriin_api -Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive. +Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive. ## Setup @@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC ```bash npm install ``` -2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules. +2. Edit `config.json` with your API keys, tickers, and schedules. 3. Start the server: ```bash npm start @@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`. On startup the server: -1. Opens the SQLite database. -2. Registers the article and status routes. +1. Opens the SQLite database and runs any pending migrations. +2. Registers routes. 3. Starts the HTTP server. -4. Immediately runs all ingestion sources once. -5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill. +4. Launches continuous background loops for each source, content backfill, and embedding backfill. When a new article is inserted: - the record is written immediately with `title`, `description`, `url`, `source`, and timestamps -- `content` and `image` start as `null` -- full article extraction runs asynchronously after insert -- vector embeddings are generated later, after title, description, and content are all available +- `content` starts as `null` +- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites +- vector embeddings are generated after title, description, and content are all available +- only articles with content + embedding are exposed via the API + +Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order. + +Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily. ## API overview -All exposed endpoints are `GET` endpoints. +All endpoints are `GET`. ### `GET /` -Simple health check. - -**Response** -```json -{ "ok": true } -``` - -Use this to confirm the server is running, not to inspect ingestion state. +Health check. Returns `{ "ok": true }`. ### `GET /articles` -Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send. +Returns usable articles — non-empty `content`, stored embedding, not an index/category page. #### Query params -##### `keyword` +| Param | Description | +|---|---| +| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` | +| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` | +| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) | +| `from` | `pub_date >= from` (ISO-8601) | +| `to` | `pub_date <= to` (ISO-8601) | +| `limit` | Rows to return. Default `20`, max `100` | +| `offset` | Pagination offset. Default `0` | +| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) | +| `semantic` | Semantic search by meaning via embedding similarity | +| `similar_to_article` | Vector similarity search using another article's embedding | -Plain keyword search. +#### `order` values -- matches `title`, `description`, and `content` -- uses SQL `LIKE` -- works like substring matching, not semantic search -- best when you want literal words or phrases to appear in the article text +| Value | Sort | +|---|---| +| `newest` | `pub_date_effective DESC` (default) | +| `oldest` | `pub_date_effective ASC` | +| `ingested_newest` | `ingested_at DESC` | +| `ingested_oldest` | `ingested_at ASC` | -Example: -```http -GET /articles?keyword=earnings -``` +#### Search modes -##### `source` +- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer). +- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding. +- Otherwise — normal filtered list mode. All params apply. -Exact match on the stored `source` field. +`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results. -Example: -```http -GET /articles?source=rss -``` +`include_embedding` is explicitly rejected on this endpoint. -##### `from` - -Only returns rows where `pub_date >= from`. - -Example: -```http -GET /articles?from=2025-01-01T00:00:00.000Z -``` - -##### `to` - -Only returns rows where `pub_date <= to`. - -Example: -```http -GET /articles?to=2025-01-31T23:59:59.999Z -``` - -##### `limit` - -Number of rows to return. - -- default: `20` -- max: `100` - -Example: -```http -GET /articles?limit=10 -``` - -##### `offset` - -Pagination offset. - -- default: `0` - -Example: -```http -GET /articles?limit=10&offset=20 -``` - -##### `similar_to_article` - -Runs vector similarity search instead of normal list mode. - -- value must be an existing article ID -- the server looks up that article's embedding -- nearest-neighbor search runs in `sqlite-vec` -- the source article is excluded from the result set -- each result includes a `distance` field -- lower `distance` means more similar -- returns `404` if the article has no stored embedding - -Example: -```http -GET /articles?similar_to_article=123&limit=5 -``` - -Not found response: -```json -{ "error": "Embedding not found for article" } -``` - -##### `semantic` - -Semantic search by meaning, not exact wording. - -- use this when you want conceptually related results -- unlike `keyword`, the words do not need to appear literally in the article text -- the query text is normalized before embedding -- query embeddings are cached in SQLite -- on cache miss, the server requests an embedding from OpenRouter -- nearest article matches are returned from the embedding index -- each result includes a `distance` field -- lower `distance` means a closer semantic match -- returns `400` if `semantic` is empty - -Example: -```http -GET /articles?semantic=ai chip demand&limit=10 -``` - -Bad request response: -```json -{ "error": "Semantic query must not be empty" } -``` - -##### `include_embedding` - -Explicitly rejected on `/articles`. - -Response: -```json -{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." } -``` - -#### General behavior - -- If `semantic` is present, semantic search is used. -- Else if `similar_to_article` is present, similarity search is used. -- Otherwise normal list/search mode is used. -- `keyword` is literal keyword matching. -- `semantic` is semantic matching by meaning. -- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`. -- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input. -- `source` must match the stored source name exactly. -- `keyword` is substring matching, not full-text search. - -#### Normal list/search response shape +#### Response shape ```json [ @@ -194,113 +92,60 @@ Response: "title": "...", "description": "...", "content": "...", - "image": "...", "url": "...", "normalized_title": "...", - "source": "rss", + "source": "rss:BBC", "pub_date": "2025-01-01T12:34:56.000Z", "ingested_at": "2025-01-01T12:35:10.000Z" } ] ``` -#### Similarity/topic search response shape - -```json -[ - { - "id": 456, - "title": "...", - "description": "...", - "content": "...", - "image": "...", - "url": "...", - "normalized_title": "...", - "source": "rss", - "pub_date": "2025-01-02T09:00:00.000Z", - "ingested_at": "2025-01-02T09:00:10.000Z", - "distance": 0.1234 - } -] -``` - -#### Combined example - -```http -GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0 -``` +Semantic and similarity results also include `"distance": 0.1234`. ### `GET /articles/:id` -Returns one article by numeric ID. - -**Behavior** - -- Looks up the article directly in SQLite. -- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable. -- Returns the same article fields as normal `/articles` list mode. -- Does not return embedding data. -- Returns `404` if the ID does not exist. - -**Example** -```http -GET /articles/123 -``` - -**Not found response** -```json -{ "error": "Article not found" } -``` +Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding. ### `GET /status` -Returns ingestion and archive summary information. +Returns archive summary. Cached for 30 seconds. **Response fields** -- `total`: total number of rows in `articles` across all sources -- `usable`: articles that have content, an embedding, and are not index pages -- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source -- `bySource`: per-source breakdown, each with `total` and `usable` counts +- `total` — total rows across all sources +- `usable` — articles with content + embedding, not index pages +- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart) +- `bySource` — per-source `{ total, usable }` +- `embeddingModels` — active embedding models with article count and detected dimensions -**Important detail** +### `GET /sources` -`lastIngestionBySource` is kept in memory, so it resets when the process restarts. +Returns the full source catalog from `sources.json` enriched with live DB stats. -**Example response** -```json -{ - "total": 10234, - "usable": 8700, - "lastIngestionBySource": { - "rss": "2025-01-02T10:00:00.000Z", - "gdelt": "2025-01-02T10:05:00.000Z" - }, - "bySource": { - "alphavantage": { "total": 120, "usable": 98 }, - "edgar": { "total": 88, "usable": 70 }, - "finnhub": { "total": 400, "usable": 360 }, - "gdelt": { "total": 2100, "usable": 1800 }, - "rss": { "total": 7526, "usable": 6372 } - } -} -``` +**Per-source fields** + +- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead) +- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source +- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.) +- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt` + +Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily. ## Article field notes -- `image` stores the extracted main image as ultra-compressed base64 WebP. -- `normalized_title` is stored for matching and indexing. -- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`. -- `pub_date` is normalized to ISO-8601 when it can be parsed. -- `ingested_at` is the insert timestamp set by the server. +- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise. +- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting. +- `ingested_at` is the server-side insert timestamp. +- `normalized_title` is stored for deduplication and indexing. +- `source` format is `: