enhance article query capabilities by supporting multiple keywords and dynamic ordering

2026-04-21 11:42:21 +01:00 · 2026-04-21 11:42:21 +01:00 · cb819e77ee
commit cb819e77ee
parent 8805d3a3fc
2 changed files with 134 additions and 240 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # duriin_api
-Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
+Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
 ## Setup
@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC
   ```bash
   npm install
   ```
-2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules.
+2. Edit `config.json` with your API keys, tickers, and schedules.
 3. Start the server:
   ```bash
   npm start
@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`.
 On startup the server:
-1. Opens the SQLite database.
+1. Opens the SQLite database and runs any pending migrations.
-2. Registers the article and status routes.
+2. Registers routes.
 3. Starts the HTTP server.
-4. Immediately runs all ingestion sources once.
+4. Launches continuous background loops for each source, content backfill, and embedding backfill.
 5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
 When a new article is inserted:
 - the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
- `content` and `image` start as `null`
+- `content` starts as `null`
- full article extraction runs asynchronously after insert
+- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
- vector embeddings are generated later, after title, description, and content are all available
+- vector embeddings are generated after title, description, and content are all available
 - only articles with content + embedding are exposed via the API
 Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
 Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
 ## API overview
-All exposed endpoints are `GET` endpoints.
+All endpoints are `GET`.
 ### `GET /`
-Simple health check.
+Health check. Returns `{ "ok": true }`.
 **Response**
 ```json
 { "ok": true }
 ```
 Use this to confirm the server is running, not to inspect ingestion state.
 ### `GET /articles`
-Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
+Returns usable articles — non-empty `content`, stored embedding, not an index/category page.
 #### Query params
-##### `keyword`
+| Param | Description |
 |---|---|
 | `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
 | `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
 | `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
 | `from` | `pub_date >= from` (ISO-8601) |
 | `to` | `pub_date <= to` (ISO-8601) |
 | `limit` | Rows to return. Default `20`, max `100` |
 | `offset` | Pagination offset. Default `0` |
 | `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
 | `semantic` | Semantic search by meaning via embedding similarity |
 | `similar_to_article` | Vector similarity search using another article's embedding |
-Plain keyword search.
+#### `order` values
- matches `title`, `description`, and `content`
+| Value | Sort |
- uses SQL `LIKE`
+|---|---|
- works like substring matching, not semantic search
+| `newest` | `pub_date_effective DESC` (default) |
- best when you want literal words or phrases to appear in the article text
+| `oldest` | `pub_date_effective ASC` |
 | `ingested_newest` | `ingested_at DESC` |
 | `ingested_oldest` | `ingested_at ASC` |
-Example:
+#### Search modes
 ```http
 GET /articles?keyword=earnings
 ```
-##### `source`
+- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
 - Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
 - Otherwise — normal filtered list mode. All params apply.
-Exact match on the stored `source` field.
+`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.
-Example:
+`include_embedding` is explicitly rejected on this endpoint.
 ```http
 GET /articles?source=rss
 ```
-##### `from`
+#### Response shape
 Only returns rows where `pub_date >= from`.
 Example:
 ```http
 GET /articles?from=2025-01-01T00:00:00.000Z
 ```
 ##### `to`
 Only returns rows where `pub_date <= to`.
 Example:
 ```http
 GET /articles?to=2025-01-31T23:59:59.999Z
 ```
 ##### `limit`
 Number of rows to return.
 - default: `20`
 - max: `100`
 Example:
 ```http
 GET /articles?limit=10
 ```
 ##### `offset`
 Pagination offset.
 - default: `0`
 Example:
 ```http
 GET /articles?limit=10&offset=20
 ```
 ##### `similar_to_article`
 Runs vector similarity search instead of normal list mode.
 - value must be an existing article ID
 - the server looks up that article's embedding
 - nearest-neighbor search runs in `sqlite-vec`
 - the source article is excluded from the result set
 - each result includes a `distance` field
 - lower `distance` means more similar
 - returns `404` if the article has no stored embedding
 Example:
 ```http
 GET /articles?similar_to_article=123&limit=5
 ```
 Not found response:
 ```json
 { "error": "Embedding not found for article" }
 ```
 ##### `semantic`
 Semantic search by meaning, not exact wording.
 - use this when you want conceptually related results
 - unlike `keyword`, the words do not need to appear literally in the article text
 - the query text is normalized before embedding
 - query embeddings are cached in SQLite
 - on cache miss, the server requests an embedding from OpenRouter
 - nearest article matches are returned from the embedding index
 - each result includes a `distance` field
 - lower `distance` means a closer semantic match
 - returns `400` if `semantic` is empty
 Example:
 ```http
 GET /articles?semantic=ai chip demand&limit=10
 ```
 Bad request response:
 ```json
 { "error": "Semantic query must not be empty" }
 ```
 ##### `include_embedding`
 Explicitly rejected on `/articles`.
 Response:
 ```json
 { "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
 ```
 #### General behavior
 - If `semantic` is present, semantic search is used.
 - Else if `similar_to_article` is present, similarity search is used.
 - Otherwise normal list/search mode is used.
 - `keyword` is literal keyword matching.
 - `semantic` is semantic matching by meaning.
 - Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
 - `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
 - `source` must match the stored source name exactly.
 - `keyword` is substring matching, not full-text search.
 #### Normal list/search response shape
 ```json
 [
@ -194,113 +92,60 @@ Response:
    "title": "...",
    "description": "...",
    "content": "...",
    "image": "...",
    "url": "...",
    "normalized_title": "...",
-    "source": "rss",
+    "source": "rss:BBC",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
 ]
 ```
-#### Similarity/topic search response shape
+Semantic and similarity results also include `"distance": 0.1234`.
 ```json
 [
  {
    "id": 456,
    "title": "...",
    "description": "...",
    "content": "...",
    "image": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss",
    "pub_date": "2025-01-02T09:00:00.000Z",
    "ingested_at": "2025-01-02T09:00:10.000Z",
    "distance": 0.1234
  }
 ]
 ```
 #### Combined example
 ```http
 GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
 ```
 ### `GET /articles/:id`
-Returns one article by numeric ID.
+Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.
 **Behavior**
 - Looks up the article directly in SQLite.
 - Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
 - Returns the same article fields as normal `/articles` list mode.
 - Does not return embedding data.
 - Returns `404` if the ID does not exist.
 **Example**
 ```http
 GET /articles/123
 ```
 **Not found response**
 ```json
 { "error": "Article not found" }
 ```
 ### `GET /status`
-Returns ingestion and archive summary information.
+Returns archive summary. Cached for 30 seconds.
 **Response fields**
- `total`: total number of rows in `articles` across all sources
+- `total` — total rows across all sources
- `usable`: articles that have content, an embedding, and are not index pages
+- `usable` — articles with content + embedding, not index pages
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source
+- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
- `bySource`: per-source breakdown, each with `total` and `usable` counts
+- `bySource` — per-source `{ total, usable }`
 - `embeddingModels` — active embedding models with article count and detected dimensions
-**Important detail**
+### `GET /sources`
-`lastIngestionBySource` is kept in memory, so it resets when the process restarts.
+Returns the full source catalog from `sources.json` enriched with live DB stats.
-**Example response**
+**Per-source fields**
-```json
+
-{
+- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
-  "total": 10234,
+- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
-  "usable": 8700,
+- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
-  "lastIngestionBySource": {
+- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
-    "rss": "2025-01-02T10:00:00.000Z",
+
-    "gdelt": "2025-01-02T10:05:00.000Z"
+Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.
  },
  "bySource": {
    "alphavantage": { "total": 120, "usable": 98 },
    "edgar": { "total": 88, "usable": 70 },
    "finnhub": { "total": 400, "usable": 360 },
    "gdelt": { "total": 2100, "usable": 1800 },
    "rss": { "total": 7526, "usable": 6372 }
  }
 }
 ```
 ## Article field notes
- `image` stores the extracted main image as ultra-compressed base64 WebP.
+- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
- `normalized_title` is stored for matching and indexing.
+- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`.
+- `ingested_at` is the server-side insert timestamp.
- `pub_date` is normalized to ISO-8601 when it can be parsed.
+- `normalized_title` is stored for deduplication and indexing.
- `ingested_at` is the insert timestamp set by the server.
+- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).
 ## Notes
- SQLite archive file defaults to `./archive.sqlite`.
+- SQLite archive defaults to `./archive.sqlite`.
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique.
+- Deduplication is enforced on `url`.
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion.
+- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
+- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search.
+- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
+- Query embeddings are cached in SQLite to avoid redundant API calls.
- SEC requests use the configured `User-Agent`.
+- SEC requests use the `User-Agent` from `config.json`.
 - Duplicate URLs are skipped rather than inserted again.
--- a/src/routes/articles.js
+++ b/src/routes/articles.js
@ -12,9 +12,15 @@ function buildArticlesQuery(query) {
  const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';
  if (query.keyword) {
-    conditions.push('(title LIKE ? OR description LIKE ? OR content LIKE ?)');
+    const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
-    const keyword = `%${query.keyword}%`;
+    const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
-    params.push(keyword, keyword, keyword);
+    const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
    conditions.push(`(${clauses.join(` ${mode} `)})`);
    for (const kw of keywords) {
      const like = `%${kw}%`;
      params.push(like, like, like);
    }
  }
  if (query.source) {
@ -36,6 +42,14 @@ function buildArticlesQuery(query) {
  conditions.push('is_index_page = 0');
  conditions.push('has_embedding = 1');
  const ORDERS = {
    newest: 'pub_date_effective DESC, id DESC',
    oldest: 'pub_date_effective ASC, id ASC',
    ingested_newest: 'ingested_at DESC, id DESC',
    ingested_oldest: 'ingested_at ASC, id ASC',
  };
  const orderBy = ORDERS[query.order] || ORDERS.newest;
  const whereClause = `WHERE ${conditions.join(' AND ')}`;
  const limit = Number.parseInt(query.limit, 10);
  const offset = Number.parseInt(query.offset, 10);
@ -48,7 +62,7 @@ function buildArticlesQuery(query) {
      SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
      FROM articles
      ${whereClause}
-      ORDER BY pub_date_effective DESC, id DESC
+      ORDER BY ${orderBy}
      LIMIT ? OFFSET ?
    `,
    params,
@ -64,23 +78,58 @@ function shouldExcludeIndexPages(query) {
  return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
 }
-function mapNeighborsToArticles(neighbors, excludeIndexPages, limit) {
+function mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query = {}) {
  const ids = neighbors.map((row) => row.articleId);
  if (ids.length === 0) {
    return [];
  }
  const placeholders = ids.map(() => '?').join(', ');
  const conditions = [];
  const params = [...ids];
  conditions.push(`id IN (${placeholders})`);
  conditions.push("content IS NOT NULL AND content != ''");
  conditions.push('has_embedding = 1');
  if (excludeIndexPages) conditions.push('is_index_page = 0');
  if (query.source) {
    conditions.push('source = ?');
    params.push(query.source);
  }
  if (query.from) {
    conditions.push('pub_date >= ?');
    params.push(query.from);
  }
  if (query.to) {
    conditions.push('pub_date <= ?');
    params.push(query.to);
  }
  if (query.keyword) {
    const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
    const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
    const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
    conditions.push(`(${clauses.join(` ${mode} `)})`);
    for (const kw of keywords) {
      const like = `%${kw}%`;
      params.push(like, like, like);
    }
  }
  const articles = db.prepare(`
    SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
    FROM articles
-    WHERE id IN (${placeholders})
+    WHERE ${conditions.join(' AND ')}
-      AND content IS NOT NULL AND content != ''
+  `).all(...params);
-      AND has_embedding = 1
+
      ${excludeIndexPages ? 'AND is_index_page = 0' : ''}
  `).all(...ids);
  const byId = new Map(articles.map((article) => [article.id, article]));
  // preserve distance ordering from the vector search
  return neighbors
    .map((row) => {
      const article = byId.get(row.articleId);
@ -113,7 +162,7 @@ async function articleRoutes(fastify) {
        Math.min(limit * 5, 500)
      );
-      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
+      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
    }
    if (query.similar_to_article) {
@ -130,7 +179,7 @@ async function articleRoutes(fastify) {
        return { error: 'Embedding not found for article' };
      }
-      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
+      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
    }
    const { sql, params } = buildArticlesQuery(query);