enhance article query capabilities by supporting multiple keywords and dynamic ordering

2026-04-21 11:42:21 +01:00
parent 8805d3a3fc
commit cb819e77ee
2 changed files with 134 additions and 240 deletions
@@ -1,6 +1,6 @@
 # duriin_api

-Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
+Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.

 ## Setup

@@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC
   ```bash
   npm install
   ```
-2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules.
+2. Edit `config.json` with your API keys, tickers, and schedules.
 3. Start the server:
   ```bash
   npm start
@@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`.

 On startup the server:

-1. Opens the SQLite database.
-2. Registers the article and status routes.
+1. Opens the SQLite database and runs any pending migrations.
+2. Registers routes.
 3. Starts the HTTP server.
-4. Immediately runs all ingestion sources once.
-5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
+4. Launches continuous background loops for each source, content backfill, and embedding backfill.

 When a new article is inserted:

 - the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
- `content` and `image` start as `null`
- full article extraction runs asynchronously after insert
- vector embeddings are generated later, after title, description, and content are all available
+- `content` starts as `null`
+- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
+- vector embeddings are generated after title, description, and content are all available
+- only articles with content + embedding are exposed via the API
+
+Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
+
+Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.

 ## API overview

-All exposed endpoints are `GET` endpoints.
+All endpoints are `GET`.

 ### `GET /`

-Simple health check.
-
-**Response**
-```json
-{ "ok": true }
-```
-
-Use this to confirm the server is running, not to inspect ingestion state.
+Health check. Returns `{ "ok": true }`.

 ### `GET /articles`

-Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
+Returns usable articles — non-empty `content`, stored embedding, not an index/category page.

 #### Query params

-##### `keyword`
+| Param | Description |
+|---|---|
+| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
+| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
+| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
+| `from` | `pub_date >= from` (ISO-8601) |
+| `to` | `pub_date <= to` (ISO-8601) |
+| `limit` | Rows to return. Default `20`, max `100` |
+| `offset` | Pagination offset. Default `0` |
+| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
+| `semantic` | Semantic search by meaning via embedding similarity |
+| `similar_to_article` | Vector similarity search using another article's embedding |

-Plain keyword search.
+#### `order` values

- matches `title`, `description`, and `content`
- uses SQL `LIKE`
- works like substring matching, not semantic search
- best when you want literal words or phrases to appear in the article text
+| Value | Sort |
+|---|---|
+| `newest` | `pub_date_effective DESC` (default) |
+| `oldest` | `pub_date_effective ASC` |
+| `ingested_newest` | `ingested_at DESC` |
+| `ingested_oldest` | `ingested_at ASC` |

-Example:
-```http
-GET /articles?keyword=earnings
-```
+#### Search modes

-##### `source`
+- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
+- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
+- Otherwise — normal filtered list mode. All params apply.

-Exact match on the stored `source` field.
+`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.

-Example:
-```http
-GET /articles?source=rss
-```
+`include_embedding` is explicitly rejected on this endpoint.

-##### `from`
-
-Only returns rows where `pub_date >= from`.
-
-Example:
-```http
-GET /articles?from=2025-01-01T00:00:00.000Z
-```
-
-##### `to`
-
-Only returns rows where `pub_date <= to`.
-
-Example:
-```http
-GET /articles?to=2025-01-31T23:59:59.999Z
-```
-
-##### `limit`
-
-Number of rows to return.
-
- default: `20`
- max: `100`
-
-Example:
-```http
-GET /articles?limit=10
-```
-
-##### `offset`
-
-Pagination offset.
-
- default: `0`
-
-Example:
-```http
-GET /articles?limit=10&offset=20
-```
-
-##### `similar_to_article`
-
-Runs vector similarity search instead of normal list mode.
-
- value must be an existing article ID
- the server looks up that article's embedding
- nearest-neighbor search runs in `sqlite-vec`
- the source article is excluded from the result set
- each result includes a `distance` field
- lower `distance` means more similar
- returns `404` if the article has no stored embedding
-
-Example:
-```http
-GET /articles?similar_to_article=123&limit=5
-```
-
-Not found response:
-```json
-{ "error": "Embedding not found for article" }
-```
-
-##### `semantic`
-
-Semantic search by meaning, not exact wording.
-
- use this when you want conceptually related results
- unlike `keyword`, the words do not need to appear literally in the article text
- the query text is normalized before embedding
- query embeddings are cached in SQLite
- on cache miss, the server requests an embedding from OpenRouter
- nearest article matches are returned from the embedding index
- each result includes a `distance` field
- lower `distance` means a closer semantic match
- returns `400` if `semantic` is empty
-
-Example:
-```http
-GET /articles?semantic=ai chip demand&limit=10
-```
-
-Bad request response:
-```json
-{ "error": "Semantic query must not be empty" }
-```
-
-##### `include_embedding`
-
-Explicitly rejected on `/articles`.
-
-Response:
-```json
-{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
-```
-
-#### General behavior
-
- If `semantic` is present, semantic search is used.
- Else if `similar_to_article` is present, similarity search is used.
- Otherwise normal list/search mode is used.
- `keyword` is literal keyword matching.
- `semantic` is semantic matching by meaning.
- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
- `source` must match the stored source name exactly.
- `keyword` is substring matching, not full-text search.
-
-#### Normal list/search response shape
+#### Response shape

 ```json
 [
@@ -194,113 +92,60 @@ Response:
    "title": "...",
    "description": "...",
    "content": "...",
-    "image": "...",
    "url": "...",
    "normalized_title": "...",
-    "source": "rss",
+    "source": "rss:BBC",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
 ]
 ```

-#### Similarity/topic search response shape
-
-```json
-[
-  {
-    "id": 456,
-    "title": "...",
-    "description": "...",
-    "content": "...",
-    "image": "...",
-    "url": "...",
-    "normalized_title": "...",
-    "source": "rss",
-    "pub_date": "2025-01-02T09:00:00.000Z",
-    "ingested_at": "2025-01-02T09:00:10.000Z",
-    "distance": 0.1234
-  }
-]
-```
-
-#### Combined example
-
-```http
-GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
-```
+Semantic and similarity results also include `"distance": 0.1234`.

 ### `GET /articles/:id`

-Returns one article by numeric ID.
-
-**Behavior**
-
- Looks up the article directly in SQLite.
- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
- Returns the same article fields as normal `/articles` list mode.
- Does not return embedding data.
- Returns `404` if the ID does not exist.
-
-**Example**
-```http
-GET /articles/123
-```
-
-**Not found response**
-```json
-{ "error": "Article not found" }
-```
+Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.

 ### `GET /status`

-Returns ingestion and archive summary information.
+Returns archive summary. Cached for 30 seconds.

 **Response fields**

- `total`: total number of rows in `articles` across all sources
- `usable`: articles that have content, an embedding, and are not index pages
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source
- `bySource`: per-source breakdown, each with `total` and `usable` counts
+- `total` — total rows across all sources
+- `usable` — articles with content + embedding, not index pages
+- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
+- `bySource` — per-source `{ total, usable }`
+- `embeddingModels` — active embedding models with article count and detected dimensions

-**Important detail**
+### `GET /sources`

-`lastIngestionBySource` is kept in memory, so it resets when the process restarts.
+Returns the full source catalog from `sources.json` enriched with live DB stats.

-**Example response**
-```json
-{
-  "total": 10234,
-  "usable": 8700,
-  "lastIngestionBySource": {
-    "rss": "2025-01-02T10:00:00.000Z",
-    "gdelt": "2025-01-02T10:05:00.000Z"
-  },
-  "bySource": {
-    "alphavantage": { "total": 120, "usable": 98 },
-    "edgar": { "total": 88, "usable": 70 },
-    "finnhub": { "total": 400, "usable": 360 },
-    "gdelt": { "total": 2100, "usable": 1800 },
-    "rss": { "total": 7526, "usable": 6372 }
-  }
-}
-```
+**Per-source fields**
+
+- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
+- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
+- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
+- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
+
+Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.

 ## Article field notes

- `image` stores the extracted main image as ultra-compressed base64 WebP.
- `normalized_title` is stored for matching and indexing.
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`.
- `pub_date` is normalized to ISO-8601 when it can be parsed.
- `ingested_at` is the insert timestamp set by the server.
+- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
+- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
+- `ingested_at` is the server-side insert timestamp.
+- `normalized_title` is stored for deduplication and indexing.
+- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).

 ## Notes

- SQLite archive file defaults to `./archive.sqlite`.
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique.
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion.
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search.
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
- SEC requests use the configured `User-Agent`.
- Duplicate URLs are skipped rather than inserted again.
+- SQLite archive defaults to `./archive.sqlite`.
+- Deduplication is enforced on `url`.
+- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
+- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
+- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
+- Query embeddings are cached in SQLite to avoid redundant API calls.
+- SEC requests use the `User-Agent` from `config.json`.
@@ -12,9 +12,15 @@ function buildArticlesQuery(query) {
  const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';

  if (query.keyword) {
-    conditions.push('(title LIKE ? OR description LIKE ? OR content LIKE ?)');
-    const keyword = `%${query.keyword}%`;
-    params.push(keyword, keyword, keyword);
+    const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
+    const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
+    const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
+
+    conditions.push(`(${clauses.join(` ${mode} `)})`);
+    for (const kw of keywords) {
+      const like = `%${kw}%`;
+      params.push(like, like, like);
+    }
  }

  if (query.source) {
@@ -36,6 +42,14 @@ function buildArticlesQuery(query) {
  conditions.push('is_index_page = 0');
  conditions.push('has_embedding = 1');

+  const ORDERS = {
+    newest: 'pub_date_effective DESC, id DESC',
+    oldest: 'pub_date_effective ASC, id ASC',
+    ingested_newest: 'ingested_at DESC, id DESC',
+    ingested_oldest: 'ingested_at ASC, id ASC',
+  };
+  const orderBy = ORDERS[query.order] || ORDERS.newest;
+
  const whereClause = `WHERE ${conditions.join(' AND ')}`;
  const limit = Number.parseInt(query.limit, 10);
  const offset = Number.parseInt(query.offset, 10);
@@ -48,7 +62,7 @@ function buildArticlesQuery(query) {
      SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
      FROM articles
      ${whereClause}
-      ORDER BY pub_date_effective DESC, id DESC
+      ORDER BY ${orderBy}
      LIMIT ? OFFSET ?
    `,
    params,
@@ -64,23 +78,58 @@ function shouldExcludeIndexPages(query) {
  return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
 }

-function mapNeighborsToArticles(neighbors, excludeIndexPages, limit) {
+function mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query = {}) {
  const ids = neighbors.map((row) => row.articleId);
  if (ids.length === 0) {
    return [];
  }

  const placeholders = ids.map(() => '?').join(', ');
+  const conditions = [];
+  const params = [...ids];
+
+  conditions.push(`id IN (${placeholders})`);
+  conditions.push("content IS NOT NULL AND content != ''");
+  conditions.push('has_embedding = 1');
+
+  if (excludeIndexPages) conditions.push('is_index_page = 0');
+
+  if (query.source) {
+    conditions.push('source = ?');
+    params.push(query.source);
+  }
+
+  if (query.from) {
+    conditions.push('pub_date >= ?');
+    params.push(query.from);
+  }
+
+  if (query.to) {
+    conditions.push('pub_date <= ?');
+    params.push(query.to);
+  }
+
+  if (query.keyword) {
+    const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
+    const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
+    const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
+
+    conditions.push(`(${clauses.join(` ${mode} `)})`);
+    for (const kw of keywords) {
+      const like = `%${kw}%`;
+      params.push(like, like, like);
+    }
+  }
+
  const articles = db.prepare(`
    SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
    FROM articles
-    WHERE id IN (${placeholders})
-      AND content IS NOT NULL AND content != ''
-      AND has_embedding = 1
-      ${excludeIndexPages ? 'AND is_index_page = 0' : ''}
-  `).all(...ids);
+    WHERE ${conditions.join(' AND ')}
+  `).all(...params);
+
  const byId = new Map(articles.map((article) => [article.id, article]));

+  // preserve distance ordering from the vector search
  return neighbors
    .map((row) => {
      const article = byId.get(row.articleId);
@@ -113,7 +162,7 @@ async function articleRoutes(fastify) {
        Math.min(limit * 5, 500)
      );

-      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
+      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
    }

    if (query.similar_to_article) {
@@ -130,7 +179,7 @@ async function articleRoutes(fastify) {
        return { error: 'Embedding not found for article' };
      }

-      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
+      return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
    }

    const { sql, params } = buildArticlesQuery(query);