ImBenji cb819e77ee enhance article query capabilities by supporting multiple keywords and dynamic ordering

2026-04-21 11:42:21 +01:00

5.9 KiB

Raw Blame History

duriin_api

Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.

Setup

Install dependencies:
```
npm install
```
Edit config.json with your API keys, tickers, and schedules.
Start the server:
```
npm start
```

The server listens on the host and port defined in config.json.

How the data pipeline works

On startup the server:

Opens the SQLite database and runs any pending migrations.
Registers routes.
Starts the HTTP server.
Launches continuous background loops for each source, content backfill, and embedding backfill.

When a new article is inserted:

the record is written immediately with title, description, url, source, and timestamps
content starts as null
content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
vector embeddings are generated after title, description, and content are all available
only articles with content + embedding are exposed via the API

Content backfill prioritises recent articles (pub_date_effective DESC) so newest content surfaces first regardless of ingestion order.

Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.

API overview

All endpoints are GET.

`GET /`

Health check. Returns { "ok": true }.

`GET /articles`

Returns usable articles — non-empty content, stored embedding, not an index/category page.

Query params

Param	Description
`keyword`	Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum`
`keyword_mode`	How multiple keywords are combined — `and` (default) or `or`
`source`	Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`)
`from`	`pub_date >= from` (ISO-8601)
`to`	`pub_date <= to` (ISO-8601)
`limit`	Rows to return. Default `20`, max `100`
`offset`	Pagination offset. Default `0`
`order`	Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance)
`semantic`	Semantic search by meaning via embedding similarity
`similar_to_article`	Vector similarity search using another article's embedding

`order` values

Value	Sort
`newest`	`pub_date_effective DESC` (default)
`oldest`	`pub_date_effective ASC`
`ingested_newest`	`ingested_at DESC`
`ingested_oldest`	`ingested_at ASC`

Search modes

If semantic is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a distance field (lower = closer).
Else if similar_to_article is present — finds articles similar to the given article ID. Returns 404 if that article has no embedding.
Otherwise — normal filtered list mode. All params apply.

keyword and source, from, to also work as post-filters on semantic and similar_to_article results.

include_embedding is explicitly rejected on this endpoint.

Response shape

[
  {
    "id": 123,
    "title": "...",
    "description": "...",
    "content": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss:BBC",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
]

Semantic and similarity results also include "distance": 0.1234.

`GET /articles/:id`

Returns one article by numeric ID. Same usability filter as the list endpoint — returns 404 if the article exists but has no content or embedding.

`GET /status`

Returns archive summary. Cached for 30 seconds.

Response fields

total — total rows across all sources
usable — articles with content + embedding, not index pages
lastIngestionBySource — in-memory timestamps of the last successful batch per source (resets on restart)
bySource — per-source { total, usable }
embeddingModels — active embedding models with article count and detected dimensions

`GET /sources`

Returns the full source catalog from sources.json enriched with live DB stats.

Per-source fields

id, label, websites, backfill, feeds — from sources.json (feed URLs preserve the [FAILED] prefix if the feed has been marked dead)
counts — aggregated { total, ready, skipped, failed, pending, untried, usable } across all feed types for this source
byFeed — same breakdown split by feed prefix (rss, gdelt, etc.)
domains — current domain fetch policy per website: policy (auto / browser_only / blocked), failure/success counts, expiresAt

Use domains[].policy to diagnose why a source has high skipped or failed counts — blocked means backfill has given up on that domain temporarily.

Article field notes

pub_date is normalized to ISO-8601 when parseable; null otherwise.
pub_date_effective is COALESCE(pub_date, ingested_at) — used for sorting.
ingested_at is the server-side insert timestamp.
normalized_title is stored for deduplication and indexing.
source format is <feed_type>:<label> for GDELT and RSS (e.g. gdelt:Bloomberg Markets, rss:TechCrunch), or just the source name for other feeds (alphavantage, edgar, finnhub).

Notes

SQLite archive defaults to ./archive.sqlite.
Deduplication is enforced on url.
GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
Embeddings use OpenRouter and are indexed in sqlite-vec for ANN search.
Query embeddings are cached in SQLite to avoid redundant API calls.
SEC requests use the User-Agent from config.json.

5.9 KiB Raw Blame History