No description
Find a file
2026-04-22 21:15:35 +01:00
intelligence update augorWorker to use openRouter configuration and set llmModel in config 2026-04-22 21:15:35 +01:00
src add intelligence and SQL tabs to admin interface with corresponding API endpoints 2026-04-22 20:50:08 +01:00
.dockerignore add Docker configuration and news crawler implementation 2026-04-16 22:54:27 +01:00
.gitignore add Docker configuration and news crawler implementation 2026-04-16 22:54:27 +01:00
admin.html add intelligence and SQL tabs to admin interface with corresponding API endpoints 2026-04-22 20:50:08 +01:00
CLAUDE.md enhance article processing by adding language support and adjusting embedding parameters 2026-04-20 03:41:10 +01:00
config.json update augorWorker to use openRouter configuration and set llmModel in config 2026-04-22 21:15:35 +01:00
docker-compose.yml add intelligence and SQL tabs to admin interface with corresponding API endpoints 2026-04-22 20:50:08 +01:00
Dockerfile add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
Dockerfile.base add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
gdelt-credentials.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
package-lock.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
package.json add intelligence and SQL tabs to admin interface with corresponding API endpoints 2026-04-22 20:50:08 +01:00
README.md add intelligence worker and embedding generation for article processing 2026-04-22 20:12:38 +01:00
rebuild-api.sh add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
server.js add intelligence worker and embedding generation for article processing 2026-04-22 20:12:38 +01:00
sources.json enhance article processing by adding language support and adjusting embedding parameters 2026-04-20 03:41:10 +01:00
summary-prompt.md add admin interface with article and event management features 2026-04-21 21:57:00 +01:00

duriin_api

Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.

Setup

  1. Install dependencies:
    npm install
    
  2. Edit config.json with your API keys, tickers, and schedules.
  3. Start the server:
    npm start
    

The server listens on the host and port defined in config.json.

How the data pipeline works

On startup the server:

  1. Opens the SQLite database and runs any pending migrations.
  2. Registers routes.
  3. Starts the HTTP server.
  4. Launches continuous background loops for each source, content backfill, embedding backfill, and event clustering.

When a new article is inserted:

  • the record is written immediately with title, description, url, source, and timestamps
  • content starts as null
  • content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
  • vector embeddings are generated after title, description, and content are all available
  • the clustering worker assigns the article to an event once it has an embedding
  • only articles with content + embedding are exposed via the API

Content backfill prioritises recent articles (pub_date_effective DESC) so newest content surfaces first regardless of ingestion order.

Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.

API overview

All endpoints are GET.

GET /

Health check. Returns { "ok": true }.

GET /articles

Returns usable articles — non-empty content, stored embedding, not an index/category page.

Query params

Param Description
keyword Keyword matched against title, description, and content. Repeat the param for multiple keywords — e.g. keyword=bitcoin&keyword=ethereum
keyword_mode How multiple keywords are combined — and (default) or or
source Exact match on the stored source field (e.g. rss:BBC, gdelt:Al Jazeera)
from pub_date >= from (ISO-8601)
to pub_date <= to (ISO-8601)
limit Rows to return. Default 20, max 100
offset Pagination offset. Default 0
order Sort order — see below. Not applied to semantic or similar_to_article results (those are sorted by distance)
semantic Semantic search by meaning via embedding similarity
similar_to_article Vector similarity search using another article's embedding

order values

Value Sort
newest pub_date_effective DESC (default)
oldest pub_date_effective ASC
ingested_newest ingested_at DESC
ingested_oldest ingested_at ASC

Search modes

  • If semantic is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a distance field (lower = closer).
  • Else if similar_to_article is present — finds articles similar to the given article ID. Returns 404 if that article has no embedding.
  • Otherwise — normal filtered list mode. All params apply.

keyword and source, from, to also work as post-filters on semantic and similar_to_article results.

include_embedding is explicitly rejected on this endpoint.

Response shape

[
  {
    "id": 123,
    "title": "...",
    "description": "...",
    "content": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss:BBC",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
]

Semantic and similarity results also include "distance": 0.1234.

GET /articles/:id

Returns one article by numeric ID. Same usability filter as the list endpoint — returns 404 if the article exists but has no content or embedding.

GET /events

Without id — returns a paginated list of events. With id — returns a single event and its articles.

Query params

Param Description
id Event ID. If present, returns that event with its articles instead of the list
limit Rows to return (list mode only). Default 20, max 100
offset Pagination offset (list mode only). Default 0

List response shape

[
  { "id": 1, "title": "...", "pub_date": "2025-01-01T12:34:56.000Z" }
]

Single event response shape

{
  "id": 1,
  "title": "...",
  "pub_date": "2025-01-01T12:34:56.000Z",
  "articles": [
    {
      "id": 123,
      "title": "...",
      "description": "...",
      "content": "...",
      "url": "...",
      "normalized_title": "...",
      "source": "rss:BBC",
      "pub_date": "2025-01-01T12:34:56.000Z",
      "ingested_at": "2025-01-01T12:35:10.000Z"
    }
  ]
}

Returns 404 if the event ID does not exist.

GET /status

Returns archive summary. Cached for 30 seconds.

Response fields

  • total — total rows across all sources
  • usable — articles with content + embedding, not index pages
  • lastIngestionBySource — in-memory timestamps of the last successful batch per source (resets on restart)
  • bySource — per-source { total, usable }
  • embeddingModels — active embedding models with article count and detected dimensions

GET /sources

Returns the full source catalog from sources.json enriched with live DB stats.

Per-source fields

  • id, label, websites, backfill, feeds — from sources.json (feed URLs preserve the [FAILED] prefix if the feed has been marked dead)
  • counts — aggregated { total, ready, skipped, failed, pending, untried, usable } across all feed types for this source
  • byFeed — same breakdown split by feed prefix (rss, gdelt, etc.)
  • domains — current domain fetch policy per website: policy (auto / browser_only / blocked), failure/success counts, expiresAt

Use domains[].policy to diagnose why a source has high skipped or failed counts — blocked means backfill has given up on that domain temporarily.

Article field notes

  • pub_date is normalized to ISO-8601 when parseable; null otherwise.
  • pub_date_effective is COALESCE(pub_date, ingested_at) — used for sorting.
  • ingested_at is the server-side insert timestamp.
  • normalized_title is stored for deduplication and indexing.
  • source format is <feed_type>:<label> for GDELT and RSS (e.g. gdelt:Bloomberg Markets, rss:TechCrunch), or just the source name for other feeds (alphavantage, edgar, finnhub).

Notes

  • SQLite archive defaults to ./archive.sqlite.
  • Deduplication is enforced on url.
  • GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
  • Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
  • Embeddings use OpenRouter and are indexed in sqlite-vec for ANN search.
  • Query embeddings are cached in SQLite to avoid redundant API calls.
  • SEC requests use the User-Agent from config.json.
  • Event clustering groups articles by embedding similarity (cosine distance ≤ config.clustering.distanceThreshold, default 0.25) and time proximity (within config.clustering.windowHours, default 72). Articles outside the time window are never grouped together even if embeddings are close.