5.9 KiB
duriin_api
Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
Setup
- Install dependencies:
npm install - Edit
config.jsonwith your API keys, tickers, and schedules. - Start the server:
npm start
The server listens on the host and port defined in config.json.
How the data pipeline works
On startup the server:
- Opens the SQLite database and runs any pending migrations.
- Registers routes.
- Starts the HTTP server.
- Launches continuous background loops for each source, content backfill, and embedding backfill.
When a new article is inserted:
- the record is written immediately with
title,description,url,source, and timestamps contentstarts asnull- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
- vector embeddings are generated after title, description, and content are all available
- only articles with content + embedding are exposed via the API
Content backfill prioritises recent articles (pub_date_effective DESC) so newest content surfaces first regardless of ingestion order.
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
API overview
All endpoints are GET.
GET /
Health check. Returns { "ok": true }.
GET /articles
Returns usable articles — non-empty content, stored embedding, not an index/category page.
Query params
| Param | Description |
|---|---|
keyword |
Keyword matched against title, description, and content. Repeat the param for multiple keywords — e.g. keyword=bitcoin&keyword=ethereum |
keyword_mode |
How multiple keywords are combined — and (default) or or |
source |
Exact match on the stored source field (e.g. rss:BBC, gdelt:Al Jazeera) |
from |
pub_date >= from (ISO-8601) |
to |
pub_date <= to (ISO-8601) |
limit |
Rows to return. Default 20, max 100 |
offset |
Pagination offset. Default 0 |
order |
Sort order — see below. Not applied to semantic or similar_to_article results (those are sorted by distance) |
semantic |
Semantic search by meaning via embedding similarity |
similar_to_article |
Vector similarity search using another article's embedding |
order values
| Value | Sort |
|---|---|
newest |
pub_date_effective DESC (default) |
oldest |
pub_date_effective ASC |
ingested_newest |
ingested_at DESC |
ingested_oldest |
ingested_at ASC |
Search modes
- If
semanticis present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include adistancefield (lower = closer). - Else if
similar_to_articleis present — finds articles similar to the given article ID. Returns404if that article has no embedding. - Otherwise — normal filtered list mode. All params apply.
keyword and source, from, to also work as post-filters on semantic and similar_to_article results.
include_embedding is explicitly rejected on this endpoint.
Response shape
[
{
"id": 123,
"title": "...",
"description": "...",
"content": "...",
"url": "...",
"normalized_title": "...",
"source": "rss:BBC",
"pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z"
}
]
Semantic and similarity results also include "distance": 0.1234.
GET /articles/:id
Returns one article by numeric ID. Same usability filter as the list endpoint — returns 404 if the article exists but has no content or embedding.
GET /status
Returns archive summary. Cached for 30 seconds.
Response fields
total— total rows across all sourcesusable— articles with content + embedding, not index pageslastIngestionBySource— in-memory timestamps of the last successful batch per source (resets on restart)bySource— per-source{ total, usable }embeddingModels— active embedding models with article count and detected dimensions
GET /sources
Returns the full source catalog from sources.json enriched with live DB stats.
Per-source fields
id,label,websites,backfill,feeds— fromsources.json(feed URLs preserve the[FAILED]prefix if the feed has been marked dead)counts— aggregated{ total, ready, skipped, failed, pending, untried, usable }across all feed types for this sourcebyFeed— same breakdown split by feed prefix (rss,gdelt, etc.)domains— current domain fetch policy per website:policy(auto / browser_only / blocked), failure/success counts,expiresAt
Use domains[].policy to diagnose why a source has high skipped or failed counts — blocked means backfill has given up on that domain temporarily.
Article field notes
pub_dateis normalized to ISO-8601 when parseable;nullotherwise.pub_date_effectiveisCOALESCE(pub_date, ingested_at)— used for sorting.ingested_atis the server-side insert timestamp.normalized_titleis stored for deduplication and indexing.sourceformat is<feed_type>:<label>for GDELT and RSS (e.g.gdelt:Bloomberg Markets,rss:TechCrunch), or just the source name for other feeds (alphavantage,edgar,finnhub).
Notes
- SQLite archive defaults to
./archive.sqlite. - Deduplication is enforced on
url. - GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
- Embeddings use OpenRouter and are indexed in
sqlite-vecfor ANN search. - Query embeddings are cached in SQLite to avoid redundant API calls.
- SEC requests use the
User-Agentfromconfig.json.