| backtest | ||
| docs | ||
| public/admin | ||
| scripts | ||
| src | ||
| workers | ||
| .dockerignore | ||
| .env.example | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.json | ||
| docker-compose.yml | ||
| Dockerfile | ||
| Dockerfile.base | ||
| gdelt-credentials.json | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| server.js | ||
| sources.json | ||
duriin_api
Node.js Fastify server that ingests news articles from RSS, GDELT, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and Google News into a local SQLite archive.
Setup
- Install dependencies:
npm install - Edit
config.jsonwith your API keys, tickers, and schedules. - Start the server:
npm start
The server listens on the host and port defined in config.json.
How the data pipeline works
On startup the server:
- Opens the SQLite database and runs any pending migrations.
- Registers routes.
- Starts the HTTP server.
- Launches continuous background loops for each source, content backfill, embedding backfill, and event clustering.
When a new article is inserted:
- the record is written immediately with
title,description,url,source, and timestamps contentstarts asnull- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
- vector embeddings are generated after title, description, and content are all available
- the clustering worker assigns the article to an event once it has an embedding
- only articles with content + embedding are exposed via the API
Content backfill prioritises recent articles (pub_date_effective DESC) so newest content surfaces first regardless of ingestion order.
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
API overview
All endpoints are GET.
GET /
Health check. Returns { "ok": true }.
GET /articles
Returns usable articles — non-empty content, stored embedding, not an index/category page.
Query params
| Param | Description |
|---|---|
keyword |
Keyword matched against title, description, and content. Repeat the param for multiple keywords — e.g. keyword=bitcoin&keyword=ethereum |
keyword_mode |
How multiple keywords are combined — and (default) or or |
source |
Exact match on the stored source field (e.g. rss:BBC, gdelt:Al Jazeera) |
from |
pub_date >= from (ISO-8601) |
to |
pub_date <= to (ISO-8601) |
limit |
Rows to return. Default 20, max 100 |
offset |
Pagination offset. Default 0 |
order |
Sort order — see below. Not applied to semantic or similar_to_article results (those are sorted by distance) |
semantic |
Semantic search by meaning via embedding similarity |
similar_to_article |
Vector similarity search using another article's embedding |
order values
| Value | Sort |
|---|---|
newest |
pub_date_effective DESC (default) |
oldest |
pub_date_effective ASC |
ingested_newest |
ingested_at DESC |
ingested_oldest |
ingested_at ASC |
Search modes
- If
semanticis present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include adistancefield (lower = closer). - Else if
similar_to_articleis present — finds articles similar to the given article ID. Returns404if that article has no embedding. - Otherwise — normal filtered list mode. All params apply.
keyword and source, from, to also work as post-filters on semantic and similar_to_article results.
include_embedding is explicitly rejected on this endpoint.
Response shape
[
{
"id": 123,
"title": "...",
"description": "...",
"content": "...",
"url": "...",
"normalized_title": "...",
"source": "rss:BBC",
"pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z"
}
]
Semantic and similarity results also include "distance": 0.1234.
GET /articles/:id
Returns one article by numeric ID. Same usability filter as the list endpoint — returns 404 if the article exists but has no content or embedding.
GET /events
Without id — returns a paginated list of events. With id — returns a single event and its articles.
Query params
| Param | Description |
|---|---|
id |
Event ID. If present, returns that event with its articles instead of the list |
limit |
Rows to return (list mode only). Default 20, max 100 |
offset |
Pagination offset (list mode only). Default 0 |
List response shape
[
{ "id": 1, "title": "...", "pub_date": "2025-01-01T12:34:56.000Z" }
]
Single event response shape
{
"id": 1,
"title": "...",
"pub_date": "2025-01-01T12:34:56.000Z",
"articles": [
{
"id": 123,
"title": "...",
"description": "...",
"content": "...",
"url": "...",
"normalized_title": "...",
"source": "rss:BBC",
"pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z"
}
]
}
Returns 404 if the event ID does not exist.
GET /status
Returns archive summary. Cached for 30 seconds.
Response fields
total— total rows across all sourcesusable— articles with content + embedding, not index pageslastIngestionBySource— in-memory timestamps of the last successful batch per source (resets on restart)bySource— per-source{ total, usable }embeddingModels— active embedding models with article count and detected dimensions
GET /sources
Returns the full source catalog from sources.json enriched with live DB stats.
Per-source fields
id,label,websites,backfill,feeds— fromsources.json(feed URLs preserve the[FAILED]prefix if the feed has been marked dead)counts— aggregated{ total, ready, skipped, failed, pending, untried, usable }across all feed types for this sourcebyFeed— same breakdown split by feed prefix (rss,gdelt, etc.)domains— current domain fetch policy per website:policy(auto / browser_only / blocked), failure/success counts,expiresAt
Use domains[].policy to diagnose why a source has high skipped or failed counts — blocked means backfill has given up on that domain temporarily.
Article field notes
pub_dateis normalized to ISO-8601 when parseable;nullotherwise.pub_date_effectiveisCOALESCE(pub_date, ingested_at)— used for sorting.ingested_atis the server-side insert timestamp.normalized_titleis stored for deduplication and indexing.sourceformat is<feed_type>:<label>for GDELT and RSS (e.g.gdelt:Bloomberg Markets,rss:TechCrunch), or just the source name for other feeds (alphavantage,edgar,finnhub).
Intelligence layer
A second process (intelligence/index.js) runs alongside the archive server and builds structured knowledge about tracked companies from ingested events.
npm run intelligence
In Docker it runs as a separate service (intelligence) sharing the same image and data volume.
How it works
- Queue feeder — continuously scans
archive.sqlitefor articles that have content, an embedding, and an event assignment. Inserts them intoarticle_queueinintelligence.sqlite. - Augor worker — pulls one pending article at a time from the queue. The article is a trigger — the unit of work is the event it belongs to. Fetches all articles in the event, matches them against tracked company embeddings via cosine similarity, calls the LLM once per matched company, writes structured knowledge and predictions, then marks all sibling articles in the event as processed.
- Consolidation worker — slow loop (default 60s). For each tracked company, reads all
event_knowledgerows, builds a flat list of claims, and calls the LLM to normalize and deduplicate them into canonical grouped facts stored incompany_facts. Preservesfirst_seen_atacross cycles. Prunes facts that have only been confirmed once and haven't been seen in 90 days. - Graph worker — slow loop (default 90s). Reads all
company_factsrows of typerelationship, parses the claim format, resolves whether the target entity is a tracked company, and upserts edges intocompany_relationships. Inserts reciprocal edges automatically (supplier ↔ customer, etc.) when both endpoints are tracked. - Signal worker — slow loop (default 120s). Picks the tracked company with the oldest (or missing) signal that has at least 3 recent predictions within a 90-day window. Calls the LLM with the company's facts, predictions, and relationships to produce a structured trade signal (
buy/sell/hold/hold_monitor) with confidence, timeframe, risk level, risk factors, and key drivers. - Column migrations — run on startup to safely add new columns to existing databases without data loss.
Output tables (intelligence.sqlite)
| Table | Contents |
|---|---|
article_queue |
Per-article processing status (pending / processed / skipped) |
tracked_companies |
Companies to watch, with names, tickers, and aliases |
company_embeddings |
Pre-generated embeddings for each company (generated on startup via OpenRouter) |
event_knowledge |
Extracted relationships, themes, and factors per event+company |
event_predictions |
Forward-looking predictions (market share, stock price, competitive position) with event_date from the source articles |
company_facts |
Deduplicated, canonical facts per company accumulated across all events. Each fact has a confirmation_count and a confidence tier (low / medium / high / very_high) |
company_relationships |
Cross-company relationship graph derived from company_facts. Includes reciprocal edges and confirmation_count |
trade_signals |
Generated investment signals per company — signal type, confidence, timeframe, risk level, risk factors, summary, and key drivers |
worker_events |
Lifecycle timestamps for each worker iteration. Pruned hourly to stay bounded. Used to compute per-worker processing rates in the admin panel |
cursors |
Key-value state store. Currently used by the queue feeder to persist its last-processed article ID across restarts |
Company matching
Uses cosine similarity between company embeddings and article embeddings stored in archive.sqlite. A company is considered relevant to an event if any article in the event has similarity ≥ config.intelligence.similarityThreshold (default 0.35). Both use the same OpenRouter embedding model (openRouter.embeddingModel).
LLM
Uses openRouter.llmModel via the OpenRouter API. One call per matched company per event. Output is structured JSON — relationships, themes, factors, and predictions.
Config keys (in config.json)
| Key | Purpose |
|---|---|
duriin_db |
Path to archive.sqlite (relative to config file, or absolute) |
intelligence_db |
Path to intelligence.sqlite |
openRouter.llmModel |
Chat model used for extraction |
openRouter.embeddingModel |
Embedding model (shared with archive server) |
intelligence.similarityThreshold |
Cosine similarity cutoff for company matching (default 0.35) |
workers.augorLoopDelayMs |
Delay between augor iterations when queue is empty (default 1500) |
workers.queueFeederBatchSize |
Articles pulled per feeder batch (default 100) |
workers.consolidationLoopDelayMs |
Delay between consolidation cycles (default 60000) |
workers.graphWorkerLoopDelayMs |
Delay between graph worker cycles (default 90000) |
workers.signalLoopDelayMs |
Delay between signal generation cycles (default 120000) |
Admin panel
The intelligence data is visible in the admin panel (/admin) under the Intelligence tab. The tab has three views:
- Knowledge — raw extracted relationships, themes, and factors per event+company. Filterable by company and type, sortable by ingestion order or event date.
- Predictions — forward-looking LLM predictions per event+company, with direction, magnitude, timeframe, and rationale.
- Signals — generated trade signals per company showing signal type (
buy/sell/hold/hold_monitor), confidence, timeframe, risk level, risk factors, key drivers, and a generated-at timestamp. - Graph — interactive D3 force-directed network diagram of the cross-company relationship graph. Nodes are draggable and zoomable. Multiple relationship types between the same pair of companies are merged into a single edge with a combined label and max confirmation count. Edge thickness reflects confirmation count. Hover an edge to see the relationship type and count. Click a tracked company node to see its top facts in a sidebar. Toggle untracked entities on/off with the checkbox in the legend. Use the Expand button to fill the full viewport.
A SQL tab allows raw queries against either database. Multiple statements separated by ; are supported — each runs independently and results render as separate blocks.
Notes
- SQLite archive defaults to
./archive.sqlite. - Deduplication is enforced on
url. - GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
- Embeddings use OpenRouter and are indexed in
sqlite-vecfor ANN search. - Query embeddings are cached in SQLite to avoid redundant API calls.
- SEC requests use the
User-Agentfromconfig.json. - Event clustering groups articles by embedding similarity (cosine distance ≤
config.clustering.distanceThreshold, default0.25) and time proximity (withinconfig.clustering.windowHours, default72). Articles outside the time window are never grouped together even if embeddings are close.