| src | ||
| .dockerignore | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.json | ||
| docker-compose.yml | ||
| Dockerfile | ||
| Dockerfile.base | ||
| gdelt-credentials.json | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| rebuild-api.sh | ||
| server.js | ||
| sources.json | ||
duriin_api
Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
Setup
- Install dependencies:
npm install - Edit
config.jsonwith your API keys, tickers, RSS feeds, Google News settings, and schedules. - Start the server:
npm start
The server listens on the host and port defined in config.json.
How the data pipeline works
On startup the server:
- Opens the SQLite database.
- Registers the article and status routes.
- Starts the HTTP server.
- Immediately runs all ingestion sources once.
- Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
When a new article is inserted:
- the record is written immediately with
title,description,url,source, and timestamps contentandimagestart asnull- full article extraction runs asynchronously after insert
- vector embeddings are generated later, after title, description, and content are all available
API overview
All exposed endpoints are GET endpoints.
GET /
Simple health check.
Response
{ "ok": true }
Use this to confirm the server is running, not to inspect ingestion state.
GET /articles
Returns articles from the articles table. Only articles that are considered usable are exposed — meaning they have non-empty content, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
Query params
keyword
Plain keyword search.
- matches
title,description, andcontent - uses SQL
LIKE - works like substring matching, not semantic search
- best when you want literal words or phrases to appear in the article text
Example:
GET /articles?keyword=earnings
source
Exact match on the stored source field.
Example:
GET /articles?source=rss
from
Only returns rows where pub_date >= from.
Example:
GET /articles?from=2025-01-01T00:00:00.000Z
to
Only returns rows where pub_date <= to.
Example:
GET /articles?to=2025-01-31T23:59:59.999Z
limit
Number of rows to return.
- default:
20 - max:
100
Example:
GET /articles?limit=10
offset
Pagination offset.
- default:
0
Example:
GET /articles?limit=10&offset=20
similar_to_article
Runs vector similarity search instead of normal list mode.
- value must be an existing article ID
- the server looks up that article's embedding
- nearest-neighbor search runs in
sqlite-vec - the source article is excluded from the result set
- each result includes a
distancefield - lower
distancemeans more similar - returns
404if the article has no stored embedding
Example:
GET /articles?similar_to_article=123&limit=5
Not found response:
{ "error": "Embedding not found for article" }
semantic
Semantic search by meaning, not exact wording.
- use this when you want conceptually related results
- unlike
keyword, the words do not need to appear literally in the article text - the query text is normalized before embedding
- query embeddings are cached in SQLite
- on cache miss, the server requests an embedding from OpenRouter
- nearest article matches are returned from the embedding index
- each result includes a
distancefield - lower
distancemeans a closer semantic match - returns
400ifsemanticis empty
Example:
GET /articles?semantic=ai chip demand&limit=10
Bad request response:
{ "error": "Semantic query must not be empty" }
include_embedding
Explicitly rejected on /articles.
Response:
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
General behavior
- If
semanticis present, semantic search is used. - Else if
similar_to_articleis present, similarity search is used. - Otherwise normal list/search mode is used.
keywordis literal keyword matching.semanticis semantic matching by meaning.- Normal list/search results are ordered by
COALESCE(pub_date, ingested_at) DESC, id DESC. fromandtoare compared against stored publication timestamps, so ISO-8601 values are the safest input.sourcemust match the stored source name exactly.keywordis substring matching, not full-text search.
Normal list/search response shape
[
{
"id": 123,
"title": "...",
"description": "...",
"content": "...",
"image": "...",
"url": "...",
"normalized_title": "...",
"source": "rss",
"pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z"
}
]
Similarity/topic search response shape
[
{
"id": 456,
"title": "...",
"description": "...",
"content": "...",
"image": "...",
"url": "...",
"normalized_title": "...",
"source": "rss",
"pub_date": "2025-01-02T09:00:00.000Z",
"ingested_at": "2025-01-02T09:00:10.000Z",
"distance": 0.1234
}
]
Combined example
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
GET /articles/:id
Returns one article by numeric ID.
Behavior
- Looks up the article directly in SQLite.
- Same usability filter as the list endpoint — returns
404if the article exists but is not usable. - Returns the same article fields as normal
/articleslist mode. - Does not return embedding data.
- Returns
404if the ID does not exist.
Example
GET /articles/123
Not found response
{ "error": "Article not found" }
GET /status
Returns ingestion and archive summary information.
Response fields
total: total number of rows inarticlesacross all sourcesusable: articles that have content, an embedding, and are not index pageslastIngestionBySource: in-memory timestamps of the last successful batch run per sourcebySource: per-source breakdown, each withtotalandusablecounts
Important detail
lastIngestionBySource is kept in memory, so it resets when the process restarts.
Example response
{
"total": 10234,
"usable": 8700,
"lastIngestionBySource": {
"rss": "2025-01-02T10:00:00.000Z",
"gdelt": "2025-01-02T10:05:00.000Z"
},
"bySource": {
"alphavantage": { "total": 120, "usable": 98 },
"edgar": { "total": 88, "usable": 70 },
"finnhub": { "total": 400, "usable": 360 },
"gdelt": { "total": 2100, "usable": 1800 },
"rss": { "total": 7526, "usable": 6372 }
}
}
Article field notes
imagestores the extracted main image as ultra-compressed base64 WebP.normalized_titleis stored for matching and indexing.sourcemay be a shared source likerss,googlenews,gdelt,edgar,alphavantage, orfinnhub.pub_dateis normalized to ISO-8601 when it can be parsed.ingested_atis the insert timestamp set by the server.
Notes
- SQLite archive file defaults to
./archive.sqlite. - Deduplication is enforced on
url; normalized titles are stored and indexed for matching but are not unique. googleNewsacceptsqueries,topics,language, andcountry, and resolves Google redirect URLs to publisher URLs before ingestion.- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
- Embeddings are generated asynchronously with OpenRouter
perplexity/pplx-embed-v1-0.6band indexed insqlite-vecfor similarity search. - Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
- SEC requests use the configured
User-Agent. - Duplicate URLs are skipped rather than inserted again.