| src | ||
| .dockerignore | ||
| .gitignore | ||
| config.json | ||
| docker-compose.yml | ||
| Dockerfile | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| server.js | ||
duriin_api
Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, GDELT, and configured publisher crawlers into a local SQLite archive.
Setup
- Install dependencies:
npm install - Edit
config.jsonwith your API keys, tickers, RSS feeds, crawler settings, and schedules. - Start the server:
npm start
The server listens on the host and port defined in config.json.
How the data pipeline works
On startup the server:
- Opens the SQLite database.
- Registers the article and status routes.
- Starts the HTTP server.
- Immediately runs all ingestion sources once.
- Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
When a new article is inserted:
- the record is written immediately with
title,description,url,source, and timestamps contentandimagestart asnull- full article extraction runs asynchronously after insert
- vector embeddings are generated later, after title, description, and content are all available
API overview
All exposed endpoints are GET endpoints.
GET /
Simple health check.
Response
{ "ok": true }
Use this to confirm the server is running, not to inspect ingestion state.
GET /articles
Returns articles from the articles table. Behavior changes based on the query params you send.
Query params
keyword
Plain keyword search.
- matches
title,description, andcontent - uses SQL
LIKE - works like substring matching, not semantic search
- best when you want literal words or phrases to appear in the article text
Example:
GET /articles?keyword=earnings
source
Exact match on the stored source field.
Example:
GET /articles?source=rss
from
Only returns rows where pub_date >= from.
Example:
GET /articles?from=2025-01-01T00:00:00.000Z
to
Only returns rows where pub_date <= to.
Example:
GET /articles?to=2025-01-31T23:59:59.999Z
limit
Number of rows to return.
- default:
20 - max:
100
Example:
GET /articles?limit=10
offset
Pagination offset.
- default:
0
Example:
GET /articles?limit=10&offset=20
similar_to_article
Runs vector similarity search instead of normal list mode.
- value must be an existing article ID
- the server looks up that article's embedding
- nearest-neighbor search runs in
sqlite-vec - the source article is excluded from the result set
- each result includes a
distancefield - lower
distancemeans more similar - returns
404if the article has no stored embedding
Example:
GET /articles?similar_to_article=123&limit=5
Not found response:
{ "error": "Embedding not found for article" }
semantic
Semantic search by meaning, not exact wording.
- use this when you want conceptually related results
- unlike
keyword, the words do not need to appear literally in the article text - the query text is normalized before embedding
- query embeddings are cached in SQLite
- on cache miss, the server requests an embedding from OpenRouter
- nearest article matches are returned from the embedding index
- each result includes a
distancefield - lower
distancemeans a closer semantic match - returns
400ifsemanticis empty
Example:
GET /articles?semantic=ai chip demand&limit=10
Bad request response:
{ "error": "Semantic query must not be empty" }
include_embedding
Explicitly rejected on /articles.
Response:
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
General behavior
- If
semanticis present, semantic search is used. - Else if
similar_to_articleis present, similarity search is used. - Otherwise normal list/search mode is used.
keywordis literal keyword matching.semanticis semantic matching by meaning.- Normal list/search results are ordered by
COALESCE(pub_date, ingested_at) DESC, id DESC. fromandtoare compared against stored publication timestamps, so ISO-8601 values are the safest input.sourcemust match the stored source name exactly.keywordis substring matching, not full-text search.
Normal list/search response shape
[
{
"id": 123,
"title": "...",
"description": "...",
"content": "...",
"image": "...",
"url": "...",
"normalized_title": "...",
"source": "rss",
"pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z"
}
]
Similarity/topic search response shape
[
{
"id": 456,
"title": "...",
"description": "...",
"content": "...",
"image": "...",
"url": "...",
"normalized_title": "...",
"source": "rss",
"pub_date": "2025-01-02T09:00:00.000Z",
"ingested_at": "2025-01-02T09:00:10.000Z",
"distance": 0.1234
}
]
Combined example
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
GET /articles/:id
Returns one article by numeric ID.
Behavior
- Looks up the article directly in SQLite.
- Returns the same article fields as normal
/articleslist mode. - Does not return embedding data.
- Returns
404if the ID does not exist.
Example
GET /articles/123
Not found response
{ "error": "Article not found" }
GET /status
Returns ingestion and archive summary information.
Response fields
totalArticles: total number of rows inarticlescountsBySource: article counts grouped by source namelastIngestionBySource: in-memory timestamps of the last successful batch run per sourcecontentFetchCoverage.total: total article count used for coverage mathcontentFetchCoverage.withContent: rows whosecontentis present and non-emptycontentFetchCoverage.withImage: rows whoseimageis present and non-emptycontentFetchCoverage.withEmbedding: rows that have an embedding inarticle_embeddingscontentFetchCoverage.contentRatio:withContent / totalcontentFetchCoverage.imageRatio:withImage / totalcontentFetchCoverage.embeddingRatio:withEmbedding / total
Important detail
lastIngestionBySource is kept in memory, so it resets when the process restarts.
Example response
{
"totalArticles": 10234,
"countsBySource": {
"alphavantage": 120,
"edgar": 88,
"finnhub": 400,
"gdelt": 2100,
"rss": 7526
},
"lastIngestionBySource": {
"rss": "2025-01-02T10:00:00.000Z",
"gdelt": "2025-01-02T10:05:00.000Z"
},
"contentFetchCoverage": {
"withContent": 9000,
"withImage": 6500,
"withEmbedding": 8700,
"total": 10234,
"contentRatio": 0.8794,
"imageRatio": 0.6351,
"embeddingRatio": 0.8501
}
}
Article field notes
imagestores the extracted main image as ultra-compressed base64 WebP.normalized_titleis stored for matching and indexing.sourcemay be a shared source likerss,gdelt,edgar,alphavantage, orfinnhub, or a crawler-derived source name for a configured publisher.pub_dateis normalized to ISO-8601 when it can be parsed.ingested_atis the insert timestamp set by the server.
Notes
- SQLite archive file defaults to
./archive.sqlite. - Deduplication is enforced on
url; normalized titles are stored and indexed for matching but are not unique. newsCrawlerreusesrssFeedsas the publisher catalog, derives one crawler source per feed label, and supportsdisabledLabelsplus per-labeloverridesfor seeds and allowed hosts.- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
- Embeddings are generated asynchronously with OpenRouter
perplexity/pplx-embed-v1-0.6band indexed insqlite-vecfor similarity search. - Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
- SEC requests use the configured
User-Agent. - Duplicate URLs are skipped rather than inserted again.