No description
Find a file
2026-04-18 14:05:29 +01:00
src add Google News integration and enhance crawler capabilities 2026-04-18 14:05:29 +01:00
.dockerignore add Docker configuration and news crawler implementation 2026-04-16 22:54:27 +01:00
.gitignore add Docker configuration and news crawler implementation 2026-04-16 22:54:27 +01:00
config.json add Google News integration and enhance crawler capabilities 2026-04-18 14:05:29 +01:00
docker-compose.yml remove config module import from content.js and add rebuild-api.sh script for Docker management 2026-04-17 03:18:21 +01:00
Dockerfile add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
Dockerfile.base add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
gdelt-credentials.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
package-lock.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
package.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
README.md add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00
rebuild-api.sh add Google News integration and enhance crawler capabilities 2026-04-18 06:43:07 +01:00
server.js add browser crawling capabilities and enhance configuration options 2026-04-17 16:53:18 +01:00
sources.json add Google News integration and enhance crawler capabilities 2026-04-18 06:35:12 +01:00

duriin_api

Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.

Setup

  1. Install dependencies:
    npm install
    
  2. Edit config.json with your API keys, tickers, RSS feeds, Google News settings, and schedules.
  3. Start the server:
    npm start
    

The server listens on the host and port defined in config.json.

How the data pipeline works

On startup the server:

  1. Opens the SQLite database.
  2. Registers the article and status routes.
  3. Starts the HTTP server.
  4. Immediately runs all ingestion sources once.
  5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.

When a new article is inserted:

  • the record is written immediately with title, description, url, source, and timestamps
  • content and image start as null
  • full article extraction runs asynchronously after insert
  • vector embeddings are generated later, after title, description, and content are all available

API overview

All exposed endpoints are GET endpoints.

GET /

Simple health check.

Response

{ "ok": true }

Use this to confirm the server is running, not to inspect ingestion state.

GET /articles

Returns articles from the articles table. Behavior changes based on the query params you send.

Query params

keyword

Plain keyword search.

  • matches title, description, and content
  • uses SQL LIKE
  • works like substring matching, not semantic search
  • best when you want literal words or phrases to appear in the article text

Example:

GET /articles?keyword=earnings
source

Exact match on the stored source field.

Example:

GET /articles?source=rss
from

Only returns rows where pub_date >= from.

Example:

GET /articles?from=2025-01-01T00:00:00.000Z
to

Only returns rows where pub_date <= to.

Example:

GET /articles?to=2025-01-31T23:59:59.999Z
limit

Number of rows to return.

  • default: 20
  • max: 100

Example:

GET /articles?limit=10
offset

Pagination offset.

  • default: 0

Example:

GET /articles?limit=10&offset=20
similar_to_article

Runs vector similarity search instead of normal list mode.

  • value must be an existing article ID
  • the server looks up that article's embedding
  • nearest-neighbor search runs in sqlite-vec
  • the source article is excluded from the result set
  • each result includes a distance field
  • lower distance means more similar
  • returns 404 if the article has no stored embedding

Example:

GET /articles?similar_to_article=123&limit=5

Not found response:

{ "error": "Embedding not found for article" }
semantic

Semantic search by meaning, not exact wording.

  • use this when you want conceptually related results
  • unlike keyword, the words do not need to appear literally in the article text
  • the query text is normalized before embedding
  • query embeddings are cached in SQLite
  • on cache miss, the server requests an embedding from OpenRouter
  • nearest article matches are returned from the embedding index
  • each result includes a distance field
  • lower distance means a closer semantic match
  • returns 400 if semantic is empty

Example:

GET /articles?semantic=ai chip demand&limit=10

Bad request response:

{ "error": "Semantic query must not be empty" }
include_embedding

Explicitly rejected on /articles.

Response:

{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }

General behavior

  • If semantic is present, semantic search is used.
  • Else if similar_to_article is present, similarity search is used.
  • Otherwise normal list/search mode is used.
  • keyword is literal keyword matching.
  • semantic is semantic matching by meaning.
  • Normal list/search results are ordered by COALESCE(pub_date, ingested_at) DESC, id DESC.
  • from and to are compared against stored publication timestamps, so ISO-8601 values are the safest input.
  • source must match the stored source name exactly.
  • keyword is substring matching, not full-text search.

Normal list/search response shape

[
  {
    "id": 123,
    "title": "...",
    "description": "...",
    "content": "...",
    "image": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
]

Similarity/topic search response shape

[
  {
    "id": 456,
    "title": "...",
    "description": "...",
    "content": "...",
    "image": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss",
    "pub_date": "2025-01-02T09:00:00.000Z",
    "ingested_at": "2025-01-02T09:00:10.000Z",
    "distance": 0.1234
  }
]

Combined example

GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0

GET /articles/:id

Returns one article by numeric ID.

Behavior

  • Looks up the article directly in SQLite.
  • Returns the same article fields as normal /articles list mode.
  • Does not return embedding data.
  • Returns 404 if the ID does not exist.

Example

GET /articles/123

Not found response

{ "error": "Article not found" }

GET /status

Returns ingestion and archive summary information.

Response fields

  • totalArticles: total number of rows in articles
  • countsBySource: article counts grouped by source name
  • lastIngestionBySource: in-memory timestamps of the last successful batch run per source
  • contentFetchCoverage.total: total article count used for coverage math
  • contentFetchCoverage.withContent: rows whose content is present and non-empty
  • contentFetchCoverage.withImage: rows whose image is present and non-empty
  • contentFetchCoverage.withEmbedding: rows that have an embedding in article_embeddings
  • contentFetchCoverage.contentRatio: withContent / total
  • contentFetchCoverage.imageRatio: withImage / total
  • contentFetchCoverage.embeddingRatio: withEmbedding / total

Important detail

lastIngestionBySource is kept in memory, so it resets when the process restarts.

Example response

{
  "totalArticles": 10234,
  "countsBySource": {
    "alphavantage": 120,
    "edgar": 88,
    "finnhub": 400,
    "gdelt": 2100,
    "rss": 7526
  },
  "lastIngestionBySource": {
    "rss": "2025-01-02T10:00:00.000Z",
    "gdelt": "2025-01-02T10:05:00.000Z"
  },
  "contentFetchCoverage": {
    "withContent": 9000,
    "withImage": 6500,
    "withEmbedding": 8700,
    "total": 10234,
    "contentRatio": 0.8794,
    "imageRatio": 0.6351,
    "embeddingRatio": 0.8501
  }
}

Article field notes

  • image stores the extracted main image as ultra-compressed base64 WebP.
  • normalized_title is stored for matching and indexing.
  • source may be a shared source like rss, googlenews, gdelt, edgar, alphavantage, or finnhub.
  • pub_date is normalized to ISO-8601 when it can be parsed.
  • ingested_at is the insert timestamp set by the server.

Notes

  • SQLite archive file defaults to ./archive.sqlite.
  • Deduplication is enforced on url; normalized titles are stored and indexed for matching but are not unique.
  • googleNews accepts queries, topics, language, and country, and resolves Google redirect URLs to publisher URLs before ingestion.
  • Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
  • Embeddings are generated asynchronously with OpenRouter perplexity/pplx-embed-v1-0.6b and indexed in sqlite-vec for similarity search.
  • Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
  • SEC requests use the configured User-Agent.
  • Duplicate URLs are skipped rather than inserted again.