Duriin-API/README.md

# duriin_api

Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.

## Setup

1. Install dependencies:
   ```bash
   npm install
   ```
2. Edit `config.json` with your API keys, tickers, and schedules.
3. Start the server:
   ```bash
   npm start
   ```

The server listens on the host and port defined in `config.json`.

## How the data pipeline works

On startup the server:

1. Opens the SQLite database and runs any pending migrations.
2. Registers routes.
3. Starts the HTTP server.
4. Launches continuous background loops for each source, content backfill, embedding backfill, and event clustering.

When a new article is inserted:

- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
- `content` starts as `null`
- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
- vector embeddings are generated after title, description, and content are all available
- the clustering worker assigns the article to an event once it has an embedding
- only articles with content + embedding are exposed via the API

Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.

Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.

## API overview

All endpoints are `GET`.

### `GET /`

Health check. Returns `{ "ok": true }`.

### `GET /articles`

Returns usable articles — non-empty `content`, stored embedding, not an index/category page.

#### Query params

| Param | Description |
|---|---|
| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
| `from` | `pub_date >= from` (ISO-8601) |
| `to` | `pub_date <= to` (ISO-8601) |
| `limit` | Rows to return. Default `20`, max `100` |
| `offset` | Pagination offset. Default `0` |
| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
| `semantic` | Semantic search by meaning via embedding similarity |
| `similar_to_article` | Vector similarity search using another article's embedding |

#### `order` values

| Value | Sort |
|---|---|
| `newest` | `pub_date_effective DESC` (default) |
| `oldest` | `pub_date_effective ASC` |
| `ingested_newest` | `ingested_at DESC` |
| `ingested_oldest` | `ingested_at ASC` |

#### Search modes

- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
- Otherwise — normal filtered list mode. All params apply.

`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.

`include_embedding` is explicitly rejected on this endpoint.

#### Response shape

```json
[
  {
    "id": 123,
    "title": "...",
    "description": "...",
    "content": "...",
    "url": "...",
    "normalized_title": "...",
    "source": "rss:BBC",
    "pub_date": "2025-01-01T12:34:56.000Z",
    "ingested_at": "2025-01-01T12:35:10.000Z"
  }
]
```

Semantic and similarity results also include `"distance": 0.1234`.

### `GET /articles/:id`

Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.

### `GET /events`

Returns a single event and its articles.

#### Query params

| Param | Description |
|---|---|
| `id` | Event ID (required) |

#### Response shape

```json
{
  "id": 1,
  "title": "...",
  "created_at": "2025-01-01T12:35:10.000Z",
  "articles": [
    {
      "id": 123,
      "title": "...",
      "description": "...",
      "content": "...",
      "url": "...",
      "normalized_title": "...",
      "source": "rss:BBC",
      "pub_date": "2025-01-01T12:34:56.000Z",
      "ingested_at": "2025-01-01T12:35:10.000Z"
    }
  ]
}
```

Returns `404` if the event ID does not exist.

### `GET /status`

Returns archive summary. Cached for 30 seconds.

**Response fields**

- `total` — total rows across all sources
- `usable` — articles with content + embedding, not index pages
- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
- `bySource` — per-source `{ total, usable }`
- `embeddingModels` — active embedding models with article count and detected dimensions

### `GET /sources`

Returns the full source catalog from `sources.json` enriched with live DB stats.

**Per-source fields**

- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`

Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.

## Article field notes

- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
- `ingested_at` is the server-side insert timestamp.
- `normalized_title` is stored for deduplication and indexing.
- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).

## Notes

- SQLite archive defaults to `./archive.sqlite`.
- Deduplication is enforced on `url`.
- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
- Query embeddings are cached in SQLite to avoid redundant API calls.
- SEC requests use the `User-Agent` from `config.json`.
- Event clustering groups articles by embedding similarity (cosine distance ≤ `config.clustering.distanceThreshold`, default `0.25`) and time proximity (within `config.clustering.windowHours`, default `72`). Articles outside the time window are never grouped together even if embeddings are close.