188 lines
6.9 KiB
Markdown
188 lines
6.9 KiB
Markdown
# duriin_api
|
|
|
|
Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
|
|
|
## Setup
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
npm install
|
|
```
|
|
2. Edit `config.json` with your API keys, tickers, and schedules.
|
|
3. Start the server:
|
|
```bash
|
|
npm start
|
|
```
|
|
|
|
The server listens on the host and port defined in `config.json`.
|
|
|
|
## How the data pipeline works
|
|
|
|
On startup the server:
|
|
|
|
1. Opens the SQLite database and runs any pending migrations.
|
|
2. Registers routes.
|
|
3. Starts the HTTP server.
|
|
4. Launches continuous background loops for each source, content backfill, embedding backfill, and event clustering.
|
|
|
|
When a new article is inserted:
|
|
|
|
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
|
|
- `content` starts as `null`
|
|
- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
|
|
- vector embeddings are generated after title, description, and content are all available
|
|
- the clustering worker assigns the article to an event once it has an embedding
|
|
- only articles with content + embedding are exposed via the API
|
|
|
|
Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
|
|
|
|
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
|
|
|
|
## API overview
|
|
|
|
All endpoints are `GET`.
|
|
|
|
### `GET /`
|
|
|
|
Health check. Returns `{ "ok": true }`.
|
|
|
|
### `GET /articles`
|
|
|
|
Returns usable articles — non-empty `content`, stored embedding, not an index/category page.
|
|
|
|
#### Query params
|
|
|
|
| Param | Description |
|
|
|---|---|
|
|
| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
|
|
| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
|
|
| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
|
|
| `from` | `pub_date >= from` (ISO-8601) |
|
|
| `to` | `pub_date <= to` (ISO-8601) |
|
|
| `limit` | Rows to return. Default `20`, max `100` |
|
|
| `offset` | Pagination offset. Default `0` |
|
|
| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
|
|
| `semantic` | Semantic search by meaning via embedding similarity |
|
|
| `similar_to_article` | Vector similarity search using another article's embedding |
|
|
|
|
#### `order` values
|
|
|
|
| Value | Sort |
|
|
|---|---|
|
|
| `newest` | `pub_date_effective DESC` (default) |
|
|
| `oldest` | `pub_date_effective ASC` |
|
|
| `ingested_newest` | `ingested_at DESC` |
|
|
| `ingested_oldest` | `ingested_at ASC` |
|
|
|
|
#### Search modes
|
|
|
|
- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
|
|
- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
|
|
- Otherwise — normal filtered list mode. All params apply.
|
|
|
|
`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.
|
|
|
|
`include_embedding` is explicitly rejected on this endpoint.
|
|
|
|
#### Response shape
|
|
|
|
```json
|
|
[
|
|
{
|
|
"id": 123,
|
|
"title": "...",
|
|
"description": "...",
|
|
"content": "...",
|
|
"url": "...",
|
|
"normalized_title": "...",
|
|
"source": "rss:BBC",
|
|
"pub_date": "2025-01-01T12:34:56.000Z",
|
|
"ingested_at": "2025-01-01T12:35:10.000Z"
|
|
}
|
|
]
|
|
```
|
|
|
|
Semantic and similarity results also include `"distance": 0.1234`.
|
|
|
|
### `GET /articles/:id`
|
|
|
|
Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.
|
|
|
|
### `GET /events`
|
|
|
|
Returns a single event and its articles.
|
|
|
|
#### Query params
|
|
|
|
| Param | Description |
|
|
|---|---|
|
|
| `id` | Event ID (required) |
|
|
|
|
#### Response shape
|
|
|
|
```json
|
|
{
|
|
"id": 1,
|
|
"title": "...",
|
|
"created_at": "2025-01-01T12:35:10.000Z",
|
|
"articles": [
|
|
{
|
|
"id": 123,
|
|
"title": "...",
|
|
"description": "...",
|
|
"content": "...",
|
|
"url": "...",
|
|
"normalized_title": "...",
|
|
"source": "rss:BBC",
|
|
"pub_date": "2025-01-01T12:34:56.000Z",
|
|
"ingested_at": "2025-01-01T12:35:10.000Z"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Returns `404` if the event ID does not exist.
|
|
|
|
### `GET /status`
|
|
|
|
Returns archive summary. Cached for 30 seconds.
|
|
|
|
**Response fields**
|
|
|
|
- `total` — total rows across all sources
|
|
- `usable` — articles with content + embedding, not index pages
|
|
- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
|
|
- `bySource` — per-source `{ total, usable }`
|
|
- `embeddingModels` — active embedding models with article count and detected dimensions
|
|
|
|
### `GET /sources`
|
|
|
|
Returns the full source catalog from `sources.json` enriched with live DB stats.
|
|
|
|
**Per-source fields**
|
|
|
|
- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
|
|
- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
|
|
- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
|
|
- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
|
|
|
|
Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.
|
|
|
|
## Article field notes
|
|
|
|
- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
|
|
- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
|
|
- `ingested_at` is the server-side insert timestamp.
|
|
- `normalized_title` is stored for deduplication and indexing.
|
|
- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).
|
|
|
|
## Notes
|
|
|
|
- SQLite archive defaults to `./archive.sqlite`.
|
|
- Deduplication is enforced on `url`.
|
|
- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
|
|
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
|
|
- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
|
|
- Query embeddings are cached in SQLite to avoid redundant API calls.
|
|
- SEC requests use the `User-Agent` from `config.json`.
|
|
- Event clustering groups articles by embedding similarity (cosine distance ≤ `config.clustering.distanceThreshold`, default `0.25`) and time proximity (within `config.clustering.windowHours`, default `72`). Articles outside the time window are never grouped together even if embeddings are close.
|