306 lines
7.8 KiB
Markdown
306 lines
7.8 KiB
Markdown
# duriin_api
|
|
|
|
Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
|
|
|
## Setup
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
npm install
|
|
```
|
|
2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules.
|
|
3. Start the server:
|
|
```bash
|
|
npm start
|
|
```
|
|
|
|
The server listens on the host and port defined in `config.json`.
|
|
|
|
## How the data pipeline works
|
|
|
|
On startup the server:
|
|
|
|
1. Opens the SQLite database.
|
|
2. Registers the article and status routes.
|
|
3. Starts the HTTP server.
|
|
4. Immediately runs all ingestion sources once.
|
|
5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
|
|
|
|
When a new article is inserted:
|
|
|
|
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
|
|
- `content` and `image` start as `null`
|
|
- full article extraction runs asynchronously after insert
|
|
- vector embeddings are generated later, after title, description, and content are all available
|
|
|
|
## API overview
|
|
|
|
All exposed endpoints are `GET` endpoints.
|
|
|
|
### `GET /`
|
|
|
|
Simple health check.
|
|
|
|
**Response**
|
|
```json
|
|
{ "ok": true }
|
|
```
|
|
|
|
Use this to confirm the server is running, not to inspect ingestion state.
|
|
|
|
### `GET /articles`
|
|
|
|
Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
|
|
|
|
#### Query params
|
|
|
|
##### `keyword`
|
|
|
|
Plain keyword search.
|
|
|
|
- matches `title`, `description`, and `content`
|
|
- uses SQL `LIKE`
|
|
- works like substring matching, not semantic search
|
|
- best when you want literal words or phrases to appear in the article text
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?keyword=earnings
|
|
```
|
|
|
|
##### `source`
|
|
|
|
Exact match on the stored `source` field.
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?source=rss
|
|
```
|
|
|
|
##### `from`
|
|
|
|
Only returns rows where `pub_date >= from`.
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?from=2025-01-01T00:00:00.000Z
|
|
```
|
|
|
|
##### `to`
|
|
|
|
Only returns rows where `pub_date <= to`.
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?to=2025-01-31T23:59:59.999Z
|
|
```
|
|
|
|
##### `limit`
|
|
|
|
Number of rows to return.
|
|
|
|
- default: `20`
|
|
- max: `100`
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?limit=10
|
|
```
|
|
|
|
##### `offset`
|
|
|
|
Pagination offset.
|
|
|
|
- default: `0`
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?limit=10&offset=20
|
|
```
|
|
|
|
##### `similar_to_article`
|
|
|
|
Runs vector similarity search instead of normal list mode.
|
|
|
|
- value must be an existing article ID
|
|
- the server looks up that article's embedding
|
|
- nearest-neighbor search runs in `sqlite-vec`
|
|
- the source article is excluded from the result set
|
|
- each result includes a `distance` field
|
|
- lower `distance` means more similar
|
|
- returns `404` if the article has no stored embedding
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?similar_to_article=123&limit=5
|
|
```
|
|
|
|
Not found response:
|
|
```json
|
|
{ "error": "Embedding not found for article" }
|
|
```
|
|
|
|
##### `semantic`
|
|
|
|
Semantic search by meaning, not exact wording.
|
|
|
|
- use this when you want conceptually related results
|
|
- unlike `keyword`, the words do not need to appear literally in the article text
|
|
- the query text is normalized before embedding
|
|
- query embeddings are cached in SQLite
|
|
- on cache miss, the server requests an embedding from OpenRouter
|
|
- nearest article matches are returned from the embedding index
|
|
- each result includes a `distance` field
|
|
- lower `distance` means a closer semantic match
|
|
- returns `400` if `semantic` is empty
|
|
|
|
Example:
|
|
```http
|
|
GET /articles?semantic=ai chip demand&limit=10
|
|
```
|
|
|
|
Bad request response:
|
|
```json
|
|
{ "error": "Semantic query must not be empty" }
|
|
```
|
|
|
|
##### `include_embedding`
|
|
|
|
Explicitly rejected on `/articles`.
|
|
|
|
Response:
|
|
```json
|
|
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
|
|
```
|
|
|
|
#### General behavior
|
|
|
|
- If `semantic` is present, semantic search is used.
|
|
- Else if `similar_to_article` is present, similarity search is used.
|
|
- Otherwise normal list/search mode is used.
|
|
- `keyword` is literal keyword matching.
|
|
- `semantic` is semantic matching by meaning.
|
|
- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
|
|
- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
|
|
- `source` must match the stored source name exactly.
|
|
- `keyword` is substring matching, not full-text search.
|
|
|
|
#### Normal list/search response shape
|
|
|
|
```json
|
|
[
|
|
{
|
|
"id": 123,
|
|
"title": "...",
|
|
"description": "...",
|
|
"content": "...",
|
|
"image": "...",
|
|
"url": "...",
|
|
"normalized_title": "...",
|
|
"source": "rss",
|
|
"pub_date": "2025-01-01T12:34:56.000Z",
|
|
"ingested_at": "2025-01-01T12:35:10.000Z"
|
|
}
|
|
]
|
|
```
|
|
|
|
#### Similarity/topic search response shape
|
|
|
|
```json
|
|
[
|
|
{
|
|
"id": 456,
|
|
"title": "...",
|
|
"description": "...",
|
|
"content": "...",
|
|
"image": "...",
|
|
"url": "...",
|
|
"normalized_title": "...",
|
|
"source": "rss",
|
|
"pub_date": "2025-01-02T09:00:00.000Z",
|
|
"ingested_at": "2025-01-02T09:00:10.000Z",
|
|
"distance": 0.1234
|
|
}
|
|
]
|
|
```
|
|
|
|
#### Combined example
|
|
|
|
```http
|
|
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
|
|
```
|
|
|
|
### `GET /articles/:id`
|
|
|
|
Returns one article by numeric ID.
|
|
|
|
**Behavior**
|
|
|
|
- Looks up the article directly in SQLite.
|
|
- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
|
|
- Returns the same article fields as normal `/articles` list mode.
|
|
- Does not return embedding data.
|
|
- Returns `404` if the ID does not exist.
|
|
|
|
**Example**
|
|
```http
|
|
GET /articles/123
|
|
```
|
|
|
|
**Not found response**
|
|
```json
|
|
{ "error": "Article not found" }
|
|
```
|
|
|
|
### `GET /status`
|
|
|
|
Returns ingestion and archive summary information.
|
|
|
|
**Response fields**
|
|
|
|
- `total`: total number of rows in `articles` across all sources
|
|
- `usable`: articles that have content, an embedding, and are not index pages
|
|
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source
|
|
- `bySource`: per-source breakdown, each with `total` and `usable` counts
|
|
|
|
**Important detail**
|
|
|
|
`lastIngestionBySource` is kept in memory, so it resets when the process restarts.
|
|
|
|
**Example response**
|
|
```json
|
|
{
|
|
"total": 10234,
|
|
"usable": 8700,
|
|
"lastIngestionBySource": {
|
|
"rss": "2025-01-02T10:00:00.000Z",
|
|
"gdelt": "2025-01-02T10:05:00.000Z"
|
|
},
|
|
"bySource": {
|
|
"alphavantage": { "total": 120, "usable": 98 },
|
|
"edgar": { "total": 88, "usable": 70 },
|
|
"finnhub": { "total": 400, "usable": 360 },
|
|
"gdelt": { "total": 2100, "usable": 1800 },
|
|
"rss": { "total": 7526, "usable": 6372 }
|
|
}
|
|
}
|
|
```
|
|
|
|
## Article field notes
|
|
|
|
- `image` stores the extracted main image as ultra-compressed base64 WebP.
|
|
- `normalized_title` is stored for matching and indexing.
|
|
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`.
|
|
- `pub_date` is normalized to ISO-8601 when it can be parsed.
|
|
- `ingested_at` is the insert timestamp set by the server.
|
|
|
|
## Notes
|
|
|
|
- SQLite archive file defaults to `./archive.sqlite`.
|
|
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique.
|
|
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion.
|
|
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
|
|
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search.
|
|
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
|
|
- SEC requests use the configured `User-Agent`.
|
|
- Duplicate URLs are skipped rather than inserted again.
|