enhance article query capabilities by supporting multiple keywords and dynamic ordering

This commit is contained in:
ImBenji 2026-04-21 11:42:21 +01:00
parent 8805d3a3fc
commit cb819e77ee
2 changed files with 134 additions and 240 deletions

301
README.md
View file

@ -1,6 +1,6 @@
# duriin_api # duriin_api
Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive. Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
## Setup ## Setup
@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC
```bash ```bash
npm install npm install
``` ```
2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules. 2. Edit `config.json` with your API keys, tickers, and schedules.
3. Start the server: 3. Start the server:
```bash ```bash
npm start npm start
@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`.
On startup the server: On startup the server:
1. Opens the SQLite database. 1. Opens the SQLite database and runs any pending migrations.
2. Registers the article and status routes. 2. Registers routes.
3. Starts the HTTP server. 3. Starts the HTTP server.
4. Immediately runs all ingestion sources once. 4. Launches continuous background loops for each source, content backfill, and embedding backfill.
5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
When a new article is inserted: When a new article is inserted:
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps - the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
- `content` and `image` start as `null` - `content` starts as `null`
- full article extraction runs asynchronously after insert - content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
- vector embeddings are generated later, after title, description, and content are all available - vector embeddings are generated after title, description, and content are all available
- only articles with content + embedding are exposed via the API
Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
## API overview ## API overview
All exposed endpoints are `GET` endpoints. All endpoints are `GET`.
### `GET /` ### `GET /`
Simple health check. Health check. Returns `{ "ok": true }`.
**Response**
```json
{ "ok": true }
```
Use this to confirm the server is running, not to inspect ingestion state.
### `GET /articles` ### `GET /articles`
Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send. Returns usable articles — non-empty `content`, stored embedding, not an index/category page.
#### Query params #### Query params
##### `keyword` | Param | Description |
|---|---|
| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
| `from` | `pub_date >= from` (ISO-8601) |
| `to` | `pub_date <= to` (ISO-8601) |
| `limit` | Rows to return. Default `20`, max `100` |
| `offset` | Pagination offset. Default `0` |
| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
| `semantic` | Semantic search by meaning via embedding similarity |
| `similar_to_article` | Vector similarity search using another article's embedding |
Plain keyword search. #### `order` values
- matches `title`, `description`, and `content` | Value | Sort |
- uses SQL `LIKE` |---|---|
- works like substring matching, not semantic search | `newest` | `pub_date_effective DESC` (default) |
- best when you want literal words or phrases to appear in the article text | `oldest` | `pub_date_effective ASC` |
| `ingested_newest` | `ingested_at DESC` |
| `ingested_oldest` | `ingested_at ASC` |
Example: #### Search modes
```http
GET /articles?keyword=earnings
```
##### `source` - If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
- Otherwise — normal filtered list mode. All params apply.
Exact match on the stored `source` field. `keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.
Example: `include_embedding` is explicitly rejected on this endpoint.
```http
GET /articles?source=rss
```
##### `from` #### Response shape
Only returns rows where `pub_date >= from`.
Example:
```http
GET /articles?from=2025-01-01T00:00:00.000Z
```
##### `to`
Only returns rows where `pub_date <= to`.
Example:
```http
GET /articles?to=2025-01-31T23:59:59.999Z
```
##### `limit`
Number of rows to return.
- default: `20`
- max: `100`
Example:
```http
GET /articles?limit=10
```
##### `offset`
Pagination offset.
- default: `0`
Example:
```http
GET /articles?limit=10&offset=20
```
##### `similar_to_article`
Runs vector similarity search instead of normal list mode.
- value must be an existing article ID
- the server looks up that article's embedding
- nearest-neighbor search runs in `sqlite-vec`
- the source article is excluded from the result set
- each result includes a `distance` field
- lower `distance` means more similar
- returns `404` if the article has no stored embedding
Example:
```http
GET /articles?similar_to_article=123&limit=5
```
Not found response:
```json
{ "error": "Embedding not found for article" }
```
##### `semantic`
Semantic search by meaning, not exact wording.
- use this when you want conceptually related results
- unlike `keyword`, the words do not need to appear literally in the article text
- the query text is normalized before embedding
- query embeddings are cached in SQLite
- on cache miss, the server requests an embedding from OpenRouter
- nearest article matches are returned from the embedding index
- each result includes a `distance` field
- lower `distance` means a closer semantic match
- returns `400` if `semantic` is empty
Example:
```http
GET /articles?semantic=ai chip demand&limit=10
```
Bad request response:
```json
{ "error": "Semantic query must not be empty" }
```
##### `include_embedding`
Explicitly rejected on `/articles`.
Response:
```json
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
```
#### General behavior
- If `semantic` is present, semantic search is used.
- Else if `similar_to_article` is present, similarity search is used.
- Otherwise normal list/search mode is used.
- `keyword` is literal keyword matching.
- `semantic` is semantic matching by meaning.
- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
- `source` must match the stored source name exactly.
- `keyword` is substring matching, not full-text search.
#### Normal list/search response shape
```json ```json
[ [
@ -194,113 +92,60 @@ Response:
"title": "...", "title": "...",
"description": "...", "description": "...",
"content": "...", "content": "...",
"image": "...",
"url": "...", "url": "...",
"normalized_title": "...", "normalized_title": "...",
"source": "rss", "source": "rss:BBC",
"pub_date": "2025-01-01T12:34:56.000Z", "pub_date": "2025-01-01T12:34:56.000Z",
"ingested_at": "2025-01-01T12:35:10.000Z" "ingested_at": "2025-01-01T12:35:10.000Z"
} }
] ]
``` ```
#### Similarity/topic search response shape Semantic and similarity results also include `"distance": 0.1234`.
```json
[
{
"id": 456,
"title": "...",
"description": "...",
"content": "...",
"image": "...",
"url": "...",
"normalized_title": "...",
"source": "rss",
"pub_date": "2025-01-02T09:00:00.000Z",
"ingested_at": "2025-01-02T09:00:10.000Z",
"distance": 0.1234
}
]
```
#### Combined example
```http
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
```
### `GET /articles/:id` ### `GET /articles/:id`
Returns one article by numeric ID. Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.
**Behavior**
- Looks up the article directly in SQLite.
- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
- Returns the same article fields as normal `/articles` list mode.
- Does not return embedding data.
- Returns `404` if the ID does not exist.
**Example**
```http
GET /articles/123
```
**Not found response**
```json
{ "error": "Article not found" }
```
### `GET /status` ### `GET /status`
Returns ingestion and archive summary information. Returns archive summary. Cached for 30 seconds.
**Response fields** **Response fields**
- `total`: total number of rows in `articles` across all sources - `total` — total rows across all sources
- `usable`: articles that have content, an embedding, and are not index pages - `usable` — articles with content + embedding, not index pages
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source - `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
- `bySource`: per-source breakdown, each with `total` and `usable` counts - `bySource` — per-source `{ total, usable }`
- `embeddingModels` — active embedding models with article count and detected dimensions
**Important detail** ### `GET /sources`
`lastIngestionBySource` is kept in memory, so it resets when the process restarts. Returns the full source catalog from `sources.json` enriched with live DB stats.
**Example response** **Per-source fields**
```json
{ - `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
"total": 10234, - `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
"usable": 8700, - `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
"lastIngestionBySource": { - `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
"rss": "2025-01-02T10:00:00.000Z",
"gdelt": "2025-01-02T10:05:00.000Z" Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.
},
"bySource": {
"alphavantage": { "total": 120, "usable": 98 },
"edgar": { "total": 88, "usable": 70 },
"finnhub": { "total": 400, "usable": 360 },
"gdelt": { "total": 2100, "usable": 1800 },
"rss": { "total": 7526, "usable": 6372 }
}
}
```
## Article field notes ## Article field notes
- `image` stores the extracted main image as ultra-compressed base64 WebP. - `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
- `normalized_title` is stored for matching and indexing. - `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`. - `ingested_at` is the server-side insert timestamp.
- `pub_date` is normalized to ISO-8601 when it can be parsed. - `normalized_title` is stored for deduplication and indexing.
- `ingested_at` is the insert timestamp set by the server. - `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).
## Notes ## Notes
- SQLite archive file defaults to `./archive.sqlite`. - SQLite archive defaults to `./archive.sqlite`.
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique. - Deduplication is enforced on `url`.
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion. - GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content. - Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search. - Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss. - Query embeddings are cached in SQLite to avoid redundant API calls.
- SEC requests use the configured `User-Agent`. - SEC requests use the `User-Agent` from `config.json`.
- Duplicate URLs are skipped rather than inserted again.

View file

@ -12,9 +12,15 @@ function buildArticlesQuery(query) {
const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true'; const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';
if (query.keyword) { if (query.keyword) {
conditions.push('(title LIKE ? OR description LIKE ? OR content LIKE ?)'); const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
const keyword = `%${query.keyword}%`; const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
params.push(keyword, keyword, keyword); const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
conditions.push(`(${clauses.join(` ${mode} `)})`);
for (const kw of keywords) {
const like = `%${kw}%`;
params.push(like, like, like);
}
} }
if (query.source) { if (query.source) {
@ -36,6 +42,14 @@ function buildArticlesQuery(query) {
conditions.push('is_index_page = 0'); conditions.push('is_index_page = 0');
conditions.push('has_embedding = 1'); conditions.push('has_embedding = 1');
const ORDERS = {
newest: 'pub_date_effective DESC, id DESC',
oldest: 'pub_date_effective ASC, id ASC',
ingested_newest: 'ingested_at DESC, id DESC',
ingested_oldest: 'ingested_at ASC, id ASC',
};
const orderBy = ORDERS[query.order] || ORDERS.newest;
const whereClause = `WHERE ${conditions.join(' AND ')}`; const whereClause = `WHERE ${conditions.join(' AND ')}`;
const limit = Number.parseInt(query.limit, 10); const limit = Number.parseInt(query.limit, 10);
const offset = Number.parseInt(query.offset, 10); const offset = Number.parseInt(query.offset, 10);
@ -48,7 +62,7 @@ function buildArticlesQuery(query) {
SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
FROM articles FROM articles
${whereClause} ${whereClause}
ORDER BY pub_date_effective DESC, id DESC ORDER BY ${orderBy}
LIMIT ? OFFSET ? LIMIT ? OFFSET ?
`, `,
params, params,
@ -64,23 +78,58 @@ function shouldExcludeIndexPages(query) {
return String(query.exclude_index_pages || '').toLowerCase() !== 'false'; return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
} }
function mapNeighborsToArticles(neighbors, excludeIndexPages, limit) { function mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query = {}) {
const ids = neighbors.map((row) => row.articleId); const ids = neighbors.map((row) => row.articleId);
if (ids.length === 0) { if (ids.length === 0) {
return []; return [];
} }
const placeholders = ids.map(() => '?').join(', '); const placeholders = ids.map(() => '?').join(', ');
const conditions = [];
const params = [...ids];
conditions.push(`id IN (${placeholders})`);
conditions.push("content IS NOT NULL AND content != ''");
conditions.push('has_embedding = 1');
if (excludeIndexPages) conditions.push('is_index_page = 0');
if (query.source) {
conditions.push('source = ?');
params.push(query.source);
}
if (query.from) {
conditions.push('pub_date >= ?');
params.push(query.from);
}
if (query.to) {
conditions.push('pub_date <= ?');
params.push(query.to);
}
if (query.keyword) {
const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
conditions.push(`(${clauses.join(` ${mode} `)})`);
for (const kw of keywords) {
const like = `%${kw}%`;
params.push(like, like, like);
}
}
const articles = db.prepare(` const articles = db.prepare(`
SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
FROM articles FROM articles
WHERE id IN (${placeholders}) WHERE ${conditions.join(' AND ')}
AND content IS NOT NULL AND content != '' `).all(...params);
AND has_embedding = 1
${excludeIndexPages ? 'AND is_index_page = 0' : ''}
`).all(...ids);
const byId = new Map(articles.map((article) => [article.id, article])); const byId = new Map(articles.map((article) => [article.id, article]));
// preserve distance ordering from the vector search
return neighbors return neighbors
.map((row) => { .map((row) => {
const article = byId.get(row.articleId); const article = byId.get(row.articleId);
@ -113,7 +162,7 @@ async function articleRoutes(fastify) {
Math.min(limit * 5, 500) Math.min(limit * 5, 500)
); );
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit); return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
} }
if (query.similar_to_article) { if (query.similar_to_article) {
@ -130,7 +179,7 @@ async function articleRoutes(fastify) {
return { error: 'Embedding not found for article' }; return { error: 'Embedding not found for article' };
} }
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit); return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
} }
const { sql, params } = buildArticlesQuery(query); const { sql, params } = buildArticlesQuery(query);