enhance article query capabilities by supporting multiple keywords and dynamic ordering
This commit is contained in:
parent
8805d3a3fc
commit
cb819e77ee
2 changed files with 134 additions and 240 deletions
301
README.md
301
README.md
|
|
@ -1,6 +1,6 @@
|
|||
# duriin_api
|
||||
|
||||
Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
||||
Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
||||
|
||||
## Setup
|
||||
|
||||
|
|
@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC
|
|||
```bash
|
||||
npm install
|
||||
```
|
||||
2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules.
|
||||
2. Edit `config.json` with your API keys, tickers, and schedules.
|
||||
3. Start the server:
|
||||
```bash
|
||||
npm start
|
||||
|
|
@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`.
|
|||
|
||||
On startup the server:
|
||||
|
||||
1. Opens the SQLite database.
|
||||
2. Registers the article and status routes.
|
||||
1. Opens the SQLite database and runs any pending migrations.
|
||||
2. Registers routes.
|
||||
3. Starts the HTTP server.
|
||||
4. Immediately runs all ingestion sources once.
|
||||
5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
|
||||
4. Launches continuous background loops for each source, content backfill, and embedding backfill.
|
||||
|
||||
When a new article is inserted:
|
||||
|
||||
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
|
||||
- `content` and `image` start as `null`
|
||||
- full article extraction runs asynchronously after insert
|
||||
- vector embeddings are generated later, after title, description, and content are all available
|
||||
- `content` starts as `null`
|
||||
- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
|
||||
- vector embeddings are generated after title, description, and content are all available
|
||||
- only articles with content + embedding are exposed via the API
|
||||
|
||||
Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
|
||||
|
||||
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
|
||||
|
||||
## API overview
|
||||
|
||||
All exposed endpoints are `GET` endpoints.
|
||||
All endpoints are `GET`.
|
||||
|
||||
### `GET /`
|
||||
|
||||
Simple health check.
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{ "ok": true }
|
||||
```
|
||||
|
||||
Use this to confirm the server is running, not to inspect ingestion state.
|
||||
Health check. Returns `{ "ok": true }`.
|
||||
|
||||
### `GET /articles`
|
||||
|
||||
Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
|
||||
Returns usable articles — non-empty `content`, stored embedding, not an index/category page.
|
||||
|
||||
#### Query params
|
||||
|
||||
##### `keyword`
|
||||
| Param | Description |
|
||||
|---|---|
|
||||
| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
|
||||
| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
|
||||
| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
|
||||
| `from` | `pub_date >= from` (ISO-8601) |
|
||||
| `to` | `pub_date <= to` (ISO-8601) |
|
||||
| `limit` | Rows to return. Default `20`, max `100` |
|
||||
| `offset` | Pagination offset. Default `0` |
|
||||
| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
|
||||
| `semantic` | Semantic search by meaning via embedding similarity |
|
||||
| `similar_to_article` | Vector similarity search using another article's embedding |
|
||||
|
||||
Plain keyword search.
|
||||
#### `order` values
|
||||
|
||||
- matches `title`, `description`, and `content`
|
||||
- uses SQL `LIKE`
|
||||
- works like substring matching, not semantic search
|
||||
- best when you want literal words or phrases to appear in the article text
|
||||
| Value | Sort |
|
||||
|---|---|
|
||||
| `newest` | `pub_date_effective DESC` (default) |
|
||||
| `oldest` | `pub_date_effective ASC` |
|
||||
| `ingested_newest` | `ingested_at DESC` |
|
||||
| `ingested_oldest` | `ingested_at ASC` |
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?keyword=earnings
|
||||
```
|
||||
#### Search modes
|
||||
|
||||
##### `source`
|
||||
- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
|
||||
- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
|
||||
- Otherwise — normal filtered list mode. All params apply.
|
||||
|
||||
Exact match on the stored `source` field.
|
||||
`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?source=rss
|
||||
```
|
||||
`include_embedding` is explicitly rejected on this endpoint.
|
||||
|
||||
##### `from`
|
||||
|
||||
Only returns rows where `pub_date >= from`.
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?from=2025-01-01T00:00:00.000Z
|
||||
```
|
||||
|
||||
##### `to`
|
||||
|
||||
Only returns rows where `pub_date <= to`.
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?to=2025-01-31T23:59:59.999Z
|
||||
```
|
||||
|
||||
##### `limit`
|
||||
|
||||
Number of rows to return.
|
||||
|
||||
- default: `20`
|
||||
- max: `100`
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?limit=10
|
||||
```
|
||||
|
||||
##### `offset`
|
||||
|
||||
Pagination offset.
|
||||
|
||||
- default: `0`
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?limit=10&offset=20
|
||||
```
|
||||
|
||||
##### `similar_to_article`
|
||||
|
||||
Runs vector similarity search instead of normal list mode.
|
||||
|
||||
- value must be an existing article ID
|
||||
- the server looks up that article's embedding
|
||||
- nearest-neighbor search runs in `sqlite-vec`
|
||||
- the source article is excluded from the result set
|
||||
- each result includes a `distance` field
|
||||
- lower `distance` means more similar
|
||||
- returns `404` if the article has no stored embedding
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?similar_to_article=123&limit=5
|
||||
```
|
||||
|
||||
Not found response:
|
||||
```json
|
||||
{ "error": "Embedding not found for article" }
|
||||
```
|
||||
|
||||
##### `semantic`
|
||||
|
||||
Semantic search by meaning, not exact wording.
|
||||
|
||||
- use this when you want conceptually related results
|
||||
- unlike `keyword`, the words do not need to appear literally in the article text
|
||||
- the query text is normalized before embedding
|
||||
- query embeddings are cached in SQLite
|
||||
- on cache miss, the server requests an embedding from OpenRouter
|
||||
- nearest article matches are returned from the embedding index
|
||||
- each result includes a `distance` field
|
||||
- lower `distance` means a closer semantic match
|
||||
- returns `400` if `semantic` is empty
|
||||
|
||||
Example:
|
||||
```http
|
||||
GET /articles?semantic=ai chip demand&limit=10
|
||||
```
|
||||
|
||||
Bad request response:
|
||||
```json
|
||||
{ "error": "Semantic query must not be empty" }
|
||||
```
|
||||
|
||||
##### `include_embedding`
|
||||
|
||||
Explicitly rejected on `/articles`.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
|
||||
```
|
||||
|
||||
#### General behavior
|
||||
|
||||
- If `semantic` is present, semantic search is used.
|
||||
- Else if `similar_to_article` is present, similarity search is used.
|
||||
- Otherwise normal list/search mode is used.
|
||||
- `keyword` is literal keyword matching.
|
||||
- `semantic` is semantic matching by meaning.
|
||||
- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
|
||||
- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
|
||||
- `source` must match the stored source name exactly.
|
||||
- `keyword` is substring matching, not full-text search.
|
||||
|
||||
#### Normal list/search response shape
|
||||
#### Response shape
|
||||
|
||||
```json
|
||||
[
|
||||
|
|
@ -194,113 +92,60 @@ Response:
|
|||
"title": "...",
|
||||
"description": "...",
|
||||
"content": "...",
|
||||
"image": "...",
|
||||
"url": "...",
|
||||
"normalized_title": "...",
|
||||
"source": "rss",
|
||||
"source": "rss:BBC",
|
||||
"pub_date": "2025-01-01T12:34:56.000Z",
|
||||
"ingested_at": "2025-01-01T12:35:10.000Z"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### Similarity/topic search response shape
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 456,
|
||||
"title": "...",
|
||||
"description": "...",
|
||||
"content": "...",
|
||||
"image": "...",
|
||||
"url": "...",
|
||||
"normalized_title": "...",
|
||||
"source": "rss",
|
||||
"pub_date": "2025-01-02T09:00:00.000Z",
|
||||
"ingested_at": "2025-01-02T09:00:10.000Z",
|
||||
"distance": 0.1234
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### Combined example
|
||||
|
||||
```http
|
||||
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
|
||||
```
|
||||
Semantic and similarity results also include `"distance": 0.1234`.
|
||||
|
||||
### `GET /articles/:id`
|
||||
|
||||
Returns one article by numeric ID.
|
||||
|
||||
**Behavior**
|
||||
|
||||
- Looks up the article directly in SQLite.
|
||||
- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
|
||||
- Returns the same article fields as normal `/articles` list mode.
|
||||
- Does not return embedding data.
|
||||
- Returns `404` if the ID does not exist.
|
||||
|
||||
**Example**
|
||||
```http
|
||||
GET /articles/123
|
||||
```
|
||||
|
||||
**Not found response**
|
||||
```json
|
||||
{ "error": "Article not found" }
|
||||
```
|
||||
Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.
|
||||
|
||||
### `GET /status`
|
||||
|
||||
Returns ingestion and archive summary information.
|
||||
Returns archive summary. Cached for 30 seconds.
|
||||
|
||||
**Response fields**
|
||||
|
||||
- `total`: total number of rows in `articles` across all sources
|
||||
- `usable`: articles that have content, an embedding, and are not index pages
|
||||
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source
|
||||
- `bySource`: per-source breakdown, each with `total` and `usable` counts
|
||||
- `total` — total rows across all sources
|
||||
- `usable` — articles with content + embedding, not index pages
|
||||
- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
|
||||
- `bySource` — per-source `{ total, usable }`
|
||||
- `embeddingModels` — active embedding models with article count and detected dimensions
|
||||
|
||||
**Important detail**
|
||||
### `GET /sources`
|
||||
|
||||
`lastIngestionBySource` is kept in memory, so it resets when the process restarts.
|
||||
Returns the full source catalog from `sources.json` enriched with live DB stats.
|
||||
|
||||
**Example response**
|
||||
```json
|
||||
{
|
||||
"total": 10234,
|
||||
"usable": 8700,
|
||||
"lastIngestionBySource": {
|
||||
"rss": "2025-01-02T10:00:00.000Z",
|
||||
"gdelt": "2025-01-02T10:05:00.000Z"
|
||||
},
|
||||
"bySource": {
|
||||
"alphavantage": { "total": 120, "usable": 98 },
|
||||
"edgar": { "total": 88, "usable": 70 },
|
||||
"finnhub": { "total": 400, "usable": 360 },
|
||||
"gdelt": { "total": 2100, "usable": 1800 },
|
||||
"rss": { "total": 7526, "usable": 6372 }
|
||||
}
|
||||
}
|
||||
```
|
||||
**Per-source fields**
|
||||
|
||||
- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
|
||||
- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
|
||||
- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
|
||||
- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
|
||||
|
||||
Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.
|
||||
|
||||
## Article field notes
|
||||
|
||||
- `image` stores the extracted main image as ultra-compressed base64 WebP.
|
||||
- `normalized_title` is stored for matching and indexing.
|
||||
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`.
|
||||
- `pub_date` is normalized to ISO-8601 when it can be parsed.
|
||||
- `ingested_at` is the insert timestamp set by the server.
|
||||
- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
|
||||
- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
|
||||
- `ingested_at` is the server-side insert timestamp.
|
||||
- `normalized_title` is stored for deduplication and indexing.
|
||||
- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).
|
||||
|
||||
## Notes
|
||||
|
||||
- SQLite archive file defaults to `./archive.sqlite`.
|
||||
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique.
|
||||
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion.
|
||||
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
|
||||
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search.
|
||||
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
|
||||
- SEC requests use the configured `User-Agent`.
|
||||
- Duplicate URLs are skipped rather than inserted again.
|
||||
- SQLite archive defaults to `./archive.sqlite`.
|
||||
- Deduplication is enforced on `url`.
|
||||
- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
|
||||
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
|
||||
- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
|
||||
- Query embeddings are cached in SQLite to avoid redundant API calls.
|
||||
- SEC requests use the `User-Agent` from `config.json`.
|
||||
|
|
|
|||
|
|
@ -12,9 +12,15 @@ function buildArticlesQuery(query) {
|
|||
const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';
|
||||
|
||||
if (query.keyword) {
|
||||
conditions.push('(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
||||
const keyword = `%${query.keyword}%`;
|
||||
params.push(keyword, keyword, keyword);
|
||||
const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
|
||||
const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
|
||||
const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
||||
|
||||
conditions.push(`(${clauses.join(` ${mode} `)})`);
|
||||
for (const kw of keywords) {
|
||||
const like = `%${kw}%`;
|
||||
params.push(like, like, like);
|
||||
}
|
||||
}
|
||||
|
||||
if (query.source) {
|
||||
|
|
@ -36,6 +42,14 @@ function buildArticlesQuery(query) {
|
|||
conditions.push('is_index_page = 0');
|
||||
conditions.push('has_embedding = 1');
|
||||
|
||||
const ORDERS = {
|
||||
newest: 'pub_date_effective DESC, id DESC',
|
||||
oldest: 'pub_date_effective ASC, id ASC',
|
||||
ingested_newest: 'ingested_at DESC, id DESC',
|
||||
ingested_oldest: 'ingested_at ASC, id ASC',
|
||||
};
|
||||
const orderBy = ORDERS[query.order] || ORDERS.newest;
|
||||
|
||||
const whereClause = `WHERE ${conditions.join(' AND ')}`;
|
||||
const limit = Number.parseInt(query.limit, 10);
|
||||
const offset = Number.parseInt(query.offset, 10);
|
||||
|
|
@ -48,7 +62,7 @@ function buildArticlesQuery(query) {
|
|||
SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
|
||||
FROM articles
|
||||
${whereClause}
|
||||
ORDER BY pub_date_effective DESC, id DESC
|
||||
ORDER BY ${orderBy}
|
||||
LIMIT ? OFFSET ?
|
||||
`,
|
||||
params,
|
||||
|
|
@ -64,23 +78,58 @@ function shouldExcludeIndexPages(query) {
|
|||
return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
|
||||
}
|
||||
|
||||
function mapNeighborsToArticles(neighbors, excludeIndexPages, limit) {
|
||||
function mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query = {}) {
|
||||
const ids = neighbors.map((row) => row.articleId);
|
||||
if (ids.length === 0) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const placeholders = ids.map(() => '?').join(', ');
|
||||
const conditions = [];
|
||||
const params = [...ids];
|
||||
|
||||
conditions.push(`id IN (${placeholders})`);
|
||||
conditions.push("content IS NOT NULL AND content != ''");
|
||||
conditions.push('has_embedding = 1');
|
||||
|
||||
if (excludeIndexPages) conditions.push('is_index_page = 0');
|
||||
|
||||
if (query.source) {
|
||||
conditions.push('source = ?');
|
||||
params.push(query.source);
|
||||
}
|
||||
|
||||
if (query.from) {
|
||||
conditions.push('pub_date >= ?');
|
||||
params.push(query.from);
|
||||
}
|
||||
|
||||
if (query.to) {
|
||||
conditions.push('pub_date <= ?');
|
||||
params.push(query.to);
|
||||
}
|
||||
|
||||
if (query.keyword) {
|
||||
const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
|
||||
const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
|
||||
const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
||||
|
||||
conditions.push(`(${clauses.join(` ${mode} `)})`);
|
||||
for (const kw of keywords) {
|
||||
const like = `%${kw}%`;
|
||||
params.push(like, like, like);
|
||||
}
|
||||
}
|
||||
|
||||
const articles = db.prepare(`
|
||||
SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
|
||||
FROM articles
|
||||
WHERE id IN (${placeholders})
|
||||
AND content IS NOT NULL AND content != ''
|
||||
AND has_embedding = 1
|
||||
${excludeIndexPages ? 'AND is_index_page = 0' : ''}
|
||||
`).all(...ids);
|
||||
WHERE ${conditions.join(' AND ')}
|
||||
`).all(...params);
|
||||
|
||||
const byId = new Map(articles.map((article) => [article.id, article]));
|
||||
|
||||
// preserve distance ordering from the vector search
|
||||
return neighbors
|
||||
.map((row) => {
|
||||
const article = byId.get(row.articleId);
|
||||
|
|
@ -113,7 +162,7 @@ async function articleRoutes(fastify) {
|
|||
Math.min(limit * 5, 500)
|
||||
);
|
||||
|
||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
|
||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
|
||||
}
|
||||
|
||||
if (query.similar_to_article) {
|
||||
|
|
@ -130,7 +179,7 @@ async function articleRoutes(fastify) {
|
|||
return { error: 'Embedding not found for article' };
|
||||
}
|
||||
|
||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
|
||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
|
||||
}
|
||||
|
||||
const { sql, params } = buildArticlesQuery(query);
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue