enhance article query capabilities by supporting multiple keywords and dynamic ordering
This commit is contained in:
parent
8805d3a3fc
commit
cb819e77ee
2 changed files with 134 additions and 240 deletions
301
README.md
301
README.md
|
|
@ -1,6 +1,6 @@
|
||||||
# duriin_api
|
# duriin_api
|
||||||
|
|
||||||
Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
Node.js Fastify server that ingests news articles from RSS, SEC EDGAR 8-K filings, Alpha Vantage News Sentiment, Finnhub company news, and GDELT into a local SQLite archive.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
|
|
@ -8,7 +8,7 @@ Node.js Fastify server that ingests news articles from RSS, Google News RSS, SEC
|
||||||
```bash
|
```bash
|
||||||
npm install
|
npm install
|
||||||
```
|
```
|
||||||
2. Edit `config.json` with your API keys, tickers, RSS feeds, Google News settings, and schedules.
|
2. Edit `config.json` with your API keys, tickers, and schedules.
|
||||||
3. Start the server:
|
3. Start the server:
|
||||||
```bash
|
```bash
|
||||||
npm start
|
npm start
|
||||||
|
|
@ -20,172 +20,70 @@ The server listens on the host and port defined in `config.json`.
|
||||||
|
|
||||||
On startup the server:
|
On startup the server:
|
||||||
|
|
||||||
1. Opens the SQLite database.
|
1. Opens the SQLite database and runs any pending migrations.
|
||||||
2. Registers the article and status routes.
|
2. Registers routes.
|
||||||
3. Starts the HTTP server.
|
3. Starts the HTTP server.
|
||||||
4. Immediately runs all ingestion sources once.
|
4. Launches continuous background loops for each source, content backfill, and embedding backfill.
|
||||||
5. Starts the cron scheduler for recurring ingestions, content backfill, and embedding backfill.
|
|
||||||
|
|
||||||
When a new article is inserted:
|
When a new article is inserted:
|
||||||
|
|
||||||
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
|
- the record is written immediately with `title`, `description`, `url`, `source`, and timestamps
|
||||||
- `content` and `image` start as `null`
|
- `content` starts as `null`
|
||||||
- full article extraction runs asynchronously after insert
|
- content backfill workers pick it up asynchronously — plain HTTP first, Playwright fallback for JS-heavy sites
|
||||||
- vector embeddings are generated later, after title, description, and content are all available
|
- vector embeddings are generated after title, description, and content are all available
|
||||||
|
- only articles with content + embedding are exposed via the API
|
||||||
|
|
||||||
|
Content backfill prioritises recent articles (`pub_date_effective DESC`) so newest content surfaces first regardless of ingestion order.
|
||||||
|
|
||||||
|
Per-domain fetch policies are tracked automatically — domains that repeatedly fail plain fetch are upgraded to browser-only, domains that fail both are blocked temporarily.
|
||||||
|
|
||||||
## API overview
|
## API overview
|
||||||
|
|
||||||
All exposed endpoints are `GET` endpoints.
|
All endpoints are `GET`.
|
||||||
|
|
||||||
### `GET /`
|
### `GET /`
|
||||||
|
|
||||||
Simple health check.
|
Health check. Returns `{ "ok": true }`.
|
||||||
|
|
||||||
**Response**
|
|
||||||
```json
|
|
||||||
{ "ok": true }
|
|
||||||
```
|
|
||||||
|
|
||||||
Use this to confirm the server is running, not to inspect ingestion state.
|
|
||||||
|
|
||||||
### `GET /articles`
|
### `GET /articles`
|
||||||
|
|
||||||
Returns articles from the `articles` table. Only articles that are considered **usable** are exposed — meaning they have non-empty `content`, a stored embedding, and are not index/category pages. Behavior changes based on the query params you send.
|
Returns usable articles — non-empty `content`, stored embedding, not an index/category page.
|
||||||
|
|
||||||
#### Query params
|
#### Query params
|
||||||
|
|
||||||
##### `keyword`
|
| Param | Description |
|
||||||
|
|---|---|
|
||||||
|
| `keyword` | Keyword matched against `title`, `description`, and `content`. Repeat the param for multiple keywords — e.g. `keyword=bitcoin&keyword=ethereum` |
|
||||||
|
| `keyword_mode` | How multiple keywords are combined — `and` (default) or `or` |
|
||||||
|
| `source` | Exact match on the stored `source` field (e.g. `rss:BBC`, `gdelt:Al Jazeera`) |
|
||||||
|
| `from` | `pub_date >= from` (ISO-8601) |
|
||||||
|
| `to` | `pub_date <= to` (ISO-8601) |
|
||||||
|
| `limit` | Rows to return. Default `20`, max `100` |
|
||||||
|
| `offset` | Pagination offset. Default `0` |
|
||||||
|
| `order` | Sort order — see below. Not applied to `semantic` or `similar_to_article` results (those are sorted by distance) |
|
||||||
|
| `semantic` | Semantic search by meaning via embedding similarity |
|
||||||
|
| `similar_to_article` | Vector similarity search using another article's embedding |
|
||||||
|
|
||||||
Plain keyword search.
|
#### `order` values
|
||||||
|
|
||||||
- matches `title`, `description`, and `content`
|
| Value | Sort |
|
||||||
- uses SQL `LIKE`
|
|---|---|
|
||||||
- works like substring matching, not semantic search
|
| `newest` | `pub_date_effective DESC` (default) |
|
||||||
- best when you want literal words or phrases to appear in the article text
|
| `oldest` | `pub_date_effective ASC` |
|
||||||
|
| `ingested_newest` | `ingested_at DESC` |
|
||||||
|
| `ingested_oldest` | `ingested_at ASC` |
|
||||||
|
|
||||||
Example:
|
#### Search modes
|
||||||
```http
|
|
||||||
GET /articles?keyword=earnings
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `source`
|
- If `semantic` is present — semantic nearest-neighbor search. Query is embedded via OpenRouter and matched against the article index. Results include a `distance` field (lower = closer).
|
||||||
|
- Else if `similar_to_article` is present — finds articles similar to the given article ID. Returns `404` if that article has no embedding.
|
||||||
|
- Otherwise — normal filtered list mode. All params apply.
|
||||||
|
|
||||||
Exact match on the stored `source` field.
|
`keyword` and `source`, `from`, `to` also work as post-filters on `semantic` and `similar_to_article` results.
|
||||||
|
|
||||||
Example:
|
`include_embedding` is explicitly rejected on this endpoint.
|
||||||
```http
|
|
||||||
GET /articles?source=rss
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `from`
|
#### Response shape
|
||||||
|
|
||||||
Only returns rows where `pub_date >= from`.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?from=2025-01-01T00:00:00.000Z
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `to`
|
|
||||||
|
|
||||||
Only returns rows where `pub_date <= to`.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?to=2025-01-31T23:59:59.999Z
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `limit`
|
|
||||||
|
|
||||||
Number of rows to return.
|
|
||||||
|
|
||||||
- default: `20`
|
|
||||||
- max: `100`
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?limit=10
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `offset`
|
|
||||||
|
|
||||||
Pagination offset.
|
|
||||||
|
|
||||||
- default: `0`
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?limit=10&offset=20
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `similar_to_article`
|
|
||||||
|
|
||||||
Runs vector similarity search instead of normal list mode.
|
|
||||||
|
|
||||||
- value must be an existing article ID
|
|
||||||
- the server looks up that article's embedding
|
|
||||||
- nearest-neighbor search runs in `sqlite-vec`
|
|
||||||
- the source article is excluded from the result set
|
|
||||||
- each result includes a `distance` field
|
|
||||||
- lower `distance` means more similar
|
|
||||||
- returns `404` if the article has no stored embedding
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?similar_to_article=123&limit=5
|
|
||||||
```
|
|
||||||
|
|
||||||
Not found response:
|
|
||||||
```json
|
|
||||||
{ "error": "Embedding not found for article" }
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `semantic`
|
|
||||||
|
|
||||||
Semantic search by meaning, not exact wording.
|
|
||||||
|
|
||||||
- use this when you want conceptually related results
|
|
||||||
- unlike `keyword`, the words do not need to appear literally in the article text
|
|
||||||
- the query text is normalized before embedding
|
|
||||||
- query embeddings are cached in SQLite
|
|
||||||
- on cache miss, the server requests an embedding from OpenRouter
|
|
||||||
- nearest article matches are returned from the embedding index
|
|
||||||
- each result includes a `distance` field
|
|
||||||
- lower `distance` means a closer semantic match
|
|
||||||
- returns `400` if `semantic` is empty
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```http
|
|
||||||
GET /articles?semantic=ai chip demand&limit=10
|
|
||||||
```
|
|
||||||
|
|
||||||
Bad request response:
|
|
||||||
```json
|
|
||||||
{ "error": "Semantic query must not be empty" }
|
|
||||||
```
|
|
||||||
|
|
||||||
##### `include_embedding`
|
|
||||||
|
|
||||||
Explicitly rejected on `/articles`.
|
|
||||||
|
|
||||||
Response:
|
|
||||||
```json
|
|
||||||
{ "error": "Embeddings are not returned directly. Use similar_to_article for vector search." }
|
|
||||||
```
|
|
||||||
|
|
||||||
#### General behavior
|
|
||||||
|
|
||||||
- If `semantic` is present, semantic search is used.
|
|
||||||
- Else if `similar_to_article` is present, similarity search is used.
|
|
||||||
- Otherwise normal list/search mode is used.
|
|
||||||
- `keyword` is literal keyword matching.
|
|
||||||
- `semantic` is semantic matching by meaning.
|
|
||||||
- Normal list/search results are ordered by `COALESCE(pub_date, ingested_at) DESC, id DESC`.
|
|
||||||
- `from` and `to` are compared against stored publication timestamps, so ISO-8601 values are the safest input.
|
|
||||||
- `source` must match the stored source name exactly.
|
|
||||||
- `keyword` is substring matching, not full-text search.
|
|
||||||
|
|
||||||
#### Normal list/search response shape
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
[
|
[
|
||||||
|
|
@ -194,113 +92,60 @@ Response:
|
||||||
"title": "...",
|
"title": "...",
|
||||||
"description": "...",
|
"description": "...",
|
||||||
"content": "...",
|
"content": "...",
|
||||||
"image": "...",
|
|
||||||
"url": "...",
|
"url": "...",
|
||||||
"normalized_title": "...",
|
"normalized_title": "...",
|
||||||
"source": "rss",
|
"source": "rss:BBC",
|
||||||
"pub_date": "2025-01-01T12:34:56.000Z",
|
"pub_date": "2025-01-01T12:34:56.000Z",
|
||||||
"ingested_at": "2025-01-01T12:35:10.000Z"
|
"ingested_at": "2025-01-01T12:35:10.000Z"
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Similarity/topic search response shape
|
Semantic and similarity results also include `"distance": 0.1234`.
|
||||||
|
|
||||||
```json
|
|
||||||
[
|
|
||||||
{
|
|
||||||
"id": 456,
|
|
||||||
"title": "...",
|
|
||||||
"description": "...",
|
|
||||||
"content": "...",
|
|
||||||
"image": "...",
|
|
||||||
"url": "...",
|
|
||||||
"normalized_title": "...",
|
|
||||||
"source": "rss",
|
|
||||||
"pub_date": "2025-01-02T09:00:00.000Z",
|
|
||||||
"ingested_at": "2025-01-02T09:00:10.000Z",
|
|
||||||
"distance": 0.1234
|
|
||||||
}
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Combined example
|
|
||||||
|
|
||||||
```http
|
|
||||||
GET /articles?keyword=earnings&source=rss&from=2025-01-01T00:00:00.000Z&limit=10&offset=0
|
|
||||||
```
|
|
||||||
|
|
||||||
### `GET /articles/:id`
|
### `GET /articles/:id`
|
||||||
|
|
||||||
Returns one article by numeric ID.
|
Returns one article by numeric ID. Same usability filter as the list endpoint — returns `404` if the article exists but has no content or embedding.
|
||||||
|
|
||||||
**Behavior**
|
|
||||||
|
|
||||||
- Looks up the article directly in SQLite.
|
|
||||||
- Same usability filter as the list endpoint — returns `404` if the article exists but is not usable.
|
|
||||||
- Returns the same article fields as normal `/articles` list mode.
|
|
||||||
- Does not return embedding data.
|
|
||||||
- Returns `404` if the ID does not exist.
|
|
||||||
|
|
||||||
**Example**
|
|
||||||
```http
|
|
||||||
GET /articles/123
|
|
||||||
```
|
|
||||||
|
|
||||||
**Not found response**
|
|
||||||
```json
|
|
||||||
{ "error": "Article not found" }
|
|
||||||
```
|
|
||||||
|
|
||||||
### `GET /status`
|
### `GET /status`
|
||||||
|
|
||||||
Returns ingestion and archive summary information.
|
Returns archive summary. Cached for 30 seconds.
|
||||||
|
|
||||||
**Response fields**
|
**Response fields**
|
||||||
|
|
||||||
- `total`: total number of rows in `articles` across all sources
|
- `total` — total rows across all sources
|
||||||
- `usable`: articles that have content, an embedding, and are not index pages
|
- `usable` — articles with content + embedding, not index pages
|
||||||
- `lastIngestionBySource`: in-memory timestamps of the last successful batch run per source
|
- `lastIngestionBySource` — in-memory timestamps of the last successful batch per source (resets on restart)
|
||||||
- `bySource`: per-source breakdown, each with `total` and `usable` counts
|
- `bySource` — per-source `{ total, usable }`
|
||||||
|
- `embeddingModels` — active embedding models with article count and detected dimensions
|
||||||
|
|
||||||
**Important detail**
|
### `GET /sources`
|
||||||
|
|
||||||
`lastIngestionBySource` is kept in memory, so it resets when the process restarts.
|
Returns the full source catalog from `sources.json` enriched with live DB stats.
|
||||||
|
|
||||||
**Example response**
|
**Per-source fields**
|
||||||
```json
|
|
||||||
{
|
- `id`, `label`, `websites`, `backfill`, `feeds` — from `sources.json` (feed URLs preserve the `[FAILED]` prefix if the feed has been marked dead)
|
||||||
"total": 10234,
|
- `counts` — aggregated `{ total, ready, skipped, failed, pending, untried, usable }` across all feed types for this source
|
||||||
"usable": 8700,
|
- `byFeed` — same breakdown split by feed prefix (`rss`, `gdelt`, etc.)
|
||||||
"lastIngestionBySource": {
|
- `domains` — current domain fetch policy per website: `policy` (auto / browser_only / blocked), failure/success counts, `expiresAt`
|
||||||
"rss": "2025-01-02T10:00:00.000Z",
|
|
||||||
"gdelt": "2025-01-02T10:05:00.000Z"
|
Use `domains[].policy` to diagnose why a source has high `skipped` or `failed` counts — `blocked` means backfill has given up on that domain temporarily.
|
||||||
},
|
|
||||||
"bySource": {
|
|
||||||
"alphavantage": { "total": 120, "usable": 98 },
|
|
||||||
"edgar": { "total": 88, "usable": 70 },
|
|
||||||
"finnhub": { "total": 400, "usable": 360 },
|
|
||||||
"gdelt": { "total": 2100, "usable": 1800 },
|
|
||||||
"rss": { "total": 7526, "usable": 6372 }
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Article field notes
|
## Article field notes
|
||||||
|
|
||||||
- `image` stores the extracted main image as ultra-compressed base64 WebP.
|
- `pub_date` is normalized to ISO-8601 when parseable; `null` otherwise.
|
||||||
- `normalized_title` is stored for matching and indexing.
|
- `pub_date_effective` is `COALESCE(pub_date, ingested_at)` — used for sorting.
|
||||||
- `source` may be a shared source like `rss`, `googlenews`, `gdelt`, `edgar`, `alphavantage`, or `finnhub`.
|
- `ingested_at` is the server-side insert timestamp.
|
||||||
- `pub_date` is normalized to ISO-8601 when it can be parsed.
|
- `normalized_title` is stored for deduplication and indexing.
|
||||||
- `ingested_at` is the insert timestamp set by the server.
|
- `source` format is `<feed_type>:<label>` for GDELT and RSS (e.g. `gdelt:Bloomberg Markets`, `rss:TechCrunch`), or just the source name for other feeds (`alphavantage`, `edgar`, `finnhub`).
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- SQLite archive file defaults to `./archive.sqlite`.
|
- SQLite archive defaults to `./archive.sqlite`.
|
||||||
- Deduplication is enforced on `url`; normalized titles are stored and indexed for matching but are not unique.
|
- Deduplication is enforced on `url`.
|
||||||
- `googleNews` accepts `queries`, `topics`, `language`, and `country`, and resolves Google redirect URLs to publisher URLs before ingestion.
|
- GDELT ingestion streams per-window to avoid accumulating the full 6-year backlog in memory at once.
|
||||||
- Article body extraction runs asynchronously after insertion, with scheduled retries for rows still missing content.
|
- Content backfill uses separate concurrency pools for plain HTTP and Playwright (browser) fetches.
|
||||||
- Embeddings are generated asynchronously with OpenRouter `perplexity/pplx-embed-v1-0.6b` and indexed in `sqlite-vec` for similarity search.
|
- Embeddings use OpenRouter and are indexed in `sqlite-vec` for ANN search.
|
||||||
- Topic search caches normalized query embeddings in SQLite and falls back to OpenRouter on cache miss.
|
- Query embeddings are cached in SQLite to avoid redundant API calls.
|
||||||
- SEC requests use the configured `User-Agent`.
|
- SEC requests use the `User-Agent` from `config.json`.
|
||||||
- Duplicate URLs are skipped rather than inserted again.
|
|
||||||
|
|
|
||||||
|
|
@ -12,9 +12,15 @@ function buildArticlesQuery(query) {
|
||||||
const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';
|
const includeEmbedding = String(query.include_embedding || '').toLowerCase() === 'true';
|
||||||
|
|
||||||
if (query.keyword) {
|
if (query.keyword) {
|
||||||
conditions.push('(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
|
||||||
const keyword = `%${query.keyword}%`;
|
const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
|
||||||
params.push(keyword, keyword, keyword);
|
const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
||||||
|
|
||||||
|
conditions.push(`(${clauses.join(` ${mode} `)})`);
|
||||||
|
for (const kw of keywords) {
|
||||||
|
const like = `%${kw}%`;
|
||||||
|
params.push(like, like, like);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (query.source) {
|
if (query.source) {
|
||||||
|
|
@ -36,6 +42,14 @@ function buildArticlesQuery(query) {
|
||||||
conditions.push('is_index_page = 0');
|
conditions.push('is_index_page = 0');
|
||||||
conditions.push('has_embedding = 1');
|
conditions.push('has_embedding = 1');
|
||||||
|
|
||||||
|
const ORDERS = {
|
||||||
|
newest: 'pub_date_effective DESC, id DESC',
|
||||||
|
oldest: 'pub_date_effective ASC, id ASC',
|
||||||
|
ingested_newest: 'ingested_at DESC, id DESC',
|
||||||
|
ingested_oldest: 'ingested_at ASC, id ASC',
|
||||||
|
};
|
||||||
|
const orderBy = ORDERS[query.order] || ORDERS.newest;
|
||||||
|
|
||||||
const whereClause = `WHERE ${conditions.join(' AND ')}`;
|
const whereClause = `WHERE ${conditions.join(' AND ')}`;
|
||||||
const limit = Number.parseInt(query.limit, 10);
|
const limit = Number.parseInt(query.limit, 10);
|
||||||
const offset = Number.parseInt(query.offset, 10);
|
const offset = Number.parseInt(query.offset, 10);
|
||||||
|
|
@ -48,7 +62,7 @@ function buildArticlesQuery(query) {
|
||||||
SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
|
SELECT id, title, description, content, ${includeEmbedding ? 'embedding,' : ''} url, normalized_title, source, pub_date, ingested_at
|
||||||
FROM articles
|
FROM articles
|
||||||
${whereClause}
|
${whereClause}
|
||||||
ORDER BY pub_date_effective DESC, id DESC
|
ORDER BY ${orderBy}
|
||||||
LIMIT ? OFFSET ?
|
LIMIT ? OFFSET ?
|
||||||
`,
|
`,
|
||||||
params,
|
params,
|
||||||
|
|
@ -64,23 +78,58 @@ function shouldExcludeIndexPages(query) {
|
||||||
return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
|
return String(query.exclude_index_pages || '').toLowerCase() !== 'false';
|
||||||
}
|
}
|
||||||
|
|
||||||
function mapNeighborsToArticles(neighbors, excludeIndexPages, limit) {
|
function mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query = {}) {
|
||||||
const ids = neighbors.map((row) => row.articleId);
|
const ids = neighbors.map((row) => row.articleId);
|
||||||
if (ids.length === 0) {
|
if (ids.length === 0) {
|
||||||
return [];
|
return [];
|
||||||
}
|
}
|
||||||
|
|
||||||
const placeholders = ids.map(() => '?').join(', ');
|
const placeholders = ids.map(() => '?').join(', ');
|
||||||
|
const conditions = [];
|
||||||
|
const params = [...ids];
|
||||||
|
|
||||||
|
conditions.push(`id IN (${placeholders})`);
|
||||||
|
conditions.push("content IS NOT NULL AND content != ''");
|
||||||
|
conditions.push('has_embedding = 1');
|
||||||
|
|
||||||
|
if (excludeIndexPages) conditions.push('is_index_page = 0');
|
||||||
|
|
||||||
|
if (query.source) {
|
||||||
|
conditions.push('source = ?');
|
||||||
|
params.push(query.source);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (query.from) {
|
||||||
|
conditions.push('pub_date >= ?');
|
||||||
|
params.push(query.from);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (query.to) {
|
||||||
|
conditions.push('pub_date <= ?');
|
||||||
|
params.push(query.to);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (query.keyword) {
|
||||||
|
const keywords = [].concat(query.keyword).map((k) => k.trim()).filter(Boolean);
|
||||||
|
const mode = String(query.keyword_mode || '').toLowerCase() === 'or' ? 'OR' : 'AND';
|
||||||
|
const clauses = keywords.map(() => '(title LIKE ? OR description LIKE ? OR content LIKE ?)');
|
||||||
|
|
||||||
|
conditions.push(`(${clauses.join(` ${mode} `)})`);
|
||||||
|
for (const kw of keywords) {
|
||||||
|
const like = `%${kw}%`;
|
||||||
|
params.push(like, like, like);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
const articles = db.prepare(`
|
const articles = db.prepare(`
|
||||||
SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
|
SELECT id, title, description, content, url, normalized_title, source, pub_date, ingested_at
|
||||||
FROM articles
|
FROM articles
|
||||||
WHERE id IN (${placeholders})
|
WHERE ${conditions.join(' AND ')}
|
||||||
AND content IS NOT NULL AND content != ''
|
`).all(...params);
|
||||||
AND has_embedding = 1
|
|
||||||
${excludeIndexPages ? 'AND is_index_page = 0' : ''}
|
|
||||||
`).all(...ids);
|
|
||||||
const byId = new Map(articles.map((article) => [article.id, article]));
|
const byId = new Map(articles.map((article) => [article.id, article]));
|
||||||
|
|
||||||
|
// preserve distance ordering from the vector search
|
||||||
return neighbors
|
return neighbors
|
||||||
.map((row) => {
|
.map((row) => {
|
||||||
const article = byId.get(row.articleId);
|
const article = byId.get(row.articleId);
|
||||||
|
|
@ -113,7 +162,7 @@ async function articleRoutes(fastify) {
|
||||||
Math.min(limit * 5, 500)
|
Math.min(limit * 5, 500)
|
||||||
);
|
);
|
||||||
|
|
||||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
|
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (query.similar_to_article) {
|
if (query.similar_to_article) {
|
||||||
|
|
@ -130,7 +179,7 @@ async function articleRoutes(fastify) {
|
||||||
return { error: 'Embedding not found for article' };
|
return { error: 'Embedding not found for article' };
|
||||||
}
|
}
|
||||||
|
|
||||||
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit);
|
return mapNeighborsToArticles(neighbors, excludeIndexPages, limit, query);
|
||||||
}
|
}
|
||||||
|
|
||||||
const { sql, params } = buildArticlesQuery(query);
|
const { sql, params } = buildArticlesQuery(query);
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue