Vespa(Search engine): podcast index
After playing around with the Podcast index and the amazing SQLite dump, I got stuck and annoyed that the search experience using SQLite and FTS5 is not that nice. I never found a good way to do a fuzzy search for podcasts. After being stuck, I decided that I want to use Vespa instead, which I had played around with before and liked.
I will not go over Vespa but instead focus on how to:
- move a SQLite dump to Vespa
- set up the Vespa schemas
- set up the Vespa rankings( we will focuse further on this in follow up posts)
Set up Vespa
So first, the schema of the SQLite db is as follows; there is only one table called podcast:
// CREATE TABLE podcasts (
// id INTEGER PRIMARY KEY,
// url TEXT NOT NULL UNIQUE,
// title TEXT NOT NULL,
// lastUpdate INTEGER,
// link TEXT NOT NULL,
// lastHttpStatus INTEGER,
// dead INTEGER,
// contentType TEXT NOT NULL,
// itunesId INTEGER,
// originalUrl TEXT NOT NULL,
// itunesAuthor TEXT NOT NULL,
// itunesOwnerName TEXT NOT NULL,
// explicit INTEGER,
// imageUrl TEXT NOT NULL,
// itunesType TEXT NOT NULL,
// generator TEXT NOT NULL,
// newestItemPubdate INTEGER,
// language TEXT NOT NULL,
// oldestItemPubdate INTEGER,
// episodeCount INTEGER,
// popularityScore INTEGER,
// priority INTEGER,
// createdOn INTEGER,
// updateFrequency INTEGER,
// chash TEXT NOT NULL,
// host TEXT NOT NULL,
// newestEnclosureUrl TEXT NOT NULL,
// podcastGuid TEXT NOT NULL,
// description TEXT NOT NULL,
// category1 TEXT NOT NULL,
// category2 TEXT NOT NULL,
// category3 TEXT NOT NULL,
// category4 TEXT NOT NULL,
// category5 TEXT NOT NULL,
// category6 TEXT NOT NULL,
// category7 TEXT NOT NULL,
// category8 TEXT NOT NULL,
// category9 TEXT NOT NULL,
// category10 TEXT NOT NULL,
// newestEnclosureDuration INTEGER
// );
To load it up after downloading it from here, open it up using:
sqlite3 --readonly podcastindex.db
.schema
I will use Go (just because I like it) for this, but it should be easy to ask your favorite LLM to translate it to any programming language.
Vespa uses a typed schema definition. If you feel we’re glossing over the basics of Vespa, that is indeed the case. Vespa has good documentation, even though it is sometimes hard to find the right content. What we will focus on though is how we construct our search, how we set up our schema, and so on.
I will not take all the fields in the SQLite db schema but just the ones I care about and intend to keep, in order to reduce size and simplify things as much as possible.
id INTEGER PRIMARY KEY,
// url TEXT NOT NULL UNIQUE,
// title TEXT NOT NULL,
// lastUpdate INTEGER,
// link TEXT NOT NULL,
// lastHttpStatus INTEGER,
// dead INTEGER,
// contentType TEXT NOT NULL,
// itunesId INTEGER,
// originalUrl TEXT NOT NULL,
// itunesAuthor TEXT NOT NULL,
// itunesOwnerName TEXT NOT NULL,
// explicit INTEGER,
// imageUrl TEXT NOT NULL,
// itunesType TEXT NOT NULL,
// generator TEXT NOT NULL,
// newestItemPubdate INTEGER,
// language TEXT NOT NULL,
// oldestItemPubdate INTEGER,
// episodeCount INTEGER,
// popularityScore INTEGER,
// priority INTEGER,
// createdOn INTEGER,
// updateFrequency INTEGER,
// chash TEXT NOT NULL,
// host TEXT NOT NULL,
// newestEnclosureUrl TEXT NOT NULL,
// podcastGuid TEXT NOT NULL,
// description TEXT NOT NULL,
Based upon the fields in the SQLite db, this is the initial schema we kick it off with:
schema podcast {
document podcast {
# Since this is the id I want to have it in memory and thus use attribute, also make it available
# for operations such as sort filtering and ranking.
# I want it indexed and thus I also set index
# Since it is an ID I want exact matching, I make the exact terminator
# explicit but the default is actually the same.
# Attribute link: https://docs.vespa.ai/en/attributes.html
field id type string {
indexing: attribute | summary
match {
exact
exact-terminator: "@@"
}
}
field url type string {
indexing: summary
}
field title type string {
indexing: attribute | summary | index
match: text
}
field last_update type long {
indexing: summary
}
field link type string {
indexing: summary
}
field last_http_status type int {}
field dead type bool {
indexing: attribute
rank: filter
}
field content_type type string {}
field original_url type string {}
field itunes_author type string {}
field itunes_owner_name type string {}
field explicit type bool {}
field image_url type string {
indexing: summary
}
field itunes_type type string {}
field generator type string {}
field newest_item_pubdate type long {
indexing: attribute
attribute: fast-search
}
field language type string {}
field oldest_item_pubdate type long {}
field episode_count type int {}
field popularity_score type double {
indexing: summary | attribute
}
field priority type int {}
field created_on type long {}
field update_frequency type int {}
field chash type string {}
field podcast_host type string {}
field newest_enclosure_url type string {}
field podcast_guid type string {}
field description type string {
indexing: summary | attribute | index
attribute: fast-search
match: text
}
}
# Vespa seems to first do a filter on the fields that we rank on
# we can not have the filtering without the ranking it seems like?
fieldset prefix {
fields: title_array, title_prefix
}
field title_prefix type string {
indexing: input title | attribute | summary
}
# This will be used for prefix search in on the podcast name
# https://docs.vespa.ai/en/text-matching.html#prefix-match
field title_array type array<string> {
indexing: input title | trim | split " +" | attribute | summary
stemming: best
attribute: fast-search
}
# Document Summaries are memory-only operations if all fields are attributes.
document-summary minimal {
summary title {}
summary description {}
summary newest_item_pubdate {}
summary url {}
summary popularity_score {}
}
# The match phase is dont on the content nodes, we rank the content based upon recency.
rank-profile podcast-search {
inputs {
query(q) string
}
function freshness() {
expression: exp(-1 * age(newest_item_pubdate)/(3600*24*7)) + attribute(popularity_score)/9
}
# https://docs.vespa.ai/en/reference/schema-reference.html#match-phase
# Attribute that decides which documents are a match if the match
# phase estimates that there will be more than max-hits hits.
match-phase {
attribute: newest_item_pubdate
order: descending
max-hits: 1000
}
# https://docs.vespa.ai/en/tutorials/hybrid-search.html
# Add time factor for aging content
# https://docs.vespa.ai/en/nativerank.html#putting-our-features-together-into-a-ranking-expression
# Figure out how to use the chunks in the first phase
# https://pyvespa.readthedocs.io/en/latest/examples/multilingual-multi-vector-reps-with-cohere-cloud.html
first-phase {
expression: bm25(title) + bm25(description) * 0.2 + freshness()
}
match-features {
bm25(title)
bm25(description)
freshness()
query(q)
}
}
}
The schema contains some inline comments, but let’s go over it a bit more. See the Vespa Schema reference. We only selected a subset of the fields. My first impression when using Vespa is that it is more low-level than for example Postgres in the way the schemas are set up. It is also important to notice that Vespa is not a relational database or a document store but a search engine, and most of the internals are focused around search.
The first field we define is the ID, which we keep as a string. Important to notice is that we explicitly set the matching condition to exact since we don’t want to do search over the IDs (but actual matches). We also set the field to summary and attribute. This was one of the most unclear things in the schema definition for me, but the best explanation can be found here in the table below. Further reading can also be found here:
EXPRESSION DESCRIPTION attribute For structured data: Keep this field in memory in a forward structure. This makes the field available for grouping, sorting and ranking. Attributes may also be searched by complete match (word or exact), or (for numerical fields) by range. Optionally a B-tree in memory can also be created by adding the fast-search option - this improves performance if the attribute is a strong criterion in queries (i.e. filters out many documents). Reference / attribute details index For unstructured text: Create a text index for this field. Text matching and all text ranking features become available. Indexes are disk backed and do not need to fit in memory. Reference / index details summary Include this field in the document summary in search result sets. Reference / document summary details
The difference can also be expressed as:
- attribute indexing is used for database-style string matching without linguistic processing and where the exact string contents are matched
- index indexing is used with linguistic processing such as tokenization and stemming.
So if you want to use BM25 and similar ranking functions for a field, you will need to add index. This is not of interest for the ID field where we have structured SQL-like queries only.
We set up the id field as follows:
field id type string {
indexing: attribute | summary
match {
exact
exact-terminator: "@@"
}
}
Since we want to filter on it, we keep it as an attribute (an attribute field will be in memory at all times) and have it as a summary field in order to return it.
We than make a small optimization but one that potentially could reduce the CPU load a lot and set the rank: filter
:
field dead type bool {
indexing: attribute
rank: filter
}
This creates a bitvector field that can be used for searching. If other indexes would be available, the query planner would select the best one in order to optimize the query performance.
The next interesting part is the title_array
,where we create a synthetic field:
field title_array type array<string> {
indexing: input title | trim | split " +" | attribute | summary
stemming: best
attribute: fast-search
}
We take the column title
, trim it, then split it. This is done in order to do prefix search on individual terms in the title. I will later show how we do this and also handle spelling errors. The field is not set as an index field, due to the fact that regular index fields do not support prefix matching. By default, for attribute fields no index is added. By adding attribute: fast-search
, a B-tree-like structured index is added to the attribute which can significantly speed up search.
In order to also allow for matching against a title with multiple words, the title array will only work for each word individually and we cannot use the title field since it is both an attribute and an index field. We create a synthetic field from the title
field called title_prefix
:
field title_prefix type string {
indexing: input title | attribute | summary
}
We then create a document summary
:
document-summary minimal {
summary title {}
summary description {}
summary newest_item_pubdate {}
summary url {}
summary popularity_score {}
}
This is used as a grouping for the returned fields, the summary, and is a logical grouping of fields we like to get out. Since this is a search engine, there is a distinction between fields used internally and exposed externally.
Key to understanding Vespa is to understand what it is not: a relational database, but rather a search engine.
Next up is probably the most interesting thing, which is the ranking functions. Before we look at the ranking function we build, let’s spend some time looking at search as a ranking problem and how we break it down into different phases. We will do this through the lens of Vespa, but the concepts hold outside. In general, it is a pipeline with less and less data but potentially more compute spent per sample in each step in order to get a better ranking.
- Matching based upon the incoming filter requirements, such as IDs or Language. These are strict requirements that need to be fulfilled.
- First reranking - this is done on the content nodes (more about Vespa setup can be found here. But in short, we have stateful content nodes storing the data and stateless compute nodes).
- Second reranking (also done on the content nodes).
- Global reranking, combining the tuples from all content nodes.
This is a good resource to read more about it. All the options for the rank-profile can be found here.
rank-profile podcast-search {
inputs {
query(q) string
}
function freshness() {
expression: exp(-1 * age(newest_item_pubdate)/(3600*24*7)) + attribute(popularity_score)/9
}
# https://docs.vespa.ai/en/reference/schema-reference.html#match-phase
# Attribute that decides which documents are a match if the match
# phase estimates that there will be more than max-hits hits.
match-phase {
attribute: newest_item_pubdate
order: descending
max-hits: 1000
}
# https://docs.vespa.ai/en/tutorials/hybrid-search.html
# Add time factor for aging content
# https://docs.vespa.ai/en/nativerank.html#putting-our-features-together-into-a-ranking-expression
# figure out how to use the chunks in the first phase
# https://pyvespa.readthedocs.io/en/latest/examples/multilingual-multi-vector-reps-with-cohere-cloud.html
first-phase {
expression: bm25(title) + bm25(description) * 0.2 + freshness()
}
match-features {
bm25(title)
bm25(description)
freshness()
query(q)
}
}
We will dive more in to the ranking in a follow up post.
To get vespa setup and running we will use docker:
docker run --detach --name vespa-podcast --hostname vespa-podcast \
--publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19112:19112 --publish 127.0.0.1:19071:19071 \
vespaengine/vespa
We assume the following folder structure:
search/ ├── schemas/ │ ├── podcast.sd └── services.xml
Where the service.xml file looks like the following (use it as an example for recreating this blog post):
<?xml version='1.0' encoding='UTF-8'?>
<services version="1.0" xmlns:deploy="vespa" xmlns:preprocess="properties">
<container id="search" version="1.0">
<document-api/>
<search>
</search>
<document-processing/>
<nodes>
<node hostalias="node1" />
</nodes>
</container>
<content id="podcast" version="1.0">
<min-redundancy>2</min-redundancy>
<documents>
<document type='podcast' mode="index"/>
</documents>
<nodes>
<node distribution-key='0' hostalias='node1'/>
</nodes>
</content>
</services>
You can then use the Vespa CLI from the search folder:
vespa deploy
Dump the SQLite database data
Feel free to dump the SQLite db any way you want. Below is a small Go script to do it.
package main
import (
"database/sql"
"encoding/json"
"fmt"
"log"
"os"
"strconv"
_ "github.com/mattn/go-sqlite3"
)
type Podcast struct {
ID string `json:"id"`
URL string `json:"url"`
Title string `json:"title"`
LastUpdate int64 `json:"last_update"`
Link string `json:"link"`
LastHTTPStatus int `json:"last_http_status"`
Dead bool `json:"dead"`
ContentType string `json:"content_type"`
OriginalURL string `json:"original_url"`
ITunesAuthor string `json:"itunes_author"`
ITunesOwnerName string `json:"itunes_owner_name"`
Explicit bool `json:"explicit"`
ImageURL string `json:"image_url"`
ITunesType string `json:"itunes_type"`
Generator string `json:"generator"`
NewestItemPubDate int64 `json:"newest_item_pubdate"`
Language string `json:"language"`
OldestItemPubDate int64 `json:"oldest_item_pubdate"`
EpisodeCount int `json:"episode_count"`
PopularityScore float64 `json:"popularity_score"`
Priority int `json:"priority"`
CreatedOn int64 `json:"created_on"`
UpdateFrequency int `json:"update_frequency"`
Chash string `json:"chash"`
PodcastHost string `json:"podcast_host"`
NewestEnclosureURL string `json:"newest_enclosure_url"`
PodcastGUID string `json:"podcast_guid"`
Description string `json:"description"`
}
type VespaPut struct {
Put string `json:"put"`
Fields Podcast `json:"fields"`
}
func main() {
fmt.Println(len(os.Args))
if len(os.Args) != 4 {
log.Fatal(" add 3 args, usage add path to db file and out put path")
}
dbPath := os.Args[1]
outputPath := os.Args[2]
nbrOfPodcasts, err := strconv.Atoi(os.Args[3])
if err != nil {
log.Fatal("strconv: ", err)
}
db, err := sql.Open("sqlite3", dbPath)
if err != nil {
log.Fatal(err)
}
defer db.Close()
podcasts, err := fetchPodcasts(db, nbrOfPodcasts)
if err != nil {
log.Fatal("fetchPodcasts: ", err)
}
file, err := os.Create(outputPath)
if err != nil {
log.Fatal("create: ", err)
}
defer file.Close()
enc := json.NewEncoder(file)
for _, p := range podcasts {
putID := fmt.Sprintf("id:podcast:podcast::%s", p.ID)
record := VespaPut{Put: putID, Fields: p}
if err := enc.Encode(record); err != nil {
log.Fatal("encode: ", err)
}
}
}
func fetchPodcasts(db *sql.DB, limit int) ([]Podcast, error) {
rows, err := db.Query(`
SELECT
id,
url,
title,
lastUpdate,
link,
lastHttpStatus,
dead,
contentType,
originalUrl,
itunesAuthor,
itunesOwnerName,
explicit,
imageUrl,
itunesType,
generator,
newestItemPubdate,
language,
oldestItemPubdate,
episodeCount,
popularityScore,
priority,
createdOn,
updateFrequency,
chash,
host,
newestEnclosureUrl,
podcastGuid,
description
FROM podcasts
ORDER BY id
LIMIT ?;
`, limit)
if err != nil {
return nil, err
}
defer rows.Close()
var out []Podcast
for rows.Next() {
var (
idInt int64
url string
title string
lastUpdate int64
link string
lastHTTPStatus int
deadInt int
contentType string
originalURL string
itunesAuthor string
itunesOwnerName string
explicitInt int
imageURL string
itunesType string
generator string
newestItemPubDate int64
language string
oldestItemPubDate int64
episodeCount int
popularityScoreInt int
priority int
createdOn int64
updateFrequency int
chash string
host string
newestEnclosureURL string
podcastGUID string
description string
)
if err := rows.Scan(
&idInt,
&url,
&title,
&lastUpdate,
&link,
&lastHTTPStatus,
&deadInt,
&contentType,
&originalURL,
&itunesAuthor,
&itunesOwnerName,
&explicitInt,
&imageURL,
&itunesType,
&generator,
&newestItemPubDate,
&language,
&oldestItemPubDate,
&episodeCount,
&popularityScoreInt,
&priority,
&createdOn,
&updateFrequency,
&chash,
&host,
&newestEnclosureURL,
&podcastGUID,
&description,
); err != nil {
return nil, err
}
out = append(out, Podcast{
ID: strconv.FormatInt(idInt, 10),
URL: url,
Title: title,
LastUpdate: lastUpdate,
Link: link,
LastHTTPStatus: lastHTTPStatus,
Dead: deadInt != 0,
ContentType: contentType,
OriginalURL: originalURL,
ITunesAuthor: itunesAuthor,
ITunesOwnerName: itunesOwnerName,
Explicit: explicitInt != 0,
ImageURL: imageURL,
ITunesType: itunesType,
Generator: generator,
NewestItemPubDate: newestItemPubDate,
Language: language,
OldestItemPubDate: oldestItemPubDate,
EpisodeCount: episodeCount,
PopularityScore: float64(popularityScoreInt),
Priority: priority,
CreatedOn: createdOn,
UpdateFrequency: updateFrequency,
Chash: chash,
PodcastHost: host,
NewestEnclosureURL: newestEnclosureURL,
PodcastGUID: podcastGUID,
Description: description,
})
}
if err := rows.Err(); err != nil {
return nil, err
}
return out, nil
}
The script is slightly longer in order to remap the name of the fields (could also be done by renaming them in the query). Key things to notice:
- The output is JSONL
- Each row has
{"put":"id:podcast:podcast::6","fields":{ .... }}
where fields is the actual data of value we care about. The put field contains the id, the schema and the document as part of it.
To run it:
go run . /Users/niklashansson/Documents/github/pod/backend/podcastindex_feeds.db dump.jsonl 1000000
After this I end up with a file of 5.8GB
Ingest the data into Vespa
vespa feed --dump.jsonl
The ingest will take a little bit of time so let it run. I kind of felt it is slow but it might be due to the Docker setup and how the CLI ingests it, since it is roughly 4 million podcasts (only).
This returned:
feed: got status 400 ({"pathId":"/document/v1/podcast/podcast/docid/1825377","message":"In document 'id:podcast:podcast::1825377': Could not parse field 'description' of type string: The string field value contains illegal code point 0x3"}) for put id:podcast:podcast::1825377: not retryable
feed: got status 400 ({"pathId":"/document/v1/podcast/podcast/docid/2783072","message":"In document 'id:podcast:podcast::2783072': Could not parse field 'itunes_author' of type string: The string field value contains illegal code point 0x3"}) for put id:podcast:podcast::2783072: not retryable
feed: got status 400 ({"pathId":"/document/v1/podcast/podcast/docid/5535724","message":"In document 'id:podcast:podcast::5535724': Could not parse field 'itunes_author' of type string: The string field value contains illegal code point 0x3"}) for put id:podcast:podcast::5535724: not retryable
feed: got status 400 ({"pathId":"/document/v1/podcast/podcast/docid/6420537","message":"In document 'id:podcast:podcast::6420537': Could not parse field 'itunes_author' of type string: The string field value contains illegal code point 0x8"}) for put id:podcast:podcast::6420537: not retryable
{
"feeder.operation.count": 4625512,
"feeder.seconds": 308.485,
"feeder.ok.count": 4625508,
"feeder.ok.rate": 14994.269,
"feeder.error.count": 0,
"feeder.inflight.count": 0,
"http.request.count": 4625512,
"http.request.bytes": 3448131187,
"http.request.MBps": 11.178,
"http.exception.count": 0,
"http.response.count": 4625512,
"http.response.bytes": 414304888,
"http.response.MBps": 1.343,
"http.response.error.count": 4,
"http.response.latency.millis.min": 1,
"http.response.latency.millis.avg": 34,
"http.response.latency.millis.max": 191,
"http.response.code.counts": {
"200": 4625508,
"400": 4
}
}
Indicating that we failed on 4 podcasts, this will not matter for this blog post so we will just skip ahead.
Query our Vespa search engine
The easiest way to query Vespa is to use the Vespa CLI
Let’s start by trying the prefix matching. We will use Lenny’s Podcast as our example.
As a first example, we can try to match against words within the podcast name, thus Lenny's
and Podcast
.
vespa query 'select title from podcast where title_array contains ({maxEditDistance: 4, prefix: true} "Lenny'\''s") and id contains "1539262"'
vespa query 'select title from podcast where title_array contains ({maxEditDistance: 4, prefix: true} "Podcast") and id contains "1539262"'
However, if we want to match against a prefix that contains both, it will not work to use the title_array field since we match against an array where we split it.
vespa query 'select title from podcast where title_array contains ({maxEditDistance: 4, prefix: true} "Lenny'\''s Podcast") and id contains "1539262"'
A good learning here is that the field title_array
will not help if the user has written Lenny's Pod
since the title is split. Thus we want to actually match on both the split and the full ones in order to get the best possible hits (of course depending upon your use case, but for my case at least).
Instead we then need to do:
vespa query 'select title from podcast where prefix contains ({maxEditDistance: 4, prefix: true} "Lenny'\''s Podcast") and id contains "1539262"'
This works since it uses the fieldset
for the prefix search which is a grouping for both the title_array
and the title_prefix
.
That is all for this post. In a follow-up, we will dive deeper into the ranking function.