Skip to main content
Version: nightly 🚧

推荐

kumosearch 可以使用向量搜索 ,根据用户行为生成推荐。这涉及构建机器学习模型来生成嵌入 ,将它们存储在 kumosearch 中,然后在 kumosearch 中进行最近邻搜索。

本文将讨论如何使用 Starspace ML 模型来生成嵌入。Transformers4Rec 也是另一个适用于此场景的 ML 模型。

我们将使用电商产品数据集来演示,这也适用于推荐文章或电影等其他领域。

Step 1: 准备数据

我们将使用 Starspace 来构建我们的 ML 模型。Starspace 训练数据集的格式为每行表示一个用户交互过的一组项目。

apple orange banana broccoli mango
cereals soda bread nuts cookies
tissue detergent butter cheese milk eggs
ice_cream milk pancake_mix muffins

在上面的示例中,第一行表示某个用户与产品 appleorangebananabroccolimango 进行了交互(查看、购买、添加到购物车等)。另一个用户(或同一用户)与产品 cerealsodabreadnutscookies 进行了交互。

tip

在本示例中,我们使用 product_name 是为了便于阅读。在实际生产环境中,建议使用产品的 ID 或 SKU 作为训练数据集的一部分。

Step 2: 设置环境

安装系统依赖项

确保有 C++11 编译器(gcc-4.6.3 或更高版本、Visual Studio 2015 或 clang-3.3 或更高版本)。在 macOS 上,需要安装 XCode;在 Linux 发行版上,使用发行版的包管理器安装 build-essential 包。

安装 Starspace

git clone https://github.com/facebookresearch/Starspace.git
cd Starspace

安装 Boost

Boost 是 Starspace 需要的一个库。进入 Starspace 目录后,运行以下命令:

curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz
tar -xzvf boost_1_82_0.tar.gz

编译 Starspace

进入 Starspace 目录后,运行以下命令:

make -e BOOST_DIR=boost_1_82_0 && \
make embed_doc -e BOOST_DIR=boost_1_82_0

要验证 Starspace 是否正常工作,运行 ./starspace,应看到类似以下的输出:

$ ./starspace
Usage: need to specify whether it is train or test.

"starspace train ..." or "starspace test ..."
...

Step 3: 训练模型

使用步骤 1 中的准备的训练数据集,将文件命名为 session-data.txt,运行以下命令来训练模型:

./starspace train \
-trainFile <path/to/session-data.txt> \
-model productsModel \
-label '' \
-trainMode 1 \
-epoch 25 \
-dim 100

该命令运行后,将生成两个文件:一个名为 productsModel 的二进制文件和一个名为 productsModel.tsv 的包含模型权重的 TSV 文件。

随着用户继续访问网站/应用程序,收集新的用户数据,建议定期运行此训练步骤。

Step 4: 生成嵌入

首先从我们的训练数据集中提取所有唯一的产品:

export unique_items=$(tr ' ' '\n' < session-data.txt | sed '/^$/d' | sort -u)
export output_jsonl_file="products-with-embeddings.jsonl"

对于每个产品,生成嵌入并存储在 JSONL 文件中:

echo -n > ${output_jsonl_file}

while read -r item; do
embedding=$(echo "${item}" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
echo "{\"id\":\"${item}\",\"embedding\":[${embedding%?}]}" >> "${output_jsonl_file}"
done <<< "${unique_items}"

这将生成一个类似如下内容的 JSONL 文件:

{"id":"apple","embedding":[-0.129717,0.173566,0.105385,0.0413297,-0.0290213,-0.0255852,0.0825889,-0.0261474,-0.0672213,-0.020061,-0.0227523,-0.232531,0.126667,0.053292,0.0092951,-0.117847,-0.0203866,0.067803,0.0669588,-0.0958568,-0.126915,0.120737,0.0547092,0.00512978,0.0257105,-0.0784047,-0.0348654,-0.125596,0.087177,0.132318,0.151595,-0.0326471,-0.169206,-0.00846743,0.184474,-0.148861,0.0110634,-0.0613974,0.0422888,-0.137809,0.0259965,-0.0851748,0.0202873,-0.120347,0.182447,0.110794,-0.200759,0.130639,-0.157653,-0.0173171,0.101569,-0.224391,-0.0160616,-0.0754992,-0.0967191,0.00498547,-0.144638,0.046945,-0.11333,-0.0533871,0.0118368,0.0670858,-0.0714718,-0.0474113,0.0123388,0.0553516,-0.163903,0.0201541,-0.0880148,0.0344916,-0.0213696,0.111026,0.0823451,-0.0152207,0.0427815,0.00890293,-0.163427,-0.165768,0.0409641,0.0668304,0.0868884,-0.0690655,-0.120059,-0.157864,-0.12657,0.0895369,-0.0551588,-0.138711,-0.0834502,0.0384778,-0.122425,0.00729352,-0.108975,-0.201364,-0.0596544,-0.0512629,-0.0172166,-0.147633,0.048211,0.0167111]}
{"id":"banana","embedding":[-0.151976,0.167556,0.0984403,0.0582992,-0.0267645,-0.0632901,0.0818063,-0.0577236,-0.0661825,-0.0224044,-0.0083418,-0.235444,0.106222,0.098582,-0.0422992,-0.16124,-0.0754309,0.0859816,0.0505005,-0.0773229,-0.0878463,0.126327,0.0319473,0.0662375,0.0288876,-0.099176,-0.0356668,-0.118937,0.085241,0.145321,0.127992,-0.0275212,-0.164231,0.007687,0.15164,-0.149566,0.0513335,-0.0522685,0.00915292,-0.127394,0.0438007,-0.0371664,0.0524856,-0.126597,0.187275,0.0891057,-0.229951,0.138657,-0.146845,0.0245155,0.0622531,-0.22075,-0.0497431,-0.0837679,-0.092076,0.00150625,-0.158956,-0.00107306,-0.104141,-0.0596481,0.00658634,0.0868983,-0.0158821,-0.0623965,0.0369001,0.0743422,-0.218009,0.0531221,-0.033627,0.036802,-0.0232749,0.149437,0.0692087,0.0290572,-0.00513245,-0.0166902,-0.162802,-0.15936,0.0567595,0.101776,0.125398,-0.114981,-0.0962633,-0.110599,-0.0872082,0.0987341,-0.0343689,-0.0974114,-0.0465906,0.00473119,-0.133105,0.0337405,-0.0637639,-0.194511,-0.0586302,0.0310114,0.004405,-0.108879,-0.0131596,0.0469659]}

我们现在可以将此 JSONL 文件导入到 kumosearch 中。

Step 5: 进行索引

对生成的嵌入进行索引。

export KUMOSEARCH_API_KEY=xyz
export KUMOSEARCH_URL='https://xyz.a1.kumosearch.net'

创建一个索引表:

curl "${KUMOSEARCH_URL}/collections" \
-X POST \
-H "Content-Type: application/json" \
-H "X-KUMOSEARCH-API-KEY: ${KUMOSEARCH_API_KEY}" \
-d '{
"name": "products",
"fields": [
{"name": "name", "type": "string", "optional": true},
{"name": "description", "type": "string", "optional": true},
{"name": "price", "type": "float", "optional": true},
{"name": "categories", "type": "string[]", "optional": true},
{"name": "embedding", "type": "float[]", "num_dim": 100 }
]
}'

注意,我们设置了 num_dim: 100,这与训练 Starspace 模型时设置的 -dim 100 参数一致。

将带有嵌入的 JSONL 文件导入索引表:

curl -H "X-KUMOSEARCH-API-KEY: ${KUMOSEARCH_API_KEY}" \
-X POST \
-T products-with-embeddings.jsonl \
"${KUMOSEARCHURL}/collections/products/documents/import?action=emplace"

在此处,我们只插入了每个产品的嵌入。您还可以使用相同的 id 字段分别导入其他值,例如名称、描述、价格、类别等,以填充产品记录的其他部分。

Step 6: 生成推荐

假设用户登陆我们的网站/应用程序并与以下产品进行了交互:

mango broccoli milk

要根据用户数据生成推荐,首先生成嵌入:

export embedding=$(echo "mango broccoli milk" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
echo ${embedding}
-0.0862846,0.127956,0.0558543,0.0745331,0.02449,-0.131018,0.0886827,-0.0571893,-0.0398686,-0.0116799,-0.0164978,-0.173818,0.0478985,0.109211,-0.0826394,-0.177671,-0.219366,0.180478,-0.0140154,-0.0237589,-0.010896,0.115979,-0.044924,0.129452,-0.0111529,-0.0978542,-0.121468,-0.0700872,-0.0190036,0.116127,0.0617186,-0.0463324,-0.172141,0.0302211,0.0610366,-0.0831281,0.04558,-0.00370933,-0.107602,-0.0394414,0.0334175,0.0429023,0.133572,-0.124658,0.225743,-0.0156787,-0.284864,0.148183,-0.0508378,0.175489,-0.0417769,-0.0920536,-0.0443016,-0.0838343,-0.0694042,-0.0333535,-0.108574,-0.0894618,-0.022049,-0.0500605,-0.0234268,0.00732048,0.0817547,0.00764651,0.0285933,0.100818,-0.229398,0.0508415,0.117766,-0.0289333,-0.0493134,0.167664,0.0696889,0.115228,-0.0609508,-0.12562,-0.0450054,-0.0648439,0.0817176,0.169663,0.133255,-0.111001,-0.0467052,-0.0373238,0.005385,0.111311,-0.0171787,0.0311545,0.0474074,-0.0301008,-0.0555648,0.0776044,-0.0287841,-0.162136,-0.0511268,0.174767,-0.0169033,-0.0223623,-0.140496,0.154727

接着,将为用户数据生成的嵌入作为 查询向量 发送到 kumosearch 进行最近邻搜索,得到要推荐给用户的产品列表:

curl "${KUMOSEARCH_URL}/multi_search" \
-X POST \
-H "Content-Type: application/json" \
-H "X-KUMOSEARCH-API-KEY: ${KUMOSEARCH_API_KEY}" \
-d '{
"searches": [
{
"q": "*",
"collection": "products",
"vector_query": "embedding:(['${embedding%?}'], k:10, distance_threshold: 1)"
}
]
}' | jq '.results[0].hits[] | .document.id'

这将返回以下结果。在过滤掉用户已看到的项目后,在用户界面显示这些推荐。

"broccoli"
"mango"
"banana"
"apple"
"orange"
"tissue"
"detergent"
"cheese"
"milk"
"butter"
tip

在本示例中,我们使用产品名称是为了便于阅读。在实际生产环境中,建议使用产品的 ID 或 SKU,并生成相应的嵌入。

sku_1 sku_4 sku_5
sku_5 sku_8 sku_1 sku_2 sku_10
sku_5 sku_1 sku_4 sku_21 sku_22