43 - Amazon S3 Vectors¶
Amazon S3 Vectors provides cost-optimized native vector storage on S3 for similarity search and RAG. The hierarchy is vector bucket → vector index → vectors, where each vector is a tuple of (key, float32[], metadata).
AWS SDK for pandas wraps the s3vectors boto3 service and exposes 14 functions on wr.s3 covering the full bucket / index / data lifecycle, with DataFrame-friendly I/O and optional on-the-fly embedding via Amazon Bedrock.
[1]:
import getpass
import pandas as pd
import awswrangler as wr
[4]:
bucket_name = getpass.getpass("Enter a vector bucket name:")
index_name = "tutorial"
# Match the embedding model's output dimension. Titan v2 supports 256, 512, or 1024.
dimension = 256
Creating resources¶
A vector bucket holds many indexes; an index has a fixed dimension and distance metric and holds the actual vectors. All vectors written to an index must match its dimension.
[5]:
bucket_arn = wr.s3.create_vector_bucket(name=bucket_name)
print(f"Vector bucket ARN: {bucket_arn}")
Vector bucket ARN: arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1
[6]:
index_arn = wr.s3.create_vector_index(
vector_bucket=bucket_name,
name=index_name,
dimension=dimension,
distance_metric="cosine",
)
print(f"Vector index ARN: {index_arn}")
Vector index ARN: arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1/index/tutorial
Discovery¶
list_vector_buckets, get_vector_bucket, list_vector_indexes, and get_vector_index describe what’s in an account / bucket. List operations paginate internally.
[7]:
print("Buckets:", [b["vectorBucketName"] for b in wr.s3.list_vector_buckets()])
print("This bucket:", wr.s3.get_vector_bucket(name=bucket_name))
print("Indexes:", [i["indexName"] for i in wr.s3.list_vector_indexes(vector_bucket=bucket_name)])
wr.s3.get_vector_index(name=index_name, vector_bucket=bucket_name)
Buckets: ['test-aws-sdk-pandas-s3-vectors-1']
This bucket: {'vectorBucketName': 'test-aws-sdk-pandas-s3-vectors-1', 'vectorBucketArn': 'arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1', 'creationTime': datetime.datetime(2026, 5, 10, 0, 47, 57, tzinfo=tzlocal()), 'encryptionConfiguration': {'sseType': 'AES256'}}
Indexes: ['tutorial']
[7]:
{'vectorBucketName': 'test-aws-sdk-pandas-s3-vectors-1',
'indexName': 'tutorial',
'indexArn': 'arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1/index/tutorial',
'creationTime': datetime.datetime(2026, 5, 10, 0, 48, tzinfo=tzlocal()),
'dataType': 'float32',
'dimension': 256,
'distanceMetric': 'cosine',
'encryptionConfiguration': {'sseType': 'AES256'}}
Writing vectors¶
put_vectors_from_df is the DataFrame entry point. The most natural workflow is to pass text_column together with a Bedrock embedding model — awswrangler then calls Bedrock for each row and writes the resulting vectors plus any other columns as metadata.
Prerequisite: the calling identity needs
bedrock:InvokeModelpermission and Bedrock model access for the chosen embedding model in the current region (request access in the Bedrock console). Supported model id prefixes:amazon.titan-embed-text-*,cohere.embed-*.
If you already have embeddings (computed by your own model or a different SDK), pass them as a column instead via vector_column:
wr.s3.put_vectors_from_df(
df=my_df,
key_column="id",
vector_column="embedding", # precomputed list[float] / np.ndarray per row
vector_bucket=bucket_name,
index=index_name,
)
[9]:
df = pd.DataFrame(
{
"id": ["m-1", "m-2", "m-3", "m-4", "m-5"],
"title": [
"A wildlife photographer documents the rebirth of a forest after a wildfire.",
"An investigative reporter uncovers fraud at a multinational bank.",
"A coming-of-age comedy about siblings inheriting a struggling family bakery.",
"A heart-warming tale of an elderly widower and a stray dog.",
"A documentary tracing the history of jazz from New Orleans to Tokyo.",
],
"genre": ["documentary", "drama", "comedy", "drama", "documentary"],
"year": [2022, 2019, 2024, 2023, 2018],
}
)
df
[9]:
| id | title | genre | year | |
|---|---|---|---|---|
| 0 | m-1 | A wildlife photographer documents the rebirth ... | documentary | 2022 |
| 1 | m-2 | An investigative reporter uncovers fraud at a ... | drama | 2019 |
| 2 | m-3 | A coming-of-age comedy about siblings inheriti... | comedy | 2024 |
| 3 | m-4 | A heart-warming tale of an elderly widower and... | drama | 2023 |
| 4 | m-5 | A documentary tracing the history of jazz from... | documentary | 2018 |
[10]:
wr.s3.put_vectors_from_df(
df=df,
key_column="id",
text_column="title",
bedrock_model_id="amazon.titan-embed-text-v2:0",
bedrock_model_kwargs={"dimensions": dimension},
vector_bucket=bucket_name,
index=index_name,
)
Querying by text¶
query_vectors runs an approximate-nearest-neighbour search. Pass query_text + bedrock_model_id to embed the query on the fly, or query_vector to query with a precomputed embedding (shown later).
[11]:
wr.s3.query_vectors(
query_text="a touching story about loneliness and companionship",
bedrock_model_id="amazon.titan-embed-text-v2:0",
bedrock_model_kwargs={"dimensions": dimension},
top_k=3,
return_distance=True,
return_metadata=True,
vector_bucket=bucket_name,
index=index_name,
)
[11]:
| key | distance | metadata | |
|---|---|---|---|
| 0 | m-4 | 0.560630 | {'year': 2023, 'genre': 'drama'} |
| 1 | m-2 | 0.701137 | {'genre': 'drama', 'year': 2019} |
| 2 | m-5 | 0.751206 | {'year': 2018, 'genre': 'documentary'} |
Filtering on metadata¶
Filters use MongoDB-style operators ($eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $exists, $and, $or). They are evaluated during the search, not as a post-filter.
[12]:
wr.s3.query_vectors(
query_text="music history",
bedrock_model_id="amazon.titan-embed-text-v2:0",
bedrock_model_kwargs={"dimensions": dimension},
top_k=5,
filter={"$and": [{"genre": {"$eq": "documentary"}}, {"year": {"$gte": 2020}}]},
return_metadata=True,
vector_bucket=bucket_name,
index=index_name,
)
[12]:
| key | distance | metadata | |
|---|---|---|---|
| 0 | m-1 | 0.776892 | {'year': 2022, 'genre': 'documentary'} |
Working with vectors directly¶
get_vectors retrieves vectors by key, optionally returning the embedding data and/or metadata. Useful for inspection, re-querying, or exporting subsets.
[13]:
fetched = wr.s3.get_vectors(
keys=["m-1", "m-3"],
return_data=True,
return_metadata=True,
vector_bucket=bucket_name,
index=index_name,
)
fetched
[13]:
| key | vector | metadata | |
|---|---|---|---|
| 0 | m-3 | [-0.03277716785669327, 0.10634636878967285, 0.... | {'year': 2024, 'genre': 'comedy'} |
| 1 | m-1 | [-0.12095249444246292, 0.10887987911701202, 0.... | {'year': 2022, 'genre': 'documentary'} |
Querying with a precomputed vector¶
Any list[float] / np.ndarray of the right dimension works. Here we re-use a vector we just fetched — in practice, this is where you’d plug in embeddings from your own model.
[14]:
wr.s3.query_vectors(
query_vector=fetched.iloc[0]["vector"],
top_k=3,
return_distance=True,
return_metadata=True,
vector_bucket=bucket_name,
index=index_name,
)
[14]:
| key | distance | metadata | |
|---|---|---|---|
| 0 | m-3 | 0.000384 | {'genre': 'comedy', 'year': 2024} |
| 1 | m-2 | 0.733959 | {'genre': 'drama', 'year': 2019} |
| 2 | m-4 | 0.755624 | {'year': 2023, 'genre': 'drama'} |
Deleting vectors by key¶
[15]:
wr.s3.delete_vectors(
keys=["m-3"],
vector_bucket=bucket_name,
index=index_name,
)
Bulk export¶
list_vectors walks the entire index. With use_threads=True it parallelises across up to 16 segments under the hood; pass return_data=True / return_metadata=True to include the full payload.
[16]:
wr.s3.list_vectors(
return_metadata=True,
vector_bucket=bucket_name,
index=index_name,
use_threads=4,
)
[16]:
| key | metadata | |
|---|---|---|
| 0 | m-1 | {'year': 2022, 'genre': 'documentary'} |
| 1 | m-5 | {'genre': 'documentary', 'year': 2018} |
| 2 | m-4 | {'year': 2023, 'genre': 'drama'} |
| 3 | m-2 | {'genre': 'drama', 'year': 2019} |
Cleanup¶
[17]:
wr.s3.delete_vector_index(name=index_name, vector_bucket=bucket_name)
wr.s3.delete_vector_bucket(name=bucket_name)
