43 - Amazon S3 Vectors¶

Amazon S3 Vectors provides cost-optimized native vector storage on S3 for similarity search and RAG. The hierarchy is vector bucket → vector index → vectors, where each vector is a tuple of (key, float32[], metadata).

AWS SDK for pandas wraps the s3vectors boto3 service and exposes 14 functions on wr.s3 covering the full bucket / index / data lifecycle, with DataFrame-friendly I/O and optional on-the-fly embedding via Amazon Bedrock.

[1]:

import getpass

import pandas as pd

import awswrangler as wr

[4]:

bucket_name = getpass.getpass("Enter a vector bucket name:")
index_name = "tutorial"
# Match the embedding model's output dimension. Titan v2 supports 256, 512, or 1024.
dimension = 256

Creating resources¶

A vector bucket holds many indexes; an index has a fixed dimension and distance metric and holds the actual vectors. All vectors written to an index must match its dimension.

[5]:

bucket_arn = wr.s3.create_vector_bucket(name=bucket_name)
print(f"Vector bucket ARN: {bucket_arn}")

Vector bucket ARN: arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1

[6]:

index_arn = wr.s3.create_vector_index(
    vector_bucket=bucket_name,
    name=index_name,
    dimension=dimension,
    distance_metric="cosine",
)
print(f"Vector index ARN: {index_arn}")

Vector index ARN: arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1/index/tutorial

Discovery¶

list_vector_buckets, get_vector_bucket, list_vector_indexes, and get_vector_index describe what’s in an account / bucket. List operations paginate internally.

[7]:

print("Buckets:", [b["vectorBucketName"] for b in wr.s3.list_vector_buckets()])
print("This bucket:", wr.s3.get_vector_bucket(name=bucket_name))
print("Indexes:", [i["indexName"] for i in wr.s3.list_vector_indexes(vector_bucket=bucket_name)])
wr.s3.get_vector_index(name=index_name, vector_bucket=bucket_name)

Buckets: ['test-aws-sdk-pandas-s3-vectors-1']
This bucket: {'vectorBucketName': 'test-aws-sdk-pandas-s3-vectors-1', 'vectorBucketArn': 'arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1', 'creationTime': datetime.datetime(2026, 5, 10, 0, 47, 57, tzinfo=tzlocal()), 'encryptionConfiguration': {'sseType': 'AES256'}}
Indexes: ['tutorial']

[7]:

{'vectorBucketName': 'test-aws-sdk-pandas-s3-vectors-1',
 'indexName': 'tutorial',
 'indexArn': 'arn:aws:s3vectors:eu-west-1:123456789012:bucket/test-aws-sdk-pandas-s3-vectors-1/index/tutorial',
 'creationTime': datetime.datetime(2026, 5, 10, 0, 48, tzinfo=tzlocal()),
 'dataType': 'float32',
 'dimension': 256,
 'distanceMetric': 'cosine',
 'encryptionConfiguration': {'sseType': 'AES256'}}

Writing vectors¶

put_vectors_from_df is the DataFrame entry point. The most natural workflow is to pass text_column together with a Bedrock embedding model — awswrangler then calls Bedrock for each row and writes the resulting vectors plus any other columns as metadata.

Prerequisite: the calling identity needs bedrock:InvokeModel permission and Bedrock model access for the chosen embedding model in the current region (request access in the Bedrock console). Supported model id prefixes: amazon.titan-embed-text-*, cohere.embed-*.

If you already have embeddings (computed by your own model or a different SDK), pass them as a column instead via vector_column:

wr.s3.put_vectors_from_df(
    df=my_df,
    key_column="id",
    vector_column="embedding",  # precomputed list[float] / np.ndarray per row
    vector_bucket=bucket_name,
    index=index_name,
)

[9]:

df = pd.DataFrame(
    {
        "id": ["m-1", "m-2", "m-3", "m-4", "m-5"],
        "title": [
            "A wildlife photographer documents the rebirth of a forest after a wildfire.",
            "An investigative reporter uncovers fraud at a multinational bank.",
            "A coming-of-age comedy about siblings inheriting a struggling family bakery.",
            "A heart-warming tale of an elderly widower and a stray dog.",
            "A documentary tracing the history of jazz from New Orleans to Tokyo.",
        ],
        "genre": ["documentary", "drama", "comedy", "drama", "documentary"],
        "year": [2022, 2019, 2024, 2023, 2018],
    }
)
df

[9]:

	id	title	genre	year
0	m-1	A wildlife photographer documents the rebirth ...	documentary	2022
1	m-2	An investigative reporter uncovers fraud at a ...	drama	2019
2	m-3	A coming-of-age comedy about siblings inheriti...	comedy	2024
3	m-4	A heart-warming tale of an elderly widower and...	drama	2023
4	m-5	A documentary tracing the history of jazz from...	documentary	2018

[10]:

wr.s3.put_vectors_from_df(
    df=df,
    key_column="id",
    text_column="title",
    bedrock_model_id="amazon.titan-embed-text-v2:0",
    bedrock_model_kwargs={"dimensions": dimension},
    vector_bucket=bucket_name,
    index=index_name,
)

Querying by text¶

query_vectors runs an approximate-nearest-neighbour search. Pass query_text + bedrock_model_id to embed the query on the fly, or query_vector to query with a precomputed embedding (shown later).

[11]:

wr.s3.query_vectors(
    query_text="a touching story about loneliness and companionship",
    bedrock_model_id="amazon.titan-embed-text-v2:0",
    bedrock_model_kwargs={"dimensions": dimension},
    top_k=3,
    return_distance=True,
    return_metadata=True,
    vector_bucket=bucket_name,
    index=index_name,
)

[11]:

	key	distance	metadata
0	m-4	0.560630	{'year': 2023, 'genre': 'drama'}
1	m-2	0.701137	{'genre': 'drama', 'year': 2019}
2	m-5	0.751206	{'year': 2018, 'genre': 'documentary'}

Filtering on metadata¶

Filters use MongoDB-style operators ($eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $exists, $and, $or). They are evaluated during the search, not as a post-filter.

[12]:

wr.s3.query_vectors(
    query_text="music history",
    bedrock_model_id="amazon.titan-embed-text-v2:0",
    bedrock_model_kwargs={"dimensions": dimension},
    top_k=5,
    filter={"$and": [{"genre": {"$eq": "documentary"}}, {"year": {"$gte": 2020}}]},
    return_metadata=True,
    vector_bucket=bucket_name,
    index=index_name,
)

[12]:

	key	distance	metadata
0	m-1	0.776892	{'year': 2022, 'genre': 'documentary'}

Working with vectors directly¶

get_vectors retrieves vectors by key, optionally returning the embedding data and/or metadata. Useful for inspection, re-querying, or exporting subsets.

[13]:

fetched = wr.s3.get_vectors(
    keys=["m-1", "m-3"],
    return_data=True,
    return_metadata=True,
    vector_bucket=bucket_name,
    index=index_name,
)
fetched

[13]:

	key	vector	metadata
0	m-3	[-0.03277716785669327, 0.10634636878967285, 0....	{'year': 2024, 'genre': 'comedy'}
1	m-1	[-0.12095249444246292, 0.10887987911701202, 0....	{'year': 2022, 'genre': 'documentary'}

Querying with a precomputed vector¶

Any list[float] / np.ndarray of the right dimension works. Here we re-use a vector we just fetched — in practice, this is where you’d plug in embeddings from your own model.

[14]:

wr.s3.query_vectors(
    query_vector=fetched.iloc[0]["vector"],
    top_k=3,
    return_distance=True,
    return_metadata=True,
    vector_bucket=bucket_name,
    index=index_name,
)

[14]:

	key	distance	metadata
0	m-3	0.000384	{'genre': 'comedy', 'year': 2024}
1	m-2	0.733959	{'genre': 'drama', 'year': 2019}
2	m-4	0.755624	{'year': 2023, 'genre': 'drama'}

Deleting vectors by key¶

[15]:

wr.s3.delete_vectors(
    keys=["m-3"],
    vector_bucket=bucket_name,
    index=index_name,
)

Bulk export¶

list_vectors walks the entire index. With use_threads=True it parallelises across up to 16 segments under the hood; pass return_data=True / return_metadata=True to include the full payload.

[16]:

wr.s3.list_vectors(
    return_metadata=True,
    vector_bucket=bucket_name,
    index=index_name,
    use_threads=4,
)

[16]:

	key	metadata
0	m-1	{'year': 2022, 'genre': 'documentary'}
1	m-5	{'genre': 'documentary', 'year': 2018}
2	m-4	{'year': 2023, 'genre': 'drama'}
3	m-2	{'genre': 'drama', 'year': 2019}

Cleanup¶

[17]:

wr.s3.delete_vector_index(name=index_name, vector_bucket=bucket_name)
wr.s3.delete_vector_bucket(name=bucket_name)