AWS SDK for pandas

38 - OpenSearch Serverless

Amazon OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service.

Create collection

A collection in Amazon OpenSearch Serverless is a logical grouping of one or more indexes that represent an analytics workload.

Collections must have an assigned encryption policy, network policy, and a matching data access policy that grants permission to its resources.

[ ]:
# Install the optional modules first
!pip install 'awswrangler[opensearch]'
[1]:
import awswrangler as wr
[8]:
data_access_policy = [
    {
        "Rules": [
            {
                "ResourceType": "index",
                "Resource": [
                    "index/my-collection/*",
                ],
                "Permission": [
                    "aoss:*",
                ],
            },
            {
                "ResourceType": "collection",
                "Resource": [
                    "collection/my-collection",
                ],
                "Permission": [
                    "aoss:*",
                ],
            },
        ],
        "Principal": [
            wr.sts.get_current_identity_arn(),
        ],
    }
]

AWS SDK for pandas can create default network and encryption policies based on the user input.

By default, the network policy allows public access to the collection, and the encryption policy encrypts the collection using AWS-managed KMS key.

Create a collection, and a corresponding data, network, and access policies:

[10]:
collection = wr.opensearch.create_collection(
    name="my-collection",
    data_policy=data_access_policy,
)

collection_endpoint = collection["collectionEndpoint"]

The call will wait and exit when the collection and corresponding policies are created and active.

To create a collection encrypted with customer KMS key, and attached to a VPC, provide KMS Key ARN and / or VPC endpoints:

[ ]:
kms_key_arn = "arn:aws:kms:..."
vpc_endpoint = "vpce-..."

collection = wr.opensearch.create_collection(
    name="my-secure-collection",
    data_policy=data_access_policy,
    kms_key_arn=kms_key_arn,
    vpc_endpoints=[vpc_endpoint],
)

Connect

Connect to the collection endpoint:

[12]:
client = wr.opensearch.connect(host=collection_endpoint)

Create index

To create an index, run:

[13]:
index = "my-index-1"

wr.opensearch.create_index(
    client=client,
    index=index,
)
[13]:
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'my-index-1'}

Index documents

To index documents:

[25]:
wr.opensearch.index_documents(
    client,
    documents=[{"_id": "1", "name": "John"}, {"_id": "2", "name": "George"}, {"_id": "3", "name": "Julia"}],
    index=index,
)
Indexing: 100% (3/3)|####################################|Elapsed Time: 0:00:12
[25]:
{'success': 3, 'errors': []}

It is also possible to index Pandas data frames:

[26]:
import pandas as pd

df = pd.DataFrame(
    [{"_id": "1", "name": "John", "tags": ["foo", "bar"]}, {"_id": "2", "name": "George", "tags": ["foo"]}]
)

wr.opensearch.index_df(
    client,
    df=df,
    index="index-df",
)
Indexing: 100% (2/2)|####################################|Elapsed Time: 0:00:12
[26]:
{'success': 2, 'errors': []}

AWS SDK for pandas also supports indexing JSON and CSV documents.

For more examples, refer to the 031 - OpenSearch tutorial

Delete index

To delete an index, run:

[ ]:
wr.opensearch.delete_index(client=client, index=index)