Jubin Soni - Portfolio & Blog

Retrieval-Augmented Generation (RAG) has transitioned from an experimental pattern to the standard architecture for deploying Generative AI in the enterprise. While large language models (LLMs) possess impressive general reasoning capabilities, they are frozen in time and lack access to proprietary, real-time corporate data. On AWS, building a RAG pipeline is no longer just about connecting a database to an LLM; it is about orchestrating a multi-stage data factory that handles ingestion, embedding, storage, and orchestrated retrieval with production-grade reliability.

In a production environment, RAG pipelines must solve for "data gravity"—the challenge of moving and processing large volumes of enterprise data while maintaining security and low latency. AWS provides a managed ecosystem through Amazon Bedrock, Amazon OpenSearch Serverless, and AWS Lambda that abstracts the heavy lifting of infrastructure management. However, as an architect, the complexity lies in the "glue": how you handle document chunking strategies, how you manage vector index refreshes, and how you implement semantic re-ranking to ensure the LLM receives the most relevant context possible.

The shift toward "Agentic RAG" on AWS further emphasizes the need for modularity. Instead of a linear flow, modern pipelines use specialized agents to decompose queries, retrieve data from multiple sources (like S3 for PDFs and Aurora for structured data), and synthesize answers. This approach minimizes hallucinations and ensures that the model’s response is grounded in a "single source of truth" hosted within your VPC.

Architecture and Core Concepts

A production-grade RAG architecture on AWS is divided into two distinct lifecycles: the Asynchronous Ingestion Path and the Synchronous Retrieval Path. The ingestion path is responsible for converting unstructured data into a searchable vector format, while the retrieval path handles the user query and response generation.

In this architecture, Amazon OpenSearch Serverless (AOSS) acts as the vector engine, providing a highly available, auto-scaling environment for vector similarity searches. The Amazon Bedrock API provides access to both embedding models (like Titan Text Embeddings V2) and generation models (like Claude 3.5 Sonnet). By using AWS Lambda as the orchestrator, you ensure a serverless execution environment that scales horizontally with user demand.

Implementation: The Orchestration Logic

The following Python example demonstrates how to implement the retrieval component of the pipeline using the boto3 SDK. This script handles the embedding of a user query, performs a vector search against OpenSearch Serverless, and invokes a Bedrock model with the retrieved context.

python

import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# Configuration
REGION = "us-east-1"
SERVICE = "aoss"
BEDROCK_RUNTIME = boto3.client("bedrock-runtime", region_name=REGION)

def get_embedding(text):
    """Generate vector embeddings using Amazon Bedrock Titan."""
    body = json.dumps({"inputText": text})
    response = BEDROCK_RUNTIME.invoke_model(
        body=body, 
        modelId="amazon.titan-embed-text-v2:0"
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("embedding")

def query_vector_db(query_vector, index_name, host):
    """Perform a k-Nearest Neighbor (k-NN) search in OpenSearch Serverless."""
    credentials = boto3.Session().get_credentials()
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, 
                       REGION, SERVICE, session_token=credentials.token)
    
    client = OpenSearch(
        hosts=[{'host': host, 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection
    )
    
    query = {
        "size": 3,
        "query": {
            "knn": {
                "embedding": {
                    "vector": query_vector,
                    "k": 3
                }
            }
        }
    }
    return client.search(index=index_name, body=query)

def generate_answer(query, context):
    """Invoke Claude 3.5 Sonnet with retrieved context."""
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer based only on context:"
    
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [{"role": "user", "content": prompt}]
    })
    
    response = BEDROCK_RUNTIME.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        body=body
    )
    return json.loads(response.get("body").read())["content"][0]["text"]

def handler(event, context):
    query_text = event['query']
    vector_host = "your-aoss-endpoint.us-east-1.aoss.amazonaws.com"
    
    # 1. Embed user query
    vector = get_embedding(query_text)
    
    # 2. Retrieve context
    search_results = query_vector_db(vector, "docs-index", vector_host)
    context_str = " ".join([hit['_source']['text'] for hit in search_results['hits']['hits']])
    
    # 3. Generate response
    answer = generate_answer(query_text, context_str)
    return {"statusCode": 200, "body": answer}

Best Practices for Vector Storage on AWS

Choosing the right vector store depends on your data volume, latency requirements, and existing infrastructure.

Service	Best Use Case	Indexing Latency	Scaling Mechanism
OpenSearch Serverless	Large-scale unstructured data, logs, and text.	Low (Near real-time)	Automatic OCU scaling
Amazon Aurora (pgvector)	Hybrid queries (SQL + Vector) on existing DBs.	Medium	Instance-based (Vertical/Horizontal)
Amazon MemoryDB	Ultra-low latency requirements (<10ms).	Very Low	In-memory sharding
Amazon Kendra	Turnkey RAG with built-in connectors (SharePoint, S3).	High	Capacity Units

Performance and Cost Optimization

In a RAG pipeline, costs are primarily driven by three factors: Vector Database uptime, Embedding tokens, and Generation tokens. To optimize costs, architects should implement "Semantic Caching." By caching the results of common queries in Amazon ElastiCache, you can bypass the LLM and Vector DB entirely for repeated questions.

To optimize performance, focus on the Top-K retrieval. Retrieving too many documents (high K) increases the LLM's prompt token count and latency. Conversely, too few documents (low K) leads to poor answer quality. Implementing a "Reranker" model (like BGE-Reranker via Bedrock) after the initial vector search can help filter out noise, allowing you to send only the most relevant 2-3 chunks to the generator.

Monitoring and Production Patterns

A RAG pipeline is only as good as its retrieval accuracy. In production, you must monitor for "Hallucination Rates" and "Retrieval Precision." Using Amazon Bedrock Guardrails, you can implement content filters that block PII or sensitive topics before they reach the model or the user.

Utilize AWS X-Ray to trace the latency of each segment—embedding, searching, and generating. If the invoke_model call for embeddings is taking too long, consider moving to a smaller embedding dimension (e.g., 256 instead of 1024) if the accuracy trade-off is acceptable for your use case.

Conclusion

Running RAG pipelines on AWS requires a shift from viewing LLMs as standalone entities to seeing them as components within a larger data ecosystem. By leveraging Amazon Bedrock for managed model access and OpenSearch Serverless for scalable vector search, you can build systems that are both powerful and maintainable. The key to success lies in the refinement of the ingestion pipeline—specifically chunking and metadata enrichment—and the implementation of robust monitoring to catch hallucinations before they impact the end user. As the landscape evolves toward multi-agent systems, the modularity of your AWS architecture will be its greatest asset.

https://aws.amazon.com/bedrock/ https://aws.amazon.com/opensearch-service/features/serverless-vector-engine/ https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html https://aws.amazon.com/blogs/aws/preview-connect-foundation-models-to-your-company-data-sources-with-agents-for-amazon-bedrock/