Accounts, tooling, and access you need before provisioning anything.
Install and authenticate the AWS CLI
Install the AWS CLI v2 and configure an IAM principal allowed to create ECS, RDS, ElastiCache, S3, ECR, IAM, CloudWatch Logs and Secrets Manager resources. Pick a single region for the whole deployment (this guide uses af-south-1).
Configure and verify the CLI
aws --version
aws configure # enter key, secret, region (af-south-1), output (json)
aws sts get-caller-identity
Note
Commands in this guide are PowerShell (the AETHER installer runs on Windows). On Linux/macOS, translate loops foreach ($x in a,b) { } → for x in a b; do … done and the backtick line-continuation ` → backslash \.
Warning
af-south-1 (Cape Town) is an opt-in region. Creating your first resource enables it. In the console, set the region selector (top-right) to Africa (Cape Town) or you will not see your resources.
Authenticate and pull the AETHER images from GHCR
AETHER images are published to GitHub Container Registry at ghcr.io/rizaanlakay/afroai. Authenticate Docker with a GitHub Personal Access Token (classic) that has the read:packages scope, then pull the four component images. You re-tag and push them to your private ECR in Phase 3.
Images are at ghcr.io/rizaanlakay/afroai/{service}:<tag> — use the latest release tag. Docker Desktop must be installed and running. You never build from source.
Choose your sizing — production vs. low-cost evaluation
Every provisioning step below shows a production size and a test / free-tier size. For a working evaluation, use the test sizes everywhere — the whole footprint then sits in or near the AWS Free Tier (the one exception is the message broker; see Phase 2). When you are done testing, Phase 7 has a one-shot script to stop everything so you are not billed while idle.
Tip
Production target for ~1,000 concurrent conversations: 3+ agent-web and agent-service tasks, 2+ agent-api/agent-worker, a Multi-AZ RDS (db.r6g.xlarge+), ElastiCache with a replica, and Postgres connection pooling (Phase 6). The chat path is SignalR over a Redis backplane, so web/service scale linearly.
2. Provision data & infrastructure
Create the managed PostgreSQL, Redis, object storage, message broker, and secrets.
Create RDS for PostgreSQL
Create a PostgreSQL 16/17 instance. AETHER stores both the application schema and the RAG embeddings (3072-dim, pgvector) here. The master user is the admin (afroai_admin) — not the app login. The app's web-user role is created later in Phase 5.
Boolean flags take no value. Use --no-publicly-accessible / --no-multi-az — not --publicly-accessible false (that errors). The master username must be letters/digits/underscore only (no hyphens).
Warning
Do not pin --engine-version to a minor that may not exist in your region (e.g. 17.2 fails). Omit it to get the default, or list options: aws rds describe-db-engine-versions --engine postgres --query "DBEngineVersions[?starts_with(EngineVersion,'17')].EngineVersion" --output table.
Note
pgvector is allow-listed by default on RDS (rds.allowed_extensions = *) — no custom parameter group needed. The initializer creates the extension automatically. (The allow-list gotcha only bites Azure/GCP.)
Make RDS reachable for the one-time DB initialization
The instance is created private. To run the AETHER DB-init tool (or psql) from your workstation in Phase 5, temporarily make it publicly reachable and open port 5432 to your IP.
Enable public access + grab the endpoint
aws rds modify-db-instance --db-instance-identifier afroai-pg --publicly-accessible --apply-immediately
# After it returns to 'available', the endpoint:
aws rds describe-db-instances --db-instance-identifier afroai-pg --query "DBInstances[0].Endpoint.Address" --output text
Open 5432 to your IP (find the instance's VPC security group in the console)
This exposes the database to the internet (restricted to your IP). It is fine for a one-time init, but revert it afterwards (--no-publicly-accessible + remove the rule). In production, run the init from inside the VPC (bastion or ECS exec) instead.
Create ElastiCache for Redis
Redis is the distributed cache (afroai: prefix) and the SignalR backplane (afroai-signalr: prefix). Keep transit encryption on so the app connects with ssl=true.
The connection string is <primary-endpoint>:6379,ssl=true,abortConnect=false. Only the deployed services reach Redis (in-VPC) — do not expose it publicly. ElastiCache cannot be stopped; to zero its cost when idle, delete and recreate it (see Phase 7).
Create one S3 bucket + a scoped IAM user
AETHER stores documents, generated artifacts, and knowledge files via the S3 API. It uses a single bucket for all object types (configured by Minio:BucketName). Create one bucket and an IAM user scoped to just that bucket.
1 — Create the bucket (name must be globally unique)
3 — Create the user, attach the policy, make an access key
aws iam create-user --user-name afroai-s3
aws iam put-user-policy --user-name afroai-s3 --policy-name afroai-s3-access --policy-document file://s3-policy.json
aws iam create-access-key --user-name afroai-s3 # copy AccessKeyId + SecretAccessKey (shown once)
Warning
S3 bucket names are global. If afroai-artifacts is taken, add a suffix (e.g. your account id) and use that name everywhere — both ARNs in the policy and the Minio__BucketName env var in Phase 4.
Note
AETHER's MinIO SDK needs the region for S3 SigV4 signing — you set Minio__Region=af-south-1 in Phase 4. The endpoint is s3.af-south-1.amazonaws.com with Minio__Secure=true.
Provision the RabbitMQ broker (CloudAMQP recommended)
The worker consumes jobs over RabbitMQ (MassTransit transport). On AWS, the only managed RabbitMQ is Amazon MQ — but its smallest RabbitMQ instance is mq.m5.large (there is no free tier and mq.t3.micro is not offered for RabbitMQ). For a low-cost deployment, use CloudAMQP's free plan instead.
Option A — CloudAMQP (free, recommended)
1. Sign up at cloudamqp.com
2. Create instance: plan 'Little Lemur (Free)', region closest to af-south-1
3. Open the instance, copy the AMQP URL (amqps://user:pass@host/vhost)
4. Use that URL verbatim as the afroai/queue secret in the next step
Option B — Amazon MQ (managed, NOT free; delete when done)
AETHER's worker is built for the RabbitMQ transport — SQS / Service Bus are not compatible and would require rebuilding the binaries. The queue value is always an amqps:// URL. AETHER configures the bus with cfg.Host(new Uri(url)), so CloudAMQP's amqps://user:pass@host/vhost works as-is.
Danger
Amazon MQ brokers cannot be stopped — billing only stops when you delete-broker. At ~$0.30+/hr for mq.m5.large that is ~$220/mo if left running. CloudAMQP free has nothing to stop.
Store secrets in AWS Secrets Manager
AETHER reads these sensitive values; store each as a secret and reference it from the ECS task definitions in Phase 4. The env-var column shows the .NET configuration key (: becomes __).
Secret name
Env var (task definition)
What it is
afroai/db-connection
ConnectionStrings__DefaultConnection
Full Postgres connection string incl. the web-user password
afroai/queue
ConnectionStrings__queue
RabbitMQ / CloudAMQP amqps:// URL
afroai/openai-key
KernelMemory__AI__OpenAI__ApiKey
Your OpenAI API key — chat, embeddings, RAG, images
afroai/orchestrator-key
Services__OrchestratorApiKey
Internal API key (Web/Service↔API auth) — you generate it
afroai/mcp-key
Mcp__CredentialEncryptionKey
AES-256 key, base64 of exactly 32 bytes — you generate it
afroai/minio-access-key
Minio__AccessKey
Access key id of the afroai-s3 IAM user
afroai/minio-secret-key
Minio__SecretKey
Secret access key of the afroai-s3 IAM user
KernelMemory's RAG store reuses the same database — set KernelMemory__Services__Postgres__ConnectionString to the same value as afroai/db-connection in Phase 4.
1 — Generate the two internal keys (32-byte base64; reuse across services)
Services:OrchestratorApiKey and Mcp:CredentialEncryptionKey must be byte-for-byte identical across agent-web, agent-service, and agent-api. Generate once, reuse — a mismatch makes the API reject every orchestrator call and leaves MCP credentials undecryptable.
Note
Endpoints, model names, region, and bucket name are not secrets — set Redis:ConnectionString, Minio:Endpoint/Secure/Region/BucketName, Services:AgentService, and KernelMemory:AI:OpenAI:TextModel/EmbeddingModel as plain environment variables in Phase 4.
3. Push images to ECR
Mirror the AETHER images into your private registry.
Authenticate Docker to ECR, then re-tag each image already pulled in Phase 1 and push. Note ${r} braces — in PowerShell a bare $r: is misread because of the colon.
ECS Fargate cluster, IAM roles, task definitions, networking, and the services.
Create the ECS cluster (+ service-linked role)
Create a Fargate cluster. On a brand-new account the ECS service-linked role may not exist yet, which makes cluster creation fail with "Unable to assume the service linked role" — create it first.
Create the role (harmless if it already exists) then the cluster
aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com # 'has been taken' = already there, fine
aws ecs create-cluster --cluster-name afroai --capacity-providers FARGATE
Create the task execution role + log group
Fargate needs an execution role to pull from ECR, read your secrets, and write logs — and a CloudWatch log group to write into. Missing either is the #1 cause of tasks that never start ("unable to pull secrets" / "log group does not exist").
1 — Create the role with the right trust + policies
Registering a task definition does not validate that the role exists — it just stores the ARN string. If the role is missing, the failure only shows at service-launch time. Confirm with aws iam get-role --role-name ecsTaskExecutionRole.
Register the task definitions
One task definition per component. All four set executionRoleArn to the role above, networkMode: awsvpc, requiresCompatibilities: ["FARGATE"], and an awslogs log config pointing at /ecs/afroai. Plain config goes in environment; sensitive values reference the secret ARNs in secrets (valueFrom). Below is the agent-web example — the others are subsets.
agent-web environment + secrets (excerpt; cpu 1024 / memory 2048)
Per-component differences:agent-service = db + queue + openai + orchestrator-key + mcp-key (no Redis/Minio). agent-api = db + orchestrator-key + Services__AgentService. agent-worker = db + queue + openai + minio (no HTTP port; use DOTNET_ENVIRONMENT not ASPNETCORE_ENVIRONMENT; its image is built from Dockerfile.worker which includes Python for the code sandbox).
Tip
The OpenAI model is set by KernelMemory__AI__OpenAI__TextModel; keep EmbeddingModel = text-embedding-3-large (3072-dim, matches the schema).
Networking — Cloud Map, the ALB, and security groups
Internal services find each other via Cloud Map DNS (agent-service.afroai); agent-web is published through an internet-facing ALB. Default-VPC subnets are public, so tasks use assignPublicIp=ENABLED to reach ECR / OpenAI / CloudAMQP (no NAT gateway needed).
1 — Cloud Map namespace + a discovery service for agent-service
Set the target-group health check to / (the landing page), not /health. AETHER's health endpoints are only mapped in Development, so /health 404s in production and the task is killed in a loop. --matcher HttpCode=200-399 tolerates the page returning a redirect.
Note
HTTP-only ALB is fine for testing — the auth cookie uses SameAsRequest, so it works over HTTP. For production add an ACM certificate + a 443 listener (HTTPS). The default 60s ALB idle timeout is raised to 300s above so long image-generation requests are not cut off.
Create the ECS services
Create one service per component. agent-service registers in Cloud Map; agent-web attaches to the ALB target group with a startup grace period; agent-worker and agent-api need no ingress. Use --desired-count 1 for a test.
Run agent-web at a single replica for now: ASP.NET Core Data Protection keys default to local disk and are not shared across tasks, so multiple replicas break cookies/antiforgery. Scaling web past 1 needs a shared key ring (Redis) — see Phase 6.
5. Initialize the AfroAI database
Create the schema, seed reference data, and the app login role.
Apply the schema + seed data
Connect to RDS as afroai_admin (via the public access from Phase 2, or a bastion) and apply the bundled schema and seed. These ship with the installer and reflect a known-good database. psql is included with pgAdmin under runtime\psql.exe.
Apply schema + seed (download both from the installer's /Database page)
Create the app login role, set the admin password, and grant privileges
-- run as afroai_admin on the AfroAI database
CREATE ROLE "web-user" LOGIN PASSWORD '<WEB_USER_PWD>';
GRANT USAGE, CREATE ON SCHEMA public TO "web-user";
GRANT ALL ON ALL TABLES IN SCHEMA public TO "web-user";
GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO "web-user";
ALTER DEFAULT PRIVILEGES FOR ROLE afroai_admin IN SCHEMA public GRANT ALL ON TABLES TO "web-user";
ALTER DEFAULT PRIVILEGES FOR ROLE afroai_admin IN SCHEMA public GRANT ALL ON SEQUENCES TO "web-user";
Danger
The app connects as web-user (in the afroai/db-connection secret). The schema and tables are owned by afroai_admin, so web-user needs the GRANTs above — without them every page fails with "permission denied for table …".
Tip
The seed includes the reference data the agent-creation UI needs (categories, languages, tones, creativity levels, response lengths) and an initial admin user. Update that user's password before going live. The live Initialize Database tool is an alternative for a from-scratch database.
6. Scale & harden
Autoscaling, connection pooling, shared keys, backups, and observability for production.
Configure ECS service autoscaling
Add Application Auto Scaling: scale agent-web/agent-service on CPU or ALB request count, and agent-worker on queue depth.
Register a scalable target and a CPU target-tracking policy
Before scaling agent-web beyond one task, configure shared Data Protection keys (persist to Redis) and enable ALB stickiness + WebSocket support, or SignalR chat and cookies break across replicas.
Add RDS Proxy for connection pooling
EF Core opens many connections under load; many Fargate tasks multiply that. Put RDS Proxy in front of PostgreSQL and point ConnectionStrings__DefaultConnection at the proxy endpoint to avoid exhausting max_connections.
Tip
The KernelMemory RAG tables (km- prefix) grow with ingested knowledge — schedule backups and watch storage.
Enable observability and backups
AETHER emits OpenTelemetry traces/metrics (via Aspire ServiceDefaults). Forward them to CloudWatch / AWS Distro for OpenTelemetry, and enable automated backups.
A resource that 404s only in production is almost always a Linux case-sensitivity issue (the container filesystem is case-sensitive; Windows dev is not). The error page hides the real exception — the CloudWatch log has the stack trace.
Shut everything down (stop idle billing)
The big variable cost is Fargate (per running task) and the RDS instance. Scale every service to zero and stop the database. Run this whenever you finish a session.
Cannot be stopped, only deleted: ElastiCache, the ALB, and (if used) Amazon MQ bill hourly even idle. For a t3.micro cache + ALB that's only a few dollars/month — usually fine to leave. To zero them too, delete them (and recreate via Phase 2 on startup). RDS can stay stopped for up to 7 days before AWS auto-starts it; storage is still billed while stopped.
Start everything back up
Bring the database back first, wait for it, then scale the services to their running counts.
Start RDS, wait, then scale services back to 1 (or your production counts)
Start agent-service before agent-web so the orchestrator is ready when the UI comes up. If you deleted ElastiCache / Amazon MQ on shutdown, recreate them (Phase 2) and refresh the afroai/queue secret + Redis env before scaling the services up.