
AWS IAM for Data Engineers: Least-Privilege Policies That Actually Work
Quick answer: Every AWS data service (Glue, Lambda, Redshift, EMR) needs an IAM role with only the permissions it requires. That means specifying exact S3 bucket ARNs instead of s3:*, exact Glue database names instead of glue:*, and using conditions to restrict by region or IP. This article includes 3 copy-paste-ready IAM policy JSON examples for the most common data engineering scenarios.
Last updated: April 2025
IAM Fundamentals for Data Engineers
IAM isn't glamorous, but it's the reason your Glue job can read from S3, your Lambda can write to DynamoDB, and your Redshift cluster can access external data. Every AWS service interaction is an API call, and every API call is authorized (or denied) by IAM. Getting IAM right means your pipelines work reliably. Getting it wrong means either "Access Denied" errors at 2 AM or -- worse -- overly permissive policies that expose your data lake to every service in the account.
The four IAM concepts data engineers use daily:
- Users: Human identities with long-term credentials. Use for console access; avoid for service-to-service communication.
- Roles: Temporary identity that services assume. Every Glue job, Lambda function, and ECS task should run under its own role.
- Policies: JSON documents that define what actions are allowed or denied on which resources. Attached to users or roles.
- STS AssumeRole: How a service in Account A gets temporary credentials to access resources in Account B. Critical for cross-account data access.
Policy Example 1: Glue Job with S3 and Catalog Access
This policy grants a Glue ETL job read access to a source S3 bucket, write access to a target S3 bucket, and read/write access to specific Glue Catalog databases. For streamlined role creation, see our IAM role creation guide with AWS Policy Generator.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadSourceBucket",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::raw-data-source-bucket",
"arn:aws:s3:::raw-data-source-bucket/*"
]
},
{
"Sid": "WriteTargetBucket",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": [
"arn:aws:s3:::processed-data-target-bucket/*"
]
},
{
"Sid": "GlueCatalogAccess",
"Effect": "Allow",
"Action": [
"glue:GetDatabase", "glue:GetTable",
"glue:GetTables", "glue:GetPartitions",
"glue:CreateTable", "glue:UpdateTable",
"glue:BatchCreatePartition"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/analytics_db",
"arn:aws:glue:us-east-1:123456789012:table/analytics_db/*"
]
},
{
"Sid": "CloudWatchLogs",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup", "logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws-glue/*"
}
]
}
Notice: no s3:*, no Resource: "*". Each action is scoped to the exact bucket and Glue database this job needs. If the job changes to read from a new bucket, you add a specific resource ARN -- you don't widen the policy.
Policy Example 2: Lambda with DynamoDB and S3
A Lambda function that reads events from S3, processes them, and writes results to a DynamoDB table. The function also needs to log to CloudWatch.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadS3Events",
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::event-data-bucket/incoming/*"
},
{
"Sid": "WriteDynamoDB",
"Effect": "Allow",
"Action": [
"dynamodb:PutItem", "dynamodb:UpdateItem",
"dynamodb:BatchWriteItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/ProcessedEvents"
},
{
"Sid": "BasicLambdaExecution",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup", "logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:us-east-1:123456789012:*"
}
]
}
This Lambda can't read from any other S3 bucket, can't delete DynamoDB items, and can't invoke other Lambda functions. That's the point.
Policy Example 3: Cross-Account Redshift Access
Account A (data producer) has an S3 bucket with analytics data. Account B (data consumer) has a Redshift cluster that needs to COPY data from Account A's bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AssumeRoleInProducerAccount",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::111111111111:role/S3DataShareRole"
},
{
"Sid": "RedshiftCopyFromS3",
"Effect": "Allow",
"Action": [
"s3:GetObject", "s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::producer-analytics-bucket",
"arn:aws:s3:::producer-analytics-bucket/shared/*"
],
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}
]
}
Account A's bucket policy must also grant access to Account B's role ARN. The Condition block restricts access to a specific region -- a useful guardrail that prevents accidental cross-region data movement. For broader access control patterns, see our data access control strategies guide.
S3 Access Points for Multi-Team Patterns
When multiple teams need access to the same S3 bucket with different permissions, S3 Access Points simplify the bucket policy. Instead of one complex bucket policy with 15 principal ARNs and conditions, create an access point per team. Each access point has its own policy that scopes access to specific prefixes.
Example: the marketing team gets an access point that allows read-only access to s3://data-lake/marketing/*. The finance team gets an access point for s3://data-lake/finance/*. Neither team can see the other's data. This is cleaner than trying to manage it all in one bucket policy.
Policy Conditions: The Underused Power Feature
Conditions let you add guardrails beyond just actions and resources:
- aws:RequestedRegion: Restrict actions to specific AWS regions. Prevents accidental resource creation in us-west-2 when your team works in us-east-1.
- aws:SourceIp: Allow API calls only from your corporate IP range. Useful for human users, not recommended for services.
- s3:prefix: Limit S3 ListBucket to specific key prefixes. A team that needs
/marketing/doesn't need to list the entire bucket. - aws:PrincipalTag: Match against tags on the calling principal. Tag-based access control scales better than maintaining long lists of role ARNs.
Common Mistakes
- Using
"Resource": "*"everywhere. This grants the action on every resource in your account. It's the IAM equivalent ofchmod 777. Always specify resource ARNs. - Attaching AdministratorAccess to services. We've seen Glue jobs running with AdministratorAccess because "it was easier during development." That Glue job can now delete your production databases, modify IAM roles, and spin up EC2 instances. Never do this.
- Not using conditions. A policy that allows
s3:PutObjecton a bucket is fine. A policy that also restricts it to your region and requires encryption (s3:x-amz-server-side-encryption) is better. - Forgetting CloudWatch Logs permissions. Every service needs log access. Without it, your Glue jobs and Lambdas run but you can't see what they did. Include log permissions in every service role.
- Ignoring the 6,144-character policy limit. Managed policies max out at 6,144 characters. If your policy is approaching this limit, split it into multiple managed policies (up to 10 per role) or use wildcards more strategically.
IAM Access Analyzer
IAM Access Analyzer is a free AWS tool that identifies overly permissive policies. Enable it in your account and it'll flag policies that grant access to external principals (other accounts, public access) or use wildcard resources where specific ARNs would be safer.
It also has a policy generation feature: point it at CloudTrail logs for a specific role, and it generates a least-privilege policy based on the actual API calls that role made over the past 90 days. This is invaluable for tightening permissions on existing roles without breaking anything. For S3 architecture that these policies protect, see our S3 data lake architecture guide.
Service Control Policies (SCPs)
If you manage multiple AWS accounts via AWS Organizations, SCPs set guardrails that apply across all accounts. Common data engineering SCPs include: deny resource creation outside approved regions, deny disabling of CloudTrail, deny public S3 bucket creation, and require encryption on all S3 objects. SCPs are the organizational safety net -- even if someone attaches AdministratorAccess to a role, the SCP Deny still blocks the restricted actions.
Key Takeaways
- Every service gets its own role with only the permissions it needs. No shared roles across Glue jobs, Lambdas, and ECS tasks.
- Specify resource ARNs, never
"*". A Glue job that reads one bucket shouldn't have access to every bucket in the account. - Use conditions (region, encryption, IP, tags) to add guardrails beyond basic action/resource scoping.
- Cross-account access = bucket policy + IAM role + STS AssumeRole. All three pieces are required.
- IAM Access Analyzer generates least-privilege policies from CloudTrail. Use it to tighten existing roles without guessing.
- Explicit Deny always wins. If you need to absolutely block an action, use a Deny statement -- no Allow anywhere can override it.
- The 6,144-character limit is real. Plan for it with multiple managed policies or strategic wildcards.
Frequently Asked Questions
Q: What is the principle of least privilege in AWS IAM?
Least privilege means granting only the minimum permissions required. A Glue job reading from one S3 bucket should have s3:GetObject on that specific bucket ARN -- not s3:* on "Resource": "*". This limits the blast radius if credentials are compromised or code has bugs.
Q: How do I set up cross-account S3 access?
Three components: (1) a bucket policy on the source account granting access to the consuming account's role ARN, (2) an IAM role in the consuming account with STS AssumeRole trust, and (3) a policy on the consuming role with s3:GetObject scoped to the source bucket. The consuming service calls sts:AssumeRole to get temporary credentials.
Q: What is the maximum IAM policy size?
Inline policies: 2,048 characters. Managed policies: 6,144 characters. You can attach up to 10 managed policies per role, giving you an effective limit of ~61,440 characters. If you're hitting these limits, consider using IAM policy variables, S3 access points, or tag-based conditions to reduce policy verbosity.
Q: How does IAM policy evaluation work?
Explicit Deny always wins. AWS evaluates: Organization SCPs first, then resource-based policies, then identity-based policies. If any policy at any level explicitly denies the action, the request is denied regardless of Allow statements elsewhere. This makes Deny the most reliable way to enforce restrictions across an organization.
