Set Timeouts Intentionally

Protect against runaway AWS Lambda costs by tuning function and downstream timeouts based on real execution data and retry behavior.

Why it matters

Lambda charges for every millisecond your function is running, up to 15 minutes per invocation.
If you leave timeouts at the default or set them too high:

  • Slow downstream services can cause long, expensive executions.
  • Misbehaving code can hang for minutes before Lambda finally stops it.
  • You pay for time spent waiting on network calls that will never succeed.

Right-sized timeouts protect you from runaway costs and surface problems faster.

Understand the timeout knob

  • Each Lambda function has a timeout setting from 1 second to 900 seconds (15 minutes).
  • When the timeout is reached, Lambda terminates the function and you are billed for the time up to that point.
  • Timeouts should reflect your expected execution time + a small buffer, not the maximum allowed.

Think of timeout as a cost and reliability guardrail: short enough to stop bad behavior, long enough for valid work to finish.

Pick timeouts from data, not guesses

Before changing timeouts, look at real execution behavior:

  • Check p50/p90/p99 duration in CloudWatch metrics.
  • Identify external calls (databases, HTTP APIs) that dominate execution time.
  • Separate fast paths (simple reads) from slow paths (big reports, batch work).

Then:

  • Set the timeout slightly above the worst normal-case duration (for example, p99 + 20–30%).
  • Use different functions (and timeouts) for “quick” vs “slow” operations when it makes sense.

Example: HTTP call with timeout

When calling external services, align your client-side timeout with your Lambda timeout so you don’t wait forever inside the function.

import axios from "axios";

const LAMBDA_TIMEOUT_MS = 5_000; // Lambda timeout is set to 5 seconds
const DOWNSTREAM_TIMEOUT_MS = 4_000; // Give downstream less than Lambda

export const handler = async () => {
  try {
    const response = await axios.get("https://api.example.com/data", {
      timeout: DOWNSTREAM_TIMEOUT_MS,
    });
    return response.data;
  } catch (error) {
    // Fail fast instead of hanging for the full Lambda timeout
    console.error("Downstream call failed", error);
    throw error;
  }
};

Key idea: let the client timeout first, then fail or retry, instead of burning the entire Lambda timeout on a stuck network call.

Use timeouts with retries and backoff

Timeouts work best alongside retries with backoff:

  • Use shorter timeouts + limited retries for transient errors (like 5xx responses).
  • Add exponential backoff and jitter so many Lambdas don’t hammer a slow dependency at once.
  • For asynchronous flows (queues/streams), rely on built-in retry behavior where possible instead of huge timeouts.

Different event sources (for example, SQS, EventBridge, Kinesis, and DynamoDB Streams) have their own retry, DLQ, and failure semantics—tightening timeouts can change how often events are retried or sent to a DLQ, so review the behavior for each source before and after changes.

This pattern improves reliability without leaving functions running (and billing) for long periods.

Watch for timeout signals

After tightening timeouts, monitor:

  • Duration, Errors, and Throttles CloudWatch metrics for jumps.
  • For async and stream-based invocations, AsyncEventAge / IteratorAge and DLQ metrics for signs that events are backing up or failing.
  • Task timed out after X seconds errors in logs.
  • Any spike in dead-letter queue messages or failed events.

If you see frequent timeouts:

  • Confirm your timeout isn’t too aggressive for the real workload.
  • Look for slow dependencies or unbounded loops.
  • Split long-running work into smaller steps or async workflows (for example, with SQS or Step Functions).

Best practices checklist

  • Timeouts are explicitly set for every function (no relying on defaults).
  • Timeout values are based on observed durations, not guesses.
  • Downstream client timeouts are shorter than Lambda timeouts.
  • Critical async flows use retries with backoff, not “just increase the timeout”.
  • Long-running or wait-heavy workflows are moved to Step Functions or queues, not single 15-minute Lambdas.