AWS CloudFront Bot & Crawler Detection

If your site is served through AWS CloudFront — including sites hosted on AWS Amplify — stream CloudFront's standard access logs to Kitbase through Amazon Data Firehose to capture every request, including crawlers that never run JavaScript, without adding any code. Firehose POSTs batches of log records; Kitbase reads each CloudFront log's fields (c-ip, cs-user-agent, cs-method, x-host-header, cs-uri-stem, cs-referer), classifies the actor, and stores bot/crawler requests. Human requests are ignored.

POST https://ingest.kitbase.dev/ingest/v1/cloudfront?environment=<ENV_NAME>

Amplify / your origin ──► CloudFront ──standard logs v2──► Amazon Data Firehose ──► Kitbase

Amplify sites need your own CloudFront

AWS Amplify already serves through a CloudFront distribution, but it isn't one you can attach logging to. Put your own CloudFront distribution in front of the Amplify app (Amplify origin), then enable logging on that distribution.

Privacy — we only keep the bots

Forwarding every request doesn't mean every request is stored. Human visitors' signals are used only to classify the request in memory and are then discarded — only bot and crawler requests are persisted. For those, the raw IP is stored only when IP logging is enabled for the environment; otherwise it's used to derive geolocation (country, region, city) and then dropped.

Setup

You need three things: a CloudFront distribution, a Firehose delivery stream pointed at Kitbase, and CloudFront standard logging v2 wired to that stream.

CloudFront distribution — front your origin (the Amplify app's *.amplifyapp.com domain, or any origin) with a CloudFront distribution. If you serve a custom domain (e.g. home.example.com), add it to both CloudFront (with an ACM cert in us-east-1) and — for Amplify origins — to the Amplify app, so the forwarded Host header resolves.
Amazon Data Firehose stream (Direct PUT) with an HTTP endpoint destination:
- Endpoint URL: https://ingest.kitbase.dev/ingest/v1/cloudfront?environment=Production — the environment query param is the Kitbase environment name.
- Access key: set it to your secret API key, sk_kitbase_<your_secret_api_key>. Firehose sends it in the X-Amz-Firehose-Access-Key header; Kitbase uses it to authenticate the stream and resolve your project. Use the secret key, not the browser-exposed SDK key.
- Buffering: 60 s / 1 MB is a good default (near-real-time).
- Backup: enable failed-data-only S3 backup so nothing is lost if the endpoint is briefly unavailable.
CloudFront standard logging v2 → your Firehose stream. Set Output format to JSON, and select these fields (CloudFront uses the W3C cs(Header) spelling): timestamp, c-ip, sc-status, cs-method, cs-uri-stem, cs-uri-query, cs(Host), cs(User-Agent), cs(Referer), sc-content-type.

That's it — CloudFront delivers access logs to Firehose, Firehose batches and POSTs them to Kitbase, and bot/crawler requests show up in your dashboard.

Tip — standard logging v2, not real-time logs

Use CloudFront standard logging v2 (delivered to Firehose), not real-time logs. Real-time logs require a Kinesis Data Stream, which adds a baseline hourly cost; standard logging v2 has no such baseline — you pay only for Firehose ingestion, which the AWS free tier covers at low volume.

Response

200 OK — Firehose's required acknowledgement, { "requestId": "…", "timestamp": <ms> }. Log rows without a usable client IP and User-Agent are ignored. A non-2xx response makes Firehose retry the batch and, after retries, park it in the stream's S3 backup.

Next steps

API reference — endpoint details and response schema.
All platforms — setup guides for other frameworks and hosts.

AWS CloudFront Bot & Crawler Detection ​

Setup ​

Response ​

Next steps ​

AWS CloudFront Bot & Crawler Detection

Setup

Response

Next steps