Event Pipeline
Real-time event processing system for a Cairo-based startup
The Context
The company was relying on a nightly batch ETL process to move user events into the analytics and machine learning feature stores. By the time the data was available, it was already 6+ hours stale. Marketing couldn't execute same-day campaigns, and the recommendation algorithms were always training on yesterday's behaviors. We needed a system that could validate, clean, and distribute events as they happened, without dropping data.
Architecture & Execution
The pipeline is built around Apache Kafka as the resilient backbone. The ingestion API does minimal structural validation and immediately pushes to Kafka to ensure high throughput and low latency. The stream processors are independent Python workers that pull from Kafka, apply business logic, deduplicate events using a sliding window Bloom filter in Redis to save memory, and then route the cleaned payloads to the appropriate sinks.
Post-Mortem Lessons
Getting the number of Kafka partitions right is a dark art. We started with 12, bottlenecked, jumped to 48, and eventually settled on 24 after properly profiling the message distribution.
The biggest incident we had wasn't a crash — it was a schema change that passed ingestion validation but completely broke the parsing logic of 3 downstream consumers. Schema registries are not optional.
Bloom filters for deduplication are magical until you need to periodically clear them to prevent false positive rates from climbing. We implemented rotating daily filters with a 24-hour overlap buffer.
Mobile/Web Clients
↓
Ingestion API (Validation)
↓
Kafka Topic (Partitioned)
↓
Stream Processors
├─ Deduplication (Redis)
└─ Enrichment (PostgreSQL)
↓
Multi-Sink Distribution