TaskQ
Distributed task scheduler — senior thesis project
The Context
For my senior thesis at Cairo University, I wanted to figure out how to parallelize massive, independent computation tasks across commodity machines in a local network. I was frustrated by long-running batch jobs that were written as single-threaded scripts and wanted to build a fault-tolerant system that let a user throw a directed acyclic graph (DAG) of tasks at a cluster and walk away.
Architecture & Execution
TaskQ used a centralized coordinator model to manage task distribution. A client submitted a dependency graph of Python functions. The coordinator parsed the DAG to find executable leaf nodes and deployed them to the worker nodes over a ZeroMQ message bus. I implemented a rudimentary work-stealing algorithm and used SQLite on the coordinator to persist state for crash recovery. If a worker missed three heartbeats, the coordinator would reassign its tasks.
Post-Mortem Lessons
Coordination overhead is the silent killer of distributed systems. For tasks that took less than 500ms to execute natively, the time spent scheduling, serializing data, and handling network I/O completely erased the gains from parallelism.
Clock synchronization across different nodes is dramatically harder than you expect. Timestamps that were off by just a few hundred milliseconds made debugging race conditions nearly impossible.
ZeroMQ is brilliant for simple messaging patterns, but trying to fight its core connection management assumptions to implement complex recovery logic consumed weeks of my life. I should have used a dumb queue like RabbitMQ.
The system worked on paper and during the academic demo. It did not survive production-like loads for more than ten minutes. It taught me more about what *not* to do than what to do.
[ Coordinator Node ]
- DAG analysis
- Critical-path scheduling
↓
[ ZeroMQ Message Bus ]
↓
[ Worker Nodes (1..N) ]
- Task sandboxes
- State updates