root / work / taskq
Archived — Backend

TaskQ

Distributed task scheduler — senior thesis project

Started
September 2019
Repository Private / Internal

The Context

For my senior thesis at Cairo University, I wanted to figure out how to parallelize massive, independent computation tasks across commodity machines in a local network. I was frustrated by long-running batch jobs that were written as single-threaded scripts and wanted to build a fault-tolerant system that let a user throw a directed acyclic graph (DAG) of tasks at a cluster and walk away.

Architecture & Execution

TaskQ used a centralized coordinator model to manage task distribution. A client submitted a dependency graph of Python functions. The coordinator parsed the DAG to find executable leaf nodes and deployed them to the worker nodes over a ZeroMQ message bus. I implemented a rudimentary work-stealing algorithm and used SQLite on the coordinator to persist state for crash recovery. If a worker missed three heartbeats, the coordinator would reassign its tasks.

Post-Mortem Lessons

01

Coordination overhead is the silent killer of distributed systems. For tasks that took less than 500ms to execute natively, the time spent scheduling, serializing data, and handling network I/O completely erased the gains from parallelism.

02

Clock synchronization across different nodes is dramatically harder than you expect. Timestamps that were off by just a few hundred milliseconds made debugging race conditions nearly impossible.

03

ZeroMQ is brilliant for simple messaging patterns, but trying to fight its core connection management assumptions to implement complex recovery logic consumed weeks of my life. I should have used a dumb queue like RabbitMQ.

04

The system worked on paper and during the academic demo. It did not survive production-like loads for more than ten minutes. It taught me more about what *not* to do than what to do.

Worker Nodes
4
Theoretical Speedup
~2x
Actual Speedup
1.3x
Thesis Grade
A-
Core Technologies
Python ZeroMQ SQLite
[ Coordinator Node ]
 - DAG analysis
 - Critical-path scheduling
          ↓
[ ZeroMQ Message Bus ]
          ↓
[ Worker Nodes (1..N) ]
 - Task sandboxes
 - State updates