Yasin Engin Go Backend
Tolerex - Fault-Tolerant Storage System in Go
A distributed storage lab for practicing replication, failover, secure node traffic, and operational thinking.
Problem
Storage systems need predictable behavior when a node fails, a follower falls behind, or a client repeats a request. Tolerex focuses on the core engineering problem: keeping the service available while making replication, health checks, and recovery visible enough to debug.
Architecture
The system is organized around a leader node, follower nodes, a client-facing API, a heartbeat loop, and a persistent log/checkpoint path. The design keeps health monitoring separate from request handling so failures can be detected without blocking normal traffic.
Technologies
- Go for service implementation and concurrency control.
- gRPC for explicit service contracts between components.
- mTLS for authenticated node-to-node communication.
- Disk-backed checkpoints and logs for restart behavior.
What I Built
- Leader/follower replication flow with heartbeat-based health checks.
- Failure detection path that can trigger a controlled failover scenario.
- Persistence layer for data and log checkpoints.
- Basic observability points for metrics, logs, and recovery timing.
Screenshots / Diagrams
GitHub Repository
What I Learned
- Separating failure detection from client request paths makes behavior easier to test.
- Distributed systems code needs simple, visible state transitions more than clever abstractions.
- Secure transport should be designed early because it affects local development, certificates, and deployment habits.
Future Improvements
- Add repeatable chaos tests for leader failure, follower lag, and network partitions.
- Expose Prometheus metrics and a small dashboard for failover timing.
- Document benchmark scenarios with dataset size, request pattern, and recovery target.