Overview
Atlas Project
At Stellus I built Atlas, a blockchain API platform where developers deploy, manage, and interact with smart contracts across EVM networks.
We needed an asynchronous system that absorbs extreme blockchain volatility and RPC rate limits without propagating latency, a transaction pipeline that reliably executes on-chain operations despite slow block confirmations and network fragmentation, and a cryptographic custody mechanism that secures EVM private keys with zero margin for error.
I owned the entire backend and protocol layer from inception, with no existing infrastructure to build on.
System Architecture
System Flow
Atlas is built around three independent execution flows: a modular authentication system supporting both OAuth and Web3 wallet signatures; an async smart contract deployment pipeline that never blocks the HTTP layer; and a cache-first charting system that absorbs high-frequency dashboard queries without flooding the database or RPC providers.
A — Modular Authentication Flow
Auth flow — JWT + HttpOnly refresh cookie with Redis session validation and WebSocket notification on login
B — Smart Contract Deployment Flow
Deployment flow — async Celery task handles on-chain signing; WebSocket push confirms completion in real time
C — Contract Event & Charting Flow
Charting flow — Redis cache-first with Cache-Control headers; on miss, Postgres aggregates time-series and primes the cache
Impact
Shipped
The numbers reflect the operational state of the system during QA.
Technical Decisions
Solutions
Each of these was a failure mode or threat vector. None of them had an obvious default solution.
Async Architecture · Python · Celery · gevent
Deploying a contract via HTTP triggered a Web3.py call to an external RPC provider. Response times ranged from 200ms to 45s depending on congestion. While waiting, the FastAPI worker was blocked — unavailable to serve other requests. At scale, a few slow blockchain calls exhausted the worker pool and made the API unresponsive.
The solution was task queue isolation: the API enqueues a Celery task and immediately returns a job ID. Blockchain work runs asynchronously in a dedicated worker pool, so the HTTP tier is never blocked.
I used gevent as the Celery concurrency model. gevent patches Python I/O to be non‑blocking, so a worker waiting on an RPC response yields to other tasks instead of blocking a thread. This lets one worker handle many pending on‑chain calls without spawning OS threads, preserving performance at scale.
One subtlety: Web3.py's AsyncHTTPProvider is not thread-safe. Creating a new instance per task exhausts connection pools. The solution was a Singleton pattern enforced via asyncio.Lock(): one provider instance shared across all workers, with initialization guarded by the lock.
Cryptography · Key Management · EVM Wallets
Atlas holds private keys for users. A private key gives absolute wallet control — if compromised, all assets are lost with no recovery. Key storage is therefore the highest‑stakes engineering decision.
Two attack surfaces: Data at rest: : an attacker with database access must not be able to use stored key material. Brute force: an attacker with an encrypted key and knowledge of the scheme must not be able to recover the plaintext in practical time.
AES-GCM secures data at rest with authenticated encryption: ciphertext is both encrypted and integrity‑verified, and dynamic nonces prevent pattern analysis.
PBKDF2 secures against brute force by deriving the AES key with 390,000 iterations and dynamic salts. Each guess requires 390,000 operations, making brute‑forcing impractical even with dedicated hardware.
The 390,000 iteration count follows OWASP’s 2023 guidance for PBKDF2‑HMAC‑SHA256, calibrated against modern GPU benchmarks.
| Attack Vector | Mitigation | Why It Works |
|---|---|---|
| DB dump / data breach | AES-GCM encryption at rest | Ciphertext is useless without the derived key |
| Brute-force on ciphertext | PBKDF2 at 390,000 iterations | Each guess costs 390K hash operations — GPU cracking is impractical |
| Pattern analysis / replay | Dynamic nonces + dynamic salts | Same plaintext → different ciphertext every time |
| Ciphertext tampering | AES-GCM authentication tag | Any bit flip in ciphertext causes decryption to fail with an error |
Reliability Engineering · RPC Providers · HTTP 429
RPC providers (Alchemy, QuickNode) enforce rate limits. Exceeding them returns HTTP 429. Naively retrying immediately creates a retry storm — repeated failures that lock you out indefinitely.
Exponential backoff Exponential backoff fixes this: each failure doubles the wait before the next attempt (1s, 2s, 4s, 8s…), giving the rate‑limit window time to reset.
But exponential backoff alone causes a thundering herd problem: multiple workers retry on the same schedule, creating another spike that triggers rate-limit again.
Dynamic jitter solves it. Each worker adds randomness to its wait time, staggering retries across time. The jitter range scales with backoff duration, spreading retries more widely at longer waits when it matters most.