Hardware-Software Co-Design

CUDA Quantum Computing SoC Verification HPC

This direction focuses on the tight feedback loop between algorithm design and hardware constraints. Many of my most impactful results have come from co-designing software with specific hardware targets, whether that means writing custom CUDA kernels, mapping generative models onto quantum processors, or building verification infrastructure for system-on-chip designs.

At Zephram, I developed multi-GPU CUDA kernels for energy-model inference that achieved a 600x speed-up over prior state-of-the-art. These kernels were purpose-built for the memory access patterns and numerical precision requirements of our energy-based generative models, demonstrating that algorithm-hardware co-optimization can unlock order-of-magnitude gains that are invisible to generic GPU programming.

At Quantinuum, this philosophy extended to quantum-classical hybrid systems. I engineered CUDA kernels for simulating UCCSD quantum circuits (20x speed-up), enabling the large-scale data collection needed for training transformer models on quantum chemistry tasks. The "Non-native Quantum Generative Optimization with Adversarial Autoencoders" paper (2024) directly addresses the hardware mapping problem: how to compile generative models onto quantum devices with non-native gate sets, using adversarial autoencoders to bridge the gap between ideal and hardware-constrained representations.

The D-Wave work (CLEO 2021) explored quantum annealing hardware for metasurface optimization — mapping continuous photonic design problems onto the Ising model structure native to D-Wave's quantum processing units.

My earlier experience at ARM grounded this work in classical hardware-software co-design. There, I built verification software for ARMv8 instruction set coverage (250% improvement in Chi-Square test randomness) and developed a Python-to-RISC-V transpiler for generating Verilog ROM for SoC designs. This experience — from processor microarchitecture to verification methodology — informs how I approach the problem of making algorithms maximally efficient on target hardware.

The common principle is that the best performance comes not from optimizing algorithms in isolation, but from understanding the hardware execution model and designing algorithms that exploit it.

Selected works

Non-native Quantum Generative Optimization with Adversarial Autoencoders

Wilson, B., Wurtz, J., Mkhitaryan, V., Bezick, M., Wang, S.T., Kais, S., Shalaev, V., Boltasseva, A.
arXiv:2407.13830 (2024)
View paper
Metasurface Design Optimization via D-Wave based Sampling

Wilson, B., Kudyshev, Z., Kildishev, A., Shalaev, V., Kais, S., Boltasseva, A.
CLEO (2021)
Multi-GPU CUDA Kernels for Energy-Model Inference

Developed custom CUDA kernels for multi-GPU optimization at Zephram, achieving 600x speed-up over previous state-of-the-art for energy-model inference.
CUDA-accelerated Quantum Circuit Simulation

Engineered multi-GPU CUDA kernels for UCCSD quantum circuits at Quantinuum, achieving 20x speed-up for data collection pipelines.
ARMv8 SoC Verification Infrastructure

Built verification software in Python, XML, and C++ for ARMv8 instruction set coverage at ARM, achieving 250% improvement in test randomness.