XTX has one of the largest HPC clusters in the world, which the research technology team have built by writing software. We are looking for someone to join in a senior capacity to work with our experienced team to help coordinate designing our datacentre, choosing the right hardware, tuning operating systems, storage, and networks, managing orchestration and observability to writing the software which manages fair and efficient distribution of work on our compute cluster. We are a full stack team that works side-by-side with our researchers to make the most performant, reliable and transparent system we can. The infrastructure that we build and maintain allows XTX to trade globally with daily volumes of over $250bn per day across a wide range of asset classes. RESPONSIBILITIES
The right candidate will: Contribute to all components of our HPC infrastructure and code, and work on growing one of the biggest private compute clusters anywhere. Write software that runs on a compute cluster that grows continually but currently has ~1PB memory, ~110K CPU cores, >25,000 GPU’s, 200+PB of mixed SSD and HDD storage, connected by a high-performance network. Enter an environment where improvements can usually be made very quickly, and where the results of those changes are both immediately visible and can make a large impact to the quantitative research function at the heart of our business. Examples of current projects being undertaken by the group are: Implementing an inhouse-written filesystem Changing our host monitoring infrastructure Changing our observability system to a new lightweight high performance metrics visualisation system ESSENTIAL ATTRIBUTES
Previous direct experience in a similar setting with large-scale-compute, although previous experience in financial services is not necessary. We expect between 5-10 years of experience in a relevant setting. Strong coding skills, preferably with recent exposure to python and at least one statically typed language. You will likely have a good STEM degree and/or top-notch technical credentials, and a drive to achieve. Clear working knowledge and hands-on experience with computer networks; design, configuration, monitoring, automation, approaches to loss + congestion control, understanding of underlying hardware, IP/Ethernet and InfiniBand, host tuning. Strong knowledge of large-scale infrastructure management, using code as a tool to facilitate hardware performance evaluation, automated builds, patching, application deployment, domain environment (DHCP/DNS/data locality etc.), observability (monitoring and alerting) and storage. A desire to solve complex problems optimally from the ground up, not just always reassembling components written by others. A working knowledge of large-scale distributed systems. Understanding of one or more machine learning frameworks and compute offload devices, like GPUs, is an advantage. BENEFITS
Onsite gym, sauna, and fitness classes at no charge Extensive medical benefits including an on-site doctor and therapist at no charge Breakfast and lunch provided daily Various supports for caregivers, including emergency dependent care 25 days paid holiday per year + statutory holiday and paid sick days
#J-18808-Ljbffr