USCâs Information Sciences Institute (ISI), a unit of the universityâs Viterbi School of Engineering, is a world leader in the research and development of advanced artificial intelligence, information processing, computing, and communications technologies. ISIâs 400 faculty, professional staff and graduate students carry out extraordinary information sciences research at three distinct locations - Marina Del Rey, CA, Arlington, VA, and Waltham, MA.
*This position is located in Waltham, MA and will require occasional travel to Marina Del Rey, CA.*
ISI is seeking a Lead HPC Developer interested in helping us develop a shared compute cluster to support our language understanding research. A successful candidate will:
Collaborate with technical leadership in the design, development, installation, and maintenance of software for Linux and HPC cluster systems and ensure its scalability and fault-tolerance needs are met.
Take primary responsibility for planning, implementation, availability, performance, security, maintenance, and repair of cluster infrastructure.
Support best practices across the environment
The candidate will coordinate closely with both the research groups using the cluster and ISIâs information technology department. They will foster collaboration between researchers in an inclusive environment that values differences by building and maintaining collaborative relationships with team members, peers, and organizational leaders.
Drives the day-to-day operations for the Linux and HPC cluster systems by monitoring computing resource performance, managing configurations, and addressing security administration.
Applies revisions to system firmware and software; engages and collaborates with vendors to assist support activities as required.
Leads the development of new HPC software deployment plans, custom scripts, and testing procedures to ensure operational reliability for researchers; trains technical staff in the use of new software and hardware, either developed or acquired.
Oversees the maintenance and management of HPC researcher accounts for staff and research groups; leads the installation, modification, and maintenance of various research software applications for access on HPC clusters; acts as a trusted technical advisor for researcher support and documentation on software applications and programs.
Designs, installs, configures, and performs document management for cluster infrastructure, including operating systems, job schedulers, resource managers, provisioning managers, configuration managers, SAN devices, network devices, and other components.
Investigates, debugs, and addresses researcher inquiries and requests efficiently through a customer issue ticketing system. Implements customer-focused resolutions efficiently; communicates complex technical concepts in a simple, straightforward manner to address a broad range of stakeholders.
Education: Bachelorâs degree in a relevant field such as computer science, computer information systems, etc. OR equivalent combined education, training, and experience.
Minimum Experience: 5 years of professional experience with at least 3 years of experience in high-performance computing cluster support & linux system administration.
Applicants selected for this position will require access to ITAR materials. According to U.S. government regulations, ONLY U.S. citizens OR lawful permanent residents (green card) are eligible for ITAR access.
Bachelorâs degree in a relevant field such as computer science, computer information systems, etc. OR equivalent combined education, training, and experience.
Multi-vendor management, security, and network/Internet protocols.
Administrating, monitoring, and maintaining secure Linux/UNIX operating systems (CentOS/RHEL, Ubuntu).
Experience with HPC system software cluster management tool and job schedulers(e.g. SGE, slurm).
Experience with the planning and design of the hardware that supports an HPC cluster to include both CPU and GPU processing
Proficiency with low-latency/high-bandwidth interconnected infrastructure such as 10GigE.
Knowledge of HPC storage (FC, SAS) principles, file systems (ZFS, etc.), and compute node storage (NFS).
Proficiency in fundamental programming/scripting skills (Bash, Python, or similar languages).
Configuration management tools (Experience in non-production environments is acceptable. Examples include Salt, Ansible, Puppet, etc).
Ability to identify, troubleshoot, and resolve problems and manage system performance.
Ability to drive technical leadership and management of complex large-scale computing system projects.
Experience establishing processes for maintaining system performance and managing best-in-class standards.
Working knowledge of machine learning algorithms and software frameworks (TensorFlow, PyTorch, Keras, CUDA, cuDNN, Caffe, Theano, etc.)
Virtualization infrastructures (VMware).
Container technologies (Docker, Singularity).
Cloud computing (AWS, Azure).
The University of Southern California values diversity and is committed to equal opportunity in employment.
Minimum Education: Bachelor's degree, Combined work experience and education as equivalentMinimum Experience: 5 yearsMinimum Field of Expertise: Relevant work experience providing strong technical knowledge of programming and analysis, and senior or lead experience.
USC’s Viterbi School of Engineering has been one of the economic engines in Southern California and a vital hub in the California economy. The technical innovations and ideas generated by the Viterbi faculty and research community have resulted in countless innovations, many becoming the foundations for new companies, products and services. The thousands of students graduating each year bring new ideas and vitality to companies in California and beyond. With an annual research budget exceeding $205M each year, more than 46 research centers and institutes, more than 180 faculty members, 7,800 students and over 60,000 impassioned alumni world-wide, the Viterbi School is addressing some of the world’s great challenges.