MENLO PARK — On April 21, 2020, Facebook AI and AWS engineers announced their collaboration to develop new libraries aimed at enhancing large-scale, elastic, and fault-tolerant AI model training, as well as high-performance PyTorch model deployment. These libraries are designed to help the AI community efficiently scale the deployment of models and advance cutting-edge research as model architectures become increasingly complex.
One of the key innovations is TorchServe, an open-source framework that simplifies the deployment of PyTorch models for high-performance inference. TorchServe provides features such as multi-model serving, monitoring metrics, logging, and the creation of RESTful endpoints, offering an efficient pathway for deploying PyTorch models at scale. The framework is cloud-agnostic and compatible across various environments, making it a versatile tool for developers.
Another major release is the integration of TorchElastic with Kubernetes. This system allows developers to train machine learning models on clusters that can dynamically adjust without disrupting ongoing training jobs. TorchElastic’s fault-tolerant design ensures that training continues smoothly even in the event of server failures or network issues, making it ideal for scalable, distributed model training.
Both libraries are included in the PyTorch 1.5 release and will be maintained by Facebook and AWS in collaboration with the broader PyTorch community. These innovations are expected to significantly advance flexible, large-scale AI model training and deployment.