Preferred Networks: Deep Learning Supercomputer

2nd Generation Intel® Xeon® Scalable processors and Intel® Optane™ persistent memory enable faster data pipeline.

At a Glance:

  • Preferred Networks (PFN) develops artificial intelligence solutions for industrial and domestic robotics, Industrial Internet of Things (IIoT), manufacturing systems and other industries.

  • Traditional SSDs could not meet the I/O throughput requirements of PFN’s new custom-designed deep learning accelerator, so they turned to Supermicro’s SuperServer hardware with Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to enable a balanced node with fast access and high capacity for training data.

BUILT IN - ARTICLE INTRO SECOND COMPONENT

Executive Summary
Preferred Networks (PFN) uses Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to create a high-performance data pipeline to keep their custom, high-performance deep learning training accelerator busy in their new MN-3 HPC system. Located in Tokyo, Preferred Networks is a deep learning company, deploying high performance computing (HPC) clusters to build and train algorithms used in domestic and industrial applications. Their latest system, MN-3, integrates a custom-designed deep learning accelerator they engineered. Intel Optane persistent memory provides the capacity and speed needed to feed data to the accelerator, maintaining high training performance.

Traditional SSDs could not meet the I/O throughput requirements of the new architec­ture, so Preferred Networks turned to Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to enable a balanced node with fast access and high capacity for training data.

Challenge
Preferred Networks develops artificial intelligence (AI) solutions for industrial and domestic robotics, Industrial Internet of Things (IIoT), manufacturing systems, and other industries. It is a leader in the robotics revolution.1

The company’s Research and Development (R&D) team uses HPC systems designed specifically to create and train algo­rithms for automated functions, such as:

  • Predictive analytics of industrial machines to optimize the use and maintenance of them for increased productivity
  • Controlling a robot to easily navigate in a home, recognize objects out of place, pick them up, and put them where they belong
  • Other autonomous operations based on vision computing

Preferred Network’s largest R&D supercomputers, MN-1 and MN-2, include more than 2500 GPUs total. Yet Preferred Networks needed to accelerate computations to support the many projects the engineering team is working on.

Solution
“We believe more computational power makes our engineers and researchers more effective,” vice president of Comput­ing Infrastructure Yusuke Doi said. “By keeping a leadership position in our computational capabilities, we can better compete in our industry and provide advanced solutions to our customers.”

So, Preferred Networks designed a unique custom accelerator called MN-Core.2 MN-Core is a custom processor based on a four-die multi-chip package designed specifically for PFN’s own R&D projects. The quadruple-chip package—specialized for deep learning training tasks—is at the center of a design for a new supercomputing cluster, MN-3. However, due to the dramatic increase in computing performance, they ran into I/O bottlenecks when they began to design and evaluate the data loading path for the training system.

Many of Preferred Networks’ projects are computer vision problems. The training data set, consisting of millions of JPEG image files, is archived on a large external storage system. It is not practical to store the entire data set directly in system memory to take advantage of the faster access. For training, the data is first copied to the nodes into high-performance NVMe SSD drives.

2nd Generation Intel® Xeon® Scalable Processors and Intel® Optane™ Persistent Memory Enable Up to 3.5X Faster Data Pipeline3
“We first benchmarked node performance with the Intel Xeon 8260M processors,” engineer Tianqi Xu of Preferred Networks explained. “During the I/O phase, the processor must get the JPEG files out of block storage and into memory, decode them, and then perform model-specific augmentations. With the 2nd Generation Intel® Xeon® Scalable processors and current GPUs, the node was well balanced for I/O, computing, and storage.”

But with terabytes of data to move during training and the I/O challenges discovered in the data path, traditional storage hierarchy with SSDs would not be able to keep up with the custom accelerator. The accelerator would be starved for data. Preferred Networks needed high capacity storage at DIMM-like speeds in the node. Engineers worked directly with Intel to understand how the high memory bandwidth of 2nd Gen Intel Xeon Scalable processors and support for high-capacity Intel Optane persistent memory could create a very fast and very large data pipeline.

Once Preferred Networks became aware of Intel Optane persistent memory’s capability to speed up their AI pipeline they initiated a proof of concept to verify that the design would support high capacity storage. Intel continues to advise the company as it moves forward with their AI technology efforts.

Leveraging a New Hierarchy of Storage with Intel® Optane™ Technology
Intel Optane persistent memory is a high-density, byte-addressable, 3D memory technology in a DIMM form factor that delivers a unique combination of large capacity, low latency, low power, and data persistence. The persistent memory modules integrate a new layer into the memory/ storage hierarchy of an HPC system, offering DIMM-like speeds of byte-addressable data access with terabytes of capacity on the memory bus. Most 2nd Generation Intel Xeon Scalable processors support Intel Optane persistent memory modules. A node with the Intel Xeon 8260M processors can support up to 3 TB of Intel Optane persistent memory.

Intel Optane persistent memory can operate in different modes (memory, app-direct, and storage over app-direct). In memory mode, the CPU uses the Intel Optane persistent memory as system memory and uses the system memory (DIMMs) as a cache. In app-direct mode, software is made aware of both types of memory and is configured to direct data reads and writes based on suitability for DRAM or Intel Optane persistent memory. This offers larger capacity and higher performance to Preferred Networks’ training processes.

“In memory mode, the entire memory domain would reside in the persistent memory,” Xu added, “which means we wouldn’t get optimal use of the entire three terabytes. Additionally, deep learning data access patterns are very random. DRAM as cache doesn’t work effectively for those accesses. We needed direct control over the persistent memory, so we developed custom code to control it in app-direct mode.”

In addition to their own code, Preferred Networks developed a custom library to take advantage of the large capacity, low latency, and byte-addressable features of Intel Optane persistent memory. To optimize performance for the entire data pipeline and custom accelerator, they included a staging phase to pre-process the JPEG images by converting them to raw pixel data and loading the data set into Intel Optane persistent memory.

Result
The company is manufacturing its accelerator and launching MN-3 with the accelerator. MN-3 is a cluster with up to 48 nodes initially. The company will expand MN-3 into a half-precision exascale supercomputer. The Intel Xeon 8260M processors will allow MN-3 to optimize pre-processing performance to stage the data set and effectively handle the post-processing phase to manage the results.

Early benchmarking of the data pipeline with Preferred Networks’ accelerator MN-Core, Intel Xeon 8260M processors, and Intel Optane persistent memory is returning up to 3.5X faster data throughput compared to the system with NVMe SSDs.4 In addition to being fast, the system is highly energy efficient. MN-3 ranked #1 on the June 2020 Green 500 list.5 Preferred Networks expects to grow the system over five years as much as 20X to exascale performance for deep learning training.

Solution Summary
Preferred Networks has been using HPC clusters for deep learning training to support their customers. They needed more performance, so they built their own deep learning accelerator and the first stage of a new cluster around it named MN-3. Traditional SSDs could not meet the I/O throughput requirements of the new architecture, so Preferred Networks turned to Intel Xeon 8260M processors and Intel Optane persistent memory to enable a balanced node with fast access and high capacity for training data. The new system design is expected to deliver up to 3.5X faster performance, according to Preferred Networks.

Solution Ingredients

  • 48-node deep learning training cluster with custom accelerator
  • Two 24-core Intel Xeon 8260M processors per node
  • 3 TB of Intel Optane persistent memory per node (153.6 PB total)

Spotlight on Supermicro
Supermicro’s SuperServer hardware was deployed at Preferred Networks. The SuperServer platform offers high levels of performance, efficiency, and supports 2nd Gen Intel Xeon Scalable processors.

Supermicro (Nasdaq: SMCI), is a leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced Server Building Block Solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and Embedded Systems worldwide.

Explore Related Products and Solutions

Intel® Xeon® Scalable Processors

Drive actionable insight, count on hardware-based security, and deploy dynamic service delivery with Intel® Xeon® Scalable processors.

Learn more

Intel® Optane™ Persistent Memory

Extract more actionable insights from data – from cloud and databases, to in-memory analytics, and content delivery networks.

Learn more

Avisos e isenções de responsabilidade

Os recursos e benefícios das tecnologias Intel® dependem da configuração do sistema e podem exigir hardware habilitado, ativação de software ou de serviço. O desempenho varia dependendo da configuração do sistema. Nenhum sistema de computador é totalmente seguro. Consulte o fabricante ou o revendedor de seu sistema, ou saiba mais em https://www.intel.com. // Talvez o software e as cargas de trabalho utilizados nos testes de desempenho tenham sido otimizados apenas para desempenho em microprocessadores Intel®. Testes de desempenho, como SYSmark e MobileMark, são medidos usando sistemas de computação, componentes, software, operações e funções específicos. Qualquer modificação em algum desses fatores pode provocar variação nos resultados. Consulte outras informações e outros testes de desempenho para ajudá-lo a avaliar melhor as suas compras, incluindo o desempenho desse produto quando combinado com outros produtos. Para obter informações mais completas, acesse https://www.intel.com/benchmarks. // Os resultados de desempenho são baseados em testes realizados na data estabelecida nas configurações e podem não refletir todas as atualizações de segurança disponíveis ao público. Para obter detalhes, consulte a publicação da configuração. Nenhum produto ou componente pode ser totalmente seguro. // Os cenários de redução de custos descritos destinam-se a servir de exemplos de como um determinado produto baseado na tecnologia Intel®, dentro das circunstâncias e configurações especificadas, pode afetar custos futuros e proporcionar economia. As circunstâncias variarão. A Intel não garante nenhum custo ou redução de custo. // A Intel não controla nem audita dados de benchmarks de terceiros nem os sites citados neste documento. Visite o site citado e verifique a precisão dos dados mencionados. // Em alguns casos de teste, alguns resultados foram estimados ou simulados usando análise interna da Intel ou simulação de arquitetura ou modelagem, e fornecidos para fins informativos. Qualquer diferença no hardware, software ou na configuração do seu sistema pode afetar o desempenho real.

Informações de produto e desempenho

3Benchmark information provided by Preferred Networks.
4Benchmark Information provided by Preferred Networks who measured throughput with the following steps: data read (from ndarray format), ImageNet augmentation (crop, resize, flip), and memory layout for the Preferred Networks accelerator (e.g. data copy).