Latest Version of ONNX Runtime Released, with optimization for BERT inference

January 21, Microsoft announced that they open sourced the optimization technology that improves the performance of the inference which uses the natural language processing technology BERT (Bidirectional Encoder Representations from Transformers). It is released as part of “ONNX Runtime,” which is the inference engine Microsoft provides.

BERT is a natural language processing model, announced in 2018. It is popular as a powerful language model, but there is a problem that it is costly to run BERT inference, which is scalable in almost real-time.

In November 2019, Microsoft Azure AI research team announced that they were able to run Bert inference over a million times per second within the Bing latency limit. Now in this release, this further optimized version is introduced to the machine learning inference engine “ONNX Runtime.”

The open sourced technology is the reimplementation of the BERT model, which was developed to understand web search queries, in C++ to improve the response time. ONNX Runtime is the technology that accelerates and optimizes the machine learning inference developed by Microsoft. AI developers can use this technology to run large-scale transformer model on CPU or GPU hardware with high performance. Microsoft open sourced ONNX Runtime at the end of 2018.

The latest ONNIX Runtime v1.1.1 is available on the project website.

ONNIX Runtime
https://github.com/microsoft/onnxruntime