Intel Neural Network Compression Framework (NNCF) is a toolkit for compressing and quantizing neural networks so that they can run more efficiently on small hardware. Examples of NCCF quantization workflows are here.
Quantizing Zero-Shot Models
NCCF requires some calibration data so that activations within the model can be profiled and sampled.
By default if running in a notebook nccf will show statistics in a widget so it is best to install ipywidgets before running.
Once the model has been quantized we can use it via the compiled model functionality:
Server
OpenVino provides an inference server which is API-compatible with Tensorflow-serving. They have a docker image which can be configured to read models from a model repository