NCCF

Intel Neural Network Compression Framework (NNCF) is a toolkit for compressing and quantizing neural networks so that they can run more efficiently on small hardware. Examples of NCCF quantization workflows are here.

Quantizing Zero-Shot Models

NCCF requires some calibration data so that activations within the model can be profiled and sampled.

 
import nncf
import torch
import openvino as ov
import pandas as pd
import numpy as np
 
from transformers import AutoTokenizer, TensorType
from torch.utils.data import Dataset, DataLoader
from typing import Dict
 
 
tokenizer = AutoTokenizer.from_pretrained("MoritzLaurer/deberta-v3-large-zeroshot-v1.1-all-33")
 
model = ov.Core().read_model("testmodel/model_optimized.xml")
 
template = "This example is about {}"
 
# Read data from CSV file into Pandas dataframe
df = pd.read_csv('input_examples').sample(n=64)
 
# Define a custom dataset class
class MyDataset(Dataset):
    def __init__(self, df):
        self.df = df
    
    # Override the __getitem__ method to return data and labels as tensors
    def __getitem__(self, index):
        return self.df.iloc[index].to_dict()
    
    # Override the __len__ method to return the number of samples in the dataset
    def __len__(self):
        return len(self.df)
 
# Create a Dataset instance
my_dataset = MyDataset(df)
 
# Create a DataLoader to iterate over the dataset
data_loader = DataLoader(my_dataset, batch_size=1, shuffle=True)
 
def transform_fn(data_item):
 
    tokens = tokenizer(
            data_item['text'], template.format(data_item['label']), return_tensors=TensorType.NUMPY, padding=True, truncation=True
        )
    
    input_keys = ['input_ids', 'attention_mask']
    return { k:v.tolist() for k,v in tokens.items() if k in input_keys}
 
 
 
calibration_dataset = nncf.Dataset(data_loader, transform_fn)
 
 
quantized_model = nncf.quantize(model, calibration_dataset, model_type=nncf.ModelType.TRANSFORMER)

By default if running in a notebook nccf will show statistics in a widget so it is best to install ipywidgets before running.

Once the model has been quantized we can use it via the compiled model functionality:

 
cm = ov.compile_model(model)
 
text = "I love you very much"
 
labels = [template.format(y) for y in 'positive','neutral','negative']
 
queries = [text] * len(labels)
 
inputs = tokenizer(queries, labels, return_tensors=TensorType.NUMPY, padding=True, truncation=True)
input_keys = ['input_ids', 'attention_mask']
X = { k:v.tolist() for k,v in tokens.items() if k in input_keys}
 
y = cm(X)['logits']

Server

OpenVino provides an inference server which is API-compatible with Tensorflow-serving. They have a docker image which can be configured to read models from a model repository