NCCF
Intel Neural Network Compression Framework (NNCF) is a toolkit for compressing and quantizing neural networks so that they can run more efficiently on small hardware. Examples of NCCF quantization workflows are here.
Quantizing Zero-Shot Models
NCCF requires some calibration data so that activations within the model can be profiled and sampled.
import nncf
import torch
import openvino as ov
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, TensorType
from torch.utils.data import Dataset, DataLoader
from typing import Dict
tokenizer = AutoTokenizer.from_pretrained("MoritzLaurer/deberta-v3-large-zeroshot-v1.1-all-33")
model = ov.Core().read_model("testmodel/model_optimized.xml")
template = "This example is about {}"
# Read data from CSV file into Pandas dataframe
df = pd.read_csv('input_examples').sample(n=64)
# Define a custom dataset class
class MyDataset(Dataset):
def __init__(self, df):
self.df = df
# Override the __getitem__ method to return data and labels as tensors
def __getitem__(self, index):
return self.df.iloc[index].to_dict()
# Override the __len__ method to return the number of samples in the dataset
def __len__(self):
return len(self.df)
# Create a Dataset instance
my_dataset = MyDataset(df)
# Create a DataLoader to iterate over the dataset
data_loader = DataLoader(my_dataset, batch_size=1, shuffle=True)
def transform_fn(data_item):
tokens = tokenizer(
data_item['text'], template.format(data_item['label']), return_tensors=TensorType.NUMPY, padding=True, truncation=True
)
input_keys = ['input_ids', 'attention_mask']
return { k:v.tolist() for k,v in tokens.items() if k in input_keys}
calibration_dataset = nncf.Dataset(data_loader, transform_fn)
quantized_model = nncf.quantize(model, calibration_dataset, model_type=nncf.ModelType.TRANSFORMER)
By default if running in a notebook nccf will show statistics in a widget so it is best to install ipywidgets
before running.
Once the model has been quantized we can use it via the compiled model functionality:
cm = ov.compile_model(model)
text = "I love you very much"
labels = [template.format(y) for y in 'positive','neutral','negative']
queries = [text] * len(labels)
inputs = tokenizer(queries, labels, return_tensors=TensorType.NUMPY, padding=True, truncation=True)
input_keys = ['input_ids', 'attention_mask']
X = { k:v.tolist() for k,v in tokens.items() if k in input_keys}
y = cm(X)['logits']
Server
OpenVino provides an inference server which is API-compatible with Tensorflow-serving. They have a docker image which can be configured to read models from a model repository