Large Multi Modal Models (LMMs) are essentially LLMs that also support other modes of data input such as image or audio. This class of models is growing quickly and the most widely used one is probably GPT-4. However, there are a number of other LMM offerings both closed and open source including LLaVa and Qwen-VL.
Earlier multi-modal models like LayoutLM did not use image encoding approaches like CLIP.
See more models tagged withLMM andmultimodal