An Introduction to Vision Language Model

AI application now a days are not only generating texts , but also images, audio and videos.

The similar approach of transformer architecture is used in Vision language model also. We will see the comparison first.

Here image patch is just a small square chunk of an image. Now a small chunk here is having image pixels.

As you know pixel are basically having RGB ( Red , Green and Blue) values.

Lets try to understand the journey from patch to embedding in very simple and understandable way.

As we said patch is just a square pice of an image. Consider we have 4 * 4 patch portions of an image.

For a 4 * 4 patch, it will have 4*4*3 = 48 numbers.

Now a patch is flatten . Ie take a 3D block and write all numbers in a single list (row)

As an example, let take a small patch as

2 × 2 patch (RGB) . Each pixel 3 values , Red , Green and Blue – (RBG).

So here there will be 4 pixels in a patch and 4 * 3 = 12 numbers

Now the a single patch will look like

[ pixel1 , pixel2, pixel3 pixel4 ]

[ ( R1 , G1 , B1 ) , ( R2 , G2 , B2 ), ( R3 , G3 , B3 ), ( R4 , G4 , B4 ) ]

so after flattening , the above 3D vector becomes

[ R1, G1, B1, R2, G2, B2, R3, G3, B3, R4, G4, B4 ].

This vector just contain just raw pixel intensities. The values don’t have any meaning yet.

Now we will apply neural transformation on the vector (Linear Layer). Once you do it , you will learn about edges textures colors patterns , which is basically feature extraction.

Here is where “learning” actually begins.

On the above linear layer, once the neural transformation is applied , it becomes an embedding, which is patch embedding.

Now this behave like a token.

In text based LLM, we get the token ID from vocabulary.

An Introduction to Vision Language Model

Related Engineering Architecture

Understanding Pre-Training in Large Language Models

Tokens and logits relation in LLM