New SOTA Image Captioning: ClipCap

New SOTA Image Captioning: ClipCap
We’ve seen AI generate images from other images using GANs. Then, there were models able to generate questionable images using text. In early 2021, DALL-E was published, beating all previous attempts to generate images from text input using CLIP, a model that links images with text as a guide. A very similar task called image captioning may sound really simple but is, in fact, just as complex. It is the ability of a machine to generate a natural description of an image.
It’s easy to simply tag the objects you see in the image but it is quite another challenge to understand what’s happening in a single 2-dimensional picture, and this new model does it extremely well!

Watch the video

References

►Read the full article: https://www.louisbouchard.ai/clipcap/
►Paper: Mokady, R., Hertz, A. and Bermano, A.H., 2021. ClipCap: CLIP Prefix for Image Captioning. https://arxiv.org/abs/2111.09734
►Code: https://github.com/rmokady/CLIP_prefix_caption
►Colab Demo: https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

00:00
we've seen ai generate images from other
00:02
images using guns then there were models
00:05
able to generate questionable images
00:07
using text in early 2021 dolly was
00:10
published beating all previous attempts
00:12
to generate images from text input using
00:14
clip a model that links images with text
00:16
as a guide a very similar task called
00:19
image captioning may sound really simple
00:21
but is in fact just as complex it's the
00:23
ability of a machine to generate a
00:25
natural description of an image indeed
00:28
it's almost as difficult as the machine
00:30
needs to understand the image and the
00:32
text it generates just like in text to
00:34
image synthesis it's easy to simply tag
00:36
the objects you see in the image this
00:38
can be done using a regular
00:39
classification model but it's quite
00:41
another challenge to understand what's
00:43
happening in a single two-dimensional
00:45
picture humans can do it quite easily
00:47
since we can interpolate from our past
00:49
experience and we can even put ourselves
00:51
in the place of the person in the
00:52
picture and quickly get what's going on
00:55
this is a whole other challenge for a
00:56
machine that only sees pixels yet these
00:59
researchers published an amazing new
01:01
model that does this extremely well in
01:03
order to publish such a great paper
01:05
about image captioning the researchers
01:07
needed to run many many experiments plus
01:10
their code is fully available on github
01:12
which means it is reproducible these are
01:14
two of the strong points of this episode
01:16
sponsor weights and biases if you want
01:18
to publish papers in big conferences or
01:20
journals and do not want to be part of
01:22
the 75 of the researchers that do not
01:25
share their code i'd strongly suggest
01:27
using weights and biases it changed my
01:29
life as a researcher and my work in my
01:31
company weights and biases will
01:32
automatically track each run the hyper
01:34
parameters the github version hardware
01:37
and osu's the python version packages
01:39
install and training script everything
01:41
you need for your code to be
01:42
reproducible without you even trying it
01:45
just needs a line of code to tell what
01:47
to track once and that's it please don't
01:50
be like most researchers that keep their
01:52
code a secret i assume mostly because it
01:54
is hardly reproducible and try out
01:56
weights and biases with the first link
01:58
below as the researchers explicitly said
02:00
image captioning is a fundamental task
02:02
in vision language understanding and i
02:05
entirely agree the results are fantastic
02:07
but what's even cooler is how it works
02:09
so let's dive into the model and it's in
02:12
our working a little before doing so
02:14
let's quickly review what image
02:16
captioning is image captioning is where
02:18
an algorithm will predict a textual
02:20
description of a scene inside an image
02:22
here it will be done by a machine and in
02:25
this case it will be a machine learning
02:27
algorithm this algorithm will only have
02:29
access to the image as input and will
02:31
need to output such a textual
02:33
description of what is happening in the
02:36
image in this case the researchers used
02:38
clip to achieve this task if you are not
02:40
familiar with how clip works or why it's
02:42
so amazing i'd strongly invite you to
02:45
watch one of the many videos i made
02:47
covering it in short clip links images
02:49
to text by encoding both types of data
02:52
into one similar representation where
02:54
they can be compared this is just like
02:56
comparing movies with books using a
02:58
short summary of the piece given only
03:00
such a summary you can tell what's it
03:02
about and compare both but you have no
03:04
idea whether it's a movie or a book in
03:07
this case the movies are images and the
03:09
books are text descriptions then clip
03:11
creates its own summary to allow simple
03:14
comparisons between both pieces using
03:16
distance calculation on bit differences
03:19
you can already see how clips seems
03:21
perfect for this task but it requires a
03:23
bit more work to fit our needs here clip
03:26
will simply be used as a tool to compare
03:28
text inputs with images inputs so we
03:30
still need to generate such a text that
03:32
could potentially describe the image
03:34
instead of comparing the text to images
03:36
using clips encoding they will simply
03:39
encode the image using clips network and
03:41
use the generated encoded information as
03:44
a way to guide a future text generation
03:46
process using another model such a task
03:49
can be performed by any language model
03:51
like gpt3 which could improve their
03:53
results but the researchers opted for
03:54
its predecessor gpd2 a smaller and more
03:57
intuitive version of the powerful openai
04:00
model they are basically conditioning
04:02
the text generation from gpt2 using
04:04
clips encoding so clips model is already
04:07
trained and they also used a pre-trained
04:10
version of gpd2 that they will further
04:12
train using the clips encoding as a
04:14
guide to orient the text generation it's
04:17
not that simple since they still need to
04:19
adapt the clips encoding to a
04:21
representation that gpt2 can understand
04:23
but it isn't that complicated either it
04:25
will simply learn to transfer the clips
04:27
encoding into multiple vectors with the
04:30
same dimensions as a typical word
04:32
embedding this step of learning how to
04:34
match clips outputs to gpd2's inputs is
04:37
the step that will be thought during
04:39
training as both gpt2 and clip are
04:41
already trained and they are powerful
04:43
models to do their respective tasks so
04:45
you can see this as a third model called
04:47
a mapping network with the sole
04:49
responsibility of translating one's
04:51
language into the other which is still a
04:53
challenging task if you are curious
04:55
about the actual architecture of such a
04:57
mapping network they tried with both a
04:59
simple multi-layer perceptron or mlp and
05:02
a transformer architecture confirming
05:04
that the latter is more powerful to
05:06
learn a meticulous set of embeddings
05:08
that will be more appropriate for the
05:09
task when using powerful pre-trained
05:11
language models if you are not familiar
05:14
with transformers you should take five
05:15
minutes to watch the video i made
05:17
covering them as you will only more
05:19
often stumble upon this type of network
05:21
in the near future the model is very
05:23
simple and extremely powerful just
05:26
imagine having clip merge with gpt3 in
05:28
such a way we could use such a model to
05:30
describe movies automatically or create
05:32
better applications for blind and
05:34
visually impaired people that's
05:36
extremely exciting for real world
05:38
applications of course this was just an
05:40
overview of this new model and you can
05:42
find more detail about the
05:44
implementation in the paper linked in
05:46
the description below i hope you enjoyed
05:48
the video and if so please take a second
05:51
to share it with a friend that could
05:52
find this interesting it will mean a lot
05:55
and help this channel grow thank you for
05:57
watching and stay tuned for my next
05:59
video the last one of the year and quite
06:01
an exciting one
06:06
[Music]         

Leave a Reply

Your email address will not be published.