What is Data-Centric AI?

What is Data-Centric AI?
What makes GPT-3 and Dalle powerful is exactly the same thing: Data.
Data is crucial in our field, and our models are extremely data-hungry. These large models, either language models for GPT or image models for Dalle, all require the same thing: way too much data.
The more data you have, the better it is. So you need to scale up those models, especially for real-world applications.
Bigger models can use bigger datasets to improve only if the data is of high quality.
Feeding images that do not represent the real world will be of no use and even worsen the model’s ability to generalize. This is where data-centric AI comes into play…
Learn more in the video:

References

►Read the full article: https://www.louisbouchard.ai/data-centric-ai/
►Data-centric AI: https://snorkel.ai/data-centric-ai
►Weak supervision: https://snorkel.ai/weak-supervision/
►Programmatic labeling: https://snorkel.ai/programmatic-labeling/
►Curated list of resources for Data-centric AI: https://github.com/hazyresearch/data-centric-ai
►Learn more about Snorkel: https://snorkel.ai/company/
►From Model-centric to Data-centric AI – Andrew Ng: https://youtu.be/06-AZXmwHjo
►Software 2.0: https://hazyresearch.stanford.edu/blog/2020-02-28-software2
►Paper 1: Ratner, A.J., De Sa, C.M., Wu, S., Selsam, D. and Ré, C.,
2016. Data programming: Creating large training sets, quickly. Advances
in neural information processing systems, 29.
►Paper 2: Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S. and
Ré, C., 2017, November. Snorkel: Rapid training data creation with weak
supervision. In Proceedings of the VLDB Endowment. International
Conference on Very Large Data Bases (Vol. 11, No. 3, p. 269). NIH Public
Access.
►Paper 3: Ré, C. (2018). Software 2.0 and Snorkel: Beyond Hand-Labeled
Data. Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining.
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
Video Transcript
       0:00
what makes gpt3 and delhi powerful is
0:03
exactly the same thing data data is
0:06
crucial in our field and our models are
0:08
extremely data hungry these large models
0:11
either language models for gpt or image
0:13
models for delhi all require the same
0:15
thing
0:16
way too much data unfortunately the more
0:19
data you have the better it is so you
0:21
need to scale up those models especially
0:24
for real world applications bigger
0:26
models can use bigger datasets to
0:28
improve only if the data is of high
0:30
quality feeding images that do not
0:32
represent the real world will be of no
0:34
use and even worsen the model's ability
0:37
to generalize this is where data centric
0:39
ai comes into play data centric ai also
0:43
referred to as software 2.0 is just a
0:46
fancy way of saying that we optimize our
0:48
data to maximize the model's
0:50
performances instead of model-centric
0:52
where you will just tweak the model's
0:54
parameters on a fixed dataset of course
0:57
both need to be done to have the best
0:59
results possible but data is by far the
1:02
bigger player here in this video in
1:04
partnership with snorkel i will cover
1:06
what data centric ai is and review some
1:09
big advancements in the field you will
1:11
quickly understand why data is so
1:13
important in machine learning which is
1:15
snorkel's mission taking a quote from
1:17
their blog post linked below teams will
1:19
often spend time writing new models
1:21
instead of understanding their problem
1:23
and its expression in data more deeply
1:26
writing a new model is a beautiful
1:28
refuge to hide from the mess of
1:30
understanding the real problems and this
1:33
is what this video aims to combat in one
1:36
sentence the goal of data centric ai is
1:38
to encode knowledge from our data into
1:40
the model by maximizing the data's
1:42
quality and model's performance it all
1:45
started in 2016 at stanford with a paper
1:48
called data programming creating large
1:51
training sets quickly introducing a
1:54
paradigm for labeling training data sets
1:56
programmatically rather than by hand
1:58
this was an eternity ago in terms of ai
2:01
research age as you know the best
2:04
approaches to date use supervised
2:05
learning a process in which models train
2:08
on data and labels and learn to
2:10
reproduce the labels when given the data
2:13
for example you'd feed a model many
2:15
images of ducks and cats with their
2:17
respective labels and ask the model to
2:20
find out what is in the picture then use
2:23
back propagation to train the model
2:25
based on how well it succeeds if you are
2:27
unfamiliar with back propagation i
2:29
invite you to pause the video to watch
2:31
my one minute explanation and return
2:33
where you left off as data sets are
2:35
getting bigger and bigger it becomes
2:37
increasingly difficult to curate them
2:39
and remove hurtful data to allow for the
2:41
model to focus on only relevant data you
2:44
don't want to train your model to detect
2:46
a cat when it's a skunk it could end
2:48
badly when i refer to data keep in mind
2:51
that it can be any sort of data tabular
2:53
images text videos etc now that you can
2:57
easily download a modal for any task the
2:59
shift to data improvement and
3:01
optimization is inevitable motor
3:03
availability the scale of recent data
3:05
sets and the data dependent cds models
3:08
have are why such a paradigm for
3:10
labeling training data sets
3:12
programmatically becomes essential
3:14
now the main problem comes with having
3:17
labels for our data it's easy to have
3:19
thousands of images of cats and dogs but
3:22
it's much harder to know which images
3:24
have a dug and which images have a cat
3:26
and even harder to have their exact
3:28
locations in the image for segmentation
3:31
tasks for example
3:32
the first paper introduces a data
3:34
programming framework where the user
3:36
either ml engineer or data scientist
3:38
expresses weak supervision strategies as
3:41
labeling functions using a generative
3:43
model that labels subsets of the data
3:46
and found that data programming may be
3:48
an easier way for non-experts to create
3:51
machine learning models when training
3:53
data is limited or unavailable in short
3:56
they show how improving data without
3:58
much additional work while keeping the
4:00
model the same improve results which is
4:03
a now evident but essential stepping
4:05
stone it's a really interesting
4:07
foundation paper in this field and worth
4:09
the read
4:10
the second paper we cover here is called
4:12
snorkel rapid training data creation
4:15
with weak supervision this paper
4:17
published a year later also from
4:19
stanford university presents a flexible
4:22
interface layer to write labeling
4:24
functions based on experience continuing
4:27
on the idea that training data is
4:28
increasingly large and difficult to
4:30
label causing a bottleneck in models
4:33
performances they introduce snorkel a
4:36
system that implements the previous
4:37
paper in an end-to-end system the system
4:40
allowed knowledge experts the people
4:42
that best understand the data to easily
4:44
define labeling functions to
4:46
automatically label data instead of
4:48
doing hand annotation building models up
4:51
to 2.8 times faster while also
4:54
increasing predictive performance by an
4:56
average of 45.5 percent so again instead
5:00
of writing labels the users or knowledge
5:03
experts write labeling functions these
5:05
functions simply give insights to the
5:07
models on patterns to look for or
5:10
anything the expert will use to classify
5:12
the data helping the model follow the
5:14
same process then the system applies the
5:17
newly written labeling functions over
5:19
our unlabeled data and learns a
5:21
generative model to combine the output
5:24
labels into probabilistic labels which
5:26
are then used to train our final deep
5:29
neural network snorkel does all this by
5:32
itself facilitating this whole process
5:35
for the first time
5:36
our last paper also from stanford
5:39
another year later introduces software
5:42
2.0 this one page paper is once again
5:45
pushing forward with the same deep
5:47
learning data centric approach using
5:49
labeling functions to produce training
5:51
labels for large unlabeled data sets and
5:54
train our final model which is
5:56
particularly useful for huge internet
5:59
scraped data sets like the one used in
6:01
google applications such as google ads
6:03
gmail youtube etc tackling the lack of
6:06
hand labeled data of course this is just
6:09
an overview of the progress and
6:10
direction of data centric ai and i
6:13
strongly invite you to read the
6:14
information in the description below to
6:16
have a complete view of data centric ai
6:19
where it comes from and where it's
6:21
heading i also want to thank snorkel for
6:24
sponsoring this video and i invite you
6:26
to check out their website for more
6:28
information if you haven't heard of
6:30
snorkel before you've still already used
6:32
their approach in many products like
6:35
youtube google ads gmail and other big
6:37
applications
6:39
thank you for watching the video until
6:41
the end
[Music]

Leave a Reply

Your email address will not be published.