Today we share a “multimodal” algorithmNÜWA(Nuwa).

At the beginning of the paper, the effect is released,NÜWASOTA, which covers 8 classic visual generation tasks.

According to the paper,NÜWAIt also “completely abuse” openai dall-e in text to image generation.

The algorithm effects of various comparisons,Crazy!

N Ü wa effect

Let’s take a look firstNÜWAThe performance of this algorithm in eight classical visual generation tasks.


The task of text to picture is actually to generate a picture corresponding to a text description.

For example:

A dog with gogglesstaring at the camera.

A dog with goggles staring at the camera.

There are more effects:

NÜWAThe generated effect does not seem so inconsistent. From the effect of the paper, it is very real!

The effect is very amazing.

Sketch-To-Image (S2I)

Sketch to picture task is to generate corresponding pictures according to the layout of the sketch.

For example:

In a picture, draw a rough outline, you can automatically “brain fill” the picture.

This effect is really eye opening. If the real effect is like this paper, it is really strong.

This algorithm can be used in many interesting scenes.

Image Completion (I2I)

Image completion, if a picture is incomplete, the algorithm can automatically “brain fill” the incomplete part.

good heavens,Are there some bold ideas?

This shelter is OK, and there are more detailed ones.

If the picture is broken like this, it can “brain fill” the picture. I’m looking forward to the code.

Image Manipulation (TI2I)

Picture processing, processing pictures according to text description.

For example:

There is a picture of grassland, and then add a description:

a horse is running on the grassland

A horse runs on the grassland, and then the corresponding picture can be generated.

This amazing understanding.

This reminds me of the p-chart, great God, a spoof work.

With this algorithm, we can try it, ha ha.


This is not over, except for the above generated imageFour kindseffect,NÜWAYou can also generate video!

Corresponding four video generation tasks:

  • Text-To-Video (T2V)
  • Sketch-To-Video (S2V)
  • Video Manipulation (TV2V)

You can play both pictures and videos.

N Ü wa principle

The overall architecture of n Ü wa model includes an adaptive encoder supporting multiple conditions and a pre trained decoder, which can make the information of image and video at the same time.

For image completion, video prediction, image processing and video processing tasks, part of the input image or video can be directly sent to the decoder.

The codecs are based on a 3D nearby self attention mechanism (3dna), which can consider the local characteristics of space and time axis at the same time. The definition is as follows:

W represents the learnable weight, and X and C represent the 3D representation of text, image and video data respectively.

3dna considers the complete proximity information and dynamically generates three-dimensional proximity attention blocks for each token. The attention matrix also shows that the attention part (blue) of 3dna is smoother than 3D block sparse attention and 3D axis sparse attention.

For more details, you can directly see the paper:

Thesis address:

N Ü wa code

The code of n Ü Wa is not open source, but GitHub has been established.


The author says that open source will soon be available:

The company has an open source approval process, and the code has to be sorted out, so you can mark star first, be patient, etc.

Microsoft Asia Research Institute and Peking UniversityA multimodal pre training model n Ü wa jointly created was unveiled at the first Microsoft summit.

This kind of pigeon should not~


This year is a year of vigorous development of multimodal transformer. It can be seen from the papers of various top conferences that various multimodal transformers.

Source: Jack Cui
