MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Generate an image from a text prompt using FLUX.1-dev while simultaneously producing dense predictions (saliency maps, segmentation maps, depth maps) from the frozen diffusion transformer's intermediate features via lightweight trained decoder heads.

Paper: MMDiff: Extending Diffusion Transformers for Multi-Modal Generation Model: yagmurakarken/mmdiff

Task
0 1000
4 50
1 10
Examples
Text Prompt Task Seed Inference Steps Guidance Scale