MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
Generate an image from a text prompt using FLUX.1-dev while simultaneously producing
dense predictions (saliency maps, segmentation maps, depth maps) from the frozen
diffusion transformer's intermediate features via lightweight trained decoder heads.