How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces
An AI coding agent built a working interactive 3D web gallery of 6 Paris monuments by chaining two existing Hugging Face Spaces on its own, with no person writing the integration code. Hugging Face engineer Mishig Davaadorj published the walkthrough on June 9, 2026, showing the agent generate images with Ideogram 4, convert each into a 3D Gaussian splat with VAST-AI TripoSplat, and assemble a browser viewer. It worked because each Space ships a plain-text agents.md file telling any agent exactly how to call it.
Key Takeaways
- An AI agent assembled a full 3D gallery of Paris landmarks by stitching together two separate Hugging Face Spaces, so a person never wrote the code connecting them.
- The pipeline pairs Ideogram 4 for image generation with VAST-AI TripoSplat for turning a single photo into a 3D model, then builds a browser viewer around the results.
- The agent worked across both services because each Gradio Space exposes a plain-text agents.md file describing its API endpoints, file uploads, and sign-in token.
- The agent did the cleanup work too: it fixed upside-down output, auto-framed each monument, and compressed the 3D files to about a third of their size for faster loading.
- The same recipe was reused for galleries of Egyptian and Japanese monuments, each needing only a one-sentence prompt to produce 6 new 3D models and a viewer.
- The author frames the lesson as a building-block economy, where agents are strongest when gluing together proven parts rather than building everything from scratch.
Stats & Key Facts
- #6 Paris monuments were generated, converted to 3D, and assembled into a single gallery viewer.
- #2 separate Hugging Face Spaces were chained together with no hand-written integration code.
- #About 3 times smaller is the file-size reduction from compressing the .ply output into the .ksplat format.
- #About 1 sentence of human instruction was enough to spin up each additional country gallery.
- #2 more galleries, Egypt and Japan, each produced 6 new monuments through the identical pipeline.
- #1 image per monument is all TripoSplat needs to build a 3D Gaussian splat.
An AI Agent Built a 3D Paris Gallery With No Hand-Written Glue Code
The walkthrough centers on a finished, working web app the agent produced end to end.
An AI coding agent assembled an interactive 3D web gallery of Paris monuments without a person writing the code that connects the underlying services. Hugging Face engineer Mishig Davaadorj published the walkthrough on June 9, 2026, and the result is hosted as a Space at mishig/monuments-de-paris.
The agent took plain text prompts and turned each one into a rotating 3D landmark you can spin in a browser. The set of 6 includes the PanthEon, the OpEra, the Arc de Triomphe, the SacrE-Coeur, and the Eiffel Tower, the last rendered as a small diorama on a plinth. Human input stayed at the level of taste, with notes such as make it zoomed out, while the agent handled the building.
Ideogram 4 and TripoSplat: The Two Spaces the Agent Chained
The pipeline runs across two existing services, each doing one job.
- ›Ideogram 4 turns each text prompt into a clean image of a landmark, isolated on a dark background for easier 3D reconstruction.
- ›VAST-AI TripoSplat takes a single image and builds a 3D Gaussian splat from it, outputting a .ply file.
- ›The agent ran the 2 steps in sequence for all 6 monuments, feeding each generated image straight into the 3D step.
- ›Both services come from different teams, yet the agent chained them without any custom connector written by a person.
The agents.md File That Lets Any Agent Operate a Space
The reason a generic agent knew how to drive both services is a single plain-text file.
Every Gradio Space on Hugging Face now exposes a plain-text file called agents.md. It tells an agent exactly how to use the Space, including the API schema, the endpoint to start a call, the template to poll for a result, how to upload file inputs, and the Bearer token sign-in through an HF_TOKEN.
Because that information sits in one readable file, the agent did not need a custom SDK or hand-written glue code. It read the instructions, found the right endpoints, and called each Space directly. The author argues agents will pick a documented service over a model they have to set up by hand, the same way open-source libraries win adoption when they are easy to call.
Gaussian Splats, .ply Files, and .ksplat Compression Explained
A few technical terms carry the story, so here is what they mean in plain language.
- ›A 3D Gaussian splat is a modern way to store a 3D scene as a cloud of colored points rather than as solid surfaces.
- ›TripoSplat outputs each splat as a .ply file, a standard format for this kind of point data.
- ›The agent compressed each .ply into the .ksplat format, which is about 3 times smaller, so the gallery loads faster in a browser.
- ›Smaller files matter for web apps because visitors load the 3D scene over the internet rather than from their own machine.
Cleanup the Agent Handled on Its Own
Beyond the two API calls, the agent did the unglamorous fix-up work itself.
- ›It detected that TripoSplat output came out upside down and corrected the orientation.
- ›It auto-framed each monument so every model sits centered and at a sensible zoom.
- ›It compressed the files to the lighter .ksplat format for quicker loading.
- ›It built a Three.js viewer with drag-to-rotate and scroll-to-switch navigation, then deployed the finished Space as a static app.
Reusing the Recipe for Egypt and Japan Galleries
Once the pipeline existed, making new versions took almost no effort.
The author then asked for galleries of Egyptian and Japanese landmarks. Each one needed only about a single sentence of instruction to produce 6 new monuments, their 3D splats, and a working viewer through the identical pipeline.
The Egypt set covers the Great Pyramid, the Sphinx, Abu Simbel, the Mask of Tutankhamun, Karnak, and the Colossi of Memnon. The Japan set covers Tokyo Tower, Himeji Castle, Kinkaku-ji, Osaka Castle, the Great Buddha of Kamakura, and the Itsukushima torii gate. The repeatability is the point: the hard part was building the first pipeline, and after that each new gallery was a prompt.
The Building-Block Economy for Multimedia Apps
The author ties the demo to a broader shift in how AI software gets made.
The framing is a building-block economy, where AI is passable at building everything from scratch but strong at gluing together proven pieces. The same dynamic long applied to code libraries, and the author argues it now extends to multimedia, since the agent reuses 2 ready-made services rather than writing model code.
The practical effect is that the cost of a new multimedia app drops toward the cost of describing it. Turning a prompt into a rotating 3D monument used to be a project on its own. With standardized, documented Spaces, the author notes, it became a single step in a pipeline.
Frequently Asked Questions
What did the AI agent actually build?
It built an interactive 3D web gallery of 6 Paris monuments that visitors can spin and switch through in a browser. The agent generated the images, converted them to 3D models, and assembled the viewer itself.
What is the agents.md file and why does it matter?
It is a plain-text file every Gradio Space on Hugging Face exposes that tells an AI agent how to call the Space, including its API endpoints, file uploads, and sign-in token. It lets a generic agent operate the service without any custom SDK or hand-written integration code.
What are the two Spaces the agent chained together?
Ideogram 4 turns a text prompt into a clean image of a landmark, and VAST-AI TripoSplat turns a single image into a 3D Gaussian splat. The agent ran them in sequence for each of the 6 monuments.
What is a Gaussian splat and what is the .ksplat format?
A Gaussian splat stores a 3D scene as a cloud of colored points rather than solid surfaces. The agent compressed the raw .ply output into the .ksplat format, which is about 3 times smaller so the gallery loads faster online.
Was the same approach reused for other galleries?
Yes. The author produced galleries of Egyptian and Japanese monuments, each requiring only about a single-sentence prompt to generate 6 new monuments, their 3D splats, and a viewer through the identical pipeline.
The demo shows agents at their strongest when they glue together proven, well-documented services instead of building from scratch. With standards like agents.md in place, producing a new multimedia app moves closer to describing it in a sentence.
Continue Learning
Comments
Sign in to join the conversation