Back to News Hub
🇪🇺Tech.eu
June 29, 2026
General AI

Robotics has a data problem. Macrodata Labs wants to solve it

Overview

The AI industry has spent the past several years learning a critical lesson: better data often matters as much as better models. While advances in large language models have been powered by increasing... After helping build some of the world's most widely used open AI datasets at Hugging Face, Guilherme Penedo and Hynek Kydlíček launched Macrodata Labs to bring the same data-first approach to robotics.

Key Takeaways

  • Robotics Robotics has a data problem.

    Macrodata Labs wants to solve it After helping build some of the world's most widely used open AI datasets at Hugging Face, Guilherme Penedo and Hynek Kydlíček launched Macrodata Labs to bring the same data-first approach to robotics.

  • Macrodata Labs recently emerged from stealth, launching Refiner, an open-source framework and cloud platform for processing robotics datasets.

    The company raised $4 million in pre-seed funding in June this year to build infrastructure for the robotics data loop.

  • From building LLM datasets to building robotics infrastructure Macrodata Labs was founded by Guilherme Penedo and Hynek Kydlíček, who formed the core team behind several of Hugging Face's largest open LLM dataset efforts.

    They created widely used datasets such as FineWeb, FineWeb2, FinePDFs, and FineTranslations, which have been used by teams at NVIDIA, Google, AI2, and Z.

  • We worked together on projects such as FineWeb, which processes large portions of the internet and turns the data into high-quality training datasets.

    FineWeb became one of the most widely used open datasets for language model training, and we later expanded that work into other areas, including PDFs and multilingual datasets.

  • Physical-world data is larger, messier, more fragmented, and far more difficult to transform into useful training datasets than text.

Stats & Key Facts

  • #The company raised $4 million in pre-seed funding in June this year to build infrastructure for the robotics data loop.

After helping build some of the world's most widely used open AI datasets at Hugging Face, Guilherme Penedo and Hynek Kydlíček launched Macrodata Labs to bring the same data-first approach to robotics. Robotics Robotics has a data problem. Macrodata Labs wants to solve it After helping build some of the world's most widely used open AI datasets at Hugging Face, Guilherme Penedo and Hynek Kydlíček launched Macrodata Labs to bring the same data-first approach to robotics.

Cate Lawrence 2 hours ago Share Share Send email Copy link The AI industry has spent the past several years learning a critical lesson: better data often matters as much as better models. While advances in large language models have been powered by increasingly sophisticated datasets and data pipelines, robotics has yet to undergo the same transformation. Robotics teams are working with vast quantities of video, sensor data, and demonstrations, but much of the infrastructure needed to process, annotate, and improve that data remains immature.

Macrodata Labs believes that closing that gap could become one of the most important challenges in robotics AI. Macrodata Labs recently emerged from stealth, launching Refiner, an open-source framework and cloud platform for processing robotics datasets. The company raised $4 million in pre-seed funding in June this year to build infrastructure for the robotics data loop.

For more details please read the original article at Tech.eu.

Continue Learning

Originally published by Tech.eu
Read the original

Comments

Sign in to join the conversation