Busting the big tech — big data narratives

DeepTech startups AI strategy: data-centric or model-centric?

Or why you most probably don’t need a lot of data to go to market

6 min readJan 11, 2022

Illustration from https://unsplash.com/photos/RlOAwXt2fEA

“During the gold rush its a good time to be in the pick and shovel business” © Mark Twain

This quote describes the current state of affairs in the AI world very well. A lot of data collected and digitalized over decades finally found their shovels — computing devices (NVIDIA GPUs, etc), rented computational powers (AWS, Azure, GCP, etc), and proven-to-work algorithms wrapped in the comfortable APIs (starting from TensorFlow / PyTorch and ending with managed solutions at AWS). The main thinking paradigm in the market today is so-called “data-centric AI”, popularized by one of the main players and educators in the field Andrew Ng.

We can mostly agree, that building a data pipeline is the only way to go for digital-first products, but what about deep tech, where data is coming from the physical world? Medical devices and IoT sensors in the field and in the labs don’t give us the courtesy of a “big data” game due to the price and speed of the experiments, measurement errors, and physical constraints. This can be a blocker in the “data-centric AI” paradigm, but it shouldn’t be such. In this article, we will review alternative strategies aimed at a fast-paced startup environment and for the physical world applications.

The reality of deep tech startups

Difference between different technologies and the place of deep tech, illustration from https://medium.com/bcg-digital-ventures/the-right-time-for-deep-tech-dcb317fc3636

I’ve talked a bit more in detail about the difference between digital-first and physical first AI companies in my previous blogs. As a short summary, in deep tech, we see main differences (and challenges) in the following sections:

Business model. Early-stage innovations in energy, health tech, life sciences aim not for the linear improvement of an existing process, but more complex changes with potentially exponential growth. This is what VCs and deep tech investors are also hunting for.
Technology and data. Since the original technology is at the core, we mostly have limited data from labs or field experiments. The data is physical-first (coming from medical devices, sensors, laboratories) and is long and expensive to obtain, measurement errors are often present.
People and skills. Often deep tech companies are founded and led by scientists and domain experts. This shifts the initial focus from business processes and customer patterns to the technology and technological process, which also requires different AI strategies.

And of course, we still have the startup environment which rushes us for competition, speed, iterative releases, and achievement of the product-market fit. When we try to map the “data-centric AI” paradigm in such a deep tech environment, we see multiple mismatches: a standard AI solutions are good for linear improvements and predictions in a process but cannot aim for non-linear novel outcomes; they overfit to small data and generalize bad to novel situations in case of absence of huge datasets. Last but not least, they don’t take into account the constraints of the physical world (chemical properties of molecules, measurement errors of physical devices, etc). See more details and comparisons in previous articles on the topic.

Small data model-first AI approach

The good point is that we did go to the Moon without any sophisticated AI and cloud big data solutions. We relied on centuries of built knowledge in different science fields and extracted laws of the universe. Why we don’t do the same in our startups? Do we need automated intelligent data-driven decision-making? Let’s do it the classic, “model-first” way.

Rule-based heuristics

Patterns of sleep from the wearable devices that can be exploited as rule-based systems; illustration from http://www.iet.unipi.it/m.cimino/pub/hdl.handle.net_11568_872373.pdf

Deep tech startups are born in the labs in the brains and hands of the brightest experts in the industry. Observing data 10h+ per day creates mental models for patterns in their heads. Why we don’t rely on those patterns when building our MVP? And if we don’t have such patterns yet, why don’t we rely on the works already published by other experts? For example, to track different heart conditions from wearables, we don’t need any data and no ML to start, we can rely on vital signs patterns (as shown above), implement them as a product, and test it in the field to get the feedback about the product.

Mathematical modeling

PV model circuit and coefficients of the mathematical model, illustration from https://www.researchgate.net/profile/Sayed-Zaki/publication/348518366_Deep-learning-based_method_for_faults_classification_of_PV_system/links/6005985245851553a0522f12/Deep-learning-based-method-for-faults-classification-of-PV-system.pdf

A bit more rigorous approach is to leverage published papers that attempted classic mathematical modeling in the past. It can include differential equations, statistical models, geometric patterns that researchers have found and published as results of their work. We can re-use their published formulas and coefficients, check how it works on our limited datasets, and calibrate parameters if needed.

Active learning / Reinforcement learning

Illustration from https://www.sciencedirect.com/science/article/abs/pii/S2352152X21001158

To learn from data efficiently, you need to collect as most as possible different scenarios and their outcomes to be able to predict them. What if that is not possible and environment is changing over time so collecting data doesn’t even make much sense? Actually, in this case, we can learn on the fly without waiting for the data to be collected and become irrelevant with the time. For example, for energy management of hybrid batteries in electic vehicles, such approach is used to minimize energy loss and to increase both the electrical and thermal safety level of the whole system.

Open-source data and pre-trained models

Illustration from https://hackingmaterials.lbl.gov/matminer/

Last but not least, we can be lucky and leverage open data sets and even release models trained on such datasets. In “digital” machine learning, thanks to million-scale open-source datasets on vision and language, we can already leverage open-source data and models for numerous tasks. The same is happening with physical world as well. For example, open databases like Materials Project and Citrination already fuel novel findings in optimal batteries design,

Of course, all above-described methods can be combined together. For example, we can use pre-trained models for facial tracking, apply mathematical models to extract vital signs, and re-use published statistical models to predict conditions and provide additional value to the patients and engage them more into the treatment process in the remote patient monitoring startups.

So I don’t need big data at all then?

A unified data infrastructure architecture, illustration from https://future.a16z.com/emerging-architectures-modern-data-infrastructure/. Looks fundamental, but the question is — do you really need to start with that?

I didn’t say that. Small-data approaches have their own limitations and might not work as accurately and economically efficiently as their big brothers. At Neurons Lab we see an evolution of our customers’ products: fast iterations and search for product-market-fit is easier to do with the small-data approaches, however, when the market fit is found, at least some elements of the big data pipeline will be necessary. When a market pattern is found, it’s smart to capitalize on that: build data pipelines, train bigger models to catch more various patterns, and scale built solutions to more customers with efficient DevOps operations. All these pieces are shovels of the gold rush, but before digging the gold, it has to be found first ;)

Conclusions

In this article, I’ve described some arguments in favor of model-centric AI as a basis of the data and modeling strategy for deep tech startups in the search of their product market-fit. It will partly hold true with bigger organizations, that already have found their market pattern and have collected much more target data for modeling. However, I’d still recommend to look into domain-aware machine learning paradigm, since standard statistical models, whatever complex, aren’t able to learn physical and chemical laws, phase changes, quantum uncertainty unless you “help” them a bit with injecting some domain knowledge by design.

This is a methodology that we widely employ with deep tech startups at Neurons Lab. You can reach us out at info@neurons-lab.com or contact Alex, author of this article who is Head of AI/ML and Partner directly at alex.honchar@neurons-lab.com.