The blogosphere is filled with posts explaining what data science is not, but very few explain what it is. Another surprising fact is that those articles are all mostly about modeling approaches (AI being the latest darling) or technology (Big Data is a favorite). In this piece, I’ll try my best to provide a clear explanation of what data science is and why it is so important.
Data science is about solving problems
The goal of data science is to make evidence-based decisions, using our most powerful tool: the scientific method, which is based on data and theory, which in turn generate evidence. This is a fundamental point, and the reason why I prefer the term evidence-based to data-driven: data-driven does not require theory to turn data into evidence. This can result in identifying spurious relationships in the data, that can not be explained by any theory, existing or created for that purpose.
This process requires both a good understanding of the functional domain and modeling concepts. The result of this detective work is a formal representation of the problem at hand and its solution. This part needs to take into account what data is required and what is available. If the discrepancy between those is too big we might have to wait until enough new and relevant data is collected.
Modeling techniques and technology are tools
When this first step is completed, then we must choose the best modeling technique, that work on our data: most people forget that modeling techniques usually have underlying assumptions, for example on how the data is distributed. This needs to be validated.
Most modeling problems result in a forecast (point estimate). Here, a key question is whether a forecast makes sense or not. A good example, is weather forecast: as a dynamic system, weather is more or less chaotic over time, which results in shorter or longer forecasting horizons, beyond which any forecast is useless. Estimating uncertainty, in addition to a point estimate, is key to assessing how meaningful a forecast is.
Another assumption usually made is that more data is better. It is not always the case, and choosing the right type of data (the right size, the right features, the right time period, …) is very important. This choice will in turn command different technologies: single computer, cloud, big data technology…
By defining the goal of data science as problem solving, and considering modeling techniques and technology as means to this end, we can stop antagonizing different definitions of what data science is or is not. This has a profound impact on how we should hire data scientists. But this will be the subject for another article. Stay tuned!