Article

Data Science Versus Data Engineering

Organizations must realize that data scientists and data engineers have different skill sets and both are valuable.

June 5, 2017 4 min read

In the third installment of this blog, we told you that “the analytic discovery process has more in common with research and development (R&D) than with software engineering.” But – symmetrically, if confusingly - what comes after the discovery process typically has more in common with software engineering than with R&D.

The objective of our analytic project will very often be to produce a predictive model that we can use, for example, to predict the level of demand for different products next week, to understand which customers are most at risk of churning, to forecast whether free cash flow next quarter will be above – or below – plan, etc.

For that model to be useful in a large organization, we may need to apply it to large volumes of data - some of it freshly minted - so that we can, for example, continuously predict product demand at the store / SKU level. In other cases, we may need to re-run the model in near real-time and on an event-driven basis, for example if we are building a recommendation engine to try and cross-sell and up-sell products to customers on the web, based on both the historical preferences that they have expressed in previous interactions with the website and on what they are browsing right now. And in almost all cases, the output of the model – a sales forecast, a probability-to-churn score, or a list of recommended products – will need to be fed back into one of the transactional systems that we use to run the business in order that we can take some useful action, based on the insight that the model has provided us.

To deliver any value to the business, then, we may need to take a model built in the lab from brown paper and string and use it to crunch terabytes, or petabytes, of data on a weekly, daily - or even hourly basis. Or we may need to simultaneously perform thousands of complex calculations on smaller data-sets - and to send the results back to an operational system within only a few hundred milliseconds. And achieving those sorts of levels of performance and scalability will require that we build a well-engineered system on top of a set of robust and well-integrated technologies.

Our system may have to ingest data from several sources, integrate them, transform the raw data into “features” that are the input to the model, crunch those features using one-or-more algorithms – and send the resulting output somewhere else. When you hear data engineers talking about building “machine learning pipelines”, it is this fetch-integrate-transform-crunch-send process that they are referring to.

Building, tuning and optimizing these pipelines at web and Internet of Things scale is a complex engineering challenge – and one that often requires a different set of skills from those required to design-and-prove the prototype model that demonstrates the feasibility and utility of the original concept. Some organizations put data scientists and data engineers into multi-disciplinary teams to address this challenge; others focus their data scientists on the discovery part of the process – and their data engineers on operationalizing the most promising of the models developed in the laboratory by the data scientists. Both of these approaches can work, but it is important to ensure that you have the right balance of both sets of skills. Over-emphasize creativity and innovation and you risk creating lots of brilliant prototypes that are too complex to implement in production; over-emphasize robust engineering and you risk decreasing marginal returns, as the team focusses on squeezing the last drop from an existing pipeline, rather than considering a completely new approach and process.

Of course, not every analytic discovery project will automatically result in a complex implementation project. As we pointed out in a previous blog, the “failure” rate for analytic projects is relatively high – so we may go several times around the CRISP-DM cycle before we settle on an approach worth implementing. And sometimes our ability may be merely to understand. For example, a bricks and mortar Retailer might want to identify and to understand different shopping missions – and “implementation” might then be about making changes to ranging and merchandising strategies, rather than about deploying a complex, real-time software solution.

Whilst there are several ways of employing the insight from an analytic discovery project, the one thing that they all have in common is this: change. As a former boss once said to one of us: old business process + expensive new technology = expensive old business process. And achieving meaningful and significant change in large and complex organizations is never merely about data, analytics and engineering – it’s also about organizational buy-in, culture and good change management. Whilst data scientists and data engineers often have different backgrounds and different skill sets, one thing that they often have in common - and that they may take for granted in others – is a belief that the data know best. Since plenty of other stakeholders see the business through a variety of entirely different lenses, securing the organizational buy-in required to action the insight derived from an analytic project is often as complex a process as the most sophisticated machine learning pipeline. Involve those other stakeholders often and early if you want to discover something worth learning in your data – and if you want that learning to change the way that you do business.

Tags

Articles

Martin has over 27-years of experience in the IT industry and has twice been listed in dataIQ’s “Data 100” as one of the most influential people in data-driven business. Before joining Teradata, Martin held data leadership roles at a major UK Retailer and a large conglomerate. Since joining Teradata, Martin has worked globally with over 250 organisations to help them realise increased business value from their data. He has helped organisations develop data and analytic strategies aligned with business objectives; designed and delivered complex technology benchmarks; pioneered the deployment of “big data” technologies; and led the development of Teradata’s AI/ML strategy. Originally a physicist, Martin has a postgraduate certificate in computing and continues to study statistics.

View all posts by Martin Willcox

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International. Frank has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

View all posts by Dr. Frank Säuberlich

Stay in the know

Subscribe to get weekly insights delivered to your inbox.

Business Email*

Country*

Yes

I consent that Teradata Corporation, as provider of this website, may occasionally send me Teradata Marketing Communications emails with information regarding products, data analytics, and event and webinar invitations. I understand that I may unsubscribe at any time by following the unsubscribe link at the bottom of any email I receive.

address1

Your privacy is important. Your personal information will be collected, stored, and processed in accordance with the Teradata Global Privacy Statement.

Data Science Versus Data Engineering

About Martin Willcox

About Dr. Frank Säuberlich