On the Challenges of Hiring for Data Science roles
I had coffee recently with an applicant to kaam.work’s data science jobs marketplace — her name is Helen.
Helen’s profile was typical of our applicants: late 20s, degree in, hungry to grow, and working for a blue-chip name– an investment bank in this case—but unsatisfied. She had spent several years in back office data engineering roles, architecting big data solutions using Hadoop, creating data-ingestion structures and pipelines to serve as foundation for business reports and decision support. Helen applied to work with kaam.work platform companies as a data scientist, a role adjacent to her prior experience.
In what follows, we’ll outline how Helen’s case neatly captures three different challenges in the field of data science hiring:
- The siren call of “data science” has led to muddiness in what that “data science” really means
…and, stemming from that muddiness: - Job applicants, keen to progress in responsibility and compensation, focus on the “sexiest” techniques without having mastered the foundations
- Prospective employers’ ability to verify hard skills efficiently and effectively
- In the coming two weeks, we’ll explore these different points in detail. Specifically, the ramifications are different for companies depending on where they are in their digital transformation journeys. For example, a growth-stage digital-native startup will encounter different challenges with “Helens” than a traditional company in a legacy industry seeking to digitize its operations and business model. Today, I’ll give the Cliff Notes version.
The term “Data Science” is now so broadly and commonly used that it encompasses at least four very different – and all very valuable—roles, each comprising part of the digitization puzzle. Those four roles are:
- Data engineer – builds the foundations. This role designs databases as well as the flow and transformation of data between different systems, commonly called a data pipeline. An example would be building a pipeline flowing marketing and website behavior data to the same database or data-lake that houses finance data on manufacturing costs.
- Data analyst—enables the business to make rapid decisions. This role is the first-level value-add on the foundations built by the data engineer. A data analyst – often also called a business analyst—is a highly adept at using SQL and related querying language to extract and wrangle data into views and basic insights meaningful to business owners. For example, a data analyst might work closely together with a channel marketing manager to assess performance of different campaign permutations on a running basis. Lots of shorter, back and forth work here focused on decision enablement. Generally, a data analyst has a good grasp of math, but their need for statistical tools is *generally* constrained to data cleaning, simple descriptive statistics, and linear regression.
- Data scientist—builds repeatable, value-add models using predictive statistical techniques, e.g. logistic regression, latent variable analysis, neural networks. This role typically works on multi-week projects exploring large data sets. These data sets are mined for hidden patterns that can be used to predict business variables capable of driving extreme operational value, at times even whole % points of operating income. Data scientists often have extensive training in quantitative fields, e.g. statistics, applied mathematics, physics, chemistry, economics. A strong data scientist will be capable of testing and automating statistical models as well as explaining those in plain English to business owners. Data scientists are usually capable coders, though, what concerns coding, are rarely as strong as machine learning engineers. At the same time, a strong data scientist can typically hand tune a statistical model to a greater degree than a machine learning engineer. Depending on the context and business scale, even small “lift” achieved by such hand-tuning can be worth millions of USD/EUR in yield.
- Machine learning or AI engineer (“MLE”)—programs computers to parse large data sets for patterns without human intervention. An MLE usually has core training in computer science and understands predictive statistical models well. He or she will establish computer routines in which a machine-learning model self-improves and delivers its output to a production-code environment, where the machine-learning model outputs interface directly with business logic.
It’s important to understand these roles as describing core activities—responsibilities often bleed over somewhat depending on whom you can hire in a super tight labor market. But to be clear, if you hire a data scientist when you really need someone to do data engineering work, you’ll both be starting from scratch again in 2-3 months.
Let’s return to Helen for a moment. Helen’s experience in data engineering includes occasional application of statistical models to optimize data pipeline flows. That provides foundation for learning more data science but does not make her a data scientist!
The type of data engineering Helen has done overlaps heavily with data analysis—in fact, Helen often tests her data-engineering work using the same analytical techniques a data analyst might use in day-to-day work. But for Helen to be hired as a standalone data scientist – a rarer and often highly-compensated role—she must master a broad spectrum of statistical techniques starting with basic applications to data extraction and wrangling. Someone capable of running advanced statistical models but incapable of extracting and cleaning their own data will not go far. Unfortunately, garbage in/garbage out applies here more than ever.
And here precisely lies the problem from an employer side as well. Assume Helen has been diligently studying statistical models – interesting stuff after all! She then applies to a data scientist role at a traditional company in the printing business that is just launching a digitization initiative. The company, just at the beginning of its digital transformation and light on digitally-native staff, will have difficulty establishing that, while Helen can readily describe different types of models, she cannot easily extract the data needed to drive those models. How then does a prospective employer efficiently decide which skills to verify and how to verify them?
What has worked for companies out there? Are there tools you use regularly?