Data drudgery persists for data scientists

Cleaning and preparation still accounts for nearly half the workload of data scientists, finds Anaconda survey

Pro

(Image: Stockfresh)

1 July 2020

The hassles of data intake and cleaning, problems with biased models and data privacy, and difficulty finding experience and technical skills – all these ranked among the biggest challenges facing data scientists and software engineers in data-science disciplines according to a newly released survey.

Anaconda, makers of the Python distribution of the same name for scientific computing applications, conducted its 2020 State Of Data Science survey with 2,360 respondents from 100 countries, slightly less than half of those hailing from the US.

Despite all the advances in recent years in data science work environments, data drudgery remains a major part of the data scientist’s workday. According to self-reported estimates by the respondents, data loading and cleaning took up 19% and 26% of their time, respectively—almost half of the total. Model selection, training/scoring, and deployment took up about 34% total (around 11% for each of those tasks individually).

When it came to moving data science work into production, the biggest overall obstacle – for data scientists, developers, and sysadmins alike – was meeting IT security standards for their organisation. At least some of that is in line with the difficulty of deploying any new app at scale, but the lifecycles for machine learning and data science apps pose their own challenges, like keeping multiple open source application stacks patched against vulnerabilities.

Another issue cited by the respondents was the gap between skills taught in institutions and the skills needed in enterprise settings. Most universities offer classes in statistics, machine learning theory, and Python programming, and most students load up on such courses. But enterprises find themselves most in need of data management skills that are taught only rarely or not at all, and advanced math skills that students do not often develop. Students themselves felt lack of experience (40%) and technical skills (26%) were the biggest barriers to jobs in the field, shortcomings that (according to Anaconda) could be better addressed by strong internship programs that “go beyond providing a résumé enhancement and hands-on-keyboard technical skills.”

One finding in the report should not surprise anyone: Python remains king of the languages used in the data science space. R comes in a distant second, while JavaScript, Java, C/C++, and C# trail behind. Although Julia, a rising contender in the data science world, wasn’t listed in the running, it’s unclear if that was because it didn’t figure into enough respondent’s answers or because the survey didn’t mention it.

IDG News Service

Data drudgery persists for data scientists

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular