What is a Data Scientist?
I’ve been trying to answer this question for well over a year. As an academic turned entrepreneur, I was intrigued by the title data scientist. We were building Data & Sons at the time, and identifying core customers was a key part of the design process. It seemed that people that both developed and utilized data would be natural sellers and buyers on our marketplace. Sounded like data scientists might just be this type of person.
Answering that question took a surprisingly long time. After spending over a year reading, researching and discussing with people across Fortune 500 companies, startups, data science centric social media, and data science training programs, I think I have working solution. A prototype data scientist definition if you will. Given the number of posts on DataTau, Medium, and Reddit asking this same question, I think taking the time to put together a solid working definition is value added for lots of people in the data science field (profession, community, industry?) especially for people interested in joining the profession.
So first, what’s data science? As an organizational scientist, I learned and applied the traditional scientific method: review/observation, theory/hypotheses development, collect data, test hypotheses using statistical analysis, and hope to find something publishable. The idea is that the data you collected (your sample) was generalizable to the overall population. So if you found results that supported your hypothesis in a sample of 600 people, you would argue this would be the case in the greater population when you published the study.
Then along came big data. You would no longer need a sample because you could plausibly have the entire population. Instead of 600 people, you now had 4 Million if you were Facebook. No need to mess with theory and hypothesis development, you simply ran statistical analysis of the population and the results told you everything you needed to know. This is what led Wired Editor Chris Andersen to observe that Theory is Dead in 2008. It is in this context that Jim Gray coined the term data science. Data science was accumulating enough data that you could skip theory and hypotheses development and rely on the statistical relationships you found in the data. Data science is essentially a science hack.
So does that make data scientists science hackers? I’m going to say no for one primary reason: you need more skills for data science than you do for traditional science. So sure, it’s a hack of the scientific method, but it takes more dedicated learning, experience, and effort to be able to hack that process. Not a very good hack if it requires more effort. It is possessing these skills that I think makes someone a data scientist. Therefore:
Data scientist are professionals competent in statistical analysis, computer programming, and applied problem solving in their domain of interest. The Venn diagram below illustrates how possessing different combinations of these skills makes people good at different data centric jobs. Because there are so many people running around calling themselves data scientists today, I think the diagram also does a good job of illustrating who is not a data scientist. Let’s review each.

Statistical Competence. I put this at the top of the Venn diagram because understanding statistics is at the core of data science (or really any other data centric role). The whole point is to skip theorizing to rely on statistical relationships. If you cannot find these relationships in your data, cannot play in data science. This also means you will need to be proficient in R, SPPS, SAS, or Stata, and likely some of the method/model specific software packages.
Applied Problem Solving. I think there are lot’s of people out there that have statistical competence and/or computer programming skills with “data scientist” in their current job title. I would however argue that they are not data scientists. Why? Remember the first part of the scientific process is review/observation, which is studying and trying to develop a basic understanding of some subject or phenomenon before you start asking your own research questions. What do people already know about this subject? What don’t we know yet? While big data may take away the need for developing new theory and hypotheses, you still need to know what it is you are studying. If you don’t, you’re going to spend a lot of time and resources to get obvious answers to stupid questions. There’s no faster way to get marginalized in an organization than making more money than most people in the room and presenting them with a detailed research project that tells them exactly what they already knew five years ago.
A data scientist has to know what questions to ask. This requires that you develop a thorough understanding of whatever you are examining with data science (e.g. business, public policy, educational outcomes, etc.). The practitioners (business people, policy wonks, educators, etc) know their subject area, but they often do not understand the tools data scientists bring to the table and thus have no idea what to tell you to do. In the 2017 Kaggle State of Data Science Survey, the fifth most cited barrier at work (30.2% of respondents) was “Lack of a Clear Question to Answer.” If you don’t know what questions to ask, you cannot have scientist in your job title. Inquiry, whether done through thoughtful theory development or studying massive amounts of data, is at the heart of ALL science. All inquiry starts with asking the right questions.
Computer Programming. Large amounts of well organized, accurate, and authentic data is the world’s most valuable resource. This means you are unlikely to just come across it anytime soon so you’ll need to develop it yourself. You will also need to do this on a repeated basis (i.e. not a one time data collection). This maybe a few times a year, once a day, or continuously in real time. To collect and analyze data on a repeated basis, you’ll need to build a system that (1) acquires and updates data; (2) organizes that data from different sources into a coherent structure; (3) can pass that data into some sort of statistical analysis; (4) presents results in a clear manner (often as visualization); (5) all on an automated basis. It’s this last part (the automation) that separates people proficient in statistical analysis who can accomplish tasks 1-4 from data scientists. Most academic researchers (PhD types like me) are highly proficient in tasks 1-4, but are completely clueless when asked to repeat that process on a ongoing basis. Automating that process requires being able to tell a computer to do it, and that requires proficiency in Python, SQL, C++, and/or some other programming language. While strong in statistical analysis and applied porblem solving, I would not identify as a data scientist until I had imporved upon my current Python and SQL skills...unless of course you had a lot of money to throw at me.
Reality is the job market for data scientists is very, very hot right now. I realize there are and will continue to be more and more people calling themselves data scientists that do not possess all three of the skills identified. I do think the three skills provide a good educational progression for becoming a data scientist. Starting with stats, moving to programming, and then gaining a solid understanding of the area you are going to apply your craft is a good educational progression. Likely, you will be marketable with a solid statistics background (Data Incubator and Insight Data Science both exist to train you up on the programming side while getting you hired), you will be highly desired as someone with both statistics and programming skills, and once you have several years experience in a particular industry, you will be extremely sought after and courted as a full fledged data scientist.
