Data Science Jobs for Computer Science Students, Grads, and Software Engineers
Data Science is rapidly expanding from a sort of specialized offshoot of software engineering to its own discipline outright. As a discipline, it draws on Computer Science, statistics, and mathematics in large quantities to achieve its main objectives. Unfortunately for hiring managers, this breadth is not reflected in a typical undergraduate or even graduate degree. Although universities are slowly beginning to introduce Data Science programs to address this skills shortage, the Data Scientists of today are largely folks who have managed to supplement their formal education with a great deal of self-learning to fill in the gaps.
The good news for self-learners is that as Data Science matures and organizations develop their Data Science teams and capabilities, the opportunities for specialization increase. This means that whereas members of Data Science teams were previously required to be “unicorns” with extensive knowledge in all possible areas of Data Science, many of the larger teams of today have members who specialize in some area or another. And there’s plenty far you can go as a Computer Science graduate or Engineer in this area.
Opportunities in Data Curation
As a relatively nascent field, there remains no unified definition of “Data Science”, but all data scientists are concerned with two very broad and different areas: data analytics, and data curation. Analytics is the extraction of useful knowledge from data. This is probably the broad activity that people most often associate with data scientists: data scientists are people who are crunching numbers and producing actionable insights and predictive models from data. But there is a whole other side of the coin that often goes ignored in discussions of Data Science: somebody somewhere along the way has to figure out how to collect, manage, preserve, document, transform, alter, and access the data effectively and efficiently in order for analytics to even be possible. In academia, these activities are often referred to as Data Curation, and professional Data Science teams without strong capabilities in this area don’t get very much done.
In fact, in some organizations there has historically an outsized emphasis on analytics relative to curation, and that’s good news for Computer Science students, graduates, and engineers in 2017. A lot of Ph.D. statisticians and physicists get hired to Data Science teams for their skills on the analytics side, and often these teams have a serious shortage of ability in data curation.
In the professional world, usual titles for a specialist in data curation include things like Data Engineer, Data Developer or Business Intelligence Developer, Big Data Specialist or sometimes just Data Scientist. People with a Computer Science background have a great head start for this kind of career path. Where physicists or statisticians might not have extensive training in data structures or schemas or Entity-Relationship models, these are all cornerstones of a Computer Science education. With this knowledge in your back pocket, you can move directly to learning about the technologies that modern technology firms use for data curation.
As a Data Engineer, you should strive to have a deep understanding of at least the following things:
- The relational model and its various implementations (SQL Server, Oracle Database, MySQL, etc.).
- NoSQL databases including:
- The Document Store model and MongoDB in particular, which is the most popular NoSQL database at the time of this writing according to db-engines.com.
- Wide column databases like Cassandra.
- Key-value stores like Redis.
- As many other NoSQL models as you can handle: this Wikipedia article is a good place to start.
- The MapReduce programming model and its implementation in Apache Hadoop.
- Cloud computing platforms like Amazon Web Services and Microsoft Azure.
Of course, the specific technologies mentioned above will likely be out of date by the time I hit “publish” on this post, but the concepts and ideas will not. Data Curation experts are expert in the creation, capturing, modeling, management, documentation, storage, transformation, and retrieval of data, and need to be familiar with all of the tools that successful organizations use to accomplish those tasks. And people with formal education and training in data structures and Computer Science are in a great position to become experts in these areas and lead data curation activities in Data Science teams.
Opportunities in Data Analytics
Again, analytics is probably the area that most people will most closely associate with the term Data Scientist, and learning analytics will require more filling in of gaps in formal education for a CS major than it would for a statistics major or someone with a Ph.D. in a highly quantitative field. The good news for CS majors is that you’ll have an easier time getting started with the tools of data analytics than, say, a physicist or a pure mathematician: as a programmer, learning how to use a new tool from examples and documentation is second nature to you.
With that in mind, you may as well get started by getting acquainted with the tools. Most Data Scientists that I know make extensive use of both R and Python, with some preferring one over the other. If you already know one of those languages, then that language is probably the best place to start. If you’re going to use R, the best environment that I know of for using R for Data Science is RStudio, which is an IDE that has great support for both interactive programming, writing scripts, producing graphics, and lots of other analytics objectives. If you want to use Python, you’re going to install a lot of libraries to turn it into a Data Science language; luckily, the folks at Continuum Analytics have built a Python distribution called Anaconda that comes with the main statistical computing libraries for doing Data Science, as well as a great tool called Jupyter Notebook which facilitates computation in a readable interactive notebook format. Incidentally, Anaconda and Jupyter Notebooks are the main tools that we will be using at BrainStation in our upcoming Data Science course.
A natural place to go from here is to learn how to build a predictive model using machine learning. There are various levels of abstraction and understanding that you can strive for here. For a deep understanding of how the various models and algorithms work, you’ll need the equivalent of at least a few undergraduate math courses: at least two semesters of calculus, and least one semester of linear algebra, and probably a course in probability. Data Science teams should have members with at least this level of understanding of what’s going on. With that said, it’s one line of code in R to perform a linear regression in R, and not much more to use more sophisticated models than that. Abstracting from the particular details of how each type of algorithm works, there is a higher-level paradigm that governs how models are built with machine learning: models are trained, tested, tuned, validated, and deployed, and data scientists should understand each step in this process. It is possible to achieve a high-level understanding of how to apply machine learning principles and techniques to build a predictive model without necessarily understanding the particular details of each specific algorithm. A good place to start here is the book Applied Predictive Modelling by Max Kuhn, which introduces machine learning techniques from the perspective of trying to make the best predictions from real-world data rather than trying to understand the minute details of any specific algorithm or model. Participating in and researching Kaggle competitions is also a great way to simply get your feet wet and start model building.
Opportunities for programmers
Tools are being built to bring extremely sophisticated machine learning capabilities to organizations that don’t necessarily have a team of Ph.D. mathematicians on hand. Toolkits like Keras and h2o.ai are already making it so that anyone with a little bit of coding experience can build production-quality machine learning applications easily. This creates a huge opportunity for programmers: the challenge of the future will not be in building the models but in integrating these kinds of ready-made toolkits into their organization’s production stack, and that takes Computer Science knowledge and programming experience over anything else. For this reason, there’s no better time than now for engineers and programmers to start learning all they can about machine learning and Data Science.
Curious to learn more about Data Science? Take a look at our upcoming part-time Data Science course.