It has been proven that companies that incorporate these kind of practices into their management workflow have become more productive and profitable than their competitors. This also proves to be right in terms of a growing user base and therefore overall company revenue. For example, Spotify - a data driven company which is best known for incorporating machine learning in its ‘Discover Weekly’ feature -, has officially twice as many paying customers as Apple Music.
At the same time, companies that provide software solutions for external clients are also aiming to expand their business by offering solutions that involve these technologies to their clients. Therefore, data scientist roles have grown 650% since 2012 and machine learning engineers jobs have increased almost tenfold. This is a clear example that both of these new roles are here to stay and that is why it is of great interest that any software engineer develops at least a basic knowledge in this new direction that promises to drive the future.
Some of the most important questions that need to be answered before a professional decides to continue on this learning path will be answered briefly in this blog post.
What are machine learning and data science?
Data science is a mix of several disciplines that have always existed although their combination had been cumbersome or difficult. However, thanks to improvements in data storage capabilities as well as cloud computing, this combination has become a reality. Its goal
is to extract insights and meaningful information from unstructured data, most of which is now being generated every second by any device (specially IoT devices). It combines statistics, programming and data handling but most importantly it needs a specific amount of domain knowledge in order to be able to understand and manipulate the information so that the intended analysis is meaningful to the business.
Machine learning is often paired with data science because of its approach to how data is used. The idea is not to only to create an analysis on data, but to generate a computational model that can solve a complex problem without the need of explicitly stating any programming logic that arrives to that conclusion. This model also has the ability to continue improving once more data is fed to the system. Machine learning is subdivided into two types of algorithms that solve different types of problems:
Supervised learning: Allows us to predict values from a known data set that contains the value to be predicted. Bank systems that detect fraudulent transactions or determine the credit risk for an account holder use these type of algorithms.
Unsupervised learning: Allows us to identify clusters of data. Voice detection in virtual assistants such as Siri and Cortana use this kind of learning, as the system needs to isolate the user’s voice from the background noise.
What kind of problems do data science and machine learning help us solve?
The trend for data science and machine learning is not just a marketing buzzword. It has been in development since early 2000’s and has multiple uses in today’s daily life. Some of the most basic examples of this are algorithms that show product recommendations to users based on their shopping history, or music recommendations based on listening history as well. Some other more complex implementations are algorithms designed to reduce customer churn rate, prevent bank frauds or to predict when to give preventive maintenance to machines in a production line based on sensor data.
One curious case is how Walmart - thanks to its 460 terabytes of own data - has been able to identify and stock important products that will experiment high demand before a natural disaster such as hurricanes. They learned that not only first need items such as food or toiletries were high in demand but also that beer and strawberry Pop-Tarts - the latter experiments a sevenfold increase in sales - were the pre-hurricane top selling products. This has not only helped people to avoid incidents during natural disasters, but has also helped Walmart increase its profits.
However, one of the most prominent (but not so honorable) mentions of data science on the media during 2018 was the Facebook and Cambridge Analytica scandal, related to the US 2016 presidential elections. According to the media, the information of more than 50 million users was harvested without their explicit consent in order to find patterns that could be used to target specific groups with ads that would make them incline towards a specific political choice. This example alone shows the extent to which data science can be used and how it can have both a positive and a negative impact on the daily lives of millions of people.
Now that I am interested in learning data science and machine learning, which development tools can I take advantage of?
1. Python + Jupyter Notebook: With Python being a very popular multi-paradigm programming language that is easy to learn, a lot of tools have emerged that are of help to analyze data and to easily create models machine learning models. One of those tools is the Jupyter notebook (http://jupyter.org/) which allows to create illustrative documents which contain live code and is used to clean, transform and visualize data as well as create machine learning models. There are a lot of good tutorials on the web that with the help of libraries such as pandas (https://pandas.pydata.org/), NumPy (http://www.numpy.org/) and Tensorflow (https://www.tensorflow.org/) make it possible to easily and quickly get hands on experience.
3. Azure Machine Learning Studio (https://studio.azureml.net/). It is a very user friendly drag-and-drop tool used to create production ready predictive analysis solutions that can easily be consumed by custom applications. It offers all capabilities needed to create a machine learning model such as integration with different data sources, transformations for data preparation, model visualization, built in ML algorithms and one click operationalization. It also has a Web API interface that allows a quick and easy integration to existing projects. The advantage of this ML studio is that it is completely free to be used for learning purposes and that you don’t need an Azure subscription to experiment with most of its features.
4. Open Data libraries: Last but not least lies the heart of what is needed to successfully get started with these tools. Data lies at the heart of these analyses and that is why it is very important to know where this information can come from. One place is https://www.data.gov/ where we can find open data provided by the US government. It contains information on many disciplines and it most probably has information that is closely related to the business goal of any company. Another repository that contains examples that can be used for learning is the University of California Irvine Machine Learning repository (https://archive.ics.uci.edu/ml/index.php). This repository contains data sources that are very popular amongst machine learning and data science courses and tutorials over the internet.
What is next in the future of data science and machine learning?
If one thing is certain is that the demand for data scientists and engineers that are familiar with machine learning algorithms will not be less in the upcoming years. Implementations for this type of predictive models will be everywhere, and soon even jobs that require highly qualified technical decision makers - such as engineers or even doctors - will be substituted by automated decision makers. Also, with the ever growing implementations of IoT (Internet of Things), most of our devices will be interconnected, and every human on the planet will become a source of huge data exhaust. According to an International Data Corporation (IDC) study, world’s data is more than doubling every two years. Whoever is able to analyze this data and extract meaningful information will be the winner of the never ending data race which has just started. Data-driven businesses but most importantly data scientists and software engineers will be the forces driving the future of the industry. That is why it is up to us as developers to be sure that these technologies will help mankind to overcome the challenges that it faces and that they will not be used to negatively impact the lives of individuals with unethical practices.
"Big Data: The Management Revolution - Harvard Business Review." https://hbr.org/2012/10/big-data-the-management-revolution. Accessed 2 Jul. 2018.
"Big Data Conversation with Spotify | Direct2DellEMC." 13 Apr. 2017, https://blog.dellemc.com/en-us/big-data-conversation-spotify/. Accessed 2 Jul. 2018.
"Spotify seeks more personalized playlists after Discover ... - TechCrunch." 25 May. 2016, https://techcrunch.com/2016/05/25/playlists-not-blogs/. Accessed 2 Jul. 2018.
"LinkedIn's Fastest-Growing Jobs Today Are In Data Science ... - Forbes." 11 Dec. 2017, https://www.forbes.com/sites/louiscolumbus/2017/12/11/linkedins-fastest-growing-jobs-today-are-in-data-science-machine-learning/. Accessed 2 Jul. 2018.
"Article Data Science vs. Big Data vs. Data Analytics - Simplilearn." https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article. Accessed 2 Jul. 2018.
"Data Science: The Big Picture | Pluralsight." 15 Sep. 2017, https://www.pluralsight.com/courses/data-science-big-picture. Accessed 2 Jul. 2018.
"Machine Learning Courses | Pluralsight." https://www.pluralsight.com/courses/understanding-machine-learning. Accessed 2 Jul. 2018