What you need to learn to work with Big Data
LinkedIn statistics says that machine learning, data science and big data are the leaders among fastest growing jobs
Big Data is not a new concept suddenly emerged from nowhere. It is a term used to describe the tools applied when processing big and complex business data sets. These tools became popular due to the increasing growth of computing capacity allowing to deal with huge volumes of information.
In old times, practically every company used to have a department of statistics. Later this function was transformed into business analysis and currently it is being transformed into data analysis dealing with more detailed and deep-laid information.
If you dream about making a career in data science, you have to improve your mathematics
Large IT companies develop an increasingly popular tendency. They organize trainings for employees working with Big Data and involved in similar projects. More and more newbies ask for advice: what is it necessary to learn to become a data scientist?
Modern Ukrainian data science market is highly focused on discovering strong professionals with good knowledge of data science tools. At the same time, the current level of large data processing projects often requires even more than just knowledge: developing algorithms from zero, recognizing consistent patterns in large data sets and, as a result, being exceptionally strong in mathematics. An important part of business case solutions is based on the theory of probability and mathematical statistics.
As proves the experience, a poor level of mathematics is the main challenge for those who dare start a career in data science because mastering machine learning with no previously acquired deep and systematic knowledge of the theory of probability, mathematical statistics, linear algebra and mathematical analysis is unproductive. Processing large data volumes is often beyond a standard task and requires some mathematical creativity.
Hence, a data scientist is, first of all, an excellent mathematician.
Besides, the job requires some knowledge of fundamental sciences like mathematical analysis, linear algebra, mathematical statistics and numerical technique, as well as their derivatives: algorithms, programming and econometrics, including statistics, programming and economic theory. All the above-mentioned sciences study data properties and consistent patterns, methods of data processing, analysis and modelling, which are extremely important for data science. These are the key spheres of expertise for good data scientists.
In addition to all the above, it is essential to get an idea about application packages and to master relevant programming languages (Python, R, SAS Miner and SPSS), allowing, when combined, to find the most appropriate data science solutions. This is the necessary elementary «basis», provided by any technical higher educational establishment.
Required knowledge and specificity of work in Data Science
Many post-graduates looking for a career opportunity in data science often make similar mistakes. The first one is to start getting familiar with the subject following an online data science training provided by Coursera or other platforms of the kind. Surely, following a training you can have an idea about what data science is and acquire some practical skills, but it’s still not enough to get expert knowledge of mathematics and mathematical statistics. As a result, you find out how to apply a package solution using programming languages and connecting libraries but remain unable to interpret the data and understand the essence of the challenge you face. There is no remedy against this evil except good fundamental knowledge of mathematics.
As for the second mistake, it often happens that good data scientists are either ignorant in business or just don’t feel like studying it properly, whereas the challenge itself goes back to the sphere of business. First of all, before stating and verifying a hypothesis, before looking for any regularities explaining a phenomenon, a data scientist should understand the business essence of the problem. This mistake is not typical only for data scientists. It is about any professional getting familiar with a new job.
Traditionally data sciences are divided into several groups. The first group includes professionals involved in preparing data, aggregating the data scheme and building the back-end and transformation pipelines. Here there is much work to do with the infrastructure (DevOps) and cloud solutions. On the whole, to get an idea about Big Data pay attention to the following recommendations:
- Master a programming language, either Python or Scala/Java. Python is generally more popular in data science and many companies use it for automation.
- Take care to get a good knowledge of Linux and data processing technologies currently applied in the majority of big data projects (Apache Spark and Apache Kafka)
- Become a strong database specialist (acquire some expert knowledge both of OLAP/OLTP and NoSQL).
- Get familiar with Cloud computing services. It is obligatory. As far as I can judge by my own experience, the big data projects I was involved in were mostly based on AWS.
- Having at least an idea about DevOps-technologies (Terraform, Ansible, Fabric, Puppet) as well as about dockerizing (docker/k8s) is an important advantage.
Of course, the list is not exhaustive, but you also can learn something when you start working.
Machine learning deals both with data and algorithm development and building. Here you’ll have to remember everything you‘ve learned in linear algebra and statistics, as machine learning is mostly focused on all kinds of operations with matrices. Besides, you should be able to apply various frameworks and work with libraries.
- Machine learning offers a broader choice of programming languages: Python, Scala and R.
- Machine learning projects are often based on cloud solutions, so here it is necessary to have a clue about what cloud computing is (think about finding out more about the platforms like AWS, Azure or Google Cloud).
- There exist machine learning libraries: learn how to work at least with one of them (Spark ML, Scikit or Tensorflow)
- Be aware you’ll also have to work with notebooks: Jupyter, Zeppelin, Databricks.
Finally, what is the most essential: follow an inspiring example. Look for interesting information about the activity of leading IT companies and high-level professionals, master new algorithms they use and try to apply them when developing a project of your own.
What is equally important in data science apart from mathematics and all the related technologies mentioned above is the ability to understand the business side of the job. Actually, you are expected to become a subject matter expert. Since you need to implement the decision-making algorithm in the right way, you should have a clear idea about different business approaches and the data you work with. In the network, there are many useful and helpful resources, which will allow you to learn how to develop and improve a similar model at an appropriate level. However, to make your model as perfectly accurate as possible, you’ll have to make a good number of attempts and experiments and to explore your business area as thoroughly as you can.