What is Data Science -Definition, Application and Importance

By Mukunda Varma | 16th October, 2020 | 8 min Read

Data Science is simply the science of data.

Data science is defined as an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data.

Did that seem confusing? Let’s break it down to understand better:

  • First, we collect previous data(history or records) related to the problem we want to solve. (of the task or application)
  • This data is arranged in a structured manner to find out what kind of events and how these events were happening before.
  • After gaining the necessary insights from the structured data, we use them to predict future outcomes of another event.

Before going deeper into data science, let us have a look at why data science is important.

Why is Data Science important?

“The world’s most valuable resource is no longer oil, but data.” — The Economist

The quote shows how important data is, in the present time. Let us get straight into a few stats. According to the Forbes 2019 report, the internet data created is almost equal to 2.5 Quintillion bytes per day. Every activity that is happening in the world is being transformed into data in one or the other way.

The concept of Big Data is getting bigger and bigger. With the availability of such vast amounts of data, more and more companies are looking forward to using this data to gain valuable insights and improve their businesses.

Since most of this data is unstructured, the need and usage of data science tools and methods are becoming important day by day. Data Scientist has been named the number one job in the U.S. by glassdoor for four years in a row. LinkedIn listed data scientists as one of the most promising jobs in 2017 and 2018. Reports of the Bureau of Labor Statistics of the USA estimates around 11.5 million new data science jobs to be created by the end of 2026.

Let’s dive into the applications -

Applications of Data Science

From getting appropriate results in your google search to predicting the growth of cancers, the applications of data science are endless. With the presence of such a vast amount of data on the internet, Isn’t it amazing how google still shows you such accurate results? Imagine how much value can you bring to your company if you can know the likes and behaviours of your customers? how much money you would save if you could detect the fraudulent bank transactions well in advance and abort them. How many hundreds of lives do you think can be saved if you can detect various cancers and dreadful diseases in the early stages? All these are some of the most important applications that use data science.

Truly amazing Isn’t it? Now after knowing about the importance of data science and some of its real-life applications, aren’t you amazed to dig deeper into this field?

Let us get into the details of what data science has and the various steps involved.

Process of Data Science

On a higher level, 6 main steps are involved in data science:

  1. Data Extraction
  2. Data Preprocessing
  3. Data Analysis and Visualization
  4. Modelling and Prediction
  5. Evaluation and Fine Tuning
  6. Data Visualization
Process of Data Science

To develop a better understanding of these 6 steps let us take a real-life example and see how each of the 6 steps plays a part in solving the problem.

Let us assume we want to predict heart failure in a person based on his/her various health parameters (The dataset was collected from here) :

  1. Data Extraction

    This step involves extracting and collecting all the relevant data and putting it at one place. Data is generally collected from sensors(IoT devices), historical records, surveys, scraping data from websites, and other raw sources of data.

    But don’t worry, you need not travel to different places or access IoT devices to obtain your desired datasets. A lot of data is already available online and can be downloaded for free. Google has a search engine made specifically to find datasets. Another easy way of collecting data is through Web scraping. Just select a website that you feel is authentic enough and satisfies your needs and start scraping data, the way you want.

    Although any data that is publicly available can be scraped, be sure to check the website policies before you start scraping the data.

  2. Data Preprocessing

    Data Collection can be tricky. Raw data is generally unstructured and needs to be brought into proper shape before it can be used for analysis.

    The data can have a lot of discrepancies caused due to errors in the method of extraction or due to human errors. For example, the collected data might contain duplicate records, missing values of particular data, or some extreme values which might impact our results. All such discrepancies have to be resolved before analysis, to make sure our results are accurate.

    Let’s take a look at how sample data looks after data extraction

    Let’s see this in our case: This is how a sample of our data looks after data extraction

    Sample of raw data after data extraction

    The data shown in the figure is to analyze if a person’s heart is prone to failure based on several factors.

    The factors that include are age, sex, smoking habits, and various other health conditions. As mentioned above, the raw data is unstructured and needs to be processed before analysis.

    • From the above image, we can find null values(NaN) (aka missing values) in the 2nd and 3rd rows. These null values are generally replaced with mean or median values of that column or sometimes the entire row is removed.
    • Similarly, if you see the age of the third person, it says 245 which is not possible. This means the data in that row is wrong and must be removed.
    • Also, we can see that the last two rows have the same data and are duplicates. We can exclude one of the two rows.
    After performing all these operations, we will have data that is clean and can be used for analysis. In the above example, the final data looks similar to the below image.

    sample of processed data after data pre-processing
  3. Data Analysis

    Data Analysis can be considered as the heart of data science. It means using various methods and tools to analyze your data and gain meaningful insights from it.

    Before starting with various analysis methods, it is important to list out the kind of results. The kind of insights you get from your data depends on the type of analysis performed. Data Analysis can be of four types:

    • Descriptive Analysis -- knowing what happened.
    • Diagnostic Analysis -- finding out the reason why something happened.
    • Predictive Analysis -- what is likely to happen?
    • Prescriptive Analysis -- What should we do for something to happen?

    While all these types share a lot of similarities, each of these serves a different purpose and provides varied insights. In our case, we will be doing the first three types:

    We’ll first see the mean, median, and other general statistic values:

    summary statistics of each feature of the data set

    In the above diagram, we can infer that the median value(50%) of age is 60.000 which indicates that people of age around 60 are more prone to heart failures than others.

    Similarly, the mean of sex(1 indicating male and 0 indicating female) says 0.64 which means that 64 in 100 people are male. We can draw similar assumptions with other factors as well.

    General descriptive analysis can be:

    Pie chart of Sex in the dataset Pie chart of Smoking Habit in the data set

    Diagnostic analysis:

    histogram of patients age versus death probabilistic distribution of creatinine phosphokinase

    These are a few examples of analysis we could do. A lot of more comparisons and patterns can be derived from the data.

  4. Modelling/Prediction

    This is the most fascinating part of a data science problem. The data analysis part which gives us valuable insights regarding the data is indeed very important. But in most cases, it is not good enough.

    For example, You collected the historical data about the heart failures that happened in a set of patients in the past 10 years. You did a very detailed analysis of all the causes of failure. Described in detail, the impact of each factor on the failure of the heart. But now what? How is your analysis worth your effort if it cannot predict and stop the next coming heart failure? right?

    That is how important a prediction is. You see all the patterns and trends in the data, take into consideration all the factors that have a negative impact on the functioning of the heart, the level of influence each factor might have on the failure of heart, and form a certain mathematical/statistical model considering all the above. We will use this model to predict whether there is a chance of failure or not.

  5. Evaluation and Fine Tuning

    Well, to be very frank, no one gets the model right the very first time. Only by multiple evaluations and improvisations, one can make the model perfect. Evaluation generally involves two steps:

    1. Methods used

      Two methods are used for evaluation:

      • The first one is the handout, in which you separate the data into three sets — 
        • Training data - data used to train the model
        • Validation data - data used to provide an unbiased evaluation of the model while fine tuning the model parameters
        • Testing data. - data used to evaluate the final model
      • The second method is called cross-validation. In this method, Data is randomly split into n subsets of equal size. Then:
        • In the first round, the first subset is considered as the Validation set and the remaining (n-1) subsets are used for training data.
        • Then , the second method is considered as the Validation set and the remaining subsets as training sets. We repeat this process for ‘n’ number of times and the mean of the accuracy is considered.

      The first method is generally used when the size of the dataset is large enough while the second method is used when we have less or limited data available.

    2. Metrics used

      While Accuracy is the most used metric in evaluation, it is not the right metric in all cases.

      For example, the data sets used for credit card fraud detection are very skewed. You generally have around 2000 or 3000 fraud transactions in a dataset of 10 lakh records. So even if your model predicts all the fraudulent transactions as genuine transactions, you still end up at an accuracy of 99.7(997000/1000000). This result is an absolute blunder. As a result you need to consider metrics such as False Positive rate and f-measure in this case which will take into account the number of negatives that are considered positive.

      In our case, the sample is not so skewed so we can go with the accuracy metric. Let’s see how accurate our model is on the first try (k means algorithm was used in the first try):

      Accuracy of the k means algorithm

      As mentioned above, the first try is never enough. On changing the algorithm and other inputs, the results were as follows:

      Accuracy of COPOD algorithm

      We see that there is a slight increase in accuracy. Similarly, after doing several other optimizations, we arrived at a final accuracy of 79% which is significantly high:

      accuracy of final confusion matrix of lightGBM algorithm
  6. Data Visualization

    We all know how important visualization is. No project is complete without the inclusion of data visualization. A visual ( an image or a graph foresay) has a different level of impact on your audience. It helps in communicating with the audience and users more effectively and makes understanding much easier.

    Although we have listed this step as the last in the process, this can be used at any step in the process.

    Some of the visualizations we did in our analysis:

     Diabetes Pie chart histogram of  ejection fraction correlation matrix

To Avail Our Free Resources