Basic Concepts in Data Science
Author – Trupti Jadhav
Data Science, is also known as the most demanding job of the century, has become a dream job for many of us. But for some, it looks like a challenging maze and they do not know where to start.
So here I will explain some basic concepts used in Data Science which are important and widely used in industry.
- Why do we use descriptive statistics?
Descriptive Statistics are used to present quantitative descriptions in a manageable form. The manageable forms could be charts, tables, basic statistics such as Mean, median, mode, standard deviation, range of the data.
It provides basic information about variables in a dataset and highlight potential relationships between variables. It helps to facilitate data visualization. It allows for data to be presented in a meaningful and understandable way, which in turn, allows for a simplified interpretation of the data set in question.
There are four major types of descriptive statistics:
- Measures of Frequency such as Count, Percent, Frequency etc
- Measures of Central Tendency such as Mean, Median, and Mode etc
- Measures of Dispersion or Variation such as Range, Variance, Standard Deviation etc
- Measures of Position such as Percentile Ranks, Quartile Ranks etc.
- Why do we use inferential statistics?
Inferential statistics are often used to estimate the unknown parameter of a population based on a sample, while this estimate is thus an informed prediction for the parameter. Inferential statistics are not used to describe samples but are used to describe the population that contains the sample.
Inferential statistics are methods for drawing conclusions and measuring how accurate the conclusions are (based on a sample from the population).
The common methods in inferential statistics:
The most common methodologies in inferential statistics are hypothesis tests, confidence intervals, and regression analysis. Interestingly, these inferential methods can produce similar summary values as descriptive statistics, such as the mean and standard deviation.
- What is predictive modelling?
Predictive modeling, also called predictive analytics, is a mathematical process that seeks to predict future events or outcomes by analysing patterns that are likely to forecast future results. The goal of predictive modeling is to answer this question: “Based on known past behaviour, what is most likely to happen in the future?”
The steps in predictive modelling:
- Once data has been collected, the analyst selects and trains statistical models, using historical data. Although it may be tempting to think that big data makes predictive models more accurate, statistical theorems show that, after a certain point, feeding more data into a predictive analytics model does not improve accuracy. The old saying “All models are wrong, but some are useful” is often mentioned in terms of relying solely on predictive models to determine future action.
- In many use cases, including weather predictions, multiple models are run simultaneously and results are aggregated to create one final prediction. This approach is known as ensemble modeling. As additional data becomes available, the statistical analysis will either be validated or revised.
What are the types of predictive models?
- Ordinary Least Squares.
- Generalized Linear Models (GLM)
- Logistic Regression.
- Random Forests.
- Decision Trees.
- Neural Networks.
- Multivariate Adaptive Regression Splines (MARS)
- How machine learning improves upon traditional OLS regression models?
As we read above on inferential statistics, i.e., a discipline which aims to understand the underlying probability distribution of a phenomenon within a specific population.
In supervised machine learning, to generate a prediction for a new element of the population.
Inferential statistics relies on assumptions: the first step of the statistical method is to choose a model with unknown parameters for the underlying law governing the observed property. Then correlations and other statistical tools help us determine the values for the parameters of this model. Hence if your assumptions about the data are wrong, computation of the parameters will make no sense and your model will never fit your data with enough accuracy.
But there are an infinite number of possible families of distributions and no recipe to come up with the good one. Usually, we conduct a descriptive analysis to identify the shape of the distribution of our data. But what if the data has more than two features? How do we visualize this data to make a model proposition? What if we cannot identify the specific shape of the model? What if the subtle difference between two families of models cannot be distinguished by the human eye? In fact, the stage of modeling is the most difficult part of the inferential statistics methodology.
This is also what we do in Machine Learning when we decide that the relationship in our data is linear and then run a linear regression. But ML does not sum up to this. Such learning methods enable us to identify tricky correlations in data sets for which exploratory analysis has not enabled proper determination of the shape of the underlying model. We do not want to give an explicit formula for the distribution of our data, rather, we want the algorithm to figure out the pattern on its own directly from the data. Learning methods enable us to throw off assumptions attached to the statistical methodology.
And hence machine learning algorithms show better accuracy in prediction over normal regression due to the right choice of underline distribution of data.
- What is the multivariate statistics, how does it work with data science?
Multivariate analysis deals with the statistical analysis of data collected on more than one dependent variable. Multivariate techniques are popular because they help organizations to turn data into knowledge and thereby improve their decision making.
Most of the Multivariate analysis techniques are extensions of univariate (analysis of single variable) and bivariate analysis (techniques used to analyse two variables). Other multivariate techniques are solely created to deal with multivariate problems, such as discriminant analysis or factor analysis, etc.
To be considered truly multivariate all the variables must be random and interrelated in such a way that their different effects cannot meaningfully be interpreted separately.
The building block of the multivariate analysis is the variate. It is defined as the weighted sum of the variables, where the weights are defined by the multivariate techniques. The variate of n weighted variables (X1 to Xn) can be written as :
Variate = X1*W1 + X2*W2 + X3*W3 + … + Xn*Wn
where X1, X2…Xn are the observed variables and
W1, W2, W3…Wn are the weights.
But what are these variates used for?
These variates capture the multivariate features of the analysis, thus in each technique, the variate acts as the focal point of the analysis.
For example, in multiple regression, the variate is determined in such a manner that the correlation between the dependent variable and the independent variables is maximum.
The multivariate analysis involves dealing with data that has multiple variables and so the entries in these variables might have different scales. Hence ‘measurement’ of the data becomes essential. Measurement is important in accurately representing the concept of interest and is instrumental in the selection of the appropriate multivariate technique.
Data can be classified into 2 categories:
1.non-metric(qualitative) – Non-metric measurements can be made by ordinal or nominal scales.
- metric(quantitative)- metric measurements can be made by Interval and Ratio Scale
Determination of each variable as non-metric or metric is essential as it can change the whole analysis. This identification is done by humans as to computers everything is just numbers.
As in data science when you are dealing with large number of predictors and complex data with different measurements, then we should perform Multivariate analysis. The main advantage of multivariate analysis is that since it considers more than one factor of independent variables that influence the variability of dependent variables, the conclusion drawn is more accurate. The conclusions are more realistic and nearer to the real-life situation.
Author – Trupti Jadhav
Trupti Jadhav is a Data Scientist with a post-graduation and M. Phil. in Statistics. She has worked with Bank of America, Absolutdata, SAS Global Services, IDeaS (SAS), First Indian Corporation to help resolve business problems with data science.
She has vast experience in Clustering, Prediction, Segmentation, Recommendation Engine, Natural Language Processing, Sentiment Analysis, Warranty Analytics, Risk Score cards, Machine Learning and Artificial Intelligence (ML, DL & AI) etc. She is currently working as Data Scientist at IBM India Pvt Ltd under ‘Cognitive Business and Decision Science and Advanced Analytics’ department.
In addition to Data Science, her area of expertise and interest along with certifications include SAS, PMP, ITIL v3 Expert, Lean Six Sigma Green Belt and Black Belt.
Trupti has proven ability to deliver client-facing data science projects across domains like Banking & Financial Services, Insurance, Telecom, Pharmaceutical, Media, Retail, Automobiles, Healthcare, Logistics, Energy and Utility, Hospitality, Mortgage, Weather while leading teams in customer-facing roles for projects across the globe.