- What is Data Science?
Ans: Data science is the technology or concept which is used to understand different phenomenon of data. It include Business understanding, Statistics, Data Analytics, Programming and Machine learning or deep learning algorithms. You can easily understand by below vein diagram:
- What is Machine Learning?
Ans: Machine learning consists of group of statistical model which is designed to train a machine to do a specific task without any explicit instructions. Based on different types of datasets and type of data(continuous or Discrete)It can be further divide into two types:
- Supervised Machine Learning:
In this we try to train the machine with the existing inputs and outputs (labels)of a dataset. It can be further divided into two types regression (numeric data) and classification (categorical data).
Ex- Suppose I want to find out today’s gold price based on past data.
- Unsupervised Machine Learning:
In this we try to find the hidden pattern or any trend or any relationship from the dataset. It is known as unsupervised because it doesn’t have labels or output data. It can be further divided into many types such as Clustering, Dimensionality reduction, Association, Anomaly detection etc.
Ex- Suppose you have a datasset of 1000 orders and you want to find the relationship or association between these orders to design a recommendation engine.
- Reinforcement Machine learning:
In this we try to train the machine based on the action or output by giving rewards or punishment. You can understand it as a human who learn new things based on environment or requirement and if they do good things then we reward them and if someone bad things then we punish them.
- What are different sub types of machine learning algorithms?
- Explain Different steps to design a machine learning model?
- Why data manipulation and cleaning is very important and what are different ways to do data cleaning?
Ans: In general a data scientist spent 60% of their time in data cleaning. It is because we have to design the model on data, we’re trying to find the results from data and most important we can’t do any analysis without data. So, data cleaning become very crucial for a data analyst or a data scientist because if the data itself is inaccurate and incomplete then it makes no sense.
Ex- Suppose there is a employee dataset of a MNC where the salary currency are based on the region or location of employee. Which means Indian salary will be in in INR and US currency will be in $, so this data is not standardize. And now let’s assume few of the rows are empty.
So, in this scenario and with this dataset we can’t do analysis because I have missing values as well as that the data is not standardize.
So, data cleaning is basically a process of manipulating/ transforming/ Standardizing data to do analysis. The most important things to do in data cleaning are mentioned below:
- Handling missing values
- Removing null values
- Removing outliers
- Removing skewness
- How to Handel missing values in data cleaning?
Ans: Handling missing values is one of the most critical and important part of machine learning. The are different ways to deal with missing values which are listed below:
- Imputation: Imputation as the name is suggest is a way where we assign a value in that missing place with reference to other data in the data set. It can be further divided into different types as follows
1.1 Mean/Median/Mode Imputation: In this we replace the missing value with mean/median or mode. And to do you can directly use sklearn impute package.
1.2. Regression Imputation: This is more of a prediction kind of methos where we replace the missing value based on the regression line.
- Removing the column with missing value: This is very uncommon, we’ll do it only if 80% of the data are missing from a particular column. But it is not recommended to do so.
- What are the most popular data science platforms?
Ans: To do analysis or to built a model you need a good and stable platform. And the most popular data science platforms are listed below:
- Anaconda Navigator (best for beginners)
- IBM SPSS
- Azure DataBricks
8.What do you understand by NLP?
Ans: NLP stands for Natural Language Processing. It is a part of Artificial Intelligence which takes input as text or voice or images. It is one of the most famous and mostly use technology these days. The most famous example for this are:
- Speech Recognition
- Character Recognition
- Spelling Correction
9.What are the most important packages you might come across while building a machine learning models:
Ans Python has a good collection of packages for machine learning, few of them are listed below:
- Pandas (Data Manipulation)
- Numpy (Mathematical Operations)
- Sklearn (Machine learning algorithms and evaluation)
- Scipy (Scientific Computing)
- Matplotlib (Data visualization)
- Seaborn (data Visualization)
- Tensorflow (Neural Network)
- Keras (Neural Network)
- NLTK (Natural Language Processing)
10.What are different data science and machine learning use cases?
Ans: Data science and Machine learning is one of the most popular technical concepts used in almost all the industries these days. The most famous use cases are listed below:
- Time series forecasting
- Google Assistance
- Recommendation Engines in Amazon/ Netflix
- Text Mining
- Weather Predictions