I’m providing set to data science question and answers. The answers would be clear and brief for anyone to understand no matter your field. However, if you would like to go further, then you can watch the video explanation.
So let’s get started!
1. What is the difference between Exploratory Data Analysis(EDA) and Confirmatory Data Analysis(CDA)
In case of Confirmatory Data Analysis are an analysis of data to test the validity of an already existing theory or hypothesis. EDA on the other hand is an analysis that aims to generate new knowledge by analysing data sets and attempting to find trends within the observations
Read more on the differences here
2. Briefly explain what happens in these phases of data analysis:
-
Knowledge Acquisition
-
Study Design
-
Data Collection
-
Quality Control
-
Anomaly Detection
-
Data Imputation
Knowledge Acquisition is the techniques involved in transformation of knowledge from an existing form to a form which i can be used in knowledge-based system. During this phase, the rules of transformation are defined
Study Design which is also called research design is the procedures and methods applied in collecting and analyzing the variables specified in a given research.
Data Collection is the process of collecting information from various sources.
Quality Control which is also called data quality control involves improving the quality of the data and ensuring that the data conforms to the quality requirements. Quality control involves procedure such as error detection, duplicate handling and anomaly detection
Anomaly Detection also called outlier detection is the process involved in identifying items in the data that does not fit into the overall nature of the data. That is observations that look suspicious or significantly different from the rest of the data set.
Data Imputation has to do with replacing the missing data with alternative values.
3. Also Explain briefly these other phases of data analysis:
-
Data Engineering
-
Dimensionality Reduction
-
Feature Selection
-
Method and Hyperparameter Selection
-
Evaluation
-
Integration of Results (Fusion)
-
Decision
Data Engineering is a broad term that has to do with practical application of data analysis
Dimensionality Reduction: Here data in high dimension is represented in lower dimension using a set of few principal components
Feature Selection is the process involved in selecting the features of the data that has the most variance
Method and Hyperparameter Selection refers to the process of selecting the parameter of a prior distribution
Evaluation covers a range of activities involved in the data analysis process
Integration of Results (Data Fusion): This is the techniques involved in integrating data from multiple sources into a more consistent, informative and useful form
Decision is some action taken as a result of insight gained from the analyzed data
4. What is a random variable?
A random variable is a variable whose values are the output of random experiment.
For example, in case of flipping a coin, the output or result could be either a head(H) or a tail(T). Therefore, this output can be represented using a random variable say X where X ∈ {H, T}.
More about random experiments.
5. Explain Regression Analysis
Regression is a data analytics techniques used to estimate the relationship between two set of variables (dependent and independent). It is performed by assuming the nature of the relationship and then determining the coefficients of the function.
6. Differentiate between Linear and Non-Linear Regression
In linear regression, a linear relationship is assumed to exist between the set of data while in Non-Linear regression a non-linear relationship is assumed.
7. Explain the Least-Squares Method of Linear Regression
Least-Squares is a method of solving regression problem by minimizing the sum of squares distances between the each observation and the fitted regression line.
8. What is the Coefficient of Determination
The Coefficient of determination R2 in regression a measure of how well the regression model represents the data set. It provides the the proportion of variance of one variable that results from the other variable.
In other words, it is the ratio of the explained variation to the total variation. It takes value between 0 and 1 such that 0 < R2 <1.
9. What do you understand by ‘Nadaraya Method’?
The Nadaraya Method is a technique for estimating unknown parameters of a regression function and is suitable for situation where the data comes from a joint p.d.f, f(x, y)
The regression model for the Nadaraya method is:
Yi = m(xi) + ei
for i = 1, . . . , n
where m(.) is the unknown
10. What is Factor Analysis?
Factor Analysis is a technique used to draw inferences about unobservable quantities that cannot be measured directly. The objective of Factor Analysis is to describe correlation between measured variables in a data set in terms of few underlying factors.
One thought on “Data Science Questions and Answers (Questions 1 to 10)”