I’m providing set to data science question and answers. The answers would be clear and brief for anyone to understand no matter your field. However, if you would like to go further, then you can watch the video explanation.

So let’s get started!

**1. What is the difference between Exploratory Data Analysis(EDA) and Confirmatory Data Analysis(CDA)**

In case of Confirmatory Data Analysis are an analysis of data to test the validity of an already existing theory or hypothesis. EDA on the other hand is an analysis that aims to generate new knowledge by analysing data sets and attempting to find trends within the observations

Read more on the differences here

**2. Briefly explain what happens in these phases of data analysis:**

**Knowledge Acquisition****Study Design****Data Collection****Quality Control****Anomaly Detection****Data Imputation**

**Knowledge Acquisition** is the techniques involved in transformation of knowledge from an existing form to a form which i can be used in knowledge-based system. During this phase, the rules of transformation are defined

**Study Design **which is also called research design is the procedures and methods applied in collecting and analyzing the variables specified in a given research.

**Data Collection** is the process of collecting information from various sources.

**Quality Control** which is also called data quality control involves improving the quality of the data and ensuring that the data conforms to the quality requirements. Quality control involves procedure such as error detection, duplicate handling and anomaly detection

**Anomaly Detection** also called outlier detection is the process involved in identifying items in the data that does not fit into the overall nature of the data. That is observations that look suspicious or significantly different from the rest of the data set.

**Data Imputation** has to do with replacing the missing data with alternative values.

**3. Also Explain briefly these other phases of data analysis:**

**Data Engineering****Dimensionality Reduction****Feature Selection****Method and Hyperparameter Selection****Evaluation****Integration of Results (Fusion)****Decision**

**Data Engineering** is a broad term that has to do with practical application of data analysis

**Dimensionality Reduction:** Here data in high dimension is represented in lower dimension using a set of few principal components

**Feature Selection** is the process involved in selecting the features of the data that has the most variance

**Method and Hyperparameter Selection** refers to the process of selecting the parameter of a prior distribution

**Evaluation** covers a range of activities involved in the data analysis process

**Integration of Results (Data Fusion):** This is the techniques involved in integrating data from multiple sources into a more consistent, informative and useful form

**Decision** is some action taken as a result of insight gained from the analyzed data

**4. What is a random variable?**

A random variable is a variable whose values are the output of random experiment.

For example, in case of flipping a coin, the output or result could be either a head(H) or a tail(T). Therefore, this output can be represented using a random variable say X where X ∈ {H, T}.

More about random experiments.

**5. Explain Regression Analysis**

Regression is a data analytics techniques used to estimate the relationship between two set of variables (dependent and independent). It is performed by assuming the nature of the relationship and then determining the coefficients of the function.

**6. Differentiate between Linear and Non-Linear Regression**

In linear regression, a linear relationship is assumed to exist between the set of data while in Non-Linear regression a non-linear relationship is assumed.

**7. Explain the Least-Squares Method of Linear Regression**

Least-Squares is a method of solving regression problem by minimizing the sum of squares distances between the each observation and the fitted regression line.

**8. What is the Coefficient of Determination **

The Coefficient of determination R2 in regression a measure of how well the regression model represents the data set. It provides the the proportion of variance of one variable that results from the other variable.

In other words, it is the ratio of the explained variation to the total variation. It takes value between 0 and 1 such that* 0 < R2 <1.*

**9. What do you understand by ‘Nadaraya Method’?**

The Nadaraya Method is a technique for estimating unknown parameters of a regression function and is suitable for situation where the data comes from a joint p.d.f, f(x, y)

The regression model for the Nadaraya method is:

*Y*_{i} = m(x_{i}) + e_{i}

_{i}= m(x

_{i}) + e

_{i}

for i = 1, . . . , n

where m(.) is the unknown

**10. What is Factor Analysis?**

Factor Analysis is a technique used to draw inferences about unobservable quantities that cannot be measured directly. The objective of Factor Analysis is to describe correlation between measured variables in a data set in terms of few underlying factors.

## One thought on “Data Science Questions and Answers (Questions 1 to 10)”