Data Science Questions and Answers (Questions 1 to 10)

I’m providing set to data science question and answers. The answers would be clear and brief for anyone to understand no matter your field. However, if you would like to go further, then you can watch the video explanation.

So let’s get started!

Questions 11 to 20

1. What is the difference between Exploratory Data Analysis(EDA) and Confirmatory Data Analysis(CDA)

In case of Confirmatory Data Analysis are an analysis of data to test the validity of an already existing theory or hypothesis. EDA on the other hand is an analysis that aims to generate new knowledge by analysing data sets and attempting to find trends within the observations

Read more on the differences here

 

2. Briefly explain what happens in these phases of data analysis:
  • Knowledge Acquisition
  • Study Design
  • Data Collection
  • Quality Control
  • Anomaly Detection
  • Data Imputation

Knowledge Acquisition is the techniques involved in transformation of knowledge from an existing form to a form which i can be used in knowledge-based system. During this phase, the rules of transformation are defined

Study Design which is also called research design is the  procedures and methods applied in collecting and analyzing the variables specified in a given research.

Data Collection is the process of collecting information from various sources.

Quality Control which is also called data quality control involves improving the quality of the data and ensuring that the data conforms to the quality requirements. Quality control involves procedure such as error detection, duplicate handling and anomaly detection

Anomaly Detection also called outlier detection is the process involved in identifying items in the data that does not fit into the overall nature of the data. That is observations that look suspicious or significantly different from the rest of the data set.

Data Imputation has to do with replacing the missing data with alternative values.

 

3. Also Explain briefly these other phases of data analysis:
  • Data Engineering
  • Dimensionality Reduction
  • Feature Selection
  • Method and Hyperparameter Selection
  • Evaluation
  • Integration of Results (Fusion)
  • Decision

Data Engineering is a broad term that has to do with practical application of data analysis

Dimensionality Reduction: Here data in high dimension is represented in lower dimension using a set of few principal components

Feature Selection is the process involved in selecting the features of the data that has the most variance

Method and Hyperparameter Selection refers to the process of selecting the parameter of a prior distribution

Evaluation covers a range of activities involved in the data analysis process

Integration of Results (Data Fusion): This is the techniques involved in integrating data from multiple sources into a more  consistent, informative and useful form

Decision is some action taken as a result of insight gained from the analyzed data

 

4. What is a random variable?

A random variable is a variable whose values are the output of random experiment.

For example, in case of flipping a coin, the output or result could be either a head(H) or a tail(T). Therefore, this output can be represented using a random variable say X where X ∈ {H, T}.

More about random experiments.

 

5. Explain Regression Analysis

Regression is  a data analytics techniques used to estimate the relationship between two set of variables (dependent and independent). It is performed by assuming the nature of the relationship and then determining the coefficients of the function.

6. Differentiate between Linear and Non-Linear Regression

In linear regression, a linear relationship is assumed to exist between the set of data while in Non-Linear regression a non-linear relationship  is assumed.

 

7. Explain the Least-Squares Method of Linear Regression

Least-Squares is a method of solving regression problem by minimizing the sum of squares distances between the each observation and the fitted regression line.

 

8. What is the Coefficient of Determination 

The Coefficient of determination R2 in regression a measure of how well the regression model represents the data set. It provides the the proportion of variance of one variable that results from the other variable.

In other words, it is the ratio of the explained variation to the total variation. It takes value between 0 and 1 such that 0 < R2 <1.

 

9. What do you understand by ‘Nadaraya Method’?

The Nadaraya Method is a technique for estimating unknown parameters of a regression function and is suitable for situation where the data comes from a joint p.d.f, f(x, y)

The regression model for the Nadaraya method is:

Yi = m(xi) + ei

for i = 1, . . . , n

where m(.) is the unknown

 

10. What is Factor Analysis?

Factor Analysis is a technique used to draw inferences about unobservable quantities that cannot be measured directly. The objective of Factor Analysis is to describe correlation between measured variables in a data set in terms of few underlying factors.

 

User Avatar

kindsonthegenius

Kindson Munonye is currently completing his doctoral program in Software Engineering in Budapest University of Technology and Economics

View all posts by kindsonthegenius →

One thought on “Data Science Questions and Answers (Questions 1 to 10)

Leave a Reply

Your email address will not be published. Required fields are marked *