Spring Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Databricks-Certified-Professional-Data-Scientist Databricks Certified Professional Data Scientist Exam Questions and Answers

Questions 4

RMSE is a useful metric for evaluating which types of models?

Options:

A.

Logistic regression

B.

Naive Bayes classifier

C.

Linear regression

D.

All of the above

Buy Now
Questions 5

Select the correct option from the below

Options:

A.

If you ' re trying to predict or forecast a target value^ then you need to look into supervised learning.

B.

If you ' ve chosen supervised learning, with discrete target value like Yes/No. 1/2/3, A/B/C: or Red/Yellow/Black, then look into classification.

C.

If the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to 999: or +_to -_, then you need to look unsupervised learning

D.

If you ' re not trying to predict a target value, then you need to look into unsupervised learning

E.

Are you trying to fit your data into some discrete groups? If so and that ' s all you need, you should look into clustering.

Buy Now
Questions 6

A denote the event ' student is female ' and let B denote the event ' student is French ' . In a class of 100 students suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I pick a French student, it will be a girl, that is, find P(A|B).

Options:

A.

1/3

B.

2/3

C.

1/6

D.

2/6

Buy Now
Questions 7

While working with Netflix the movie rating websites you have developed a recommender system that has produced ratings predictions for your data set that are consistently exactly 1 higher for the user-item pairs in your dataset than the ratings given in the dataset. There are n items in the dataset. What will be the calculated RMSE of your recommender system on the dataset?

Options:

A.

1

B.

2

C.

0

D.

n/2

Buy Now
Questions 8

A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project?

Options:

A.

Linear regression

B.

K-means clustering

C.

Logistic regression

D.

Apriori algorithm

Buy Now
Questions 9

If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that E 1 has occurred?

Options:

A.

P(E1)/P(E2)

B.

P(E1+E2)/P(E1)

C.

P(E2)/P(E1)

D.

P(E2)/(P(E1+E2)

Buy Now
Questions 10

You are working in an ecommerce organization, where you are designing and evaluating a recommender system, you need to select which of the following metric wilt always have the largest value?

Options:

A.

Root Mean Square Error

B.

Sum of Errors

C.

Mean Absolute Error

D.

Both land 2

E.

Information is not good enough.

Buy Now
Questions 11

Select the correct objectives of principal component analysis

Options:

A.

To reduce the dimensionality of the data set

B.

To identify new meaningful underlying variables

C.

To discover the dimensionality of the data set

D.

Only 1 and 2

E.

All 1, 2 and 3

Buy Now
Questions 12

Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).

Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light

Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)

Databricks-Certified-Professional-Data-Scientist Question 12

Options:

A.

The marginal probability P(H=Hit) is the sum along the H=Hit row of this joint distribution table, as this is the probability of being hit when the lights are red OR yellow OR green.

B.

marginal probability that P(H=Not Hit) is the sum of the H=Not Hit row

C.

marginal probability that P(H=Not Hit) is the sum of the H= Hit row

Buy Now
Questions 13

In which of the following scenario you should apply the Bay ' s Theorem

Options:

A.

The sample space is partitioned into a set of mutually exclusive events {A1, A2, . .., An }.

B.

Within the sample space, there exists an event B, for which P(B) > 0.

C.

The analytical goal is to compute a conditional probability of the form: P(Ak | B ).

D.

In all above cases

Buy Now
Questions 14

Select the correct statement which applies to K-Nearest Neighbors

Options:

A.

No Assumption about the data

B.

Computationally expensive

C.

Require less memory

D.

Works with Numeric Values

Buy Now
Questions 15

RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.

Options:

A.

Between Variables

B.

Particular Variable

C.

Among all the variables

D.

All of the above are correct

Buy Now
Questions 16

Assume some output variable " y " is a linear combination of some independent input variables " A " plus some independent noise " e " . The way the independent variables are combined is defined by a parameter vector B y=AB+e where X is an m x n matrix. B is a vector of n unknowns, and b is a vector of m values. Assuming that m is not equal to n and the columns of X are linearly independent, which expression correctly solves for B?

Databricks-Certified-Professional-Data-Scientist Question 16

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 17

What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?

Options:

A.

The lowest cost clustering subject to a stability constraint

B.

The lowest cost clustering

C.

The most stable clustering subject to a minimal cost constraint

D.

The most stable clustering

Buy Now
Questions 18

Suppose a man told you he had a nice conversation with someone on the train. Not knowing anything about this conversation, the probability that he was speaking to a woman is 50% (assuming the train had an equal number of men and women and the speaker was as likely to strike up a conversation with a man as with a woman). Now suppose he also told you that his conversational partner had long hair. It is now more

likely he was speaking to a woman, since women are more likely to have long hair than men.____________

can be used to calculate the probability that the person was a woman.

Options:

A.

SVM

B.

MLE

C.

Bayes ' theorem

D.

Logistic Regression

Buy Now
Questions 19

What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

Options:

A.

Expected value

B.

Variance

C.

Linear regression

D.

Quantiles

Buy Now
Questions 20

Select the correct statement which applies to logistic regression

Options:

A.

Computationally inexpensive, easy to implement knowledge representation easy to interpret

B.

May have low accuracy

C.

Works with Numeric values

Buy Now
Questions 21

Scenario: Suppose that Bob can decide to go to work by one of three modes of transportation,

car, bus, or commuter train. Because of high traffic, if he decides to go by car. there is a 50% chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes overcrowded, the probability of being late is only 20%. The commuter train is almost never late, with a probability of only 1 %, but is more expensive than the bus.

Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove to work that day by car. Since he does not know Which mode of transportation Bob usually uses, he gives a prior probability of 1 3 to each of the three possibilities. Which of the following method the boss will use to estimate of the probability that Bob drove to work?

Options:

A.

Naive Bayes

B.

Linear regression

C.

Random decision forests

D.

None of the above

Buy Now
Questions 22

Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.

Above is an example of

Options:

A.

Linear Regression

B.

Logistic Regression

C.

Recommendation system

D.

Maximum likelihood estimation

E.

Hierarchical linear models

Buy Now
Questions 23

Which of the following is not a correct application for the Classification?

Options:

A.

credit scoring

B.

tumor detection

C.

image recognition

D.

drug discovery

Buy Now
Questions 24

What is the considerable difference between L1 and L2 regularization?

Options:

A.

L1 regularization has more accuracy of the resulting model

B.

Size of the model can be much smaller in L1 regularization than that produced by L2-regularization

C.

L2-regularization can be of vital importance when the application is deployed in resource-tight environments such as cell-phones.

D.

All of the above are correct

Buy Now
Questions 25

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several variables that may be......

Options:

A.

Numerical

B.

Categorical

C.

Both 1 and 2 are correct

D.

None of the 1 and 2 are correct

Buy Now
Questions 26

Which of the following is a correct example of the target variable in regression (supervised learning)?

Options:

A.

Nominal values like true, false

B.

Reptile, fish, mammal, amphibian, plant, fungi

C.

Infinite number of numeric values, such as 0.100, 42.001, 1000.743..

D.

All of the above

Buy Now
Questions 27

Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has

rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn ' t rain, he incorrectly forecasts rain 10% of the time. Which of the following will you use to calculate the probability whether it will rain on the

day of Marie’s wedding?

Options:

A.

Naive Bayes

B.

Logistic Regression

C.

Random Decision Forests

D.

All of the above

Buy Now
Questions 28

Projecting a multi-dimensional dataset onto which vector has the greatest variance?

Options:

A.

first principal component

B.

first eigenvector

C.

not enough information given to answer

D.

second eigenvector

E.

second principal component

Buy Now
Questions 29

You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?

Options:

A.

Identify additional measures to add to the analysis

B.

Remove one of the measures

C.

Decrease the number of clusters

D.

Increase the number of clusters

Buy Now
Questions 30

Refer to exhibit

Databricks-Certified-Professional-Data-Scientist Question 30

You are asked to write a report on how specific variables impact your client ' s sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional data. what is a way that you could try to increase the R2 of the model without artificially inflating it?

Options:

A.

Create clusters based on the data and use them as model inputs

B.

Force all 15 variables into the model as independent variables

C.

Create interaction variables based only on variables A, B, and C

D.

Break variables A, B, and C into their own univariate models

Buy Now
Questions 31

Which technique you would be using to solve the below problem statement? " What is the probability that individual customer will not repay the loan amount? "

Options:

A.

Classification

B.

Clustering

C.

Linear Regression

D.

Logistic Regression

E.

Hypothesis testing

Buy Now
Questions 32

You are building a classifier off of a very high-dimensiona data set similar to shown in the image with 5000 variables (lots of columns, not that many rows). It can handle both dense and sparse input. Which technique is most suitable, and why?

Databricks-Certified-Professional-Data-Scientist Question 32

Options:

A.

Logistic regression with L1 regularization, to prevent overfitting

B.

Naive Bayes, because Bayesian methods act as regularlizers

C.

k-nearest neighbors, because it uses local neighborhoods to classify examples

D.

Random forest because it is an ensemble method

Buy Now
Questions 33

Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?

Options:

A.

The data is unformatted.

B.

There is not enough data to create a test set.

C.

There are missing values in the data.

D.

There are categorical variables in the model.

Buy Now
Questions 34

You have collected the 100 ' s of parameters about the 1000 ' s of websites e.g. daily hits, average time on the websites, number of unique visitors, number of returning visitors etc. Now you have find the most important parameters which can best describe a website, so which of the following technique you will use

Options:

A.

PCA (Principal component analysis)

B.

Linear Regression

C.

Logistic Regression

D.

Clustering

Buy Now
Questions 35

What describes a true property of Logistic Regression method?

Options:

A.

It handles missing values well.

B.

It works well with discrete variables that have many distinct values.

C.

It is robust with redundant variables and correlated variables.

D.

It works well with variables that affect the outcome in a discontinuous way.

Buy Now
Questions 36

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

Options:

A.

Include least mutual information with other selected features as a feature selection criterion

B.

Include the number of times each of the words appears in the book in your model

C.

Decrease the size of our training data

D.

Evaluate a model that only includes the top 100 words

Buy Now
Questions 37

Refer to the Exhibit.

Databricks-Certified-Professional-Data-Scientist Question 37

In the Exhibit, the table shows the values for the input Boolean attributes " A " , " B " , and " C " . It also shows the values for the output attribute " class " . Which decision tree is valid for the data?

Options:

A.

Tree A

B.

Tree B

C.

Tree C

D.

Tree D

Buy Now
Questions 38

Select the sequence of the developing machine learning applications

A) Analyze the input data

B) Prepare the input data

C) Collect data

D) Train the algorithm

E) Test the algorithm

F) Use It

Options:

A.

A, B, C, D, E, F

B.

C, B, A, D, E, F

C.

C, A, B, D, E, F

D.

C, B, A, D, E, F

Buy Now
Questions 39

Select the correct statement which applies to Principal component analysis (PCA)

Options:

A.

Is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables.

B.

Is a mathematical procedure that transforms a number of (possibly) correlated variables into a (higher) number of uncorrelated variables

C.

Increase the dimensionality of the data set.

D.

1 and 3 are correct

E.

1 and 2 are correct

Buy Now
Questions 40

Select the correct statement regarding the naive Bayes classification

Options:

A.

it only requires a small amount of training data to estimate the parameters

B.

Independent variables can be assumed

C.

only the variances of the variables for each class need to be determined

D.

for each class entire covariance matrix need to be determined

Buy Now
Questions 41

A problem statement is given as below

Hospital records show that of patients suffering from a certain disease, 75% die of it. What is the probability that of 6 randomly selected patients, 4 will recover?

Which of the following model will you use to solve it.

Options:

A.

Binomial

B.

Poisson

C.

Normal

D.

Any of the above

Buy Now
Exam Name: Databricks Certified Professional Data Scientist Exam
Last Update: May 7, 2026
Questions: 138

PDF + Testing Engine

$63.52  $181.49

Testing Engine

$50.57  $144.49
buy now Databricks-Certified-Professional-Data-Scientist testing engine

PDF (Q&A)

$43.57  $124.49
buy now Databricks-Certified-Professional-Data-Scientist pdf