Spring Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and amount are greater than zero. Invalid records should be dropped.

Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

.expect_or_drop( " valid_amount " , " amount > 0 " )

)

B.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

C.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect( " valid_customer " , " customer_id IS NOT NULL " )

.expect( " valid_amount " , " amount > 0 " )

)

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 5

A data engineer is configuring a Lakeflow Declarative Pipeline to process CDC (Change Data Capture) data from a source. The source events sometimes arrive out of order, and multiple updates may occur with the same update_timestamp but with different update_sequence_id .

What should the data engineer do to ensure events are sequenced correctly?

Options:

A.

O Set track_history_column_list to [event_timestamp, event_id] in AUTO CDC APIs.

B.

O Use dropDuplicates() to remove out-of-order and duplicate records in LDP.

C.

O Use SEQUENCE BY STRUCT(event_timestamp, update_sequence_id) in AUTO CDC APIs.

D.

O Use a window function to sort update_sequence_id within the same partition, i.e., update_timestamp in the LDP pipeline.

Buy Now
Questions 6

A table is registered with the following code:

Databricks-Certified-Professional-Data-Engineer Question 6

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders ?

Options:

A.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Buy Now
Questions 7

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

A.

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

B.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Buy Now
Questions 8

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

A.

userLookup.join(streamingDF, [ " userid " ], how= " inner " )

B.

streamingDF.join(userLookup, [ " user_id " ], how= " outer " )

C.

streamingDF.join(userLookup, [ " user_id”], how= " left " )

D.

streamingDF.join(userLookup, [ " userid " ], how= " inner " )

E.

userLookup.join(streamingDF, [ " user_id " ], how= " right " )

Buy Now
Questions 9

What statement is true regarding the retention of job run history?

Options:

A.

It is retained until you export or delete job run logs

B.

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

C.

t is retained for 60 days, during which you can export notebook run results to HTML

D.

It is retained for 60 days, after which logs are archived

E.

It is retained for 90 days or until the run-id is re-used through custom run configuration

Buy Now
Questions 10

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

A.

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

B.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

C.

Create a new table with the required schema and new fields and use Delta Lake ' s deep clone functionality to sync up changes committed to one table to the corresponding table.

D.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

E.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

Buy Now
Questions 11

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Buy Now
Questions 12

A junior data engineer seeks to leverage Delta Lake ' s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true . They plan to execute the following code as a daily job:

Databricks-Certified-Professional-Data-Engineer Question 12

Which statement describes the execution and results of running the above query multiple times?

Options:

A.

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Buy Now
Questions 13

A data engineer is tasked with ensuring that a Delta table in Databricks continuously retains deleted files for 15 days (instead of the default 7 days), in order to permanently comply with the organization’s data retention policy.

Which code snippet correctly sets this retention period for deleted files?

Options:

A.

spark.sql( " ALTER TABLE my_table SET TBLPROPERTIES ( ' delta.deletedFileRetentionDuration ' = ' interval 15 days ' ) " )

B.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, " /mnt/data/my_table " )

deltaTable.deletedFileRetentionDuration = " interval 15 days "

C.

spark.sql( " VACUUM my_table RETAIN 15 HOURS " )

D.

spark.conf.set( " spark.databricks.delta.deletedFileRetentionDuration " , " 15 days " )

Buy Now
Questions 14

A data architect has heard about lake ' s built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

A.

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

B.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.

C.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.

D.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.

Buy Now
Questions 15

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

A.

The total count of rows is calculated by scanning all data files

B.

The total count of rows will be returned from cached results unless REFRESH is run

C.

The total count of records is calculated from the Delta transaction logs

D.

The total count of records is calculated from the parquet file metadata

E.

The total count of records is calculated from the Hive metastore

Buy Now
Questions 16

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.

Buy Now
Questions 17

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

Buy Now
Questions 18

A data engineer is tasked with building a nightly batch ETL pipeline that processes very large volumes of raw JSON logs from a data lake into Delta tables for reporting. The data arrives in bulk once per day, and the pipeline takes several hours to complete. Cost efficiency is important, but performance and reliability of completing the pipeline are the highest priorities.

Which type of Databricks cluster should the data engineer configure?

Options:

A.

A lightweight single-node cluster with low worker node count to reduce costs.

B.

A high-concurrency cluster designed for interactive SQL workloads.

C.

An all-purpose cluster always kept running to ensure low-latency job startup times.

D.

A job cluster configured to autoscale across multiple workers during the pipeline run.

Buy Now
Questions 19

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.

Schedule a job to execute the pipeline once hour on a new job cluster.

D.

Configure a job that executes every time new data lands in a given directory.

Buy Now
Questions 20

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Databricks-Certified-Professional-Data-Engineer Question 20

Which response correctly fills in the blank to meet the specified requirements?

Databricks-Certified-Professional-Data-Engineer Question 20

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Buy Now
Questions 21

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as " unmanaged " ) Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

A.

When a database is being created, make sure that the LOCATION keyword is used.

B.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

C.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

D.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Buy Now
Questions 22

A company wants to implement Lakehouse Federation across multiple data sources but is concerned about data consistency and ensuring that all teams access the same authoritative version of their data.

Which statement is applicable for Lakehouse Federations to maintain data consistency?

Options:

A.

Federation provides read-only access that reflects the current state of source systems.

B.

Federation implements change data capture (CDC) from all sources.

C.

A separate data synchronization service must be deployed.

D.

Federation creates local copies that must be manually refreshed.

Buy Now
Questions 23

Given the following PySpark code snippet in a Databricks notebook:

filtered_df = spark.read.format( " delta " ).load( " /mnt/data/large_table " ) \

.filter( " event_date > ' 2024-01-01 ' " )

filtered_df.count()

The data engineer notices from the Query Profiler that the scan operator for filtered_df is reading almost all files, despite the filter being applied.

What is the probable reason for poor data skipping?

Options:

A.

The Delta table lacks optimization that enables dynamic file pruning.

B.

The filter is executed only after the full data scan, preventing data skipping.

C.

The event_date column is outside the table’s partitioning and Z-ordering scheme.

D.

The filter condition involves a data type excluded from data skipping support.

Buy Now
Questions 24

A senior data engineer is planning large-scale data workflows. The task is to identify the considerations that form a foundation for creating scalable data models for managing large datasets. The team has listed Delta Lake capabilities and wants to determine which feature should not be considered as a core factor.

Which key feature can be ignored while evaluating Delta Lake?

Options:

A.

Delta Lake’s ability to process data in both batch and streaming modes seamlessly, providing flexibility in ingestion and processing.

B.

Delta Lake works with various data formats (Parquet, JSON, CSV) and integrates well with Spark and Databricks tools.

C.

Delta Lake optimizes metadata handling, efficiently managing billions of files and facilitating scalability to petabyte-scale datasets.

D.

Delta Lake provides limited support for monitoring and troubleshooting data pipelines, so relevant partner tools have to be identified and set up for enhanced operational efficiency.

Buy Now
Questions 25

A data engineer wants to reflector the following DLT code, which includes multiple definition with very similar code:

Databricks-Certified-Professional-Data-Engineer Question 25

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

Databricks-Certified-Professional-Data-Engineer Question 25

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for tables.

How can the data engineer fix this?

Options:

A.

Convert the list of configuration values to a dictionary of table settings, using table names as keys.

B.

Convert the list of configuration values to a dictionary of table settings, using different input the for loop.

C.

Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

D.

Wrap the loop inside another table definition, using generalized names and properties to replace with those from the inner table

Buy Now
Questions 26

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

A.

%python

B.

% Databricks notebook source

C.

-- Databricks notebook source

D.

//Databricks notebook source

Buy Now
Questions 27

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Databricks-Certified-Professional-Data-Engineer Question 27

Which command should be removed from the notebook before scheduling it as a job?

Options:

A.

Cmd 2

B.

Cmd 3

C.

Cmd 4

D.

Cmd 5

E.

Cmd 6

Buy Now
Questions 28

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable.

Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a SCALAR_ITER Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a SCALAR Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas() on a Spark DataFrame that receives all rows for each stock symbol as a Pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped_agg Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Buy Now
Questions 29

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Options:

A.

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

B.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.

C.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

Buy Now
Questions 30

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales .

Databricks-Certified-Professional-Data-Engineer Question 30

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

A.

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

B.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

C.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

D.

Both commands will fail. No new variables, tables, or views will be created.

E.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Buy Now
Questions 31

Given the following error traceback (from display(df.select(3* " heartrate " ))) which shows AnalysisException: cannot resolve ' heartrateheartrateheartrate ' , which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Buy Now
Questions 32

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark( " event_time " , " 10 minutes " )

.groupBy(

________,

" device_id "

)

.agg(

avg( " temp " ).alias( " avg_temp " ),

avg( " humidity " ).alias( " avg_humidity " )

)

.writeStream

.format( " delta " )

.saveAsTable( " sensor_avg " )

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

A.

window( " event_time " , " 5 minutes " ).alias( " time " )

B.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

lag( " event_time " , " 5 minutes " ).alias( " time " )

Buy Now
Questions 33

A data engineer is designing a Lakeflow Spark Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id is not null and amount is greater than zero. Invalid records should be dropped. Which Lakeflow Spark Declarative Pipelines configuration implements this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect_or_drop( " valid_amount " , " amount > 0 " )

B.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect( " valid_amount " , " amount > 0 " )

C.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 34

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df . The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Databricks-Certified-Professional-Data-Engineer Question 34

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

B.

window( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

window( " event_time " , " 10 minutes " ).alias( " time " )

E.

lag( " event_time " , " 10 minutes " ).alias( " time " )

Buy Now
Questions 35

What is true for Delta Lake?

Options:

A.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.

C.

Z-ORDER can only be applied to numeric values stored in Delta Lake tables.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Buy Now
Questions 36

A data engineer is using Auto Loader to read incoming JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records but notice that over time, some records are being quarantined even though they are well-formed JSON .

The code snippet is:

df = (spark.readStream

.format( " cloudFiles " )

.option( " cloudFiles.format " , " json " )

.option( " badRecordsPath " , " /tmp/somewhere/badRecordsPath " )

.schema( " a int, b int " )

.load( " /Volumes/catalog/schema/raw_data/ " ))

What is the cause of the missing data?

Options:

A.

At some point, the upstream data provider switched everything to multi-line JSON.

B.

The badRecordsPath location is accumulating many small files.

C.

The source data is valid JSON but does not conform to the defined schema in some way.

D.

The engineer forgot to set the option " cloudFiles.quarantineMode " = " rescue " .

Buy Now
Questions 37

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now
Questions 38

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Databricks-Certified-Professional-Data-Engineer Question 38

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

withWatermark( " event_time " , " 10 minutes " )

B.

awaitArrival( " event_time " , " 10 minutes " )

C.

await( " event_time + ‘10 minutes ' " )

D.

slidingWindow( " event_time " , " 10 minutes " )

E.

delayWrite( " event_time " , " 10 minutes " )

Buy Now
Questions 39

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company ' s data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

A.

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

B.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

C.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

D.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

E.

Databricks notebooks send all executable code from the user ' s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Buy Now
Questions 40

A departing platform owner currently holds ownership of multiple catalogs and controls storage credentials and external locations. A data engineer has been asked to ensure continuity: transfer catalog ownership to the platform team group, delegate ongoing privilege management, and retain the ability to receive and share data via Delta Sharing.

Which role must be in place to perform these actions across the metastore?

Options:

A.

Metastore Admin, because metastore admins can transfer ownership and manage privileges across all metastore objects, including shares and recipients.

B.

Account Admin, because account admins can only create metastores but cannot change ownership of catalogs.

C.

Workspace Admin, because workspace admins can transfer ownership of any Unity Catalog object.

D.

Catalog Owner, because catalog owners can transfer any object in any catalog in the metastore.

Buy Now
Questions 41

A data engineer is implementing Unity Catalog governance for a multi-team environment. Data scientists need interactive clusters for basic data exploration tasks, while automated ETL jobs require dedicated processing.

How should the data engineer configure cluster isolation policies to enforce least privilege and ensure Unity Catalog compliance?

Options:

A.

Use only DEDICATED access mode for both interactive workloads and automated jobs to maximize security isolation.

B.

Allow all users to create any cluster type and rely on manual configuration to enable Unity Catalog access modes.

C.

Configure all clusters with NO ISOLATION_SHARED access mode since Unity Catalog works with any cluster configuration.

D.

Create compute policies with STANDARD access mode for interactive workloads and DEDICATED access mode for automated jobs.

Buy Now
Questions 42

While reviewing a query ' s execution in the Databricks Query Profiler, a data engineer observes that the Top Operators panel shows a Sort operator with high Time Spent and Memory Peak metrics. The Spark UI also reports frequent data spilling .

How should the data engineer address this issue?

Options:

A.

Switch to a broadcast join to reduce memory usage.

B.

Repartition the DataFrame to a single partition before sorting.

C.

Convert the sort operation to a filter operation.

D.

Increase the number of shuffle partitions to better distribute data.

Buy Now
Questions 43

A data engineer is designing an append-only pipeline that needs to handle both batch and streaming data in Delta Lake. The team wants to ensure that the streaming component can efficiently track which data has already been processed.

Which configuration should be set to enable this?

Options:

A.

overwriteSchema

B.

partitionBy

C.

checkpointLocation

D.

mergeSchema

Buy Now
Questions 44

A data engineer needs to install the PyYAML Python package within an air-gapped Databricks environment . The workspace has no direct internet access to PyPI. The engineer has downloaded the .whl file locally and wants it available automatically on all new clusters.

Which approach should the data engineer use?

Options:

A.

Upload the PyYAML .whl file to the user home directory and create a cluster-scoped init script to install it.

B.

Upload the PyYAML .whl file to a Unity Catalog Volume, ensure it’s allow-listed, and create a cluster-scoped init script that installs it from that path.

C.

Set up a private PyPI repository and install via pip index URL.

D.

Add the .whl file to Databricks Git Repos and assume automatic installation.

Buy Now
Questions 45

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job ' s initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Buy Now
Questions 46

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

Options:

A.

databricks execute my_project_job -e prod

B.

databricks job run my_project_job --env prod

C.

databricks run my_project_job -t prod

D.

databricks bundle run my_project_job -t prod

Buy Now
Questions 47

Given the following error traceback:

AnalysisException: cannot resolve ' heartrateheartrateheartrate ' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3* " heartrate " ))

Which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Buy Now
Questions 48

A data engineer deploys a multi-task Databricks job that orchestrates three notebooks. One task intermittently fails with Exit Code 1 but succeeds on retry. The engineer needs to collect detailed logs for the failing attempts, including stdout/stderr and cluster lifecycle context, and share them with the platform team.

What steps the data engineer needs to follow using built-in tools?

Options:

A.

Use the notebook interactive debugger to re-run the entire multi-task job, and capture step-through traces for the failing task.

B.

Download worker logs directly from the Spark UI and ignore driver logs, as worker logs contain stdout/stderr for all tasks and cluster events.

C.

Export the notebook run results to HTML; this bundle includes complete stdout, stderr, and cluster event history across all tasks.

D.

From the job run details page, export the job ' s logs or configure log delivery; then retrieve the compute driver logs and event logs from the compute details page to correlate stdout/stderr with cluster events.

Buy Now
Questions 49

A data engineering team is migrating off its legacy Hadoop platform. As part of the process, they are evaluating storage formats for performance comparison. The legacy platform uses ORC and RCFile formats. After converting a subset of data to Delta Lake , they noticed significantly better query performance. Upon investigation, they discovered that queries reading from Delta tables leveraged a Shuffle Hash Join , whereas queries on legacy formats used Sort Merge Joins . The queries reading Delta Lake data also scanned less data.

Which reason could be attributed to the difference in query performance?

Options:

A.

Delta Lake enables data skipping and file pruning using a vectorized Parquet reader.

B.

The queries against the Delta Lake tables were able to leverage the dynamic file pruning optimization.

C.

Shuffle Hash Joins are always more efficient than Sort Merge Joins.

D.

The queries against the ORC tables leveraged the dynamic data skipping optimization but not the dynamic file pruning optimization.

Buy Now
Questions 50

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

A.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

B.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

C.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

D.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

E.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Buy Now
Questions 51

A data architect is designing a Databricks solution to efficiently process data for different business requirements.

In which scenario should a data engineer use a materialized view compared to a streaming table ?

Options:

A.

Implementing a CDC (Change Data Capture) pipeline that needs to detect and respond to database changes within seconds.

B.

Ingesting data from Apache Kafka topics with sub-second processing requirements for immediate alerting.

C.

Precomputing complex aggregations and joins from multiple large tables to accelerate BI dashboard performance.

D.

Processing high-volume, continuous clickstream data from a website to monitor user behavior in real-time.

Buy Now
Questions 52

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table ' s schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

Options:

A.

New fields not be computed for historic records.

B.

Updating the table schema will invalidate the Delta transaction log metadata.

C.

Updating the table schema requires a default value provided for each file added.

D.

Spark cannot capture the topic partition fields from the kafka source.

Buy Now
Questions 53

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Buy Now
Questions 54

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format( " parquet " )

.load( " /mnt/raw_orders/ " )

.withWatermark( " time " , " 2 hours " )

.dropDuplicates([ " customer_id " , " order_id " ])

.writeStream

.trigger(once=True)

.table( " orders " )

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

B.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

C.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

D.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Buy Now
Questions 55

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

Options:

A.

All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

B.

Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.

C.

Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake ' s upsert functionality.

D.

Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Buy Now
Questions 56

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Databricks-Certified-Professional-Data-Engineer Question 56

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

B.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

C.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

D.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

E.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Buy Now
Questions 57

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:

A.

All records are cached to an operational database and then the filter is applied

B.

The Parquet file footers are scanned for min and max statistics for the latitude column

C.

All records are cached to attached storage and then the filter is applied

D.

The Delta log is scanned for min and max statistics for the latitude column

E.

The Hive metastore is scanned for min and max statistics for the latitude column

Buy Now
Questions 58

Which statement describes Delta Lake Auto Compaction?

Options:

A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Buy Now
Questions 59

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create .

Databricks-Certified-Professional-Data-Engineer Question 59

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

A.

Three new jobs named " Ingest new data " will be defined in the workspace, and they will each run once daily.

B.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

C.

Three new jobs named " Ingest new data " will be defined in the workspace, but no jobs will be executed.

D.

One new job named " Ingest new data " will be defined in the workspace, but it will not be executed.

E.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Buy Now
Questions 60

A data team ' s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Databricks-Certified-Professional-Data-Engineer Question 60

Which step must also be completed to put the proposed query into production?

Options:

A.

Increase the shuffle partitions to account for additional aggregates

B.

Specify a new checkpointlocation

C.

Run REFRESH TABLE delta, /item_agg '

D.

Remove .option (mergeSchema ' , true ' ) from the streaming write

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: Apr 30, 2026
Questions: 202

PDF + Testing Engine

$63.52  $181.49

Testing Engine

$50.57  $144.49
buy now Databricks-Certified-Professional-Data-Engineer testing engine

PDF (Q&A)

$43.57  $124.49
buy now Databricks-Certified-Professional-Data-Engineer pdf