The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching
column names and inserting null values where column names do not appear in both DataFrames. Find the error.
Sample of DataFrame transactionsDfMonday:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 5| null| null| null| 2|null|
5.| 6| 3| 2| 25| 2|null|
6.+-------------+---------+-----+-------+---------+----+
Sample of DataFrame transactionsDfTuesday:
1.+-------+-------------+---------+-----+
2.|storeId|transactionId|productId|value|
3.+-------+-------------+---------+-----+
4.| 25| 1| 1| 4|
5.| 2| 2| 2| 7|
6.| 3| 4| 2| null|
7.| null| 5| 2| null|
8.+-------+-------------+---------+-----+
Code block:
sc.union([transactionsDfMonday, transactionsDfTuesday])
Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to
30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__((__2__.__3__) __4__ (__5__))
Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?
The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame
itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to
accomplish this.
__1__.__2__(__3__, __4__, __5__)
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame
transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))
In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from
most to least often?
Sample of DataFrame articlesDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type
double?
The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the
answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')
The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose
the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code
block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame
transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(itemsDf, __2__).__3__(__4__)
Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?
The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from
DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).
Code block:
1.from pyspark.sql.functions import udf
2.from pyspark.sql import types as T
3.
4.transactionsDf.createOrReplaceTempView('transactions')
5.
6.def pow_5(x):
7. return x**5
8.
9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())
10.spark.sql('SELECT power_5_udf(value) FROM transactions')
Which of the following code blocks performs an inner join between DataFrame itemsDf and DataFrame transactionsDf, using columns itemId and transactionId as join keys, respectively?
The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.
Code block:
transactionsDf.write.partitionOn("storeId").parquet(filePath)
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the
code block to accomplish this.
transactionsDf.__1__(__2__)
Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?
Content of directory filePath:
1._SUCCESS
2._committed_2754546451699747124
3._started_2754546451699747124
4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz
5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz
6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz
7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz
spark.option("header",True).csv(filePath)