Demo practice questions for guest users.
A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.
Which code snippet can be used to meet this requirement?
To remove specific columns from a PySpark DataFrame, the drop() method is used. This method
returns a new DataFrame without the specified columns. The correct syntax for dropping multiple
columns is to pass each column name as a separate argument to the drop() method.
Correct Usage:
df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
This line of code will return a new DataFrame df_user_non_pii that excludes the specified PII
columns.
Explanation of Options:
A . Correct. Uses the drop() method with multiple column names passed as separate arguments,
which is the standard and correct usage in PySpark.
B . Although it appears similar to Option A, if the column names are not enclosed in quotes or if there's a syntax error (e.g., missing quotes or incorrect variable names), it would result in an error.
However, as written, it's identical to Option A and thus also correct.
C . Incorrect. The dropfields() method is not a method of the DataFrame class in PySpark. It's used
with StructType columns to drop fields from nested structures, not top-level DataFrame columns.
D . Incorrect. Passing a single string with comma-separated column names to dropfields() is not valid
syntax in PySpark.
Reference:
PySpark Documentation: DataFrame.drop
Stack Overflow Discussion: How to delete columns in PySpark DataFrame

