1.
DataFrame schemas are determined
Correct Answer
B. Eagerly
Explanation
DataFrame schemas are determined eagerly, meaning that they are evaluated and determined immediately when the DataFrame is created. This allows for faster processing and optimization during the execution of operations on the DataFrame. In contrast, lazy schema determination would delay the evaluation of the schema until it is actually needed, which could potentially slow down the overall performance of the DataFrame operations.
2.
Which of the following is the entry point of Spark SQL in Spark 2.0?
Correct Answer
B. SparkSession (spark)
Explanation
The correct answer is SparkSession (spark). In Spark 2.0, SparkSession is the entry point of Spark SQL. SparkSession provides a single point of entry for interacting with Spark SQL and it encapsulates the functionality of SparkContext, SQLContext, and HiveContext. It allows users to easily create DataFrames, execute SQL queries, and access various Spark SQL features. Therefore, both options a and b are correct.
3.
Which of the following is not a function of Spark Context ?
Correct Answer
D. Entry point to Spark SQL
Explanation
Spark Context is the entry point for any Spark functionality and it provides access to various services, allows setting configurations, and enables checking the status of Spark applications. However, it is not responsible for serving as the entry point to Spark SQL. Spark SQL has its own entry point called SparkSession, which is used for working with structured data using SQL queries, DataFrame, and Dataset APIs.
4.
What does the following code print?
val lyrics = List("all", "that", "i", "know")
println(lyrics.size)
Correct Answer
A. 4
Explanation
The code creates a list called "lyrics" with 4 elements: "all", "that", "i", and "know". The "println" statement prints the size of the list, which is 4.
5.
Spark is 100x faster than MapReduce due to
Correct Answer
A. In-memory computing
Explanation
In-memory computing is the reason why Spark is 100x faster than MapReduce. By keeping the data in memory, Spark eliminates the need to read and write data from disk, which significantly speeds up data processing. This allows Spark to perform operations much faster than MapReduce, which relies heavily on disk I/O. By leveraging the power of in-memory computing, Spark is able to achieve impressive performance gains and process large datasets more efficiently.
6.
Which type of processing Apache Spark can handle
Correct Answer
E. All of the above
Explanation
Apache Spark is a powerful data processing framework that can handle various types of processing tasks. It supports batch processing, which involves processing large volumes of data in a scheduled manner. It also supports stream processing, which involves processing real-time data as it arrives. Additionally, Apache Spark can handle graph processing, which involves analyzing and processing graph-based data structures. Lastly, it supports interactive processing, which involves querying and analyzing data interactively in real-time. Therefore, the correct answer is "All of the above" as Apache Spark is capable of handling all these types of processing.
7.
Choose correct statement about RDD
Correct Answer
B. RDD is a distributed data structure
Explanation
RDD stands for Resilient Distributed Dataset, which is a fundamental data structure in Apache Spark. It is not a database or a programming paradigm. RDD is a distributed data structure that allows data to be processed in parallel across a cluster of computers. RDDs are fault-tolerant and can be cached in memory, which enables faster processing. They provide a high-level abstraction for distributed data processing and are a key component in Spark's computational model.
8.
Identify Correct Action
Correct Answer
A. Reduce
Explanation
The correct answer is "Reduce." In programming, the reduce function is used to combine all the elements in a collection into a single value. It applies a specified operation to each element and accumulates the result. This is useful when you want to perform calculations on a list of values and obtain a single output. The reduce function is commonly used for tasks such as calculating the sum or product of a list, finding the maximum or minimum value, or concatenating strings.
9.
How do you print schema of a dataframe?
Correct Answer
A. Df.printSchema()
Explanation
The correct answer is df.printSchema(). This is because the printSchema() function is a method in Spark DataFrame that prints the schema of the DataFrame in a tree format. It displays the column names and their corresponding data types, providing a concise overview of the structure of the DataFrame.
10.
Which of the following is true about DataFrame?
Correct Answer
B. DataFrames provide a more user friendly API than RDDs.
Explanation
The correct answer is "DataFrames provide a more user friendly API than RDDs." This is true because DataFrames provide a higher-level abstraction and a more structured and organized way to work with data compared to RDDs. DataFrames allow for easier manipulation and transformation of data using SQL-like queries and provide optimizations for performance. They also have a schema that provides compile-time type safety, ensuring that the data is correctly structured and typed.
11.
What does the following code print?
var number = {val x = 2 * 2; x + 40}
println(number)
Correct Answer
B. 44
Explanation
The given code defines a variable called "number" and assigns it a value using a closure. The closure calculates the value of "x" as 2 multiplied by 2, which is 4. Then, it adds 40 to "x", resulting in a final value of 44. The "println" statement then prints the value of "number", which is 44.
12.
Data transformations are executed
Correct Answer
B. Lazily
Explanation
Data transformations are executed lazily. This means that the transformations are not immediately performed when the code is executed, but rather when the result is needed or requested. Laziness allows for more efficient execution as only the necessary transformations are performed, reducing unnecessary computation. It also enables the use of lazy evaluation strategies, such as memoization, which can further optimize the execution of data transformations.
13.
How many Spark Context can be active per job?
Correct Answer
B. Only one
Explanation
The correct answer is "only one" because in Apache Spark, there can only be one active Spark Context per job. A Spark Context represents the entry point to the Spark cluster and coordinates the execution of tasks. Having multiple active Spark Contexts can lead to conflicts and inconsistencies in the execution environment. Therefore, it is recommended to have only one active Spark Context at a time.
14.
DataFrames and _____________ are abstractions for representing structured data
Correct Answer
B. Datasets
Explanation
Datasets are abstractions for representing structured data, along with DataFrames. Both DataFrames and Datasets are used in Apache Spark to handle structured data. While DataFrames provide a high-level API and are optimized for performance, Datasets provide a type-safe, object-oriented programming interface. Datasets combine the benefits of both DataFrames and RDDs, allowing for strong typing and providing a more efficient execution engine. Therefore, the correct answer is Datasets.
15.
Apache Spark has API's in
Correct Answer
D. All of the above
Explanation
Apache Spark has APIs in Java, Scala, and Python. This means that developers can use any of these programming languages to interact with and manipulate data in Apache Spark. The availability of multiple APIs allows developers to choose the language they are most comfortable with, making it easier for them to work with Spark and perform tasks such as data analysis, machine learning, and distributed processing.
16.
Which of the following is not the feature of Spark?
Correct Answer
C. It is cost efficient
Explanation
Spark is known for its features like supporting in-memory computation, fault-tolerance, and compatibility with other file storage systems. However, it is not specifically known for being cost efficient. While Spark does offer high performance and scalability, the cost of running Spark can vary depending on factors such as cluster size and resource requirements. Therefore, the statement "it is cost efficient" is not a feature commonly associated with Spark.
17.
Which of the following is true about Scala type inference ?
Correct Answer
B. The type of the variable is determined by looking at its value.
Explanation
Scala has a powerful type inference system that allows the type of a variable to be determined by looking at its value. This means that in many cases, the data type of a variable does not need to be explicitly mentioned. The compiler analyzes the value assigned to the variable and infers its type based on that. This feature of Scala makes the code more concise and reduces the need for explicit type declarations, leading to cleaner and more expressive code.
18.
What will be the output:
val rawData = spark.read.textFile("PATH").rdd
val result = rawData.filter...
Correct Answer
C. Won't be executed
Explanation
The code snippet is attempting to read a text file using Spark and convert it into an RDD (Resilient Distributed Dataset). However, the code is incomplete as the filter operation is not specified. Therefore, the code will not be executed and will result in a compilation error.
19.
Datasets are only defined in Scala and ______
Correct Answer
B. Java
Explanation
Datasets are a feature in Apache Spark that provide the benefits of both RDDs and DataFrames. While they are primarily defined in Scala, they can also be used in Java. Therefore, the correct answer is Java.
20.
What does the following code print?
val simple = Map("r" -> "red", "g" -> "green")
println(simple("g"))
Correct Answer
B. Green
Explanation
The code creates a Map called "simple" with two key-value pairs: "r" -> "red" and "g" -> "green". The code then prints the value associated with the key "g" in the map, which is "green". Therefore, the code will print "green".
21.
What does the following code print?
val dd: Double = 9.99
dd = 10.01
println(dd)
Correct Answer
C. Error
Explanation
The code will produce an error because the variable "dd" is declared as a "val", which means it is immutable and cannot be reassigned a new value. Therefore, the attempt to assign a new value to "dd" will result in a compilation error.
22.
What does the following code print?
var aa: String = "hello"
aa = "pretty"
println(aa)
Correct Answer
B. Pretty
Explanation
The code initializes a variable "aa" with the value "hello". Then it assigns the value "pretty" to the variable "aa". Finally, it prints the value of "aa", which is "pretty".
23.
Identify correct transformation
Correct Answer
D. All of the above
Explanation
The correct answer is "All of the above" because the question is asking to identify the correct transformation, and all three options - Map, Filter, and Join - are valid transformations in data processing. Map is used to transform each element in a dataset, Filter is used to select specific elements based on a condition, and Join is used to combine two datasets based on a common key. Therefore, all three transformations can be used depending on the specific requirements of the data processing task.
24.
Choose correct statement
Correct Answer
B. Execution starts with the call of Action
Explanation
The correct answer is "Execution starts with the call of Action." In Spark, transformations are lazily evaluated, meaning they are not executed immediately when called. Instead, they create a plan of execution that is only triggered when an action is called. Actions are operations that trigger the execution of the transformations and produce a result or output. Therefore, the execution of a Spark program begins when an action is called, not when a transformation is called.
25.
What does Spark Engine do?
Correct Answer
D. All of the above
Explanation
The Spark Engine performs multiple tasks including scheduling, distributing data across a cluster, and monitoring data across the cluster. It is responsible for managing the execution of Spark applications, allocating resources, and coordinating tasks across the cluster. By handling these tasks, the Spark Engine enables efficient and parallel processing of large datasets, making it a powerful tool for big data analytics and processing.
26.
Caching is optimizing technique?
Correct Answer
A. TRUE
Explanation
Caching is indeed an optimizing technique. It involves storing frequently accessed data or resources in a cache, which is a high-speed memory or storage system. By doing so, the system can retrieve the data or resources more quickly, reducing the need to access slower or more resource-intensive components. This can greatly improve the performance and efficiency of a system, making caching an effective optimization technique.
27.
Which of the following statements are correct
Correct Answer
D. All of the above
Explanation
All of the statements are correct. Spark is designed to run on top of Hadoop and can process data stored in HDFS. It can also use Yarn as a resource management layer, which allows for efficient allocation of resources and scheduling of tasks in a Hadoop cluster. Therefore, all three statements are true.
28.
Which dataframe method will display the first few rows in tabular format
Correct Answer
C. Show()
Explanation
The show() method in a dataframe will display the first few rows in tabular format.
29.
What does the following code print?
var bb: Int = 10
bb = "funny"
println(bb)
Correct Answer
C. Error
Explanation
The code will print an error. This is because the variable "bb" is declared as an Int, but then it is assigned a string value "funny". This is a type mismatch and the code will not compile.
30.
Spark session variable was introduced in which Spark release?
Correct Answer
C. Spark 2.0
Explanation
The Spark session variable was introduced in Spark 2.0. This release of Spark introduced the concept of a Spark session, which is the entry point for interacting with Spark functionality and allows for managing various Spark configurations and settings. Prior to Spark 2.0, users had to create a SparkContext object to interact with Spark, but the introduction of the Spark session simplified the process and provided a more user-friendly interface.
31.
RDD can not be created from data stored on
Correct Answer
B. Oracle
Explanation
RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark that allows for distributed processing of large datasets. In this context, the given correct answer states that an RDD cannot be created from data stored on an Oracle database. This is because RDDs are typically created from data sources that are supported by Spark, such as HDFS (Hadoop Distributed File System), S3 (Amazon Simple Storage Service), or LocalFS (local file system). Oracle is not listed among the supported data sources, hence an RDD cannot be directly created from data stored on an Oracle database.
32.
RDD is
Correct Answer
D. All of the above
Explanation
RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark. It is immutable, meaning that once created, its data cannot be modified. RDDs are also recomputable, which means that if a node fails, the RDD can be reconstructed from the lineage information. Finally, RDDs are fault-tolerant, as they automatically recover from failures. Therefore, the correct answer is "All of the above" as RDDs possess all these characteristics.
33.
What is action in Spark RDD?
Correct Answer
A. The ways to send result from executors to the driver
Explanation
The correct answer is "The ways to send result from executors to the driver." In Spark RDD, an action is an operation that triggers the execution of transformations and returns the result to the driver program. Actions are used to bring the data from RDDs back to the driver program or to perform some computation on the RDDs. They are responsible for executing the DAG (Directed Acyclic Graph) of computations that are created by transformations.
34.
Which of the following are Dataframe actions ?
Correct Answer
E. All the above
Explanation
The given answer "All the above" is correct because all the mentioned options - count, first, take(n), and collect - are actions that can be performed on a DataFrame. These actions are used to retrieve or manipulate data from the DataFrame. The count action returns the number of rows in the DataFrame, the first action returns the first row, the take(n) action returns the first n rows, and the collect action retrieves all the rows from the DataFrame. Therefore, all the mentioned options are valid DataFrame actions.
35.
Dataframes are _____________
Correct Answer
A. Immutable
Explanation
Dataframes are immutable, meaning that once they are created, their contents cannot be changed. This ensures data integrity and prevents accidental modifications to the dataframe. If any changes need to be made to a dataframe, a new dataframe must be created with the desired modifications. This immutability property also allows for easier debugging and reproducibility, as the original dataframe remains unchanged throughout the data processing pipeline.
36.
Which of the following is not true for Mapreduce and Spark?
Correct Answer
C. Both have their own file system
Explanation
Both MapReduce and Spark do not have their own file system. They rely on external file systems such as Hadoop Distributed File System (HDFS) or any other compatible file system for storing and accessing data. MapReduce uses HDFS for data storage and retrieval, while Spark can work with various file systems including HDFS, Amazon S3, and local file systems.
37.
Which is not a component on the top of Spark Core?
Correct Answer
A. Spark RDD
Explanation
The correct answer is Spark RDD. Spark RDD is not a component on the top of Spark Core. RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, and it is the main component of Spark Core. Spark Streaming, MLlib, and graphX are all built on top of Spark Core and provide additional functionalities for real-time streaming processing, machine learning, and graph processing respectively.
38.
SparkContext guides how to access the Spark cluster?
Correct Answer
A. TRUE
Explanation
The SparkContext is the entry point for accessing the Spark cluster. It is responsible for coordinating the execution of tasks and distributing data across the cluster. It provides methods for creating RDDs (Resilient Distributed Datasets) and performing operations on them. Therefore, it guides how to access the Spark cluster, making the answer TRUE.
39.
You cannot load a Dataset directly from a structured source
Correct Answer
A. True
Explanation
Loading a dataset directly from a structured source is not possible because structured sources, such as databases or spreadsheets, contain organized and formatted data that needs to be processed and transformed before it can be loaded into a dataset. Therefore, the statement "You cannot load a Dataset directly from a structured source" is true.
40.
Common Dataframe transformation include
Correct Answer
D. Both a and b
Explanation
The correct answer is "both a and b" because both "select" and "filter" are common DataFrame transformations. The "select" transformation is used to select specific columns from a DataFrame, while the "filter" transformation is used to filter rows based on a condition. Therefore, both a and b are valid options for common DataFrame transformations.
41.
What is transformation in Spark RDD?
Correct Answer
A. Takes RDD as input and produces one or more RDD as output.
Explanation
Transformation in Spark RDD refers to the operations that are performed on an RDD to create a new RDD. These operations are lazily evaluated, meaning they are not executed immediately but rather when an action is called. The transformation takes an RDD as input and produces one or more RDDs as output. Examples of transformations include map, filter, and reduceByKey. These transformations allow for the transformation of data in a distributed and parallel manner, enabling efficient data processing in Spark.
42.
What does the following code print?
println(5 < 6 && 10 == 10)
Correct Answer
A. True
Explanation
The code will print "true" because it is using the logical AND operator (&&) to check if both conditions are true. The first condition, 5 < 6, is true. The second condition, 10 == 10, is also true. Since both conditions are true, the overall result is true.
43.
Which file format provides optimized binary storage of structured data ?
Correct Answer
C. Parquet
Explanation
Parquet is a file format that provides optimized binary storage of structured data. It is designed to efficiently store and process large amounts of data. Parquet uses columnar storage, which allows for efficient compression and encoding techniques to be applied to individual columns, resulting in reduced storage space and improved query performance. This makes Parquet an ideal choice for big data processing frameworks like Apache Hadoop and Apache Spark.
44.
What does the following code print?
val numbers = List(11, 22, 33)
var total = 0
for (i <- numbers) {
total += i
}
println(total)
Correct Answer
C. 66
Explanation
The given code initializes a list of numbers [11, 22, 33] and a variable total with the value 0. It then iterates over each element in the list using a for loop and adds each element to the total. Finally, it prints the value of total, which is 66.
45.
Which Cluster Manager do Spark Support?
Correct Answer
D. All of the above
Explanation
Spark supports all of the above cluster managers, which include Standalone Cluster Manager, Mesos, and YARN. This means that Spark can be deployed and run on any of these cluster managers, providing flexibility and compatibility with different environments and infrastructures.
46.
Spark caches the RDD automatically in the memory on its own
Correct Answer
B. FALSE
Explanation
Spark does not automatically cache the RDD in memory. Caching is an optional operation in Spark, and the user needs to explicitly instruct Spark to cache an RDD using the `cache()` or `persist()` methods. Caching an RDD allows for faster access to the data, as it is stored in memory and can be reused across multiple actions or transformations. However, if the user does not explicitly cache the RDD, Spark will not automatically cache it. Therefore, the given statement is false.
47.
The default storage level of cache() is?
Correct Answer
A. MEMORY_ONLY
Explanation
The default storage level of cache() is MEMORY_ONLY. This means that the RDD will be stored in memory as deserialized Java objects. This storage level provides fast access to the data but does not persist it on disk. If the memory is not sufficient to store the entire RDD, some partitions may be evicted and recomputed on the fly when needed.
48.
Spark is developed in
Correct Answer
A. Scala
Explanation
Spark is developed in Scala. Scala is a programming language that runs on the Java Virtual Machine (JVM) and combines object-oriented and functional programming concepts. Spark was originally written in Scala because Scala provides concise syntax and strong support for functional programming, making it well-suited for building distributed data processing systems like Spark. However, Spark also provides APIs in other languages like Java, Python, and R, allowing developers to use Spark with their preferred programming language.
49.
Spark's core is a batch engine
Correct Answer
A. TRUE
Explanation
Spark's core is a batch engine. This means that Spark is designed to process large amounts of data in batches rather than in real-time. It allows for efficient and parallel processing of data by dividing it into smaller chunks called batches. This batch processing approach is suitable for tasks such as data analytics, machine learning, and data transformations where processing large volumes of data at once is more efficient than processing individual records in real-time. Therefore, the statement "Spark's core is a batch engine" is true.
50.
How much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can?
Correct Answer
C. 100 times faster
Explanation
Apache Spark can potentially run batch-processing programs 100 times faster than MapReduce when processed in memory. This is because Spark is designed to store data in memory, which allows for faster data processing and eliminates the need to read and write data from disk, as in the case of MapReduce. Additionally, Spark utilizes a directed acyclic graph (DAG) execution engine, which optimizes the execution plan and minimizes the overhead of data shuffling. These factors contribute to the significant speed improvement of Spark over MapReduce.