SEC-C BDA Quiz 1

1. Point out the correct statement:

Raw data is original source of data

Preprocessed data is original source of data

Raw data is the data obtained after processing steps

None of the Mentioned

The correct statement is that raw data is the original source of data. Raw data refers to the unprocessed and unorganized information that is collected directly from the source. It has not undergone any manipulation or analysis. Preprocessed data, on the other hand, refers to the data that has been cleaned, transformed, and organized for further analysis. Therefore, the answer "Raw data is original source of data" is the correct statement.

Explanation

The correct statement is that raw data is the original source of data. Raw data refers to the unprocessed and unorganized information that is collected directly from the source. It has not undergone any manipulation or analysis. Preprocessed data, on the other hand, refers to the data that has been cleaned, transformed, and organized for further analysis. Therefore, the answer "Raw data is original source of data" is the correct statement.

2. Which of the following is performed by Data Scientist ?

Define the question

Create reproducible code

Challenge results

All of the Mentioned

Data scientists perform all of the mentioned tasks. They define the question or problem they are trying to solve, create reproducible code to analyze and manipulate data, and challenge the results to ensure accuracy and reliability. By doing all of these tasks, data scientists are able to extract insights and make data-driven decisions.

Explanation

Data scientists perform all of the mentioned tasks. They define the question or problem they are trying to solve, create reproducible code to analyze and manipulate data, and challenge the results to ensure accuracy and reliability. By doing all of these tasks, data scientists are able to extract insights and make data-driven decisions.

3. Which of the following is preferred for text analytics ?

R

Python

S

All of the mentioned

R is preferred for text analytics because it has a wide range of packages and libraries specifically designed for natural language processing and text mining tasks. These packages provide various functionalities such as tokenization, stemming, sentiment analysis, and topic modeling. R also has robust visualization capabilities, making it easier to analyze and interpret textual data. Additionally, R has a strong community support and a vast number of resources available online, making it a popular choice for text analytics tasks.

Explanation

R is preferred for text analytics because it has a wide range of packages and libraries specifically designed for natural language processing and text mining tasks. These packages provide various functionalities such as tokenization, stemming, sentiment analysis, and topic modeling. R also has robust visualization capabilities, making it easier to analyze and interpret textual data. Additionally, R has a strong community support and a vast number of resources available online, making it a popular choice for text analytics tasks.

4. Which of the following is most important language for Data Science ?

Java

Ruby

R

None of the Mentioned

R is the most important language for Data Science because it is specifically designed for statistical analysis and data manipulation. It has a wide range of packages and libraries that make it easy to perform complex data analysis tasks. R also has a large and active community of users, which means there is a wealth of resources and support available for those working in Data Science. Additionally, R integrates well with other programming languages and tools commonly used in Data Science, making it a versatile and powerful language for this field.

Explanation

R is the most important language for Data Science because it is specifically designed for statistical analysis and data manipulation. It has a wide range of packages and libraries that make it easy to perform complex data analysis tasks. R also has a large and active community of users, which means there is a wealth of resources and support available for those working in Data Science. Additionally, R integrates well with other programming languages and tools commonly used in Data Science, making it a versatile and powerful language for this field.

5. Which of the following is one of the key data science skill ?

Statistics

Machine Learning

Data Visualization

All of the Mentioned

All of the mentioned options are key data science skills. Statistics is essential for analyzing and interpreting data, Machine Learning is crucial for building predictive models and making data-driven decisions, and Data Visualization is important for effectively communicating insights and patterns from data. Therefore, all of these skills are fundamental in the field of data science.

Explanation

All of the mentioned options are key data science skills. Statistics is essential for analyzing and interpreting data, Machine Learning is crucial for building predictive models and making data-driven decisions, and Data Visualization is important for effectively communicating insights and patterns from data. Therefore, all of these skills are fundamental in the field of data science.

6. Your company is attempting to build a Big Data environment. The vendors you are working with tell you that an additional $1m of capital expenditure is needed on top of the $10m made so far. You are worried that the existing environment will not provide all the capability you need, however. Do you:

Finalize the work you are doing with your current vendors because there isn't much left to do

Pause work while you consider what would be needed to gain the extra capability you need

Scrap the project as it seems it will not be fit for purpose

None of the above

Pausing work while considering what would be needed to gain the extra capability is the most logical choice in this situation. The concern about the existing environment not providing all the necessary capability indicates that further evaluation and planning are required before making a decision. By pausing work, the company can assess the feasibility of meeting their requirements with the additional $1m investment and determine if any adjustments or changes need to be made to ensure the success of the Big Data environment project.

Explanation

Pausing work while considering what would be needed to gain the extra capability is the most logical choice in this situation. The concern about the existing environment not providing all the necessary capability indicates that further evaluation and planning are required before making a decision. By pausing work, the company can assess the feasibility of meeting their requirements with the additional $1m investment and determine if any adjustments or changes need to be made to ensure the success of the Big Data environment project.

7. For taking decisions data must be:

Very accurate

Massive

Processed correctly

Collected from diverse sources

To make informed decisions, it is crucial that the data is processed correctly. Processing data correctly involves ensuring that it is organized, cleaned, and transformed in a way that eliminates errors and inconsistencies. By processing data correctly, one can derive meaningful insights and make accurate conclusions. Without proper processing, the data may be unreliable and lead to incorrect decisions. Accuracy, massiveness, and diverse sources are important aspects, but processing the data correctly is the key to utilizing these factors effectively.

Explanation

To make informed decisions, it is crucial that the data is processed correctly. Processing data correctly involves ensuring that it is organized, cleaned, and transformed in a way that eliminates errors and inconsistencies. By processing data correctly, one can derive meaningful insights and make accurate conclusions. Without proper processing, the data may be unreliable and lead to incorrect decisions. Accuracy, massiveness, and diverse sources are important aspects, but processing the data correctly is the key to utilizing these factors effectively.

8. You are operating a public health screening post at an airport and 200 people with a disease are identified. Three quarters of these are young, and two-thirds of all young people are diseased. There are as many non-diseased old people as there are young people in total. You now screen a new previously unseen individual – what is the chance they are old?

Impossible to say from the data given

Impossible to say without knowledge of the previously unseen individual's gender

55%

40%

Based on the information given, it is stated that there are as many non-diseased old people as there are young people in total. Since three quarters of the identified diseased individuals are young, it can be inferred that the remaining one quarter of diseased individuals are old. Therefore, the chance that the new unseen individual is old is 25% + 25% = 50%. However, since the options provided do not include this percentage, the closest option is 55%.

Explanation

Based on the information given, it is stated that there are as many non-diseased old people as there are young people in total. Since three quarters of the identified diseased individuals are young, it can be inferred that the remaining one quarter of diseased individuals are old. Therefore, the chance that the new unseen individual is old is 25% + 25% = 50%. However, since the options provided do not include this percentage, the closest option is 55%.

9. Which of the following approach should be used to ask Data Analysis question ?

Find only one solution for particular problem

Find out the question which is to be answered

Find out answer from dataset without asking question

None of the mentioned

The correct approach to ask a Data Analysis question is to first identify the question that needs to be answered. This involves understanding the problem at hand and determining what specific information or insights are required from the dataset. Once the question is clearly defined, appropriate analysis techniques can be applied to find the answer. The other options mentioned, such as finding only one solution or directly extracting the answer from the dataset without asking a question, do not align with the systematic approach of data analysis.

Explanation

The correct approach to ask a Data Analysis question is to first identify the question that needs to be answered. This involves understanding the problem at hand and determining what specific information or insights are required from the dataset. Once the question is clearly defined, appropriate analysis techniques can be applied to find the answer. The other options mentioned, such as finding only one solution or directly extracting the answer from the dataset without asking a question, do not align with the systematic approach of data analysis.

10. ______ is simplest class of analytics:

Descriptive

Predictive

Prescriptive

All of the mentioned

Descriptive analytics is the simplest class of analytics because it focuses on analyzing historical data to understand what has happened in the past. It involves summarizing and interpreting data to gain insights and identify patterns and trends. Descriptive analytics does not involve making predictions or prescribing actions for the future, unlike predictive and prescriptive analytics. Instead, it provides a foundation for further analysis and decision-making by providing a clear understanding of past events and their implications.

Explanation

Descriptive analytics is the simplest class of analytics because it focuses on analyzing historical data to understand what has happened in the past. It involves summarizing and interpreting data to gain insights and identify patterns and trends. Descriptive analytics does not involve making predictions or prescribing actions for the future, unlike predictive and prescriptive analytics. Instead, it provides a foundation for further analysis and decision-making by providing a clear understanding of past events and their implications.

11. Point out the correct statement :

Hadoop is an ideal environment for extracting and transforming small volumes of data

Hadoop stores data in HDFS and supports data compression/decompression

The Giraph framework is less useful than a MapReduce job to solve graph and machine learning

None of the mentioned

Hadoop stores data in HDFS and supports data compression/decompression. This means that Hadoop has the capability to store large volumes of data in its distributed file system (HDFS) and also provides the functionality to compress and decompress the data. This feature is important in big data processing as it helps in reducing storage space and improving data processing efficiency.

Explanation

Hadoop stores data in HDFS and supports data compression/decompression. This means that Hadoop has the capability to store large volumes of data in its distributed file system (HDFS) and also provides the functionality to compress and decompress the data. This feature is important in big data processing as it helps in reducing storage space and improving data processing efficiency.

12. Data by itself is not useful unless:

It is massive

It is processed to obtain information

It is collected from diverse sources

It is properly stated

Data by itself is raw and unorganized information. In order to derive any meaningful insights or make informed decisions, the data needs to be processed and analyzed to extract valuable information. Processing the data involves organizing, cleaning, and transforming it into a more structured format. This allows for the identification of patterns, trends, and relationships within the data, enabling the generation of useful information that can be used for various purposes. Therefore, processing the data is essential to make it useful and meaningful.

Explanation

Data by itself is raw and unorganized information. In order to derive any meaningful insights or make informed decisions, the data needs to be processed and analyzed to extract valuable information. Processing the data involves organizing, cleaning, and transforming it into a more structured format. This allows for the identification of patterns, trends, and relationships within the data, enabling the generation of useful information that can be used for various purposes. Therefore, processing the data is essential to make it useful and meaningful.

13. Point out the wrong statement:

Merging concerns combining datasets on the same observations to produce a result with more variables

Data visualization is the organization of information according to preset specifications

Subsetting can be used to select and exclude variables and observations

All of the Mentioned

The correct answer is "Data visualization is the organization of information according to preset specifications." This statement is incorrect because data visualization is the representation of data in graphical or visual format to provide insights and communicate patterns or trends in the data, not the organization of information according to preset specifications.

Explanation

The correct answer is "Data visualization is the organization of information according to preset specifications." This statement is incorrect because data visualization is the representation of data in graphical or visual format to provide insights and communicate patterns or trends in the data, not the organization of information according to preset specifications.

14. A salesman offers you a choice of three boxes, one containing a million dollars and two containing fifty dollars and tells you to pick one. He then shows you fifty dollars in one of the other two boxes and asks you if you want to change your choice to the remaining box that you have neither picked nor seen inside. What do you do?

Change to the other box

Stay with the one you picked originally

It doesn't matter, so do nothing

You don't have enough information to figure out whether you should change, so do nothing

The correct answer is to change to the other box. This is known as the Monty Hall problem. Initially, there is a 1/3 chance of picking the box with a million dollars, and a 2/3 chance of picking one with fifty dollars. When the salesman reveals one of the boxes with fifty dollars, the probability of the remaining unopened box containing a million dollars increases to 2/3. Therefore, it is advantageous to switch your choice to the other box.

Explanation

The correct answer is to change to the other box. This is known as the Monty Hall problem. Initially, there is a 1/3 chance of picking the box with a million dollars, and a 2/3 chance of picking one with fifty dollars. When the salesman reveals one of the boxes with fifty dollars, the probability of the remaining unopened box containing a million dollars increases to 2/3. Therefore, it is advantageous to switch your choice to the other box.