1.
What is the most basic need of data science?
Correct Answer
A. Collect
Explanation
The most basic need of data science is to collect data. Data collection is the first step in the data science process, as it involves gathering relevant and reliable data from various sources. Without data collection, there would be no data to analyze and derive insights from. Collecting data allows data scientists to have a foundation for their analysis and modeling, enabling them to make informed decisions and predictions based on the data collected.
2.
You have run a linear regression model against your data and have plotted the true outcome versus the predicted outcome. The R-squared of your model is 0.75. What is your assessment of the model?
Correct Answer
B. The R-squared is good. The model should perform well.
Explanation
The R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. An R-squared value of 0.75 indicates that 75% of the variance in the true outcome can be explained by the predicted outcome. Generally, an R-squared value between 0.7 and 0.8 is considered good, suggesting that the model is able to explain a significant amount of the variation in the data. Therefore, the assessment of the model in this case is that it should perform well.
3.
Data has been collected on visitors' viewing habits at a bank's website. Which technique is used to identify pages commonly viewed during the same visit to the website?
Correct Answer
B. Association Rules
Explanation
Association Rules is the technique used to identify pages commonly viewed during the same visit to the website. It analyzes the patterns and relationships between different pages visited by users and identifies the frequent co-occurrence of pages. This technique helps in understanding the behavior of visitors and can be used for various purposes such as recommending related pages or products, optimizing website layout, and improving user experience.
4.
A data scientist is asked to implement an article recommendation feature for an online magazine. The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article are available for making recommendations. All of the magazine’s articles are stored in a database in a format suitable for analytics. Which method should the data scientist try first?
Correct Answer
A. K Means Clustering
Explanation
The data scientist should try K Means Clustering first. K Means Clustering is a method used to group similar data points together based on their characteristics. In this case, the data scientist can use K Means Clustering to group articles based on their style and subject matter. By doing so, they can recommend articles that are similar in style and subject matter to the current article being read by the user. This method does not require any client tracking technologies and can be implemented using the available data in the database.
5.
You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completed. What should you do?
Correct Answer
D. Ensure that the TaskTracker is running
Explanation
If a MapReduce job is successfully submitted but not completed, it indicates that there might be an issue with the TaskTracker. The TaskTracker is responsible for running the individual tasks of the MapReduce job. Therefore, ensuring that the TaskTracker is running is crucial for the successful completion of the job.
6.
You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort?
Correct Answer
A. MADlib
Explanation
MADlib is the correct answer because it is specifically designed for large-scale machine learning tasks on relational databases. It provides a set of in-database algorithms, including logistic regression, that can be applied directly to the data stored in PostgreSQL. This means that there is no need to extract the data from the database and transfer it to another tool or library, reducing the effort required to produce the logistic regression models for the 100 countries.
7.
Imagine you are trying to hire a Data Scientist for your team. In addition to technical ability and quantitative background, which additional essential trait would you look for in people applying for this position?
Correct Answer
A. Communication skill
Explanation
When hiring a Data Scientist, communication skills are crucial. This trait is essential because Data Scientists need to effectively communicate their findings and insights to both technical and non-technical stakeholders. They should be able to explain complex concepts in a clear and concise manner, as well as collaborate and work effectively with team members. Without strong communication skills, a Data Scientist may struggle to convey their ideas and findings, hindering the overall success and impact of their work.
8.
An analyst is searching a corpus of documents for the topic “solid-state disk.” In the Exhibit, Table A provides the inverse document frequency for each term across the corpus. Table B provides each term’s frequency in four documents selected from the corpus. Which of the four documents is most relevant to the analyst’s search?
Correct Answer
A. Document B
Explanation
Document B is the most relevant to the analyst's search because it has the highest term frequency for the term "solid-state disk" compared to the other documents. The term frequency-inverse document frequency (TF-IDF) is a measure that indicates the importance of a term in a document relative to the entire corpus. In this case, the term "solid-state disk" is more frequent in Document B compared to the other documents, suggesting that it is more relevant to the analyst's search.
9.
Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the probability of the classification for the tuple X(1, 0, 0) using a Naive Bayesian classifier?
Correct Answer
A. Classification Y = 0,Probability = 4/54
Explanation
Based on the provided training dataset and the tuple X(1, 0, 0), we can calculate the classification and probability using a Naive Bayesian classifier.
The calculation involves estimating the conditional probabilities of each feature given the class (Y = 0 or Y = 1), as well as the prior probabilities of each class.
Here's how to calculate it:
Prior probabilities:
P(Y = 0) = 2/5
P(Y = 1) = 3/5
Conditional probabilities:
P(X1 = 1 | Y = 0) = 1/2
P(X2 = 0 | Y = 0) = 1/2
P(X3 = 0 | Y = 0) = 1/2
P(X1 = 1 | Y = 1) = 2/3
P(X2 = 0 | Y = 1) = 1/3
P(X3 = 0 | Y = 1) = 1/3
Now, let's calculate the probabilities for the tuple X(1, 0, 0):
P(Y = 0 | X = (1, 0, 0)) ∝ P(X = (1, 0, 0) | Y = 0) * P(Y = 0) = (1/2) * (1/2) * (1/2) * (2/5) = 1/20
P(Y = 1 | X = (1, 0, 0)) ∝ P(X = (1, 0, 0) | Y = 1) * P(Y = 1) = (2/3) * (1/3) * (1/3) * (3/5) = 2/45
Now, we normalize these probabilities:
P(Y = 0 | X = (1, 0, 0)) = (1/20) / ((1/20) + (2/45)) P(Y = 1 | X = (1, 0, 0)) = (2/45) / ((1/20) + (2/45))
After normalization, we find the probabilities to be:
Classification Y = 0, Probability ≈ 4/54
Classification Y = 1, Probability ≈ 4/54
Therefore, the correct option is:
Classification Y = 0, Probability ≈ 4/54
10.
What provides the decision tree for predicting whether or not someone is good or bad credit risk? What would be the assigned probability, p(good), of a single male with no known savings?
Correct Answer
A. 0.83
Explanation
The decision tree provides the prediction for whether someone is a good or bad credit risk. The assigned probability, p(good), for a single male with no known savings would be 0.83.
11.
What describes the use of the UNION clause in a SQL statement?
Correct Answer
B. Operates on queries and potentially decreases the number of rows
Explanation
The UNION clause in a SQL statement is used to combine the result sets of two or more SELECT statements into a single result set. It operates on queries and potentially decreases the number of rows because it eliminates duplicate rows from the combined result set.
12.
When would you use a Wilcoxson Rank Sum test?
Correct Answer
A. When you cannot make an assumption about the distribution of the populations
Explanation
A Wilcoxson Rank Sum test, also known as the Mann-Whitney U test, is used when you cannot make an assumption about the distribution of the populations. This test is a non-parametric alternative to the independent samples t-test and does not require the assumption of normality. It is suitable when the data is ordinal or when the distribution is skewed or has outliers. The test compares the ranks of the observations between two independent groups to determine if there is a significant difference between them.
13.
In the MapReduce framework, what is the purpose of the Reduce function?
Correct Answer
B. It distributes the input to multiple nodes for processing.
Explanation
The Reduce function in the MapReduce framework is responsible for aggregating the results generated by the Map function. It takes the intermediate key-value pairs produced by the Map function and combines them based on the key. The Reduce function then generates processed output by performing operations such as summing, averaging, or counting on the grouped values. This processed output is typically written to storage or used for further analysis. However, the purpose of the Reduce function is not to distribute the input to multiple nodes for processing, as this task is handled by the Map function.
14.
A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process. What is a concern the data scientist should have about the data?
Correct Answer
A. It is too processed.
Explanation
The concern the data scientist should have about the data being "too processed" is that it may have undergone extensive transformations during the ETL process, which could potentially lead to loss or distortion of the original data. This could affect the accuracy and reliability of the model built using this data. It is important for a data scientist to have access to raw and unprocessed data to ensure the integrity of the model and to be able to perform necessary data cleaning and preprocessing steps.
15.
You are given a list of pre-defined association rules:A) RENTER => BAD CREDITB) RENTER => GOOD CREDITC) HOME OWNER => BAD CREDITD) HOME OWNER => GOOD CREDITE) FREE HOUSING => BAD CREDITF) FREE HOUSING => GOOD CREDIT For your next analysis you must limit your dataset based on rules with confidence greater than 60%. Which of the rules will be kept in the analysis?
Correct Answer
A. Rules B and D
Explanation
The rules that will be kept in the analysis are B and D. This is because these rules have a confidence greater than 60%. Rule B states that if someone is a renter, they have good credit. Rule D states that if someone is a home owner, they have good credit. Both of these rules meet the criteria for the analysis, as their confidence is greater than 60%.