The editorial team at ProProfs Quizzes consists of a select group of subject experts, trivia writers, and quiz masters who have authored over 10,000 quizzes taken by more than 100 million users. This team includes our in-house seasoned quiz moderators and subject matter experts. Our editorial experts, spread across the world, are rigorously trained using our comprehensive guidelines to ensure that you receive the highest quality quizzes.
The AWS Certified Machine Learning - Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate's ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Questions and Answers
1.
A Machine Learning Engineer wants to use Amazon SageMaker and the built-in XGBoost algorithm for model training. The training data is currently stored in CSV format, with the first 10 columns representing features and the 11th column representing the target label.
What should the ML Engineer do to prepare the data for use in an Amazon SageMaker training job?
A.
The target label should be changed to the first column. The data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3
B.
The dataset should be uploaded directly to Amazon S3. Amazon SageMaker can then be used to split the data into training, validation, and test sets.
C.
The data should be split into training, validation, and test sets. The datasets should then be uploaded to Amazon S3.
D.
The target label should be changed to the first column. The dataset should then be uploaded to Amazon
S3. Finally, Amazon SageMaker can be used to split the data into training, validation, and test sets.
Correct Answer
A. The target label should be changed to the first column. The data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3
Explanation To prepare the data for use in an Amazon SageMaker training job, the ML Engineer should first change the target label to the first column. Then, the data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3.
Rate this question:
2.
A Machine Learning Specialist was given a dataset consisting of unlabeled data The Specialist must create a model that can help the team classify the data into different buckets What model should be used to complete this work?
A.
K-means clustering
B.
Random Cut Forest (RCF)
C.
XGBoost
D.
BlazingText
Correct Answer
A. K-means clustering
Explanation K-means clustering should be used to complete this work because it is a popular unsupervised learning algorithm that is used for clustering data. It is suitable for this task because the dataset consists of unlabeled data and the goal is to classify the data into different buckets. K-means clustering works by partitioning the data into k clusters based on their similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. This algorithm is widely used for data clustering and can help the Machine Learning Specialist in this task.
Rate this question:
3.
A large JSON dataset for a project has been uploaded to a private Amazon S3 bucket. The Machine Learning Specialist wants to securely access and explore the data from an Amazon SageMaker notebook instance. A new VPC was created and assigned to the Specialist.
How can the privacy and integrity of the data stored in Amazon S3 be maintained while granting access to the Specialist for analysis?
A.
Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled. Use an S3 ACL to open read privileges to the everyone group
B.
Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data. Copy the JSON dataset from Amazon S3 into the ML storage volume on the SageMaker notebook instance and work against the local dataset.
C.
Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, define a custom S3 bucket policy to only allow requests from your VPC to access the S3 bucket.
D.
Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled. Generate an S3 pre-signed URL for access to data in the bucket.
Correct Answer
C. Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, define a custom S3 bucket policy to only allow requests from your VPC to access the S3 bucket.
Explanation The correct answer is to launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, and define a custom S3 bucket policy to only allow requests from the VPC to access the S3 bucket. This ensures that the data stored in Amazon S3 remains private and can only be accessed by the Specialist through the VPC and the designated notebook instance. The S3 VPC endpoint establishes a private connection between the VPC and S3, eliminating the need for internet access. The custom S3 bucket policy further restricts access to the bucket, ensuring the integrity of the data.
Rate this question:
4.
You work in the security department of your company’s IT division. Your company has decided to try to use facial recognition to improve security on their campus. You have been asked to design a system that augments your company’s building access security by scanning the faces of people entering their buildings and recognizing the person as either an employee/contractor/consultant, who is in the company’s database, or visitor, who is not in their database.
Across their many campus locations worldwide your company has over 750,000 employees and over 250,000 contractors and consultants. These workers are all registered in their HR database. Each of these workers has an image of their face stored in the HR database. You have decided to use Amazon Rekognition for your facial recognition solution. On occasion, the Rekognition model fails to recognize visitors to the buildings.
What could be the source of the problem?
A.
Face landmarks filters set to a max sharpness
B.
Bounding box and confidence score for face comparison threshold tolerances set to max values
C.
Confidence threshold tolerance set to the default
D.
Face collection contents
Correct Answer
D. Face collection contents
Explanation The source of the problem could be the face collection contents. Since the Rekognition model is failing to recognize visitors, it is possible that the faces of the visitors are not included in the face collection that the system is comparing against. The face collection should ideally contain images of both employees/contractors/consultants and visitors in order to accurately identify and differentiate between them.
Rate this question:
5.
A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available.
Which model is MOST likely to provide the best results in Amazon SageMaker?
A.
Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
B.
Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
C.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
D.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.
Correct Answer
C. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
Explanation The Amazon SageMaker Linear Learner algorithm is most likely to provide the best results in this scenario. The task is to forecast air quality based on historical data, which is a regression problem. The Linear Learner algorithm is designed for regression tasks and can effectively learn patterns and make predictions based on the given time series data. Using a regressor predictor_type will allow the algorithm to accurately forecast the air quality in parts per million of contaminants for the next 2 days. The other options, such as k-Nearest-Neighbors and Random Cut Forest, may not be as suitable for this specific task.
Rate this question:
6.
A data scientist is working on optimizing a model during the training process by varying multiple
parameters. The data scientist observes that, during multiple runs with identical parameters, the loss
function converges to different, yet stable, values.
What should the data scientist do to improve the training process?
A.
Increase the learning rate. Keep the batch size the same.
B.
Reduce the batch size. Decrease the learning rate
C.
Keep the batch size the same. Decrease the learning rate
D.
Do not change the learning rate. Increase the batch size.
Correct Answer
B. Reduce the batch size. Decrease the learning rate
Explanation It is most likely that the loss function is very curvy and has multiple local minima where the training is
getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima
saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum
Rate this question:
7.
A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance's Amazon EBS volume, and needs to take a snapshot of that EBS volume. However the ML Specialist cannot find the Amazon SageMaker notebook instance's EBS volume or Amazon EC2 instance within the VPC.
Why is the ML Specialist not seeing the instance visible in the VPC?
A.
Amazon SageMaker notebook instances are based on the EC2 instances within the customer account but they run outside of VPCs.
B.
Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.
C.
Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
D.
Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service accounts.
Correct Answer
C. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
8.
A data engineer needs to create a cost-effective data pipeline solution that ingests unstructured data from various sources and stores it for downstream analytics applications and ML. The solution should include a data store where the processed data is highly available for at least one year so that data analysts and data scientists can run analytics and ML workloads on the most recent data. For compliance reasons, the solution should include both processed and raw data. The raw data does not need to be accessed regularly, but when needed, should be accessible within 24 hours.
What solution should the data engineer deploy?
A.
Use Amazon S3 Standard for all raw data. Use Amazon S3 Glacier Deep Archive for all processed data.
B.
Use Amazon S3 Standard for the processed data that is within one year of processing. After one year, use Amazon S3 Glazier for the processed data. Use Amazon S3 Glacier Deep Archive for all raw data.
C.
Use Amazon Elastic File System (Amazon EFS) for processed data is within one year of processing. After one year, use Amazon S3 Standard for the processed data. Use Amazon S3 Glacier Deep Archive for all raw data.
D.
Use Amazon S3 Standard for both the raw and processed data. after one year, use Amazon S3 Glacier Deep Archive for the raw data.
Correct Answer
B. Use Amazon S3 Standard for the processed data that is within one year of processing. After one year, use Amazon S3 Glazier for the processed data. Use Amazon S3 Glacier Deep Archive for all raw data.
Explanation The data engineer should deploy Amazon S3 Standard for the processed data that is within one year of processing. After one year, they should use Amazon S3 Glacier for the processed data. Additionally, they should use Amazon S3 Glacier Deep Archive for all raw data. This solution ensures that the processed data is highly available for at least one year, allowing data analysts and data scientists to run analytics and ML workloads on the most recent data. The use of Amazon S3 Glacier Deep Archive for raw data ensures compliance and accessibility within 24 hours when needed.
Rate this question:
9.
A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.
Which storage scheme is MOST adapted to this scenario?
A.
.Store datasets as files in Amazon S3.
B.
Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
C.
Store datasets as tables in a multi-node Amazon Redshift cluster.
D.
Store datasets as global tables in Amazon DynamoDB.
Correct Answer
A. .Store datasets as files in Amazon S3.
Explanation Storing datasets as files in Amazon S3 is the most adapted storage scheme for this scenario because it allows for scalability and cost-effectiveness. With S3, the Data Science team can easily store and retrieve large amounts of training data without worrying about capacity limitations. Additionally, S3 supports SQL-based querying using services like Amazon Athena, allowing for easy exploration of the data using SQL. This solution also aligns with the requirement of being able to create an arbitrary number of new datasets every day, as S3 can handle the storage of a large number of files.
Rate this question:
10.
A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.
Which solution requires the LEAST effort to be able to query this data?
A.
Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.
B.
Use AWS Glue to catalogue the data and Amazon Athena to run queries.
C.
Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.
D.
Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.
Correct Answer
B. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
Explanation The correct answer is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It can automatically discover and catalog data stored in Amazon S3, making it easier to query the data using SQL. Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. This combination of AWS Glue and Amazon Athena requires the least effort as it eliminates the need for manual data transformation and provides a simple and efficient way to query the data.
Rate this question:
11.
A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published A sample of the data being used is below.
Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values.
What technique should be used to convert this column to binary values?
A.
Binarization
B.
One-hot encoding
C.
Tokenization
D.
Normalization transformation
Correct Answer
B. One-hot encoding
Explanation The technique that should be used to convert the Day-Of_Week column to binary values is one-hot encoding. One-hot encoding is a technique used to represent categorical variables as binary vectors. Each category is converted into a binary column, where a value of 1 represents the presence of that category and a value of 0 represents the absence. This is commonly used in machine learning algorithms to handle categorical data and allow them to be used in mathematical calculations.
Rate this question:
12.
A navigation and transportation company is using satellite images to model weather around the world in order to create optimal routes for its ships and planes. The company is using Amazon SageMaker training jobs to build and train its models.
However, during training, it takes too long to download the company’s 100 GB data from Amazon S3 to the training instance before the training starts.
What should the company do to speed up its training jobs while keeping the costs low?
A.
Increase the instance size for training
B.
Increase the batch size in the model
C.
Change the input mode to Pipe
D.
Create an Amazon EBS volume with the data on it and attach it to the training job
Correct Answer
C. Change the input mode to Pipe
Explanation Changing the input mode to Pipe would speed up the training jobs while keeping the costs low. By using Pipe mode, the company can stream the data directly from Amazon S3 to the training instance without the need to download the entire 100 GB data before training starts. This eliminates the time-consuming download process and allows for faster training. Additionally, it helps in reducing storage costs as there is no need to store the data on the training instance.
Rate this question:
13.
A Machine Learning Engineer is creating and preparing data for a linear regression model. However, while preparing the data, the Engineer notices that about 20% of the numerical data contains missing values in the same two columns. The shape of the data is 500 rows by 4 columns, including the target column.
How could the Engineer handle the missing values in the data? (Select TWO.)
A.
Remove the rows containing the missing values
B.
Remove the columns containing the missing values
C.
Fill the missing values with zeros
D.
Impute the missing values using regression
E.
Add regularization to the model
Correct Answer(s)
C. Fill the missing values with zeros D. Impute the missing values using regression
Explanation The Engineer can handle the missing values in two ways. Firstly, they can fill the missing values with zeros, which means replacing the missing values with the value of zero. Secondly, they can impute the missing values using regression, which involves using the other available data to predict and fill in the missing values based on a regression model. These two approaches help to ensure that the missing values are accounted for and do not negatively impact the linear regression model's performance.
Rate this question:
14.
A Dats Scientist at a retail company is using Amazon SageMaker to classify social media posts that mention the company into one of two categories: Posts that require a response from the company, and posts that do not. The Data Scientist is using a training dataset of 10,000 posts, which contains the timestamp, author, and full text of each post.
However, the Data Scientist is missing the target labels that are required for training.
Which approach can the Data Scientist take to create valid target label data? (Select TWO.)
A.
Ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label
B.
Use the sentiment analysis natural language processing library to determine whether a post requires a response
C.
Use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts
D.
Use the a priori probability distribution of the two classes. Then, use Monte-Carlo simulation to generate the labels
E.
Use K-Means to cluster posts into various groups, and pick the most frequent word in each group as its label
Correct Answer(s)
A. Ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label C. Use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts
Explanation The Data Scientist can ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label. This approach involves manual review and labeling of each post by the team, ensuring accurate target labels for training. Additionally, the Data Scientist can use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts. This crowdsourcing approach allows for a larger pool of workers to label the posts, increasing efficiency and scalability in generating valid target label data.
Rate this question:
15.
A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet How should the company set up the job?
A.
Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
B.
Launch the notebook instances in a private subnet and access the data through a NAT gateway
C.
Launch the notebook instances in a public subnet and access the data through a NAT gateway
D.
Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
Correct Answer
D. Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
Explanation The company should launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint. This setup ensures that the data is not transmitted across the internet, as required by the compliance policy. By using a private subnet, the instances are not accessible from the public internet. The S3 VPC endpoint allows the instances to securely access the S3 bucket within the VPC, without the need for internet connectivity. This ensures that the data remains within the company's network and complies with the compliance policy.
Rate this question:
16.
A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases.
Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A.
Oversampling using bootstrapping
B.
Undersampling
C.
Oversampling using SMOTE
D.
Class weight adjustment
Correct Answer
C. Oversampling using SMOTE
Explanation With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds
new information by adding synthetic data points to the minority class. This technique would be the most effective
in this scenario.
Rate this question:
17.
Your marketing department wishes to understand how their products are being represented in the various social media services in which they have active content streams. They would like insights into the reception of a current product line so they can plan for the roll out of a new product in the line in the new future. You have been tasked with creating a service that organizes the social media content by sentiment across all languages so that your marketing department can determine how best to introduce the new product.
How would you quickly and most efficiently design and build a service for your marketing team that gives insight into the social media sentiment?
A.
Use the scikit-learn python library to build a sentiment analysis service to provide insight data to the marketing team’s internal application platform. Build a dashboard into the application platform using React or Angular.
B.
Use the DetectSentiment Amazon Comprehend API as a service to provide insight data to the marketing team’s internal application platform. Build a dashboard into the application platform using React or Angular.
C.
Use the Amazon Lex API as a service to implement the to provide insight data to the marketing team’s internal application platform. Build a dashboard into the application platform using React or Angular.
D.
Use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight to build a natural-language-processing (NLP)-powered social media dashboard
Correct Answer
D. Use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight to build a natural-language-processing (NLP)-powered social media dashboard
Explanation The best way to quickly and efficiently design and build a service that gives insight into social media sentiment for the marketing team is to use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight. These services provide a comprehensive solution for natural language processing (NLP) and data analysis. Amazon Translate can be used to translate social media content into different languages, Amazon Comprehend can be used for sentiment analysis, Amazon Kinesis can be used for real-time data streaming, Amazon Athena can be used for querying and analyzing the data, and Amazon QuickSight can be used for visualizing the insights on a dashboard. This combination of services enables the marketing team to understand the sentiment of their products across different languages and make informed decisions for the roll out of a new product.
Rate this question:
18.
A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset.
Which tool should be used to improve the validation accuracy?
A.
Amazon Comprehend syntax analysts and entity detection
B.
Amazon SageMaker BlazingText allow mode
C.
Natural Language Toolkit (NLTK) stemming and stop word removal
D.
Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers
Correct Answer
D. Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers
Explanation The Data Scientist believes that the poor validation accuracy may be due to a rich vocabulary and low average frequency of words in the dataset. In order to improve the accuracy, they should use Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers. TF-IDF is a technique that assigns weights to words based on their frequency in a document and their rarity in the entire dataset. By using TF-IDF vectorizers, the Data Scientist can give more importance to the words that are both frequent in a document and rare in the dataset, which can help improve the accuracy of the sentiment analysis application.
Rate this question:
19.
A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training.
What should the Specialist do to optimize the data for training on SageMaker?
A.
Use the SageMaker batch transform feature to transform the training data into a DataFrame
B.
Use AWS Glue to compress the data into the Apache Parquet format
C.
Transform the dataset into the Recordio protobuf format
D.
Use the SageMaker hyperparameter optimization feature to automatically optimize the data
Correct Answer
C. Transform the dataset into the Recordio protobuf format
Explanation The Specialist should transform the dataset into the Recordio protobuf format. This format is optimized for high-performance, efficient data storage and retrieval, which can improve the speed of training on SageMaker.
Rate this question:
20.
A social networking organization wants to analyze all the comments and likes from its users to flag offensive language on the site. The organization’s data science team wants to use a Long Short-term Memory (LSTM) architecture to classify the raw sentences from the comments into one of two categories: offensive and nonoffensive.
What should the team do to prepare the data for the LSTM?
A.
Convert the individual sentences into sequences of words. Use those as the input.
B.
Convert the individual sentences into numerical sequences starting from the number 1 for each word in a sentence. Use the sentences as the input.
C.
Vectorize the sentences. Transform them into numerical sequences. Use the sentences as the input.
D.
Vectorize the sentences. Transform them into numerical sequences with a padding. Use the sentences as the input.
Correct Answer
D. Vectorize the sentences. Transform them into numerical sequences with a padding. Use the sentences as the input.
Explanation To prepare the data for the LSTM, the team should vectorize the sentences by transforming them into numerical sequences. Additionally, padding should be applied to ensure that all sequences have the same length. This is important because LSTMs require fixed-length input. By vectorizing and padding the sentences, the data can be effectively processed by the LSTM model for classification.
Rate this question:
21.
Which probability distribution would describe the likelihood of flipping a coin "heads"?
A.
Bernoulli Distribution
B.
Normal Distribution
C.
Poisson Distribution
D.
Binomial Distribution
Correct Answer
D. Binomial Distribution
Explanation The likelihood of flipping a coin "heads" can be described by the Binomial Distribution. This distribution is used when there are two possible outcomes (in this case, heads or tails) and each flip is independent. The Binomial Distribution calculates the probability of a certain number of successes (in this case, heads) in a fixed number of trials (the number of times the coin is flipped).
Rate this question:
22.
You work for a real estate company where you are building a machine learning model to predict the prices of houses. You are using a regression decision tree. As you train your model you see that it is overfitted to your training data and that it doesn’t generalize well to unseen data.
How can you improve your situation and get better training results in the most efficient way?
A.
Use a random forest by building multiple randomized decision trees and averaging their outputs to get the predictions of the housing prices.
B.
Gather additional training data that gives a more diverse representation of the housing price data.
C.
Use the “dropout” technique to penalize large weights and prevent overfitting.
D.
Use feature selection to eliminate irrelevant features and iteratively train your model until you eliminate the overfitting.
Correct Answer
A. Use a random forest by building multiple randomized decision trees and averaging their outputs to get the predictions of the housing prices.
Explanation Using a random forest by building multiple randomized decision trees and averaging their outputs can improve the situation and provide better training results. Random forests help to reduce overfitting by introducing randomness into the model. By building multiple decision trees with different subsets of the data and features, the model can learn from different perspectives and make more accurate predictions. Averaging the outputs of these trees helps to reduce the impact of individual overfitted trees and provides a more generalized prediction for unseen data.
Rate this question:
23.
A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.
Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)
A.
AWS CloudTrail
B.
AWS Health
C.
AWS Trusted Advisor
D.
Amazon CloudWatch
E.
AWS Config
Correct Answer(s)
A. AWS CloudTrail
D. Amazon CloudWatch
Explanation The correct answer is AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is used to track API activity and monitor actions taken by users, including model deployments and endpoint invocations. Amazon CloudWatch is used to monitor resource utilization, such as GPU and CPU utilization on the deployed SageMaker endpoints. AWS Health, AWS Trusted Advisor, and AWS Config are not directly integrated with Amazon SageMaker for tracking this information.
Rate this question:
24.
A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
Combine multiple data sources.
Reuse existing PySpark logic.
Run the solution on the existing schedule.
Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?
A.
Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a “processed” location in Amazon S3 that is accessible for downstream use.
B.
Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a “processed” location in Amazon S3 that is accessible for downstream use.
C.
Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a “processed” location in Amazon S3 that is accessible for downstream use.
D.
Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a “processed” location in Amazon S3 that is accessible for downstream use.
Correct Answer
B. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a “processed” location in Amazon S3 that is accessible for downstream use.
Explanation The Data Scientist should use the architecture described in option 2. This option suggests writing the raw data to Amazon S3 and using AWS Glue ETL job to perform the ETL processing. By writing the ETL job in PySpark, the existing logic can be leveraged. A new AWS Glue trigger can be created to trigger the ETL job based on the existing schedule. The output target of the ETL job can be configured to write to a "processed" location in Amazon S3, which is accessible for downstream use. This architecture meets all the given requirements, including combining multiple data sources, reusing existing PySpark logic, running on the existing schedule, and minimizing the number of managed servers.
Rate this question:
25.
While working on a neural network project, a Machine Learning Specialist discovers that some features in the data have very high magnitude resulting in this data being weighted more in the cost function.
What should the Specialist do to ensure better convergence during backpropagation?
A.
Dimensionality reduction
B.
Data normalization
C.
Model regularization
D.
Data augmentation for the minority class
Correct Answer
B. Data normalization
Explanation Data normalization is the process of scaling the data to a standard range. In this case, the high magnitude of some features can cause the neural network to give more importance to those features, leading to slower convergence during backpropagation. By normalizing the data, the features will be on a similar scale, allowing the neural network to learn more effectively and converge faster. This helps to prevent any one feature from dominating the cost function and ensures better convergence during backpropagation.
Rate this question:
26.
A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon
SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly
imbalanced and multiple feature columns contain missing values. The proportion of missing values
across the entire data frame is less than 5%.
What should the ML engineer do to minimize bias due to missing values?
A.
Replace each missing value by the mean or median across non-missing values in same row.
B.
Delete observations that contain missing values because these represent less than 5% of the data
C.
Replace each missing value by the mean or median across non-missing values in the same column.
D.
For each feature, approximate the missing values using supervised learning based on other features.
Correct Answer
D. For each feature, approximate the missing values using supervised learning based on other features.
Explanation Use supervised learning to predict missing values based on the values of other features. Different
supervised learning approaches might have different performances, but any properly implemented supervised
learning approach should provide the same or better approximation than mean or median approximation, as
proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field
of research.
Rate this question:
27.
You are a data scientist working for a cancer screening center. The center has gathered data on many patients that have been screened over the years. The data is obviously skewed toward true negative results, as most screened patients don’t have cancer. You are evaluating several machine learning models to decide which model best predicts true positives when using your cancer screening data. You have split your data into a 70/30 ratio of training set to test set. You now need to decide which metric to use to evaluate your models.
Which metric will most accurately determine the model best suited to solve your classification problem?
A.
ROC Curve
B.
Precision
C.
Recall
D.
PR Curve
Correct Answer
D. PR Curve
Explanation The PR Curve is the most suitable metric to determine the model best suited for the classification problem in this scenario. Since the data is skewed towards true negative results, precision and recall are more appropriate metrics than the ROC curve. Precision measures the proportion of true positive predictions out of all positive predictions, while recall measures the proportion of true positive predictions out of all actual positive cases. The PR Curve combines both precision and recall, providing a more accurate evaluation of the model's performance in identifying true positives.
Rate this question:
28.
A Data Scientist wants to tune the hyperparameters of a machine learning model to improve the model’s F1score.
What technique can be used to achieve this desired outcome on Amazon SageMaker? (Select TWO)
A.
Grid Search
B.
Random Search
C.
Breadth First Search
D.
Bayesian optimization
E.
Depth first search
Correct Answer(s)
B. Random Search D. Bayesian optimization
Explanation Random Search and Bayesian optimization are two techniques that can be used to tune the hyperparameters of a machine learning model on Amazon SageMaker to improve the model's F1 score. Random Search involves randomly selecting combinations of hyperparameters from a predefined search space and evaluating their performance. Bayesian optimization, on the other hand, uses a probabilistic model to find the optimal set of hyperparameters by iteratively exploring the search space based on previous evaluations. Both techniques can help identify the best hyperparameter values that maximize the F1 score.
Rate this question:
29.
A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company's Amazon S3-based data lake.
The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of:
Real-time analytics
Interactive analytics of historical data
Clickstream analytics
Product recommendations
Which services should the Specialist use?
A.
AWS Glue as the data dialog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations.
B.
Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for near-real time data insights; Amazon Kinesis Data Firehose for clickstream analytics; AWS Glue to generate personalized product recommendations.
C.
AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations.
D.
Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon DynamoDB streams for clickstream analytics; AWS Glue to generate personalized product recommendations.
Correct Answer
A. AWS Glue as the data dialog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations.
Explanation The Specialist should use AWS Glue as the data catalog to manage the metadata of the data lake. They should use Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights, allowing them to process and analyze streaming data in real-time. They should also use Amazon Kinesis Data Firehose to deliver the clickstream data to Amazon ES for clickstream analytics. Lastly, they should use Amazon EMR to generate personalized product recommendations by processing and analyzing the data in the data lake.
Rate this question:
30.
A company has collected customer comments on its products, rating them as safe or unsafe, using
decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field.
For this use case, which is the most effective course of action to address test data samples with missing features?
A.
Drop the test samples with missing full review text fields, and then run through the test set.
B.
Copy the summary text fields and use them to fill in the missing full review text fields, and then run
through the test set.
C.
Use an algorithm that handles missing data better than decision trees.
D.
Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
Correct Answer
B. Copy the summary text fields and use them to fill in the missing full review text fields, and then run
through the test set.
Explanation In this case, a full review summary usually contains the most descriptive phrases of the entire review and is
a valid stand-in for the missing full review text field.
Rate this question:
31.
An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen.
Which combination of algorithms would provide the appropriate insights? (Select TWO)
A.
The factorization machines (FM) algorithm
B.
The Latent Dirichlet Allocation (LDA) algorithm
C.
The principal component analysis (PCA) algorithm
D.
The k-means algorithm
E.
The Random Cut Forest (RCF) algorithm
Correct Answer(s)
C. The principal component analysis (PCA) algorithm
D. The k-means algorithm
Explanation The principal component analysis (PCA) algorithm is suitable for this task as it can reduce the dimensionality of the data and identify the most important variables that contribute to the variance in the dataset. This can help in identifying patterns and relationships within the census information. The k-means algorithm can be used to cluster the data based on similarities, which can be useful in grouping provinces and cities with similar healthcare and social program needs. These algorithms together can provide valuable insights for determining healthcare and social program needs by province and city based on the census information.
Rate this question:
32.
A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS.
Which approach should the Specialist use for training a model using that data?
A.
Write a direct connection to the SQL database within the notebook and pull data in.
B.
Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
C.
Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.
D.
Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pull data in for fast access.
Correct Answer
B. Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
Explanation The Specialist should push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook. This approach allows for efficient and scalable storage of the historical training data in Amazon S3, which can then be easily accessed and used for training the model in Amazon SageMaker. It also ensures that the data is securely stored and can be easily shared and accessed by other services or users within the AWS environment.
Rate this question:
33.
A Data Scientist wants to use the Amazon SageMaker hyperparameter tuning job to automatically tune a random forest model.
What API does the Amazon SageMaker SDK use to create and interact with the Amazon SageMaker
hyperparameter tuning jobs?
A.
YperparameterTunerJob()
B.
HyperparameterTuner()
C.
HyperparameterTuningJobs()
D.
Hyperparameter()
Correct Answer
B. HyperparameterTuner()
Explanation The Amazon SageMaker SDK uses the HyperparameterTuner() API to create and interact with the Amazon SageMaker hyperparameter tuning jobs. This API allows the data scientist to automate the tuning process for their random forest model, optimizing the hyperparameters to improve the model's performance.
Rate this question:
34.
In AWS SageMaker, what feature allows you to distribute machine learning model training across multiple instances and is designed for large-scale distributed training?
A.
SageMaker Data Wrangler
B.
SageMaker Model Monitor
C.
SageMaker Multi-Model Endpoints
D.
SageMaker Distributed Training
Correct Answer
D. SageMaker Distributed Training
Explanation SageMaker Distributed Training is a feature within Amazon SageMaker that enables large-scale distributed training of machine learning models across multiple instances. This advanced capability is particularly useful for handling large datasets and complex model training scenarios, making it an essential tool for scaling machine learning workflows in AWS.
Rate this question:
35.
A Data Scientist created a correlation matrix between nine variables and the target variable. The correlation coefficient between two of the numerical variables, variable 1 and variable 5, is -0.95.
How should the Data Scientist interpret the correlation coefficient?
A.
As variable 1 increases, variable 5 increases
B.
As variable 1 increases, variable 5 decreases
C.
Variable 1 does not have any influence on variable 5
D.
The data is not sufficient to make a well-informed interpretation
Correct Answer
B. As variable 1 increases, variable 5 decreases
Explanation The correlation coefficient of -0.95 indicates a strong negative correlation between variable 1 and variable 5. This means that as variable 1 increases, variable 5 tends to decrease. The closer the correlation coefficient is to -1, the stronger the negative correlation. Therefore, the Data Scientist can interpret that there is a strong inverse relationship between variable 1 and variable 5.
Rate this question:
36.
A manufacturing company has a large set of labeled historical sales data The manufacturer would like to predict how many units of a particular part should be produced each quarter.
Which machine learning approach should be used to solve this problem?
A.
Logistic regression
B.
Random Cut Forest (RCF)
C.
Principal component analysis (PCA)
D.
Linear regression
Correct Answer
D. Linear regression
Explanation Linear regression is the appropriate machine learning approach to solve the problem of predicting the number of units of a particular part that should be produced each quarter. Linear regression is used for predicting a continuous numerical value, which aligns with the problem of predicting the quantity of units to be produced. Logistic regression, Random Cut Forest (RCF), and Principal Component Analysis (PCA) are not suitable in this case because they are used for different types of problems such as classification, anomaly detection, and dimensionality reduction, respectively.
Rate this question:
37.
An insurance company needs to automate claim compliance reviews because human reviews are
expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information.
Management would like to use Amazon SageMaker built-in algorithms to design a machine learning
supervised model that can be trained to read each claim and predict if the claim is compliant or not.
Which approach should be used to extract features from the claims to be used as inputs for the
downstream supervised task?
A.
Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker built in supervised learning algorithm.
B.
Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C.
Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D.
Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Correct Answer
D. Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Explanation Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Rate this question:
38.
An analytics company wants to use a fully managed service that automatically scales to handle the transfer of its Apache web logs, syslogs, text and videos on their webserver to Amazon S3 with minimum transformation.
What service can be used for this process?
A.
Kinesis Data Streams
B.
Kinesis Firehose
C.
Kinesis Data Analytics
D.
Amazon Kinesis Video Streams
Correct Answer
B. Kinesis Firehose
Explanation Kinesis Firehose is the correct answer for this question. Kinesis Firehose is a fully managed service that automatically scales to handle the transfer of data, such as Apache web logs, syslogs, text, and videos, from various sources to Amazon S3. It requires minimum transformation, making it suitable for the given scenario where the analytics company wants to transfer their web logs, syslogs, text, and videos to Amazon S3 without extensive data manipulation.
Rate this question:
39.
A video streaming company is looking to create a personalized experience for its customers on its platform. The company wants to provide recommended videos to stream based on what other similar users watched previously. To this end, it is collecting its platform’s clickstream data using an ETL pipeline and storing the logs and syslogs in Amazon S3.
What kind of algorithm should the company use to create the simplest solution in this situation?
A.
Regression
B.
Classification
C.
Recommender system
D.
Reinforcement learning
Correct Answer
C. Recommender system
Explanation The company should use a recommender system algorithm to create a personalized experience for its customers. A recommender system analyzes clickstream data and user behavior to provide recommendations based on what other similar users watched previously. This algorithm would be the simplest solution for the company to implement in order to provide recommended videos to stream on its platform.
Rate this question:
40.
A real estate company wants to provide its customers with a more accurate prediction of the final sale price for houses they are considering in various cities. To do this, the company wants to use a fully connected neural network trained on data from the previous ten years of home sales, as well as other features.
What kind of machine learning problem does this situation represent?
A.
Regression
B.
Classification
C.
Recommender system
D.
Reinforcement learning
Correct Answer
A. Regression
Explanation This situation represents a regression problem. Regression is a type of machine learning problem where the goal is to predict a continuous numerical value. In this case, the real estate company wants to predict the final sale price of houses, which is a continuous variable. By using a fully connected neural network trained on previous home sales data, the company can make more accurate predictions for their customers.
Rate this question:
41.
A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance.
Which approach allows the Specialist to use all the data to train the model?
A.
Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
B.
Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset.
C.
Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
D.
Load a smaller subset of the data into the SageMaker notebook and tram locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.
Correct Answer
A. Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
Explanation To avoid loading the entire large dataset onto the limited storage of the SageMaker notebook instance, the Machine Learning Specialist should load a smaller subset of the data into the notebook and train locally. This allows them to confirm that the training code is executing correctly and the model parameters are reasonable. Once this is verified, they can initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode. This approach allows the Specialist to use all the data for training without exceeding the storage limitations of the notebook instance.
Rate this question:
42.
Which AWS service provides a managed environment for training and deploying machine learning models with built-in support for distributed training, automatic model tuning, and integration with other AWS services?
A.
AWS Glue
B.
Amazon Comprehend
C.
AWS SageMaker
D.
Amazon Lex
Correct Answer
C. AWS SageMaker
Explanation AWS SageMaker is the service that provides a managed environment for training and deploying machine learning models. It supports distributed training, automatic model tuning, and integrates with other AWS services. Unlike AWS Glue, which handles data integration and ETL, and Amazon Comprehend, which focuses on text analysis, SageMaker is specifically designed for end-to-end machine learning workflows, including model training, tuning, and deployment. Amazon Lex is for building chatbots, not model deployment.
Rate this question:
43.
A manufacturing company wants to increase the longevity of its factory machines by predicting when a machine part is about to stop working, jeopardizing the health of the machine. The company’s team of Data Scientists will build an ML model to accomplish this goal. The model will be trained on data made up of consumption metrics from similar factory machines, and will span a time frame from one hour before a machine part broke down to five minutes after the part degraded.
What kind of machine learning algorithm should the company use to build this model?
A.
Amazon SageMaker DeepAR
B.
SciKit Learn Regression
C.
Convolutional neural network (CNN)
D.
Scikit Learn Random Forest
Correct Answer
A. Amazon SageMaker DeepAR
Explanation The company should use Amazon SageMaker DeepAR algorithm to build the model. DeepAR is a time series forecasting algorithm that is specifically designed for predicting future values based on historical data. In this case, the algorithm can be trained on the consumption metrics of similar factory machines to predict when a machine part is about to stop working. The algorithm's ability to handle time series data and capture temporal dependencies makes it suitable for this task.
Rate this question:
44.
A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side
encryption using AWS KMS.
How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3?
A.
Define security group(s) to allow all HTTP inbound/outbound traffic and assign those security group(s) to the Amazon SageMaker notebook instance.
B.
Сonfigure the Amazon SageMaker notebook instance to have access to the VPC. Grant permission in the KMS key policy to the notebook’s KMS role.
C.
Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant
permission in the KMS key policy to that role.
D.
Assign the same KMS key used to encrypt data in Amazon S3 to the Amazon SageMaker notebook
instance.
Correct Answer
C. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant
permission in the KMS key policy to that role.
A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result.
The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs
After creating each binary classification model, the data scientist generates the corresponding confusion matrix.
Which confusion matrix represents the model that satisfies the requirements?
Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective.
Rate this question:
46.
A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:
Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.
Support event-driven ETL pipelines.
Provide a quick and easy way to understand metadata.
Which approach meets these requirements?
A.
Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.
B.
Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.
C.
Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.
D.
Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.
Correct Answer
A. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.
Explanation This approach meets the requirements because it utilizes AWS Glue, which is a fully managed extract, transform, and load (ETL) service. The AWS Glue crawler is used to automatically discover and catalog metadata about the data in the S3 data lake. An AWS Lambda function is used to trigger the AWS Glue ETL job, which allows for event-driven ETL pipelines. The AWS Glue Data Catalog is used to search and discover metadata, providing a quick and easy way to understand the data lake's metadata. This approach also aligns with the requirement of supporting querying old and new data through Amazon Athena and Amazon Redshift Spectrum.
Rate this question:
47.
A Machine Learning Specialist is using Apache Spark for pre-processing training data As part of the Spark pipeline, the Specialist wants to use Amazon SageMaker for training a model and hosting it.
Which of the following would the Specialist do to integrate the Spark application with SageMaker? (Select THREE)
A.
Download the AWS SDK for the Spark environment.
B.
Install the SageMaker Spark library in the Spark environment.
C.
Use the appropriate estimator from the SageMaker Spark Library to train a model.
D.
Compress the training data into a ZIP file and upload it to a pre-defined Amazon S3 bucket.
E.
Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker.
F.
Convert the DataFrame object to a CSV file, and use the CSV file as input for obtaining inferences from SageMaker.
Correct Answer(s)
B. Install the SageMaker Spark library in the Spark environment.
C. Use the appropriate estimator from the SageMaker Spark Library to train a model.
E. Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker.
Explanation To integrate the Spark application with SageMaker, the Machine Learning Specialist would need to perform the following steps: 1) Install the SageMaker Spark library in the Spark environment, which allows for seamless integration between Spark and SageMaker. 2) Use the appropriate estimator from the SageMaker Spark Library to train a model, which provides a high-level API for training models on SageMaker using Spark. 3) Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker, which allows for real-time inference on new data using the trained model.
Rate this question:
48.
A Machine Learning Specialist is working with a large cyber security company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested.
The company also wants be able to save the results in its data lake for later processing and analysis.
What is the MOST efficient way to accomplish these tasks?
A.
Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
B.
Ingest the data into Apache Spark Streaming using Amazon EMR. and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake.
C.
Ingest the data and store it in Amazon S3 Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.
D.
Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.
Correct Answer
A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
Explanation The most efficient way to accomplish the tasks of ingesting and analyzing the data in real-time is by using Amazon Kinesis Data Firehose to ingest the data and Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. After detecting the anomalies, the results can be streamed to Amazon S3 using Kinesis Data Firehose. This approach allows for real-time analysis and storage of the results in a scalable and efficient manner.
Rate this question:
49.
A security and networking company wants to use ML to flag certain IP addresses that have been known to send spam and phishing information. The company wants to build an ML model based on previous user feedback indicating whether specific IP addresses have been connected to a website designed for spam and phishing.
What is the simplest solution that the company can implement?
A.
Regression
B.
Classification
C.
Natural language processing (NLP)
D.
A rule-based solution should be used instead of ML
Correct Answer
D. A rule-based solution should be used instead of ML
Explanation A rule-based solution should be used instead of ML because the company already has specific criteria (previous user feedback) to identify IP addresses connected to spam and phishing websites. ML models require training data, which may not be readily available in this case. By using a rule-based solution, the company can set predefined rules based on the feedback to flag the IP addresses without the need for ML algorithms.
Rate this question:
50.
A Data Scientist working for an autonomous vehicle company is building an ML model to detect and label people and various objects (for instance, cars and traffic signs) that may be encountered on a street. The Data Scientist has a dataset made up of labeled images, which will be used to train their machine learning model.
What kind of ML algorithm should be used?
A.
Image classification
B.
Instance segmentation
C.
Image localization
D.
Semantic segmentation
Correct Answer
B. Instance segmentation
Explanation Instance segmentation should be used in this scenario. Instance segmentation not only classifies objects in an image but also provides a pixel-level mask for each individual object. This is important in the context of autonomous vehicles as it allows for accurate detection and labeling of people and various objects on the street. Image classification would only classify the entire image, while image localization would only provide bounding boxes around objects. Semantic segmentation would classify pixels into different categories but would not differentiate between individual objects.
Rate this question:
Quiz Review Timeline +
Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.