1.
A large repository of documents in IR is called:
Correct Answer
A. Corpus
Explanation
A large repository of documents in Information Retrieval (IR) is called a corpus. A corpus refers to a collection of texts or documents that are used for linguistic analysis or research purposes. It typically includes a wide range of texts from various sources, such as books, articles, websites, or any other form of written material. By analyzing a corpus, researchers can gain insights into language patterns, trends, and usage, which can be beneficial for various fields like natural language processing, computational linguistics, and information retrieval itself.
2.
The posting list should be sorted by:
Correct Answer
B. DocID
Explanation
The posting list should be sorted by DocID because it ensures that the documents are listed in a consistent and organized manner. Sorting by DocID allows for easier retrieval and comparison of documents, as it provides a unique identifier for each document. Additionally, sorting by DocID can improve the efficiency of certain operations, such as merging or intersecting multiple posting lists.
3.
For query optimization, while intersecting two posting list, we should
Correct Answer
A. Process in the order of increasing document frequency
Explanation
When optimizing a query, it is more efficient to intersect two posting lists in the order of increasing document frequency. This means processing the posting lists that have lower document frequencies first. By doing so, we can eliminate irrelevant documents early on in the process, reducing the overall number of comparisons needed and improving the query's performance.
4.
Term-document incidence matrix is:
Correct Answer
B. Sparse
Explanation
The term-document incidence matrix is described as "sparse" because it typically contains a large number of zeros. In this matrix, the rows represent terms and the columns represent documents, with each entry indicating the presence or absence of a term in a document. Since most documents only contain a small subset of all possible terms, the matrix is sparse, meaning that the majority of its entries are zeros. This sparsity allows for efficient storage and processing of the matrix, making it a commonly used representation in information retrieval and text mining tasks.
5.
Lemmatization is a technique for:
Correct Answer
C. Normalization
Explanation
Lemmatization is a technique used for normalization. It involves reducing words to their base or root form, which helps in grouping together different forms of the same word. This process ensures that variations of a word are treated as the same word, making it easier for analysis and comparison. Normalization is an important step in natural language processing tasks like information retrieval, text mining, and machine learning. It helps in improving the accuracy and efficiency of these tasks by reducing the complexity of the text data.
6.
A model of information retrieval in which we can pose any query in which search terms are combined with the operators AND, OR, and NOT:
Correct Answer
C. Boolean retrieval model
Explanation
The Boolean retrieval model is a model of information retrieval that allows us to pose queries using search terms combined with the operators AND, OR, and NOT. This model is based on Boolean logic, where the search results are either true or false based on the presence or absence of the search terms in the documents. It is a simple and straightforward approach to information retrieval, where the focus is on exact matches of the search terms rather than relevance ranking.
7.
The number of times that a word or term occurs in a document is called:
Correct Answer
C. Term frequency
Explanation
Term frequency refers to the number of times a word or term appears in a document. It is a measure of how frequently a specific term occurs within a document and is commonly used in information retrieval and natural language processing tasks. By calculating the term frequency, we can determine the importance or relevance of a term within a document or a collection of documents.
8.
Stemming increases the size of the vocabulary.
Correct Answer
B. False
Explanation
Stemming does not increase the size of the vocabulary. In fact, stemming reduces words to their base or root form, which helps in consolidating similar words and reducing the overall vocabulary size. Stemming aims to normalize words so that variations of the same word can be treated as a single entity, thereby improving text analysis and information retrieval tasks. Therefore, the statement that stemming increases the size of the vocabulary is false.
9.
A crude heuristic process that chops off the ends of the words to reduce inflectional forms of words and reduce the size of the vocabulary is called:
Correct Answer
D. Stemming
Explanation
Stemming is a crude heuristic process that reduces inflectional forms of words by chopping off the ends of the words. This process helps to reduce the size of the vocabulary by grouping together words that have the same root or stem. Unlike lemmatization, which aims to reduce words to their base or dictionary form, stemming focuses on removing prefixes and suffixes to obtain the stem of a word. Case folding refers to converting all letters to lowercase, while true casing preserves the original case of words. Therefore, the correct answer for this question is stemming.
10.
In information retrieval, extremely common words that would appear to be of little value in helping select documents and are excluded from the index vocabulary are called:
Correct Answer
A. Stop words
Explanation
Stop words are extremely common words that are excluded from the index vocabulary in information retrieval. These words, such as "and," "the," and "is," appear frequently in text but do not carry much meaning in terms of document selection. By excluding stop words from the index, the retrieval system can focus on more important and meaningful terms.