nbhkdz.com冰点文库

网体思考题3


[MIW] ??习题3.1 给出至少2种不同的Web大小(节点和边数)的简单估计方法。通过对IP地址采样(Ping), 是否可以对估计有所改进呢? ? 习题3.3 证明,如果?一个n节点的图的度分布符合Power-law,幂指数γ≥2,那么这个图是稀疏的,即 边数为O(n)。 ? 习题3.4 书中给出的若干符合Power-law的现象中,幂指数相对都较小,不超过10。问这是否是普

遍 现象?为什么?? [IIR] 习题21-7? — 21-18; 21-19—21-22? 习题21-7 除了点击当前浏览页面x上的链接之外,浏览器用户还可能通过点击“后退”按钮回到上?一页。用 户的这种回退行为能否用马尔科夫链来建模?如何对反复的回退调用建模?

SOLUTION.
No, the user can’t be modeled as a Markov chain. This is because the user may hit the back-button multiply and the semantics should be you unwind the path up to that point – this is not Markovian.
习题21-10 试证明每个网页的PageRank值都不小于α/N。当α 逐渐接近1时,这对所有不同网页的PageRank 值之间的差异来说意味着什么?

? 习题21-12 假定Web图以邻接表的形式存储在磁盘上,这种情况下假定用户仅仅查询网页的排好序的链出 网页邻居。用户不可能将所有Web图装到内存,但是可以采用多次读入的办法。写出在这种情况下计算 PageRank的算法。

习题21-15 [***] S中每个网页的面向体育主题的PageRank是否不低于其在整个Web图中的PageRank?

SOLUTION.
What I think is that in case of pagerank the probability would be distributed over a much larger collection than states than in topic pageank and so the topic pagerank might be greater than the pagerank in some cases. PR: The answer is YES, but I realize that to prove it requires a careful stochas- tic domination argument - unless someone can think of an easier way.

习题21-16 [***] 考虑对每个Web网页具有两个主题相关的PageRank的情况:?一个是体育主题的 PageRankπs,?一个是政治主题的PageRankπp。α为?PageRank时随机跳转操作的公共概率。对于q ∈ [0, 1], 考虑?一个兴趣可分成概率q到体育类、1-q到政治类的用户,试证明带随机跳转操作(到体育类的概率是q, 到政治类的是1-q)的随机游走过程得到的稳态概率分布就是这个用户的个性化PageRank。?

SOLUTION.
see Haveliwala (2002) and Haveliwala (2003)

习题21-22 对图21-7所示的Web图,计算每个网页计算其PageRank、hub值及authority值。依照这三个值分 别给出三个网页的排序结果,如果出现相等值,则请标明。?

PageRank:假定PageRank的每?一步,我们都有0.1的概率进行随机跳转操作,其中会等概率地选择随机跳 转的目标网页。? Hub/authority:将hub值或authority值进行归?一化,以保证最大的hub或authority值为? 提示1:利用对称性来简化线性方程并求解,这种做法可能会比迭代方法容易得多;? 提示2:对每个衡量指标,给出三个节点的相对次序。

? 图21-7 习题21-22对应的Web图

SOLUTION. Since the in-degree of A is 0, the steady-visit rate (or rank) of A is 0.1 · 1/3 = 1/30
(from teleport). By symmetry, rank(B) = rank(C). Thus, rank(B)=rank(C) = 29/60. Authority: There is no concept of teleport, so authority(A)=0. Again, author- ity(B)=authority(C). Thus: authority(A)=0, authority(B)=authority(C)=1. Hub: Use the authority scores above and iterate once to get hub(A)=2, hub(B)=1, hub(C)=1. Normalized: hub(A)=1, hub(B)=1/2, hub(C)=1/2.
习题 1-4 [*] 对于如下查询,能否仍然在O(x + y)时间内完成,其中x和y分别是 Brutus和Caesar所对应的倒 排记录表长度?如果不能,我们能达到的时间复杂度是多少?如何才能实现这种目标?? a. Brutus AND NOT Caesar? b. Brutus OR NOT Caesar

SOLUTION. a. Time is O(x+y). Instead of collecting documents that oc- cur in both postings
lists, collect those that occur in the ?rst one and not in the second. b. Time is O(N) (where N is the total number of documents in the collection) assuming we need to return a complete list of all documents satis- fying the query. This is because the length of the results list is only bounded by N, not by the length of the postings lists.

习题 2-10 [*] 考虑如下?一个位置索引的片段,格式为:词: 文档: (位置, 位置, . . .); 文档: (位置, . . .)...? Gates: 1: (3); 2: (6); 3: (2,17); 4: (1);? IBM: 4: (3); 7: (14);? Microsoft: 1: (1); 2: (1,21); 3: (3); 5: (16,22,51);? /k 操作符的意思是,当查询为 word1 /k word2 时表示查找时word1 和 word2必须在k个词内出现(匹配时左 右两边计算都行),其中参数k是?一个正整数。若k = 1,则意味着word1和word2相邻。? a. 给出满足查询 Gates /2 Microsoft的所有文档;? b. 将k值划分成多个集合,使得同?一集合内的k值对于查询Gates /k Microsoft返回同样的文档子集,而不同 集合中的k值则返回不同的文档子集。 ?

SOLUTION. a. 1,3
b. {k = 1} gives {3}; {2,3,4} gives {1,3}; {x : x ≥ 5} gives {1,2,3}.

习题 2-12 [**] 考虑倒排记录表基本合并算法(图1-6)的修改算法(图2-12),后者能够处理邻近查询。?一个处 理该类查询的朴素算法的复杂度可能为O(PLmax2),其中P是所有倒排记录表的长度之和(实际上是词项的 文档频率之和),Lmax是最长的文档的长度(以词条为单位):? a. 分析该算法,说明它为什么满足上述复杂度。? b. 该算法的复杂度是多少?为什么?? c. 对于某些查询和数据分布,是否存在其它的更有效的算法?其复杂度是多少? ?SOLUTION. It’s an O(PkLmax) algorithm.

Keep a moving window of postings w2 for one of the lists. Add next postings to end of window if there is no match. After adding a posting, check for all postings in the other window if they meet the proximity constraint. If yes, add them to the result set. Delete the ?rst posting of the window, if its proximity to the last posting in the other window is below k. There can be at most k entries in the window. Each new posting is compared to at most k entries. Thus, this algorithm is O(k(S1 + S2)). So proximity queries can be handled in linear time. c. If cft1 < k then one can simply test the cft1 × cft2 possible posting pairs to see if they are within k.
习题 4-8 假定我们有?一个规模适度的文档集,它能使用图1-4所示的简单的内存式索引算法进行索引构建 (第1章),计算该算法的内存、磁盘和时间开销并与基于块的排序索引算法相比较。 ?SOLUTION. Let there be a collection with the following statistics: Number of documents = 10000 Average number of tokens per document = 1000 Number of terms = 100000 Avg. bytes per token = 6 Number of non-positional postings = 10 million Size of collection is approximately 60 MB A simple in-memory indexing algorithm as the one in Fig. 1.4 would proceed in the following steps: (Assuming the system parameters from table 4.1) a. Parsing (to create initial postings):Complexity = O(n) where n = number of postings Time taken = 10,000,000 * = 1 s b. Sorting the postings created alphabetically : O( ) Time taken = = 24 s c. Merging the same term postings to form postings lists:O(n) Time taken = 10,000,000 * = 1 s Total Time = 26 s Disk space required initially = 60 MB, Memory required = 60 MB

Block Sort-based indexing Assuming that we have a memory restriction of 10 MB, Number of blocks = 6 Numbero f postingsperblock = (107)/6 = 16 ? (105) (i)Reading of collection : (2 disk seeks/ block * 6 blocks)+(read/write transfer time for 6 blocks ) = (2?5?)?6+10?(6?(106)bytes?s/byte) = 0.06 + 0.6 = 0.66 s (ii) Using sorting algorithm of O(n* ): Sort time=2 disk seek/block reading * 6 blocks + read time for 6 blocks + sorting time = 0.66 + (6blocks ? 16 ? (105) ? log (16 ? (105)) ? (10( ? 7))) = 21 s app. (iii)With vocabulary size , the reduction in size of the postings database can be worked out. Vocabulary size of collection = 100000 (given) Total size of blocks after indexing = 4*100000 + 4 ? (107)bytes4 ? (107)bytes Timeinwriting = 4 ? (107 ) ? (10( ? 7)) = 4s (iv) Assuming that number of buffers maintained at a time of merging various parts of blocks =6 Number of disk seeks = 6*6*2(for read/write)=72 Total disk transfer time for merging = disk seek time + blocks r/ w into buffers = (72 ? 5?) + (4 ? (107) ? (10( ? 7))s/byte ? 2) = 0.36 + 8 =8.36 s (v)Time of actual merging for 10 sorted blocks =disk transfer time + processing time ( O(n) where n= total number of

postings) = 0.36s. + (105) ? (10( ? 7))s/processingop.0.4s Total time=(i)+(II)+(III)+(v) = 500 s=1482 hrs. 26 s Memory required is 6 MB at maximum. Disk space required is 60 MB.

习题 4-9 假定MapReduce构架下的每台机器都有100GB磁盘空间,而词项“the”的倒排记录表大小是 200GB。这种情况下,文中介绍的MapReduce方法就不能完成索引的构建,请问如何修改MapReduce以便 它能解决这种情况? ?SOLUTION. partition by docid as well as term for very frequent terms

习题 6-18 另?一种计算向量相似度的方法称为欧几里得距离 (Euclidean distance,或称 L2距离或欧氏距 离),计算公式如下:?给定查询q和文档集d1,d2,…,可以按照它们和查询之间的欧氏距离输出有序的结 果。假定q和di都归?一化成单位向量,则请证明通过欧氏距离计算出的文档排名和使用余弦相似度所 得的结果完全?一致。

SOLUTION.
∑(qi ?wi)2 = ∑q2i ?2∑qiwi +∑w2i = 2(1?∑qiwi) Thus: ∑(qi ?vi)2 < ∑(qi ?wi)2 ? 2(1?∑qivi) < 2(1?∑qiwi) ? ∑qivi > ∑qiwi

习题 6-19 计算查询 “digital cameras”及文档“digital cameras and video cameras”的向量空间相似度并将结 果填入下表的空列中。假定 N=10,000,000,对查询及文档中的词项权重(wf对应的列)采用对数方法计算, 查询的权重计算采用idf,而文档归?一化采用余弦相似度计算。将 and 看成是停用词。在tf列中输入词项的 出现频率,请问那么最后的相似度计算结果是多少?

习题 12-6 [*] 考虑从如下训练文本中构造语言模型:? the martian has landed on the latin pop sensation ricky martin? 请问:? a. 在采用MLE估计的?一元概率语言模型中,P(the)和P(martian)分别是多少?? b. 在采用MLE估计的?一元概率语言模型中,P(sensation|pop)和 P(pop|the)的概率是多少?

SOLUTION. No solution.
习题 8-2 [*] F1定义为正确率和召回率的调和平均值,采用调和平均而不是算术平均进行计算的好处是什 么?

Since arithmetic mean is more closer to the highest of the two values of pre- cision and recall, it is not a good representative of both values whereas Har- monic mean , on the other hand is more closer to min(Precision,Recall).In case when all documents relevant to a query are returned, Recall is 1 and Arithmetic mean is more than 0.5 though the precision is very low and the purpose of the search engine is not served effectively. Since arithmetic mean is more closer to the highest of the two values of precision and recall, it is not a good representative of both values whereas Harmonic mean , on the other hand is more closer to min(Precision,Recall).In case when all documents relevant to a query are returned ,Recall is 1 and Arithmetic mean is more than 0.5 though the precision is very low and the purpose of the search engine is not served effectively.

习题 8-4 [*] 召回率水平为0所对应的插值正确率的可能取值如何??

0≤p≤1
习题 8-5 [**] 正确率和召回率之间是否?一定存在等值点(break-even point)?说明为什么?一定存在或给出反 例。?

precision = recall iff fp = fn or tp=0. If the highest ranked element is not relevant, then tp=0 and that is a trivial break-even point. If the highest ranked element is relevant: The number of false positives in- creases as you go down the list and the number of false negatives decreases. As you go down the list, if the item is R, then fn decreases and fp does not change. If the item is N, then fp increases and fn does not change. At the beginning of the list fp<fn. At the end of the list fp>fn. Thus, there has to be a break-even point.

习题 8-6 [**]正确率-召回率等值点与这点的F1值之间是什么关系?

At the break-even point: F1 = P = R.

习题 8-9 [**] 在10,000篇文档构成的文档集中,某个查询的相关文档总数为8,下面给出了某系统针对该 查询的前20个有序结果的相关(用R表示)和不相关(用N表示)情况,其中有6篇相关文档:? RRNNN NNNRN RNNNR NNNNR? a. 前20篇文档的正确率是多少?? b. 前20篇文档的F1值是多少?? c. 在25%召回率水平上的插值正确率是多少?? d. 在33%召回率水平上的插值正确率是多少?? e. 假定该系统所有返回的结果数目就是20,请计算其MAP值。? 假定该系统返回了所有的10,000篇文档,上述20篇文档只是结果中最靠前的20篇文档,那么? f. 该系统可能的最大MAP是多少?? g. 该系统可能的最小MAP是多少?? h. 在?一系列实验中,只有最靠前的20篇文档通过人工来判定,(e)的结果用于近似从(f)、到(g)的MAP取值 范围。对于上例来说,通过(e)而不是(f)和(g)来计算MAP所造成的误差有多大(采用绝对值来计算)?

a. MAP(System 1) = (1/4)*(1+(2/3)+(3/9)+(4/10)) = 0.6 MAP(System 2) = (1/4)*(1/2 + 2/5 + 3/6 + 4/7) = 0.493 System 1 has a higher average precision b. b. The text says that MAP provides a single ?gure measure of quality across recall levels. I am not sure I clearly get what this statement means. For a good MAP score, it is essential to more relevant documents in the ?rst few (3-5) retrieved ones c. R-Precision(system 1) = 1/2 R-Precision(system 2) =1/4 This ranks the system the same as MAP.
习题 14-2 [*] 试举例说明,如果采用Rocchio分类方法对某篇训练文档分类,那么分类结果有可能会不同于训练 集上的结果。 习题 14-4 试证明两个类别间的线性分类器的数目要么是无穷的,要么是0。

SOLUTION. If there is one linear separator, then there is an epsilon such that after moving it by
ε in the direction of the closest point, it is still a sepa- rator
习题 14-10 [*] 在计算?一个密集的质心向量和?一个稀疏向量之间的距离时,?一个很原始的实现方法是在M个维度 上进行反复的迭代计算,因此其时间复杂度是Θ(M)。基于等式∑(xi-?i)2=1.0+∑?2-2∑xi?i 并假定∑?2 可以预先计 算,请给出?一个时间复杂度是Θ(Ma)的算法,其中Ma 是测试文档中不同词项的个数。

No solution.

习题 13.2 [*] 下面三个文档中,对于如下的两种模型表示,哪些文档具有相同的模型表示?哪些文档具有不同 的模型表示?对于不同的表示进行描述。 (i) 贝努利模型 (ii) 多项式模型 (1)He moved from London, Ontario, to London, England. (2)He moved from London, England, to London, Ontario. (3)He moved from England to London, Ontario.

No solution.
习题 13-6 [**] 假定测试集中的每篇文档仅仅属于?一个类别,并且分类器也只给?一篇文档赋予?一个类别。这类 问题称为单标签问题(参见14.5节)。请证明在单标签问题中: (i) 假正例(false positive)的个数等于假负例(false negative)的个数; (ii) 微平均F1值等于精确率。 习题 13-10 你的任务是将单词分成英语(English)类或非英语类。这些单词的产生来自如下分布:

?
(i) 计算多项式NB分类器的参数(含先验概率和条件概率),分类器使用字母b、n、o、u和z作为特征。假 设训练集能否完美反应表中的概率分布,使用多项式分类器中常用的特征独立性假设。在计算参数时使 用平滑方法,零概率平滑成0.01,而非零概率不做改变。(这种简单的平滑方法显然会使 P(A) + P( ?A ) > 1),但是这里不需要解决这个问题) (ii) 上述分类器对单词zoo的分类结果是什么? (iii) 使用(i)中的多项式分类器对zoo进行分类,但是此时不使用位置独立性假设。也就是说对单词中的每 个位置上都要估计独立的参数,当然你只需要估计对zoo分类时所需要的参数。

SOLUTION.
(a) Compute the parameters (priors and conditionals) of a multinomial classi- ?er that uses the letters b, n, o, u, and z as features. Assume a training set that re?ects the probability distribution of the source perfectly. Make the same independence assumptions that are usually made for a multinomial classi- ?er that uses words as features for text classi?cation. Compute parameters using smoothing, in which computed-zero probabilities are smoothed into probability 0.01, and computed-nonzero probabilities are untouched. (This simplistic smoothing may cause P(A) + P(:A) > 1, which can be corrected if we correspondingly smooth all complementary probability-1 values into probability 0.99. For this problem, solutions may omit this correction to sim- plify arithmetic.) Priors: Class ?English?: 1 9 . Class ?Not English?: 8 9 . letter o b Conditionals: z n u English Not English 11 36 11 66 11 63 1 0:01 6 11 63 (b) How does the classi?er classify the word ?zoo?? 111 1

English: 1/9 1/6 1/3 1/3 = 1/(235) Not English: 8/9 1/3 1/6 1/6 = 2/(35) The classi?er chooses the class ?Not English.? (c) Classify the word ?zoo? using a multinomial classi?er as in part (a), but do not make the assumption of positional independence. That is, estimate separate parameters for each position in a word. You only need to compute the parameters you need for classifying ?zoo?. Priors: Class ?English?: 1 9 . Class ?Not English?: 8 9 . letter position z1 Conditionals: z 1: 1/2 0.01 o 2: 1/2 0.01 o 3: 1/2 0.01 English: 1/9 1/2 1/2 1/2 = 1/72 Not English: 8/9 1/100 1/100 1/100 = 8/9,000,000 The classi?er chooses the class ?English.?

习题 13-11 如果t和c完全独立,I (Ut ;Cc) 及 X2(D, t, c)的值各是多少?如果t和c完全依赖,上述值又是多少?

SOLUTION.
completely independent: I=0, chisquare close to 0. completely dependent: I=1.0, chisquare very large
习题 13-12 互信息的特征选择方法适用于贝努利模型,为什么?如果要让它适用于多项式模型,应该如何修 改?

SOLUTION.
It is most appropriate for Bernoulli because it is based on binary occurrence information. A modi?cation for the multinomial would be to estimate the probabilities for tokens, not for documents
习题 15-1 [ *] 数据集具有的支持向量的最小数目是多少(此时的数据集每个类别中都包含实例)?

No solution.
SE 9.8使用不同的聚类方法将下面二维实例聚合为3类: (-4,-2),(-3,-2),(-2,-2),(-1,-2),(1,-1),(1,1),(2,3),(3,2),(3,4),(4,3) 讨论这些方法的差异,哪些方法得到了相同的结果?和人工聚类过程相比,这些结果有何不同? 9.12评价指标Purity可能为0吗?如果是,给出?一个实例;如果不是,解释原因。 IIR 习题 16-3 对于下图,同?一类中的每个点d都用两个同样的d的副本来替换。(i) 那么,对于新的包含34个点的集 合进行聚类,会比图中17个点的聚类更容易、?一样难还是更难?(ii) 计算对34个点聚类的纯度、NMI、RI及F5 值。在点数增加?一倍之后,哪些指标增大?哪些指标保持不变?(iii) 在得到(i)中的判断和(ii)中的指标之后,哪 些指标更适合于上述两种聚类结果的质量比较?

?

习题 16-8 请证明 I (Ω; C) ≤ [H(Ω) + H(C)]/2。

SOLUTION. http://www.lans.ece.utexas.edu/ #strehl/diss/node107.html I(X;Y)=H(X)+H(Y)
+H(X,Y) I(X;Y)<H(X) I(X;Y)<H(Y) I(X;Y)<min(H(X),H(Y)) I(X;Y)<0.5(H(X)+H(Y))
习题 16-11 (i) 请给出这样的?一个例子,由多个点组成的点集合和3个初始质心组成(初始质心不?一定要是点集合 中的点),采用3-均值聚类方法收敛得到的聚类结果中包含空簇 (ii)这种包含空簇的聚类结果在RSS的意义上说 有没有可能是全局最优? ?SOLUTION.

(i) Start with identical cluster centers. (ii) yes, for example, if all documents are identical
习题 17-7?a.考虑在?一个两种语言组成的文档集上进行2-均值聚类,你预期的结果是什么??b.当使用HAC算法 时,预期的结果是否仍然?一样?

SOLUTION.
a. Two clusters that roughly correspond to the two languages – unless the initial seed were badly chosen b. Complete-link clustering is likely to end up with one big cluster containing both languages and a second small cluster with outliers. Single-link cluster- ing may intermix the two languages through chaining if there a few docu- ments (e.g., tables of numbers) that can serve as a bridge between the two languages. Centroid / GAAC are likely to ?nd two clusters corresponding roughly to the two languages.

习题 17-10 考虑对如下的直线上的N个点进行单连接聚类,试证明我们总共只需要计算N次相似度。另外,请 给出对直线上的点进行单连接聚类的总时间复杂度。

SOLUTION.
Each cluster must be a sequence of adjacent points. Thus, distances between clusters are distances between adjacent points. So we only need the O(N) distances between adjacent points. We can order them in O(N log N), so clustering has time complexity O(N log N)


人体工程学思考题(1-3章)

人体工程学思考题(1-3章)_艺术_高等教育_教育专区 暂无评价|0人阅读|0次下载|举报文档 人体工程学思考题(1-3章)_艺术_高等教育_教育专区。...

3 刚体力学习题详解

5 2 1 第 6 页共 8 页 3 刚体力学习题详解 习题册-上-3 答案:太长,略。 解: (1)用隔离体法,分别画出三个物体的受力图。 对物体 1,在竖直方向...

第3章计算机网络体系结构习题解答

3 章数据链路层习题解答 P118 T3.3 某信道误码率为 1 0 ,每帧长度为 10kb,那么 ?5 (1)若差错都是单个错,则在该信道上传送的帧的平均出错率是多 少...

陈阅增普通生物学第3版课后思考题(完整版)

陈阅增普通生物学第3版课后思考题(完整版)_其它_高等教育_教育专区。陈阅增普通...8、人体毛细血管一长约 1mm,遍布全身,形成一个庞大的毛细血管网。在体内很少...

3 刚体力学习题详解

5 2 1 第 6 页共 7 页 3 刚体力学习题详解 习题册-上-3 答案:太长,略。 解: (1)用隔离体法,分别画出三个物体的受力图。 对物体 1,在竖直方向...

《普通物理》习题三答案

《普通物理》习题三答案_理学_高等教育_教育专区。西安交通大学网络教育学院 《...mgR 3 B、 ? mgR C、 ? mgR 2 D、0 20、线度相同的滑块和匀质圆柱体...

天体运动练习题3 答案

天体运动练习题3 答案_理化生_高中教育_教育专区。天体运动练习题(三)答案 1、参考解答 1.双星均绕它们的连线的中点做圆周运动,设运动速率为 v ,向心加速度...

3 刚体力学习题详解

6 2 1 第 7 页共 9 页 3 刚体力学习题详解 习题册-上-3 N a N? 解: (1)用隔离体法,分别画出三个物体的受力图。 f 对物体 1,在竖直方向应用...

《理论力学》习题三答案

《理论力学》习题三答案_工学_高等教育_教育专区。西安交通大学理论力学习题 ...7. 三棱柱重 P ,放在光滑的水平面上,重 Q 的匀质圆柱体静止释放后沿斜面...

三视图习题(含答案)

几何体的视图练习题 1、若某空间几何体的三视图如图所示,则该几何体的体积是 ((A)2 )(B)1 (C) 2 3 (D) () 1 3 2、一个几何体的三视图如图,该...