Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Techniques for Reducing Cache Misses and Loop Optimization in Computer Architecture, Exams of Computer Science

Solutions to Exam 2 questions from CPSC 614 course in Spring 2002. The questions cover techniques for reducing cache misses, analyzing loop dependencies, and optimizing cache organizations. The document also includes an analysis of the DAXPY loop and its execution on a VLIW processor.

Typology: Exams

2019/2020
On special offer
20 Points
Discount

Limited-time offer


Uploaded on 11/25/2020

koofers-user-0rl
koofers-user-0rl 🇺🇸

10 documents

1 / 5

Discount

On special offer

Related documents


Partial preview of the text

Download Techniques for Reducing Cache Misses and Loop Optimization in Computer Architecture and more Exams Computer Science in PDF only on Docsity! Exam 2 Solutions CPSC 614 Spring 2002 Name_Duncan M. Walker_______________________________ Student ID____________________________________________ This exam is 50 minutes, closed book. Show your work for partial credit. STATE ANY ASSUMPTIONS. 1. (15 pts) List 3 techniques for reducing cache misses. For each technique, explain in a few sentences why each works, and its advantages and disadvantages. For each technique, explain what types of misses (among the 3 Cs) are being reduced by the technique. There are seven techniques described in the text and slides: • Larger Block Size • Higher Associativity • Victim Caches • Pseudo-Associative Caches • HW Prefetching of Instructions/Data • Compiler Controlled Prefetching • Compiler Reduce Misses There is also the most obvious technique of making the cache bigger. To go into detail on the three most common techniques: • Bigger cache – Since a program has temporal and spatial locality, more cache entries will reduce misses. The capacity misses by definition are reduced. The conflict misses are reduced since fewer memory locations will map to a given cache line. The advantage of this approach is that it is very simple. The disadvantage is that a larger cache is more expensive and slower. • Larger block size – Larger blocks provide more prefetching, so reduces misses due to spatial locality. This reduces the number of compulsory misses since the entries are already brought into the cache before needed. This could also be viewed as reducing the number of conflict misses, since if more lines with smaller blocks were used in a direct or set associative cache, then there would be the chance for another memory location to be in that line rather than the desired value. The advantage of this technique is that it reduces the number of tags, so makes the cache smaller and potentially faster. It may also better match with a wide path to memory. But it requires more multiplexing to select the words in the larger blocks, and blocks that are too large start to fetch data that is not used. Also, it increases memory traffic and increases the miss penalty if an entire block must be fetched before used. The larger block size means fewer lines, increasing the number of conflict misses. • Higher associativity – More sets reduce the number of conflict misses. The advantage is that a relatively small amount of hardware will significantly reduce the miss rate. It also reduces the number of bits that must be used to index into the cache, allowing a smaller page to use a TLB. The disadvantage is that the tags must be matched and then a multiplexer must select the matching set member before the CPU can continue, increasing the cache delay. 2. (15 pts) What are the dependencies in the loop below? Is this loop parallel (i.e. can the loop iterations be executed in parallel)? If it is parallel, explain why. If it is not parallel, explain why not and if possible, show how to make it parallel. for (I = 1; I <= 100; I++) { A[I] = A[I] + B[I]; /* S1 */ D[I+1] = D[I] + E[I]; /* S2 */ B[I+1] = C[I] + D[I+1]; /* S3 */ } In S1, B[I] is computed by S3 on the previous iteration, so it is a loop-carried dependency. In S2, D[I] is computed on the previous iteration, so it is a loop-carried dependency. In S3, D[I+1] is computed by S2 on the same iteration. Since there is a loop-carried dependency, the loop is not parallel. The S3 to S1 dependency can be handled by changing the loop order so that S3 is computed before S1. However that will not solve the S2 to S2 loop-carried dependency, since this dependency forms a loop. S2 fills the D array with the sum of previous elements: ])[][(][]1[ 1 1 JEJDIEID IJ J ++=+ ∑ −= = . Based on the definitions in Chapter 4, a loop with loop of dependencies cannot be made parallel. As discussed in Chapter 4, even pipelining such recurrences is challenging since each computation depends on the all the previous ones. Parallelism in the sum operation can be obtained by first summing the D[J] + E[J] elements in parallel, and then taking those results and summing them with a binary tree. This takes log2(N) iterations for arrays of length N, or 7 in this case.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved