Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Techniques for Reducing Cache Misses and Loop Optimization in Computer Architecture, Exams of Computer Science

Texas A&M University (A&M)Computer Science

Solutions to Exam 2 questions from CPSC 614 course in Spring 2002. The questions cover techniques for reducing cache misses, analyzing loop dependencies, and optimizing cache organizations. The document also includes an analysis of the DAXPY loop and its execution on a VLIW processor.

Typology: Exams

2019/2020

On special offer

~~20 Points~~

Limited-time offer

Uploaded on 11/25/2020

koofers-user-0rl 🇺🇸

10 documents

1 / 5

On special offer

Partial preview of the text

Download Techniques for Reducing Cache Misses and Loop Optimization in Computer Architecture and more Exams Computer Science in PDF only on Docsity! Exam 2 Solutions CPSC 614 Spring 2002 Name_Duncan M. Walker_______________________________ Student ID____________________________________________ This exam is 50 minutes, closed book. Show your work for partial credit. STATE ANY ASSUMPTIONS. 1. (15 pts) List 3 techniques for reducing cache misses. For each technique, explain in a few sentences why each works, and its advantages and disadvantages. For each technique, explain what types of misses (among the 3 Cs) are being reduced by the technique. There are seven techniques described in the text and slides: • Larger Block Size • Higher Associativity • Victim Caches • Pseudo-Associative Caches • HW Prefetching of Instructions/Data • Compiler Controlled Prefetching • Compiler Reduce Misses There is also the most obvious technique of making the cache bigger. To go into detail on the three most common techniques: • Bigger cache – Since a program has temporal and spatial locality, more cache entries will reduce misses. The capacity misses by definition are reduced. The conflict misses are reduced since fewer memory locations will map to a given cache line. The advantage of this approach is that it is very simple. The disadvantage is that a larger cache is more expensive and slower. • Larger block size – Larger blocks provide more prefetching, so reduces misses due to spatial locality. This reduces the number of compulsory misses since the entries are already brought into the cache before needed. This could also be viewed as reducing the number of conflict misses, since if more lines with smaller blocks were used in a direct or set associative cache, then there would be the chance for another memory location to be in that line rather than the desired value. The advantage of this technique is that it reduces the number of tags, so makes the cache smaller and potentially faster. It may also better match with a wide path to memory. But it requires more multiplexing to select the words in the larger blocks, and blocks that are too large start to fetch data that is not used. Also, it increases memory traffic and increases the miss penalty if an entire block must be fetched before used. The larger block size means fewer lines, increasing the number of conflict misses. • Higher associativity – More sets reduce the number of conflict misses. The advantage is that a relatively small amount of hardware will significantly reduce the miss rate. It also reduces the number of bits that must be used to index into the cache, allowing a smaller page to use a TLB. The disadvantage is that the tags must be matched and then a multiplexer must select the matching set member before the CPU can continue, increasing the cache delay. 2. (15 pts) What are the dependencies in the loop below? Is this loop parallel (i.e. can the loop iterations be executed in parallel)? If it is parallel, explain why. If it is not parallel, explain why not and if possible, show how to make it parallel. for (I = 1; I <= 100; I++) { A[I] = A[I] + B[I]; /* S1 */ D[I+1] = D[I] + E[I]; /* S2 */ B[I+1] = C[I] + D[I+1]; /* S3 */ } In S1, B[I] is computed by S3 on the previous iteration, so it is a loop-carried dependency. In S2, D[I] is computed on the previous iteration, so it is a loop-carried dependency. In S3, D[I+1] is computed by S2 on the same iteration. Since there is a loop-carried dependency, the loop is not parallel. The S3 to S1 dependency can be handled by changing the loop order so that S3 is computed before S1. However that will not solve the S2 to S2 loop-carried dependency, since this dependency forms a loop. S2 fills the D array with the sum of previous elements: ])[][(][]1[ 1 1 JEJDIEID IJ J ++=+ ∑ −= = . Based on the definitions in Chapter 4, a loop with loop of dependencies cannot be made parallel. As discussed in Chapter 4, even pipelining such recurrences is challenging since each computation depends on the all the previous ones. Parallelism in the sum operation can be obtained by first summing the D[J] + E[J] elements in parallel, and then taking those results and summing them with a binary tree. This takes log2(N) iterations for arrays of length N, or 7 in this case.

Documents

questions

Techniques for Reducing Cache Misses and Loop Optimization in Computer Architecture, Exams of Computer Science

Related documents

Partial preview of the text