Tiefpreis
CHF58.95
Auslieferung erfolgt in der Regel innert 2 bis 4 Wochen.
The book will be mostly reference and part cookbook. Careful descriptions of the hardware and software abstractions, best practices, and example source code will be included. Much of the source code will come in the form of reusable GÇ£microbenchmarksGÇ¥ or GÇ£microdemosGÇ¥ designed to expose specific hardware characteristics or highlight specific use cases. Best practices will be discussed and accompanied with source code. One idea that will be emphasized is the GÇ£EERS PrincipleGÇ¥ (Empirical Evidence Reigns Supreme): determining the fastest way to perform a given operation is best done empirically.
ÿØÿà
Autorentext
Nicholas Wilt has been programming professionally for more than twenty-five years in a variety of areas, including industrial machine vision, graphics, and low-level multimedia software. While at Microsoft, he served as the development lead for Direct3D 5.0 and 6.0, built the prototype for the Desktop Window Manager, and did early GPU computing work. At NVIDIA, he worked on CUDA from its inception, designing and often implementing most of CUDA’s low-level abstractions. Now at Amazon, Mr. Wilt is working on cloud computing technologies relating to GPUs.
Inhalt
Preface xxi
Acknowledgments xxiii
About the Author xxv
Part I: 1
Chapter 1: Background 3
1.1 Our Approach 5
1.2 Code 6
1.3 Administrative Items 7
1.4 Road Map 8
Chapter 2: Hardware Architecture 11
2.1 CPU Configurations 11
2.2 Integrated GPUs 17
2.3 Multiple GPUs 19
2.4 Address Spaces in CUDA 22
2.5 CPU/GPU Interactions 32
2.6 GPU Architecture 41
2.7 Further Reading 50
Chapter 3: Software Architecture 51
3.1 Software Layers 51
3.2 Devices and Initialization 59
3.3 Contexts 67
3.4 Modules and Functions 71
3.5 Kernels (Functions) 73
3.6 Device Memory 75
3.7 Streams and Events 76
3.8 Host Memory 79
3.9 CUDA Arrays and Texturing 82
3.10 Graphics Interoperability 86
3.11 The CUDA Runtime and CUDA Driver API 87
Chapter 4: Software Environment 93
4.1 nvcc–CUDA Compiler Driver 93
4.2 ptxas–the PTX Assembler 100
4.3 cuobjdump 105
4.4 nvidia-smi 106
4.5 Amazon Web Services 109
Part II: 119
Chapter 5: Memory 121
5.1 Host Memory 122
5.2 Global Memory 130
5.3 Constant Memory 156
5.4 Local Memory 158
5.5 Texture Memory 162
5.6 Shared Memory 162
5.7 Memory Copy 164
Chapter 6: Streams and Events 173
6.1 CPU/GPU Concurrency: Covering Driver Overhead 174
6.2 Asynchronous Memcpy 178
6.3 CUDA Events: CPU/GPU Synchronization 183
6.4 CUDA Events: Timing 186
6.5 Concurrent Copying and Kernel Processing 187
6.6 Mapped Pinned Memory 197
6.7 Concurrent Kernel Processing 199
6.8 GPU/GPU Synchronization: cudaStreamWaitEvent() 202
6.9 Source Code Reference 202
Chapter 7: Kernel Execution 205
7.1 Overview 205
7.2 Syntax 206
7.3 Blocks, Threads, Warps, and Lanes 211
7.4 Occupancy 220
7.5 Dynamic Parallelism 222
Chapter 8: Streaming Multiprocessors 231
8.1 Memory 233
8.2 Integer Support 241
8.3 Floating-Point Support 244
8.4 Conditional Code 267
8.5 Textures and Surfaces 269
8.6 Miscellaneous Instructions 270
8.7 Instruction Sets 275
Chapter 9: Multiple GPUs 287
9.1 Overview 287
9.2 Peer-to-Peer 288
9.3 UVA: Inferring Device from Address 291
9.4 Inter-GPU Synchronization 292
9.5 Single-Threaded Multi-GPU 294
9.6 Multithreaded Multi-GPU 299
Chapter 10: Texturing 305
10.1 Overview 305
10.2 Texture Memory 306
10.3 1D Texturing 314
10.4 Texture as a Read Path 317
10.5 Texturing with Unnormalized Coordinates 323
10.6 Texturing with Normalized Coordinates 331
10.7 1D Surface Read/Write 333
10.8 2D Texturing 335
10.9 2D Texturing: Copy Avoidance 338
10.10 3D Texturing 340
10.11 Layered Textures 342
10.12 Optimal Block Sizing and Performance 343
10.13 Texturing Quick References 345
Part III: 351
Chapter 11: Streaming Workloads 353
11.1 Device Memory 355
11.2 Asynchronous Memcpy 358
11.3 Streams 359
11.4 Mapped Pinned Memory 361
11.5 Performance and Summary 362
Chapter 12: Reduction 365
12.1 Overview 365
12.2 Two-Pass Reduction 367
12.3 Single-Pass Reduction 373
12.4 Reduction with Atomics 376
12.5 Arbitrary Block Sizes 377
12.6 Reduction Using Arbitrary Data Types 378
12.7 Predicate Reduction 382
12.8 Warp Reduction with Shuffle 382
Chapter 13: Scan 385
13.1 Definition and Variations 385
13.2 Overview 387
13.3 Scan and Circuit Design 390
13.4 CUDA Implementations 394
13.5 Warp Scans 407
13.6 Stream Compaction 414
13.7 References (Parallel Scan Algorithms) 418
13.8 Further Reading (Parallel Prefix Sum Circuits) 419
Chapter 14: N-Body 421
14.1 Introduction 423
14.2 Naïve Implementation 428
14.3 Shared Memory 432
14.4 Constant Memory 434
14.5 Warp Shuffle 436
14.6 Multiple GPUs and Scalability 438
14.7 CPU Optimizations 439
14.8 Conclusion 444
14.9 References and Further Reading 446
Chapter 15: Image Processing: Normalized Correlation 449
15.1 Overview 449
15.2 Naïve Texture-Texture Implementation 452
15.3 Template in Constant Memory 456
15.4 Image in Shared Memory 459
15.5 Further Optimizations 463
15.6 Source Code 465
15.7 Performance and Further Reading 466
15.8 Further Reading 469
Appendix A: The CUDA Handbook Library 471
A.1 Timing 471
A.2 Threading 472
A.3 Driver API Facilities 474
A.4 Shmoos 475
A.5 Command Line Parsing 476
A.6 Error Handling 477
*
Glossary / TLA Decoder 481
Index 487*