Parallel Programming in NVidia's Compute Unified Device Architecture Report Example | Topics and Well Written Essays

Parallel Programming in NVidias Compute Unified Device Architecture (СUDА) Supervisor: Students Name School of Chemistry University of ---------- Table of Contents Parallel Programming in NVidias Compute Unified Device Architecture (СUDА) 1 The data--‐parallel processing approach 3 The history of GPU computing 5 The Heterogeneous execution model 6 The GPU Memory hierarchy 9 CUDA GPU Thread Structure 12 The future directions of the GPU computing 16 References 19 The data--‐parallel processing approach Data-parallel processing approach involves the distribution of arrays to various processers for computation. Once the data has been distributed, the processors will perform computations, which will produce the required results. The data is organized in parallel patterns for the purpose of multiplication, addition, and subtraction. The main purpose of choosing data-parallel processing approach is to reduce the time of computation for large amount of data and obtain a better solution. Since it uses GPU memory, it is possible to process large and complex data like genetic variability of a population of cattle in certain location. Some models may take large amount of time to process some information but parallel computing helps increasing processing time by changing codes. In order for one to process data using this approach will need to decompose the data into small units as well as arranging this data in a manner that multiple processors using parallel algorithms will handle it. Parallel algorithm is implemented using programs, which accept the use of parallel languages. This technology of parallel computers is only made possible at instances where we have multi-connected super computers or mandible processors. In each of the computation processes data will pass through a program that is divided into three programming stages. The stages that will enable parallel computing possible are pipelining, data control and data parallelism. Pipelining involves the division of computation to segments that carries out various distinctive functions. Data parallelism is where different but uniform functional units carry out multiple operations. The application used in parallelism is a very efficient and independently processes large amount of data. These applications are developed on a server where the GPU software that is used for developing this application is resident, and once that proprietary application is developed on those server clients then distribute it on other servers for use. Errors in the measurement of topographic data can be minimized by quality control of the survey and are generally within acceptable tolerances. However, gross errors do sometimes occur and these need to be eliminated either by quality control procedures at the data input stage or during the model procedure. Errors in the data are not so easy to eliminate and may, therefore, need to be catered for by a sensitivity analysis. In such a dynamic environment, the servers expand or reduce their capacities based on their requirements. Data parallelism approach is GPU-based computing where shared resources, application, and information is provided to computers and other devices on demand. The functioning of the application does not depend on the system’s architecture and it works mainly on server components. In this case, he gets a double benefit that of creating is proprietary software by using a GPU application and at the same time not being required to reveal the source of the module or application that he created. Data parallelism allows convenient access to a pool of computing resources that configure and provide minimal interaction to data. Data parallelism is revolutionizing computational and allowing them to tap from computing resources that are extremely powerful through the server. The key lesson is that given a problem decomposition decision, programmers will typically have to select from a variety of algorithms. Some of these algorithms achieve different trade- offs while maintaining the same numerical accuracy. Others involve sacrificing some level of accuracy to achieve much more scalable running times. The cutoff strategy is perhaps the most popular of such strategies. Even though we introduced cutoff in the context of electrostatic potential map calculation, it is used in many domains including ray tracing in graphics and collision detection in games. Computational thinking skills allow an algorithm designer to work around the roadblocks and reach a good solution. The history of GPU computing There has been remarkable advancement in graphic pipelines in the last thirty years. Advancement has evolved from large machines, which were in use in the 1980s to the current small machines, which have a capability of producing 1billion pixels per second. This is because technology of semi-conductors evolved. This evolvement has led to new development in hardware design and graphic algorithm. In the year 2001, there was further improvement in graphics programming due to introduction of instructions set which enabled a 32-bit floating-point pixel shader. This led to unification of functions of different segments of a program. In the year, 2005 there was an introduction of unified-processor, which enabled pixel and vertex shaders to execute many data on the same processor. Heterogeneous execution models ensure time-sharing while computing or communicating between many users through multi-tasking and multiprogramming. Its introduction during the 1980s and increased usage in 2007 was a significant technological shift in computing. It materialized through GPU and a control unit with each unit having 32K core memory. This saw the charges of storage, input/output, library programs, and FORTRAN rise to premium charges to reflect CRU charges for limited resources. It allowed the dials to be adjusted to regulated charges that were associated with usage of the system. Competitors joined in with other pricing schemes ensuring users paid a fair share. This is a synonymous mechanism to services of cloud computing today that charge a fee for utilizing data center resources. In essence, cloud computing and computer time-sharing utilize the same concept, but with some distinctions. The application provider on his part would use the GPL compliant software on the server of the service provider to develop his application and then sell the application to others on the web. Here again the application provider is not violating the reciprocal clause, since it his not him who is bound by the GPL licence agreement instead it is the service provider. Circumvention like this is quiet rampant in the cloud computing space especially with respect to software applications that require the user to agree to the GPL licence agreement. The Heterogeneous execution model Heterogeneous execution model involves the use of both MBI CUDA programming to carry out its functions. MPI applications are used in data transfer and communication. MPI execution models employ syntax such as For data execution, MPI enables the data server to receive data, which has been sent from one server to the receiving server with effectiveness. Figure 1: heterogeneous many-core) (http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf Pulses render the contents of electronic transmissions unreadable by anyone who might intercept them. Modern encryption technologies provide a high degree of protection against the vast majority of potential attackers. Legitimate recipients can decrypt transmission contents by using a piece of data called a “key”. The recipient typically possesses the key for decryption as a result of a previous interaction. Like passwords, keys must be kept secret and protected from social engineering, physical theft, insecure transmission, and a variety of other techniques hostile forces use to obtain them. Encryption does little good if the key that decrypts is available to attackers. Encryption does not conceal everything about a network transmission. Hackers still can gain useful information from the pattern of transmission, the lengths of messages, or their origin or destination addresses. Encryption does not prevent attackers from intercepting and changing the data in a transmission. They provide information about the reliability of software either by predicting the same from design parameters or from test data. Predicting software reliability from design parameters is a kin to ascertaining the level of defect in the software, thus they are referred to as “defect density” models. These models used code characteristics such as through nesting of loops, lines of code, input/outputs, or external references in estimating the degree of defect in software. Models that test software from test data are akin to testing reliability in growth, thus they are referred to as “software reliability growth” models. The models use statistical correlations to find the relationship between known statistical functions and defect detection data. Good correlations are then used to predict software behavior in the future. Heterogeneous execution model then is a homogenous Poisson process, implying that the computation varies. Moreover, time is measured as a component of program execution time. Program execution time is the definite time a processor spends in program instructions execution. Computations are done in steps and process involves partition, exchange of data with the neighbors repeatedly. The reason is that this strategy forces the system to be in one of the two modes. In the first mode, all compute processes are performing computation steps. During this time, the communication network is not used. In the second mode, all compute processes exchange halo data with their left and right neighbors. During this time, the computation hardware is not well utilized. Ideally, we would like to achieve better performance by utilizing both the communication network and computation hardware all the time. Applications generally deploy across several servers. Many applications will need three to five servers although some might still require hundreds. Since this is a Dynamic datacenter, several servers will be required and a model will be put into place to link those applications, servers, and configurations. This will also assist the owners of the applications to understand the application components to configure them in an appropriate standard way. This will begin with a business analyst who will draft the application requirements. An architect will then define the application architecture and the deployment model to be applied. Developers will then implement the applications and it is deployed into the environment as stipulated by the model. The GPU Memory hierarchy GPU has many programmable processing cores that carry out data-parallel stages of the graphics pipeline. The following has many processing cores and SIMD execution as shown in the diagram below; Figure 2: GPU Memory hierarchy Homogeneous collection of throughput-optimized programmable processing cores Augmented by fixed-function logic (Kirk and Hwu, 2014). It is important to note that each singular term used to denote the different complex notations such as the terms point, sample, value and signal constitutes both a real part and an imaginary part. A given a typical 16 point signal, it can be decomposed through four stages that are each distinct. The first stage of decomposition results in two signals that each has 8 points while the second stage of decomposition will result in four signals that each has 4 points. The idea is to continue this decomposition by half until a given number of signals, in this case 16, that each contains a single point remain. It should be noted that each decomposition stage is interlaced in nature and this means that the two samples, which result from a decomposition, are even and odd numbered respectively. An important feature to note concerning the decomposition process is that it is what makes it possible for the samples in the original signal to be reordered. However, the re-ordering of these samples is supposed to follow a specific pattern, which is determined by the binary equivalents of each sample. The algorithm that involves the rearrangements of the order of the t time domain samples through the counting in binaries that have been flipped from left to right. After the bit reversal sorting stage of the algorithm, the next step is the finding of the frequency spectra which belongs to the 1 point time domain signals at the end of the last decomposition phase. This is a very easy process since the frequency spectrum of a 1 point signals is equal to itself and therefore there is virtually nothing to be done at this stage. In addition, it should be noted that the final 1-point signals are no longer time domain signals but rather, a frequency spectrum. Lastly, the frequency spectra are then supposed to be recombined in the reverse order that is exactly similar to the order followed during the decomposition of the time domain. Since bit reversal formula is not applicable for this recombination, the reverse process is performed one stage at a time. he different flow diagrams used to represent this synthesis is referred to as a butterfly and it is also what forms the basic computational element of the binary as it transforms two complex points into two different but still complex points. The process of synthesizing the data from the separate single data undergoes three main essential loops, which are also concentric in nature. The outermost loop is the one that runs through the block stages while the middle loop is the one, which moves through each of the individual data that are in the stage currently being worked on. The computing resources of the provider are pooled to serve many consumers through a multitenant model having different physical and virtual resources, which are assigned and reassigned depending on the customer’s requirement. There is location independence in that a customer cannot ascertain the exact location of the resources other than at a higher level of abstraction like country. They elastically provided and released to scale rapidly inward and outward commensurate with demand. Consumers assume that the services available are unlimited and can be extended at any time in any quantity. Computing systems control and optimize the use of resources automatically through leveraging a metering mechanism at an abstraction appropriate to the service type. Usage of resources can be controlled, monitored, and reported for transparency provision for both the customer and the provider of the service CUDA GPU Thread Structure The following is CUDA GPU Thread Structure with two main components that are a host and a GPU kernel. Host code runs locally on a CPU, while the GPU kernel codes are GPU functions that run on GPU devices. Kernel execution can be completely independent of host execution. CUDA GPU Thread Structure (Kirk and Hwu,2014). Figure 3: thread structure An execution starts at the code on a CPU host then goes to other parts. This is shown in the dirgram below; Figure 4; execution structure (Kirk and Hwu, 2014). At a certain point in the program, the host code invokes a GPU kernel on a GPU device. Thus far have consisted of single flows of control. This kernel is executed on a GPU grid composed of independent groups of threads called thread blocks. When the kernel completes its execution, the CPU continues to execute the original program. The flow of control started in the grid and worked its way block by block to the last thread. In GPU integration programmers have to integrate their data and applications on a GPU computing or may have to integrate multiple applications. All these integrated applications may have been developed on some server that uses GPU compliant application for developing these integrated GPU applications and once these applications are developed they are then moved to other servers or distributed environments for use by end users. Here again there is no violation of the reciprocal clause of the GPU licence and as such developers of these applications can develop these applications on a server that stores the GPU compliant software and then deploy those applications with impunity on other servers and distributed environments. The case studies covered in chapters 10, 11, 12 and 13 CUDA has been used in case study of chapter ten to carry out analysis of data where CSR has been applied to obtain the required thread, in that case for columns and row indices has been applied in that case to obtain results. This is called sparse matrix since most of the numbers in the matrix are zero. In the manipulation several interrelations are applied which was easy since the matrix was a square matrix. The execution was efficient and it used low memory. In chapter 11, there is the application of a matrix in magnetic resonance imaging. MRI system is made up of a data acquisition system, a computer and image processing systems, the main magnet system, a gradient system, and a radiofrequency source. The magnetic field should also be very stable without any fluctuations in strength. In addition, the magnetic field should have acceptable homogeneity with a uniform strength. A high strength and homogenous magnetic field is necessary in order to receive a strong signal and ensure that MRI works, as required. It is therefore necessary to ensure that magnetic field homogeneity is achieved. To optimize the homogeneity of the magnetic field, a process known as shimming is used. Shimming can be done either passively or actively. This technology relies on selected nucleus within the body to generate an electromagnetic signal necessary for imaging. The preferred nucleus for MRI imaging is the hydrogen nucleus. The hydrogen nucleus is the preferred nucleus for imaging purposes because of the high hydrogen presence in water, body fluids and in body fat hence providing a reliable basis for imaging. In addition, there is variation in density of hydrogen protons in the human tissues when the chemical composition of the tissues changes. Therefore, the difference in the MRI signal depends on the difference in the hydrogen nuclei within the body tissues and physiological structures. The raw data is placed in the K-space such that the data associated with low frequencies is put at the center while those with high frequencies are placed around the center. Low frequency signal has more information about the signal and contrast. On the other hand, the signals with high frequencies have information on spatial resolution i.e. sharpness. In chapter 12 Visual molecular dynamic has been used as a case study. Visual molecular dynamic involves animating displaying and analysis of molecular systems. It runs on a wide range of operating system as well as hardware to produce 3-D objects. Visual molecular dynamics calculates electrostatic potential maps in a great space that is used in the molecular orbit, ion placement, display, gas imaging, and analysis of the calculations carried out. Electrostatic maps in this case are calculated using direct coulomb summation where each grid point is added up for manipulation. The case study in chapter 13 deals with parallel programming and computational thinking. In this case study, programming is done to generate parallel program that is used in manipulation of data. Algorithm is selected and used for the purpose of matrix multiplication, division, and subtraction. Parallel programming relies on the in-built spatial sensitivity of the array to obtain some of the computational information of data. In Parallel programming techniques, several reduced data sets are obtained simultaneously by using an array of receiver coils such as a four-channel phased array coil in order to enhance the sampling speed. The use of several receiver coil array has the ability to significantly reduce image acquisition time but in a manner that is totally different from the conventional fast data analysis. The most fundamental requirement for successful Parallel programming with regard to hardware is an appropriate array of receiver coils. The number of elements of this array of coils depends on the particular application, which ranges from two to eight elements. The coils must also be arranged systematically in order to attain an acceptable signal to noise ratio. The spatial sensitivities of these coils are also crucial and should be reasonably constant throughout the data capture process. In order to make the sensitivities constant, a rigid arrangement such as cage-like arrangement is used. Another important requirement for Parallel programming concerns the number of receiver channels and the corresponding coil elements. As a rule of thumb, the number of coil element should be equal to the number of receiver channels. When carrying out Parallel programming it is also important to accurately establish the coding effects of receiver sensitivities in order to achieve reliable reconstruction. The future directions of the GPU computing There are new developments in GPU computing whereby the new CUDA is able to support developer productivity and modern software engineering practices. The new CUDA describes a set of development range of applications for designing, re-designing and deployment of solutions to GPU computing. The process involves considerations in a number of levels and domains. The differences between current and future is that whereas the latter emphasizes the building of range of applications in the right way that reduces computational time, it focuses on selecting the appropriate systems and their interactions with an aim of satisfying GPU computational requirements. In addition, while systems engineering addresses operations and development of GPU computational applications, it focuses on the development and functionality of evolving programs. With such potential, new GPU computing applications will be less costly that will be able to get reasonable performance at minimal development cost will expand significantly. We expect that developers will immediately notice the reduction in application development, porting, and maintenance cost compared to previous CUDA systems. The existing applications developed with Thrust and similar high-level tools that automatically generate CUDA code will also likely get an immediate boost in their performance. While the benefit of hardware enhancements in memory architecture, kernel execution control, and compute core performance will be visible in the associated SDK release, the true potential of these enhancements may take years to be fully exploited in the SDKs and run- times. For example, the true potential of the hardware virtual memory capability will likely be fully achieved only when a shared global address space runtime that supports direct GPU I/O and peer-to-peer data transfer for multi-GPU systems becomes widely available. We predict an exciting time for innovations from both industry and academia in programming tools and runtime environments for massively parallel computing in the next few years. There is tremendous diversity in the possible and actual configurations of GPU infrastructure to support these models, which creates great potential for cost savings and new capabilities, but also which creates additional infrastructure complexity. GPU infrastructure models that underlie on demand, utility, or grid models can seem mundane as when an IT vendor makes equipment available to a customer via favorable leasing arrangements, other than traditional purchases. Taken to a more sophisticated extreme, financial models may allow customers to contract for a variable amount of computer power or storage,to be provided by a vendor for a fee that varies in proportion to the amount of power or capacity used. Such contracts may also include pricing that allows handling of surges in the load systems must bear. A customer firm that has to pay for surge capacity only when it is required, instead of maintaining it all the time even though it is usually not needed, can realize substantial savings from this kind of financial arrangements. The basic technology that supports computing around an internetwork has existed in something like its presents form since the late 1960s.The technologies we use to access internetworks-PCs, email packages, and web browsers, for example-have been appearing and maturing over the last 20 or so years. Although internetworking infrastructure continues to evolve significantly in both of these areas, there is a third area in which internetworking technologies are evolving even more rapidly. Ultimately, GPU technologies must support all or nearly all the elements of business transactions that can occur in face-to-face transactions. If you are video-conferencing, for example, you need to be able to purchase guaranteed network band-width sufficient to make the conference approximate a productive face-to-face work experience; this is not yet possible everywhere. Consider another example: in business you need to be sure the party you are interacting with is who he says he is, so he cannot later say, “that was not me you contracted with.” This “nonrepudiation” requirements still present difficulties in some internetworks. References Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the ACM Conference on High- Performance Computing Networking Storage and Analysis. Gropp, W. ,Lusk, E., & Skjellum, A. (1999). Using MPI: Portable Parallel Programming with the Message Passing Interface .Cambridge,MA: MIT Press, Scientific and Engineering Computation Series. INVIDIA (2013). CUDA Programming Model. Overview. Retrieved on 25th September, 2014, from Kirk, D. & Hwu, W. (2014). Programming Massively Parallel Processors A Hands-On Approach NVIDIA (2012). CUDA Zone, Retrieved on 25th September, 2014, from: Pharr,M.(2005). GPU Gems2: Programming techniques for high performance graphics and general-purpose computation. Reading, MA: Addison-Wesley. Harris, M. (2007). Parallel Prefix Sum with CUDA, Available at. ,http://developer. download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf.. Satish, N. Harris, M. & Garland, M., 2009. Designing efficient sorting algorithms for proceedings of the 23rd ieee international parallel and distributed processing symposium. Stratton,J.A.,Stone,S.S.,& Hwu,W.W.M( 2008). CUDA (2008) An Efficient Implementation of CUDA Kernels for Multi-CoreCPUs- The 21st International Workshop on Languages and Compilers for Parallel Computing, July 30_31, Canada,2008.AlsoavailableasLectureNotesin Computer Science, 2008. Xianchun, Y. (2014) Parallel Programming. Retrieved on 25th September, 2014, from Read More

Parallel Programming in NVidia's Compute Unified Device Architecture - Report Example

Extract of sample "Parallel Programming in NVidia's Compute Unified Device Architecture"

CHECK THESE SAMPLES OF Parallel Programming in NVidia's Compute Unified Device Architecture

Computer Systems Fundamentals

Computer Programming

Programming and culture in architecture

PROGRAMMING AND CULTURE IN ARCHITECTURE WAY

Initial Public Offering

Computer Programming Techniques

Pair and Parallel Programming

Rendering the Sub-Architecture