7 просмотров
Рейтинг статьи
1 звезда2 звезды3 звезды4 звезды5 звезд
Загрузка...

Even more compute shaders

If you’re familiar with modern CPU design, you know that a CPU is not a scalar thing processing one instruction at a time. A modern CPU architecture like Zen has 10 issue ports, split between integer and floating point:

Zen has ten execution units across the floating point and integer blocks.

GPUs also take advantage of multiple issue ports, but not in the same way as CPUs. On a CPU, instructions get executed out-of-order, and some of them speculatively at that. This is not feasible on a GPU. The whole out-of-order machinery with register renaming requires even more registers, and GPUs already have tons of registers (a Vega10 GPU has for instance a whopping 16 MiB of registers on the die.) Not to mention that speculative execution increases power usage and that’s already a heavily limiting factor for GPUs running very wide workloads. Finally, GPU programs don’t look like CPU programs to start with – out-of-order, speculation, prefetching and more is all great if you’re executing GCC, but not for ye old pixel shader.

That said, it totally makes sense to issue memory requests while the SIMD is busy working on data, or execute scalar instructions concurrently. So how can we get those advantages without going out-of-order? The advantage we have on a GPU over a CPU is that there is a ton of work in flight. Just as we take advantage for this for hiding memory latency, we can also exploit this for scheduling. On a CPU, we look ahead on a single instruction stream and try to find independent instructions. On a GPU, we already have tons of independent instruction streams. The easiest way to get instruction-level parallelism is to simply have different units for different instruction types, and issue accordingly. Turns out, that’s exactly how GCN is set up, with a few execution ports per CU:

GCN has multiple execution ports per CU — scalar ALU/scalar memory, Branch/Message, vector ALU, LDS, Export/GDS and vector memory.

In total, there are six distinct execution ports, and the dispatcher can send one instruction to up to five of them per cycle. There are some special-case instructions which are handled in the dispatcher directly (like no-ops – there’s no use in sending them to a unit.) At each clock cycle, the dispatcher looks at the active waves, and the next instruction that is ready. It will then send it to the next available unit. For instance, let’s assume we have code like this executing:

If there are two waves ready, the dispatcher will issue the first v_add to the first SIMD. In the next cycle, it will issue the s_cmp from the first wave, and the v_add from the second wave. This way the scalar instruction overlaps with the execution of the vector instruction, and we get instruction level parallelism without any lookahead or expensive out-of-order machinery.

Let’s look at a more complete example, with multiple wavefronts executing a decent mix of scalar, vector, and memory instructions:

The top part shows how four wavefronts are scheduled (left to right are clock cycles.) In the first cycle, three independent instructions get issued to three units. Keep in mind that VALU runs for four cycles. The bottom part shows how much utilization the units see. As VALU instructions run for four cycles, all four SIMD units get used rather quickly; a good instruction mix ensures that all units are kept busy.

One last thing before we wrap this up is handling of loads and stores. On a CPU, it’s all transparent, you can write a sequence like this:

This will just work, because the CPU “knows” that the load needs to finish before the operation can start by tracking this information. On a GPU, tracking which register is written by a load would require a lot of extra hardware. The solution the GPU folks came up with is moving this problem “one level up”, into the shader compiler. The compiler has the required knowledge, and inserts the waits for loads manually. In GCN ISA, a special instruction – s_waitcnt – is used to wait until a certain number of loads has finished. It’s not just waiting for everything, as this allows piping in multiple loads simultaneously and then consuming them one-by-one. The corresponding GCN ISA would look somewhat like this:

I think a good idea is to think of a (GCN) GPU as a CPU, running four threads per core (compute unit), and each thread can call scalar, vector and other instructions. It’s in-order, and the designers made a trade-off between hardware and software complexity. Instead of requiring expensive hardware, a GPU requires massively parallel software – not just to hide latency, but also to take advantage of all execution units. Instead of “automatic” tracking, it requires the compiler to insert extra operations, and requires the application to provide enough parallelism to fully utilize it, but at the same time, it provides massive throughput and tons of execution units. It’s a nice example how the properties of the code you’re executing can influence the design of the hardware.

См. также раздел See also

  • UserVoice: предложения пользователей по улучшению SQL Server UserVoice: Have suggestions for improving SQL Server?
  • Вопросы и ответы по продуктам Майкрософт (SQL Server) Microsoft Q & A (SQL Server)
  • DBA Stack Exchange (тег sql-server): вопросы по SQL Server DBA Stack Exchange (tag sql-server): Ask SQL Server questions
  • Stack Overflow (тег sql-server): ответы на вопросы по разработке приложений SQL Stack Overflow (tag sql-server): Answers to SQL development questions
  • Reddit: общее обсуждение по SQL Server Reddit: General discussion about SQL Server
  • Условия лицензии и информация о Microsoft SQL Server Microsoft SQL Server License Terms and Information
  • Варианты поддержки для бизнес-пользователей Support options for business users
  • Обратиться в Майкрософт Contact Microsoft
  • Дополнительная справка и отзывы по SQL Server Additional SQL Server help and feedback

What is the AMD Navi RDNA GPU release date?

The new RX 5700 XT and RX 5700 graphics cards launched with the Navi 10 GPU, in two different configurations, on July 7 this year. This was the first implementation of the RDNA graphics architecture, but definitely won’t be the last because we’ve got Navi 12 and Navi 14 GPUs arriving, potentially in October, to fill out the mainstream level of Radeon GPUs.

David Wang, senior vp of the Radeon Technologies Group, has explicitly stated that the second generation of RDNA will follow this one, probably next year. The gaming architecture roadmap slide indicates that the 7nm+ RDNA 2 design will arrive before 2021… so 2020, then.

There had already been speculation about a ‘big Navi’ GPU arriving next year, which would be designed to take on the best of Nvidia’s GeForce graphics cards, so the Navi 20 chip could well feature the RDNA 2 architecture.

Тест AMD FX-8320 – не дождавшимся Steamroller посвящается.

«Век живи — век учись.»


История покупки такова. Однажды не очень хорошим днём перестал включатся компьютер. Опытным путём было установлено, что в этом виноват процессор AMD Phenom II X3 710, что ему не хватало не знаю (выглядит как новый), отработал пять лет. Было решено купить затычку в виде AMD Athlon II X2 260 и ждать Steamroller как видите не дождался и дождёмся ли вообще непонятно, подписал петицию к AMD выпустить FX на архитектуре Steamroller, подписал и пошёл покупать Piledriver.

«Век живи — век учись.»


История покупки такова. Однажды не очень хорошим днём перестал включатся компьютер. Опытным путём было установлено, что в этом виноват процессор AMD Phenom II X3 710, что ему не хватало не знаю (выглядит как новый), отработал пять лет. Было решено купить затычку в виде AMD Athlon II X2 260 и ждать Steamroller как видите не дождался и дождёмся ли вообще непонятно, подписал петицию к AMD выпустить FX на архитектуре Steamroller, подписал и пошёл покупать Piledriver.


Покупал боксовый процессор из-за вроде как неплохого кулера на тепловых трубках планирую поставить его на Athlon для FX есть более подходящие кулеры, а на что способен этот кулер узнаем позже.

Тестовая конфигурация:

1000RPM)
&nbsprear fun Arctic F12 Pro PWM (

1000RPM)
Cooler

  • AVC
  • Cooler Master Hyper 212 (Scythe Slip Stream 120)
  • Thermalright Macho Rev.A(BW) (Scythe Glide Stream 140)

Thermal Grease
&nbspZalman ZM-STG2
Sound
&nbspCreative Sound Blaster X-Fi Titanium

Soft:
Windows 7 pro 64-bit sp1
NVIDIA graphics driver v.314.22
&nbspнастройки: AF 16x, AA off, фильтрация текстур — высокое качество, vsync: off
O.C.C.T 4.4.0
&nbspнастройки: 64bit, medium data set, 10min
HWiNFO64 v4.34
SpeedFan v.4.49

Читать еще:  Повернулся экран на ноутбуке что делать

Тестирование Процессора:


Игры:



Resident Evil 6 это единственный тест который запускался с АА.

Если судить по тестам то восемь ядер «мешаются» друг-другу, почти во всех тестах пусть чуть-чуть но проигрывая четырём полученных путём включения опции «one core per compute unit».

Архиваторы:




Результаты впечатлили. Когда видишь результаты реального теста а не встроенного benchmark`a сразу видно сколько экономишь времени, за что отдаёшь деньги. Для архиваторов оказалась важна двух-канальная память минимальные задержки не спасают одно-канальную. Про старые методы компрессии можно забыть. В случае WinRAR менее эффективный, а в случае 7-Zip медленнее.

Кодирование видео:

Подводя итоги под всеми тестами можно сказать что всё больше приложений стараются задействовать все доступные ядра. С играми похуже но и здесь четыре ядра уже стандарт.

Тестирование системы охлаждения:

Основания Hyper 212 и Macho выравнивались.

Хотел протестировать с одинаковым уровнем шума но не получилось вентилятор оказался не так прост.

На нём находится терма-датчик который и регулирует обороты даже если PWM отключить.

Обороты растут вместе с температурой.

Тестирование происходило на стандартной чистоте и напряжении процессора. Со снятой боковой панелью корпуса.

Результаты тестирования.
Первый тест который я провёл оказался не корректным.

Закрадывались подозрения в том что что-то не так. Потом появилась догадка в чём может быть проблема. Поскольку Cooler Master не возможно было поставить мешал вентилятор, я сдвинул память. Память стала работать в single channel. Свою догадку решил проверить на Macho.

Кто же мог подумать что при тестировании между памятью подключённой dual и single channel есть разница. И она токая огромная.

Резюме:
Результаты комплектного кулера ожидаема худшие, если не обращать внимание на шум то со своей задачей справляется. Это не максимальные обороты на которых он может работать, обороты поднимались и выше шести тысяч. При этом вентилятор подвывает в такт нагрузки процессора порой кажется, что это не вентилятор а за окном воет вьюга.
Производительностью доволен особенно это касается архивации и кодирования видео. С выходом консолей нового поколения глядишь и игры научатся задействовать все восемь ядер. Так что надо было покупать процессор сразу как только вышли на архитектуре Piledriver а не ждать Steamroller тогда может быть и мне достался процессор в металлической коробке 🙂

P.S.
Добавлено по просьбе пользователя fx8350.




Что хотел почерпнуть из этих тестов fx8350 я не знаю, поэтому выводы попрошу сделать заказчика 🙂
Выводы:

выводы самые элементарные понялись и получились! Вся эта синтетика-бестолковая чепуха маркетинговая, другими словами лохотронная.Пишется с тем смыслом чтобы раскупали ихнюю продукцию! Ядра по полной не работают и не грузятся практически ни в одном бенче-следовательно ядра эти чисто для специфических приложений. Проци интела вообще расчитаны на мозг школьников-поиграть с большим фпс и померится пиписками! Но в принципе купил проц 8350 и пока не жалею!

Manage subscription and billing settings

Note that as of July 30 2020, we have a newer pricing plan. To learn more, see Overview of pricing.

The account Owner can perform many subscription self-service functions directly from the user interface:

  1. From one.newrelic.com, select the account dropdown.
  2. Select your choice of self-service options.
  3. When making subscription changes, be sure to save any changes, agree to New Relic’s Terms of Service and Supplemental Payment Terms as appropriate, and select Pay now.
  4. Optional: If you downgrade your subscription, complete New Relic’s survey.

Here is a summary of the available options from your account dropdown in the New Relic user interface:

View summary information

To view summary information about your subscription: From the account dropdown, select Account settings > Account > Summary. This includes:

Your account ID (which is not the same as your license key) is part of the URL after you sign in to New Relic.

To view or change your current subscription options: From the account dropdown, select Upgrade subscription/Change subscription.

From the account dropdown, select Account settings > Account > Subscription.

  • Upgrade or downgrade your pricing and subscription levels
  • Cancel your subscription or delete your account
  • Change your account’s tax location for billing purposes

If you need more help, contact your New Relic account representative, or contact New Relic’s Billing Department.

To view your subscription usage information: From the account dropdown, select Account settings > Usage.

View or update billing information

To view or update your New Relic account’s billing information: From the account dropdown, select Account settings > Account > Billing. Billing settings include:

  • Account contact information, including name, organization, address, phone, email, purchase order number, etc.
  • Billing history, including invoices and receipts
  • Credit card or other payment method

Using aprun

The aprun command is used to specify to ALPS the resources and placement parameters needed for your application at application launch. At a high level, aprun is similar to mpiexec or mpirun.

The following are the most commonly used options for aprun.

  • -n: Number of processing elements PEs required for the application (pes)
  • -N: Number of PEs to place per node
  • -S: Number of PEs to place per NUMA node.
  • -d: Number of CPU cores required for each PE and its threads
  • -j: Number of CPUs to use per compute unit (bulldozer core-module)
  • -cc: Bind PEs to CPU cores.
  • -r: Number of CPU cores to be used for core specialization
  • -ss: Enable strict memory containment per NUMA node
  • -q: Suppress all non-fatal messages from aprun
  • -R: aprun restart on minimum number of PEs

At a minimum, the user should provide the «-n» option to specify the number of PEs required to run the job.

Task placement

By default, MPI processes are assigned to cores in a packed manner. If, for example, you run a single-node pure-MPI job on 8 cores (i.e., aprun -n 8 . ), the MPI ranks will be placed on cores 0-7. If you run a hybrid job with OpenMP, the -d parameter should be set to the number of OpenMP threads per MPI rank. In this case, the MPI ranks will be spaced by the value of -d. So, if you run a single-node job with 8 MPI ranks and 2 OpenMP threads per rank (aprun -n 8 -d 2 . ), the MPI ranks will be placed on cores 0, 2, 4, 6, 8, 10, 12, and 14, and the OpenMP threads will be placed on cores 0 through 15.

For the vast majority of codes, it’s best to distribute MPI processes among the numa nodes to avoid bottlenecks to cache, pci bus, main memory, etc. Note the core layout of an XE node as shown by /usr/bin/numactl:

There are a couple ways to modify ALPS’ task placement behavior.

Simple task placement: -n, -N, -d, and -j

The simple way of evenly distributing the tasks on a node is with a combination of the -n, -N, and -d aprun parameters. For a pure-MPI job, specify the total number of MPI processes with -n, the number of MPI processes per node with -N, and then use -d to space them.

Example 1 (single-node job):

aprun -n 4 -N 4 -d 2 ./myexe

Result: MPI processes will be assigned to cores 0, 2, 4, and 6.

Example 2 (2-node job):

aprun -n 8 -N 4 -d 4 ./myexe

Result: MPI processes will be assigned to cores 0, 4, 8, and 12 on both nodes.

If you want to place one MPI rank on each of the 16 bulldozer core-modules in a node, simply use -N 16 -d 2. Alternatively, -j may be used instead of -d. The -j parameter specifies the number of CPUs to be allocated per compute unit, which is a bulldozer core-module on Blue Waters. As there are two integer cores per bulldozer core-module, the valid values for -j are 0 (use the system default), 1 (use one integer core per bulldozer), and 2 (use both integer cores in each bulldozer; this is the system default). So, using -N 16 -j 1 is another way to place one MPI rank on each bulldozer core-module. Note that using -d in combination with -j has a multiplicative effect:

Example 3 (single-node jobs):

aprun -n 4 -N 4 -d 1 -j 1
Result: MPI processes will be assigned to cores 0, 2, 4, and 6.

aprun -n 4 -N 4 -d 2 -j 1
Result: MPI processes will be assigned to cores 0, 4, 8, and 12.

aprun -n 4 -N 4 -d 4 -j 1
Result: MPI processes will be assigned to cores 0, 8, 16, and 24.

For hybrid jobs with OpenMP threads, set -d to the number of threads per MPI rank.

Example 4 (single-node job with OpenMP, using a bash PBS script):

Set the number of threads: export OMP_NUM_THREADS=8

aprun -n 4 -N 4 -d 8 ./myexe

core 0: MPI rank 0 (and main/master OpenMP thread for MPI rank 0)
cores 1-7: OpenMP threads for MPI rank 0
core 8: MPI rank 1 (and main/master OpenMP thread for MPI rank 1)
cores 9-15: OpenMP threads for MPI rank 1
core 16: MPI rank 2 (and main/master OpenMP thread for MPI rank 2)
cores 17-23: OpenMP threads for MPI rank 2
core 24: MPI rank 3 (and main/master OpenMP thread for MPI rank 3)
cores 25-31: OpenMP threads for MPI rank 3

Advanced task placement: -cc

Use the -cc parameter if you need more control over where the MPI processes are placed. The list following -cc specifies the cores to which MPI processes are bound. This list may be comma delimited (e.g., 2,4,6,8), contain ranges (e.g., 2-4,5-7), or both.

Читать еще:  Несовпадение ключа безопасности сети что делать

A typical -cc layout for using the 16 FPU units in an XE compute node would look like:

A exmaple for using -cc for a MPI+openMP hybrid application with 4 MPI per node and 4 openMP threads per MPI on an XE compute node is:

XK compute nodes contain only numa nodes 0 and 1 with the same memory and enumeration of cores 0-15. Nodes 2 and 3 (the 2nd processor socket) are vacant to provide space for the Nvidia Kepler GPU.

The -cc parameter may be used to bind multiple MPI tasks to the same core (e.g., aprun -cc 0,0,1,1 . ). However, this is typically very undesirable as doing so will result in an extreme load imbalance for any application that tries to keep the amount of work done by each MPI rank the same.

Important note: the bindings specified by using -cc apply to each node. This means that the only valid values for Blue Waters are 0-31. Values above 31 will be ignored (no error is given).

See the aprun man page for more information.

This example code can be compiled and run on a node(s) to show how the core placement changes with various arguments to aprun:

Restart for aprun resiliency

Aprun output

Upon an application exits, aprun sends to stdout: utime, stime, maxrss, inblocks, and outblocks. These are for user time, system time, maximum resident set size, block input operations, and block output operations. The values given are approximate as they are rounded aggregate scaled by the number of resources used. For more information on these values, see the getrusage(2) man page.

An example of the output is:

Application 2243970 resources: utime

References

http://docs.cray.com/books/S-2496-4101/html-S-2496-4101/cnlexamples.html # contains many sample codes demonstrating the common HPC programming paradigms along with various aprun invocations and program output

Blue Waters is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) the State of Illinois, and the National Geospatial-Intelligence Agency.

Contact Blue Waters Team with questions regarding this page.
Copyright 2020 Board of Trustees of the University of Illinois. All rights reserved. Web privacy notice.

См. также раздел See also

  • UserVoice: предложения пользователей по улучшению SQL Server UserVoice: Have suggestions for improving SQL Server?
  • Вопросы и ответы по продуктам Майкрософт (SQL Server) Microsoft Q & A (SQL Server)
  • DBA Stack Exchange (тег sql-server): вопросы по SQL Server DBA Stack Exchange (tag sql-server): Ask SQL Server questions
  • Stack Overflow (тег sql-server): ответы на вопросы по разработке приложений SQL Stack Overflow (tag sql-server): Answers to SQL development questions
  • Reddit: общее обсуждение по SQL Server Reddit: General discussion about SQL Server
  • Условия лицензии и информация о Microsoft SQL Server Microsoft SQL Server License Terms and Information
  • Варианты поддержки для бизнес-пользователей Support options for business users
  • Обратиться в Майкрософт Contact Microsoft
  • Дополнительная справка и отзывы по SQL Server Additional SQL Server help and feedback

How Do Graphics Cards Work?

Ever since 3dfx debuted the original Voodoo accelerator, no single piece of equipment in a PC has had as much of an impact on whether your machine could game as the humble graphics card. While other components absolutely matter, a top-end PC with 32GB of RAM, a $4,000 CPU, and PCIe-based storage will choke and die if asked to run modern AAA titles on a ten-year-old card at modern resolutions and detail levels. Graphics cards, aka GPUs (Graphics Processing Units) are critical to game performance and we cover them extensively. But we don’t often dive into what makes a GPU tick and how the cards function.

By necessity, this will be a high-level overview of GPU functionality and cover information common to AMD, Nvidia, and Intel’s integrated GPUs, as well as any discrete cards Intel might build in the future based on the Xe architecture. It should also be common to the mobile GPUs built by Apple, Imagination Technologies, Qualcomm, ARM, and other vendors.

Why Don’t We Run Rendering With CPUs?

The first point I want to address is why we don’t use CPUs for rendering workloads in gaming in the first place. The honest answer to this question is that you can run rendering workloads directly on a CPU. Early 3D games that predate the widespread availability of graphics cards, like Ultima Underworld, ran entirely on the CPU. UU is a useful reference case for multiple reasons — it had a more advanced rendering engine than games like Doom, with full support for looking up and down, as well as then-advanced features like texture mapping. But this kind of support came at a heavy price — many people lacked a PC that could actually run the game.

Ultima Underworld. Image by GOG

In the early days of 3D gaming, many titles like Half-Life and Quake II featured a software renderer to allow players without 3D accelerators to play the title. But the reason we dropped this option from modern titles is simple: CPUs are designed to be general-purpose microprocessors, which is another way of saying they lack the specialized hardware and capabilities that GPUs offer. A modern CPU could easily handle titles that tended to stutter when running in software 18 years ago, but no CPU on Earth could easily handle a modern AAA game from today if run in that mode. Not, at least, without some drastic changes to the scene, resolution, and various visual effects.

As a fun example of this: The Threadripper 3990X is capable of running Crysis in software mode, albeit not all that well.

What’s a GPU?

A GPU is a device with a set of specific hardware capabilities that are intended to map well to the way that various 3D engines execute their code, including geometry setup and execution, texture mapping, memory access, and shaders. There’s a relationship between the way 3D engines function and the way GPU designers build hardware. Some of you may remember that AMD’s HD 5000 family used a VLIW5 architecture, while certain high-end GPUs in the HD 6000 family used a VLIW4 architecture. With GCN, AMD changed its approach to parallelism, in the name of extracting more useful performance per clock cycle.

Nvidia first coined the term “GPU” with the launch of the original GeForce 256 and its support for performing hardware transform and lighting calculations on the GPU (this corresponded, roughly to the launch of Microsoft’s DirectX 7). Integrating specialized capabilities directly into hardware was a hallmark of early GPU technology. Many of those specialized technologies are still employed (in very different forms). It’s more power-efficient and faster to have dedicated resources on-chip for handling specific types of workloads than it is to attempt to handle all of the work in a single array of programmable cores.

There are a number of differences between GPU and CPU cores, but at a high level, you can think about them like this. CPUs are typically designed to execute single-threaded code as quickly and efficiently as possible. Features like SMT / Hyper-Threading improve on this, but we scale multi-threaded performance by stacking more high-efficiency single-threaded cores side-by-side. AMD’s 64-core / 128-thread Epyc CPUs are the largest you can buy today. To put that in perspective, the lowest-end Pascal GPU from Nvidia has 384 cores, while the highest core-count x86 CPU on the market tops out at 64. A “core” in GPU parlance is a much smaller processor.

Note: You cannot compare or estimate relative gaming performance between AMD, Nvidia, and Intel simply by comparing the number of GPU cores. Within the same GPU family (for example, Nvidia’s GeForce GTX 10 series, or AMD’s RX 4xx or 5xx family), a higher GPU core count means that GPU is more powerful than a lower-end card. Comparisons based on FLOPS are suspect for reasons discussed here.

The reason you can’t draw immediate conclusions on GPU performance between manufacturers or core families based solely on core counts is that different architectures are more and less efficient. Unlike CPUs, GPUs are designed to work in parallel. Both AMD and Nvidia structure their cards into blocks of computing resources. Nvidia calls these blocks an SM (Streaming Multiprocessor), while AMD refers to them as a Compute Unit.

A Pascal Streaming Multiprocessor (SM).

Each block contains a group of cores, a scheduler, a register file, instruction cache, texture and L1 cache, and texture mapping units. The SM / CU can be thought of as the smallest functional block of the GPU. It doesn’t contain literally everything — video decode engines, render outputs required for actually drawing an image on-screen, and the memory interfaces used to communicate with onboard VRAM are all outside its purview — but when AMD refers to an APU as having 8 or 11 Vega Compute Units, this is the (equivalent) block of silicon they’re talking about. And if you look at a block diagram of a GPU, any GPU, you’ll notice that it’s the SM/CU that’s duplicated a dozen or more times in the image.

And here’s Pascal, full-fat edition.

The higher the number of SM/CU units in a GPU, the more work it can perform in parallel per clock cycle. Rendering is a type of problem that’s sometimes referred to as “embarrassingly parallel,” meaning it has the potential to scale upwards extremely well as core counts increase.

Читать еще:  Что значит Iphone отключен подключитесь к Itunes

When we discuss GPU designs, we often use a format that looks something like this: 4096:160:64. The GPU core count is the first number. The larger it is, the faster the GPU, provided we’re comparing within the same family (GTX 970 versus GTX 980 versus GTX 980 Ti, RX 560 versus RX 580, and so on).

Texture Mapping and Render Outputs

There are two other major components of a GPU: texture mapping units and render outputs. The number of texture mapping units in a design dictates its maximum texel output and how quickly it can address and map textures on to objects. Early 3D games used very little texturing because the job of drawing 3D polygonal shapes was difficult enough. Textures aren’t actually required for 3D gaming, though the list of games that don’t use them in the modern age is extremely small.

The number of texture mapping units in a GPU is signified by the second figure in the 4096:160:64 metric. AMD, Nvidia, and Intel typically shift these numbers equivalently as they scale a GPU family up and down. In other words, you won’t really find a scenario where one GPU has a 4096:160:64 configuration while a GPU above or below it in the stack is a 4096:320:64 configuration. Texture mapping can absolutely be a bottleneck in games, but the next-highest GPU in the product stack will typically offer at least more GPU cores and texture mapping units (whether higher-end cards have more ROPs depends on the GPU family and the card configuration).

Render outputs (also sometimes called raster operations pipelines) are where the GPU’s output is assembled into an image for display on a monitor or television. The number of render outputs multiplied by the clock speed of the GPU controls the pixel fill rate. A higher number of ROPs means that more pixels can be output simultaneously. ROPs also handle antialiasing, and enabling AA — especially supersampled AA — can result in a game that’s fill-rate limited.

Memory Bandwidth, Memory Capacity

The last components we’ll discuss are memory bandwidth and memory capacity. Memory bandwidth refers to how much data can be copied to and from the GPU’s dedicated VRAM buffer per second. Many advanced visual effects (and higher resolutions more generally) require more memory bandwidth to run at reasonable frame rates because they increase the total amount of data being copied into and out of the GPU core.

In some cases, a lack of memory bandwidth can be a substantial bottleneck for a GPU. AMD’s APUs like the Ryzen 5 3400G are heavily bandwidth-limited, which means increasing your DDR4 clock rate can have a substantial impact on overall performance. The choice of game engine can also have a substantial impact on how much memory bandwidth a GPU needs to avoid this problem, as can a game’s target resolution.

The total amount of on-board memory is another critical factor in GPUs. If the amount of VRAM needed to run at a given detail level or resolution exceeds available resources, the game will often still run, but it’ll have to use the CPU’s main memory for storing additional texture data — and it takes the GPU vastly longer to pull data out of DRAM as opposed to its onboard pool of dedicated VRAM. This leads to massive stuttering as the game staggers between pulling data from a quick pool of local memory and general system RAM.

One thing to be aware of is that GPU manufacturers will sometimes equip a low-end or midrange card with more VRAM than is otherwise standard as a way to charge a bit more for the product. We can’t make an absolute prediction as to whether this makes the GPU more attractive because honestly, the results vary depending on the GPU in question. What we can tell you is that in many cases, it isn’t worth paying more for a card if the only difference is a larger RAM buffer. As a rule of thumb, lower-end GPUs tend to run into other bottlenecks before they’re choked by limited available memory. When in doubt, check reviews of the card and look for comparisons of whether a 2GB version is outperformed by the 4GB flavor or whatever the relevant amount of RAM would be. More often than not, assuming all else is equal between the two solutions, you’ll find the higher RAM loadout not worth paying for.

Check out our ExtremeTech Explains series for more in-depth coverage of today’s hottest tech topics.

Тест AMD FX-8320 – не дождавшимся Steamroller посвящается.

«Век живи — век учись.»


История покупки такова. Однажды не очень хорошим днём перестал включатся компьютер. Опытным путём было установлено, что в этом виноват процессор AMD Phenom II X3 710, что ему не хватало не знаю (выглядит как новый), отработал пять лет. Было решено купить затычку в виде AMD Athlon II X2 260 и ждать Steamroller как видите не дождался и дождёмся ли вообще непонятно, подписал петицию к AMD выпустить FX на архитектуре Steamroller, подписал и пошёл покупать Piledriver.

«Век живи — век учись.»


История покупки такова. Однажды не очень хорошим днём перестал включатся компьютер. Опытным путём было установлено, что в этом виноват процессор AMD Phenom II X3 710, что ему не хватало не знаю (выглядит как новый), отработал пять лет. Было решено купить затычку в виде AMD Athlon II X2 260 и ждать Steamroller как видите не дождался и дождёмся ли вообще непонятно, подписал петицию к AMD выпустить FX на архитектуре Steamroller, подписал и пошёл покупать Piledriver.


Покупал боксовый процессор из-за вроде как неплохого кулера на тепловых трубках планирую поставить его на Athlon для FX есть более подходящие кулеры, а на что способен этот кулер узнаем позже.

Тестовая конфигурация:

1000RPM)
&nbsprear fun Arctic F12 Pro PWM (

1000RPM)
Cooler

  • AVC
  • Cooler Master Hyper 212 (Scythe Slip Stream 120)
  • Thermalright Macho Rev.A(BW) (Scythe Glide Stream 140)

Thermal Grease
&nbspZalman ZM-STG2
Sound
&nbspCreative Sound Blaster X-Fi Titanium

Soft:
Windows 7 pro 64-bit sp1
NVIDIA graphics driver v.314.22
&nbspнастройки: AF 16x, AA off, фильтрация текстур — высокое качество, vsync: off
O.C.C.T 4.4.0
&nbspнастройки: 64bit, medium data set, 10min
HWiNFO64 v4.34
SpeedFan v.4.49

Тестирование Процессора:


Игры:



Resident Evil 6 это единственный тест который запускался с АА.

Если судить по тестам то восемь ядер «мешаются» друг-другу, почти во всех тестах пусть чуть-чуть но проигрывая четырём полученных путём включения опции «one core per compute unit».

Архиваторы:




Результаты впечатлили. Когда видишь результаты реального теста а не встроенного benchmark`a сразу видно сколько экономишь времени, за что отдаёшь деньги. Для архиваторов оказалась важна двух-канальная память минимальные задержки не спасают одно-канальную. Про старые методы компрессии можно забыть. В случае WinRAR менее эффективный, а в случае 7-Zip медленнее.

Кодирование видео:

Подводя итоги под всеми тестами можно сказать что всё больше приложений стараются задействовать все доступные ядра. С играми похуже но и здесь четыре ядра уже стандарт.

Тестирование системы охлаждения:

Основания Hyper 212 и Macho выравнивались.

Хотел протестировать с одинаковым уровнем шума но не получилось вентилятор оказался не так прост.

На нём находится терма-датчик который и регулирует обороты даже если PWM отключить.

Обороты растут вместе с температурой.

Тестирование происходило на стандартной чистоте и напряжении процессора. Со снятой боковой панелью корпуса.

Результаты тестирования.
Первый тест который я провёл оказался не корректным.

Закрадывались подозрения в том что что-то не так. Потом появилась догадка в чём может быть проблема. Поскольку Cooler Master не возможно было поставить мешал вентилятор, я сдвинул память. Память стала работать в single channel. Свою догадку решил проверить на Macho.

Кто же мог подумать что при тестировании между памятью подключённой dual и single channel есть разница. И она токая огромная.

Резюме:
Результаты комплектного кулера ожидаема худшие, если не обращать внимание на шум то со своей задачей справляется. Это не максимальные обороты на которых он может работать, обороты поднимались и выше шести тысяч. При этом вентилятор подвывает в такт нагрузки процессора порой кажется, что это не вентилятор а за окном воет вьюга.
Производительностью доволен особенно это касается архивации и кодирования видео. С выходом консолей нового поколения глядишь и игры научатся задействовать все восемь ядер. Так что надо было покупать процессор сразу как только вышли на архитектуре Piledriver а не ждать Steamroller тогда может быть и мне достался процессор в металлической коробке 🙂

P.S.
Добавлено по просьбе пользователя fx8350.




Что хотел почерпнуть из этих тестов fx8350 я не знаю, поэтому выводы попрошу сделать заказчика 🙂
Выводы:

выводы самые элементарные понялись и получились! Вся эта синтетика-бестолковая чепуха маркетинговая, другими словами лохотронная.Пишется с тем смыслом чтобы раскупали ихнюю продукцию! Ядра по полной не работают и не грузятся практически ни в одном бенче-следовательно ядра эти чисто для специфических приложений. Проци интела вообще расчитаны на мозг школьников-поиграть с большим фпс и померится пиписками! Но в принципе купил проц 8350 и пока не жалею!

Ссылка на основную публикацию
Статьи c упоминанием слов:
Adblock
detector