| Join our community! If you're already a member please log in to your account to access all of our features: |
| Revisiting Hyper-Threading for Nehalem; Editorial | |
|---|---|
| Topic Started: Jun 24 2008, 11:05 PM (200 Views) | |
| virtualrain | Jun 24 2008, 11:05 PM Post #1 |
![]()
Administrator
![]() ![]() ![]() ![]() ![]()
|
Nehalem re-incorporates a form of Simultaneous Multithreading (SMT) that first appeared on Pentium 4 processors several years ago branded as "Hyper-Threading" (HTT). HTT has been conscpiously absent on the Core2 generation of processors which has been theorized as the result of Intel's decision to let the Haifa team develop the Core2 based on their great success with the Pentium M while the preceeding Netburst architecture was swept aside (and HTT along with it). Since the concept of HTT may be new to many or even foggy for those who once owned a P4 (or still do), I thought it might be prudent to write a brief editorial on HTT and how it works with current operating systems. By the way, while it's not clear what the brand name for Nehalem's SMT implementation will be, I will stick with the Hyper-Threading (HTT) nomenclature throughout this article. What is Hyper-Threading? Hyper-Threading is an Intel patented technique for simultaneous multithreading. It duplicates some key execution elements, specifically those elements that store the executional state, but not those elements that actually do the execution. It thereby allows two threads to be processed concurrently on a single core relying on the fact that a given execution unit is not always fully utilized. For example, an executing thread may occasionally be idle waiting for data from main memory in the event of a cache miss, allowing another thread to use those idle periods to execute. However, it's important to note that Hyper-Threading not like having two full cores. Many elements including L1 and L2 cache, and the execution engine itself are shared between the two competing threads. Here is a list of the core elements that are duplicated, partitioned, and shared between the two threads on a core: Duplicated:
Partitioned:
Shared:
According to Intel, the duplicated resources account for only a 5% increase in transistor count (and TDP). HTT Performance: In general, some applications can benefit as much as 30% from HTT making it a very economical performance option. Based on knowing what resources are duplicated and shared we can better understand where performance issues may arise or economies gained. In particular, since L1 and L2 cache are shared by both HTT threads, the possibility of cache contention and thrashing can develop as they compete with each other for these important resources. On the contrary, multi-threaded applications that have producer and consumer threads where one thread is producing data that another thread consumes benefit significantly from Hyper-Threading because they obviously are working together on the same data set and sharing the cache is tremendously advantageous. A revealing set of benchmarks were conducted by FiringSquad on a set of P4 processors in 2005... ![]() The processors tested above include a Pentium 4 840 EE (2 cores with HTT), 840 (2 cores), and 540 (1 core with HTT). To summarize the results:
Therefore, while it's clear that an additional core is the best way to improve performance in multi-threaded applications, noticeable gains can be had from HTT technology... even with a game. The bottom-line on performance is that the impact of HTT depends heavily on the particular nature of the processor load and to some extent, how the various threads are scheduled by the operating system... Operating System Support: HTT technology is abstracted so the operating system sees a series of logical processors. Each HTT enabled physical core appears as a pair of logical processors to the operating system - each of which can have threads assigned by the OS allowing two concurrent threads of execution to occur on a single physical core. Logic in the core manages which of the two concurrent threads execute at any given time (based on the idle cycles) but it is the operating system that assigns the threads to their respective execution units. An operating system that is HTT aware can optimize performance by assigning new threads to logical processors on inactive cores rather than to logical processors on a core with a concurrent active thread. Conversely, an operating system that is not HTT aware can introduce performance penalties by loading a single physical core with concurrent threads while other cores sit idle. For example, consider the Bloomfield CPU represented in the illustration below that has 8 logical processors on 4 physical cores with two active threads. ![]() A shaded logical processor indicates an active thread; a non-shaded processor is inactive. Assuming no affinity has been set, the operating system is free to schedule the next available thread to any of the inactive logical processors. In a non-HTT aware operating system, the next thread might be randomly assigned to logical processor 2 or 5 both of which would incur a performance penalty since the new thread is now competing for shared processor resources with a concurrent thread on the same core while the last two cores sit idle. An HTT aware operating system will optimize performance by dispatching new threads onto inactive physical cores whenever possible (in this case logical processor 3, 4, 7, or 8). Here's a summary of recent Microsoft Operating Systems and their awareness of HTT:
(A full list of HTT aware operating systems is provided by Intel.) The most recent Windows operating systems are not only HTT aware but expose an API that applications can use to optimize threading based on the logical processor configuration on the machine. This is advantageous in that it allows applications to determine the ideal thread distribution to maximize performance based on the logical/physical processor implementation at run-time. It's also worth pointing out that users can also over-ride the operating systems thread scheduling by forcing affinity at the process level using Task Manager as shown below. ![]() Summary In summary, HTT can provide a noticable increase in performance with little additional cost in terms of die-area and power consumption. The relatively small duplication of some core execution elements has been shown to provide tangible benefits to a variety of tasks including games and desktop media applictions. Benefits of 10-30% can be expected from applications run on HTT aware operating systems such as XP, Vista, and recent Server flavors. With Nehalem reintroducing a form of HTT and Vista now supporting an HTT aware API for applications to optimize threading combined with an increasing emphasis on parallel programming, this next generation of hardware and software promises to be a very interesting time for performance enthusiasts. See the main blog article for a list of additional reading. Edited by virtualrain, Jun 24 2008, 11:07 PM.
|
![]() Visit Nehalem News | |
![]() |
|
| bowman | Jun 25 2008, 06:21 AM Post #2 |
|
Advanced Member
![]() ![]() ![]() ![]() ![]()
|
Oh sweet. This will be so great. Just look at those gains in Quake 4.. Newer, better implementation and newer, better architecture. We should see even more gains as long as the applications make use of more than 4 threads. |
| |
![]() |
|
| Morgoth | Jun 25 2008, 07:47 AM Post #3 |
|
Advanced Member
![]() ![]() ![]() ![]() ![]()
|
i love my ht ^^ |
| 500mhz stone age, 1ghz copper age, 1,50ghz bronze age, 2ghz middel ages, 2,50ghz reinaissance, 3ghz industrail age, 4ghz atomic age, 6ghz Nano age | |
![]() |
|
| bowman | Jun 28 2008, 04:57 AM Post #4 |
|
Advanced Member
![]() ![]() ![]() ![]() ![]()
|
http://www.xtremesystems.org/forums/showthread.php?s=&threadid=4952&highlight=Hyperthreading
Unfortunately the multi-threaded SMP client only wants 4 threads, not 2, 6 or 8. Apparently it doesn't like SMT either as it's actually executing a single task utilizing all four cores rather than just running four different clients. Time to experiment once it arrives then.. What's better, disabling SMT and running the SMP client or running 8 console clients? Edited by bowman, Jun 28 2008, 04:59 AM.
|
| |
![]() |
|
| « Previous Topic · Nehalem News Discussion · Next Topic » |






![]](http://209.85.48.18/static/1/pip_r.png)







5:13 AM Nov 21