WCSA Publicidade




WCSA Publicidade



Intel and Novell to colloborate on Moblin

Intel and Novell signed an agreement outlining their plan for collaboration. Novell also announced it will create a Moblin-based product for netbooks that it will take to market to a wide range of OEMs and ODMs. Additionally, Novell will establish Novell® Open Labs in Taiwan to foster the adoption of Moblin and will work with the Taiwan Moblin Enabling Center (MEC), a joint effort of Intel and the Taiwan Institute for Information Industry, to validate designs for Moblin compliance.

?Novell has taken a significant leadership role in the Moblin community since joining the effort late last year, and today?s announcement will extend Novell?s level of involvement,? said Doug Fisher, vice president of Intel?s Software and Services Group and general manager of the System Software Division. ?The combination of Intel Atom processor-based platforms and Moblin-based Novell software will provide even more opportunities for OEMs, ODMs and the broader Moblin community to deliver excellent mobile Internet solutions.?

Novell?s contributions to the Moblin ecosystem include leading the open source development of key operating system features such as window, e-mail and media management.

?We are extending our involvement with Moblin because we believe that it provides a richer mobile Internet experience,? said Ron Hovsepian, Novell president and CEO. ?The emergence of such mobile computing platforms as netbooks presents a significant growth opportunity. We believe that Moblin-based Novell software on Intel-based platforms will offer OEMs and ODMs exceptional solutions for delivering a full Internet experience on such devices.?

More details at - http://finance.yahoo.com/news/Intel-Novell-Extend-bw-15164933.html

07/05/2009 08:47 AM

Looking to buy vPro systems? Model numbers just updated on wiki.

The easiest way to be sure that you are ordering a PC with Intel vPro technology is to use the list of model numbers that we've cataloged. Hewlett-Packard, Dell, Lenovo, and FTS (desktop and workstation systems) model numbers were updated here in the last couple of weeks - other manufacturers are still in progress and will be updated as the info is available.

You can check them out here: Order an Activation-Ready PC

07/05/2009 06:18 AM

Sponsors of Tomorrow

I’m here in Toronto, getting ready to head over to participate in a panel discussion at the Empire Club of Canada on the topic of corporate social responsibility and the impact that social media is having on CSR communications and reporting. The topic-which didn’t even exist just a few years ago-struck me as just another example of the dramatic changes that are occurring even in the field of CSR because of technology. Changes that, as will be highlighted in our biggest advertising campaign in years, were made possible by Intel silicon.

On Monday, Intel will launch our new Sponsors of Tomorrow campaign, which looks at the role that Intel plays in changing the way we all live and work. Most importantly in my opinion, is that it celebrates the minds and creativity of the people here who make that innovation possible every day. I still remember when I started here at Intel and being taken aback when these engineers would literally walk into me in the hallways/elevators as if they didn’t see me. I imagined that they were busy concentrating on designing the next chip in their heads on the way to get their morning coffee. But really, I am continually amazed and proud by what my co-workers here continue to make possible - a sentiment that is amusingly captured in the new “Intel star” ad that will begin running next week. Take a sneak peak and tell us what you think.

Comments (0)
07/05/2009 04:13 AM

Virtually Everything...

"We are virtualizing".  I hear that at every customer, every day.  I am not sure where virtualization is on the hype curve, but i don't think it is anywhere near slowing down.  I am very glad to be past the "Dilbert" and "in flight magazine" era.  Customers seem to have a really solid command of what they want to virtualize and why they want to virtualize. ( not to imply that all the questions have been answered )

The latest Intel servers - Xeon 7400 processor series in the 4 socket family, and the incredible Xeon 5500 (Nehalem) processor series in the 2 socket family - deliver more than sufficient capacity for sweeping data center virtualization.  i.e. very few enterprise applications are to big for a VM on one of these platforms.

I hear three reasons from customers for virtualization. ( in order of emphasis )1) To improve efficiency.2) To improve flexibility.3) To improve reliability.Virtualization has moved out of the lab and become a "best know method" for doing IT right.Intel points to three focus areas for servers.

06/05/2009 06:58 PM

3 items to consider for your next energy efficient server

If your company needs new servers, this is a great time to be in the market.  Intel based Xeon® 5500 (Nehalem) servers that were introduced only a month ago have been arriving at customer sites all over the world and they provide some very compelling performance and energy efficiency benefits.  Here are 3 key items to consider before buying your next server.  The actual order of importance of these items may vary depending upon your business needs.

1.  Performance.  This is still a primary reason why new servers are purchased.  The best way to measure performance is to actually run your applications on the server you are considering.  If that is not possible or feasible, the next best choice is to compare server performance using a suite of benchmarks.  Some of the more common benchmarks that IT departments use to compare server performance are:

a.       Virtualization performance using Vmware VMmark: http://www.vmware.com/products/vmmark/results.html

b.      Energy efficiency using SPECpower_ssj2008: http://www.spec.org/power_ssj2008/

c.       Integer performance using SPECing_rate_base2006: http://www.spec.org/cpu2006/results/cpu2006.html#SPECint_rate

d.      Floating point performance using SPECfp_rate_base2006: http://www.spec.org/cpu2006/results/cpu2006.html#SPECfp_rate

e.       Web server performance using SPECweb2005: http://www.spec.org/web2005/results/

f.        Java performance using SPECjbb2005: http://www.spec.org/jbb2005/results/jbb2005.html

After looking these benchmark results, one thing you?ll notice is the Xeon® 5500 processors provide phenomenal performance?often up to 2x the previous generation!

2.   Server Hardware Choices 

a.       Processor.  The processor is one of the most important choices in the server.  Performance, features, power envelope and price all need to be considered.  From a power perspective, there are three power envelopes available for Xeon® 5500 server processors (95W, 80W and 60W).  In addition, there are 130W Xeon® 5500 processors, but these are primarily being used for workstations.  If you are in constrained power environment, it may be worthwhile to consider buying a lower power processor to reduce energy consumption.  Depending upon the processor SKU you are interested in, it is possible to get the exact same performance/frequency with a processor that just consumes less power.  (i.e. Xeon L5520 2.26GHz 60W instead of the Xeon E5520 2.26GHz 80W).  The L in front of the processor number refers to low voltage processors that consume less power.   

b.      Power supply.  Choosing a power supply with a high efficiency rating is one of the easiest choices you can make to reduce power consumption.  Choose a power supply that is at least 80%+ or higher efficiency.  Some of the newer power supplies are 90%+ or higher.  The higher the percentage, the better.

c.       Memory.  Every DIMM installed in the server consumes power.  In general, the fewer the DIMMs used, the less power that server will consume.  For a given memory capacity, such as 24GB, choose six 4GB DIMMs instead of twelve 2GB DIMMs.  The price of 2GB and 4GB DIMMs are almost at price per bit parity, but the power consumption of the memory will be much less with fewer DIMMs installed. 

d.      Add in boards.  Compare power consumption of add in boards such as 10GbE adapters, fibre channel adapters and other I/O cards.  Also, do you really need a fibre channel card these days.  FCOE (Fibre Channel over Ethernet) using a 10GbE adaptor is definitely a cost effective and power efficient way to get access to your storage array.

3.       To virtualize or not to virtualize?  Virtualization is no longer just a buzz word.  Virtualization is being used by many companies across multiple diverse industries today.  Fundamentally, it is an excellent way to consolidate many applications onto a single server, thereby increasing the utilization, value and energy efficiency of every server purchased.  Definetely a top item to consider.

What about your business?  What items do you consider before purchasing servers to maximize energy efficient performance?

06/05/2009 12:10 PM

Living in the future thanks to the Sponsors of Tomorrow

I like to poke a little fun at my peer Josh Bancroft. When were meeting new people I tend to introduce him as "living in the future", but it's true and not limited to just Josh. Josh and all of the community managers and advisor in the Intel Software Network live in the future. And it's only fitting for us that Intel's new ad campaign tag line is Sponsor of Tomorrow. Intel sponsors our journey into the future every minute of every day. So it's great to see the video spot (YouTube Video) and read the press reviews in the Wall Street Journal and New York Times.

Intel pays us to work with cutting edge technology and share our experience with others. It is our JOB to live in the future and help people see it as a new reality for how they work, play and communicate with other.

We use computer that let us seamlessly boot a variety of operating systems (Linux, Windows or Mac OS X) since the future does not run on on operating system. The future consists of multiple operating systems and application that are loosely coupled through open standards.

We all appreciate the power of a powerful laptop and need it for our daily work but we use smart phones, net books, net tops and store our information in the cloud. In the future you'll have access to your information when you need it and how you want it.

When we relax with our families we have home media systems that get their content more from data streams than from broadcast streams. We have access to all our content on demand and across all form factors of devices, in the future everyone will.

I have to admit that living in the future does have it's limitations and is not an entirely stable experience. The future is not set in stone and often causes some heart ache.  For example when the Boxee was forced to rip Hulu out of it's content stream we all were heart broken. And when we get The Fail Whale were cut off from our twitter community of friends. But for the most part the future is looking good and we have a better quality of life.

The Intel Software Network team is working on something BIG for June that uses WiMax, a Tricaster and Parallel Programming... all three of which have Intel innovation that makes the magic possible. Keep an eye out for more details and you too can be part of the future world.

06/05/2009 11:26 AM

Sponsors of Tomorrow: Just another tagline?

Clever tagline indeed; but it means much more to me personally.  I joined Intel over a year ago and I can genuinely say that my experience here has been fantastic.  I have worked with the most amazing, smart, cool and quirky people within the walls here at 2200 Mission College Blvd in Santa Clara <--- this is where I sit in case you want to send fan mail.  : P

Okay, seriously, in my mind, this new ad campaign, Sponsors of Tomorrow not only celebrates the people behind the Intel brand, but it communicates to every day people who we are, what we do and how it affects our tomorrow. Over the last 40 years or so -- way before my time - Intel has developed some really cool technologies that affect the lives of many, including mine; from cool gadgets like MIDs and Netbooks that allow me to update my Twitter status wherever I go to technologies like WiMAX that are transforming entire cities into wireless hot spots. If you know anything about me; you will know that my entire life is defined by "being connected" all the time and for that, I thank my all of my awesome co-workers and others in the industry for making that happen.

Okay, enough about me. I would encourage you to check out the Sponsors of Tomorrow website. It is filled with some amazing facts about Intel, quizzes and I am sure you will have a blast. You can also check out what Intel employees are doing online by visiting our activity feed.

Or, if you have some free time while browsing the Internet at work, stick around and check out the below videos that will be airing on television next week. Yes, these are actors and not real Intel employees. We're too busy for Hollywood; and besides, you still have me and the crew here; and several others you still haven't met. We're nice and we don't bite ... well, at least I don't. There is, however, an area on the site to learn more about our very own "Rockstars" by clicking on See Our videos. I hope you enjoy.





Like what I have to say? You can follow me on Twitter or subscribe to our blog to stay in touch with the latest and greatest of our new ad campaign.

06/05/2009 07:52 PM

Reality Check - Windows XP Mode and Intel's chips

So, there have been quite a few stories recently about support for Windows 7’s new ‘Windows XP Mode’.

‘Windows XP Mode’ is a feature that will be available with some versions of Windows 7. The short version is this: it will let you run a copy of Windows XP SP3 on your Windows 7 PC or notebook within a virtual partition using hardware virtualisation. ‘Windows XP Mode’ will however have some cool bells and whistles including great integration into Windows 7 (copy and paste will work etc…). This is another very cool use of our VT technology.

Intel introduced its Virtualization Technology in 2005 and has shipped over 100 Million chips with the feature. Windows XP Mode is targeted for business customers. It is available on the mid to higher end versions of Windows 7 and is supported in hardware by many Intel processors. Intel vPro technology PCs are required to have an Intel VT capable CPU and Intel VT capable BIOS. They are the best platforms for testing and deploying Microsoft Windows Virtual PC and Windows XP Mode.

However, there have been a lot of articles berating the fact that consumers with Intel processors without VT will ‘lose out’ on the Windows XP Mode, or that it ‘won’t work’.

Cnet for example mentions that there are at least 30 versions of consumer laptops using the VT’less T6400 version of the Core 2 Duo processor.

Here is an example of such a notebook: It comes with Windows Vista Home premium. (As do most of them)

Here is the list of Windows 7 versions that will ship according to ZDnet: Home Premium is a middle sku.

  • Windows 7 Starter Edition (for emerging market and netbook users)
  • Windows 7 Home Basic (for emerging market customers only)
  • Windows 7 Home Premium (the main “Media Center” equivalent)
  • Windows 7 Professional (the business SKU for home users and non-enterprise licensees)
  • Windows 7 Enterprise (for volume licensees)
  • Windows 7 Ultimate (for consumers who want/need business features)

And finally according to TheRegister.co.uk: The Windows XP Mode will only come with Windows 7 Professional and up.

So not having VT on these consumer laptops is not going to be an issue - because the consumer versions of Windows 7 (Starter, Home Basic, and Home Premium) do not include Windows XP Mode.

Storm in a teacup anyone?

UPDATE: This is from Microsoft’s website:

PressPass: What types of applications are suited for Windows XP Mode and Windows Virtual PC stand-alone? Woodgate: Windows XP Mode is best suited for older business and productivity applications such as accounting, inventory and similar applications. Windows XP Mode is not aimed at consumers because many consumer applications require extensive use of hardware interfaces such as 3-D graphics, audio, and TV tuners that do not work well under virtualization today.

Comments (1)
07/05/2009 06:56 AM

Parallel Programming Talk - Listener Question: Radix Sort Solution

Welcome to Episode 29 of Parallel Programming Talk broadcast on May 5, 2009 hosted by Aaron Tersteeg and Dr. Clay Breshears.

The first show of every month is the listener question show. On this episode we discussed Radix Sort, the first problem from Threading Challenge 2009.

Download the MP3 of the show.

News:

The Intel Software Network has launched Teach Parallel, a show discuss teaching parallel programming. Today?s Teach Parallel guest is professor Dan Ernst from the University of Wisconsin, Eau Claire.  Dan has been a pioneer at retooling the computer science curriculum for multi-core platforms.  Dan will be discussing practical steps to take to weave parallelism into the undergraduate curriculum. Listen to Teach Parallel every other Tuesday at 10:00AM.

Intel® Parallel Studio and is on schedule to go live in Mid 2009. Sanjiv Shah will be joining us on May 12 to Discuss Parallel Inspector. Learn more and download the beta today.

Threading Challenge 2009
The second challenge, 3SAT, is due May 8, 2009. Get working on your your entries. Please see official rules for more information or visit the forum for this problem to get your questions answered.

Send in your questions.
The first Tuesday of every month we will pick a listener question and do our best to provide an answer.

Email your questions to parallelprogrammingtalk@intel.com

On the show today:

There has been a lot of discussion in the contest forum about the Threading Challenge #1: Radix Sort. (80 posts, 21 threads, ~8,500 views)

Here a brief recap of the problem description:

Given a set of unsorted items with keys that can be considered as a binary representation of an integer, the bits within the key can be used to sort the set of items. This method of sorting is known as Radix Sort.

Write a program that includes a threaded version of a Radix Sort algorithm that sorts the keys read from an input file, then output the sorted keys to another file. The input and output file names shall be the first and second arguments on the command line of the application execution.

The first line of the input text file is the total number of keys (N) to be sorted; this is followed by N keys, one per line, in the file.  A key will be a seven-character string made up of printable characters not including the space character (ASCII 0x20). The number of keys within the file is less than 2^31 - 1.  Sorted output must be stored in a text file, one key per line.

Timing: If you put timing code into your application to time the sorting process and report the elapsed time, this time will be used for scoring.  If no timing code is added, the entire execution time (including time for input and output) will be used for scoring.

Take a moment to listen to Clay's comments on the Radix Sort, review the forums and the two guest posts by Asaf Shelly "All Sorts of Sorts" and Dmitry Vyukov "Another Sort of Sort".

Up Next on Parallel Programming Talk

On the May 12th episode of Parallel Programming Talk we?ll be talking with Intel Engineer Sanjiv Shah about the Parallel Inspector module in the new Intel Parallel Studio tool.

06/05/2009 07:19 AM

Another Sorts of Sorts

Asaf Shelly posted interesting blog regarding first problem (radix sort) of the Intel Threading Contest 2009:
All Sorts of Sorts
There is also active discussion going in the comments. Since I had mentioned some aspects of my submission, I decided to post my write-up here (I've checked up with Contest Rules, luckily Intel leave me enough rights for this :) ). So here it goes:

Radix Sort

Radix sort is a sorting algorithm that sorts integers by processing individual digits. Because integers can represent strings of characters and specially formatted floating point numbers, radix sort is not limited to integers. Most digital computers internally represent all of their data as electronic representations of binary numbers, so processing the digits of integer representations by groups of binary digit representations is most convenient. Two classifications of radix sorts are least significant digit (LSD) radix sorts and most significant digit (MSD) radix sorts. LSD radix sorts process the integer representations starting from the least significant digit and move towards the most significant digit. MSD radix sorts work the other way around. MSD sorting algorithm has particular application to parallel computing, as each of the subdivisions can be sorted independently of the rest.

Radix sort is not a comparison-based sort, so theoretical limit of O(NlgN) is not applicable. Computational complexity of radix sort is O(NK), where N is the number of values and K is the number of subdivisions. This complexity holds for worst, best and mean cases. Space complexity is O(NK).

Single-Threaded Implementation

Naïve single-threaded implementation of MSD radix sort is quite straightforward:

 > data_t;
size_t const byte_values = 256;
typedef unsigned char byte;
void radix_sort(data_t& data, size_t position = 0)
{
    // recursion stop conditions
    if (data.size() <= 1 || position == data[0].size())
        return;
    std::vector radix (byte_values);
    // radix split
    for (size_t i = 0; i != data.size(); ++i)
    {
        size_t idx = data[i][position];
        radix[idx].push_back(data[i]);
    }
    size_t out_pos = 0;
    for (size_t i = 0; i != byte_values; ++i)
    {
        // recursive sort of lesser significant digits
        radix_sort(radix[i], position + 1);
        // copyback
        for (size_t j = 0; j != radix[i].size(); ++j, ++out_pos)
        {
            data[out_pos] = radix[i][j];
        }
    }
}]]>

Parallelization

I use 2 types of parallelization. First type is the parallelization of the radix split (intra-radix parallelization), this parallelization is especially useful for initial radix split (most significant digit). Input data is split into several parts (fork), each processor picks up a part and makes radix split (parallel processing). When all parts have split partial radix arrays are aggregated (join) and directed to the next level of recursion. This parallelization may help also with sorting of not-so-randomly distributed data.

Second type is the parallelization on inter-radix level. Processor completely sorts whole array on lower levels of recursion. This parallelization helps mitigate overheads of thread synchronization.

Parallelization is guided at run-time. I.e. threads prefer to do inter-radix parallelization, however if some threads are out of work they help other threads on intra-radix level.

When size of the input array reaches some threshold, thread switches to single-threaded mode, i.e. no further sub-tasks are split (this also helps mitigate synchronization overheads).

Here is pseudo-code of the parallel algorithm:

struct radix_desc
{
  // partial results
     radix [thread_count];]]>
  size_t                  radix_pending_count;
  size_t                  position;
  //...
};
struct radix_task
{
  data_t                  input;

  //...
  void execute()
  {
    // partial radix split
    for (size_t i = 0; i != input.size(); ++i)
    {
      size_t idx = input[i][desc.position];
      desc.radix[thread_id][idx].push_back(input[i]);
    }
    if (0 == atomic_decrement(desc.radix_pending_count))
    {
      // spawn sub-tasks
      for (size_t i = 0; i != byte_values; ++i)
      {
        // aggregate partial results
        data_t result;
        for (size_t j = 0; j != thread_count; ++j)
        {
          result.insert(result.end(), desc.radix[j][i]);
        }
        radix_desc desc = new radix_desc (...);
        spawn_some_subtasks(desc, result);
      }
    }
  }
};

Scheduling

I've implemented custom task-based scheduler on top of the Win32 threading API. In main part it's similar to classical Cilk-style work-stealing scheduler, though I've made some improvements on it. In particular I've added system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control. All worker threads are strictly binded to EUs (execution units), stealing conducted based on the ?distance? between EUs, i.e. worker thread tries to steal from neighbor threads first, then from threads running on different NUMA node (system-topology awareness). This allows to efficiently reuse data in shared L3 cache of the processors.

Sibling HT threads share single work-stealing deque (HT awareness), this allows them to keep as close to each other as possible in terms of working sets. Resources of single core (L1D cache, L1 DTLB, etc) are not capable to accommodate 2 distinct radix sorts, HT awareness allows HT sibling threads to work on single radix sort, so to say. Assume first HT thread completes radix split and spawns a bunch of sub-tasks. Then it picks up some sub-task to process, while second HT thread picks up another sub-task, data for that another sub task is already in L1D cache (as well as in L1 DTLB) of the core.

The scheduler is able to support affinity of tasks. Though I didn't have enough time to exploit the feature.

When thread completes radix split it submits up to 96 (number of printable characters in US-ASCII) sub-tasks, scheduler allows to submit all the tasks in single enqueue operation. This reduces synchronization overheads to some degree.

When thread submits new tasks to the scheduler it explicitly passes so called tasks depth as a parameter. Task depth relates to the task level in the work DAG. When thread pops task from own work-stealing deque it picks up task with the highest available level (the smallest piece of work), when thread steals task from remote work-stealing deque it picks up task the lowest available level (the biggest piece of work). This reduces number of steal operations.

Regarding Threading Building Blocks. Another possibility would be to use TBB's task scheduler. Usage of the TBB would not affect main logic of the program in any way, because it supports exactly the same task concept. On one hand TBB would allow to reduce amount of written code (no need to implement scheduler manually). On the other hand TBB's scheduler is not system-topology aware, not HT aware, does not provide batch spawn capability, and does not provide manual control over task depths (not relevant w/o HT awareness) (TBB's scheduler is affinity aware to some degree, i.e. it supports task affinities however does not supports thread affinities). Also TBB's scheduler has somehow bigger task spawn/consume overheads: some 600 cycles, while my scheduler some 200 cycles (on my hardware). Since the contest is about raw performance I've decided to implement own scheduler.

Single-threaded Optimizations

Avoiding copyback. Naïve radix sort implementation makes K (number of digits) copies of the whole data set in the copyback phase. In order to eliminate those copies I use following optimization. On start I allocate array for the sorted data:

struct output_cell
{
  int count_;
  uint32_t*   data_;
};
size_t const output_size = 96*128*128; 
output_cell* g_output = new output_cell [output_size];

3 most significant digits of the value determine index in that array:

size_t output_index(uint64_t val)
{

  return ((size_t)v[3]) | ((size_t)v[2] << 7) | (((size_t)v[1] - 32) << 14);
}

4 least significant digits of the value are stored in the inner array:

void store_result(uint64_t val, size_t position)
{
  size_t idx = output_index(val);
  uint32_t v = (uint32_t)(val >> 32);
  g_output[idx].data_[position] = v;
}

This way all copies of the data in the copyback phase are eliminated, sorted data are placed directly to the final destination.

Counting sort. When values reduced to 2 bytes (by 5 previous radix splits) I use counting sort (which is a special case of the radix sort with special intermediate representation of the values). Counting sort has the same computational complexity as the radix sort, however has lower space complexity and can be implemented more efficiently. Since I expect very few values will be sorted with counting sort at a time (i.e. counter array will be very sparse), I add bitmask to optimize search over counter array.

Pseudo-code of the counting sort:

void counting_sort(uint16_t* begin, uint16_t* end, uint32_t* output, uint32_t prefix)
{
  uint32_t counter [256*256] = {};
  bitmask_t bitmask;
  for (uint16_t* pos = begin; pos != end; pos += 1)
  {
    uint16_t v = pos[0];
    counter[v] += 1;
    bitmask.set_bit(v);
  }
  for (uint16_t v; bitmask.get_and_reset_bit(v);)
  {
    do
    {
      uint32_t val = prefix;
      val |= v;
      output[0] = val;
      output += 1;
    }
    while (--counter[v]);
  }
}

bitmask_t::get_and_reset_bit() operation is implemented with the BSF instruction (_BitScanForward64() intrinsic). Bitmask optimization reduces computational complexity of the counting sort from 65536*N to 2*N.

Counting sort is not parallelized in my implementation. Since input data is uniformly distributed, I expect this to not affect performance. Though this is a possible further optimization which will allow better handling of not-so-randomly distributed data.

Template code generation. I heavily use C++ template programming in order to allow efficient code generation. Value is represented by the following class:

 struct data_layout;]]>
 struct data_layout<7> {typedef uint64_t value_t;};]]>
 struct data_layout<6> {typedef uint64_t value_t;};]]>
 struct data_layout<5> {typedef uint64_t value_t;};]]>
 struct data_layout<4> {typedef uint32_t value_t;};]]>
 struct data_layout<3> {typedef uint32_t value_t;};]]>
 struct data_layout<2> {typedef uint16_t value_t;};]]>
]]>
struct value
{
::value_t value_t;]]>
  value_t val;
 const& r)]]>
  {
    val = (value_t)(r.val >> (8 * (sizeof(r) - sizeof(*this))));
    return *this;
  }
  char prefix() const
  {

  }
};

All functions and classes related to radix sorting are also template parametrized by number of digits, and act accordingly to particular value layout, location of the radix prefix in the value, etc.

Also radix task is template parametrized by parameters is_single_threaded and is_parent_single_threaded. When is_single_threaded==true, task allocates subtasks on the stack and executes them directly. When is_parent_single_threaded==true, task avoids atomic counting of pending siblings, since parent allocates sub-tasks on the stack they all will complete when parent completes.

Memory allocation. Efficient memory allocation is crucial for single-threaded as well as multi-threaded (standard Windows allocator uses single mutex which significantly reduces scalability) performance of the implementation. I implement distributed region memory allocator, there is a pool of 2 MB pages per NUMA node, a thread privatizes a page from that pool and then uses region allocation on the page. When page exhausted thread privatizes another page, and so on. No memory is freed to the OS during radix sort, though some memory is reused internally. Also I implement simple caching memory allocator for objects of a particular size; the allocator is based on a per-thread lifo freelist. When object is freed it?s pushed onto the freelist, when object must be allocated it?s popped from the freelist.

Tools

I was considering Microsoft Visual C++ (MSVC) and Intel C++ (ICC) compilers. In 32-bit mode ICC showed impressive 30% speedup over MSVC (even more with profile-guided optimizations). However in 64-bit mode ICC showed wicked 20% slowdown (with maximum possible optimizations turned on, including /QxHost, /Qunroll, etc), profile-guided optimizations improve situation somehow but ICC still was behind MSVC. I didn't have time to investigate the problem, so I've decided to use MSVC for final submission.

As a profiler I used AMD CodeAnalyst, it's a simple profiler which allows to easily capture and analyze profile of the program. Profiling was crucial for single-threaded optimizations. Also it allowed me to verify that profile of the multi-threaded version is mainly identical to that of the single-threaded version, and that overheads for synchronization and scheduling are not greater than several percents ? all this is a good sign of successful parallelization. Another option would be to use Intel PTU, it's somehow more complicated however would allow to capture processor performance events which is crucial for single-threaded optimization (for example it would answer what causes excessive pipeline stalls ? L1D cache misses or L1 DTLB misses).

Another great tool I used is Windows Task Manager. I allowed me to track virtual memory consumption, CPU utilization, working set and number of page faults. The goal was to keep virtual memory consumption in expected bounds (~1.5 * input data size in my case), 100% utilization of the CPUs in parallel phase and 0 page faults (i.e. working set == virtual memory).

06/05/2009 05:46 AM

Yahoo bot last visit powered by MyPagerank.Net Msn bot last visit powered by MyPagerank.Net WCSA Topsites - http://www.autosurf.wcsa.info Bookmark and Share TopSites EmpresaHost TopSites WCSA - Publicidade Progressiva para seu Site!!





Não confunda o Original com cópias. Aqui seu anúncio é tratado com seriedade.

Site 100% Compativel com o Google Chrome - Versão Oficial 1583 v0.2.149.27 ou superior, Firefox 1.5 ou Superior e Safari 3 ou Superior.


Downloads