CPU Tests: Simulation

Simulation and Science have a lot of overlap in the benchmarking world, however for this distinction we’re separating into two segments mostly based on the utility of the resulting data. The benchmarks that fall under Science have a distinct use for the data they output – in our Simulation section, these act more like synthetics but at some level are still trying to simulate a given environment.

DigiCortex v1.35: link

DigiCortex is a pet project for the visualization of neuron and synapse activity in the brain. The software comes with a variety of benchmark modes, and we take the small benchmark which runs a 32k neuron/1.8B synapse simulation, similar to a small slug.

The results on the output are given as a fraction of whether the system can simulate in real-time, so anything above a value of one is suitable for real-time work. The benchmark offers a 'no firing synapse' mode, which in essence detects DRAM and bus speed, however we take the firing mode which adds CPU work with every firing.

I reached out to the author of the software, who has added in several features to make the software conducive to benchmarking. The software comes with a series of batch files for testing, and we run the ‘small 64-bit nogui’ version with a modified command line to allow for ‘benchmark warmup’ and then perform the actual testing.

The software originally shipped with a benchmark that recorded the first few cycles and output a result. So while fast multi-threaded processors this made the benchmark last less than a few seconds, slow dual-core processors could be running for almost an hour. There is also the issue of DigiCortex starting with a base neuron/synapse map in ‘off mode’, giving a high result in the first few cycles as none of the nodes are currently active. We found that the performance settles down into a steady state after a while (when the model is actively in use), so we asked the author to allow for a ‘warm-up’ phase and for the benchmark to be the average over a second sample time.

For our test, we give the benchmark 20000 cycles to warm up and then take the data over the next 10000 cycles seconds for the test – on a modern processor this takes 30 seconds and 150 seconds respectively. This is then repeated a minimum of 10 times, with the first three results rejected.

We also have an additional flag on the software to make the benchmark exit when complete (which is not default behavior). The final results are output into a predefined file, which can be parsed for the result. The number of interest for us is the ability to simulate this system in real-time, and results are given as a factor of this: hardware that can simulate double real-time is given the value of 2.0, for example.

The final result is a table that looks like this:

(3-1) DigiCortex 1.35 (32k Neuron, 1.8B Synapse)

The variety of results show that DigiCortex loves cache and single thread frequency, is not too fond of victim caches, but still likes threads and DRAM bandwidth.

Dwarf Fortress 0.44.12: Link

Another long standing request for our benchmark suite has been Dwarf Fortress, a popular management/roguelike indie video game, first launched in 2006 and still being regularly updated today, aiming for a Steam launch sometime in the future.

Emulating the ASCII interfaces of old, this title is a rather complex beast, which can generate environments subject to millennia of rule, famous faces, peasants, and key historical figures and events. The further you get into the game, depending on the size of the world, the slower it becomes as it has to simulate more famous people, more world events, and the natural way that humanoid creatures take over an environment. Like some kind of virus.

For our test we’re using DFMark. DFMark is a benchmark built by vorsgren on the Bay12Forums that gives two different modes built on DFHack: world generation and embark. These tests can be configured, but range anywhere from 3 minutes to several hours. After analyzing the test, we ended up going for three different world generation sizes:

  • Small, a 65x65 world with 250 years, 10 civilizations and 4 megabeasts
  • Medium, a 127x127 world with 550 years, 10 civilizations and 4 megabeasts
  • Large, a 257x257 world with 550 years, 40 civilizations and 10 megabeasts

I looked into the embark mode, but came to the conclusion that due to the way people played embark, to get something close to a real world data would require several hours’ worth of embark tests. This would be functionally prohibitive to the bench suite, and so I decided to focus on world generation.

DFMark outputs the time to run any given test, so this is what we use for the output. We loop the small test for as many times possible in 10 minutes, the medium test for as many times in 30 minutes, and the large test for as many times in an hour.

(3-2a) Dwarf Fortress 0.44.12 World Gen 65x65, 250 Yr(3-2b) Dwarf Fortress 0.44.12 World Gen 129x129, 550 Yr(3-2c) Dwarf Fortress 0.44.12 World Gen 257x257, 550 Yr

Interestingly Intel's hardware likes Dwarf Fortress. It is primarily single threaded, and so a high IPC and a high frequency is what matters here.

Dolphin v5.0 Emulation: Link

Many emulators are often bound by single thread CPU performance, and general reports tended to suggest that Haswell provided a significant boost to emulator performance. This benchmark runs a Wii program that ray traces a complex 3D scene inside the Dolphin Wii emulator. Performance on this benchmark is a good proxy of the speed of Dolphin CPU emulation, which is an intensive single core task using most aspects of a CPU. Results are given in seconds, where the Wii itself scores 1051 seconds.

The Dolphin software has the ability to output a log, and we obtained a version of the benchmark from a Dolphin developer that outputs the display into that log file. The benchmark when finished will automatically try to close the Dolphin software (which is not normal behavior) and brings a pop-up on display to confirm, which our benchmark script can detects and remove. The log file is fairly verbose, so the benchmark script iterates through line-by-line looking for a regex match in line with the final time to complete.

The final result is a table that looks like this:

(3-3) Dolphin 5.0 Render Test

Dolphin does still have one flaw – about one in every 10 runs it will hang when the benchmark is complete and can only be removed by memory via a taskkill command or equivalent. I have not found a solution for this yet, and due to this issue Dolphin is one of the final tests in the benchmark run. If the issue occurs and I notice, I can close Dolphin and re-run the test by manually opening the benchmark in Dolphin to run again, and allow the script to pick up the final dialog box when done.

CPU Tests: Science CPU Tests: Rendering
Comments Locked

110 Comments

View All Comments

  • ruthan - Monday, July 27, 2020 - link

    Well lots of bla, bla, bla.. I checked graphs in archizlr they are classic just few entries.. there is link to your benchmark database, but here i see preselected some Crysis benchmark, which is not part of article.. and dont lead to some ultimate lots of cpus graphs. So it need much more streamlining.

    i usually using old Geekbench for cpus tests and there i can compare usually what i want.. well not with real applications and games, but its quick too. Otherwise usually have enough knowledge to know if is some cpu good enough for some games or not.. so i dont need some very old and very need comparisions. Something can be found at Phoronix.
    These benchmarks will always lots relevancy with new updates, unless all cpus would in own machines and update and running and reresting constantly - which could be quite waste of power and money.
    Maybe some golden path is some simple multithreaded testing utility with 2 benchmarks one for integers and one for floats.
  • Ian Cutress - Wednesday, August 5, 2020 - link

    When you're in Bench, Check the drop down menu on your left for the individual tests
  • hnlog - Wednesday, July 29, 2020 - link

    > For our testing on the 2020 suite, we have secured three RTX 2080 Ti GPUs direct from NVIDIA.
    Congrats!
  • Koenig168 - Saturday, August 1, 2020 - link

    It would be more efficient to focus on the more popular CPUs. Some of the less popular SKUs which differ only by clock speed can have their performance extrapolated. Testing 900 CPUs sound nice but quickly hit diminishing returns in terms of usefulness after the first few hundred.

    You might also wish to set some minimum performance standards using just a few tests. Any CPU which failed to meet those standards should be marked as "obsolete, upgrade already dude!" and be done with them rather than spend the full 30 to 40 hours testing each of them.

    Finally, you need to ask yourself "How often do I wish to redo this project and how much resources will I be able to devote to it?" Bearing in mind that with new drivers, games etc, the database needs to be updated oeriodically to stay relevant. This will provide a realistic estimate of how many CPUs to include in the database.
  • Meteor2 - Monday, August 3, 2020 - link

    I think it's a labour of love...
  • TrevorX - Thursday, September 3, 2020 - link

    My suggestion would be to bench the highest performing Xeons that supported DDR3 RAM. Why? Because the cost of DDR3 RDIMMs is so amazingly cheap (as in, less than 10%) compared with DDR4. I personally have a Xeon E5-1660v2 @4.1GHz with 128GB DDR3 1866MHz RDIMMs that's the most rock stable PC I've ever had. Moving up to a DDR4 system with similar memory capacity would be eye-wateringly expensive. I currently have 466 tabs open in Chrome, Outlook, Photoshop, Word, several Excel spreadsheets, and I'm only using 31.3% of physical RAM. I don't game, so I would be genuinely interested in what actual benefit would be derived from an upgrade to Ryzen / Threadripper.

    Also very keen to see server/hypervisor testing of something like Xeon E5-2667v2 vs Xeon W-1270P or Xeon Silver 4215R for evaluation of on-prem virtualisation hosts. A lot of server workloads are being shifted to the cloud for very good reasons, but for smaller businesses it might be difficult to justify the monthly expense of cloud hosting (and Azure licensing) when they still have a perfectly serviceable 5yo server with plenty of legs left on it. It would be great to be able to see what performance and efficiency improvements can be had jumping between generations.
  • Tilmitt - Thursday, October 8, 2020 - link

    When is this going to be done?
  • Mil0 - Friday, October 16, 2020 - link

    Well they launched with 12 results if I count correctly, and currently there are 38 listed, that's close to 10/month. With the goal of 900, that would mean over 7 years (in which ofc more CPUs would be released)
  • Mil0 - Friday, October 16, 2020 - link

    Well they launched with 12 results if I count correctly, and currently there are 44 listed, that's about a dozen a month. With the goal of 900, that would mean 6 years (in which ofc more CPUs would be released)
  • Mil0 - Friday, October 16, 2020 - link

    Caching hid my previous comment from me, so instead of a follow up there are now 2 pretty similar ones. However, in the mean time I found Ian is actually updating on twitter, which you can find here: https://twitter.com/IanCutress/status/131350328982...

    He actually did 36 CPU's in 2.5 months, so it should only take 5 years! :D

Log in

Don't have an account? Sign up now