Olxinos
French fry. Caution: very salty!
- Joined
- May 7, 2015
- Messages
- 151
- Reaction score
- 88
Megacrafter127 Well, the two above snippets will give you identical or nearly identical results on a good optimizing compiler (template<gateenum type> forces type to be a constexpr expression which is therefore known at compile-time, so the compiler can and usually will optimize away the switch) but I assume you meant something along the lines of
In that case, you essentially remove a control structure. That can lead to interesting improvements if the processor couldn't accurately predict the branch/jump.
See for instance : Why is it faster to process a sorted array than an unsorted array?
or this small toy-program :
On my machine:
(yeah, "g++" is a symlink to apple clang and not gnu g++, they did that when they switched from g++ to clang a couple years ago)
That's actually what I was referring to when I was talking about mispredictions a couple posts back. However, set aside this possible optimisation I don't see how a pre-compilation would help (I disagree with what GRHayes said about L1/L2 cache more on that later).
The improvement appears somewhat interesting in the aforementioned examples but it's really because everything else was pretty cheap. I don't think this would be the case if you were to evaluate Starmade logic gates' states. Actually, let's take the code I posted a couple posts ago and replace opcode by a fixed value :
On my machine :
Btw, about the L1/L2 cache part, it's true that a compiled program may have more opportunities to better use the cache (mostly because it can optimize away "parasite" evaluations/fetches which would trash them or cause cache misses I guess). However, it's more dependent on what the program does and how you've arranged the data it's processing which hasn't much to do with compilation : if your program is a clusterfuck of indirections to extremely fragmented memory, compiling won't help you that much in that regard.
Sure, you can argue that interpreters are more likely to be such messes of indirections everywhere, but not necessarily, and compilers may tidy up your code a bit, but only to some extent. In fact in my "toy-logic-interpreter", I added the permutation solely to simulate to some extent fragmented memory. If I use the identity permutation for instance…
Then, on my machine :
Decent usage of cache gives significant improvements as well in the first case. You could argue however that the improvements nearly are an order of magnitude greater with the optimized away switch, but the program has been so simplified in that case that it's not so surprising (it now only performs 100000000 ands on things which fit decently in the L1 cache… well yeah, of course that's fast). This also has more to do with some kind of branch misprediction than actual use of cache. In fact if I hack the random generation to only give me And as opcodes:
Then:
Now the difference isn't that great anymore…
Well, to be fair, I'd have to meticulously inspect the generated assembly to make sure the compiler hasn't done something clever but I'm not really used to reading assembly and this takes me time, so I've simply quickly checked that the differences between the two generated asm files (obtained with the same command with -S -o [outputfile]) involved something looking like an additional jump table (the switch) when compiled without DFIXED_VALUE. This was the case.
Obviously, this won't map perfectly to Starmade's case. My toy-programs are simplistic, have flaws, and I am not an expert in the field (only an humble computer science student with a master degree and some interest in compilation). But I think those are decent guesstimates of the order of magnitude of the speedups you can get, i.e. not completely negligible but not that great either.
Code:
inline bool evalStatusOnUpdate(bool newstate, int activeInputCount, gatetypeenum type) {
switch(type) {
case AND: return activeInputCount==connectedInputCount;
// AND, OR, etc. are values of the enum gatetypeenum
case OR: return activeInputCount>0;
//...
default: error();
}
}
See for instance : Why is it faster to process a sorted array than an unsorted array?
or this small toy-program :
Code:
#include <vector>
#include <chrono>
#include <random>
#include <iostream>
#include <cstdlib>
using size_t = std::size_t;
using namespace std::chrono;
using clk_t = high_resolution_clock;
int main(int argc, char* argv[]) {
std::random_device seeder;
std::default_random_engine rng(seeder());
std::uniform_int_distribution<unsigned int> dist;
if(argc != 2) {
std::cerr << "usage: ./a.out n" << std::endl;
return 1;
}
const size_t n = atol(argv[1]);
std::vector<unsigned int> random_table(n);
for(auto& cell : random_table)
cell = dist(rng);
const auto start = clk_t::now();
unsigned long counter = 0;
for(auto cell : random_table) {
# ifdef WITH_CONDITIONAL
if(cell & 0x1)
counter += 2;
# else
counter += (cell & 0x1) << 1;
# endif // WITH_CONDITIONAL
}
const auto end = clk_t::now();
auto us = duration_cast<microseconds>(end - start).count();
std::cout << "Took " << (us / int(1e6)) << 's';
us %= int(1e6);
std::cout << (us / int(1e3)) << "ms";
us %= int(1e3);
std::cout << us << "us\n";
return static_cast<int>(counter);
}
Code:
user$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin16.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
user$ g++ foobar.cpp -std=c++11 -DWITH_CONDITIONAL -O3
user$ ./a.out 1000000000
Took 4s55ms492us
user$ g++ foobar.cpp -std=c++11 -O3
user$ ./a.out 1000000000
Took 1s178ms76us
That's actually what I was referring to when I was talking about mispredictions a couple posts back. However, set aside this possible optimisation I don't see how a pre-compilation would help (I disagree with what GRHayes said about L1/L2 cache more on that later).
The improvement appears somewhat interesting in the aforementioned examples but it's really because everything else was pretty cheap. I don't think this would be the case if you were to evaluate Starmade logic gates' states. Actually, let's take the code I posted a couple posts ago and replace opcode by a fixed value :
Code:
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>
using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;
using namespace std::chrono;
using clk_t = high_resolution_clock;
constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;
enum op_t : instr_t {
Mov,
And,
Or,
Not,
Xor,
Nand,
Nor,
Nxor,
LENGTH_
};
static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");
int main(int argc, char* argv[]) {
std::random_device seeder;
std::default_random_engine rng(seeder());
std::uniform_int_distribution<instr_t> dist;
if(argc != 2) {
std::cerr << "usage ./a.out n" << std::endl;
return 1;
}
const size_t n = atol(argv[1]);
std::vector<instr_t> random_code(n);
for(instr_t& instruction : random_code)
instruction = dist(rng);
std::vector<std::size_t> permutation(n);
for(size_t i = 0 ; i < n ; ++i)
permutation[i] = i;
std::shuffle(std::begin(permutation), std::end(permutation), rng);
std::vector<boolean_t> environment(EnvSize);
boolean_t eax;
const auto start = clk_t::now();
for(size_t idx : permutation) {
const instr_t instruction = random_code[idx];
# ifdef FIXED_VALUE
const op_t opcode = And;
# else
const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
# endif // FIXED_VALUE
const addr_t address[2] =
{static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
,static_cast<addr_t>
((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
};
std::vector<boolean_t>::reference op1 = environment[address[0]];
const auto op2 = environment[address[1]];
switch(opcode) {
case Mov:
op1 = eax;
break;
case And:
eax = op1 && op2;
break;
case Or:
eax = op1 || op2;
break;
case Not:
eax = !op1;
break;
case Xor:
eax = (op1 && !op2) || (!op1 && op2);
break;
case Nand:
eax = !(op1 && op2);
break;
case Nor:
eax = !(op1 || op2);
break;
case Nxor:
eax = (op1 && op2) || (!op1 && !op2);
break;
default:
assert(false);
}
}
const auto end = clk_t::now();
auto us = duration_cast<microseconds>(end - start).count();
std::cout << "Took " << (us / int(1e6)) << 's';
us %= int(1e6);
std::cout << (us / int(1e3)) << "ms";
us %= int(1e3);
std::cout << us << "us\n";
std::cout << std::boolalpha << eax << std::endl;
return 0;
}
Code:
user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 5s301ms726us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 2s435ms980us
false
Sure, you can argue that interpreters are more likely to be such messes of indirections everywhere, but not necessarily, and compilers may tidy up your code a bit, but only to some extent. In fact in my "toy-logic-interpreter", I added the permutation solely to simulate to some extent fragmented memory. If I use the identity permutation for instance…
Code:
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>
using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;
using namespace std::chrono;
using clk_t = high_resolution_clock;
constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;
enum op_t : instr_t {
Mov,
And,
Or,
Not,
Xor,
Nand,
Nor,
Nxor,
LENGTH_
};
static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");
int main(int argc, char* argv[]) {
std::random_device seeder;
std::default_random_engine rng(seeder());
std::uniform_int_distribution<instr_t> dist;
if(argc != 2) {
std::cerr << "usage ./a.out n" << std::endl;
return 1;
}
const size_t n = atol(argv[1]);
std::vector<instr_t> random_code(n);
for(instr_t& instruction : random_code)
instruction = dist(rng);
std::vector<std::size_t> permutation(n);
for(size_t i = 0 ; i < n ; ++i)
permutation[i] = i;
//std::shuffle(std::begin(permutation), std::end(permutation), rng);
std::vector<boolean_t> environment(EnvSize);
boolean_t eax;
const auto start = clk_t::now();
for(size_t idx : permutation) {
const instr_t instruction = random_code[idx];
# ifdef FIXED_VALUE
const op_t opcode = And;
# else
const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
# endif // FIXED_VALUE
const addr_t address[2] =
{static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
,static_cast<addr_t>
((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
};
std::vector<boolean_t>::reference op1 = environment[address[0]];
const auto op2 = environment[address[1]];
switch(opcode) {
case Mov:
op1 = eax;
break;
case And:
eax = op1 && op2;
break;
case Or:
eax = op1 || op2;
break;
case Not:
eax = !op1;
break;
case Xor:
eax = (op1 && !op2) || (!op1 && op2);
break;
case Nand:
eax = !(op1 && op2);
break;
case Nor:
eax = !(op1 || op2);
break;
case Nxor:
eax = (op1 && op2) || (!op1 && !op2);
break;
default:
assert(false);
}
}
const auto end = clk_t::now();
auto us = duration_cast<microseconds>(end - start).count();
std::cout << "Took " << (us / int(1e6)) << 's';
us %= int(1e6);
std::cout << (us / int(1e3)) << "ms";
us %= int(1e3);
std::cout << us << "us\n";
std::cout << std::boolalpha << eax << std::endl;
return 0;
}
Code:
user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 1s574ms160us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s173ms8us
false
Code:
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>
using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;
using namespace std::chrono;
using clk_t = high_resolution_clock;
constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;
enum op_t : instr_t {
Mov,
And,
Or,
Not,
Xor,
Nand,
Nor,
Nxor,
LENGTH_
};
static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");
int main(int argc, char* argv[]) {
std::random_device seeder;
std::default_random_engine rng(seeder());
std::uniform_int_distribution<instr_t> dist;
if(argc != 2) {
std::cerr << "usage ./a.out n" << std::endl;
return 1;
}
const size_t n = atol(argv[1]);
std::vector<instr_t> random_code(n);
unsigned int op = 0;
for(instr_t& instruction : random_code)
instruction = (dist(rng) & ~OpCodeMask) | And;
std::vector<std::size_t> permutation(n);
for(size_t i = 0 ; i < n ; ++i)
permutation[i] = i;
//std::shuffle(std::begin(permutation), std::end(permutation), rng);
std::vector<boolean_t> environment(EnvSize);
boolean_t eax;
const auto start = clk_t::now();
for(size_t idx : permutation) {
const instr_t instruction = random_code[idx];
# ifdef FIXED_VALUE
const op_t opcode = And;
# else
const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
# endif // FIXED_VALUE
const addr_t address[2] =
{static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
,static_cast<addr_t>
((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
};
std::vector<boolean_t>::reference op1 = environment[address[0]];
const auto op2 = environment[address[1]];
switch(opcode) {
case Mov:
op1 = eax;
break;
case And:
eax = op1 && op2;
break;
case Or:
eax = op1 || op2;
break;
case Not:
eax = !op1;
break;
case Xor:
eax = (op1 && !op2) || (!op1 && op2);
break;
case Nand:
eax = !(op1 && op2);
break;
case Nor:
eax = !(op1 || op2);
break;
case Nxor:
eax = (op1 && op2) || (!op1 && !op2);
break;
default:
assert(false);
}
}
const auto end = clk_t::now();
auto us = duration_cast<microseconds>(end - start).count();
std::cout << "Took " << (us / int(1e6)) << 's';
us %= int(1e6);
std::cout << (us / int(1e3)) << "ms";
us %= int(1e3);
std::cout << us << "us\n";
std::cout << std::boolalpha << eax << std::endl;
return 0;
}
Code:
user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 0s326ms996us
false
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s171ms505us
false
Well, to be fair, I'd have to meticulously inspect the generated assembly to make sure the compiler hasn't done something clever but I'm not really used to reading assembly and this takes me time, so I've simply quickly checked that the differences between the two generated asm files (obtained with the same command with -S -o [outputfile]) involved something looking like an additional jump table (the switch) when compiled without DFIXED_VALUE. This was the case.
Obviously, this won't map perfectly to Starmade's case. My toy-programs are simplistic, have flaws, and I am not an expert in the field (only an humble computer science student with a master degree and some interest in compilation). But I think those are decent guesstimates of the order of magnitude of the speedups you can get, i.e. not completely negligible but not that great either.
Last edited: