How to cure logic lagg

Olxinos · Apr 13, 2017

Megacrafter127 Well, the two above snippets will give you identical or nearly identical results on a good optimizing compiler (template<gateenum type> forces type to be a constexpr expression which is therefore known at compile-time, so the compiler can and usually will optimize away the switch) but I assume you meant something along the lines of

Code:

inline bool evalStatusOnUpdate(bool newstate, int activeInputCount, gatetypeenum type) {
   switch(type) {
       case AND: return activeInputCount==connectedInputCount;
       // AND, OR, etc. are values of the enum gatetypeenum
       case OR: return activeInputCount>0;
       //...
       default: error();
   }
}

In that case, you essentially remove a control structure. That can lead to interesting improvements if the processor couldn't accurately predict the branch/jump.
See for instance : Why is it faster to process a sorted array than an unsorted array?
or this small toy-program :

Code:

#include <vector>
#include <chrono>
#include <random>
#include <iostream>
#include <cstdlib>

using size_t = std::size_t;
using namespace std::chrono;
using clk_t = high_resolution_clock;

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<unsigned int> dist;
 
   if(argc != 2) {
       std::cerr << "usage: ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<unsigned int> random_table(n);
   for(auto& cell : random_table)
       cell = dist(rng);
   const auto start = clk_t::now();
   unsigned long counter = 0;
   for(auto cell : random_table) {
#   ifdef WITH_CONDITIONAL
       if(cell & 0x1)
           counter += 2;
#   else
       counter += (cell & 0x1) << 1;
#   endif // WITH_CONDITIONAL
   }
 
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   return static_cast<int>(counter);
}

On my machine:

Code:

user$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin16.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
user$ g++ foobar.cpp -std=c++11 -DWITH_CONDITIONAL -O3
user$ ./a.out 1000000000
Took 4s55ms492us
user$ g++ foobar.cpp -std=c++11 -O3
user$ ./a.out 1000000000
Took 1s178ms76us

(yeah, "g++" is a symlink to apple clang and not gnu g++, they did that when they switched from g++ to clang a couple years ago)

That's actually what I was referring to when I was talking about mispredictions a couple posts back. However, set aside this possible optimisation I don't see how a pre-compilation would help (I disagree with what GRHayes said about L1/L2 cache more on that later).
The improvement appears somewhat interesting in the aforementioned examples but it's really because everything else was pretty cheap. I don't think this would be the case if you were to evaluate Starmade logic gates' states. Actually, let's take the code I posted a couple posts ago and replace opcode by a fixed value :

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   for(instr_t& instruction : random_code)
       instruction = dist(rng);
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

On my machine :

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 5s301ms726us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 2s435ms980us
false

Btw, about the L1/L2 cache part, it's true that a compiled program may have more opportunities to better use the cache (mostly because it can optimize away "parasite" evaluations/fetches which would trash them or cause cache misses I guess). However, it's more dependent on what the program does and how you've arranged the data it's processing which hasn't much to do with compilation : if your program is a clusterfuck of indirections to extremely fragmented memory, compiling won't help you that much in that regard.
Sure, you can argue that interpreters are more likely to be such messes of indirections everywhere, but not necessarily, and compilers may tidy up your code a bit, but only to some extent. In fact in my "toy-logic-interpreter", I added the permutation solely to simulate to some extent fragmented memory. If I use the identity permutation for instance…

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   for(instr_t& instruction : random_code)
       instruction = dist(rng);
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   //std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

Then, on my machine :

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 1s574ms160us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s173ms8us
false

Decent usage of cache gives significant improvements as well in the first case. You could argue however that the improvements nearly are an order of magnitude greater with the optimized away switch, but the program has been so simplified in that case that it's not so surprising (it now only performs 100000000 ands on things which fit decently in the L1 cache… well yeah, of course that's fast). This also has more to do with some kind of branch misprediction than actual use of cache. In fact if I hack the random generation to only give me And as opcodes:

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   unsigned int op = 0;
   for(instr_t& instruction : random_code)
       instruction = (dist(rng) & ~OpCodeMask) | And;
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   //std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

Then:

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 0s326ms996us
false
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s171ms505us
false

Now the difference isn't that great anymore…
Well, to be fair, I'd have to meticulously inspect the generated assembly to make sure the compiler hasn't done something clever but I'm not really used to reading assembly and this takes me time, so I've simply quickly checked that the differences between the two generated asm files (obtained with the same command with -S -o [outputfile]) involved something looking like an additional jump table (the switch) when compiled without DFIXED_VALUE. This was the case.

Obviously, this won't map perfectly to Starmade's case. My toy-programs are simplistic, have flaws, and I am not an expert in the field (only an humble computer science student with a master degree and some interest in compilation). But I think those are decent guesstimates of the order of magnitude of the speedups you can get, i.e. not completely negligible but not that great either.

Megacrafter127 · Apr 13, 2017

Olxinos said:
Megacrafter127 Well, the two above snippets will give you identical or nearly identical results on a good optimizing compiler (template<gateenum type> forces type to be a constexpr expression which is therefore known at compile-time, so the compiler can and usually will optimize away the switch) but I assume you meant something along the lines of

I intentionally used a template structure, as an easy implementation for the logic graph would be along the lines of:

Code:

struct gateNode {
bool currentState;
bool (*evalStatusOnUpdate)(bool newstate, unsigned int activeInputCount);
unsigned int currentInputCount;
struct gateNode **outputs; //NULL terminated (*gateNode)[]
}

While a function pointer will in effect negate inline benefits, and be as likely to cause cache misses as a switch when first encountered, the concrete specializations of the templates should be small enough to be kept in cache when used frequently enough.

Olxinos · Apr 13, 2017

Megacrafter127 said:
I intentionally used a template structure, as an easy implementation for the logic graph would be along the lines of:

Code:

struct gateNode { bool currentState; bool (*evalStatusOnUpdate)(bool newstate, unsigned int activeInputCount); unsigned int currentInputCount; struct gateNode **outputs; //NULL terminated (*gateNode)[] }

While a function pointer will in effect negate inline benefits, and be as likely to cause cache misses as a switch when first encountered, the concrete specializations of the templates should be small enough to be kept in cache when used frequently enough.

I see, I think this suffers from the same possible "inefficiency" : this is basically a list of one-sized jump tables which can't be optimized away so either the branch target predictor catches on or you'll suffer from similar misprediction costs. I don't know whether the fact they are one-sized helps the branch target predictor though, I know a bit about how simple branch predictors work, but I don't know much about branch target predictors, I'll have to test that.

Megacrafter127 · Apr 13, 2017

Olxinos said:
this is basically a list of one-sized jump tables which can't be optimized away so either the branch target predictor catches on or you'll suffer from similar misprediction costs. I don't know whether the fact they are one-sized helps the branch target predictor though

As in this case the pointer is only changed if the type of gate is changed[and thus usually not during 'execution' of the circuit], I suspect a sufficiently advanced branch predictor will be able to correctly predict where the jump of the 1 sized table will go the vast majority of the time[a.k.a whenever the user is not changing gates while the circuit is being evaluated, a.k.a changing states].
Whether or not such a branch predictor exists currently, and if so whether or not it is sufficiently common, I do not know though.

Olxinos · Apr 13, 2017

Megacrafter127 So far, this seems roughly equivalent to the switch implementation, maybe slightly better in general. I haven't looked into the asm to check the compiler didn't introduce clever tricks though:

Code:

#include <vector>
#include <chrono>
#include <random>
#include <iostream>
#include <cstdlib>

using namespace std::chrono;
using clk_t = high_resolution_clock;

enum op_t {
    Minus,
    Incr,
    Decr,
    Twice
};

template<op_t type>
unsigned int evaluate(unsigned int n) {
    switch(type) {
        case Minus: return -n;
        case Incr: return n + 1;
        case Decr: return n - 1;
        case Twice: return n * 2;
    }
}

unsigned int(*evaluate_fct(op_t type))(unsigned int) {
    switch(type) {
        case Minus: return &evaluate<Minus>;
        case Incr: return &evaluate<Incr>;
        case Decr: return &evaluate<Decr>;
        case Twice: return &evaluate<Twice>;
    }
}

unsigned int evaluate(unsigned int n, op_t type) {
    return evaluate_fct(type)(n);
}

struct random_expr {
public:
    random_expr(unsigned int sz)
    : operator_list(sz) {
        std::uniform_int_distribution<unsigned int> dist(0, 3);
        for(auto& op : operator_list) {
#        ifndef NO_RANDOMNESS
            const op_t type = static_cast<op_t>(dist(rng));
#        else
            const op_t type = Twice;
#        endif
#        ifdef POINTER_LIST
            op = evaluate_fct(type);
#        else
            op = type;
#        endif
        }
    }
    
    unsigned int operator()(unsigned int n) const {
#    ifdef POINTER_LIST
        for(unsigned int(*fct)(unsigned int) : operator_list)
            n = fct(n);
#    else
        for(op_t type : operator_list)
            n = evaluate(n, type);
#    endif
        return n;
    }
private:
#ifdef POINTER_LIST
    std::vector<unsigned int(*)(unsigned int)> operator_list;
#else
    std::vector<op_t> operator_list;
#endif
    static std::random_device seeder;
    static std::default_random_engine rng;
};

std::random_device random_expr::seeder;
std::default_random_engine random_expr::rng(seeder());

int main(int argc, char* argv[]) {
    if(argc != 2) {
        std::cerr << "usage: ./a.out n";
        return 1;
    }
    const unsigned long n = atol(argv[1]);
    const random_expr e(n);
    const auto start = clk_t::now();
    const auto r = e(1);
    const auto end = clk_t::now();
    auto us = duration_cast<microseconds>(end - start).count();
    std::cout << "Took " << (us / int(1e6)) << 's';
    us %= int(1e6);
    std::cout << (us / int(1e3)) << "ms";
    us %= int(1e3);
    std::cout << us << "us\n";
    return static_cast<int>(r);
}

Code:

user$ g++ fptrlist.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 0s862ms83us
user$ g++ fptrlist.cpp -std=c++11 -O3 -DPOINTER_LIST
user$ ./a.out 100000000
Took 0s749ms386us
user$ g++ fptrlist.cpp -std=c++11 -O3 -DNO_RANDOMNESS
user$ ./a.out 100000000
Took 0s219ms348us
user$ g++ fptrlist.cpp -std=c++11 -O3 -DNO_RANDOMNESS -DPOINTER_LIST
user$ ./a.out 100000000
Took 0s262ms167us

I think this is something which could be even more dependent (than usual) on the processor and/or whether it's JIT-compiled though. My machine, in particular, is a bit old (macbook pro mid-2012) maybe a more recent processor would be able to predict the branch target more accurately. I don't know. (of course, assuming I'm correct and the speed differences are mostly due to branch target prediction/misprediction, but that seems consistent with my results so far)

GRHayes · Apr 14, 2017

Megacrafter127 said:
Oh, my question was more along the lines of:

What is the major performance difference between the following C-ish pseudocodesnippets[assuming both are compiled with the same compiler]?

Code:

template<gatetypeenum type> inline bool evalStatusOnUpdate(bool newstate, int activeInputCount) { switch(type) { case AND: return activeInputCount==connectedInputCount; // AND, OR, etc. are values of the enum gatetypeenum case OR: return activeInputCount>0; //... default: error(); } }

Code:

inline bool evalStatusOnUpdate(bool newstate, int activeInputCount) { return activeInputCount==5; //at compiletime, gatetype is known to be AND, and connectedInputCount is known to be 5 }

Specifically, the former represents general logic eval code, which does not need to be shipped with blueprints as it is identical for all[despite being a template, the number of concrete specializations is sufficiently limited], while the latter represents a specifically compiled instance of logic eval code, for a part of a circuit consisting of an AND gate with 5 inputs.

I simply don't see any sufficiently significant performance gain that can be achieved by compiling for each specific scenario on demand as opposed to compiling a general evaluator once, and have it work for all.

Additionally, although besides the main point, packing executable code for the logic with a blueprint poses a security risk. A malicious player could make a blueprint, and replace the compiled code for the logic with malware. By directly having the CPU execute code provided by an outside source, keeping sufficient security becomes very hard. Granted, this is not an issue in singleplayer, unless the player downloads other blueprints, but starmade features multiplayer.

I'll give you the summation first, so you can decide if you want to read the rest. You over simplified the interpreter. You also aren't looking at it at the correct level you should be looking at it machine code level or very least ASM and how it is in memory. The security issue is easily solved see below. The perspective which you are looking at this is the reason you are missing how big of a change it makes.

Compiled code doesn't need to run that evaluation step to determine what an object is. It is already done by the compiler at compile time.
That is a huge speed boost there.
Secondly, even if you gave the same level of optimization instruction to both the compiled code and the interpreter as it was being compiled. The interpreter would be pulling from volatile memory rather than cache. Massive speed difference.

If you didn't get it that change from 0.1 to 0.001 resulted in freeing up 10% of CPU usage for the entire game under the previously discussed scenario. That isn't exactly minor. In fact that is what most optimization is about 10% here plus 10% there and so on it adds up in a hurry.
That was the point of me discussing that simple project for generating a random dungeon.

But if you really want to see how different it is evaluate it down to ASM code so you can see the instructions that are actually ran.
I couldn't find a web link where someone else already done this in the time I have. I am getting ready for family coming over.
Herbert Schildt put a book out "Expert C++" in it one of the projects is a small basic interpreter. It would give you a better idea of the complexity of the interpreter. In your example above you have a switch going to the code. That switch isn't even closely representative of what goes on in an interpreter. The comparison would be more closely representative of the step between going from byte code to machine language. If you want to see a small C interpreter GitHub - zsaleeba/picoc: A very small C interpreter that might give you a bit more perspective on what is going on.
Lets assume that the final out put of the interpreter and the compiler for each instruction is the same. The thousands of lines of code to get to that point an interpreter has to do each time. The compiler only does it one time when it compiles it never has to do it again.

The security risk can be minimized by simply creating a file inside the jar that holds a key to determine if the code is the same compiled by the server. Besides a malicious individual shouldn't have access to another person server files. You could also simply set it up if a ship is uploaded that it recompiles the logic on upload. Then you have no more risk. If someone does have access to the server files you got a hell of a lot more to worry about than logic files he could mess with. If they do it on their own system the only people they are going to hurt is themselves.

It is possible for you to write a program that has the list of instruction sets and so on and could create a machine code set of instructions then set the processor pointer to that location to run. That is something a lot of viruses do. It can create issues, the OS can have a problem with it, most virus scanners look for such behavior. In short pretty much no interpreter does that at all because of the potential risk and harm it can create. By doing that you can create a interpreter that's output is as fast as a compiled program. Because what you did is make a compiler that instead of saving to disc actually stores its result in memory and runs from there. But no one does because of the issues listed. Another method is have your interpreter compile it save it to a file and then run the file. Congratulations you just created another compiler.
JIT compilers. Compile to byte code which in some aspects similar to ASM except it doesn't need much of an interpretation or reordering. It is essentially a switch statement to go from byte code to the relative machine code. Which is why it is so fast.

If you don't mind reading technical manuals
Intel Intel® 64 and IA-32 Architectures Software Developer Manuals | Intel® Software
AMD Developer Guides, Manuals & ISA Documents - AMD

Then there are books on code optimization such as Michal Abrash's "Zen of Code Optimization"
A lot of the stuff you don't see taught in schools these days. Books like "PC intern" used to be a staple for any programmer worth his salt.

The reason I am point that out is the perspective modern programmers look at a program vs those who learned in my day is different. We were required to take ASM before we could take C and C was required before C++. That gave us the perspective of seeing how a high level program functioned on the low level and how things worked better. In other words you are looking at it from the wrong perspective.
[doublepost=1492125181,1492121959][/doublepost]

Olxinos said:

Megacrafter127 Well, the two above snippets will give you identical or nearly identical results on a good optimizing compiler (template<gateenum type> forces type to be a constexpr expression which is therefore known at compile-time, so the compiler can and usually will optimize away the switch) but I assume you meant something along the lines of

Code:

inline bool evalStatusOnUpdate(bool newstate, int activeInputCount, gatetypeenum type) {
   switch(type) {
       case AND: return activeInputCount==connectedInputCount;
       // AND, OR, etc. are values of the enum gatetypeenum
       case OR: return activeInputCount>0;
       //...
       default: error();
   }
}

In that case, you essentially remove a control structure. That can lead to interesting improvements if the processor couldn't accurately predict the branch/jump.
See for instance : Why is it faster to process a sorted array than an unsorted array?
or this small toy-program :

Code:

#include <vector>
#include <chrono>
#include <random>
#include <iostream>
#include <cstdlib>

using size_t = std::size_t;
using namespace std::chrono;
using clk_t = high_resolution_clock;

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<unsigned int> dist;
 
   if(argc != 2) {
       std::cerr << "usage: ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<unsigned int> random_table(n);
   for(auto& cell : random_table)
       cell = dist(rng);
   const auto start = clk_t::now();
   unsigned long counter = 0;
   for(auto cell : random_table) {
#   ifdef WITH_CONDITIONAL
       if(cell & 0x1)
           counter += 2;
#   else
       counter += (cell & 0x1) << 1;
#   endif // WITH_CONDITIONAL
   }
 
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   return static_cast<int>(counter);
}

On my machine:

Code:

user$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin16.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
user$ g++ foobar.cpp -std=c++11 -DWITH_CONDITIONAL -O3
user$ ./a.out 1000000000
Took 4s55ms492us
user$ g++ foobar.cpp -std=c++11 -O3
user$ ./a.out 1000000000
Took 1s178ms76us

(yeah, "g++" is a symlink to apple clang and not gnu g++, they did that when they switched from g++ to clang a couple years ago)

That's actually what I was referring to when I was talking about mispredictions a couple posts back. However, set aside this possible optimisation I don't see how a pre-compilation would help (I disagree with what GRHayes said about L1/L2 cache more on that later).
The improvement appears somewhat interesting in the aforementioned examples but it's really because everything else was pretty cheap. I don't think this would be the case if you were to evaluate Starmade logic gates' states. Actually, let's take the code I posted a couple posts ago and replace opcode by a fixed value :

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   for(instr_t& instruction : random_code)
       instruction = dist(rng);
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

On my machine :

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 5s301ms726us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 2s435ms980us
false

Btw, about the L1/L2 cache part, it's true that a compiled program may have more opportunities to better use the cache (mostly because it can optimize away "parasite" evaluations/fetches which would trash them or cause cache misses I guess). However, it's more dependent on what the program does and how you've arranged the data it's processing which hasn't much to do with compilation : if your program is a clusterfuck of indirections to extremely fragmented memory, compiling won't help you that much in that regard.
Sure, you can argue that interpreters are more likely to be such messes of indirections everywhere, but not necessarily, and compilers may tidy up your code a bit, but only to some extent. In fact in my "toy-logic-interpreter", I added the permutation solely to simulate to some extent fragmented memory. If I use the identity permutation for instance…

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   for(instr_t& instruction : random_code)
       instruction = dist(rng);
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   //std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

Then, on my machine :

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 1s574ms160us
true
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s173ms8us
false

Decent usage of cache gives significant improvements as well in the first case. You could argue however that the improvements nearly are an order of magnitude greater with the optimized away switch, but the program has been so simplified in that case that it's not so surprising (it now only performs 100000000 ands on things which fit decently in the L1 cache… well yeah, of course that's fast). This also has more to do with some kind of branch misprediction than actual use of cache. In fact if I hack the random generation to only give me And as opcodes:

Code:

#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

using instr_t = unsigned int;
using addr_t = unsigned int;
using size_t = std::size_t;
using boolean_t = bool;

using namespace std::chrono;
using clk_t = high_resolution_clock;

constexpr size_t OpCodeBits = 3;
constexpr instr_t OpCodeMask = instr_t(~(~instr_t(0) << OpCodeBits));
constexpr size_t AddrBits = 15;
constexpr size_t EnvSize = 1 << AddrBits;

enum op_t : instr_t {
   Mov,
   And,
   Or,
   Not,
   Xor,
   Nand,
   Nor,
   Nxor,
   LENGTH_
};

static_assert(LENGTH_ == 1 << OpCodeBits, "missing opcodes");

int main(int argc, char* argv[]) {
   std::random_device seeder;
   std::default_random_engine rng(seeder());
   std::uniform_int_distribution<instr_t> dist;
 
   if(argc != 2) {
       std::cerr << "usage ./a.out n" << std::endl;
       return 1;
   }
   const size_t n = atol(argv[1]);
   std::vector<instr_t> random_code(n);
   unsigned int op = 0;
   for(instr_t& instruction : random_code)
       instruction = (dist(rng) & ~OpCodeMask) | And;
   std::vector<std::size_t> permutation(n);
   for(size_t i = 0 ; i < n ; ++i)
       permutation[i] = i;
   //std::shuffle(std::begin(permutation), std::end(permutation), rng);
   std::vector<boolean_t> environment(EnvSize);
   boolean_t eax;
 
   const auto start = clk_t::now();
   for(size_t idx : permutation) {
       const instr_t instruction = random_code[idx];
#       ifdef FIXED_VALUE
       const op_t opcode = And;
#       else
       const op_t opcode = static_cast<op_t>(instruction & OpCodeMask);
#       endif // FIXED_VALUE
       const addr_t address[2] =
           {static_cast<addr_t>((instruction >> OpCodeBits) % EnvSize)
           ,static_cast<addr_t>
               ((instruction >> (OpCodeBits + AddrBits)) % EnvSize)
           };
       std::vector<boolean_t>::reference op1 = environment[address[0]];
       const auto op2 = environment[address[1]];
       switch(opcode) {
           case Mov:
               op1 = eax;
               break;
           case And:
               eax = op1 && op2;
               break;
           case Or:
               eax = op1 || op2;
               break;
           case Not:
               eax = !op1;
               break;
           case Xor:
               eax = (op1 && !op2) || (!op1 && op2);
               break;
           case Nand:
               eax = !(op1 && op2);
               break;
           case Nor:
               eax = !(op1 || op2);
               break;
           case Nxor:
               eax = (op1 && op2) || (!op1 && !op2);
               break;
           default:
               assert(false);
       }
   }
   const auto end = clk_t::now();
   auto us = duration_cast<microseconds>(end - start).count();
   std::cout << "Took " << (us / int(1e6)) << 's';
   us %= int(1e6);
   std::cout << (us / int(1e3)) << "ms";
   us %= int(1e3);
   std::cout << us << "us\n";
   std::cout << std::boolalpha << eax << std::endl;
   return 0;
}

Then:

Code:

user$ g++ foo.cpp -std=c++11 -O3
user$ ./a.out 100000000
Took 0s326ms996us
false
user$ g++ foo.cpp -std=c++11 -O3 -DFIXED_VALUE
user$ ./a.out 100000000
Took 0s171ms505us
false

Now the difference isn't that great anymore…
Well, to be fair, I'd have to meticulously inspect the generated assembly to make sure the compiler hasn't done something clever but I'm not really used to reading assembly and this takes me time, so I've simply quickly checked that the differences between the two generated asm files (obtained with the same command with -S -o [outputfile]) involved something looking like an additional jump table (the switch) when compiled without DFIXED_VALUE. This was the case.

Obviously, this won't map perfectly to Starmade's case. My toy-programs are simplistic, have flaws, and I am not an expert in the field (only an humble computer science student with a master degree and some interest in compilation). But I think those are decent guesstimates of the order of magnitude of the speedups you can get, i.e. not completely negligible but not that great either.

The compiler has a large part to play in how much or what level of optimization looks like Clang and GCC perform differently. Frankly, Clang is better from what I have been reading. But taking a look at GCC optimization options can show the vast level of difference in selecting -O vs -O3. The following file lists the various types of optimizations that take place and at what level they are implemented.
Using the GNU Compiler Collection (GCC): Optimize Options
You also have stuff like some of the functions from the standard library now actually being included in the GCC compiler itself thus no longer requiring the library to be used with common functions. You can see an example of that comparing the current version of GLEW 2.0 vs older versions.
With an interpreter that interpreted code isn't going to sit entirely in cache in most cases. If you had one small script it might. But that depends on how much the interpreter allows making changes. Can you make changes on the fly or is it just one time then run and when done edit again. Also if you have multiple scripts like we would be looking at in star-made each circuit would be effectively another script that would need to be loaded and ran separately. That will be pulling in and out of memory. Also script size plays a part. Add to that how is it broken up on process or thread tasks. If each of these is a separate task it could be bounced back and forward from that would mean while the program might remain in memory the script would be pulled and replaced potentially more often than if it simply ran down a list of the various circuits to update them.

Consider the program I created. It is entirely about creating a random map. It changes each and every time you run it. The rooms are different sizes, different locations, different walls, floors, and connected to one another differently. With all that unpredictability I still got it running primarily in cache. The part I couldn't control is the precompiled graphics library. So creating the map actually takes less time than displaying it. It doesn't pull data in from outside, it has a specified size of data it works with. Like I also said the functions like testing collision where you would have an exponential growth rate I turned into a fixed size test and not even linear growth. Path finding I made linear vs exponential. The interesting part is if I undo all the optimization the 40,000 rooms wouldn't even finish in our life time.

In short I agree with the aspect it has a lot to do with data the program also needs to be compiled to make use of the cache. The compiled logic circuits would be able to run in cache because they really wouldn't change until recompiled after a change. The current script version assumed to be in use wouldn't have that ability. Because the evaluation function would be the same one used for all and it would be continually changed out for other scripts to run as well. In short the very fact it is being used to run multiple scripts is just one of several factors it doesn't make good use of L1 cache.

Search

How to cure logic lagg

Olxinos

French fry. Caution: very salty!

Megacrafter127

Olxinos

French fry. Caution: very salty!

Megacrafter127

Olxinos

French fry. Caution: very salty!

GRHayes