How to improve how fast desktops/laptops/phones execute code
To improve:
Look for new versions of programs that have more optimizations.
Turn off features (such as shadows or animations) that do not have uses.
Replace old hardware parts with ones that use new processes -- such as smaller transisters (extra cores + higher gigaherts/clocks) or hardware acceleraters (suchas matrix multipliers);
If you use open source (FLOSS), use vectorizers for your particular CPU — SSE (Streaming SIMD Extensions) for Intel/AMD, and NEON for Arm — to recompile your programs and OS.
If you use closed source, search for versions with optimizations to your particular CPU's opcodes.
Microsoft Windows has WoW64 to execute programs as x32 or x64 based on your current CPU.
MSVC (MicroSoft Visual Compiler) has auto-vectorization (which improves executables to use advanced CPU codes).
Clang/LLVM has auto-vectorization.
Intel Performance Primitives (for Microsoft Windows and Linux/Unix) uses cpuid
to choose the most performant code path (opcodes) which your CPU allows.
Solaris also uses cpuid
to choose the most performant code path (opcodes) which your CPU allows and has protocols which allow users to force use of advanced opcoodes.
C++'s "syntactic sugar" reduces code sizes (due to classes, templates, and the STL),
plus this sugar syntax allows more room for compilers to do auto-optimization of code:
https://stackoverflow.com/questions/13676172/optimization-expectations-in-the-stl
https://devblogs.microsoft.com/cppblog/algorithm-optimizations-advanced-stl-part-2/
Linux/Unix allows you to recompile the whole OS to use the most advanced opcodes.
GCC/LLVM/Clang accept march=native
to recompile programs to use the most advanced opcodes.
ArchLinux was produced to minimize effort for you to recompile all of your packages (what the Linux ecosystem calls programs) to use your CPU’s most advanced opcodes.
New versions of Linux come with Multiarch (which is similar to Microsoft Windows’ WoW64):
New versions of GCC/LLVM/Clang use cpuid
to produce multiple code paths and chooses the path which has your CPU’s most advanced opcodes.
Auto-parallelization produces threaded (multicore) code (searches for code with lots of loops, distributes those loads across all local CPUs or GPUs):
https://www.intel.com/content/www/us/en/developer/articles/technical/automatic-parallelization-with-intel-compilers.html “Adding the -Qparallel (Windows*) or -parallel (Linux* or macOS*) option to the compile command is the only action required of the programmer. However, successful parallelization is subject to certain conditions”
https://gcc.gnu.org/wiki/AutoParInGCC (gcc or g++) "You can trigger it by 2 flags -floop-parallelize-all -ftree-parallelize-loops=4"
https://polly.llvm.org/docs/UsingPollyWithClang.html “"To automatically detect parallel loops and generate OpenMP code for them you also need to add -mllvm -polly-parallel -lgomp to your CFLAGS. clang -O3 -mllvm -polly -mllvm -polly-parallel -lgomp file.c"“
https://link.springer.com/chapter/10.1007/978-3-030-64616-5_38 "LLVM Based Parallelization of C Programs for GPU"
https://stackoverflow.com/questions/41553533/auto-parallelization-of-simple-do-loop-memory-reference-too-complex (distributes Fortran tasks).
TensorFlow's "MapReduce" (https://www.tensorflow.org/federated/api_docs/python/tff/backends/mapreduce) distributes loads across clouds of CPUs/GPUs.
The sort of SW (programs) which get the most benefits from MapReduce is artificial neural tissue, such as: