SHORT Overview of arithmetic and main memory performance

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PWR0  PWR_PKG_ENERGY
PWR3  PWR_DRAM_ENERGY
PMC0  FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE
PMC1  FP_ARITH_INST_RETIRED_SCALAR_SINGLE
PMC2  FP_FLOPS_RETIRED_FP32
MBOX0C0 CAS_COUNT_SCH0_RD
MBOX0C1 CAS_COUNT_SCH0_WR
MBOX0C2 CAS_COUNT_SCH1_RD
MBOX0C3 CAS_COUNT_SCH1_WR
MBOX1C0 CAS_COUNT_SCH0_RD
MBOX1C1 CAS_COUNT_SCH0_WR
MBOX1C2 CAS_COUNT_SCH1_RD
MBOX1C3 CAS_COUNT_SCH1_WR
MBOX2C0 CAS_COUNT_SCH0_RD
MBOX2C1 CAS_COUNT_SCH0_WR
MBOX2C2 CAS_COUNT_SCH1_RD
MBOX2C3 CAS_COUNT_SCH1_WR
MBOX3C0 CAS_COUNT_SCH0_RD
MBOX3C1 CAS_COUNT_SCH0_WR
MBOX3C2 CAS_COUNT_SCH1_RD
MBOX3C3 CAS_COUNT_SCH1_WR
MBOX4C0 CAS_COUNT_SCH0_RD
MBOX4C1 CAS_COUNT_SCH0_WR
MBOX4C2 CAS_COUNT_SCH1_RD
MBOX4C3 CAS_COUNT_SCH1_WR
MBOX5C0 CAS_COUNT_SCH0_RD
MBOX5C1 CAS_COUNT_SCH0_WR
MBOX5C2 CAS_COUNT_SCH1_RD
MBOX5C3 CAS_COUNT_SCH1_WR
MBOX6C0 CAS_COUNT_SCH0_RD
MBOX6C1 CAS_COUNT_SCH0_WR
MBOX6C2 CAS_COUNT_SCH1_RD
MBOX6C3 CAS_COUNT_SCH1_WR
MBOX7C0 CAS_COUNT_SCH0_RD
MBOX7C1 CAS_COUNT_SCH0_WR
MBOX7C2 CAS_COUNT_SCH1_RD
MBOX7C3 CAS_COUNT_SCH1_WR
MBOX8C0 CAS_COUNT_SCH0_RD
MBOX8C1 CAS_COUNT_SCH0_WR
MBOX8C2 CAS_COUNT_SCH1_RD
MBOX8C3 CAS_COUNT_SCH1_WR
MBOX9C0 CAS_COUNT_SCH0_RD
MBOX9C1 CAS_COUNT_SCH0_WR
MBOX9C2 CAS_COUNT_SCH1_RD
MBOX9C3 CAS_COUNT_SCH1_WR
MBOX10C0 CAS_COUNT_SCH0_RD
MBOX10C1 CAS_COUNT_SCH0_WR
MBOX10C2 CAS_COUNT_SCH1_RD
MBOX10C3 CAS_COUNT_SCH1_WR
MBOX11C0 CAS_COUNT_SCH0_RD
MBOX11C1 CAS_COUNT_SCH0_WR
MBOX11C2 CAS_COUNT_SCH1_RD
MBOX11C3 CAS_COUNT_SCH1_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
CPI  FIXC1/FIXC0
Energy [J]  PWR0
Power [W] PWR0/time
Energy DRAM [J]  PWR3
Power DRAM [W] PWR3/time
SP [MFLOP/s]  1.0E-06*(PMC2)/time
Packed [MUOPS/s]   1.0E-06*(PMC0+((PMC2-(PMC0*4+PMC1))/8))/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX0C2+MBOX1C0+MBOX1C2+MBOX2C0+MBOX2C2+MBOX3C0+MBOX3C2+MBOX4C0+MBOX4C2+MBOX5C0+MBOX5C2+MBOX6C0+MBOX6C2+MBOX7C0+MBOX7C2+MBOX8C0+MBOX8C2+MBOX9C0+MBOX9C2+MBOX10C0+MBOX10C2+MBOX11C0+MBOX11C2)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX0C2+MBOX1C0+MBOX1C2+MBOX2C0+MBOX2C2+MBOX3C0+MBOX3C2+MBOX4C0+MBOX4C2+MBOX5C0+MBOX5C2+MBOX6C0+MBOX6C2+MBOX7C0+MBOX7C2+MBOX8C0+MBOX8C2+MBOX9C0+MBOX9C2+MBOX10C0+MBOX10C2+MBOX11C0+MBOX11C2)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX0C3+MBOX1C1+MBOX1C3+MBOX2C1+MBOX2C3+MBOX3C1+MBOX3C3+MBOX4C1+MBOX4C3+MBOX5C1+MBOX5C3+MBOX6C1+MBOX6C3+MBOX7C1+MBOX7C3+MBOX8C1+MBOX8C3+MBOX9C1+MBOX9C3+MBOX10C1+MBOX10C3+MBOX11C1+MBOX11C3)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX0C3+MBOX1C1+MBOX1C3+MBOX2C1+MBOX2C3+MBOX3C1+MBOX3C3+MBOX4C1+MBOX4C3+MBOX5C1+MBOX5C3+MBOX6C1+MBOX6C3+MBOX7C1+MBOX7C3+MBOX8C1+MBOX8C3+MBOX9C1+MBOX9C3+MBOX10C1+MBOX10C3+MBOX11C1+MBOX11C3)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Operational intensity [FLOP/Byte] (PMC2)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0)
Vectorization ratio [%] 100*(PMC0+((PMC2-(PMC0*4+PMC1))/8))/(PMC0+PMC1+((PMC2-(PMC0*4+PMC1))/8))


LONG
Formulas:
Power [W] = PWR_PKG_ENERGY/runtime
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_SCH0_RD + CAS_COUNT_SCH1_RD))*64.0/runtime
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_SCH0_RD + CAS_COUNT_SCH1_RD))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_SCH0_WR + CAS_COUNT_SCH1_WR))*64.0/runtime
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_SCH0_WR + CAS_COUNT_SCH1_WR))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_SCH0_RD + CAS_COUNT_SCH1_RD)+SUM(CAS_COUNT_SCH0_WR + CAS_COUNT_SCH1_WR))*64.0/runtime
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_SCH0_RD + CAS_COUNT_SCH1_RD)+SUM(CAS_COUNT_SCH0_WR + CAS_COUNT_SCH1_WR))*64.0
Operational intensity [FLOP/Byte] = (FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE)/(SUM(CAS_COUNT_SCH0_RD + CAS_COUNT_SCH1_RD)+SUM(CAS_COUNT_SCH0_WR + CAS_COUNT_SCH1_WR))*64.0)
Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE)/(FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)
--
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on
a per socket base. Also outputs total data volume transferred from main memory.
SSE scalar and packed single precision FLOP rates. On Intel Sierra Forrest,
it is not possible to count the SP AVX instructions directly, so this group
counts the single-precision FP operations and substracts the scalar and SSE FP
operations to derive the SP AVX instruction count.
The operational intensity is calculated using the FP values of the cores and the
memory data volume of the whole socket. It is *not* correct for the whole node!
In order to calculate it for the whole node manually do
'Memory bandwidth [MBytes/s] SUM'/'SP [MFLOP/s] SUM'
