Discussion:
[RFC] Use of host vector operations in host helper functions
(too old to reply)
Richard Henderson
2014-08-28 15:40:27 UTC
Permalink
Most of the time, guest vector operations are rare enough that it doesn't
really matter that we implement them with a loop around integer operations.

But for target-alpha, there's one vector comparison operation that appears in
every guest string operation, and is used heavily enough that it's in the top
10 functions in the profile: cmpbge (compare bytes greater or equal).

I did some experiments, where I rewrote the function using gcc's "generic"
vector types and builtin operations. Irritatingly, gcc won't use a wider
vector insn to implement a narrower operation, so I needed to widen by hand in
order to get vectorization for SSE2, but:

----------------------------------------------------------------------
diff --git a/target-alpha/int_helper.c b/target-alpha/int_helper.c
index c023fa1..ec71c17 100644
--- a/target-alpha/int_helper.c
+++ b/target-alpha/int_helper.c
@@ -60,6 +60,42 @@ uint64_t helper_zap(uint64_t val, uint64_t mask)

uint64_t helper_cmpbge(uint64_t op1, uint64_t op2)
{
+#if 1
+ uint64_t r;
+
+ /* The cmpbge instruction is heavily used in the implementation of
+ every string function on Alpha. We can do much better than either
+ the default loop below, or even an unrolled version by using the
+ native vector support. */
+ {
+ typedef uint64_t Q __attribute__((vector_size(16)));
+ typedef uint8_t B __attribute__((vector_size(16)));
+
+ Q q1 = (Q){ op1, 0 };
+ Q q2 = (Q){ op2, 0 };
+
+ q1 = (Q)((B)q1 >= (B)q2);
+
+ r = q1[0];
+ }
+
+ /* Select only one bit from each byte. */
+ r &= 0x0101010101010101;
+
+ /* Collect the bits into the bottom byte. */
+ /* .......A.......B.......C.......D.......E.......F.......G.......H */
+ r |= r >> (8 - 1);
+
+ /* .......A......AB......BC......CD......DE......EF......FG......GH */
+ r |= r >> (16 - 2);
+
+ /* .......A......AB.....ABC....ABCD....BCDE....CDEF....DEFG....EFGH */
+ r |= r >> (32 - 4);
+
+ /* .......A......AB.....ABC....ABCD...ABCDE..ABCDEF.ABCDEFGABCDEFGH */
+ /* Return only the low 8 bits. */
+ return r & 0xff;
+#else
uint8_t opa, opb, res;
int i;

@@ -72,6 +108,7 @@ uint64_t helper_cmpbge(uint64_t op1, uint64_t op2)
}
}
return res;
+#endif
}

uint64_t helper_minub8(uint64_t op1, uint64_t op2)
----------------------------------------------------------------------

allows very good optimization on x86_64:

0000000000000120 <helper_cmpbge>:
120: 48 89 7c 24 e8 mov %rdi,-0x18(%rsp)
125: 48 b8 01 01 01 01 01 movabs $0x101010101010101,%rax
12c: 01 01 01
12f: f3 0f 7e 5c 24 e8 movq -0x18(%rsp),%xmm3
135: 48 89 74 24 e8 mov %rsi,-0x18(%rsp)
13a: f3 0f 7e 64 24 e8 movq -0x18(%rsp),%xmm4
140: f3 0f 7e c3 movq %xmm3,%xmm0
144: f3 0f 7e cc movq %xmm4,%xmm1
148: 66 0f 6f d1 movdqa %xmm1,%xmm2
14c: 66 0f d8 d0 psubusb %xmm0,%xmm2
150: 66 0f ef c0 pxor %xmm0,%xmm0
154: 66 0f 74 c2 pcmpeqb %xmm2,%xmm0
158: 66 0f 7f 44 24 e8 movdqa %xmm0,-0x18(%rsp)
15e: 48 8b 54 24 e8 mov -0x18(%rsp),%rdx
163: 48 21 c2 and %rax,%rdx
166: 48 89 d0 mov %rdx,%rax
169: 48 c1 e8 07 shr $0x7,%rax
16d: 48 09 d0 or %rdx,%rax
170: 48 89 c2 mov %rax,%rdx
173: 48 c1 ea 0e shr $0xe,%rdx
177: 48 09 c2 or %rax,%rdx
17a: 48 89 d0 mov %rdx,%rax
17d: 48 c1 e8 1c shr $0x1c,%rax
181: 48 09 d0 or %rdx,%rax
184: 0f b6 c0 movzbl %al,%eax
187: c3 retq

which is just about as good as you could hope for (modulo two extra movq insns).

Profiling a (guest) compilation of glibc, helper_cmpbge is reduced from 3% to
0.8% of emulation time, and from 7th to 11th in the ranking.

GCC doesn't do a half-bad job on other hosts either:

aarch64:
b4: 4f000400 movi v0.4s, #0x0
b8: 4ea01c01 mov v1.16b, v0.16b
bc: 4e081c00 mov v0.d[0], x0
c0: 4e081c21 mov v1.d[0], x1
c4: 6e213c00 cmhs v0.16b, v0.16b, v1.16b
c8: 4e083c00 mov x0, v0.d[0]
cc: 9200c000 and x0, x0, #0x101010101010101
d0: aa401c00 orr x0, x0, x0, lsr #7
d4: aa403800 orr x0, x0, x0, lsr #14
d8: aa407000 orr x0, x0, x0, lsr #28
dc: 53001c00 uxtb w0, w0
e0: d65f03c0 ret

Of course aarch64 *does* have an 8-byte vector size that gcc knows how to use.
If I adjust the patch above to use it, only the first two insns are eliminated
-- surely not a measurable difference.

power7:
...
vcmpgtub 13,0,1
vcmpequb 0,0,1
xxlor 32,45,32
...


But I guess the larger question here is: how much of this should we accept?

(0) Ignore this and do nothing?

(1) No general infrastructure. Special case this one insn with #ifdef __SSE2__
and ignore anything else.

(2) Put in just enough infrastructure to know if compiler support for general
vectors is available, and then use it ad hoc when such functions are shown to
be high on the profile?

(3) Put in more infrastructure and allow it to be used to implement most guest
vector operations, possibly tidying their implementations?



r~
Alex Bennée
2014-09-13 16:02:38 UTC
Permalink
Post by Richard Henderson
Most of the time, guest vector operations are rare enough that it doesn't
really matter that we implement them with a loop around integer operations.
But for target-alpha, there's one vector comparison operation that appears in
every guest string operation, and is used heavily enough that it's in the top
10 functions in the profile: cmpbge (compare bytes greater or equal).
For a helper function to top the profile is pretty impressive. I wonder
how it compares when you break it down by basic blocks?
Post by Richard Henderson
I did some experiments, where I rewrote the function using gcc's "generic"
vector types and builtin operations.
<snip>
b4: 4f000400 movi v0.4s, #0x0
b8: 4ea01c01 mov v1.16b, v0.16b
bc: 4e081c00 mov v0.d[0], x0
c0: 4e081c21 mov v1.d[0], x1
c4: 6e213c00 cmhs v0.16b, v0.16b, v1.16b
c8: 4e083c00 mov x0, v0.d[0]
cc: 9200c000 and x0, x0, #0x101010101010101
d0: aa401c00 orr x0, x0, x0, lsr #7
d4: aa403800 orr x0, x0, x0, lsr #14
d8: aa407000 orr x0, x0, x0, lsr #28
dc: 53001c00 uxtb w0, w0
e0: d65f03c0 ret
Of course aarch64 *does* have an 8-byte vector size that gcc knows how to use.
If I adjust the patch above to use it, only the first two insns are eliminated
-- surely not a measurable difference.
...
vcmpgtub 13,0,1
vcmpequb 0,0,1
xxlor 32,45,32
...
But I guess the larger question here is: how much of this should we accept?
(0) Ignore this and do nothing?
(1) No general infrastructure. Special case this one insn with #ifdef __SSE2__
and ignore anything else.
Not a big fan of special cases that are arch dependent.
Post by Richard Henderson
(2) Put in just enough infrastructure to know if compiler support for general
vectors is available, and then use it ad hoc when such functions are shown to
be high on the profile?
(3) Put in more infrastructure and allow it to be used to implement most guest
vector operations, possibly tidying their implementations?
<snip>

(4) Consider supporting generic vector operations in the TCG?

While making helper functions faster is good I've wondered if they is
enough genericsm across the various SIMD/vector operations we could add
add TCG ops to translate them? The ops could fall back to generic helper
functions using the GCC instrinsics if we know there is no decent
back-end support for them?
--
Alex Bennée
Kirill Batuzov
2014-10-16 08:56:49 UTC
Permalink
When a TCG backend does not support some vector operation we need to emulate
this operation. Unlike arguments of the scalar operations vector values are
hard to operate on directly or to be passed as function arguments (because
a target may lack corresponding type support). To avoid this we will use
pointers to host memory locations holding values of temporaries. This memory
locations for globals must be their canonical locations in CPUArchState
because moving them around is expensive and hard to implement.

Fortunately globals always have memory locations statically assigned to them.
They are addressed relative to AREG0. To express direct access to this memory
in TCG opcodes we need to know global variable ENV (which corresponds to this
AREG0).

Add a field to TCGContext. Frontends can save ENV there during translate_init.
It will be used in handling vector operations only so targets that do not use
vector support do not need to set it.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
tcg/tcg.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 01dbede..83fb0d3 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -496,6 +496,7 @@ struct TCGContext {
tcg_insn_unit *code_ptr;
TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
+ TCGv_ptr cpu_env; /* used to access memory locations for vector globals */

GHashTable *helpers;
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:52 UTC
Permalink
To support 128-bit guest registers as globals we need to do two things:

1) create corresponding globals,
2) add sync_temp/discard to code that access these registers as memory
locations.

Note that the second part is not complete in this RFC yet and mixing NEON with
VFP code can result in miscompile.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
target-arm/translate.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 8a2994f..22855d8 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -64,6 +64,7 @@ TCGv_ptr cpu_env;
/* We reuse the same 64-bit temporaries for efficiency. */
static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
static TCGv_i32 cpu_R[16];
+static TCGv_v128 cpu_Q[16];
static TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
static TCGv_i64 cpu_exclusive_addr;
static TCGv_i64 cpu_exclusive_val;
@@ -78,10 +79,14 @@ static TCGv_i64 cpu_F0d, cpu_F1d;

#include "exec/gen-icount.h"

-static const char *regnames[] =
+static const char *regnames_r[] =
{ "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
"r8", "r9", "r10", "r11", "r12", "r13", "r14", "pc" };

+static const char *regnames_q[] =
+ { "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
+ "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" };
+
/* initialize TCG globals. */
void arm_translate_init(void)
{
@@ -92,7 +97,12 @@ void arm_translate_init(void)
for (i = 0; i < 16; i++) {
cpu_R[i] = tcg_global_mem_new_i32(TCG_AREG0,
offsetof(CPUARMState, regs[i]),
- regnames[i]);
+ regnames_r[i]);
+ }
+ for (i = 0; i < 16; i++) {
+ cpu_Q[i] = tcg_global_mem_new_v128(TCG_AREG0,
+ offsetof(CPUARMState, vfp.regs[2 * i]),
+ regnames_q[i]);
}
cpu_CF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, CF), "CF");
cpu_NF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, NF), "NF");
@@ -1237,23 +1247,27 @@ neon_reg_offset (int reg, int n)
static TCGv_i32 neon_load_reg(int reg, int pass)
{
TCGv_i32 tmp = tcg_temp_new_i32();
+ tcg_gen_sync_temp_v128(cpu_Q[reg >> 1]);
tcg_gen_ld_i32(tmp, cpu_env, neon_reg_offset(reg, pass));
return tmp;
}

static void neon_store_reg(int reg, int pass, TCGv_i32 var)
{
+ tcg_gen_discard_v128(cpu_Q[reg >> 1]);
tcg_gen_st_i32(var, cpu_env, neon_reg_offset(reg, pass));
tcg_temp_free_i32(var);
}

static inline void neon_load_reg64(TCGv_i64 var, int reg)
{
+ tcg_gen_sync_temp_v128(cpu_Q[reg >> 1]);
tcg_gen_ld_i64(var, cpu_env, vfp_reg_offset(1, reg));
}

static inline void neon_store_reg64(TCGv_i64 var, int reg)
{
+ tcg_gen_discard_v128(cpu_Q[reg >> 1]);
tcg_gen_st_i64(var, cpu_env, vfp_reg_offset(1, reg));
}
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:50 UTC
Permalink
Currently every field of CPUArchState can be accessed from the TCG-generated code
as a memory location or as a global but not both. In order to be able to mix
these two approaches we need to restore consistency between value of global
(possibly kept on register) and value in corresponding memory location.

Introduce sync_temp TCGOpcode which instructs register allocator to
save value of a global into its memory location.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
tcg/tcg-op.h | 10 ++++++++++
tcg/tcg-opc.h | 1 +
tcg/tcg.c | 12 ++++++++++++
3 files changed, 23 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 81291fd..ea2b14f 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -1808,6 +1808,16 @@ static inline void tcg_gen_discard_i64(TCGv_i64 arg)
#endif
}

+static inline void tcg_gen_discard_v128(TCGv_v128 arg)
+{
+ tcg_gen_op1_v128(INDEX_op_discard, arg);
+}
+
+static inline void tcg_gen_sync_temp_v128(TCGv_v128 arg)
+{
+ tcg_gen_op1_v128(INDEX_op_sync_temp, arg);
+}
+
static inline void tcg_gen_andc_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2)
{
if (TCG_TARGET_HAS_andc_i32) {
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 042d442..0916d83 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -37,6 +37,7 @@ DEF(nop3, 0, 0, 3, TCG_OPF_NOT_PRESENT)
DEF(nopn, 0, 0, 1, TCG_OPF_NOT_PRESENT)

DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
+DEF(sync_temp, 0, 1, 0, TCG_OPF_NOT_PRESENT)
DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)

/* variable number of parameters */
diff --git a/tcg/tcg.c b/tcg/tcg.c
index d01f357..ff157b7 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1553,6 +1553,11 @@ static void tcg_liveness_analysis(TCGContext *s)
dead_temps[args[0]] = 1;
mem_temps[args[0]] = 0;
break;
+ case INDEX_op_sync_temp:
+ args--;
+ dead_temps[args[0]] = 1;
+ mem_temps[args[0]] = 1;
+ break;
case INDEX_op_end:
break;

@@ -2527,6 +2532,13 @@ static inline int tcg_gen_code_common(TCGContext *s,
case INDEX_op_discard:
temp_dead(s, args[0]);
break;
+ case INDEX_op_sync_temp:
+ /* We use it only for globals currently. */
+ assert(args[0] < s->nb_globals);
+ if (s->temps[args[0]].val_type == TEMP_VAL_REG) {
+ tcg_reg_free(s, s->temps[args[0]].reg);
+ }
+ break;
case INDEX_op_set_label:
tcg_reg_alloc_bb_end(s, s->reserved_regs);
tcg_out_label(s, args[0], s->code_ptr);
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:48 UTC
Permalink
Introduce TCG_TYPE_V128 and corresponding TCGv_v128 for TCG temps. Add wrapper
functions that work with temps of this new type.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
tcg/tcg-op.h | 23 +++++++++++++++++++++++
tcg/tcg.c | 24 ++++++++++++++++++++++++
tcg/tcg.h | 28 ++++++++++++++++++++++++++++
3 files changed, 75 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 019dd9b..81291fd 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -345,6 +345,29 @@ static inline void tcg_gen_op6ii_i64(TCGOpcode opc, TCGv_i64 arg1,
*tcg_ctx.gen_opparam_ptr++ = arg6;
}

+static inline void tcg_gen_op1_v128(TCGOpcode opc, TCGv_v128 arg1)
+{
+ *tcg_ctx.gen_opc_ptr++ = opc;
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg1);
+}
+
+static inline void tcg_gen_op2_v128(TCGOpcode opc, TCGv_v128 arg1,
+ TCGv_v128 arg2)
+{
+ *tcg_ctx.gen_opc_ptr++ = opc;
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg1);
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg2);
+}
+
+static inline void tcg_gen_op3_v128(TCGOpcode opc, TCGv_v128 arg1,
+ TCGv_v128 arg2, TCGv_v128 arg3)
+{
+ *tcg_ctx.gen_opc_ptr++ = opc;
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg1);
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg2);
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(arg3);
+}
+
static inline void tcg_add_param_i32(TCGv_i32 val)
{
*tcg_ctx.gen_opparam_ptr++ = GET_TCGV_I32(val);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 7a84b87..d01f357 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -542,6 +542,12 @@ TCGv_i64 tcg_global_mem_new_i64(int reg, intptr_t offset, const char *name)
return MAKE_TCGV_I64(idx);
}

+TCGv_v128 tcg_global_mem_new_v128(int reg, intptr_t offset, const char *name)
+{
+ int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset, name);
+ return MAKE_TCGV_V128(idx);
+}
+
static inline int tcg_temp_new_internal(TCGType type, int temp_local)
{
TCGContext *s = &tcg_ctx;
@@ -612,6 +618,14 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
return MAKE_TCGV_I64(idx);
}

+TCGv_v128 tcg_temp_new_internal_v128(int temp_local)
+{
+ int idx;
+
+ idx = tcg_temp_new_internal(TCG_TYPE_V128, temp_local);
+ return MAKE_TCGV_V128(idx);
+}
+
static void tcg_temp_free_internal(int idx)
{
TCGContext *s = &tcg_ctx;
@@ -644,6 +658,11 @@ void tcg_temp_free_i64(TCGv_i64 arg)
tcg_temp_free_internal(GET_TCGV_I64(arg));
}

+void tcg_temp_free_v128(TCGv_v128 arg)
+{
+ tcg_temp_free_internal(GET_TCGV_V128(arg));
+}
+
TCGv_i32 tcg_const_i32(int32_t val)
{
TCGv_i32 t0;
@@ -1062,6 +1081,11 @@ char *tcg_get_arg_str_i64(TCGContext *s, char *buf, int buf_size, TCGv_i64 arg)
return tcg_get_arg_str_idx(s, buf, buf_size, GET_TCGV_I64(arg));
}

+char *tcg_get_arg_str_v128(TCGContext *s, char *buf, int buf_size, TCGv_v128 arg)
+{
+ return tcg_get_arg_str_idx(s, buf, buf_size, GET_TCGV_V128(arg));
+}
+
/* Find helper name. */
static inline const char *tcg_find_helper(TCGContext *s, uintptr_t val)
{
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 7285f71..01dbede 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -194,6 +194,7 @@ typedef struct TCGPool {
typedef enum TCGType {
TCG_TYPE_I32,
TCG_TYPE_I64,
+ TCG_TYPE_V128,
TCG_TYPE_COUNT, /* number of different types */

/* An alias for the size of the host register. */
@@ -286,6 +287,7 @@ typedef tcg_target_ulong TCGArg;
typedef struct TCGv_i32_d *TCGv_i32;
typedef struct TCGv_i64_d *TCGv_i64;
typedef struct TCGv_ptr_d *TCGv_ptr;
+typedef struct TCGv_v128_d *TCGv_v128;

static inline TCGv_i32 QEMU_ARTIFICIAL MAKE_TCGV_I32(intptr_t i)
{
@@ -302,6 +304,11 @@ static inline TCGv_ptr QEMU_ARTIFICIAL MAKE_TCGV_PTR(intptr_t i)
return (TCGv_ptr)i;
}

+static inline TCGv_v128 QEMU_ARTIFICIAL MAKE_TCGV_V128(intptr_t i)
+{
+ return (TCGv_v128)i;
+}
+
static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_I32(TCGv_i32 t)
{
return (intptr_t)t;
@@ -317,6 +324,11 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
return (intptr_t)t;
}

+static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V128(TCGv_v128 t)
+{
+ return (intptr_t)t;
+}
+
#if TCG_TARGET_REG_BITS == 32
#define TCGV_LOW(t) MAKE_TCGV_I32(GET_TCGV_I64(t))
#define TCGV_HIGH(t) MAKE_TCGV_I32(GET_TCGV_I64(t) + 1)
@@ -324,15 +336,18 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)

#define TCGV_EQUAL_I32(a, b) (GET_TCGV_I32(a) == GET_TCGV_I32(b))
#define TCGV_EQUAL_I64(a, b) (GET_TCGV_I64(a) == GET_TCGV_I64(b))
+#define TCGV_EQUAL_V128(a, b) (GET_TCGV_V128(a) == GET_TCGV_V128(b))
#define TCGV_EQUAL_PTR(a, b) (GET_TCGV_PTR(a) == GET_TCGV_PTR(b))

/* Dummy definition to avoid compiler warnings. */
#define TCGV_UNUSED_I32(x) x = MAKE_TCGV_I32(-1)
#define TCGV_UNUSED_I64(x) x = MAKE_TCGV_I64(-1)
+#define TCGV_UNUSED_V128(x) x = MAKE_TCGV_V128(-1)
#define TCGV_UNUSED_PTR(x) x = MAKE_TCGV_PTR(-1)

#define TCGV_IS_UNUSED_I32(x) (GET_TCGV_I32(x) == -1)
#define TCGV_IS_UNUSED_I64(x) (GET_TCGV_I64(x) == -1)
+#define TCGV_IS_UNUSED_V128(x) (GET_TCGV_V128(x) == -1)
#define TCGV_IS_UNUSED_PTR(x) (GET_TCGV_PTR(x) == -1)

/* call flags */
@@ -596,6 +611,19 @@ static inline TCGv_i64 tcg_temp_local_new_i64(void)
void tcg_temp_free_i64(TCGv_i64 arg);
char *tcg_get_arg_str_i64(TCGContext *s, char *buf, int buf_size, TCGv_i64 arg);

+TCGv_v128 tcg_global_mem_new_v128(int reg, intptr_t offset, const char *name);
+TCGv_v128 tcg_temp_new_internal_v128(int temp_local);
+static inline TCGv_v128 tcg_temp_new_v128(void)
+{
+ return tcg_temp_new_internal_v128(0);
+}
+static inline TCGv_v128 tcg_temp_local_new_v128(void)
+{
+ return tcg_temp_new_internal_v128(1);
+}
+void tcg_temp_free_v128(TCGv_v128 arg);
+char *tcg_get_arg_str_v128(TCGContext *s, char *buf, int buf_size, TCGv_v128 arg);
+
#if defined(CONFIG_DEBUG_TCG)
/* If you call tcg_clear_temp_count() at the start of a section of
* code which is not supposed to leak any TCG temporaries, then
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:47 UTC
Permalink
Post by Alex Bennée
(4) Consider supporting generic vector operations in the TCG?
I gave it a go and was quite happy with the result. I have implemented the add_i32x4
opcode which is addition of 128-bit vectors composed of four 32-bit integers
and used it to translate NEON vadd.i32 to SSE paddd instruction. I used ARM for
my guest because I'm familiar with this architecture and it is different from
my host.

I got a 3x speedup on my testcase:

mov r0, #0xb0000000
loop:
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
subs r0, r0, #1
bne loop

Evaluation results:

master: 25.398s
patched: 7.704s

Generated code:

IN:
0x00008298: f2200842 vadd.i32 q0, q0, q1
0x0000829c: f2200842 vadd.i32 q0, q0, q1
0x000082a0: f2200842 vadd.i32 q0, q0, q1
0x000082a4: f2200842 vadd.i32 q0, q0, q1
<...>

OP after optimization and liveness analysis:
ld_i32 tmp5,env,$0xfffffffffffffffc
movi_i32 tmp6,$0x0
brcond_i32 tmp5,tmp6,ne,$0x0
---- 0x8298
add_i32x4 q0,q0,q1

---- 0x829c
add_i32x4 q0,q0,q1

---- 0x82a0
add_i32x4 q0,q0,q1

---- 0x82a4
add_i32x4 q0,q0,q1
<...>

OUT: [size=196]
0x60442450: mov -0x4(%r14),%ebp
0x60442454: test %ebp,%ebp
0x60442456: jne 0x60442505
0x6044245c: movdqu 0x658(%r14),%xmm0
0x60442465: movdqu 0x668(%r14),%xmm1
0x6044246e: paddd %xmm1,%xmm0
0x60442472: paddd %xmm1,%xmm0
0x60442476: paddd %xmm1,%xmm0
0x6044247a: paddd %xmm1,%xmm0
0x6044247e: movdqu %xmm0,0x658(%r14)
<...>
Post by Alex Bennée
But for target-alpha, there's one vector comparison operation that appears in
every guest string operation, and is used heavily enough that it's in the top
10 functions in the profile: cmpbge (compare bytes greater or equal).
cmpbge can be translated as follows:

cmpge_i8x8 tmp0, arg1, arg2
select_msb_i8x8 res, tmp0

where cmpge is "compare grater or equal" with following semantic:
res[i] = <111...11> if arg1[i] >= arg2[i]
res[i] = <000...00> if arg1[i] < arg2[i]
There is such operation in NEON. In SSE we can emulate it with PCMPEQB, PCMPGTB
and POR.

select_msb is "select most significant bit". SSE instruction PMOVMSKB.
Post by Alex Bennée
While making helper functions faster is good I've wondered if they is
enough genericsm across the various SIMD/vector operations we could add
add TCG ops to translate them? The ops could fall back to generic helper
functions using the GCC instrinsics if we know there is no decent
back-end support for them?
From Valgrind experience there are enough genericism. Valgrind can translate
SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes are
reused between instruction sets.

But keep in mind - there are a lot of vector opcodes. Much much more than
scalar ones. You can see full list in Valgrind sources (VEX/pub/libvex_ir.h).

We can reduce the amount of opcodes by converting vector element size from part
of an opcode to a constant argument. But we will lose some flexibility offered
by the TARGET_HAS_opcode macro when target has support for some sizes but not for
others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
does not have one for size i64x2.

Some implementation details and concerns.

The most problematic issue was the fact that with vector registers we have one
entity that can be accessed as both global variable and memory location. I
solved it by introducing the sync_temp opcode that instructs register allocator to
save global variable to its memory location if it is on the register. If a
variable is not on a register or memory is already coherent - no store is issued,
so performance penalty for it is minimal. Still this approach has a serious
drawback: we need to generate sync_temp explicitly. But I do not know any better
way to achieve consistency.

Note that as of this RFC I have not finished conversion of ARM guest so mixing
NEON with VFP code can cause a miscompile.

The second problem is that a backend may or may not support vector operations. We
do not want each frontend to check it on every operation. I created a wrapper that
generates vector opcode if it is supported or generates emulation code.

For add_i32x4 emulation code is generated inline. I tried to make it a helper
but got a very significant performance loss (5x slowdown). I'm not sure about
the cause but I suspect that memory was a bottleneck and extra stores needed
by calling conventions mattered a lot.

The existing constraints are good enough to express that vector registers and
general purpose registers are different and can not be used instead of each
other.

One unsolved problem is global aliasing. With general purpose registers we have
no aliasing between globals. The only example I know where registers can alias
is the x86 ah/ax/eax/rax case. They are handled as one global. With vector
registers we have NEON where an 128-bit Q register consists of two 64-bit
D registers each consisting of two 32-bit S registers. I think I'll need
to add alias list to each global listing every other global it can clobber and
then iterate over it in the optimizer. Fortunately this list will be static and not
very long.

Why I think all this is worth doing:

(1) Performance. 200% speedup is a lot. My test was specifically crafted and real
life applications may not have that much vector operations on average, but
there is a specific class of applications where it will matter a lot - media
processing applications like ffmpeg.

(2) Some unification of common operations. Right now every target reimplements
common vector operations (like vector add/sub/mul/min/compare etc.). We can
do it once in the common TCG code.

Still there are some cons I mentioned earlier. The need to support a lot of
opcodes is the most significant in the long run I think. So before I commit my
time to conversion of more operations I'd like to hear your opinions if this
approach is acceptable and worth spending efforts.

Kirill Batuzov (7):
tcg: add support for 128bit vector type
tcg: store ENV global in TCGContext
tcg: add sync_temp opcode
tcg: add add_i32x4 opcode
target-arm: support access to 128-bit guest registers as globals
target-arm: use add_i32x4 opcode to handle vadd.i32 instruction
tcg/i386: add support for vector opcodes

target-arm/translate.c | 30 ++++++++++-
tcg/i386/tcg-target.c | 103 ++++++++++++++++++++++++++++++++---
tcg/i386/tcg-target.h | 24 ++++++++-
tcg/tcg-op.h | 141 ++++++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg-opc.h | 13 +++++
tcg/tcg.c | 36 +++++++++++++
tcg/tcg.h | 34 ++++++++++++
7 files changed, 371 insertions(+), 10 deletions(-)
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:53 UTC
Permalink
Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
target-arm/translate.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 22855d8..00ea5cf 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -5239,6 +5239,18 @@ static int disas_neon_data_insn(CPUARMState * env, DisasContext *s, uint32_t ins
return 1;
}

+ /* Use vector ops to handle what we can */
+ switch (op) {
+ case NEON_3R_VADD_VSUB:
+ if (!u && size == 2) {
+ tcg_gen_add_i32x4(cpu_Q[rd >> 1], cpu_Q[rn >> 1], cpu_Q[rm >> 1]);
+ return 0;
+ }
+ break;
+ default:
+ break;
+ }
+
for (pass = 0; pass < (q ? 4 : 2); pass++) {

if (pairwise) {
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:54 UTC
Permalink
To be able to generate vector operations in TCG backend we need to do several
things.

1. We need to tell the register allocator about the target's vector registers.
In the case of x86 we'll use xmm0..xmm7. xmm7 is designated as a scratch
register, others can be used by register allocator.

2. We need a new constraint to indicate where to use vector registers. In this
commit constraint 'V' is introduced.

3. We need to be able to generate bare minimum: load, store and reg-to-reg
move. MOVDQU is used for loads and stores. MOVDQA is used for reg-to-reg
moves.

4. Finally we need to support any other opcodes we want. INDEX_op_add_i32x4 is
the only one for now. PADDD instruction handles it perfectly.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
tcg/i386/tcg-target.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++---
tcg/i386/tcg-target.h | 24 +++++++++++-
2 files changed, 119 insertions(+), 8 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 4133dcf..f26750d 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -32,6 +32,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
#else
"%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
#endif
+#ifdef TCG_TARGET_HAS_REG128
+ "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
+#endif
};
#endif

@@ -61,6 +64,16 @@ static const int tcg_target_reg_alloc_order[] = {
TCG_REG_EDX,
TCG_REG_EAX,
#endif
+#ifdef TCG_TARGET_HAS_REG128
+ TCG_REG_XMM0,
+ TCG_REG_XMM1,
+ TCG_REG_XMM2,
+ TCG_REG_XMM3,
+ TCG_REG_XMM4,
+ TCG_REG_XMM5,
+ TCG_REG_XMM6,
+/* TCG_REG_XMM7, <- scratch register */
+#endif
};

static const int tcg_target_call_iarg_regs[] = {
@@ -247,6 +260,10 @@ static int target_parse_constraint(TCGArgConstraint *ct, const char **pct_str)
case 'I':
ct->ct |= TCG_CT_CONST_I32;
break;
+ case 'V':
+ ct->ct |= TCG_CT_REG;
+ tcg_regset_set32(ct->u.regs, 0, 0xff0000);
+ break;

default:
return -1;
@@ -301,6 +318,9 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
#define P_SIMDF3 0x10000 /* 0xf3 opcode prefix */
#define P_SIMDF2 0x20000 /* 0xf2 opcode prefix */

+#define P_SSE_660F (P_DATA16 | P_EXT)
+#define P_SSE_F30F (P_SIMDF3 | P_EXT)
+
#define OPC_ARITH_EvIz (0x81)
#define OPC_ARITH_EvIb (0x83)
#define OPC_ARITH_GvEv (0x03) /* ... plus (ARITH_FOO << 3) */
@@ -351,6 +371,11 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
#define OPC_GRP3_Ev (0xf7)
#define OPC_GRP5 (0xff)

+#define OPC_MOVDQU_M2R (0x6f | P_SSE_F30F) /* store 128-bit value */
+#define OPC_MOVDQU_R2M (0x7f | P_SSE_F30F) /* load 128-bit value */
+#define OPC_MOVDQA_R2R (0x6f | P_SSE_660F) /* reg-to-reg 128-bit mov */
+#define OPC_PADDD (0xfe | P_SSE_660F)
+
/* Group 1 opcode extensions for 0x80-0x83.
These are also used as modifiers for OPC_ARITH. */
#define ARITH_ADD 0
@@ -428,6 +453,9 @@ static void tcg_out_opc(TCGContext *s, int opc, int r, int rm, int x)
assert((opc & P_REXW) == 0);
tcg_out8(s, 0x66);
}
+ if (opc & P_SIMDF3) {
+ tcg_out8(s, 0xf3);
+ }
if (opc & P_ADDR32) {
tcg_out8(s, 0x67);
}
@@ -634,9 +662,22 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
static inline void tcg_out_mov(TCGContext *s, TCGType type,
TCGReg ret, TCGReg arg)
{
+ int opc;
if (arg != ret) {
- int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm(s, opc, ret, arg);
+ switch (type) {
+ case TCG_TYPE_V128:
+ ret -= TCG_REG_XMM0;
+ arg -= TCG_REG_XMM0;
+ tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
+ break;
+ case TCG_TYPE_I32:
+ case TCG_TYPE_I64:
+ opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+ tcg_out_modrm(s, opc, ret, arg);
+ break;
+ default:
+ assert(0);
+ }
}
}

@@ -699,15 +740,39 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
TCGReg arg1, intptr_t arg2)
{
- int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+ int opc;
+ switch (type) {
+ case TCG_TYPE_V128:
+ ret -= TCG_REG_XMM0;
+ tcg_out_modrm_offset(s, OPC_MOVDQU_M2R, ret, arg1, arg2);
+ break;
+ case TCG_TYPE_I32:
+ case TCG_TYPE_I64:
+ opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+ tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+ break;
+ default:
+ assert(0);
+ }
}

static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
TCGReg arg1, intptr_t arg2)
{
- int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+ int opc;
+ switch (type) {
+ case TCG_TYPE_V128:
+ arg -= TCG_REG_XMM0;
+ tcg_out_modrm_offset(s, OPC_MOVDQU_R2M, arg, arg1, arg2);
+ break;
+ case TCG_TYPE_I32:
+ case TCG_TYPE_I64:
+ opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+ tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+ break;
+ default:
+ assert(0);
+ }
}

static inline void tcg_out_sti(TCGContext *s, TCGType type, TCGReg base,
@@ -1770,6 +1835,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
case INDEX_op_ld_i32:
tcg_out_ld(s, TCG_TYPE_I32, args[0], args[1], args[2]);
break;
+#ifdef TCG_TARGET_HAS_REG128
+ case INDEX_op_ld_v128:
+ tcg_out_ld(s, TCG_TYPE_V128, args[0], args[1], args[2]);
+ break;
+#endif

OP_32_64(st8):
if (const_args[0]) {
@@ -1802,6 +1872,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
tcg_out_st(s, TCG_TYPE_I32, args[0], args[1], args[2]);
}
break;
+#ifdef TCG_TARGET_HAS_REG128
+ case INDEX_op_st_v128:
+ tcg_out_st(s, TCG_TYPE_V128, args[0], args[1], args[2]);
+ break;
+#endif

OP_32_64(add):
/* For 3-operand addition, use LEA. */
@@ -2055,6 +2130,13 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
}
break;

+#ifdef TCG_TARGET_HAS_REG128
+ case INDEX_op_add_i32x4:
+ tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
+ break;
+#endif
+
+
case INDEX_op_mov_i32: /* Always emitted via tcg_out_mov. */
case INDEX_op_mov_i64:
case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi. */
@@ -2080,6 +2162,11 @@ static const TCGTargetOpDef x86_op_defs[] = {
{ INDEX_op_st16_i32, { "ri", "r" } },
{ INDEX_op_st_i32, { "ri", "r" } },

+#ifdef TCG_TARGET_HAS_REG128
+ { INDEX_op_ld_v128, { "V", "r" } },
+ { INDEX_op_st_v128, { "V", "r" } },
+#endif
+
{ INDEX_op_add_i32, { "r", "r", "ri" } },
{ INDEX_op_sub_i32, { "r", "0", "ri" } },
{ INDEX_op_mul_i32, { "r", "0", "ri" } },
@@ -2193,6 +2280,10 @@ static const TCGTargetOpDef x86_op_defs[] = {
{ INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
{ INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
#endif
+
+#ifdef TCG_TARGET_HAS_REG128
+ { INDEX_op_add_i32x4, { "V", "0", "V" } },
+#endif
{ -1 },
};

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 7a9980e..a6e4cd8 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -27,8 +27,14 @@
#define TCG_TARGET_INSN_UNIT_SIZE 1

#ifdef __x86_64__
-# define TCG_TARGET_REG_BITS 64
-# define TCG_TARGET_NB_REGS 16
+# define TCG_TARGET_HAS_REG128 1
+# ifdef TCG_TARGET_HAS_REG128
+# define TCG_TARGET_REG_BITS 64
+# define TCG_TARGET_NB_REGS 24
+# else
+# define TCG_TARGET_REG_BITS 64
+# define TCG_TARGET_NB_REGS 16
+# endif
#else
# define TCG_TARGET_REG_BITS 32
# define TCG_TARGET_NB_REGS 8
@@ -54,6 +60,16 @@ typedef enum {
TCG_REG_R13,
TCG_REG_R14,
TCG_REG_R15,
+#ifdef TCG_TARGET_HAS_REG128
+ TCG_REG_XMM0,
+ TCG_REG_XMM1,
+ TCG_REG_XMM2,
+ TCG_REG_XMM3,
+ TCG_REG_XMM4,
+ TCG_REG_XMM5,
+ TCG_REG_XMM6,
+ TCG_REG_XMM7,
+#endif
TCG_REG_RAX = TCG_REG_EAX,
TCG_REG_RCX = TCG_REG_ECX,
TCG_REG_RDX = TCG_REG_EDX,
@@ -130,6 +146,10 @@ extern bool have_bmi1;
#define TCG_TARGET_HAS_mulsh_i64 0
#endif

+#ifdef TCG_TARGET_HAS_REG128
+#define TCG_TARGET_HAS_add_i32x4 1
+#endif
+
#define TCG_TARGET_deposit_i32_valid(ofs, len) \
(((ofs) == 0 && (len) == 8) || ((ofs) == 8 && (len) == 8) || \
((ofs) == 0 && (len) == 16))
--
1.7.10.4
Kirill Batuzov
2014-10-16 08:56:51 UTC
Permalink
Introduce INDEX_op_add_i32x4 opcode which adds two 128-bit variables as vectors
of four 32-bit integers.

Add tcg_gen_add_i32x4 wrapper function that generates this opcode. If a TCG target
does not support it, the wrapper falls back to emulation of vector operation as
a series of scalar ones. Wrapper arguments should be globals unless the frontend is
sure that the backend has at least some support for vector operations (by "some
support" I mean loads, stores and moves).

Note that emulation of vector operation with scalar ones is done inline. An
attempt to do it as a helper resulted in a serious performance degradation.

Signed-off-by: Kirill Batuzov <***@ispras.ru>
---
tcg/tcg-op.h | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg-opc.h | 12 +++++++
tcg/tcg.h | 5 +++
3 files changed, 125 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index ea2b14f..c5f777d 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -139,6 +139,15 @@ static inline void tcg_gen_ldst_op_i64(TCGOpcode opc, TCGv_i64 val,
*tcg_ctx.gen_opparam_ptr++ = offset;
}

+static inline void tcg_gen_ldst_op_v128(TCGOpcode opc, TCGv_v128 val,
+ TCGv_ptr base, TCGArg offset)
+{
+ *tcg_ctx.gen_opc_ptr++ = opc;
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_V128(val);
+ *tcg_ctx.gen_opparam_ptr++ = GET_TCGV_PTR(base);
+ *tcg_ctx.gen_opparam_ptr++ = offset;
+}
+
static inline void tcg_gen_op4_i32(TCGOpcode opc, TCGv_i32 arg1, TCGv_i32 arg2,
TCGv_i32 arg3, TCGv_i32 arg4)
{
@@ -1069,6 +1078,11 @@ static inline void tcg_gen_ld_i64(TCGv_i64 ret, TCGv_ptr arg2, tcg_target_long o
tcg_gen_ldst_op_i64(INDEX_op_ld_i64, ret, arg2, offset);
}

+static inline void tcg_gen_ld_v128(TCGv_v128 ret, TCGv_ptr arg2, tcg_target_long offset)
+{
+ tcg_gen_ldst_op_v128(INDEX_op_ld_v128, ret, arg2, offset);
+}
+
static inline void tcg_gen_st8_i64(TCGv_i64 arg1, TCGv_ptr arg2,
tcg_target_long offset)
{
@@ -1092,6 +1106,11 @@ static inline void tcg_gen_st_i64(TCGv_i64 arg1, TCGv_ptr arg2, tcg_target_long
tcg_gen_ldst_op_i64(INDEX_op_st_i64, arg1, arg2, offset);
}

+static inline void tcg_gen_st_v128(TCGv_v128 arg1, TCGv_ptr arg2, tcg_target_long offset)
+{
+ tcg_gen_ldst_op_v128(INDEX_op_st_v128, arg1, arg2, offset);
+}
+
static inline void tcg_gen_add_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2)
{
tcg_gen_op3_i64(INDEX_op_add_i64, ret, arg1, arg2);
@@ -2780,6 +2799,8 @@ static inline void tcg_gen_qemu_st64(TCGv_i64 arg, TCGv addr, int mem_index)
tcg_gen_add_i32(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), TCGV_PTR_TO_NAT(B))
# define tcg_gen_addi_ptr(R, A, B) \
tcg_gen_addi_i32(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
+# define tcg_gen_movi_ptr(R, B) \
+ tcg_gen_movi_i32(TCGV_PTR_TO_NAT(R), (B))
# define tcg_gen_ext_i32_ptr(R, A) \
tcg_gen_mov_i32(TCGV_PTR_TO_NAT(R), (A))
#else
@@ -2791,6 +2812,93 @@ static inline void tcg_gen_qemu_st64(TCGv_i64 arg, TCGv addr, int mem_index)
tcg_gen_add_i64(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), TCGV_PTR_TO_NAT(B))
# define tcg_gen_addi_ptr(R, A, B) \
tcg_gen_addi_i64(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
+# define tcg_gen_movi_ptr(R, B) \
+ tcg_gen_movi_i64(TCGV_PTR_TO_NAT(R), (B))
# define tcg_gen_ext_i32_ptr(R, A) \
tcg_gen_ext_i32_i64(TCGV_PTR_TO_NAT(R), (A))
#endif /* UINTPTR_MAX == UINT32_MAX */
+
+/***************************************/
+/* 128-bit vector arithmetic. */
+
+static inline void *tcg_v128_swap_slot(int n)
+{
+ return &tcg_ctx.v128_swap[n * 16];
+}
+
+/* Find a memory location for 128-bit TCG variable. */
+static inline void tcg_v128_to_ptr(TCGv_v128 tmp, TCGv_ptr base, int slot,
+ TCGv_ptr *real_base, intptr_t *real_offset,
+ int is_read)
+{
+ int idx = GET_TCGV_V128(tmp);
+ assert(idx >= 0 && idx < tcg_ctx.nb_temps);
+ if (idx < tcg_ctx.nb_globals) {
+ /* Globals use their locations within CPUArchState. */
+ int env = GET_TCGV_PTR(tcg_ctx.cpu_env);
+ TCGTemp *ts_env = &tcg_ctx.temps[env];
+ TCGTemp *ts_arg = &tcg_ctx.temps[idx];
+
+ /* Sanity checks: global's memory locations must be addressed
+ relative to ENV. */
+ assert(ts_env->val_type == TEMP_VAL_REG &&
+ ts_env->reg == ts_arg->mem_reg &&
+ ts_arg->mem_allocated);
+
+ *real_base = tcg_ctx.cpu_env;
+ *real_offset = ts_arg->mem_offset;
+
+ if (is_read) {
+ tcg_gen_sync_temp_v128(tmp);
+ } else {
+ tcg_gen_discard_v128(tmp);
+ }
+ } else {
+ /* Temporaries use swap space in TCGContext. Since we already have
+ a 128-bit temporary we'll assume that the target supports 128-bit
+ loads and stores. */
+ *real_base = base;
+ *real_offset = slot * 16;
+ if (is_read) {
+ tcg_gen_st_v128(tmp, base, slot * 16);
+ }
+ }
+}
+
+static inline void tcg_gen_add_i32x4(TCGv_v128 res, TCGv_v128 arg1,
+ TCGv_v128 arg2)
+{
+ if (TCG_TARGET_HAS_add_i32x4) {
+ tcg_gen_op3_v128(INDEX_op_add_i32x4, res, arg1, arg2);
+ } else {
+ TCGv_ptr base = tcg_temp_new_ptr();
+ TCGv_ptr arg1p, arg2p, resp;
+ intptr_t arg1of, arg2of, resof;
+ int i;
+ TCGv_i32 tmp1, tmp2;
+
+ tcg_gen_movi_ptr(base, (unsigned long)&tcg_ctx.v128_swap[0]);
+
+ tcg_v128_to_ptr(arg1, base, 1, &arg1p, &arg1of, 1);
+ tcg_v128_to_ptr(arg2, base, 2, &arg2p, &arg2of, 1);
+ tcg_v128_to_ptr(res, base, 0, &resp, &resof, 0);
+
+ tmp1 = tcg_temp_new_i32();
+ tmp2 = tcg_temp_new_i32();
+
+ for (i = 0; i < 4; i++) {
+ tcg_gen_ld_i32(tmp1, arg1p, arg1of + i * 4);
+ tcg_gen_ld_i32(tmp2, arg2p, arg2of + i * 4);
+ tcg_gen_add_i32(tmp1, tmp1, tmp2);
+ tcg_gen_st_i32(tmp1, resp, resof + i * 4);
+ }
+
+ if (GET_TCGV_V128(res) >= tcg_ctx.nb_globals) {
+ tcg_gen_ld_v128(res, base, 0);
+ }
+
+ tcg_temp_free_i32(tmp1);
+ tcg_temp_free_i32(tmp2);
+ tcg_temp_free_ptr(base);
+ }
+}
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 0916d83..51d72d9 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -52,6 +52,12 @@ DEF(br, 0, 0, 1, TCG_OPF_BB_END)
# define IMPL64 TCG_OPF_64BIT
#endif

+#ifdef TCG_TARGET_HAS_REG128
+# define IMPL128 0
+#else
+# define IMPL128 TCG_OPF_NOT_PRESENT
+#endif
+
DEF(mov_i32, 1, 1, 0, TCG_OPF_NOT_PRESENT)
DEF(movi_i32, 1, 0, 1, TCG_OPF_NOT_PRESENT)
DEF(setcond_i32, 1, 2, 1, 0)
@@ -177,6 +183,12 @@ DEF(muls2_i64, 2, 2, 0, IMPL64 | IMPL(TCG_TARGET_HAS_muls2_i64))
DEF(muluh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_muluh_i64))
DEF(mulsh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_mulsh_i64))

+/* load/store */
+DEF(st_v128, 0, 2, 1, IMPL128)
+DEF(ld_v128, 1, 1, 1, IMPL128)
+/* 128-bit vector arith */
+DEF(add_i32x4, 1, 2, 0, IMPL128 | IMPL(TCG_TARGET_HAS_add_i32x4))
+
/* QEMU specific */
#if TARGET_LONG_BITS > TCG_TARGET_REG_BITS
DEF(debug_insn_start, 0, 0, 2, TCG_OPF_NOT_PRESENT)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 83fb0d3..75ab2e4 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -119,6 +119,10 @@ typedef uint64_t TCGRegSet;
#define TCG_TARGET_HAS_rem_i64 0
#endif

+#ifndef TCG_TARGET_HAS_add_i32x4
+#define TCG_TARGET_HAS_add_i32x4 0
+#endif
+
/* For 32-bit targets, some sort of unsigned widening multiply is required. */
#if TCG_TARGET_REG_BITS == 32 \
&& !(defined(TCG_TARGET_HAS_mulu2_i32) \
@@ -544,6 +548,7 @@ struct TCGContext {
/* threshold to flush the translated code buffer */
size_t code_gen_buffer_max_size;
void *code_gen_ptr;
+ uint8_t v128_swap[16 * 3];

TBContext tb_ctx;
--
1.7.10.4
Alex Bennée
2014-10-16 10:03:20 UTC
Permalink
Post by Kirill Batuzov
Post by Alex Bennée
(4) Consider supporting generic vector operations in the TCG?
I gave it a go and was quite happy with the result. I have implemented the add_i32x4
opcode which is addition of 128-bit vectors composed of four 32-bit integers
and used it to translate NEON vadd.i32 to SSE paddd instruction. I used ARM for
my guest because I'm familiar with this architecture and it is different from
my host.
<snip>
Post by Kirill Batuzov
OUT: [size=196]
0x60442450: mov -0x4(%r14),%ebp
0x60442454: test %ebp,%ebp
0x60442456: jne 0x60442505
0x6044245c: movdqu 0x658(%r14),%xmm0
0x60442465: movdqu 0x668(%r14),%xmm1
0x6044246e: paddd %xmm1,%xmm0
0x60442472: paddd %xmm1,%xmm0
0x60442476: paddd %xmm1,%xmm0
0x6044247a: paddd %xmm1,%xmm0
0x6044247e: movdqu %xmm0,0x658(%r14)
<...>
It certainly looks promising although as I suspect you know add is a
pretty easy target ;-)
Post by Kirill Batuzov
Post by Alex Bennée
But for target-alpha, there's one vector comparison operation that appears in
every guest string operation, and is used heavily enough that it's in the top
10 functions in the profile: cmpbge (compare bytes greater or equal).
cmpge_i8x8 tmp0, arg1, arg2
select_msb_i8x8 res, tmp0
res[i] = <111...11> if arg1[i] >= arg2[i]
res[i] = <000...00> if arg1[i] < arg2[i]
There is such operation in NEON. In SSE we can emulate it with PCMPEQB, PCMPGTB
and POR.
select_msb is "select most significant bit". SSE instruction PMOVMSKB.
Post by Alex Bennée
While making helper functions faster is good I've wondered if they is
enough genericsm across the various SIMD/vector operations we could add
add TCG ops to translate them? The ops could fall back to generic helper
functions using the GCC instrinsics if we know there is no decent
back-end support for them?
From Valgrind experience there are enough genericism. Valgrind can translate
SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes are
reused between instruction sets.
Doesn't Valgrind have the advantage of same-arch->same-arch (I've not
looked at it's generated code in detail though).
Post by Kirill Batuzov
But keep in mind - there are a lot of vector opcodes. Much much more than
scalar ones. You can see full list in Valgrind sources
(VEX/pub/libvex_ir.h).
I think we could only approach this is in a piecemeal way guided by
performance bottlenecks when we find them.
Post by Kirill Batuzov
We can reduce the amount of opcodes by converting vector element size from part
of an opcode to a constant argument. But we will lose some flexibility offered
by the TARGET_HAS_opcode macro when target has support for some sizes but not for
others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
does not have one for size i64x2.
Some implementation details and concerns.
The most problematic issue was the fact that with vector registers we have one
entity that can be accessed as both global variable and memory location. I
solved it by introducing the sync_temp opcode that instructs register allocator to
save global variable to its memory location if it is on the register. If a
variable is not on a register or memory is already coherent - no store is issued,
so performance penalty for it is minimal. Still this approach has a serious
drawback: we need to generate sync_temp explicitly. But I do not know any better
way to achieve consistency.
I'm not sure I follow. I thought we only needed the memory access when
the backend can't support the vector width operations so shouldn't have
stuff in the vector registers?
Post by Kirill Batuzov
Note that as of this RFC I have not finished conversion of ARM guest so mixing
NEON with VFP code can cause a miscompile.
The second problem is that a backend may or may not support vector operations. We
do not want each frontend to check it on every operation. I created a wrapper that
generates vector opcode if it is supported or generates emulation code.
For add_i32x4 emulation code is generated inline. I tried to make it a helper
but got a very significant performance loss (5x slowdown). I'm not sure about
the cause but I suspect that memory was a bottleneck and extra stores needed
by calling conventions mattered a lot.
So the generic helper was more API heavy than the existing NEON helpers?
Post by Kirill Batuzov
The existing constraints are good enough to express that vector registers and
general purpose registers are different and can not be used instead of each
other.
One unsolved problem is global aliasing. With general purpose registers we have
no aliasing between globals. The only example I know where registers can alias
is the x86 ah/ax/eax/rax case. They are handled as one global. With vector
registers we have NEON where an 128-bit Q register consists of two 64-bit
D registers each consisting of two 32-bit S registers. I think I'll need
to add alias list to each global listing every other global it can clobber and
then iterate over it in the optimizer. Fortunately this list will be static and not
very long.
(1) Performance. 200% speedup is a lot. My test was specifically crafted and real
life applications may not have that much vector operations on average, but
there is a specific class of applications where it will matter a lot - media
processing applications like ffmpeg.
(2) Some unification of common operations. Right now every target reimplements
common vector operations (like vector add/sub/mul/min/compare etc.). We can
do it once in the common TCG code.
Still there are some cons I mentioned earlier. The need to support a lot of
opcodes is the most significant in the long run I think. So before I commit my
time to conversion of more operations I'd like to hear your opinions if this
approach is acceptable and worth spending efforts.
Overall I'm pretty keen to explore this further. If we can get the
backend interface right and make it an easier proposition to tcg-up
various vector operations when bottle-necks arise it will be a big win.

A lot will depend on where those bottle-necks are though. If for example
the media codecs all use very ARCH specific special sauce instructions
we might never claw back that much.

I'll have a look through the patches and comment there when I've gotten
my head round the back-end issues.

Thanks for coding this up ;-)
--
Alex Bennée
Kirill Batuzov
2014-10-16 11:07:25 UTC
Permalink
Post by Alex Bennée
Post by Alex Bennée
From Valgrind experience there are enough genericism. Valgrind can translate
SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes are
reused between instruction sets.
Doesn't Valgrind have the advantage of same-arch->same-arch (I've not
looked at it's generated code in detail though).
Yes, they have this advantage, but Valgrind tools look at intermediate
code in an architecture-independent way. For tools to work they need
to preserve opcode's semantics across different architectures. For
example Iop_QAdd16Sx4 (addition with saturation) must have the same
meaning on ARM (vqadd.s16 instruction) and on x86 (paddsw instruction).
So in most cases where Valgrind uses same opcode for different
instructions from different architectures QEMU can do the same.
Post by Alex Bennée
Post by Alex Bennée
But keep in mind - there are a lot of vector opcodes. Much much more than
scalar ones. You can see full list in Valgrind sources
(VEX/pub/libvex_ir.h).
I think we could only approach this is in a piecemeal way guided by
performance bottlenecks when we find them.
I'm not sure this will work. In my example larger part of speedup comes
from the fact that I could preserve value on registers and do not need
them to be saved and loaded for each vadd.i32 instruction. To be able to
do it on the real-life application we need to support as large fraction
of its vector instructions as possible. In short: the speedup does not
come from faster emulation of one instruction but from interaction
between sequential guest instructions.
Post by Alex Bennée
Post by Alex Bennée
We can reduce the amount of opcodes by converting vector element size from part
of an opcode to a constant argument. But we will lose some flexibility offered
by the TARGET_HAS_opcode macro when target has support for some sizes but not for
others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
does not have one for size i64x2.
Some implementation details and concerns.
The most problematic issue was the fact that with vector registers we have one
entity that can be accessed as both global variable and memory location. I
solved it by introducing the sync_temp opcode that instructs register allocator to
save global variable to its memory location if it is on the register. If a
variable is not on a register or memory is already coherent - no store is issued,
so performance penalty for it is minimal. Still this approach has a serious
drawback: we need to generate sync_temp explicitly. But I do not know any better
way to achieve consistency.
I'm not sure I follow. I thought we only needed the memory access when
the backend can't support the vector width operations so shouldn't have
stuff in the vector registers?
The target support for vector operations is not binary ("support all" or
"support none"). In most cases it will support some large subset but
some guest vector operations will be emulated. In that case we'll need
to access guest vector registers as memory locations.

Scalar operations which are not supported in opcodes are very uncommon
and a helper with large performance overhead is a reasonable option. I'd
like to avoid such heavy helpers in vector operations because
unsupported opcodes will be more common.

Another cause is the transition from existing code to vector opcodes.
During transition we'll have mix of old code (access as memory) and new
one (access as globals). Doing transition in one go is unrealistic.
Post by Alex Bennée
Post by Alex Bennée
Note that as of this RFC I have not finished conversion of ARM guest so mixing
NEON with VFP code can cause a miscompile.
The second problem is that a backend may or may not support vector operations. We
do not want each frontend to check it on every operation. I created a wrapper that
generates vector opcode if it is supported or generates emulation code.
For add_i32x4 emulation code is generated inline. I tried to make it a helper
but got a very significant performance loss (5x slowdown). I'm not sure about
the cause but I suspect that memory was a bottleneck and extra stores needed
by calling conventions mattered a lot.
So the generic helper was more API heavy than the existing NEON helpers?
Existing NEON implementation generates emulation code inline too. That
is how I found that my helper was slow.
--
Kirill
Kirill Batuzov
2014-11-11 11:58:46 UTC
Permalink
Post by Kirill Batuzov
Post by Alex Bennée
(4) Consider supporting generic vector operations in the TCG?
I gave it a go and was quite happy with the result. I have implemented the add_i32x4
opcode which is addition of 128-bit vectors composed of four 32-bit integers
and used it to translate NEON vadd.i32 to SSE paddd instruction.
<snip>
Post by Kirill Batuzov
(1) Performance. 200% speedup is a lot. My test was specifically crafted and real
life applications may not have that much vector operations on average, but
there is a specific class of applications where it will matter a lot - media
processing applications like ffmpeg.
(2) Some unification of common operations. Right now every target reimplements
common vector operations (like vector add/sub/mul/min/compare etc.). We can
do it once in the common TCG code.
Still there are some cons I mentioned earlier. The need to support a lot of
opcodes is the most significant in the long run I think. So before I commit my
time to conversion of more operations I'd like to hear your opinions if this
approach is acceptable and worth spending efforts.
tcg: add support for 128bit vector type
tcg: store ENV global in TCGContext
tcg: add sync_temp opcode
tcg: add add_i32x4 opcode
target-arm: support access to 128-bit guest registers as globals
target-arm: use add_i32x4 opcode to handle vadd.i32 instruction
tcg/i386: add support for vector opcodes
target-arm/translate.c | 30 ++++++++++-
tcg/i386/tcg-target.c | 103 ++++++++++++++++++++++++++++++++---
tcg/i386/tcg-target.h | 24 ++++++++-
tcg/tcg-op.h | 141 ++++++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg-opc.h | 13 +++++
tcg/tcg.c | 36 +++++++++++++
tcg/tcg.h | 34 ++++++++++++
7 files changed, 371 insertions(+), 10 deletions(-)
Ping? Any more comments?
--
Kirill
Continue reading on narkive:
Loading...