Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8273057: [vector] New VectorAPI "SelectiveStore" #115

Closed

Conversation

JoshuaZhuwj
Copy link
Member

@JoshuaZhuwj JoshuaZhuwj commented Aug 27, 2021

Hi,

I want to propose a new VectorAPI "Selective Store/Load" and share my
implementation. Currently Alibaba's internal databases are in the
process of applying VectorAPI and they have requirements on "Selective
Store" for acceleration.

My proposed VectorAPI is declared as below [1]:

    int selectiveIntoArray($type$[] a, int offset, VectorMask<$Boxtype$> m);

The active elements (with their respective bit set in mask) are
contiguously stored into the array "a". Assume N is the true count of
mask, the elements starting from a[offset+N] till a[offset+laneCount]
are left unchanged. The return value represents the number of elements
store into the array and "offset + return value" is the new offset of
the next iteration.
image
This API will be used like the following manner [2]:

    tld.conflict_cnt = 0;
    for (int i = 0; i < ARRAY_LENGTH; i += INT_PREFERRED_SPECIES.length()) {
      IntVector av = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input1, i);
      IntVector bv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input2, i);
      IntVector cv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_index, i);
      VectorMask<Integer> mask = av.compare(VectorOperators.NE, bv);
      tld.conflict_cnt += cv.selectiveIntoArray(tld.conflict_array, tld.conflict_cnt, mask);
    }

My patch includes the following changes:
  * Selective Store VectorAPI for Long & Int
  * Assembler: add x86 instruction "VPCOMPRESSD" and "VPCOMPRESSQ"
  * Instruction selection: vselective_store; kmask_truecount (true count of kregister)
  * Add node "StoreVectorSelective"
  * Add a new parameter "is_selective" in inline_vector_mem_masked_operation()
    in order to distinguish Masked version or Selective version
  * jtreg cases
  * JMH benchmark

TODO parts I will implement:
  * Selective Store for other types
  * Selective Load
  * Some potential optimization. Such as: when mask is allTrue, SelectiveIntoArray() -> IntoArray()

Test:
  * Passed VectorAPI jtreg cases.
  * Result of JMH benchmark to evaluate API's performance in Alibaba's real scenario.
      UseAVX=3; thread number = 8; conflict data percentage: 20% (that means 20% of mask bits are true)
      http://cr.openjdk.java.net/~jzhu/8273057/jmh_benchmark_result.pdf

[1] JoshuaZhuwj@69623f7#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR4667
[2] JoshuaZhuwj@69623f7#diff-951d02bd72a931ac34bc85d1d4e656a14f8943e143fc9282b36b9c76c1893c0cR144
[3] failed to inline (intrinsic) by

return false; // Implementation limitation

Best Regards,
Joshua


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/115/head:pull/115
$ git checkout pull/115

Update a local copy of the PR:
$ git checkout pull/115
$ git pull https://git.openjdk.java.net/panama-vector pull/115/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 115

View PR using the GUI difftool:
$ git pr show -t 115

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/115.diff

Sorry, something went wrong.

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 27, 2021

👋 Welcome back jzhu! A progress list of the required criteria for merging this PR into vectorIntrinsics+mask will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Aug 27, 2021
@mlbridge
Copy link

mlbridge bot commented Aug 27, 2021

Webrevs

@mlbridge
Copy link

mlbridge bot commented Aug 27, 2021

Mailing list message from Paul Sandoz on panama-dev:

Hi Joshua,

Thank you for exploring this area. I am impressed at the level of knowledge you have of HotSpot.

Instead of immediately diving into code I would first prefer to discuss the design to determine if this is the best way to support your use-case. I would like to explore if there are underlying primitives from which we can compose to support this use-case.

If possible I would like to leverage the existing mask load/store primitives we have, and if necessary make some changes, rather than add more.

We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.)

To use the scatter store method today for your use-case we would have to:

- compute an int[] array from the set lanes of the mask, M say
- computing the ?prefix" mask from the set lanes of M, PM
- store the vector using the int[] array and PM.

Another alternative is to:

- compute a ?compression? shuffle, S, from the set lanes of the mask, M
- apply S to the vector, produce a compressed vector CV
- computing the ?prefix" mask from the set lanes of M, PM
- store CV using M

In either case the loop index value is increased by the true count of M.

The primitive I am searching for might be a way to create a shuffle from a mask.

Let?s say we could write:

Int[] a = ...
IntVector v = ...
VectorMask m = ?

// The new primitive, create a shuffle from the mask that partitions vector elements
// according to the set and unset mask lane elements.
VectorShuffle s = m.toPartitioningShuffle();
// Partition the elements
IntVector cv = v.rearrange(s);

// This method is likely not optimal, yet!
// Another method could be added that is prefix(int length)
VectorMask pm = m.species().indexInRange(0, m.trueCount());

// Use existing masked store
cv.intoArray(a, offset, pm);
// Increase offset by number of stores
offset += m.trueCount();

Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in!

I think this is worth exploring further, the more we can optimize the primitives, and then potentially optimize patterns of those, the more flexible we are and can avoid adding more specific functionality.

Paul.

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Aug 27, 2021

Mailing list message from Paul Sandoz on panama-dev:

Hi Joshua,

Thank you for exploring this area. I am impressed at the level of knowledge you have of HotSpot.

Instead of immediately diving into code I would first prefer to discuss the design to determine if this is the best way to support your use-case. I would like to explore if there are underlying primitives from which we can compose to support this use-case.

If possible I would like to leverage the existing mask load/store primitives we have, and if necessary make some changes, rather than add more.

We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.)

To use the scatter store method today for your use-case we would have to:

- compute an int[] array from the set lanes of the mask, M say
- computing the ?prefix" mask from the set lanes of M, PM
- store the vector using the int[] array and PM.

Another alternative is to:

- compute a ?compression? shuffle, S, from the set lanes of the mask, M
- apply S to the vector, produce a compressed vector CV
- computing the ?prefix" mask from the set lanes of M, PM
- store CV using M

In either case the loop index value is increased by the true count of M.

The primitive I am searching for might be a way to create a shuffle from a mask.

Let?s say we could write:

Int[] a = ...
IntVector v = ...
VectorMask m = ?

// The new primitive, create a shuffle from the mask that partitions vector elements
// according to the set and unset mask lane elements.
VectorShuffle s = m.toPartitioningShuffle();
// Partition the elements
IntVector cv = v.rearrange(s);

// This method is likely not optimal, yet!
// Another method could be added that is prefix(int length)
VectorMask pm = m.species().indexInRange(0, m.trueCount());

// Use existing masked store
cv.intoArray(a, offset, pm);
// Increase offset by number of stores
offset += m.trueCount();

Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in!

I think this is worth exploring further, the more we can optimize the primitives, and then potentially optimize patterns of those, the more flexible we are and can avoid adding more specific functionality.

Paul.

@mlbridge
Copy link

mlbridge bot commented Aug 27, 2021

Mailing list message from John Rose on panama-dev:

I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive?

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Aug 27, 2021

Mailing list message from John Rose on panama-dev:

I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive?

@JoshuaZhuwj
Copy link
Member Author

JoshuaZhuwj commented Aug 30, 2021

Hi Paul,

Thanks a lot for your quick reply and share of design thinking.

We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.)

To use the scatter store method today for your use-case we would have to:

  • compute an int[] array from the set lanes of the mask, M say
  • computing the ?prefix" mask from the set lanes of M, PM
  • store the vector using the int[] array and PM.

Another alternative is to:

  • compute a ?compression? shuffle, S, from the set lanes of the mask, M
  • apply S to the vector, produce a compressed vector CV
  • computing the ?prefix" mask from the set lanes of M, PM
  • store CV using M

In my opinion, no matter the first scatter store way or the second shuffle way,
the key to resolve this use case is similar: how to generate "compression" shuffle or "compression" index array according to the set lanes of the mask.
(The "prefix" mask could be computed by true count of mask.)
But when compose existing primitives, how much performance is affected is also important compared to native instruction support or scalar version.

When I help database developers to apply VectorAPI, they usually care:
  * how much performance they could gain;
  * VectorAPI's ease of use to reduce code complexity and ensure portability;
  * whether intrinsics are inlined in their scenarios (fail cases bring performance degradation).

Four vectorized data movement primitives: selective load, selective store, gather, and scatter
are critical SIMD operations for in-memory databases.
Currently gather and scatter are supported in VectorAPI.
That's why I help them implement selective ops and try to propose "Selective Store/Load" in VectorAPI.

The primitive I am searching for might be a way to create a shuffle from a mask.

What I know for creating shuffle from mask is to leverage a precomputed, cache resident table.
Take four bits mask as example,

mask_index | mask |  index_array
    0      | 0000 |  
    1      | 0001 |  3
    2      | 0010 |  2
    3      | 0011 |  2, 3
    4      | 0100 |  1
    5      | 0101 |  1, 3
   ...
    14     | 1110 |  0, 1, 2
    15     | 1111 |  0, 1, 2, 3

The index_array could be treated as index map or converted into shuffle.

Let?s say we could write:

Int[] a = ...
IntVector v = ...
VectorMask m = ?

// The new primitive, create a shuffle from the mask that partitions vector elements
// according to the set and unset mask lane elements.
VectorShuffle s = m.toPartitioningShuffle();
// Partition the elements
IntVector cv = v.rearrange(s);

// This method is likely not optimal, yet!
// Another method could be added that is prefix(int length)
VectorMask pm = m.species().indexInRange(0, m.trueCount());

// Use existing masked store
cv.intoArray(a, offset, pm);
// Increase offset by number of stores
offset += m.trueCount();

Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in!

I think the pattern will be too complicated and the optimization may be not reliable.

Thanks,
Joshua

@JoshuaZhuwj
Copy link
Member Author

I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive?

Agree. John, thanks a lot for your review comments. This will make the primitive more friendly.

@nsjian
Copy link

nsjian commented Aug 31, 2021

I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive?

Agree. John, thanks a lot for your review comments. This will make the primitive more friendly.

Yes, if we have to introduce a new API, a vector-to-vector compress operation sounds more reasonable. Arm SVE instrctuion COMPACT could also do such work.

@mlbridge
Copy link

mlbridge bot commented Aug 31, 2021

Mailing list message from Paul Sandoz on panama-dev:

Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location.

That may be wishful thinking, however we have made significant improvements to the optimization of shuffles and masking, which gives me some hope.

I think we should give it some more thought (on the C2 heroics required to not, whether we can internally classify certain kinds of shuffle etc) before committing to a more specific/specialized operation, such as say:

- Vector.compress(VectorMask<>); or perhaps
- a new class of operators, rearrange operators, whose behaviors are documented with regards to mask and non-mask variants.
(It?s tempting to create a special unary lanewise operator, whose non-mask variant returns the input. But, that would be a misuse.)

Paul.

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Aug 31, 2021

Mailing list message from Paul Sandoz on panama-dev:

Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location.

That may be wishful thinking, however we have made significant improvements to the optimization of shuffles and masking, which gives me some hope.

I think we should give it some more thought (on the C2 heroics required to not, whether we can internally classify certain kinds of shuffle etc) before committing to a more specific/specialized operation, such as say:

- Vector.compress(VectorMask<>); or perhaps
- a new class of operators, rearrange operators, whose behaviors are documented with regards to mask and non-mask variants.
(It?s tempting to create a special unary lanewise operator, whose non-mask variant returns the input. But, that would be a misuse.)

Paul.

@JoshuaZhuwj
Copy link
Member Author

Thank Paul, John Rose and Ningsheng for your comments.

A vector-to-vector compress operation is the more friendly primitive.
Both Intel AVX3 instruction "COMPRESS" and Arm SVE instruction "COMPACT" provide the capability.
Hence selective store could be implemented by

    VectorMask<Integer> mask = ...;
    IntVector bv = av.compress(mask);
    VectorMask<Integer> prefixMask = prefix(mask.trueCount());
    bv.intoArray(array, offset, prefixMask);
    offset += mask.trueCount();

The vector-to-vector compress primitive together with a store and prefix mask could be optimized further into memory destination version of compress on supported architecture.

Once architectures do not support vector-to-vector compress natively,
the intrinsic bailed out and then java path is taken.
Java path could be composed by "mask -> shuffle -> rearrange" and C2 will then try to inline intrinsics again for these API calls.
In this way, we are also able to implement scatter/gather store/load in the same composition manner to keep design consistency.

Best Regards,
Joshua

@mlbridge
Copy link

mlbridge bot commented Sep 2, 2021

Mailing list message from Paul Sandoz on panama-dev:

Hi Joshua,

I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.

Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:

VectorMask<Integer> mask = ...;
IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
VectorMask<Integer> prefixMask = prefix(mask.trueCount());
bv.intoArray(array, offset, prefixMask);
offset += mask.trueCount();

We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:

1) the non-mask rearrange is an identity operation; and
2) the mask accepting rearrange describes mask-based cross-lane movement:

It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:

rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
Or
rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
// Postfix of exceptional lanes in the shuffle, representing unset lanes

For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.

However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.

In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant. Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?

?

FWIW another way to think about a partitioning/compression shuffle:

SPECIES.iota().compress(m);

Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.

Paul.

[*] We could consider this independently

@mlbridge
Copy link

mlbridge bot commented Sep 2, 2021

Mailing list message from John Rose on panama-dev:

On Sep 2, 2021, at 9:50 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:

Hi Joshua,

I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.

Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:

VectorMask<Integer> mask = ...;
IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
VectorMask<Integer> prefixMask = prefix(mask.trueCount());
bv.intoArray(array, offset, prefixMask);
offset += mask.trueCount();

We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:

1) the non-mask rearrange is an identity operation; and
2) the mask accepting rearrange describes mask-based cross-lane movement:

Be careful with adding bulk operations to VectorOperations.
Those are supposed to be only lanewise (scalar). It?s not
clear to me what COMPRESS could mean, as a lanewise
operation.

One very interesting potential interaction between lanewise
ops and masked operations is reduction of a subset of lanes.
This feels a little like compress, although the lateral data
motion is absent. I?m thinking that generalized shuffle,
lateral compression/expansion, and segmented scan/reduce
may be the interesting primitives here.

The FIRST_NONZERO operation is the closest thing we have
to lateral motion today: The idea is that non-zero lane values
propagate from higher to lower positions, skipping over zero
lane values. We might try to get from FIRST_NONZERO to
compress by defining an enhanced segmented reduction
operation which takes into a account masking (to define
segments that are to be individually reduced), and then
take additional wins from using other VOs such as MAX,
ADD, etc. I?m throwing this out as brainstorming; what
I get from this line of thought is that segmented scans
naturally deliver results in expanded form, while
segmented reductions have two natural forms of
output, expanded and compressed. So we need a
compress primitive, and/or an expand/compress
mode bit for segmented reduction.

(Matters of terminology: scan(op) produces the
vector of partial results that would be obtained
if a reduction(op) operation proceeded linearly
from lane 0 to lane VLENGTH-1, recording
an intermediate result in each lane, often
before lane value is accumulated. The scan
can take a carry-in and carry-out lane value
to allow multiple vectors to compose.
Both reductions and scans can be usefully
be segmented, again with carry-in and carry-out.
Segmented operations are useful for fast
group-by modes of operation, and also for
vectorized parsing, such as var-ints or
even text. Compress and expand are independent
primitives, both driven by masks; such masks
can as noted above be related to segmentation
masks.)

It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:

rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
Or
rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
// Postfix of exceptional lanes in the shuffle, representing unset lanes

I?m not following this, because I?m having trouble imagining
ops like ADD and MAX here, as well as imagining COMPRESS
as a VO. (I must be missing something crucial here.)

For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.

I guess what I?m missing is this is a whole new kind of OP,
not a lanewise one. Probably should live on Shuffle, since
VOs is documented to be about lanewise ops only.

However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.

In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant.

Yes, and you are exploring the space of plausible constants.
I think they (or at least some of them) could also factorized
as *families* of constants, indexed by masks.

Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?

Siting the compress (and expand!) functionality on shuffle instead
of vector is a nice story, but I think it?s better to site it on general
vectors, as distinct from shuffling, because of its relation to
segmented data-parallel operations (which are distinct from
permutations, since they never swap the order of selected
lanes).

So, I see three primitive notions: Segmentation, compress/expand,
and permutation. You can bridge to and from permutation simply
by working with index vectors like iota, and perhaps (as sugar) lifting
selected vector operations to shuffles.

?

FWIW another way to think about a partitioning/compression shuffle:

SPECIES.iota().compress(m);

Yes, that?s the bridging I just mentioned. Some applications
would expand as well as compress. (I know it?s a one-way
street for SVE and maybe others but fear not; expand is
easier to implement than compress, using a plain
permutation.)

For example, if parsing a file of var-ints does
segmentation (based on (lane&0x80)=0x80 masking)
and subsequent compression to 32-bit integer lanes, the
reverse operation of un-parsing would use an expansion
of 32-bit integer lanes, to be reduced to bytes and stored
to the var-int file.

Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.

Yes.

If a machine lacks a compress and/or expand operation,
we?d want Java paths that would compute appropriate
shuffles and then apply them, with constant folding
when appropriate. (Although masks are usually not
constant.)

Sketch of algorithm to derive a compression shuffle
from a mask:

Take VLENGTH=5 just for example and suppose we
want to compress <4:v, 3:w, 2:x, 1:y, 0:z> with mask
0b01001. The desired output is then <0,0,0,w,z>.

1. For each masked lane (in a fresh index vector)
compute the number of masked lanes at lower
indexes. For example, <2, 1, 1, 1, 0> from 0b01001.
This can be done in log time, and hardware can
sometimes help. The operation naturally yields
popcnt(m) (2 in the example) as a carry-out from
a log-scan. Note that the lowest lane always gets
zero. Variants of this algorithm might use a carry-in
to put something other than zero there. Note also
that the resulting index vector, used to shuffle
the input vector, performs an *expand* which is
the inverse of the compress we are seeking.

2. (Optional but useful.) Compute a complementary
second fresh index vector from the inverted mask,
added to the bit-count of the first mask. For example,
<2, 2, 1, 0, 0> from 0x10110 (~0b01001), and then
(adding the bitcount 2) <4, 4, 3, 2, 2>. Note that
this index vector would perform an expand, to
the left-hand (high) side of the vector, of the vector
we are trying to compress.

3. Under the mask, blend the index vector from step 1
with that of step 2, or with a constant vector of value
VLENGTH-1. For example, blending from 0b01001,
either <4, 1, 3, 2, 0> or (lacking step 2) <4, 1, 4, 4, 0>.

4. Create an in-memory buffer of the same size and
shape as the vector to be reduced. Use the blended
index to perform an indexed store from the input
vector to the memory buffer. This uses the hardware
memory store operation, on a temp buffer, to create
the desired compress. (It should be done in the VPU
not in the MU, of course, but sometimes this might
be the best way.)

5. Load the buffer back into a vector register, and
set the unselected lanes to whatever value is desired,
either zero, or (perhaps) a left-packed set of the
non-compressed values. (This is called the SAG
or ?sheep and goats? operation, when applied to
a bit-vector, and it is very useful for other vectors
as well.)

Step 2 may be desirable even if you are not doing a SAG,
if the indexed store hardware will go slower due
to collisions on the last lane (index 4 in the example;
note that only that index gets repeated).

Note that steps 1 and 2 can be done in superscalar parallel.
If step 2 is done with a from-left scan, there is no need to
add in the correction value, of popcnt(m).

I don?t know a great way to avoid a trip to a memory
buffer, if the hardware does not supply a compress
operation in the VPU. But here?s an almost-great
way:

4b. Having computed the non-colliding permutation
vector E (in steps 2 and 3, as if for SAG), use math to
compute the inverse permutation vector C. E is an
expansion as noted above while its inverse C is the
corresponding compression. In the running example,
E is <4, 1, 3, 2, 0> and its inverse C will be <4, 2, 1, 3, 0>.

5b. Use the standard shuffle (permutation) operation
to reorder the input vector. Adjust uncompressed
fields appropriately if SAG is not desired (zero them
using the mask or whatever).

4b. Math details: Sort the merged E so as to transform
it into the iota vector, while performing *exactly the
same* swaps and reorderings on an iota vector. As
E sorts to iota, the other copy of iota reorders to C.
Perhaps the easiest way to do this is to create an index
vector A with double-width lanes, and pack each lane
A[i] with (E[i] * VLENGTH + i). (The ?+i? is ?+iota[i]?.)
The double-width lanes of A are ?packets? to be ?routed?
to the ?addresses? in the high half of the lane, and the
?payload? of each packet is the eventual index in C.
Butterfly network, aka binary radix sort, will do the job
in log time.

(Hey hardware vendors, wouldn?t it be nice to supply
us users with a keyed sort operation, which takes a
vector of ?addresses? and a different vector of ?values?
and reorders the ?values? according to the sorted
order of the ?addresses?? That is the proper inverse
to shuffle/permute primitives. You?d build a keyed
routing network in silicon much like the ones that
implement shuffle/permute.)

5c. The step 5a can be merged into 4b in some cases,
if the original input vector lanes are small enough to
fit into the ?packets? in the A vector being sorted in
4b. In that case, there?s no need to finish with a
shuffle/permutation, since the actual data has already
been moved where it belongs. All you need to do
is strip off the ?address headers? of the ?packets?,
by means of a reshaping truncation.

This shows that a possible primitive is the anti-shuffle.
(Anti-permutation.) Anti-shuffle is to shuffle as
store is to load. Store is harder than load because
where load has duplications store has collisions.
Collisions probably want to be handled with
reduction operations. (This takes me back to
my Connection Machine and C* days? The
C* operation p->n += x causes each processor?s
x value to be applied as an increment to the n
value on each friend processor p, where p is
basically a pointer as in C. The operations
take place in an indeterminate order but
all x values are summed into each distinct n
variable. Also, ?int y = p->n += x? captures
partial sums, so it?s a scan. It?s all ?as if?
executed sequentially, but in fact executes
in log time, like vector reductions and
permutations.) An anti-shuffle is really
a 1-1 permutation (to the expanded ?collision
buckets? in order), followed by a segmented
reduction, followed by a compress. So
(this is brainstorming!) the anti-shuffle
is really another way to dress up the other
primitives I?m discussing.

Ideas of the morning; HTH? :-)

@mlbridge
Copy link

mlbridge bot commented Sep 2, 2021

Mailing list message from John Rose on panama-dev:

On Aug 31, 2021, at 9:44 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location.

As I argued in my previous, it may be just as well
to think of compress as its own primitive, even if
under the covers it is implemented using shuffle.

I think it?s worth thinking more about anti-shuffle,
what that would be like. Mathematical permutations
do not come in two kinds, but shuffles and anti-shuffles
are distinct because only the former duplicate and only
the latter collide.

@mlbridge
Copy link

mlbridge bot commented Sep 2, 2021

Mailing list message from John Rose on panama-dev:

On Sep 2, 2021, at 11:33 AM, John Rose <john.r.rose at oracle.com<mailto:john.r.rose at oracle.com>> wrote:

The double-width lanes of A are ?packets? to be ?routed?
to the ?addresses? in the high half of the lane, and the
?payload? of each packet is the eventual index in C.
Butterfly network, aka binary radix sort, will do the job
in log time.

I forgot to point out that some hardware has min/max
operations which can perform the conditional swap
efficiently. So log(VLENGTH) mix/max swaps, plus
classic butterfly-like shuffles to bring the pairs into
proximity as needed, does the job as well.

Also, these ?packets? were, back in the day, literal
bit-serial packets on the Connection Machine, with
up to 16-bit addresses. The microcode would worm
the data around simultaneously through a hypercube
network, getting the job done in log time. The
weakness of the scheme is, of course, implementing
the binary N-cube network in 3-space. Eventually
you run out of places to put the wires.

@mlbridge
Copy link

mlbridge bot commented Sep 3, 2021

Mailing list message from Viswanathan, Sandhya on panama-dev:

Today we have rearrange, slice and unslice methods to do cross lane movements.
It looks to me that best way is to extend this and provide compress and expand as a primitive. It seems more natural from programmer's perspective as well as easy to do good code gen.
Doing it as a special purpose rearrange with mask could be confusing as in the current API rearrange with mask has a different meaning. Backend implementation is also not clear on how then to achieve the desired single instruction code gen.
As John suggests, on architectures that doesn?t support compress/expand, underlying compress/expand could be implemented in terms of mask->partition shuffle -> rearrange.

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of John Rose
Sent: Thursday, September 02, 2021 11:36 AM
To: Paul Sandoz <paul.sandoz at oracle.com>
Cc: Ningsheng Jian <njian at openjdk.java.net>; panama-dev <panama-dev at openjdk.java.net>
Subject: Re: [vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

On Aug 31, 2021, at 9:44 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location.

As I argued in my previous, it may be just as well to think of compress as its own primitive, even if under the covers it is implemented using shuffle.

I think it?s worth thinking more about anti-shuffle, what that would be like. Mathematical permutations do not come in two kinds, but shuffles and anti-shuffles are distinct because only the former duplicate and only the latter collide.

@mlbridge
Copy link

mlbridge bot commented Sep 3, 2021

Mailing list message from Paul Sandoz on panama-dev:

Thanks John, and Sandhya for also commenting.

You both rightly pointed out the weakness using operators and rearrange :-) it does fit right.

John, your observation on order really stood out to me. I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero).

In summary, from the discussion, compress/expand are:

- important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and

- at the right level to reliably optimize on supporting hardware.

So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport.

Paul.

On Sep 2, 2021, at 11:33 AM, John Rose <john.r.rose at oracle.com> wrote:

On Sep 2, 2021, at 9:50 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:

Hi Joshua,

I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.

Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:

VectorMask<Integer> mask = ...;
IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
VectorMask<Integer> prefixMask = prefix(mask.trueCount());
bv.intoArray(array, offset, prefixMask);
offset += mask.trueCount();

We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:

1) the non-mask rearrange is an identity operation; and
2) the mask accepting rearrange describes mask-based cross-lane movement:

Be careful with adding bulk operations to VectorOperations.
Those are supposed to be only lanewise (scalar). It?s not
clear to me what COMPRESS could mean, as a lanewise
operation.

One very interesting potential interaction between lanewise
ops and masked operations is reduction of a subset of lanes.
This feels a little like compress, although the lateral data
motion is absent. I?m thinking that generalized shuffle,
lateral compression/expansion, and segmented scan/reduce
may be the interesting primitives here.

The FIRST_NONZERO operation is the closest thing we have
to lateral motion today: The idea is that non-zero lane values
propagate from higher to lower positions, skipping over zero
lane values. We might try to get from FIRST_NONZERO to
compress by defining an enhanced segmented reduction
operation which takes into a account masking (to define
segments that are to be individually reduced), and then
take additional wins from using other VOs such as MAX,
ADD, etc. I?m throwing this out as brainstorming; what
I get from this line of thought is that segmented scans
naturally deliver results in expanded form, while
segmented reductions have two natural forms of
output, expanded and compressed. So we need a
compress primitive, and/or an expand/compress
mode bit for segmented reduction.

(Matters of terminology: scan(op) produces the
vector of partial results that would be obtained
if a reduction(op) operation proceeded linearly
from lane 0 to lane VLENGTH-1, recording
an intermediate result in each lane, often
before lane value is accumulated. The scan
can take a carry-in and carry-out lane value
to allow multiple vectors to compose.
Both reductions and scans can be usefully
be segmented, again with carry-in and carry-out.
Segmented operations are useful for fast
group-by modes of operation, and also for
vectorized parsing, such as var-ints or
even text. Compress and expand are independent
primitives, both driven by masks; such masks
can as noted above be related to segmentation
masks.)

It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:

rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
Or
rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
// Postfix of exceptional lanes in the shuffle, representing unset lanes

I?m not following this, because I?m having trouble imagining
ops like ADD and MAX here, as well as imagining COMPRESS
as a VO. (I must be missing something crucial here.)

For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.

I guess what I?m missing is this is a whole new kind of OP,
not a lanewise one. Probably should live on Shuffle, since
VOs is documented to be about lanewise ops only.

However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.

In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant.

Yes, and you are exploring the space of plausible constants.
I think they (or at least some of them) could also factorized
as *families* of constants, indexed by masks.

Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?

Siting the compress (and expand!) functionality on shuffle instead
of vector is a nice story, but I think it?s better to site it on general
vectors, as distinct from shuffling, because of its relation to
segmented data-parallel operations (which are distinct from
permutations, since they never swap the order of selected
lanes).

So, I see three primitive notions: Segmentation, compress/expand,
and permutation. You can bridge to and from permutation simply
by working with index vectors like iota, and perhaps (as sugar) lifting
selected vector operations to shuffles.

?

FWIW another way to think about a partitioning/compression shuffle:

SPECIES.iota().compress(m);

Yes, that?s the bridging I just mentioned. Some applications
would expand as well as compress. (I know it?s a one-way
street for SVE and maybe others but fear not; expand is
easier to implement than compress, using a plain
permutation.)

For example, if parsing a file of var-ints does
segmentation (based on (lane&0x80)=0x80 masking)
and subsequent compression to 32-bit integer lanes, the
reverse operation of un-parsing would use an expansion
of 32-bit integer lanes, to be reduced to bytes and stored
to the var-int file.

Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.

Yes.

If a machine lacks a compress and/or expand operation,
we?d want Java paths that would compute appropriate
shuffles and then apply them, with constant folding
when appropriate. (Although masks are usually not
constant.)

Sketch of algorithm to derive a compression shuffle
from a mask:

Take VLENGTH=5 just for example and suppose we
want to compress <4:v, 3:w, 2:x, 1:y, 0:z> with mask
0b01001. The desired output is then <0,0,0,w,z>.

1. For each masked lane (in a fresh index vector)
compute the number of masked lanes at lower
indexes. For example, <2, 1, 1, 1, 0> from 0b01001.
This can be done in log time, and hardware can
sometimes help. The operation naturally yields
popcnt(m) (2 in the example) as a carry-out from
a log-scan. Note that the lowest lane always gets
zero. Variants of this algorithm might use a carry-in
to put something other than zero there. Note also
that the resulting index vector, used to shuffle
the input vector, performs an *expand* which is
the inverse of the compress we are seeking.

2. (Optional but useful.) Compute a complementary
second fresh index vector from the inverted mask,
added to the bit-count of the first mask. For example,
<2, 2, 1, 0, 0> from 0x10110 (~0b01001), and then
(adding the bitcount 2) <4, 4, 3, 2, 2>. Note that
this index vector would perform an expand, to
the left-hand (high) side of the vector, of the vector
we are trying to compress.

3. Under the mask, blend the index vector from step 1
with that of step 2, or with a constant vector of value
VLENGTH-1. For example, blending from 0b01001,
either <4, 1, 3, 2, 0> or (lacking step 2) <4, 1, 4, 4, 0>.

4. Create an in-memory buffer of the same size and
shape as the vector to be reduced. Use the blended
index to perform an indexed store from the input
vector to the memory buffer. This uses the hardware
memory store operation, on a temp buffer, to create
the desired compress. (It should be done in the VPU
not in the MU, of course, but sometimes this might
be the best way.)

5. Load the buffer back into a vector register, and
set the unselected lanes to whatever value is desired,
either zero, or (perhaps) a left-packed set of the
non-compressed values. (This is called the SAG
or ?sheep and goats? operation, when applied to
a bit-vector, and it is very useful for other vectors
as well.)

Step 2 may be desirable even if you are not doing a SAG,
if the indexed store hardware will go slower due
to collisions on the last lane (index 4 in the example;
note that only that index gets repeated).

Note that steps 1 and 2 can be done in superscalar parallel.
If step 2 is done with a from-left scan, there is no need to
add in the correction value, of popcnt(m).

I don?t know a great way to avoid a trip to a memory
buffer, if the hardware does not supply a compress
operation in the VPU. But here?s an almost-great
way:

4b. Having computed the non-colliding permutation
vector E (in steps 2 and 3, as if for SAG), use math to
compute the inverse permutation vector C. E is an
expansion as noted above while its inverse C is the
corresponding compression. In the running example,
E is <4, 1, 3, 2, 0> and its inverse C will be <4, 2, 1, 3, 0>.

5b. Use the standard shuffle (permutation) operation
to reorder the input vector. Adjust uncompressed
fields appropriately if SAG is not desired (zero them
using the mask or whatever).

4b. Math details: Sort the merged E so as to transform
it into the iota vector, while performing *exactly the
same* swaps and reorderings on an iota vector. As
E sorts to iota, the other copy of iota reorders to C.
Perhaps the easiest way to do this is to create an index
vector A with double-width lanes, and pack each lane
A[i] with (E[i] * VLENGTH + i). (The ?+i? is ?+iota[i]?.)
The double-width lanes of A are ?packets? to be ?routed?
to the ?addresses? in the high half of the lane, and the
?payload? of each packet is the eventual index in C.
Butterfly network, aka binary radix sort, will do the job
in log time.

(Hey hardware vendors, wouldn?t it be nice to supply
us users with a keyed sort operation, which takes a
vector of ?addresses? and a different vector of ?values?
and reorders the ?values? according to the sorted
order of the ?addresses?? That is the proper inverse
to shuffle/permute primitives. You?d build a keyed
routing network in silicon much like the ones that
implement shuffle/permute.)

5c. The step 5a can be merged into 4b in some cases,
if the original input vector lanes are small enough to
fit into the ?packets? in the A vector being sorted in
4b. In that case, there?s no need to finish with a
shuffle/permutation, since the actual data has already
been moved where it belongs. All you need to do
is strip off the ?address headers? of the ?packets?,
by means of a reshaping truncation.

This shows that a possible primitive is the anti-shuffle.
(Anti-permutation.) Anti-shuffle is to shuffle as
store is to load. Store is harder than load because
where load has duplications store has collisions.
Collisions probably want to be handled with
reduction operations. (This takes me back to
my Connection Machine and C* days? The
C* operation p->n += x causes each processor?s
x value to be applied as an increment to the n
value on each friend processor p, where p is
basically a pointer as in C. The operations
take place in an indeterminate order but
all x values are summed into each distinct n
variable. Also, ?int y = p->n += x? captures
partial sums, so it?s a scan. It?s all ?as if?
executed sequentially, but in fact executes
in log time, like vector reductions and
permutations.) An anti-shuffle is really
a 1-1 permutation (to the expanded ?collision
buckets? in order), followed by a segmented
reduction, followed by a compress. So
(this is brainstorming!) the anti-shuffle
is really another way to dress up the other
primitives I?m discussing.

Ideas of the morning; HTH? :-)

@mlbridge
Copy link

mlbridge bot commented Sep 3, 2021

Mailing list message from Paul Sandoz on panama-dev:

On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:

Thanks John, and Sandhya for also commenting.

You both rightly pointed out the weakness using operators and rearrange :-) it does fit right.

^
? It does *not* fit right.

@mlbridge
Copy link

mlbridge bot commented Sep 4, 2021

Mailing list message from John Rose on panama-dev:

On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

...
John, your observation on order really stood out to me. I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero).

Yes, this whole thread is a very good exercise in what we
call ?find the primitive?. Which means taking some use
cases, and some hardware capabilities, and some known
optimization techniques, and deciding how to factor
down the use cases into uses of the latter capabilities and
techniques, such that (a) the use cases are well optimized
and (b) a naturally wide set of future use cases are supported.

(And by ?naturally? I mean the chose primitives directly
give access to the capabilities and techniques, so that
it is new use cases ?grow out of? the existing primitives
without additional deep insights.)

In summary, from the discussion, compress/expand are:

- important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and

Compression can be composed from *colliding* shuffles,
which are a special kind of thing I call an anti-shuffle.
I suspect *that* is the really interesting primitive here.

- at the right level to reliably optimize on supporting hardware.

So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport.

Yes, a function and its inverse are usually not best accessed
by a boolean option that says ?select the inverse?. Partly
this is because the edge effects and exceptions (such as
collisions for anti-shuffle but not shuffle) make the types
of the function and its inverse subtly different.

@mlbridge
Copy link

mlbridge bot commented Sep 4, 2021

Mailing list message from John Rose on panama-dev:

On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero).

Prefix sums (aka scans) are surprisingly powerful.

- scans can use any associative operator (not just sum: min, max, first-non-zero, etc.)
- scans can be implemented in log time (reduction ?up-sweep? and computation ?down-sweep? on lane spanning tree)
- come in ?inclusive? and ?exclusive? form (aka Hillis/Steele and Blelloch; Blelloch calls his a ?prescan?)
- scale naturally to arbitrary lengths
- partition naturally into vectorizable sub-lengths (N.B. requires carry-in and carry-out feature!)
- provide a good primitive for ?nested SIMD? [1]
- often provide useful ?steering? information for subsequent loads/stores or shuffles/anti-shuffles [2]

[1] To perform a data-parallel/SIMD operation simultaneously on N SIMD data sets {S[i]}, first concatenate them all into a single SIMD data set G (unravel an array into a stream of values, for example), keeping elements in the same original data set contiguous in the new data set, and add a new boolean vector, a ?boundary mask? which marks boundaries of the original data sets within the new data set. (This does not need a group field in [1..N], just a boolean, because groups are contiguous in the big set.) Do SIMD on the whole thing. When performing scans or reduces, use the segmented variation of the scan or reduce, which consults the boundary mask and prevents carry-ins and carry-outs across boundaries. Reductions and carry-outs from up-sweeps go into sparse SIMD data sets which can be quickly compressed to N-vectors, carrying the per-group results. Collecting per-group results is where compress shines. The procedure outlined here is very robust across group sizes: It basically works the same, and with the same efficiency, whether you N=1 or N=|G| and regardless of the statistics of the group sizes |S[i]|.

[2] When you have an ordering problem, like vectorized sort or compress, look first for a scan pre-pass that could use to steer a data-movement pass. I find this often clarifies the problem and suggests new vectorization opportunities.

@mlbridge
Copy link

mlbridge bot commented Sep 4, 2021

Mailing list message from John Rose on panama-dev:

P.S. Some googly references that seem useful for me:

https://en.wikipedia.org/wiki/Prefix_sum
https://www.cs.princeton.edu/courses/archive/fall21/cos326/lec/21-02-parallel-prefix-scan.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf (from Connection Machine days, still relevant if you translate terms)

You can find your own easily, of course. I suppose there are plenty of GPU people who have rediscovered this stuff recently. Appel traces the basics back to 1977.

So here?s a basic tool for our toolkit: Watch out for segmented scans and reductions, even in disguise (say, as nested or grouped parallelism). Use them to turn brute-force iteration into log-N data parallel operations. (Will the hardware reward your rewrite of the algorithm? One may hope? Sometimes it does.)

@JoshuaZhuwj
Copy link
Member Author

JoshuaZhuwj commented Sep 6, 2021

In summary, from the discussion, compress/expand are:

  • important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and

  • at the right level to reliably optimize on supporting hardware.

So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport.

Thanks Paul for bringing the deep design thinking.
John Rose's encompassing knowledge impresses me a lot.
Also thank Sandhya and Ningsheng for comments.

I will refactor my codes and implement these two cross-lane data movement primitives: mask-based compression & expansion.
They will work on general vectors and be declared:

    $abstractvectortype$ compress(VectorMask<$Boxtype$> m);
    $abstractvectortype$ expand(VectorMask<$Boxtype$> m);

These two vector-to-vector operations together with a store/load and prefix mask could be optimized further into single memory version instruction on supported architecture.

@jatin-bhateja
Copy link
Member

In summary, from the discussion, compress/expand are:

  • important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and
  • at the right level to reliably optimize on supporting hardware.

So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport.

Thanks Paul for bringing the deep design thinking.
John Rose's encompassing knowledge impresses me a lot.
Also thank Sandhya and Ningsheng for comments.

I will refactor my codes and implement these two cross-lane data movement primitives: mask-based compression & expansion.
They will work on general vectors and be declared:

    $abstractvectortype$ compress(VectorMask<$Boxtype$> m);
    $abstractvectortype$ expand(VectorMask<$Boxtype$> m);

These two vector-to-vector operations together with a store/load and prefix mask could be optimized further into single memory version instruction on supported architecture.

Hi @JoshuaZhuwj ,

While adding new macro level APIs is appealing, we can also extend following existing vectorAPIs to accept another boolean flag "is_selective" under which compression/expansion triggers. In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

public static IntVector fromArray​(VectorSpecies<Integer> species,
int[] a,
int offset,
VectorMask<Integer> m) 


public final void intoArray​(int[] a,
int offset,
VectorMask<Integer> m)

This way we can also share common optimizations as you suggested earlier to convert masked COMPRESS to unmasked vector move for ALLTRUE mask, some work[1][2] is already in place on this front.

Best Regards,
Jatin

[1] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L752
[2] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L771

@JoshuaZhuwj
Copy link
Member Author

While adding new macro level APIs is appealing, we can also extend following existing vectorAPIs to accept another boolean flag "is_selective" under which compression/expansion triggers.

public static IntVector fromArray​(VectorSpecies<Integer> species,
int[] a,
int offset,
VectorMask<Integer> m) 


public final void intoArray​(int[] a,
int offset,
VectorMask<Integer> m)

Per design discussion in this thread, compared to vector-to-memory operation, vector-to-vector compress/expand operation is the more friendly primitive.
It can also be used to "bridge to and from permutation simply by working with index vectors like iota, and perhaps (as sugar) lifting selected vector operations to shuffles."
For different architectures, like SVE, memory destination version is also not supported natively.

In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

Could you elaborate on it please? I do not follow this.

This way we can also share common optimizations as you suggested earlier to convert masked COMPRESS to unmasked vector move for ALLTRUE mask, some work[1][2] is already in place on this front.

[1] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L752
[2] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L771

Yes. Since compress/expand op is also mask-based, this piece of optimization is common. Maybe we can think of one way to share this optimization for different kinds of masked operations?

@jatin-bhateja
Copy link
Member

While adding new macro level APIs is appealing, we can also extend following existing vectorAPIs to accept another boolean flag "is_selective" under which compression/expansion triggers.

public static IntVector fromArray​(VectorSpecies<Integer> species,
int[] a,
int offset,
VectorMask<Integer> m) 


public final void intoArray​(int[] a,
int offset,
VectorMask<Integer> m)

Per design discussion in this thread, compared to vector-to-memory operation, vector-to-vector compress/expand operation is the more friendly primitive.
It can also be used to "bridge to and from permutation simply by working with index vectors like iota, and perhaps (as sugar) lifting selected vector operations to shuffles."
For different architectures, like SVE, memory destination version is also not supported natively.

In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

Could you elaborate on it please? I do not follow this.

I meant auto-vectorizing following loop which mimic compression semantics could be tricky if its difficult to ascertain the independence between memory references. Like in following case 'j' could be a middle index in array and thus if the distance between memory references is less than chosen vector width it may result into incorrect overwrite.

for ( int i = 0; i < n ; i++ ) {
    if ( mask[i] > 1 ) {
         a[j++] = a[i];
     }
}

This way we can also share common optimizations as you suggested earlier to convert masked COMPRESS to unmasked vector move for ALLTRUE mask, some work[1][2] is already in place on this front.
[1] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L752
[2] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L771

Yes. Since compress/expand op is also mask-based, this piece of optimization is common. Maybe we can think of one way to share this optimization for different kinds of masked operations?

Yes, agree.

@mlbridge
Copy link

mlbridge bot commented Sep 7, 2021

Mailing list message from Paul Sandoz on panama-dev:

Hi Joshua,

Thank you for your patience as we went through deeper design thinking.

Would it be possible for you to target this work to a new branch? e.g. vectorIntrinsics+compress.

I want to avoid complicating review and integration of the masking work for JEP 417. We are close to merging vectorIntrinsics+mask into vectorIntrinsics and then starting reviews of the masking support.

It may be possible that compress/expand become part of JEP 417, if we can get it ready in time. If so I can update the JEP accordingly and help do CSRs. Otherwise, we can target to the next round of incubation.

Paul.

@JoshuaZhuwj
Copy link
Member Author

Would it be possible for you to target this work to a new branch? e.g. vectorIntrinsics+compress.

I want to avoid complicating review and integration of the masking work for JEP 417. We are close to merging vectorIntrinsics+mask into vectorIntrinsics and then starting reviews of the masking support.

Yes, of course. Thank you, Paul. Hence could you please help me checkout a new branch based on "vectorIntrinsics+mask" or "vectorIntrinsics" after merging?

It may be possible that compress/expand become part of JEP 417, if we can get it ready in time. If so I can update the JEP accordingly and help do CSRs. Otherwise, we can target to the next round of incubation.

I'm fine with the plan to place compress/expand in the next round of incubation. I'm afraid I cannot catch the last bus before the review of masking support while I have other works at hand.

@JoshuaZhuwj
Copy link
Member Author

In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

Could you elaborate on it please? I do not follow this.

I meant auto-vectorizing following loop which mimic compression semantics could be tricky if its difficult to ascertain the independence between memory references. Like in following case 'j' could be a middle index in array and thus if the distance between memory references is less than chosen vector width it may result into incorrect overwrite.

for ( int i = 0; i < n ; i++ ) {
    if ( mask[i] > 1 ) {
         a[j++] = a[i];
     }
}

Agree. I think current SWPointer compare (used to detect possible same address) does not apply to this pattern.
In this case, besides SWPointer comparison, "j" cannot be inferred by induction varaible "i" is more critical for auto-vectorization.

Do you have plan to support masked operations in SLP optimization?

@mlbridge
Copy link

mlbridge bot commented Sep 8, 2021

Mailing list message from Paul Sandoz on panama-dev:

On Sep 8, 2021, at 12:03 AM, Joshua Zhu <jzhu at openjdk.java.net> wrote:

On Tue, 7 Sep 2021 16:35:19 GMT, Paul Sandoz <paul.sandoz at oracle.com> wrote:

Would it be possible for you to target this work to a new branch? e.g. vectorIntrinsics+compress.

I want to avoid complicating review and integration of the masking work for JEP 417. We are close to merging vectorIntrinsics+mask into vectorIntrinsics and then starting reviews of the masking support.

Yes, of course. Thank you, Paul. Hence could you please help me checkout a new branch based on "vectorIntrinsics+mask" or "vectorIntrinsics" after merging?

Yes, once vectorIntrinsics+mask is merged into vectorIntrinsics I will create the branch vectorIntrinsics+compress off vectorIntrinsics.

It may be possible that compress/expand become part of JEP 417, if we can get it ready in time. If so I can update the JEP accordingly and help do CSRs. Otherwise, we can target to the next round of incubation.

I'm fine with the plan to place compress/expand in the next round of incubation. I'm afraid I cannot catch the last bus before the review of masking support while I have other works at hand.

Ok.

Paul.

@jatin-bhateja
Copy link
Member

In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

Could you elaborate on it please? I do not follow this.

I meant auto-vectorizing following loop which mimic compression semantics could be tricky if its difficult to ascertain the independence between memory references. Like in following case 'j' could be a middle index in array and thus if the distance between memory references is less than chosen vector width it may result into incorrect overwrite.

for ( int i = 0; i < n ; i++ ) {
    if ( mask[i] > 1 ) {
         a[j++] = a[i];
     }
}

Agree. I think current SWPointer compare (used to detect possible same address) does not apply to this pattern.
In this case, besides SWPointer comparison, "j" cannot be inferred by induction varaible "i" is more critical for auto-vectorization.

A PhiNode will be created for j for sure, and it's indeed an induction variable since with each iteration its value is getting incremented. Yes, its not a secondary induction variable in this case and which makes any dependency analysis between array references indexed by different subscripts 'i' or 'j' difficult.

Do you have plan to support masked operations in SLP optimization?

Given that SLP algorithm does not works across basic blocks a simplified approach will be to add ideal transformation post SLP to replace VectorBlend + VectorOperation to a masked operation where ever applicable

@mlbridge
Copy link

mlbridge bot commented Oct 2, 2021

Mailing list message from Viswanathan, Sandhya on panama-dev:

Thanks a lot Joshua for bringing this use case.
I have a Java only draft implementation of compress/expand Vector API methods based on the discussion here.
I will send out a PR for review and further discussion. It would be nice to have support for it in JDK 18, if possible.

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of Paul Sandoz
Sent: Wednesday, September 08, 2021 9:47 AM
To: Joshua Zhu <jzhu at openjdk.java.net>
Cc: panama-dev <panama-dev at openjdk.java.net>
Subject: Re: [vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

On Sep 8, 2021, at 12:03 AM, Joshua Zhu <jzhu at openjdk.java.net> wrote:

On Tue, 7 Sep 2021 16:35:19 GMT, Paul Sandoz <paul.sandoz at oracle.com> wrote:

Would it be possible for you to target this work to a new branch? e.g. vectorIntrinsics+compress.

I want to avoid complicating review and integration of the masking work for JEP 417. We are close to merging vectorIntrinsics+mask into vectorIntrinsics and then starting reviews of the masking support.

Yes, of course. Thank you, Paul. Hence could you please help me checkout a new branch based on "vectorIntrinsics+mask" or "vectorIntrinsics" after merging?

Yes, once vectorIntrinsics+mask is merged into vectorIntrinsics I will create the branch vectorIntrinsics+compress off vectorIntrinsics.

It may be possible that compress/expand become part of JEP 417, if we can get it ready in time. If so I can update the JEP accordingly and help do CSRs. Otherwise, we can target to the next round of incubation.

I'm fine with the plan to place compress/expand in the next round of incubation. I'm afraid I cannot catch the last bus before the review of masking support while I have other works at hand.

Ok.

Paul.

@JoshuaZhuwj
Copy link
Member Author

Thanks a lot Joshua for bringing this use case. I have a Java only draft implementation of compress/expand Vector API methods based on the discussion here. I will send out a PR for review and further discussion. It would be nice to have support for it in JDK 18, if possible.

Sandhya, sorry for the late reply as I'm off in Chinese National Holiday. I have already implemented the compress VectorAPI (and its full compiler support) before holidays and waited for creation of branch of "vectorIntrinsics+compress" for review and merging. I will submit my changes for review after my holidays.

@mlbridge
Copy link

mlbridge bot commented Oct 5, 2021

Mailing list message from Viswanathan, Sandhya on panama-dev:

Hi Joshua,

I started off the discussion last week as part of https://github.com//pull/143.
The email thread, with additional inputs, is at:
https://mail.openjdk.java.net/pipermail/panama-dev/2021-October/015223.html

Once you are back from vacation, please do join the discussion and development.

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of Joshua Zhu
Sent: Tuesday, October 05, 2021 7:48 AM
To: panama-dev at openjdk.java.net
Subject: Re: [vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

On Sat, 2 Oct 2021 00:01:13 GMT, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:

Thanks a lot Joshua for bringing this use case. I have a Java only draft implementation of compress/expand Vector API methods based on the discussion here. I will send out a PR for review and further discussion. It would be nice to have support for it in JDK 18, if possible.

Sandhya, sorry for the late reply as I'm off in Chinese National Holiday. I have already implemented the compress VectorAPI (and its full compiler support) before holidays and waited for creation of branch of "vectorIntrinsics+compress" for review and merging. I will submit my changes for review after my holidays.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/115

@JoshuaZhuwj
Copy link
Member Author

I started off the discussion last week as part of https://github.com/[/pull/143](https://github.com/openjdk/panama-vector/pull/143). The email thread, with additional inputs, is at: https://mail.openjdk.java.net/pipermail/panama-dev/2021-October/015223.html

Once you are back from vacation, please do join the discussion and development.

Okay. I will send out what I already implemented to avoid possible repetitive work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

3 participants