-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8273057: [vector] New VectorAPI "SelectiveStore" #115
8273057: [vector] New VectorAPI "SelectiveStore" #115
Conversation
👋 Welcome back jzhu! A progress list of the required criteria for merging this PR into |
Mailing list message from Paul Sandoz on panama-dev: Hi Joshua, Thank you for exploring this area. I am impressed at the level of knowledge you have of HotSpot. Instead of immediately diving into code I would first prefer to discuss the design to determine if this is the best way to support your use-case. I would like to explore if there are underlying primitives from which we can compose to support this use-case. If possible I would like to leverage the existing mask load/store primitives we have, and if necessary make some changes, rather than add more. We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.) To use the scatter store method today for your use-case we would have to: - compute an int[] array from the set lanes of the mask, M say Another alternative is to: - compute a ?compression? shuffle, S, from the set lanes of the mask, M In either case the loop index value is increased by the true count of M. The primitive I am searching for might be a way to create a shuffle from a mask. Let?s say we could write: Int[] a = ... // The new primitive, create a shuffle from the mask that partitions vector elements // This method is likely not optimal, yet! // Use existing masked store Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in! I think this is worth exploring further, the more we can optimize the primitives, and then potentially optimize patterns of those, the more flexible we are and can avoid adding more specific functionality. Paul. |
1 similar comment
Mailing list message from Paul Sandoz on panama-dev: Hi Joshua, Thank you for exploring this area. I am impressed at the level of knowledge you have of HotSpot. Instead of immediately diving into code I would first prefer to discuss the design to determine if this is the best way to support your use-case. I would like to explore if there are underlying primitives from which we can compose to support this use-case. If possible I would like to leverage the existing mask load/store primitives we have, and if necessary make some changes, rather than add more. We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.) To use the scatter store method today for your use-case we would have to: - compute an int[] array from the set lanes of the mask, M say Another alternative is to: - compute a ?compression? shuffle, S, from the set lanes of the mask, M In either case the loop index value is increased by the true count of M. The primitive I am searching for might be a way to create a shuffle from a mask. Let?s say we could write: Int[] a = ... // The new primitive, create a shuffle from the mask that partitions vector elements // This method is likely not optimal, yet! // Use existing masked store Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in! I think this is worth exploring further, the more we can optimize the primitives, and then potentially optimize patterns of those, the more flexible we are and can avoid adding more specific functionality. Paul. |
Mailing list message from John Rose on panama-dev: I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive? |
1 similar comment
Mailing list message from John Rose on panama-dev: I think I would rather see a vector-to-vector compress operation, than a vector-to-memory operation that also includes compression. Isn?t that the real underlying primitive? |
Hi Paul, Thanks a lot for your quick reply and share of design thinking.
In my opinion, no matter the first scatter store way or the second shuffle way, When I help database developers to apply VectorAPI, they usually care: Four vectorized data movement primitives: selective load, selective store, gather, and scatter
What I know for creating shuffle from mask is to leverage a precomputed, cache resident table.
The index_array could be treated as index map or converted into shuffle.
I think the pattern will be too complicated and the optimization may be not reliable. Thanks, |
Agree. John, thanks a lot for your review comments. This will make the primitive more friendly. |
Yes, if we have to introduce a new API, a vector-to-vector compress operation sounds more reasonable. Arm SVE instrctuion COMPACT could also do such work. |
Mailing list message from Paul Sandoz on panama-dev: Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location. That may be wishful thinking, however we have made significant improvements to the optimization of shuffles and masking, which gives me some hope. I think we should give it some more thought (on the C2 heroics required to not, whether we can internally classify certain kinds of shuffle etc) before committing to a more specific/specialized operation, such as say: - Vector.compress(VectorMask<>); or perhaps Paul. |
1 similar comment
Mailing list message from Paul Sandoz on panama-dev: Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location. That may be wishful thinking, however we have made significant improvements to the optimization of shuffles and masking, which gives me some hope. I think we should give it some more thought (on the C2 heroics required to not, whether we can internally classify certain kinds of shuffle etc) before committing to a more specific/specialized operation, such as say: - Vector.compress(VectorMask<>); or perhaps Paul. |
Thank Paul, John Rose and Ningsheng for your comments. A vector-to-vector compress operation is the more friendly primitive.
The vector-to-vector compress primitive together with a store and prefix mask could be optimized further into memory destination version of compress on supported architecture. Once architectures do not support vector-to-vector compress natively, Best Regards, |
Mailing list message from Paul Sandoz on panama-dev: Hi Joshua, I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities. Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method: VectorMask<Integer> mask = ...; We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that: 1) the non-mask rearrange is an identity operation; and It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS: rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix()) For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE. However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence. In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant. Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation? ? FWIW another way to think about a partitioning/compression shuffle: SPECIES.iota().compress(m); Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation. Paul. [*] We could consider this independently |
Mailing list message from John Rose on panama-dev: On Sep 2, 2021, at 9:50 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
Be careful with adding bulk operations to VectorOperations. One very interesting potential interaction between lanewise The FIRST_NONZERO operation is the closest thing we have (Matters of terminology: scan(op) produces the
I?m not following this, because I?m having trouble imagining
I guess what I?m missing is this is a whole new kind of OP,
Yes, and you are exploring the space of plausible constants.
Siting the compress (and expand!) functionality on shuffle instead So, I see three primitive notions: Segmentation, compress/expand,
Yes, that?s the bridging I just mentioned. Some applications For example, if parsing a file of var-ints does
Yes. If a machine lacks a compress and/or expand operation, Sketch of algorithm to derive a compression shuffle Take VLENGTH=5 just for example and suppose we 1. For each masked lane (in a fresh index vector) 2. (Optional but useful.) Compute a complementary 3. Under the mask, blend the index vector from step 1 4. Create an in-memory buffer of the same size and 5. Load the buffer back into a vector register, and Step 2 may be desirable even if you are not doing a SAG, Note that steps 1 and 2 can be done in superscalar parallel. I don?t know a great way to avoid a trip to a memory 4b. Having computed the non-colliding permutation 5b. Use the standard shuffle (permutation) operation 4b. Math details: Sort the merged E so as to transform (Hey hardware vendors, wouldn?t it be nice to supply 5c. The step 5a can be merged into 4b in some cases, This shows that a possible primitive is the anti-shuffle. Ideas of the morning; HTH? :-) |
Mailing list message from John Rose on panama-dev: On Aug 31, 2021, at 9:44 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote: Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location. As I argued in my previous, it may be just as well I think it?s worth thinking more about anti-shuffle, |
Mailing list message from John Rose on panama-dev: On Sep 2, 2021, at 11:33 AM, John Rose <john.r.rose at oracle.com<mailto:john.r.rose at oracle.com>> wrote: The double-width lanes of A are ?packets? to be ?routed? I forgot to point out that some hardware has min/max Also, these ?packets? were, back in the day, literal |
Mailing list message from Viswanathan, Sandhya on panama-dev: Today we have rearrange, slice and unslice methods to do cross lane movements. Best Regards, -----Original Message----- On Aug 31, 2021, at 9:44 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote: Yes, my suggestion is that a vector-to-vector compress might be a composition of mask -> partitioning shuffle -> rearrange, such that on supported architectures it reduces down to a single instruction. In combination with a store and prefix mask it may be possible to reduce further to single instruction accepting the source vector, mask, and a memory location. As I argued in my previous, it may be just as well to think of compress as its own primitive, even if under the covers it is implemented using shuffle. I think it?s worth thinking more about anti-shuffle, what that would be like. Mathematical permutations do not come in two kinds, but shuffles and anti-shuffles are distinct because only the former duplicate and only the latter collide. |
Mailing list message from Paul Sandoz on panama-dev: Thanks John, and Sandhya for also commenting. You both rightly pointed out the weakness using operators and rearrange :-) it does fit right. John, your observation on order really stood out to me. I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero). In summary, from the discussion, compress/expand are: - important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and - at the right level to reliably optimize on supporting hardware. So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport. Paul.
|
Mailing list message from Paul Sandoz on panama-dev:
^ |
Mailing list message from John Rose on panama-dev: On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote: ... Yes, this whole thread is a very good exercise in what we (And by ?naturally? I mean the chose primitives directly In summary, from the discussion, compress/expand are: - important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and Compression can be composed from *colliding* shuffles, - at the right level to reliably optimize on supporting hardware. So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport. Yes, a function and its inverse are usually not best accessed |
Mailing list message from John Rose on panama-dev: On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote: I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero). Prefix sums (aka scans) are surprisingly powerful. - scans can use any associative operator (not just sum: min, max, first-non-zero, etc.) [1] To perform a data-parallel/SIMD operation simultaneously on N SIMD data sets {S[i]}, first concatenate them all into a single SIMD data set G (unravel an array into a stream of values, for example), keeping elements in the same original data set contiguous in the new data set, and add a new boolean vector, a ?boundary mask? which marks boundaries of the original data sets within the new data set. (This does not need a group field in [1..N], just a boolean, because groups are contiguous in the big set.) Do SIMD on the whole thing. When performing scans or reduces, use the segmented variation of the scan or reduce, which consults the boundary mask and prevents carry-ins and carry-outs across boundaries. Reductions and carry-outs from up-sweeps go into sparse SIMD data sets which can be quickly compressed to N-vectors, carrying the per-group results. Collecting per-group results is where compress shines. The procedure outlined here is very robust across group sizes: It basically works the same, and with the same efficiency, whether you N=1 or N=|G| and regardless of the statistics of the group sizes |S[i]|. [2] When you have an ordering problem, like vectorized sort or compress, look first for a scan pre-pass that could use to steer a data-movement pass. I find this often clarifies the problem and suggests new vectorization opportunities. |
Mailing list message from John Rose on panama-dev: P.S. Some googly references that seem useful for me: https://en.wikipedia.org/wiki/Prefix_sum You can find your own easily, of course. I suppose there are plenty of GPU people who have rediscovered this stuff recently. Appel traces the basics back to 1977. So here?s a basic tool for our toolkit: Watch out for segmented scans and reductions, even in disguise (say, as nested or grouped parallelism). Use them to turn brute-force iteration into log-N data parallel operations. (Will the hardware reward your rewrite of the algorithm? One may hope? Sometimes it does.) |
Thanks Paul for bringing the deep design thinking. I will refactor my codes and implement these two cross-lane data movement primitives: mask-based compression & expansion.
These two vector-to-vector operations together with a store/load and prefix mask could be optimized further into single memory version instruction on supported architecture. |
Hi @JoshuaZhuwj , While adding new macro level APIs is appealing, we can also extend following existing vectorAPIs to accept another boolean flag "is_selective" under which compression/expansion triggers. In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.
This way we can also share common optimizations as you suggested earlier to convert masked COMPRESS to unmasked vector move for ALLTRUE mask, some work[1][2] is already in place on this front. Best Regards, [1] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L752 |
Per design discussion in this thread, compared to vector-to-memory operation, vector-to-vector compress/expand operation is the more friendly primitive.
Could you elaborate on it please? I do not follow this.
Yes. Since compress/expand op is also mask-based, this piece of optimization is common. Maybe we can think of one way to share this optimization for different kinds of masked operations? |
I meant auto-vectorizing following loop which mimic compression semantics could be tricky if its difficult to ascertain the independence between memory references. Like in following case 'j' could be a middle index in array and thus if the distance between memory references is less than chosen vector width it may result into incorrect overwrite.
Yes, agree. |
Mailing list message from Paul Sandoz on panama-dev: Hi Joshua, Thank you for your patience as we went through deeper design thinking. Would it be possible for you to target this work to a new branch? e.g. vectorIntrinsics+compress. I want to avoid complicating review and integration of the masking work for JEP 417. We are close to merging vectorIntrinsics+mask into vectorIntrinsics and then starting reviews of the masking support. It may be possible that compress/expand become part of JEP 417, if we can get it ready in time. If so I can update the JEP accordingly and help do CSRs. Otherwise, we can target to the next round of incubation. Paul. |
Yes, of course. Thank you, Paul. Hence could you please help me checkout a new branch based on "vectorIntrinsics+mask" or "vectorIntrinsics" after merging?
I'm fine with the plan to place compress/expand in the next round of incubation. I'm afraid I cannot catch the last bus before the review of masking support while I have other works at hand. |
Agree. I think current SWPointer compare (used to detect possible same address) does not apply to this pattern. Do you have plan to support masked operations in SLP optimization? |
Mailing list message from Paul Sandoz on panama-dev:
Yes, once vectorIntrinsics+mask is merged into vectorIntrinsics I will create the branch vectorIntrinsics+compress off vectorIntrinsics.
Ok. Paul. |
A PhiNode will be created for j for sure, and it's indeed an induction variable since with each iteration its value is getting incremented. Yes, its not a secondary induction variable in this case and which makes any dependency analysis between array references indexed by different subscripts 'i' or 'j' difficult.
Given that SLP algorithm does not works across basic blocks a simplified approach will be to add ideal transformation post SLP to replace VectorBlend + VectorOperation to a masked operation where ever applicable |
Mailing list message from Viswanathan, Sandhya on panama-dev: Thanks a lot Joshua for bringing this use case. Best Regards, -----Original Message-----
Yes, once vectorIntrinsics+mask is merged into vectorIntrinsics I will create the branch vectorIntrinsics+compress off vectorIntrinsics.
Ok. Paul. |
Sandhya, sorry for the late reply as I'm off in Chinese National Holiday. I have already implemented the compress VectorAPI (and its full compiler support) before holidays and waited for creation of branch of "vectorIntrinsics+compress" for review and merging. I will submit my changes for review after my holidays. |
Mailing list message from Viswanathan, Sandhya on panama-dev: Hi Joshua, I started off the discussion last week as part of https://github.com//pull/143. Once you are back from vacation, please do join the discussion and development. Best Regards, -----Original Message----- On Sat, 2 Oct 2021 00:01:13 GMT, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
Sandhya, sorry for the late reply as I'm off in Chinese National Holiday. I have already implemented the compress VectorAPI (and its full compiler support) before holidays and waited for creation of branch of "vectorIntrinsics+compress" for review and merging. I will submit my changes for review after my holidays. ------------- PR: https://git.openjdk.java.net/panama-vector/pull/115 |
Okay. I will send out what I already implemented to avoid possible repetitive work. |
Hi,
I want to propose a new VectorAPI "Selective Store/Load" and share my
implementation. Currently Alibaba's internal databases are in the
process of applying VectorAPI and they have requirements on "Selective
Store" for acceleration.
My proposed VectorAPI is declared as below [1]:
The active elements (with their respective bit set in mask) are

contiguously stored into the array "a". Assume N is the true count of
mask, the elements starting from a[offset+N] till a[offset+laneCount]
are left unchanged. The return value represents the number of elements
store into the array and "offset + return value" is the new offset of
the next iteration.
This API will be used like the following manner [2]:
My patch includes the following changes:
* Selective Store VectorAPI for Long & Int
* Assembler: add x86 instruction "VPCOMPRESSD" and "VPCOMPRESSQ"
* Instruction selection: vselective_store; kmask_truecount (true count of kregister)
* Add node "StoreVectorSelective"
* Add a new parameter "is_selective" in inline_vector_mem_masked_operation()
in order to distinguish Masked version or Selective version
* jtreg cases
* JMH benchmark
TODO parts I will implement:
* Selective Store for other types
* Selective Load
* Some potential optimization. Such as: when mask is allTrue, SelectiveIntoArray() -> IntoArray()
Test:
* Passed VectorAPI jtreg cases.
* Result of JMH benchmark to evaluate API's performance in Alibaba's real scenario.
UseAVX=3; thread number = 8; conflict data percentage: 20% (that means 20% of mask bits are true)
http://cr.openjdk.java.net/~jzhu/8273057/jmh_benchmark_result.pdf
[1] JoshuaZhuwj@69623f7#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR4667
[2] JoshuaZhuwj@69623f7#diff-951d02bd72a931ac34bc85d1d4e656a14f8943e143fc9282b36b9c76c1893c0cR144
[3] failed to inline (intrinsic) by
panama-vector/src/hotspot/cpu/x86/x86.ad
Line 1769 in 60aa8ca
Best Regards,
Joshua
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/115/head:pull/115
$ git checkout pull/115
Update a local copy of the PR:
$ git checkout pull/115
$ git pull https://git.openjdk.java.net/panama-vector pull/115/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 115
View PR using the GUI difftool:
$ git pr show -t 115
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/115.diff