Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV][VPlan] Add initial support for CSA vectorization #121222

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

michaelmaitland
Copy link
Contributor

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

int t = init_val;
for (int i = 0; i < N; i++) {
  if (cond[i])
    t = a[i];
}
s = t; // use t

Using pseudo-LLVM code this can be vectorized as

vector.ph:
  ...
  %t = %init_val
  %init.mask = <all-false-vec>
  %init.data = <poison-vec> ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = <widened-cmp> ...
  %a.vec    = <widened-load> %a, %i
  %b        = <any-lane-active> %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = <extract-last-active-lane> %new.mask, %new.data

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found here.

This patch is further tested in llvm/llvm-test-suite#155.

This patch contains only the non-EVL related code. The is based on the larger
patch of #106560, which contained both
EVL and non-EVL related parts.

@llvmbot
Copy link
Member

llvmbot commented Dec 27, 2024

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

int t = init_val;
for (int i = 0; i &lt; N; i++) {
  if (cond[i])
    t = a[i];
}
s = t; // use t

Using pseudo-LLVM code this can be vectorized as

vector.ph:
  ...
  %t = %init_val
  %init.mask = &lt;all-false-vec&gt;
  %init.data = &lt;poison-vec&gt; ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = &lt;widened-cmp&gt; ...
  %a.vec    = &lt;widened-load&gt; %a, %i
  %b        = &lt;any-lane-active&gt; %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = &lt;extract-last-active-lane&gt; %new.mask, %new.data

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found here.

This patch is further tested in llvm/llvm-test-suite#155.

This patch contains only the non-EVL related code. The is based on the larger
patch of #106560, which contained both
EVL and non-EVL related parts.


Patch is 232.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121222.diff

18 Files Affected:

  • (modified) llvm/include/llvm/Analysis/IVDescriptors.h (+67-1)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+27)
  • (modified) llvm/lib/Analysis/IVDescriptors.cpp (+59-1)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+37-4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+22-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+136-9)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+164-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+9-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+238)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+1-1)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/conditional-scalar-assignment.ll (+3021)
diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index e8041e22b031ce..b085f4bd173afc 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file "describes" induction and recurrence variables.
+// This file "describes" induction, recurrence, and conditional scalar
+// assignment variables.
 //
 //===----------------------------------------------------------------------===//
 
@@ -423,6 +424,71 @@ class InductionDescriptor {
   SmallVector<Instruction *, 2> RedundantCasts;
 };
 
+/// A Conditional Scalar Assignment is an assignment from an initial
+/// scalar that may or may not occur.
+class ConditionalScalarAssignmentDescriptor {
+  /// If the conditional assignment occurs inside a loop, then Phi chooses
+  /// the value of the assignment from the entry block or the loop body block.
+  PHINode *Phi = nullptr;
+
+  /// The initial value of the ConditionalScalarAssignment. If the condition
+  /// guarding the assignment is not met, then the assignment retains this
+  /// value.
+  Value *InitScalar = nullptr;
+
+  /// The Instruction that conditionally assigned to inside the loop.
+  Instruction *Assignment = nullptr;
+
+  /// Create a ConditionalScalarAssignmentDescriptor that models a valid
+  /// conditional scalar assignment with its members initialized correctly.
+  ConditionalScalarAssignmentDescriptor(PHINode *Phi, Instruction *Assignment,
+                                        Value *InitScalar)
+      : Phi(Phi), InitScalar(InitScalar), Assignment(Assignment) {}
+
+public:
+  /// Create a ConditionalScalarAssignmentDescriptor that models an invalid
+  /// ConditionalScalarAssignment.
+  ConditionalScalarAssignmentDescriptor() = default;
+
+  /// If Phi is the root of a ConditionalScalarAssignment, set
+  /// ConditionalScalarAssignmentDesc as the ConditionalScalarAssignment rooted
+  /// by Phi. Otherwise, return a false, leaving ConditionalScalarAssignmentDesc
+  /// unmodified.
+  static bool isConditionalScalarAssignmentPhi(
+      PHINode *Phi, Loop *TheLoop,
+      ConditionalScalarAssignmentDescriptor &Desc);
+
+  operator bool() const { return isValid(); }
+
+  /// Returns whether SI is the Assignment in ConditionalScalarAssignment
+  static bool isConditionalScalarAssignmentSelect(
+      ConditionalScalarAssignmentDescriptor Desc, SelectInst *SI) {
+    return Desc.getAssignment() == SI;
+  }
+
+  /// Return whether this ConditionalScalarAssignmentDescriptor models a valid
+  /// ConditionalScalarAssignment.
+  bool isValid() const { return Phi && InitScalar && Assignment; }
+
+  /// Return the PHI that roots this ConditionalScalarAssignment.
+  PHINode *getPhi() const { return Phi; }
+
+  /// Return the initial value of the ConditionalScalarAssignment. This is the
+  /// value if the conditional assignment does not occur.
+  Value *getInitScalar() const { return InitScalar; }
+
+  /// The Instruction that is used after the loop
+  Instruction *getAssignment() const { return Assignment; }
+
+  /// Return the condition that this ConditionalScalarAssignment is conditional
+  /// upon.
+  Value *getCond() const {
+    if (auto *SI = dyn_cast_or_null<SelectInst>(Assignment))
+      return SI->getCondition();
+    return nullptr;
+  }
+};
+
 } // end namespace llvm
 
 #endif // LLVM_ANALYSIS_IVDESCRIPTORS_H
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index c6b846f96f1622..b41dbb6582f76f 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1852,6 +1852,10 @@ class TargetTransformInfo {
         : EVLParamStrategy(EVLParamStrategy), OpStrategy(OpStrategy) {}
   };
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableConditionalScalarAssignmentVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   VPLegalization getVPLegalizationStrategy(const VPIntrinsic &PI) const;
@@ -2305,6 +2309,7 @@ class TargetTransformInfo::Concept {
                              SmallVectorImpl<Use *> &OpsToSink) const = 0;
 
   virtual bool isVectorShiftByScalarCheap(Type *Ty) const = 0;
+  virtual bool enableConditionalScalarAssignmentVectorization() const = 0;
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -3130,6 +3135,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.isVectorShiftByScalarCheap(Ty);
   }
 
+  bool enableConditionalScalarAssignmentVectorization() const override {
+    return Impl.enableConditionalScalarAssignmentVectorization();
+  }
+
   VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
     return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 5fa0c46ad292d8..0cb081d48991f2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -1030,6 +1030,8 @@ class TargetTransformInfoImplBase {
 
   bool isVectorShiftByScalarCheap(Type *Ty) const { return false; }
 
+  bool enableConditionalScalarAssignmentVectorization() const { return false; }
+
   TargetTransformInfo::VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const {
     return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index fbe80eddbae07a..48c9077d653278 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -269,6 +269,12 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// ConditionalScalarAssignmentList contains the
+  /// ConditionalScalarAssignmentDescriptors for all the conditional scalar
+  /// assignments  that were found in the loop, rooted by their phis.
+  using ConditionalScalarAssignmentList =
+      MapVector<PHINode *, ConditionalScalarAssignmentDescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -321,6 +327,18 @@ class LoopVectorizationLegality {
   /// Returns True if V is a Phi node of an induction variable in this loop.
   bool isInductionPhi(const Value *V) const;
 
+  /// Returns the conditional scalar assignments found in the loop.
+  const ConditionalScalarAssignmentList &
+  getConditionalScalarAssignments() const {
+    return ConditionalScalarAssignments;
+  }
+
+  /// Returns true if Phi is the root of a conditional scalar assignments in the
+  /// loop.
+  bool isConditionalScalarAssignmentPhi(PHINode *Phi) const {
+    return ConditionalScalarAssignments.count(Phi) != 0;
+  }
+
   /// Returns a pointer to the induction descriptor, if \p Phi is an integer or
   /// floating point induction.
   const InductionDescriptor *getIntOrFpInductionDescriptor(PHINode *Phi) const;
@@ -550,6 +568,12 @@ class LoopVectorizationLegality {
   void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
                        SmallPtrSetImpl<Value *> &AllowedExit);
 
+  /// Updates the vetorization state by adding \p Phi to the
+  /// ConditionalScalarAssignment list.
+  void addConditionalScalarAssignmentPhi(
+      PHINode *Phi, const ConditionalScalarAssignmentDescriptor &Desc,
+      SmallPtrSetImpl<Value *> &AllowedExit);
+
   /// The loop that we evaluate.
   Loop *TheLoop;
 
@@ -594,6 +618,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds the conditional scalar assignments
+  ConditionalScalarAssignmentList ConditionalScalarAssignments;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index f74ede4450ce52..259af79f6cdba5 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file "describes" induction and recurrence variables.
+// This file "describes" induction, recurrence, and conditional scalar
+// assignment variables.
 //
 //===----------------------------------------------------------------------===//
 
@@ -1570,3 +1571,60 @@ bool InductionDescriptor::isInductionPHI(
   D = InductionDescriptor(StartValue, IK_PtrInduction, Step);
   return true;
 }
+
+/// Return ConditionalScalarAssignmentDescriptor that describes a
+/// ConditionalScalarAssignment that matches one of these patterns:
+///   phi loop_inv, (select cmp, value, phi)
+///   phi loop_inv, (select cmp, phi, value)
+///   phi (select cmp, value, phi), loop_inv
+///   phi (select cmp, phi, value), loop_inv
+/// If the ConditionalScalarAssignment does not match any of these paterns,
+/// return a ConditionalScalarAssignmentDescriptor that describes an
+/// InvalidConditionalScalarAssignment.
+bool ConditionalScalarAssignmentDescriptor::isConditionalScalarAssignmentPhi(
+    PHINode *Phi, Loop *TheLoop, ConditionalScalarAssignmentDescriptor &Desc) {
+
+  // Must be a scalar.
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return false;
+
+  // Match phi loop_inv, (select cmp, value, phi)
+  //    or phi loop_inv, (select cmp, phi, value)
+  //    or phi (select cmp, value, phi), loop_inv
+  //    or phi (select cmp, phi, value), loop_inv
+  if (Phi->getNumIncomingValues() != 2)
+    return false;
+  auto SelectInstIt = find_if(Phi->incoming_values(), [&Phi](const Use &U) {
+    return match(U.get(), m_Select(m_Value(), m_Specific(Phi), m_Value())) ||
+           match(U.get(), m_Select(m_Value(), m_Value(), m_Specific(Phi)));
+  });
+  if (SelectInstIt == Phi->incoming_values().end())
+    return false;
+  auto LoopInvIt = find_if(Phi->incoming_values(), [&](Use &U) {
+    return U.get() != *SelectInstIt && TheLoop->isLoopInvariant(U.get());
+  });
+  if (LoopInvIt == Phi->incoming_values().end())
+    return false;
+
+  // Phi or Sel must be used only outside the loop,
+  // excluding if Phi use Sel or Sel use Phi
+  auto IsOnlyUsedOutsideLoop = [&](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return true;
+    });
+  };
+  Instruction *Select = cast<SelectInst>(SelectInstIt->get());
+  Value *LoopInv = LoopInvIt->get();
+  if (!IsOnlyUsedOutsideLoop(Phi, Select) ||
+      !IsOnlyUsedOutsideLoop(Select, Phi))
+    return false;
+
+  Desc = ConditionalScalarAssignmentDescriptor(Phi, Select, LoopInv);
+  return true;
+}
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c62e40db0c5775..2468227be4a0da 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1373,6 +1373,11 @@ bool TargetTransformInfo::preferEpilogueVectorization() const {
   return TTIImpl->preferEpilogueVectorization();
 }
 
+bool TargetTransformInfo::enableConditionalScalarAssignmentVectorization()
+    const {
+  return TTIImpl->enableConditionalScalarAssignmentVectorization();
+}
+
 TargetTransformInfo::VPLegalization
 TargetTransformInfo::getVPLegalizationStrategy(const VPIntrinsic &VPI) const {
   return TTIImpl->getVPLegalizationStrategy(VPI);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 49192bd6380223..e469c0225cb0a3 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2361,6 +2361,11 @@ bool RISCVTTIImpl::isLegalMaskedExpandLoad(Type *DataTy, Align Alignment) {
   return true;
 }
 
+bool RISCVTTIImpl::enableConditionalScalarAssignmentVectorization() const {
+  return ST->hasVInstructions() &&
+         ST->getProcFamily() == RISCVSubtarget::SiFive7;
+}
+
 bool RISCVTTIImpl::isLegalMaskedCompressStore(Type *DataTy, Align Alignment) {
   auto *VTy = dyn_cast<VectorType>(DataTy);
   if (!VTy || VTy->isScalableTy())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index bd90bfed6e2c95..9e1b2cb3f3043f 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -306,6 +306,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     return TLI->isVScaleKnownToBeAPowerOfTwo();
   }
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableConditionalScalarAssignmentVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   TargetTransformInfo::VPLegalization
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index cb0b4641b6492b..2835d9f385ac5f 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -83,6 +83,10 @@ static cl::opt<bool> EnableHistogramVectorization(
     "enable-histogram-loop-vectorization", cl::init(false), cl::Hidden,
     cl::desc("Enables autovectorization of some loops containing histograms"));
 
+static cl::opt<bool> EnableConditionalScalarAssignment(
+    "enable-csa-vectorization", cl::init(false), cl::Hidden,
+    cl::desc("Control whether loop vectorization is enabled"));
+
 /// Maximum vectorization interleave count.
 static const unsigned MaxInterleaveFactor = 16;
 
@@ -749,6 +753,18 @@ bool LoopVectorizationLegality::setupOuterLoopInductions() {
   return llvm::all_of(Header->phis(), IsSupportedPhi);
 }
 
+void LoopVectorizationLegality::addConditionalScalarAssignmentPhi(
+    PHINode *Phi, const ConditionalScalarAssignmentDescriptor &Desc,
+    SmallPtrSetImpl<Value *> &AllowedExit) {
+  assert(Desc.isValid() &&
+         "Expected Valid ConditionalScalarAssignmentDescriptor");
+  LLVM_DEBUG(
+      dbgs() << "LV: found legal conditional scalar assignment opportunity"
+             << *Phi << "\n");
+  AllowedExit.insert(Phi);
+  ConditionalScalarAssignments.insert({Phi, Desc});
+}
+
 /// Checks if a function is scalarizable according to the TLI, in
 /// the sense that it should be vectorized and then expanded in
 /// multiple scalar calls. This is represented in the
@@ -866,14 +882,27 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
-        // As a last resort, coerce the PHI to a AddRec expression
-        // and re-try classifying it a an induction PHI.
+        // Try to coerce the PHI to a AddRec expression and re-try classifying
+        // it a an induction PHI.
         if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
             !IsDisallowedStridedPointerInduction(ID)) {
           addInductionPhi(Phi, ID, AllowedExit);
           continue;
         }
 
+        // Check if the PHI can be classified as a conditional scalar assignment
+        // PHI.
+        if (EnableConditionalScalarAssignment ||
+            (TTI->enableConditionalScalarAssignmentVectorization() &&
+             EnableConditionalScalarAssignment.getNumOccurrences() == 0)) {
+          ConditionalScalarAssignmentDescriptor Desc;
+          if (ConditionalScalarAssignmentDescriptor::
+                  isConditionalScalarAssignmentPhi(Phi, TheLoop, Desc)) {
+            addConditionalScalarAssignmentPhi(Phi, Desc, AllowedExit);
+            continue;
+          }
+        }
+
         reportVectorizationFailure("Found an unidentified PHI",
             "value that could not be identified as "
             "reduction is used outside the loop",
@@ -1844,11 +1873,15 @@ bool LoopVectorizationLegality::canFoldTailByMasking() const {
   for (const auto &Reduction : getReductionVars())
     ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
 
+  SmallPtrSet<const Value *, 8> CSALiveOuts;
+  for (const auto &CSA : getConditionalScalarAssignments())
+    CSALiveOuts.insert(CSA.second.getAssignment());
+
   // TODO: handle non-reduction outside users when tail is folded by masking.
   for (auto *AE : AllowedExit) {
     // Check that all users of allowed exit values are inside the loop or
-    // are the live-out of a reduction.
-    if (ReductionLiveOuts.count(AE))
+    // are the live-out of a reduction or conditional scalar assignment.
+    if (ReductionLiveOuts.count(AE) || CSALiveOuts.count(AE))
       continue;
     for (User *U : AE->users()) {
       Instruction *UI = cast<Instruction>(U);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 650a4859780da2..6eba380ceb7a12 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -174,8 +174,8 @@ class VPBuilder {
         new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
   }
 
-  VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
-                     const Twine &Name = "") {
+  VPInstruction *createNot(VPValue *Operand, DebugLoc DL = {},
+                           const Twine &Name = "") {
     return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
   }
 
@@ -257,6 +257,26 @@ class VPBuilder {
         FPBinOp ? FPBinOp->getFastMathFlags() : FastMathFlags()));
   }
 
+  VPInstruction *createConditionalScalarAssignmentMaskPhi(VPValue *InitMask,
+                                                          DebugLoc DL,
+                                                          const Twine &Name) {
+    return createInstruction(VPInstruction::ConditionalScalarAssignmentMaskPhi,
+                             {InitMask}, DL, Name);
+  }
+
+  VPInstruction *createAnyOf(VPValue *Cond, DebugLoc DL, const Twine &Name) {
+    return createInstruction(VPInstruction::AnyOf, {Cond}, DL, Name);
+  }
+
+  VPInstruction *createConditionalScalarAssignmentMaskSel(VPValue *Cond,
+                                                          VPValue *MaskPhi,
+                                                          VPValue *AnyOf,
+                                                          DebugLoc DL,
+                                                          const Twine &Name) {
+    return createInstruction(VPInstruction::ConditionalScalarAssignmentMaskSel,
+                             {Cond, MaskPhi, AnyOf}, DL, Name);
+  }
+
   //===--------------------------------------------------------------------===//
   // RAII helpers.
   //===--------------------------------------------------------------------===//
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 355ff40ce770e7..7102e0437caf5c 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -174,6 +174,8 @@ const char LLVMLoopVectorizeFollowupEpilogue[] =
 STATISTIC(LoopsVectorized, "Number of loops vectorized");
 STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
 STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
+STATISTIC(ConditionalScalarAssignmentsVectorized,
+          "Number of conditional scalar assignments vectorized");
 
 static cl::opt<bool> EnableEpilogueVectorization(
     "enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -4635,6 +4637,9 @@ static bool willGenerateVect...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 27, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Michael Maitland (michaelmaitland)

Changes

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

int t = init_val;
for (int i = 0; i &lt; N; i++) {
  if (cond[i])
    t = a[i];
}
s = t; // use t

Using pseudo-LLVM code this can be vectorized as

vector.ph:
  ...
  %t = %init_val
  %init.mask = &lt;all-false-vec&gt;
  %init.data = &lt;poison-vec&gt; ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = &lt;widened-cmp&gt; ...
  %a.vec    = &lt;widened-load&gt; %a, %i
  %b        = &lt;any-lane-active&gt; %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = &lt;extract-last-active-lane&gt; %new.mask, %new.data

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found here.

This patch is further tested in llvm/llvm-test-suite#155.

This patch contains only the non-EVL related code. The is based on the larger
patch of #106560, which contained both
EVL and non-EVL related parts.


Patch is 232.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121222.diff

18 Files Affected:

  • (modified) llvm/include/llvm/Analysis/IVDescriptors.h (+67-1)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+27)
  • (modified) llvm/lib/Analysis/IVDescriptors.cpp (+59-1)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+37-4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+22-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+136-9)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+164-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+9-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+238)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+1-1)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/conditional-scalar-assignment.ll (+3021)
diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index e8041e22b031ce..b085f4bd173afc 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file "describes" induction and recurrence variables.
+// This file "describes" induction, recurrence, and conditional scalar
+// assignment variables.
 //
 //===----------------------------------------------------------------------===//
 
@@ -423,6 +424,71 @@ class InductionDescriptor {
   SmallVector<Instruction *, 2> RedundantCasts;
 };
 
+/// A Conditional Scalar Assignment is an assignment from an initial
+/// scalar that may or may not occur.
+class ConditionalScalarAssignmentDescriptor {
+  /// If the conditional assignment occurs inside a loop, then Phi chooses
+  /// the value of the assignment from the entry block or the loop body block.
+  PHINode *Phi = nullptr;
+
+  /// The initial value of the ConditionalScalarAssignment. If the condition
+  /// guarding the assignment is not met, then the assignment retains this
+  /// value.
+  Value *InitScalar = nullptr;
+
+  /// The Instruction that conditionally assigned to inside the loop.
+  Instruction *Assignment = nullptr;
+
+  /// Create a ConditionalScalarAssignmentDescriptor that models a valid
+  /// conditional scalar assignment with its members initialized correctly.
+  ConditionalScalarAssignmentDescriptor(PHINode *Phi, Instruction *Assignment,
+                                        Value *InitScalar)
+      : Phi(Phi), InitScalar(InitScalar), Assignment(Assignment) {}
+
+public:
+  /// Create a ConditionalScalarAssignmentDescriptor that models an invalid
+  /// ConditionalScalarAssignment.
+  ConditionalScalarAssignmentDescriptor() = default;
+
+  /// If Phi is the root of a ConditionalScalarAssignment, set
+  /// ConditionalScalarAssignmentDesc as the ConditionalScalarAssignment rooted
+  /// by Phi. Otherwise, return a false, leaving ConditionalScalarAssignmentDesc
+  /// unmodified.
+  static bool isConditionalScalarAssignmentPhi(
+      PHINode *Phi, Loop *TheLoop,
+      ConditionalScalarAssignmentDescriptor &Desc);
+
+  operator bool() const { return isValid(); }
+
+  /// Returns whether SI is the Assignment in ConditionalScalarAssignment
+  static bool isConditionalScalarAssignmentSelect(
+      ConditionalScalarAssignmentDescriptor Desc, SelectInst *SI) {
+    return Desc.getAssignment() == SI;
+  }
+
+  /// Return whether this ConditionalScalarAssignmentDescriptor models a valid
+  /// ConditionalScalarAssignment.
+  bool isValid() const { return Phi && InitScalar && Assignment; }
+
+  /// Return the PHI that roots this ConditionalScalarAssignment.
+  PHINode *getPhi() const { return Phi; }
+
+  /// Return the initial value of the ConditionalScalarAssignment. This is the
+  /// value if the conditional assignment does not occur.
+  Value *getInitScalar() const { return InitScalar; }
+
+  /// The Instruction that is used after the loop
+  Instruction *getAssignment() const { return Assignment; }
+
+  /// Return the condition that this ConditionalScalarAssignment is conditional
+  /// upon.
+  Value *getCond() const {
+    if (auto *SI = dyn_cast_or_null<SelectInst>(Assignment))
+      return SI->getCondition();
+    return nullptr;
+  }
+};
+
 } // end namespace llvm
 
 #endif // LLVM_ANALYSIS_IVDESCRIPTORS_H
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index c6b846f96f1622..b41dbb6582f76f 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1852,6 +1852,10 @@ class TargetTransformInfo {
         : EVLParamStrategy(EVLParamStrategy), OpStrategy(OpStrategy) {}
   };
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableConditionalScalarAssignmentVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   VPLegalization getVPLegalizationStrategy(const VPIntrinsic &PI) const;
@@ -2305,6 +2309,7 @@ class TargetTransformInfo::Concept {
                              SmallVectorImpl<Use *> &OpsToSink) const = 0;
 
   virtual bool isVectorShiftByScalarCheap(Type *Ty) const = 0;
+  virtual bool enableConditionalScalarAssignmentVectorization() const = 0;
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -3130,6 +3135,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.isVectorShiftByScalarCheap(Ty);
   }
 
+  bool enableConditionalScalarAssignmentVectorization() const override {
+    return Impl.enableConditionalScalarAssignmentVectorization();
+  }
+
   VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
     return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 5fa0c46ad292d8..0cb081d48991f2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -1030,6 +1030,8 @@ class TargetTransformInfoImplBase {
 
   bool isVectorShiftByScalarCheap(Type *Ty) const { return false; }
 
+  bool enableConditionalScalarAssignmentVectorization() const { return false; }
+
   TargetTransformInfo::VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const {
     return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index fbe80eddbae07a..48c9077d653278 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -269,6 +269,12 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// ConditionalScalarAssignmentList contains the
+  /// ConditionalScalarAssignmentDescriptors for all the conditional scalar
+  /// assignments  that were found in the loop, rooted by their phis.
+  using ConditionalScalarAssignmentList =
+      MapVector<PHINode *, ConditionalScalarAssignmentDescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -321,6 +327,18 @@ class LoopVectorizationLegality {
   /// Returns True if V is a Phi node of an induction variable in this loop.
   bool isInductionPhi(const Value *V) const;
 
+  /// Returns the conditional scalar assignments found in the loop.
+  const ConditionalScalarAssignmentList &
+  getConditionalScalarAssignments() const {
+    return ConditionalScalarAssignments;
+  }
+
+  /// Returns true if Phi is the root of a conditional scalar assignments in the
+  /// loop.
+  bool isConditionalScalarAssignmentPhi(PHINode *Phi) const {
+    return ConditionalScalarAssignments.count(Phi) != 0;
+  }
+
   /// Returns a pointer to the induction descriptor, if \p Phi is an integer or
   /// floating point induction.
   const InductionDescriptor *getIntOrFpInductionDescriptor(PHINode *Phi) const;
@@ -550,6 +568,12 @@ class LoopVectorizationLegality {
   void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
                        SmallPtrSetImpl<Value *> &AllowedExit);
 
+  /// Updates the vetorization state by adding \p Phi to the
+  /// ConditionalScalarAssignment list.
+  void addConditionalScalarAssignmentPhi(
+      PHINode *Phi, const ConditionalScalarAssignmentDescriptor &Desc,
+      SmallPtrSetImpl<Value *> &AllowedExit);
+
   /// The loop that we evaluate.
   Loop *TheLoop;
 
@@ -594,6 +618,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds the conditional scalar assignments
+  ConditionalScalarAssignmentList ConditionalScalarAssignments;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index f74ede4450ce52..259af79f6cdba5 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file "describes" induction and recurrence variables.
+// This file "describes" induction, recurrence, and conditional scalar
+// assignment variables.
 //
 //===----------------------------------------------------------------------===//
 
@@ -1570,3 +1571,60 @@ bool InductionDescriptor::isInductionPHI(
   D = InductionDescriptor(StartValue, IK_PtrInduction, Step);
   return true;
 }
+
+/// Return ConditionalScalarAssignmentDescriptor that describes a
+/// ConditionalScalarAssignment that matches one of these patterns:
+///   phi loop_inv, (select cmp, value, phi)
+///   phi loop_inv, (select cmp, phi, value)
+///   phi (select cmp, value, phi), loop_inv
+///   phi (select cmp, phi, value), loop_inv
+/// If the ConditionalScalarAssignment does not match any of these paterns,
+/// return a ConditionalScalarAssignmentDescriptor that describes an
+/// InvalidConditionalScalarAssignment.
+bool ConditionalScalarAssignmentDescriptor::isConditionalScalarAssignmentPhi(
+    PHINode *Phi, Loop *TheLoop, ConditionalScalarAssignmentDescriptor &Desc) {
+
+  // Must be a scalar.
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return false;
+
+  // Match phi loop_inv, (select cmp, value, phi)
+  //    or phi loop_inv, (select cmp, phi, value)
+  //    or phi (select cmp, value, phi), loop_inv
+  //    or phi (select cmp, phi, value), loop_inv
+  if (Phi->getNumIncomingValues() != 2)
+    return false;
+  auto SelectInstIt = find_if(Phi->incoming_values(), [&Phi](const Use &U) {
+    return match(U.get(), m_Select(m_Value(), m_Specific(Phi), m_Value())) ||
+           match(U.get(), m_Select(m_Value(), m_Value(), m_Specific(Phi)));
+  });
+  if (SelectInstIt == Phi->incoming_values().end())
+    return false;
+  auto LoopInvIt = find_if(Phi->incoming_values(), [&](Use &U) {
+    return U.get() != *SelectInstIt && TheLoop->isLoopInvariant(U.get());
+  });
+  if (LoopInvIt == Phi->incoming_values().end())
+    return false;
+
+  // Phi or Sel must be used only outside the loop,
+  // excluding if Phi use Sel or Sel use Phi
+  auto IsOnlyUsedOutsideLoop = [&](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return true;
+    });
+  };
+  Instruction *Select = cast<SelectInst>(SelectInstIt->get());
+  Value *LoopInv = LoopInvIt->get();
+  if (!IsOnlyUsedOutsideLoop(Phi, Select) ||
+      !IsOnlyUsedOutsideLoop(Select, Phi))
+    return false;
+
+  Desc = ConditionalScalarAssignmentDescriptor(Phi, Select, LoopInv);
+  return true;
+}
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c62e40db0c5775..2468227be4a0da 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1373,6 +1373,11 @@ bool TargetTransformInfo::preferEpilogueVectorization() const {
   return TTIImpl->preferEpilogueVectorization();
 }
 
+bool TargetTransformInfo::enableConditionalScalarAssignmentVectorization()
+    const {
+  return TTIImpl->enableConditionalScalarAssignmentVectorization();
+}
+
 TargetTransformInfo::VPLegalization
 TargetTransformInfo::getVPLegalizationStrategy(const VPIntrinsic &VPI) const {
   return TTIImpl->getVPLegalizationStrategy(VPI);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 49192bd6380223..e469c0225cb0a3 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2361,6 +2361,11 @@ bool RISCVTTIImpl::isLegalMaskedExpandLoad(Type *DataTy, Align Alignment) {
   return true;
 }
 
+bool RISCVTTIImpl::enableConditionalScalarAssignmentVectorization() const {
+  return ST->hasVInstructions() &&
+         ST->getProcFamily() == RISCVSubtarget::SiFive7;
+}
+
 bool RISCVTTIImpl::isLegalMaskedCompressStore(Type *DataTy, Align Alignment) {
   auto *VTy = dyn_cast<VectorType>(DataTy);
   if (!VTy || VTy->isScalableTy())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index bd90bfed6e2c95..9e1b2cb3f3043f 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -306,6 +306,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     return TLI->isVScaleKnownToBeAPowerOfTwo();
   }
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableConditionalScalarAssignmentVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   TargetTransformInfo::VPLegalization
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index cb0b4641b6492b..2835d9f385ac5f 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -83,6 +83,10 @@ static cl::opt<bool> EnableHistogramVectorization(
     "enable-histogram-loop-vectorization", cl::init(false), cl::Hidden,
     cl::desc("Enables autovectorization of some loops containing histograms"));
 
+static cl::opt<bool> EnableConditionalScalarAssignment(
+    "enable-csa-vectorization", cl::init(false), cl::Hidden,
+    cl::desc("Control whether loop vectorization is enabled"));
+
 /// Maximum vectorization interleave count.
 static const unsigned MaxInterleaveFactor = 16;
 
@@ -749,6 +753,18 @@ bool LoopVectorizationLegality::setupOuterLoopInductions() {
   return llvm::all_of(Header->phis(), IsSupportedPhi);
 }
 
+void LoopVectorizationLegality::addConditionalScalarAssignmentPhi(
+    PHINode *Phi, const ConditionalScalarAssignmentDescriptor &Desc,
+    SmallPtrSetImpl<Value *> &AllowedExit) {
+  assert(Desc.isValid() &&
+         "Expected Valid ConditionalScalarAssignmentDescriptor");
+  LLVM_DEBUG(
+      dbgs() << "LV: found legal conditional scalar assignment opportunity"
+             << *Phi << "\n");
+  AllowedExit.insert(Phi);
+  ConditionalScalarAssignments.insert({Phi, Desc});
+}
+
 /// Checks if a function is scalarizable according to the TLI, in
 /// the sense that it should be vectorized and then expanded in
 /// multiple scalar calls. This is represented in the
@@ -866,14 +882,27 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
-        // As a last resort, coerce the PHI to a AddRec expression
-        // and re-try classifying it a an induction PHI.
+        // Try to coerce the PHI to a AddRec expression and re-try classifying
+        // it a an induction PHI.
         if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
             !IsDisallowedStridedPointerInduction(ID)) {
           addInductionPhi(Phi, ID, AllowedExit);
           continue;
         }
 
+        // Check if the PHI can be classified as a conditional scalar assignment
+        // PHI.
+        if (EnableConditionalScalarAssignment ||
+            (TTI->enableConditionalScalarAssignmentVectorization() &&
+             EnableConditionalScalarAssignment.getNumOccurrences() == 0)) {
+          ConditionalScalarAssignmentDescriptor Desc;
+          if (ConditionalScalarAssignmentDescriptor::
+                  isConditionalScalarAssignmentPhi(Phi, TheLoop, Desc)) {
+            addConditionalScalarAssignmentPhi(Phi, Desc, AllowedExit);
+            continue;
+          }
+        }
+
         reportVectorizationFailure("Found an unidentified PHI",
             "value that could not be identified as "
             "reduction is used outside the loop",
@@ -1844,11 +1873,15 @@ bool LoopVectorizationLegality::canFoldTailByMasking() const {
   for (const auto &Reduction : getReductionVars())
     ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
 
+  SmallPtrSet<const Value *, 8> CSALiveOuts;
+  for (const auto &CSA : getConditionalScalarAssignments())
+    CSALiveOuts.insert(CSA.second.getAssignment());
+
   // TODO: handle non-reduction outside users when tail is folded by masking.
   for (auto *AE : AllowedExit) {
     // Check that all users of allowed exit values are inside the loop or
-    // are the live-out of a reduction.
-    if (ReductionLiveOuts.count(AE))
+    // are the live-out of a reduction or conditional scalar assignment.
+    if (ReductionLiveOuts.count(AE) || CSALiveOuts.count(AE))
       continue;
     for (User *U : AE->users()) {
       Instruction *UI = cast<Instruction>(U);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 650a4859780da2..6eba380ceb7a12 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -174,8 +174,8 @@ class VPBuilder {
         new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
   }
 
-  VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
-                     const Twine &Name = "") {
+  VPInstruction *createNot(VPValue *Operand, DebugLoc DL = {},
+                           const Twine &Name = "") {
     return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
   }
 
@@ -257,6 +257,26 @@ class VPBuilder {
         FPBinOp ? FPBinOp->getFastMathFlags() : FastMathFlags()));
   }
 
+  VPInstruction *createConditionalScalarAssignmentMaskPhi(VPValue *InitMask,
+                                                          DebugLoc DL,
+                                                          const Twine &Name) {
+    return createInstruction(VPInstruction::ConditionalScalarAssignmentMaskPhi,
+                             {InitMask}, DL, Name);
+  }
+
+  VPInstruction *createAnyOf(VPValue *Cond, DebugLoc DL, const Twine &Name) {
+    return createInstruction(VPInstruction::AnyOf, {Cond}, DL, Name);
+  }
+
+  VPInstruction *createConditionalScalarAssignmentMaskSel(VPValue *Cond,
+                                                          VPValue *MaskPhi,
+                                                          VPValue *AnyOf,
+                                                          DebugLoc DL,
+                                                          const Twine &Name) {
+    return createInstruction(VPInstruction::ConditionalScalarAssignmentMaskSel,
+                             {Cond, MaskPhi, AnyOf}, DL, Name);
+  }
+
   //===--------------------------------------------------------------------===//
   // RAII helpers.
   //===--------------------------------------------------------------------===//
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 355ff40ce770e7..7102e0437caf5c 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -174,6 +174,8 @@ const char LLVMLoopVectorizeFollowupEpilogue[] =
 STATISTIC(LoopsVectorized, "Number of loops vectorized");
 STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
 STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
+STATISTIC(ConditionalScalarAssignmentsVectorized,
+          "Number of conditional scalar assignments vectorized");
 
 static cl::opt<bool> EnableEpilogueVectorization(
     "enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -4635,6 +4637,9 @@ static bool willGenerateVect...
[truncated]

Copy link

github-actions bot commented Dec 27, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor Author

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn I have brought your recent comments over from #106560

// preheader and middle block. It also contains recipes that are not backed by
// underlying instructions in the original loop. This makes it difficult to
// model in the legacy cost model.
if (!Legal.getConditionalScalarAssignments().empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn

You said:

Better check for the CSA recipes?

What is the reason to walk all recipes in the plan looking for CSA recipes when we can do this check in O(1) like this?

@@ -423,6 +424,71 @@ class InductionDescriptor {
SmallVector<Instruction *, 2> RedundantCasts;
};

/// A Conditional Scalar Assignment is an assignment from an initial
/// scalar that may or may not occur.
class ConditionalScalarAssignmentDescriptor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn

You said:

I don't think CSA is a very common term, would be good to have a more descriptive name if possible

Intel has used the term conditional scalar assingmnet. I have abbreviated it as CSA for short. I have documented the acronym in the code in this patch in multiple places

/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
/// scalar that may or may not occur.

// This file "describes" induction, recurrence, and conditional scalar
// assignment (CSA) variables.

STATISTIC(CSAsVectorized,
"Number of conditional scalar assignments vectorized");

I thought that ConditionalScalarAssignmentDescriptor, createConditionalScalarAssignmentMaskPhi, and VPConditionalScalarAssignmentDescriptorExtractScalarRecipe were quite long for example.

Do you have any suggestion on what you'd like it to be named? Is expanding CSA to ConditionalScalarAssignment everywhere your preference?

For now, I've tried to be proactive and did some renaming as a fixup in this patch. Please let me know what you think.

@@ -0,0 +1,68 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn, you said:

I think I am missing something, but the non-EVL codegen doens't seem to use and predicated memory accesses, so it should be able to do it with a generic target?

Yes, the non-evl codegen can do it with generic target, but the EVL related RUN line requires us to be target specific. But then we'd have two test files with the same test cases, only differing by RUN lines. I can make that change if you'd prefer. It was my opinion that it is better to not duplicate the test cases and compare differences in vectorization in one file, but if you disagree, I am content to concede.

@michaelmaitland
Copy link
Contributor Author

ping

@michaelmaitland michaelmaitland force-pushed the csa-noevl branch 2 times, most recently from 63a40e2 to 95bc608 Compare January 6, 2025 15:50
Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this moving along :)

br i1 %exitcond.not, label %exit, label %loop
}

; CHECK: VPlan 'Initial VPlan for VF={vscale x 1},UF>=1' {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the auto-generated CHECK lines end up after the function? I think the VPlan-based testing can be made more extensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was not actually auto generated. I updated the file to remove the first line. Would you like to see any other CHECK lines in this file? If so, what would you like to see specifically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also include an EVL test? I think VPlan-based testing can be made more useful by also forcing different VFs? I think we can also report instruction costs, since the cost-computation in the code is quite non-trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also include an EVL test?

Done

I think VPlan-based testing can be made more useful by also forcing different VFs?

This seems redundant to me. It should not matter much for the VPlan.

I think we can also report instruction costs, since the cost-computation in the code is quite non-trivial.

I'll add another test here for this. I think it could be a good idea to have this depend on forced VFs.

@michaelmaitland michaelmaitland force-pushed the csa-noevl branch 2 times, most recently from 22fa83d to 49c8989 Compare January 13, 2025 20:00
br i1 %exitcond.not, label %exit, label %loop
}

; CHECK: VPlan 'Initial VPlan for VF={vscale x 1},UF>=1' {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also include an EVL test? I think VPlan-based testing can be made more useful by also forcing different VFs? I think we can also report instruction costs, since the cost-computation in the code is quite non-trivial.

@michaelmaitland
Copy link
Contributor Author

ping!

@ayalz
Copy link
Collaborator

ayalz commented Jan 21, 2025

The motivating example of

int t = init_val;
for (int i = 0; i < N; i++) {
  if (cond[i])
    t = a[i];
}
s = t; // use t

suggests to vectorize this as

int t = init_val;
<VF x i1> vmask = 0;
<VF x ?> va;
for (int i = 0; i < N; i+=VF) {
  vmaski = cond[i:i+VF-1];
  vai = a[i:i+VF-1];
  if (any(maski)) {
    vmask = vmaski;
    va = vai;
  }
}
if (any(vmask)) {
  i = last(vmask);
  t = extract(va, i);
}
s = t; // use t

arguing that it's better to pass vmask as a live-out and sink looking for its last turned-on lane to after the loop, instead of looking for it inside the loop and passing i as live-out.

Continuing with this argument, better to also sink the loading of a[i] to after the loop, instead of loading vectorized va with mask inside the loop?

In general, there may be some function t=f(i) of i that produces the value t being conditionally overwritten, e.g., think of f(i) as a polynomial of i. Such a function f should arguably be sunk and computed once after the loop - based on figuring out the (at most one) relevant iteration i. The reduction becomes a "FindLast" reduction once this function is sunk. Sound reasonable?

This is reminiscent of sinking the "x, y" of AnyOf to produce a boolean reduction followed by an "anyof ? x : y" function, rather than carrying x and y inside the loop, see @fhahn's commit bccb7ed.

Finally, note that LV already supports some sort of "CSA", introduced by @annamthomas some years ago in https://reviews.llvm.org/D52656; see test variant_val_store_to_inv_address_conditional in llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll - a conditional store to the same address could be converted into a conditional scalar assignment coupled with sinking the conditional store to after the loop.

@michaelmaitland
Copy link
Contributor Author

michaelmaitland commented Jan 21, 2025

@ayalz thanks for the review!

Continuing with this argument, better to also sink the loading of a[i] to after the loop, instead of loading vectorized va with mask inside the loop?

I agree in the example you bring up in your example. But consider a case where a[i] is needed unconditionally in the loop. For example if cond[i] is replaced with a[i]. Then doing it after the loop is doing duplicate loads. Additionally, consider that t = f(a, b, c), then we need to keep track of a, b, c live out of the loop, which may be tricky.

I think there is some room to do such an optimization, but I prefer to leave this as future work. WDYT?

The reduction becomes a "FindLast" reduction once this function is sunk. Sound reasonable?

According to #106560 (comment), there is no plan to support FindLast at the moment. Above I describe a scenario where it is not beneficial to do this sinking, so we would not be able to rely on FindLast in all instances.

Finally, note that LV already supports some sort of "CSA", introduced by @annamthomas some years ago in https://reviews.llvm.org/D52656; see test variant_val_store_to_inv_address_conditional in llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll - a conditional store to the same address could be converted into a conditional scalar assignment coupled with sinking the conditional store to after the loop.

This sounds like a good idea for future improvement of these loops.

@michaelmaitland
Copy link
Contributor Author

Ping

@michaelmaitland
Copy link
Contributor Author

ping!

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

```
int t = init_val;
for (int i = 0; i < N; i++) {
  if (cond[i])
    t = a[i];
}
s = t; // use t
```

Using pseudo-LLVM code this can be vectorized as

```
vector.ph:
  ...
  %t = %init_val
  %init.mask = <all-false-vec>
  %init.data = <poison-vec> ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = <widened-cmp> ...
  %a.vec    = <widened-load> %a, %i
  %b        = <any-lane-active> %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = <extract-last-active-lane> %new.mask, %new.data
```

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found [here](https://discourse.llvm.org/t/vectorization-of-conditional-scalar-assignment-csa/80964).

This patch is further tested in llvm/llvm-test-suite#155.

This patch contains only the non-EVL related code. The is based on the larger
patch of llvm#106560, which contained both EVL and non-EVL related parts.
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayalz thanks for the review!

Continuing with this argument, better to also sink the loading of a[i] to after the loop, instead of loading vectorized va with mask inside the loop?

I agree in the example you bring up in your example. But consider a case where a[i] is needed unconditionally in the loop. For example if cond[i] is replaced with a[i]. Then doing it after the loop is doing duplicate loads. Additionally, consider that t = f(a, b, c), then we need to keep track of a, b, c live out of the loop, which may be tricky.

I think there is some room to do such an optimization, but I prefer to leave this as future work. WDYT?

The reduction becomes a "FindLast" reduction once this function is sunk. Sound reasonable?

IIUC the current patch will always load a wide value on each vector iteration vs only doing an extra scalar load at most outside the loop, which seems like it would be more profitable in general?

@ayalz suggestion may help to avoid additional recipes by re-using the existing recipes (and example is sinking stores for reductions to an invariant address).

I added some comments inline to clarify some of the added recieps in the patch, perhaps some of them can be replaced by existing recipes to reduce the complexity.

Recipes also should avoid holding references to underlying IR if possible or modifying existing IR.

According to #106560 (comment), there is no plan to support FindLast at the moment. Above I describe a scenario where it is not beneficial to do this sinking, so we would not be able to rely on FindLast in all instances.

Now that @Mel-Chen's recent changes landed, I'd hope that adding support for FindLast may be slightly easier, but I am not sure. Maybe @Mel-Chen has additional thoughts?

cast<PHINode>(getUnderlyingInstr()), *getOperand(0));
}

VPValue *NewData = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, recipes should manage any VPValues via their operands; otherwise replaceAllUsesWith won't work as expected.

VPValue *getVPNewData() { return NewData; }
};

class VPConditionalScalarAssignmentDataUpdateRecipe final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to tell what the semantics of the recipe are without comments; it's not clear to me why this cannot be just a select?


class VPConditionalScalarAssignmentExtractScalarRecipe final
: public VPSingleDefRecipe {
SmallVector<PHINode *> PhisToFix;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the recipe need to hold references to IR Phis? Can it instead be an operand of the phis that need updating?

Value *AnyOf = State.get(getOperand(2), /*NeedsScalar=*/true);
Value *MaskSel =
State.Builder.CreateSelect(AnyOf, WidenedCond, MaskPhi, "csa.mask.sel");
cast<PHINode>(MaskPhi)->addIncoming(MaskSel, State.CFG.PrevBB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? A recipe's execution shouldn't modify outside the IR generated by the reicpe.

Can we instead update MaskPhi via the VPlan Def-use chains?

Comment on lines +2539 to +2540
return ST->hasVInstructions() &&
ST->getProcFamily() == RISCVSubtarget::SiFive7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to not limit this to a single processor family. IIUC there's nothing target-specific about the feature and the cost model should ideally accurately compute the cost, without needing a separate TTI hook?

@michaelmaitland
Copy link
Contributor Author

Now that @Mel-Chen's recent changes landed, I'd hope that adding support for FindLast may be slightly easier, but I am not sure. Maybe @Mel-Chen has additional thoughts?

FindLast approach only works when things are monotonic. CSA works regardless of whether things are monotonic. Reductions kick in before CSA, so if we can vectorize using FindLast, we will vectorize it (and prefer it) compared to CSA.

@ayalz
Copy link
Collaborator

ayalz commented Feb 10, 2025

@ayalz thanks for the review!

Continuing with this argument, better to also sink the loading of a[i] to after the loop, instead of loading vectorized va with mask inside the loop?

I agree in the example you bring up in your example.

This is "The motivating example" rather than one I brought up.

But consider a case where a[i] is needed unconditionally in the loop. For example if cond[i] is replaced with a[i]. Then doing it after the loop is doing duplicate loads. Additionally, consider that t = f(a, b, c), then we need to keep track of a, b, c live out of the loop, which may be tricky.

If a, b, c are loop invariant they should be available just the same right after the loop?
There is indeed potential for reuse inside the vectorized loop, but even then maintaining a vector va live across iterations and live out of the loop may incur a cost (register pressure?).
Would be good to reason about concrete examples along with their associated costs.

I think there is some room to do such an optimization, but I prefer to leave this as future work. WDYT?

The reduction becomes a "FindLast" reduction once this function is sunk. Sound reasonable?

I think such patterns are essentially extensions of "FindLast" reduction and should be developed as such, rather than being considered distinct unrelated patterns.

According to #106560 (comment), there is no plan to support FindLast at the moment. Above I describe a scenario where it is not beneficial to do this sinking, so we would not be able to rely on FindLast in all instances.

Note that by replacing t = a[i] with t = i the above "CSA" example becomes a FindLast one, hopefully suggesting such a supportive plan?

Finally, note that LV already supports some sort of "CSA", introduced by @annamthomas some years ago in https://reviews.llvm.org/D52656; see test variant_val_store_to_inv_address_conditional in llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll - a conditional store to the same address could be converted into a conditional scalar assignment coupled with sinking the conditional store to after the loop.

This sounds like a good idea for future improvement of these loops.

-- one which would also seem to benefit from sinking f(i)=c[i] of %t4 to after the loop, based on FindLast reduction.

@michaelmaitland
Copy link
Contributor Author

michaelmaitland commented Feb 10, 2025

@ayalz, we are carrying future patches which should improve the performance of the loop by using mask logic instead of reductions inside the loop. I planned on posting it as a follow up since it is really more performant in cases where the target core has mask units that can handle it well. It looks something like this:

int t = init_val;
<VF x i1> vmask = 0;
<VF x ?> va;
for (int i = 0; i < N; i+=VF) {
  vmaski =  cond[i:i+VF-1];
  vmask = (vmsbf(vmaski) & vmask) | vmaski
  vai = a[i:i+VF-1]
  va = vmerge vmaski, vai, va
}
if any(vmask) {
  i = last(vmask)
  t = extract (va, i)
}
s = t; // use t

This is not the same as a FindLast inside the loop because there is no reducing on each loop iteration. Since this pattern is not an extension of "FindLast", I'm not sure it is a good idea to develop CSAs as in loop reductions.

I think we're saying the same thing though, where we both want to see FindLast in the middle block. This is essentially what I'm doing with the ExtractRecipe. I wrote the ExtractRecipe since FindLast only supports monotonics. I still think that we need to keep the in-loop approach though, regardless of whether we use FindLast outside the loop or not?

I think such patterns are essentially extensions of "FindLast" reduction and should be developed as such, rather than being considered distinct unrelated patterns.

@Mel-Chen can you chime in here? Can FindLast handle non-monotonic cases? I think the reason we took the approach proposed in this patch was because FindLast only works for monotonic cases.

@Mel-Chen
Copy link
Contributor

@Mel-Chen can you chime in here? Can FindLast handle non-monotonic cases? I think the reason we took the approach proposed in this patch was because FindLast only works for monotonic cases.

I think what @ayalz wants to discuss with you is the newly added RecurKind::FindLast, rather than the existing RecurKind::FindLastIV.

I still believe that the semantics you are implementing are a form of reduction. Therefore, reusing the existing reduction framework is appropriate. I don't think this change would affect how you ultimately generate the vectorized IR.

Ideally, the vectorizer should select the best vectorization approach in the following order:

  1. Try RecurKind::AnyOf (select between invariants)
  2. If not applicable, try RecurKind::FindLastIV (select from a monotonic increasing sequence)
  3. If not applicable, try RecurKind::FindLast (select from a set of variables)
  4. If none apply, then it cannot be vectorized.

@ayalz
Copy link
Collaborator

ayalz commented Feb 11, 2025

@michaelmaitland - thanks for the intriguing discussion!

@ayalz, we are carrying future patches which should improve the performance of the loop by using mask logic instead of reductions inside the loop. I planned on posting it as a follow up since it is really more performant in cases where the target core has mask units that can handle it well. It looks something like this:

int t = init_val;
<VF x i1> vmask = 0;
<VF x ?> va;
for (int i = 0; i < N; i+=VF) {
  vmaski =  cond[i:i+VF-1];
  vmask = (vmsbf(vmaski) & vmask) | vmaski
  vai = a[i:i+VF-1]
  va = vmerge vmaski, vai, va
}
if any(vmask) {
  i = last(vmask)
  t = extract (va, i)
}
s = t; // use t

This is not the same as a FindLast inside the loop because there is no reducing on each loop iteration. Since this pattern is not an extension of "FindLast", I'm not sure it is a good idea to develop CSAs as in loop reductions.

It may be helpful to distinguish between:

  1. the input pattern - as it appears in the IR given to the vectorizer;
  2. the canonical pattern - underlying essential construct recognized during vectorization, and
  3. the output pattern - as it appears in the IR generated by the vectorizer.

The canonical pattern for the motivating example, presented with the following input pattern:

  int t = init_val;
  for (int i = 0; i < N; i++) {
    if (cond[i])
      t = a[i];
  }
  s = t; // use t

is arguably that of a FindLast reduction as the canonical pattern. Indeed, an alternative input pattern is, using a sentinel:

  int t = init_val;
  int last = -1;
  for (int i = 0; i < N; i++) {
    if (cond[i])
      last = i;
  }
  s = (last == -1) ? init_val : a[last];

Hopefully these two input patterns could be recognized and manipulated using common infrastructure including recipes, and benefit from similar output patterns. Now, regarding the latter - multiple options may indeed be considered, with the one having the best cost to be selected. The above example demonstrates one such option, where
vmask = (vmsbf(vmaski) & vmask) | vmaski
is an optimized version of
vmask = any(vmaski) ? vmaski : vmask
with a slight deviation - if any(vmaski) then any bit prior to the last active bit of vmaski may be turned on due to vmask, but such bits are irrelevant (pls correct if needed).

As mentioned above, it may indeed be "better to pass vmask as a live-out and sink looking for its last turned-on lane to after
the loop, instead of looking for it inside the loop and passing i as live-out" - when considering how to best optimize a FindLast pattern; this includes how to best maintain vmask inside the loop.

Now, whether it is better to sink a single scalar load of a[last] out of the loop, or maintain a vector va inside the loop eventually holding it, via

  vai = a[i:i+VF-1]
  va = vmerge vmaski, vai, va

are also two options to chose from based on cost, unless one clearly outweighs the other.

I think we're saying the same thing though, where we both want to see FindLast in the middle block. This is essentially what I'm doing with the ExtractRecipe. I wrote the ExtractRecipe since FindLast only supports monotonics. I still think that we need to keep the in-loop approach though, regardless of whether we use FindLast outside the loop or not?

I think such patterns are essentially extensions of "FindLast" reduction and should be developed as such, rather than being considered distinct unrelated patterns.

@Mel-Chen can you chime in here? Can FindLast handle non-monotonic cases? I think the reason we took the approach proposed in this patch was because FindLast only works for monotonic cases.

FindLast conceptually computes that last iteration for which some condition holds, and iteration count is conceptually monotone and increasing (wraparound at the end issue?). Computing FindLast using a sentinel value, however, requires that such a value exists. In its absence, an external "found" indicator can be used or the condition can be checked for the resulting iteration, as @Mel-Chen pointed out in several TODOs of #67812. Note that decreasing IV's are derived from the canonical iteration count IV, and their derivative function could be sunk as well. In any case, if the current limitation of FindLast regarding monotone/sentinel is too restrictive, would be good to work towards lifting it?

@ayalz
Copy link
Collaborator

ayalz commented Feb 11, 2025

@Mel-Chen can you chime in here? Can FindLast handle non-monotonic cases? I think the reason we took the approach proposed in this patch was because FindLast only works for monotonic cases.

I think what @ayalz wants to discuss with you is the newly added RecurKind::FindLast, rather than the existing RecurKind::FindLastIV.

I still believe that the semantics you are implementing are a form of reduction. Therefore, reusing the existing reduction framework is appropriate. I don't think this change would affect how you ultimately generate the vectorized IR.

Ideally, the vectorizer should select the best vectorization approach in the following order:

  1. Try RecurKind::AnyOf (select between invariants)
  2. If not applicable, try RecurKind::FindLastIV (select from a monotonic increasing sequence)
  3. If not applicable, try RecurKind::FindLast (select from a set of variables)
  4. If none apply, then it cannot be vectorized.

Curious about there need for both FindLast and FindLastIV. The desired output may be some function of the last IV found, and this function may be applied after the loop or folded into the loop along with the reduction if preferred (similar to two invariants per boolean AnyOf), but the canonical reduction pattern for both is presumably that of finding the last iteration?

Trying to reason about where the above ideal order may be important: every header phi, which represents a cross-iteration dependence, must be handled by the vectorizer, according to one of the following vectorizable categories:

  • FOR, acyclic: all users of the phi appear (or can move) below the instruction feeding the phi along the back-edge. Handled as Fixed Order Recurrence.
  • Induction: cyclic non-wrapping addition chain with invariant addend step, aka AddRec by SCEV.
  • Reduction: cyclic, non-induction (non-addition chain or addition chain with varying step). A {value(i)} sequence of values all of the same type (integer, floating point or boolean) is computed one per iteration i, independent of other iterations, and reduced to a single value of this type by repeatedly applying a binary operator to value(i) and the reduction phi.
    • Select Reduction: the reduction result may be an element of the sequence being reduced if this binary operator selects one of its two operands. This is the case for Min/Max reductions and arguably also for AnyOf - which can use a select value(i) ? value(i) : phi as its binary reduction operator. Note that if AnyOf returns false, all value(i)'s are false, so any value(i) can be considered selected.
    • Selected Index: the resulting value(i) of a select reduction can be complemented with its corresponding iteration i. This is the case for Min/Max-with-index and also for FindLast - corresponding to AnyOf-with-index, returning the last iteration i for which value(i) is true, if any.

There may be additional recurrences to consider?

@Mel-Chen
Copy link
Contributor

Ideally, the vectorizer should select the best vectorization approach in the following order:

  1. Try RecurKind::AnyOf (select between invariants)
  2. If not applicable, try RecurKind::FindLastIV (select from a monotonic increasing sequence)
  3. If not applicable, try RecurKind::FindLast (select from a set of variables)
  4. If none apply, then it cannot be vectorized.

Curious about there need for both FindLast and FindLastIV. The desired output may be some function of the last IV found, and this function may be applied after the loop or folded into the loop along with the reduction if preferred (similar to two invariants per boolean AnyOf), but the canonical reduction pattern for both is presumably that of finding the last iteration?

The reason for distinguishing FindLastIV from FindLast is similar to the distinction between AnyOf and FindLast.

Under specific conditions—such as AnyOf for two invariants or FindLastIV for a monotonic sequence—we can achieve better vectorization performance by using the lower-cost AnyOf or FindLastIV instead of always using FindLast.

While it is possible to initially classify all non-min/max select reductions as FindLast and later refine them into AnyOf or FindLastIV, I don’t see any clear advantage in doing so.

Instead, performing a precise classification of AnyOf, FindLastIV, and FindLast during the legality check allows the target to determine whether to apply in-loop or out-of-loop reduction based on the specific reduction kind after in-loop non-min/max select reductions are supported in the future.

  • Reduction: cyclic, non-induction (non-addition chain or addition chain with varying step). A {value(i)} sequence of values all of the same type (integer, floating point or boolean) is computed one per iteration i, independent of other iterations, and reduced to a single value of this type by repeatedly applying a binary operator to value(i) and the reduction phi.

    • Select Reduction: the reduction result may be an element of the sequence being reduced if this binary operator selects one of its two operands. This is the case for Min/Max reductions and arguably also for AnyOf - which can use a select value(i) ? value(i) : phi as its binary reduction operator. Note that if AnyOf returns false, all value(i)'s are false, so any value(i) can be considered selected.
    • Selected Index: the resulting value(i) of a select reduction can be complemented with its corresponding iteration i. This is the case for Min/Max-with-index and also for FindLast - corresponding to AnyOf-with-index, returning the last iteration i for which value(i) is true, if any.

There may be additional recurrences to consider?

I'm currently considering adding MinMaxRecurrence to better support min/max with index idioms.

The current reduction analysis does not allow internal loop users outside the recurrence chain, nor does it allow the recurrence chain without external users. This does not align well with the conditions of min/max reduction in min/max with index, which is why I'm not satisfied with my previous implementation.

Other than the MinMaxRecurrence I'm planning, I haven't thought of any additional recurrences to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants