xref: /aosp_15_r20/external/eigen/doc/InsideEigenExample.dox (revision bf2c37156dfe67e5dfebd6d394bad8b2ab5804d4)
1*bf2c3715SXin Linamespace Eigen {
2*bf2c3715SXin Li
3*bf2c3715SXin Li/** \page TopicInsideEigenExample What happens inside Eigen, on a simple example
4*bf2c3715SXin Li
5*bf2c3715SXin Li\eigenAutoToc
6*bf2c3715SXin Li
7*bf2c3715SXin Li<hr>
8*bf2c3715SXin Li
9*bf2c3715SXin Li
10*bf2c3715SXin LiConsider the following example program:
11*bf2c3715SXin Li
12*bf2c3715SXin Li\code
13*bf2c3715SXin Li#include<Eigen/Core>
14*bf2c3715SXin Li
15*bf2c3715SXin Liint main()
16*bf2c3715SXin Li{
17*bf2c3715SXin Li  int size = 50;
18*bf2c3715SXin Li  // VectorXf is a vector of floats, with dynamic size.
19*bf2c3715SXin Li  Eigen::VectorXf u(size), v(size), w(size);
20*bf2c3715SXin Li  u = v + w;
21*bf2c3715SXin Li}
22*bf2c3715SXin Li\endcode
23*bf2c3715SXin Li
24*bf2c3715SXin LiThe goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
25*bf2c3715SXin Li
26*bf2c3715SXin Li\section WhyInteresting Why it's interesting
27*bf2c3715SXin Li
28*bf2c3715SXin LiMaybe you think, that the above example program is so simple, that compiling it shouldn't involve anything interesting. So before starting, let us explain what is nontrivial in compiling it correctly -- that is, producing optimized code -- so that the complexity of Eigen, that we'll explain here, is really useful.
29*bf2c3715SXin Li
30*bf2c3715SXin LiLook at the line of code
31*bf2c3715SXin Li\code
32*bf2c3715SXin Li  u = v + w;   //   (*)
33*bf2c3715SXin Li\endcode
34*bf2c3715SXin Li
35*bf2c3715SXin LiThe first important thing about compiling it, is that the arrays should be traversed only once, like
36*bf2c3715SXin Li\code
37*bf2c3715SXin Li  for(int i = 0; i < size; i++) u[i] = v[i] + w[i];
38*bf2c3715SXin Li\endcode
39*bf2c3715SXin LiThe problem is that if we make a naive C++ library where the VectorXf class has an operator+ returning a VectorXf, then the line of code (*) will amount to:
40*bf2c3715SXin Li\code
41*bf2c3715SXin Li  VectorXf tmp = v + w;
42*bf2c3715SXin Li  VectorXf u = tmp;
43*bf2c3715SXin Li\endcode
44*bf2c3715SXin LiObviously, the introduction of the temporary \a tmp here is useless. It has a very bad effect on performance, first because the creation of \a tmp requires a dynamic memory allocation in this context, and second as there are now two for loops:
45*bf2c3715SXin Li\code
46*bf2c3715SXin Li  for(int i = 0; i < size; i++) tmp[i] = v[i] + w[i];
47*bf2c3715SXin Li  for(int i = 0; i < size; i++) u[i] = tmp[i];
48*bf2c3715SXin Li\endcode
49*bf2c3715SXin LiTraversing the arrays twice instead of once is terrible for performance, as it means that we do many redundant memory accesses.
50*bf2c3715SXin Li
51*bf2c3715SXin LiThe second important thing about compiling the above program, is to make correct use of SSE2 instructions. Notice that Eigen also supports AltiVec and that all the discussion that we make here applies also to AltiVec.
52*bf2c3715SXin Li
53*bf2c3715SXin LiSSE2, like AltiVec, is a set of instructions allowing to perform computations on packets of 128 bits at once. Since a float is 32 bits, this means that SSE2 instructions can handle 4 floats at once. This means that, if correctly used, they can make our computation go up to 4x faster.
54*bf2c3715SXin Li
55*bf2c3715SXin LiHowever, in the above program, we have chosen size=50, so our vectors consist of 50 float's, and 50 is not a multiple of 4. This means that we cannot hope to do all of that computation using SSE2 instructions. The second best thing, to which we should aim, is to handle the 48 first coefficients with SSE2 instructions, since 48 is the biggest multiple of 4 below 50, and then handle separately, without SSE2, the 49th and 50th coefficients. Something like this:
56*bf2c3715SXin Li
57*bf2c3715SXin Li\code
58*bf2c3715SXin Li  for(int i = 0; i < 4*(size/4); i+=4) u.packet(i)  = v.packet(i) + w.packet(i);
59*bf2c3715SXin Li  for(int i = 4*(size/4); i < size; i++) u[i] = v[i] + w[i];
60*bf2c3715SXin Li\endcode
61*bf2c3715SXin Li
62*bf2c3715SXin LiSo let us look line by line at our example program, and let's follow Eigen as it compiles it.
63*bf2c3715SXin Li
64*bf2c3715SXin Li\section ConstructingVectors Constructing vectors
65*bf2c3715SXin Li
66*bf2c3715SXin LiLet's analyze the first line:
67*bf2c3715SXin Li
68*bf2c3715SXin Li\code
69*bf2c3715SXin Li  Eigen::VectorXf u(size), v(size), w(size);
70*bf2c3715SXin Li\endcode
71*bf2c3715SXin Li
72*bf2c3715SXin LiFirst of all, VectorXf is the following typedef:
73*bf2c3715SXin Li\code
74*bf2c3715SXin Li  typedef Matrix<float, Dynamic, 1> VectorXf;
75*bf2c3715SXin Li\endcode
76*bf2c3715SXin Li
77*bf2c3715SXin LiThe class template Matrix is declared in src/Core/util/ForwardDeclarations.h with 6 template parameters, but the last 3 are automatically determined by the first 3. So you don't need to worry about them for now. Here, Matrix\<float, Dynamic, 1\> means a matrix of floats, with a dynamic number of rows and 1 column.
78*bf2c3715SXin Li
79*bf2c3715SXin LiThe Matrix class inherits a base class, MatrixBase. Don't worry about it, for now it suffices to say that MatrixBase is what unifies matrices/vectors and all the expressions types -- more on that below.
80*bf2c3715SXin Li
81*bf2c3715SXin LiWhen we do
82*bf2c3715SXin Li\code
83*bf2c3715SXin Li  Eigen::VectorXf u(size);
84*bf2c3715SXin Li\endcode
85*bf2c3715SXin Lithe constructor that is called is Matrix::Matrix(int), in src/Core/Matrix.h. Besides some assertions, all it does is to construct the \a m_storage member, which is of type DenseStorage\<float, Dynamic, Dynamic, 1\>.
86*bf2c3715SXin Li
87*bf2c3715SXin LiYou may wonder, isn't it overengineering to have the storage in a separate class? The reason is that the Matrix class template covers all kinds of matrices and vector: both fixed-size and dynamic-size. The storage method is not the same in these two cases. For fixed-size, the matrix coefficients are stored as a plain member array. For dynamic-size, the coefficients will be stored as a pointer to a dynamically-allocated array. Because of this, we need to abstract storage away from the Matrix class. That's DenseStorage.
88*bf2c3715SXin Li
89*bf2c3715SXin LiLet's look at this constructor, in src/Core/DenseStorage.h. You can see that there are many partial template specializations of DenseStorages here, treating separately the cases where dimensions are Dynamic or fixed at compile-time. The partial specialization that we are looking at is:
90*bf2c3715SXin Li\code
91*bf2c3715SXin Litemplate<typename T, int _Cols> class DenseStorage<T, Dynamic, Dynamic, _Cols>
92*bf2c3715SXin Li\endcode
93*bf2c3715SXin Li
94*bf2c3715SXin LiHere, the constructor called is DenseStorage::DenseStorage(int size, int rows, int columns)
95*bf2c3715SXin Liwith size=50, rows=50, columns=1.
96*bf2c3715SXin Li
97*bf2c3715SXin LiHere is this constructor:
98*bf2c3715SXin Li\code
99*bf2c3715SXin Liinline DenseStorage(int size, int rows, int) : m_data(internal::aligned_new<T>(size)), m_rows(rows) {}
100*bf2c3715SXin Li\endcode
101*bf2c3715SXin Li
102*bf2c3715SXin LiHere, the \a m_data member is the actual array of coefficients of the matrix. As you see, it is dynamically allocated. Rather than calling new[] or malloc(), as you can see, we have our own internal::aligned_new defined in src/Core/util/Memory.h. What it does is that if vectorization is enabled, then it uses a platform-specific call to allocate a 128-bit-aligned array, as that is very useful for vectorization with both SSE2 and AltiVec. If vectorization is disabled, it amounts to the standard new[].
103*bf2c3715SXin Li
104*bf2c3715SXin LiAs you can see, the constructor also sets the \a m_rows member to \a size. Notice that there is no \a m_columns member: indeed, in this partial specialization of DenseStorage, we know the number of columns at compile-time, since the _Cols template parameter is different from Dynamic. Namely, in our case, _Cols is 1, which is to say that our vector is just a matrix with 1 column. Hence, there is no need to store the number of columns as a runtime variable.
105*bf2c3715SXin Li
106*bf2c3715SXin LiWhen you call VectorXf::data() to get the pointer to the array of coefficients, it returns DenseStorage::data() which returns the \a m_data member.
107*bf2c3715SXin Li
108*bf2c3715SXin LiWhen you call VectorXf::size() to get the size of the vector, this is actually a method in the base class MatrixBase. It determines that the vector is a column-vector, since ColsAtCompileTime==1 (this comes from the template parameters in the typedef VectorXf). It deduces that the size is the number of rows, so it returns VectorXf::rows(), which returns DenseStorage::rows(), which returns the \a m_rows member, which was set to \a size by the constructor.
109*bf2c3715SXin Li
110*bf2c3715SXin Li\section ConstructionOfSumXpr Construction of the sum expression
111*bf2c3715SXin Li
112*bf2c3715SXin LiNow that our vectors are constructed, let's move on to the next line:
113*bf2c3715SXin Li
114*bf2c3715SXin Li\code
115*bf2c3715SXin Liu = v + w;
116*bf2c3715SXin Li\endcode
117*bf2c3715SXin Li
118*bf2c3715SXin LiThe executive summary is that operator+ returns a "sum of vectors" expression, but doesn't actually perform the computation. It is the operator=, whose call occurs thereafter, that does the computation.
119*bf2c3715SXin Li
120*bf2c3715SXin LiLet us now see what Eigen does when it sees this:
121*bf2c3715SXin Li
122*bf2c3715SXin Li\code
123*bf2c3715SXin Liv + w
124*bf2c3715SXin Li\endcode
125*bf2c3715SXin Li
126*bf2c3715SXin LiHere, v and w are of type VectorXf, which is a typedef for a specialization of Matrix (as we explained above), which is a subclass of MatrixBase. So what is being called is
127*bf2c3715SXin Li
128*bf2c3715SXin Li\code
129*bf2c3715SXin LiMatrixBase::operator+(const MatrixBase&)
130*bf2c3715SXin Li\endcode
131*bf2c3715SXin Li
132*bf2c3715SXin LiThe return type of this operator is
133*bf2c3715SXin Li\code
134*bf2c3715SXin LiCwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
135*bf2c3715SXin Li\endcode
136*bf2c3715SXin LiThe CwiseBinaryOp class is our first encounter with an expression template. As we said, the operator+ doesn't by itself perform any computation, it just returns an abstract "sum of vectors" expression. Since there are also "difference of vectors" and "coefficient-wise product of vectors" expressions, we unify them all as "coefficient-wise binary operations", which we abbreviate as "CwiseBinaryOp". "Coefficient-wise" means that the operations is performed coefficient by coefficient. "binary" means that there are two operands -- we are adding two vectors with one another.
137*bf2c3715SXin Li
138*bf2c3715SXin LiNow you might ask, what if we did something like
139*bf2c3715SXin Li
140*bf2c3715SXin Li\code
141*bf2c3715SXin Liv + w + u;
142*bf2c3715SXin Li\endcode
143*bf2c3715SXin Li
144*bf2c3715SXin LiThe first v + w would return a CwiseBinaryOp as above, so in order for this to compile, we'd need to define an operator+ also in the class CwiseBinaryOp... at this point it starts looking like a nightmare: are we going to have to define all operators in each of the expression classes (as you guessed, CwiseBinaryOp is only one of many) ? This looks like a dead end!
145*bf2c3715SXin Li
146*bf2c3715SXin LiThe solution is that CwiseBinaryOp itself, as well as Matrix and all the other expression types, is a subclass of MatrixBase. So it is enough to define once and for all the operators in class MatrixBase.
147*bf2c3715SXin Li
148*bf2c3715SXin LiSince MatrixBase is the common base class of different subclasses, the aspects that depend on the subclass must be abstracted from MatrixBase. This is called polymorphism.
149*bf2c3715SXin Li
150*bf2c3715SXin LiThe classical approach to polymorphism in C++ is by means of virtual functions. This is dynamic polymorphism. Here we don't want dynamic polymorphism because the whole design of Eigen is based around the assumption that all the complexity, all the abstraction, gets resolved at compile-time. This is crucial: if the abstraction can't get resolved at compile-time, Eigen's compile-time optimization mechanisms become useless, not to mention that if that abstraction has to be resolved at runtime it'll incur an overhead by itself.
151*bf2c3715SXin Li
152*bf2c3715SXin LiHere, what we want is to have a single class MatrixBase as the base of many subclasses, in such a way that each MatrixBase object (be it a matrix, or vector, or any kind of expression) knows at compile-time (as opposed to run-time) of which particular subclass it is an object (i.e. whether it is a matrix, or an expression, and what kind of expression).
153*bf2c3715SXin Li
154*bf2c3715SXin LiThe solution is the <a href="http://en.wikipedia.org/wiki/Curiously_Recurring_Template_Pattern">Curiously Recurring Template Pattern</a>. Let's do the break now. Hopefully you can read this wikipedia page during the break if needed, but it won't be allowed during the exam.
155*bf2c3715SXin Li
156*bf2c3715SXin LiIn short, MatrixBase takes a template parameter \a Derived. Whenever we define a subclass Subclass, we actually make Subclass inherit MatrixBase\<Subclass\>. The point is that different subclasses inherit different MatrixBase types. Thanks to this, whenever we have an object of a subclass, and we call on it some MatrixBase method, we still remember even from inside the MatrixBase method which particular subclass we're talking about.
157*bf2c3715SXin Li
158*bf2c3715SXin LiThis means that we can put almost all the methods and operators in the base class MatrixBase, and have only the bare minimum in the subclasses. If you look at the subclasses in Eigen, like for instance the CwiseBinaryOp class, they have very few methods. There are coeff() and sometimes coeffRef() methods for access to the coefficients, there are rows() and cols() methods returning the number of rows and columns, but there isn't much more than that. All the meat is in MatrixBase, so it only needs to be coded once for all kinds of expressions, matrices, and vectors.
159*bf2c3715SXin Li
160*bf2c3715SXin LiSo let's end this digression and come back to the piece of code from our example program that we were currently analyzing,
161*bf2c3715SXin Li
162*bf2c3715SXin Li\code
163*bf2c3715SXin Liv + w
164*bf2c3715SXin Li\endcode
165*bf2c3715SXin Li
166*bf2c3715SXin LiNow that MatrixBase is a good friend, let's write fully the prototype of the operator+ that gets called here (this code is from src/Core/MatrixBase.h):
167*bf2c3715SXin Li
168*bf2c3715SXin Li\code
169*bf2c3715SXin Litemplate<typename Derived>
170*bf2c3715SXin Liclass MatrixBase
171*bf2c3715SXin Li{
172*bf2c3715SXin Li  // ...
173*bf2c3715SXin Li
174*bf2c3715SXin Li  template<typename OtherDerived>
175*bf2c3715SXin Li  const CwiseBinaryOp<internal::scalar_sum_op<typename internal::traits<Derived>::Scalar>, Derived, OtherDerived>
176*bf2c3715SXin Li  operator+(const MatrixBase<OtherDerived> &other) const;
177*bf2c3715SXin Li
178*bf2c3715SXin Li  // ...
179*bf2c3715SXin Li};
180*bf2c3715SXin Li\endcode
181*bf2c3715SXin Li
182*bf2c3715SXin LiHere of course, \a Derived and \a OtherDerived are VectorXf.
183*bf2c3715SXin Li
184*bf2c3715SXin LiAs we said, CwiseBinaryOp is also used for other operations such as substration, so it takes another template parameter determining the operation that will be applied to coefficients. This template parameter is a functor, that is, a class in which we have an operator() so it behaves like a function. Here, the functor used is internal::scalar_sum_op. It is defined in src/Core/Functors.h.
185*bf2c3715SXin Li
186*bf2c3715SXin LiLet us now explain the internal::traits here. The internal::scalar_sum_op class takes one template parameter: the type of the numbers to handle. Here of course we want to pass the scalar type (a.k.a. numeric type) of VectorXf, which is \c float. How do we determine which is the scalar type of \a Derived ? Throughout Eigen, all matrix and expression types define a typedef \a Scalar which gives its scalar type. For example, VectorXf::Scalar is a typedef for \c float. So here, if life was easy, we could find the numeric type of \a Derived as just
187*bf2c3715SXin Li\code
188*bf2c3715SXin Litypename Derived::Scalar
189*bf2c3715SXin Li\endcode
190*bf2c3715SXin LiUnfortunately, we can't do that here, as the compiler would complain that the type Derived hasn't yet been defined. So we use a workaround: in src/Core/util/ForwardDeclarations.h, we declared (not defined!) all our subclasses, like Matrix, and we also declared the following class template:
191*bf2c3715SXin Li\code
192*bf2c3715SXin Litemplate<typename T> struct internal::traits;
193*bf2c3715SXin Li\endcode
194*bf2c3715SXin LiIn src/Core/Matrix.h, right \em before the definition of class Matrix, we define a partial specialization of internal::traits for T=Matrix\<any template parameters\>. In this specialization of internal::traits, we define the Scalar typedef. So when we actually define Matrix, it is legal to refer to "typename internal::traits\<Matrix\>::Scalar".
195*bf2c3715SXin Li
196*bf2c3715SXin LiAnyway, we have declared our operator+. In our case, where \a Derived and \a OtherDerived are VectorXf, the above declaration amounts to:
197*bf2c3715SXin Li\code
198*bf2c3715SXin Liclass MatrixBase<VectorXf>
199*bf2c3715SXin Li{
200*bf2c3715SXin Li  // ...
201*bf2c3715SXin Li
202*bf2c3715SXin Li  const CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
203*bf2c3715SXin Li  operator+(const MatrixBase<VectorXf> &other) const;
204*bf2c3715SXin Li
205*bf2c3715SXin Li  // ...
206*bf2c3715SXin Li};
207*bf2c3715SXin Li\endcode
208*bf2c3715SXin Li
209*bf2c3715SXin LiLet's now jump to src/Core/CwiseBinaryOp.h to see how it is defined. As you can see there, all it does is to return a CwiseBinaryOp object, and this object is just storing references to the left-hand-side and right-hand-side expressions -- here, these are the vectors \a v and \a w. Well, the CwiseBinaryOp object is also storing an instance of the (empty) functor class, but you shouldn't worry about it as that is a minor implementation detail.
210*bf2c3715SXin Li
211*bf2c3715SXin LiThus, the operator+ hasn't performed any actual computation. To summarize, the operation \a v + \a w just returned an object of type CwiseBinaryOp which did nothing else than just storing references to \a v and \a w.
212*bf2c3715SXin Li
213*bf2c3715SXin Li\section Assignment The assignment
214*bf2c3715SXin Li
215*bf2c3715SXin Li<div class="warningbox">
216*bf2c3715SXin Li<strong>PLEASE HELP US IMPROVING THIS SECTION.</strong>
217*bf2c3715SXin LiThis page reflects how %Eigen worked until 3.2, but since %Eigen 3.3 the assignment is more sophisticated as it involves an Assignment expression, and the creation of so called evaluator which are responsible for the evaluation of each kind of expressions.
218*bf2c3715SXin Li</div>
219*bf2c3715SXin Li
220*bf2c3715SXin LiAt this point, the expression \a v + \a w has finished evaluating, so, in the process of compiling the line of code
221*bf2c3715SXin Li\code
222*bf2c3715SXin Liu = v + w;
223*bf2c3715SXin Li\endcode
224*bf2c3715SXin Liwe now enter the operator=.
225*bf2c3715SXin Li
226*bf2c3715SXin LiWhat operator= is being called here? The vector u is an object of class VectorXf, i.e. Matrix. In src/Core/Matrix.h, inside the definition of class Matrix, we see this:
227*bf2c3715SXin Li\code
228*bf2c3715SXin Li    template<typename OtherDerived>
229*bf2c3715SXin Li    inline Matrix& operator=(const MatrixBase<OtherDerived>& other)
230*bf2c3715SXin Li    {
231*bf2c3715SXin Li      eigen_assert(m_storage.data()!=0 && "you cannot use operator= with a non initialized matrix (instead use set()");
232*bf2c3715SXin Li      return Base::operator=(other.derived());
233*bf2c3715SXin Li    }
234*bf2c3715SXin Li\endcode
235*bf2c3715SXin LiHere, Base is a typedef for MatrixBase\<Matrix\>. So, what is being called is the operator= of MatrixBase. Let's see its prototype in src/Core/MatrixBase.h:
236*bf2c3715SXin Li\code
237*bf2c3715SXin Li    template<typename OtherDerived>
238*bf2c3715SXin Li    Derived& operator=(const MatrixBase<OtherDerived>& other);
239*bf2c3715SXin Li\endcode
240*bf2c3715SXin LiHere, \a Derived is VectorXf (since u is a VectorXf) and \a OtherDerived is CwiseBinaryOp. More specifically, as explained in the previous section, \a OtherDerived is:
241*bf2c3715SXin Li\code
242*bf2c3715SXin LiCwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
243*bf2c3715SXin Li\endcode
244*bf2c3715SXin LiSo the full prototype of the operator= being called is:
245*bf2c3715SXin Li\code
246*bf2c3715SXin LiVectorXf& MatrixBase<VectorXf>::operator=(const MatrixBase<CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf> > & other);
247*bf2c3715SXin Li\endcode
248*bf2c3715SXin LiThis operator= literally reads "copying a sum of two VectorXf's into another VectorXf".
249*bf2c3715SXin Li
250*bf2c3715SXin LiLet's now look at the implementation of this operator=. It resides in the file src/Core/Assign.h.
251*bf2c3715SXin Li
252*bf2c3715SXin LiWhat we can see there is:
253*bf2c3715SXin Li\code
254*bf2c3715SXin Litemplate<typename Derived>
255*bf2c3715SXin Litemplate<typename OtherDerived>
256*bf2c3715SXin Liinline Derived& MatrixBase<Derived>
257*bf2c3715SXin Li  ::operator=(const MatrixBase<OtherDerived>& other)
258*bf2c3715SXin Li{
259*bf2c3715SXin Li  return internal::assign_selector<Derived,OtherDerived>::run(derived(), other.derived());
260*bf2c3715SXin Li}
261*bf2c3715SXin Li\endcode
262*bf2c3715SXin Li
263*bf2c3715SXin LiOK so our next task is to understand internal::assign_selector :)
264*bf2c3715SXin Li
265*bf2c3715SXin LiHere is its declaration (all that is still in the same file src/Core/Assign.h)
266*bf2c3715SXin Li\code
267*bf2c3715SXin Litemplate<typename Derived, typename OtherDerived,
268*bf2c3715SXin Li         bool EvalBeforeAssigning = int(OtherDerived::Flags) & EvalBeforeAssigningBit,
269*bf2c3715SXin Li         bool NeedToTranspose = Derived::IsVectorAtCompileTime
270*bf2c3715SXin Li                && OtherDerived::IsVectorAtCompileTime
271*bf2c3715SXin Li                && int(Derived::RowsAtCompileTime) == int(OtherDerived::ColsAtCompileTime)
272*bf2c3715SXin Li                && int(Derived::ColsAtCompileTime) == int(OtherDerived::RowsAtCompileTime)
273*bf2c3715SXin Li                && int(Derived::SizeAtCompileTime) != 1>
274*bf2c3715SXin Listruct internal::assign_selector;
275*bf2c3715SXin Li\endcode
276*bf2c3715SXin Li
277*bf2c3715SXin LiSo internal::assign_selector takes 4 template parameters, but the 2 last ones are automatically determined by the 2 first ones.
278*bf2c3715SXin Li
279*bf2c3715SXin LiEvalBeforeAssigning is here to enforce the EvalBeforeAssigningBit. As explained <a href="TopicLazyEvaluation.html">here</a>, certain expressions have this flag which makes them automatically evaluate into temporaries before assigning them to another expression. This is the case of the Product expression, in order to avoid strange aliasing effects when doing "m = m * m;" However, of course here our CwiseBinaryOp expression doesn't have the EvalBeforeAssigningBit: we said since the beginning that we didn't want a temporary to be introduced here. So if you go to src/Core/CwiseBinaryOp.h, you'll see that the Flags in internal::traits\<CwiseBinaryOp\> don't include the EvalBeforeAssigningBit. The Flags member of CwiseBinaryOp is then imported from the internal::traits by the EIGEN_GENERIC_PUBLIC_INTERFACE macro. Anyway, here the template parameter EvalBeforeAssigning has the value \c false.
280*bf2c3715SXin Li
281*bf2c3715SXin LiNeedToTranspose is here for the case where the user wants to copy a row-vector into a column-vector. We allow this as a special exception to the general rule that in assignments we require the dimesions to match. Anyway, here both the left-hand and right-hand sides are column vectors, in the sense that ColsAtCompileTime is equal to 1. So NeedToTranspose is \c false too.
282*bf2c3715SXin Li
283*bf2c3715SXin LiSo, here we are in the partial specialization:
284*bf2c3715SXin Li\code
285*bf2c3715SXin Liinternal::assign_selector<Derived, OtherDerived, false, false>
286*bf2c3715SXin Li\endcode
287*bf2c3715SXin Li
288*bf2c3715SXin LiHere's how it is defined:
289*bf2c3715SXin Li\code
290*bf2c3715SXin Litemplate<typename Derived, typename OtherDerived>
291*bf2c3715SXin Listruct internal::assign_selector<Derived,OtherDerived,false,false> {
292*bf2c3715SXin Li  static Derived& run(Derived& dst, const OtherDerived& other) { return dst.lazyAssign(other.derived()); }
293*bf2c3715SXin Li};
294*bf2c3715SXin Li\endcode
295*bf2c3715SXin Li
296*bf2c3715SXin LiOK so now our next job is to understand how lazyAssign works :)
297*bf2c3715SXin Li
298*bf2c3715SXin Li\code
299*bf2c3715SXin Litemplate<typename Derived>
300*bf2c3715SXin Litemplate<typename OtherDerived>
301*bf2c3715SXin Liinline Derived& MatrixBase<Derived>
302*bf2c3715SXin Li  ::lazyAssign(const MatrixBase<OtherDerived>& other)
303*bf2c3715SXin Li{
304*bf2c3715SXin Li  EIGEN_STATIC_ASSERT_SAME_MATRIX_SIZE(Derived,OtherDerived)
305*bf2c3715SXin Li  eigen_assert(rows() == other.rows() && cols() == other.cols());
306*bf2c3715SXin Li  internal::assign_impl<Derived, OtherDerived>::run(derived(),other.derived());
307*bf2c3715SXin Li  return derived();
308*bf2c3715SXin Li}
309*bf2c3715SXin Li\endcode
310*bf2c3715SXin Li
311*bf2c3715SXin LiWhat do we see here? Some assertions, and then the only interesting line is:
312*bf2c3715SXin Li\code
313*bf2c3715SXin Li  internal::assign_impl<Derived, OtherDerived>::run(derived(),other.derived());
314*bf2c3715SXin Li\endcode
315*bf2c3715SXin Li
316*bf2c3715SXin LiOK so now we want to know what is inside internal::assign_impl.
317*bf2c3715SXin Li
318*bf2c3715SXin LiHere is its declaration:
319*bf2c3715SXin Li\code
320*bf2c3715SXin Litemplate<typename Derived1, typename Derived2,
321*bf2c3715SXin Li         int Vectorization = internal::assign_traits<Derived1, Derived2>::Vectorization,
322*bf2c3715SXin Li         int Unrolling = internal::assign_traits<Derived1, Derived2>::Unrolling>
323*bf2c3715SXin Listruct internal::assign_impl;
324*bf2c3715SXin Li\endcode
325*bf2c3715SXin LiAgain, internal::assign_selector takes 4 template parameters, but the 2 last ones are automatically determined by the 2 first ones.
326*bf2c3715SXin Li
327*bf2c3715SXin LiThese two parameters \a Vectorization and \a Unrolling are determined by a helper class internal::assign_traits. Its job is to determine which vectorization strategy to use (that is \a Vectorization) and which unrolling strategy to use (that is \a Unrolling).
328*bf2c3715SXin Li
329*bf2c3715SXin LiWe'll not enter into the details of how these strategies are chosen (this is in the implementation of internal::assign_traits at the top of the same file). Let's just say that here \a Vectorization has the value \a LinearVectorization, and \a Unrolling has the value \a NoUnrolling (the latter is obvious since our vectors have dynamic size so there's no way to unroll the loop at compile-time).
330*bf2c3715SXin Li
331*bf2c3715SXin LiSo the partial specialization of internal::assign_impl that we're looking at is:
332*bf2c3715SXin Li\code
333*bf2c3715SXin Liinternal::assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
334*bf2c3715SXin Li\endcode
335*bf2c3715SXin Li
336*bf2c3715SXin LiHere is how it's defined:
337*bf2c3715SXin Li\code
338*bf2c3715SXin Litemplate<typename Derived1, typename Derived2>
339*bf2c3715SXin Listruct internal::assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
340*bf2c3715SXin Li{
341*bf2c3715SXin Li  static void run(Derived1 &dst, const Derived2 &src)
342*bf2c3715SXin Li  {
343*bf2c3715SXin Li    const int size = dst.size();
344*bf2c3715SXin Li    const int packetSize = internal::packet_traits<typename Derived1::Scalar>::size;
345*bf2c3715SXin Li    const int alignedStart = internal::assign_traits<Derived1,Derived2>::DstIsAligned ? 0
346*bf2c3715SXin Li                           : internal::first_aligned(&dst.coeffRef(0), size);
347*bf2c3715SXin Li    const int alignedEnd = alignedStart + ((size-alignedStart)/packetSize)*packetSize;
348*bf2c3715SXin Li
349*bf2c3715SXin Li    for(int index = 0; index < alignedStart; index++)
350*bf2c3715SXin Li      dst.copyCoeff(index, src);
351*bf2c3715SXin Li
352*bf2c3715SXin Li    for(int index = alignedStart; index < alignedEnd; index += packetSize)
353*bf2c3715SXin Li    {
354*bf2c3715SXin Li      dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
355*bf2c3715SXin Li    }
356*bf2c3715SXin Li
357*bf2c3715SXin Li    for(int index = alignedEnd; index < size; index++)
358*bf2c3715SXin Li      dst.copyCoeff(index, src);
359*bf2c3715SXin Li  }
360*bf2c3715SXin Li};
361*bf2c3715SXin Li\endcode
362*bf2c3715SXin Li
363*bf2c3715SXin LiHere's how it works. \a LinearVectorization means that the left-hand and right-hand side expression can be accessed linearly i.e. you can refer to their coefficients by one integer \a index, as opposed to having to refer to its coefficients by two integers \a row, \a column.
364*bf2c3715SXin Li
365*bf2c3715SXin LiAs we said at the beginning, vectorization works with blocks of 4 floats. Here, \a PacketSize is 4.
366*bf2c3715SXin Li
367*bf2c3715SXin LiThere are two potential problems that we need to deal with:
368*bf2c3715SXin Li\li first, vectorization works much better if the packets are 128-bit-aligned. This is especially important for write access. So when writing to the coefficients of \a dst, we want to group these coefficients by packets of 4 such that each of these packets is 128-bit-aligned. In general, this requires to skip a few coefficients at the beginning of \a dst. This is the purpose of \a alignedStart. We then copy these first few coefficients one by one, not by packets. However, in our case, the \a dst expression is a VectorXf and remember that in the construction of the vectors we allocated aligned arrays. Thanks to \a DstIsAligned, Eigen remembers that without having to do any runtime check, so \a alignedStart is zero and this part is avoided altogether.
369*bf2c3715SXin Li\li second, the number of coefficients to copy is not in general a multiple of \a packetSize. Here, there are 50 coefficients to copy and \a packetSize is 4. So we'll have to copy the last 2 coefficients one by one, not by packets. Here, \a alignedEnd is 48.
370*bf2c3715SXin Li
371*bf2c3715SXin LiNow come the actual loops.
372*bf2c3715SXin Li
373*bf2c3715SXin LiFirst, the vectorized part: the 48 first coefficients out of 50 will be copied by packets of 4:
374*bf2c3715SXin Li\code
375*bf2c3715SXin Li  for(int index = alignedStart; index < alignedEnd; index += packetSize)
376*bf2c3715SXin Li  {
377*bf2c3715SXin Li    dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
378*bf2c3715SXin Li  }
379*bf2c3715SXin Li\endcode
380*bf2c3715SXin Li
381*bf2c3715SXin LiWhat is copyPacket? It is defined in src/Core/Coeffs.h:
382*bf2c3715SXin Li\code
383*bf2c3715SXin Litemplate<typename Derived>
384*bf2c3715SXin Litemplate<typename OtherDerived, int StoreMode, int LoadMode>
385*bf2c3715SXin Liinline void MatrixBase<Derived>::copyPacket(int index, const MatrixBase<OtherDerived>& other)
386*bf2c3715SXin Li{
387*bf2c3715SXin Li  eigen_internal_assert(index >= 0 && index < size());
388*bf2c3715SXin Li  derived().template writePacket<StoreMode>(index,
389*bf2c3715SXin Li    other.derived().template packet<LoadMode>(index));
390*bf2c3715SXin Li}
391*bf2c3715SXin Li\endcode
392*bf2c3715SXin Li
393*bf2c3715SXin LiOK, what are writePacket() and packet() here?
394*bf2c3715SXin Li
395*bf2c3715SXin LiFirst, writePacket() here is a method on the left-hand side VectorXf. So we go to src/Core/Matrix.h to look at its definition:
396*bf2c3715SXin Li\code
397*bf2c3715SXin Litemplate<int StoreMode>
398*bf2c3715SXin Liinline void writePacket(int index, const PacketScalar& x)
399*bf2c3715SXin Li{
400*bf2c3715SXin Li  internal::pstoret<Scalar, PacketScalar, StoreMode>(m_storage.data() + index, x);
401*bf2c3715SXin Li}
402*bf2c3715SXin Li\endcode
403*bf2c3715SXin LiHere, \a StoreMode is \a #Aligned, indicating that we are doing a 128-bit-aligned write access, \a PacketScalar is a type representing a "SSE packet of 4 floats" and internal::pstoret is a function writing such a packet in memory. Their definitions are architecture-specific, we find them in src/Core/arch/SSE/PacketMath.h:
404*bf2c3715SXin Li
405*bf2c3715SXin LiThe line in src/Core/arch/SSE/PacketMath.h that determines the PacketScalar type (via a typedef in Matrix.h) is:
406*bf2c3715SXin Li\code
407*bf2c3715SXin Litemplate<> struct internal::packet_traits<float>  { typedef __m128  type; enum {size=4}; };
408*bf2c3715SXin Li\endcode
409*bf2c3715SXin LiHere, __m128 is a SSE-specific type. Notice that the enum \a size here is what was used to define \a packetSize above.
410*bf2c3715SXin Li
411*bf2c3715SXin LiAnd here is the implementation of internal::pstoret:
412*bf2c3715SXin Li\code
413*bf2c3715SXin Litemplate<> inline void internal::pstore(float*  to, const __m128&  from) { _mm_store_ps(to, from); }
414*bf2c3715SXin Li\endcode
415*bf2c3715SXin LiHere, __mm_store_ps is a SSE-specific intrinsic function, representing a single SSE instruction. The difference between internal::pstore and internal::pstoret is that internal::pstoret is a dispatcher handling both the aligned and unaligned cases, you find its definition in src/Core/GenericPacketMath.h:
416*bf2c3715SXin Li\code
417*bf2c3715SXin Litemplate<typename Scalar, typename Packet, int LoadMode>
418*bf2c3715SXin Liinline void internal::pstoret(Scalar* to, const Packet& from)
419*bf2c3715SXin Li{
420*bf2c3715SXin Li  if(LoadMode == Aligned)
421*bf2c3715SXin Li    internal::pstore(to, from);
422*bf2c3715SXin Li  else
423*bf2c3715SXin Li    internal::pstoreu(to, from);
424*bf2c3715SXin Li}
425*bf2c3715SXin Li\endcode
426*bf2c3715SXin Li
427*bf2c3715SXin LiOK, that explains how writePacket() works. Now let's look into the packet() call. Remember that we are analyzing this line of code inside copyPacket():
428*bf2c3715SXin Li\code
429*bf2c3715SXin Liderived().template writePacket<StoreMode>(index,
430*bf2c3715SXin Li    other.derived().template packet<LoadMode>(index));
431*bf2c3715SXin Li\endcode
432*bf2c3715SXin Li
433*bf2c3715SXin LiHere, \a other is our sum expression \a v + \a w. The .derived() is just casting from MatrixBase to the subclass which here is CwiseBinaryOp. So let's go to src/Core/CwiseBinaryOp.h:
434*bf2c3715SXin Li\code
435*bf2c3715SXin Liclass CwiseBinaryOp
436*bf2c3715SXin Li{
437*bf2c3715SXin Li  // ...
438*bf2c3715SXin Li    template<int LoadMode>
439*bf2c3715SXin Li    inline PacketScalar packet(int index) const
440*bf2c3715SXin Li    {
441*bf2c3715SXin Li      return m_functor.packetOp(m_lhs.template packet<LoadMode>(index), m_rhs.template packet<LoadMode>(index));
442*bf2c3715SXin Li    }
443*bf2c3715SXin Li};
444*bf2c3715SXin Li\endcode
445*bf2c3715SXin LiHere, \a m_lhs is the vector \a v, and \a m_rhs is the vector \a w. So the packet() function here is Matrix::packet(). The template parameter \a LoadMode is \a #Aligned. So we're looking at
446*bf2c3715SXin Li\code
447*bf2c3715SXin Liclass Matrix
448*bf2c3715SXin Li{
449*bf2c3715SXin Li  // ...
450*bf2c3715SXin Li    template<int LoadMode>
451*bf2c3715SXin Li    inline PacketScalar packet(int index) const
452*bf2c3715SXin Li    {
453*bf2c3715SXin Li      return internal::ploadt<Scalar, LoadMode>(m_storage.data() + index);
454*bf2c3715SXin Li    }
455*bf2c3715SXin Li};
456*bf2c3715SXin Li\endcode
457*bf2c3715SXin LiWe let you look up the definition of internal::ploadt in GenericPacketMath.h and the internal::pload in src/Core/arch/SSE/PacketMath.h. It is very similar to the above for internal::pstore.
458*bf2c3715SXin Li
459*bf2c3715SXin LiLet's go back to CwiseBinaryOp::packet(). Once the packets from the vectors \a v and \a w have been returned, what does this function do? It calls m_functor.packetOp() on them. What is m_functor? Here we must remember what particular template specialization of CwiseBinaryOp we're dealing with:
460*bf2c3715SXin Li\code
461*bf2c3715SXin LiCwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
462*bf2c3715SXin Li\endcode
463*bf2c3715SXin LiSo m_functor is an object of the empty class internal::scalar_sum_op<float>. As we mentioned above, don't worry about why we constructed an object of this empty class at all -- it's an implementation detail, the point is that some other functors need to store member data.
464*bf2c3715SXin Li
465*bf2c3715SXin LiAnyway, internal::scalar_sum_op is defined in src/Core/Functors.h:
466*bf2c3715SXin Li\code
467*bf2c3715SXin Litemplate<typename Scalar> struct internal::scalar_sum_op EIGEN_EMPTY_STRUCT {
468*bf2c3715SXin Li  inline const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
469*bf2c3715SXin Li  template<typename PacketScalar>
470*bf2c3715SXin Li  inline const PacketScalar packetOp(const PacketScalar& a, const PacketScalar& b) const
471*bf2c3715SXin Li  { return internal::padd(a,b); }
472*bf2c3715SXin Li};
473*bf2c3715SXin Li\endcode
474*bf2c3715SXin LiAs you can see, all what packetOp() does is to call internal::padd on the two packets. Here is the definition of internal::padd from src/Core/arch/SSE/PacketMath.h:
475*bf2c3715SXin Li\code
476*bf2c3715SXin Litemplate<> inline __m128  internal::padd(const __m128&  a, const __m128&  b) { return _mm_add_ps(a,b); }
477*bf2c3715SXin Li\endcode
478*bf2c3715SXin LiHere, _mm_add_ps is a SSE-specific intrinsic function, representing a single SSE instruction.
479*bf2c3715SXin Li
480*bf2c3715SXin LiTo summarize, the loop
481*bf2c3715SXin Li\code
482*bf2c3715SXin Li  for(int index = alignedStart; index < alignedEnd; index += packetSize)
483*bf2c3715SXin Li  {
484*bf2c3715SXin Li    dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
485*bf2c3715SXin Li  }
486*bf2c3715SXin Li\endcode
487*bf2c3715SXin Lihas been compiled to the following code: for \a index going from 0 to the 11 ( = 48/4 - 1), read the i-th packet (of 4 floats) from the vector v and the i-th packet from the vector w using two __mm_load_ps SSE instructions, then add them together using a __mm_add_ps instruction, then store the result using a __mm_store_ps instruction.
488*bf2c3715SXin Li
489*bf2c3715SXin LiThere remains the second loop handling the last few (here, the last 2) coefficients:
490*bf2c3715SXin Li\code
491*bf2c3715SXin Li  for(int index = alignedEnd; index < size; index++)
492*bf2c3715SXin Li    dst.copyCoeff(index, src);
493*bf2c3715SXin Li\endcode
494*bf2c3715SXin LiHowever, it works just like the one we just explained, it is just simpler because there is no SSE vectorization involved here. copyPacket() becomes copyCoeff(), packet() becomes coeff(), writePacket() becomes coeffRef(). If you followed us this far, you can probably understand this part by yourself.
495*bf2c3715SXin Li
496*bf2c3715SXin LiWe see that all the C++ abstraction of Eigen goes away during compilation and that we indeed are precisely controlling which assembly instructions we emit. Such is the beauty of C++! Since we have such precise control over the emitted assembly instructions, but such complex logic to choose the right instructions, we can say that Eigen really behaves like an optimizing compiler. If you prefer, you could say that Eigen behaves like a script for the compiler. In a sense, C++ template metaprogramming is scripting the compiler -- and it's been shown that this scripting language is Turing-complete. See <a href="http://en.wikipedia.org/wiki/Template_metaprogramming"> Wikipedia</a>.
497*bf2c3715SXin Li
498*bf2c3715SXin Li*/
499*bf2c3715SXin Li
500*bf2c3715SXin Li}
501