1*dfc6aa5cSAndroid Build Coastguard Worker; 2*dfc6aa5cSAndroid Build Coastguard Worker; jfdctflt.asm - floating-point FDCT (64-bit SSE) 3*dfc6aa5cSAndroid Build Coastguard Worker; 4*dfc6aa5cSAndroid Build Coastguard Worker; Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB 5*dfc6aa5cSAndroid Build Coastguard Worker; Copyright (C) 2009, 2016, D. R. Commander. 6*dfc6aa5cSAndroid Build Coastguard Worker; 7*dfc6aa5cSAndroid Build Coastguard Worker; Based on the x86 SIMD extension for IJG JPEG library 8*dfc6aa5cSAndroid Build Coastguard Worker; Copyright (C) 1999-2006, MIYASAKA Masaru. 9*dfc6aa5cSAndroid Build Coastguard Worker; For conditions of distribution and use, see copyright notice in jsimdext.inc 10*dfc6aa5cSAndroid Build Coastguard Worker; 11*dfc6aa5cSAndroid Build Coastguard Worker; This file should be assembled with NASM (Netwide Assembler), 12*dfc6aa5cSAndroid Build Coastguard Worker; can *not* be assembled with Microsoft's MASM or any compatible 13*dfc6aa5cSAndroid Build Coastguard Worker; assembler (including Borland's Turbo Assembler). 14*dfc6aa5cSAndroid Build Coastguard Worker; NASM is available from http://nasm.sourceforge.net/ or 15*dfc6aa5cSAndroid Build Coastguard Worker; http://sourceforge.net/project/showfiles.php?group_id=6208 16*dfc6aa5cSAndroid Build Coastguard Worker; 17*dfc6aa5cSAndroid Build Coastguard Worker; This file contains a floating-point implementation of the forward DCT 18*dfc6aa5cSAndroid Build Coastguard Worker; (Discrete Cosine Transform). The following code is based directly on 19*dfc6aa5cSAndroid Build Coastguard Worker; the IJG's original jfdctflt.c; see the jfdctflt.c for more details. 20*dfc6aa5cSAndroid Build Coastguard Worker 21*dfc6aa5cSAndroid Build Coastguard Worker%include "jsimdext.inc" 22*dfc6aa5cSAndroid Build Coastguard Worker%include "jdct.inc" 23*dfc6aa5cSAndroid Build Coastguard Worker 24*dfc6aa5cSAndroid Build Coastguard Worker; -------------------------------------------------------------------------- 25*dfc6aa5cSAndroid Build Coastguard Worker 26*dfc6aa5cSAndroid Build Coastguard Worker%macro unpcklps2 2 ; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(0 1 4 5) 27*dfc6aa5cSAndroid Build Coastguard Worker shufps %1, %2, 0x44 28*dfc6aa5cSAndroid Build Coastguard Worker%endmacro 29*dfc6aa5cSAndroid Build Coastguard Worker 30*dfc6aa5cSAndroid Build Coastguard Worker%macro unpckhps2 2 ; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(2 3 6 7) 31*dfc6aa5cSAndroid Build Coastguard Worker shufps %1, %2, 0xEE 32*dfc6aa5cSAndroid Build Coastguard Worker%endmacro 33*dfc6aa5cSAndroid Build Coastguard Worker 34*dfc6aa5cSAndroid Build Coastguard Worker; -------------------------------------------------------------------------- 35*dfc6aa5cSAndroid Build Coastguard Worker SECTION SEG_CONST 36*dfc6aa5cSAndroid Build Coastguard Worker 37*dfc6aa5cSAndroid Build Coastguard Worker alignz 32 38*dfc6aa5cSAndroid Build Coastguard Worker GLOBAL_DATA(jconst_fdct_float_sse) 39*dfc6aa5cSAndroid Build Coastguard Worker 40*dfc6aa5cSAndroid Build Coastguard WorkerEXTN(jconst_fdct_float_sse): 41*dfc6aa5cSAndroid Build Coastguard Worker 42*dfc6aa5cSAndroid Build Coastguard WorkerPD_0_382 times 4 dd 0.382683432365089771728460 43*dfc6aa5cSAndroid Build Coastguard WorkerPD_0_707 times 4 dd 0.707106781186547524400844 44*dfc6aa5cSAndroid Build Coastguard WorkerPD_0_541 times 4 dd 0.541196100146196984399723 45*dfc6aa5cSAndroid Build Coastguard WorkerPD_1_306 times 4 dd 1.306562964876376527856643 46*dfc6aa5cSAndroid Build Coastguard Worker 47*dfc6aa5cSAndroid Build Coastguard Worker alignz 32 48*dfc6aa5cSAndroid Build Coastguard Worker 49*dfc6aa5cSAndroid Build Coastguard Worker; -------------------------------------------------------------------------- 50*dfc6aa5cSAndroid Build Coastguard Worker SECTION SEG_TEXT 51*dfc6aa5cSAndroid Build Coastguard Worker BITS 64 52*dfc6aa5cSAndroid Build Coastguard Worker; 53*dfc6aa5cSAndroid Build Coastguard Worker; Perform the forward DCT on one block of samples. 54*dfc6aa5cSAndroid Build Coastguard Worker; 55*dfc6aa5cSAndroid Build Coastguard Worker; GLOBAL(void) 56*dfc6aa5cSAndroid Build Coastguard Worker; jsimd_fdct_float_sse(FAST_FLOAT *data) 57*dfc6aa5cSAndroid Build Coastguard Worker; 58*dfc6aa5cSAndroid Build Coastguard Worker 59*dfc6aa5cSAndroid Build Coastguard Worker; r10 = FAST_FLOAT *data 60*dfc6aa5cSAndroid Build Coastguard Worker 61*dfc6aa5cSAndroid Build Coastguard Worker%define wk(i) rbp - (WK_NUM - (i)) * SIZEOF_XMMWORD ; xmmword wk[WK_NUM] 62*dfc6aa5cSAndroid Build Coastguard Worker%define WK_NUM 2 63*dfc6aa5cSAndroid Build Coastguard Worker 64*dfc6aa5cSAndroid Build Coastguard Worker align 32 65*dfc6aa5cSAndroid Build Coastguard Worker GLOBAL_FUNCTION(jsimd_fdct_float_sse) 66*dfc6aa5cSAndroid Build Coastguard Worker 67*dfc6aa5cSAndroid Build Coastguard WorkerEXTN(jsimd_fdct_float_sse): 68*dfc6aa5cSAndroid Build Coastguard Worker push rbp 69*dfc6aa5cSAndroid Build Coastguard Worker mov rax, rsp ; rax = original rbp 70*dfc6aa5cSAndroid Build Coastguard Worker sub rsp, byte 4 71*dfc6aa5cSAndroid Build Coastguard Worker and rsp, byte (-SIZEOF_XMMWORD) ; align to 128 bits 72*dfc6aa5cSAndroid Build Coastguard Worker mov [rsp], rax 73*dfc6aa5cSAndroid Build Coastguard Worker mov rbp, rsp ; rbp = aligned rbp 74*dfc6aa5cSAndroid Build Coastguard Worker lea rsp, [wk(0)] 75*dfc6aa5cSAndroid Build Coastguard Worker collect_args 1 76*dfc6aa5cSAndroid Build Coastguard Worker 77*dfc6aa5cSAndroid Build Coastguard Worker ; ---- Pass 1: process rows. 78*dfc6aa5cSAndroid Build Coastguard Worker 79*dfc6aa5cSAndroid Build Coastguard Worker mov rdx, r10 ; (FAST_FLOAT *) 80*dfc6aa5cSAndroid Build Coastguard Worker mov rcx, DCTSIZE/4 81*dfc6aa5cSAndroid Build Coastguard Worker.rowloop: 82*dfc6aa5cSAndroid Build Coastguard Worker 83*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, XMMWORD [XMMBLOCK(2,0,rdx,SIZEOF_FAST_FLOAT)] 84*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_FAST_FLOAT)] 85*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, XMMWORD [XMMBLOCK(2,1,rdx,SIZEOF_FAST_FLOAT)] 86*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [XMMBLOCK(3,1,rdx,SIZEOF_FAST_FLOAT)] 87*dfc6aa5cSAndroid Build Coastguard Worker 88*dfc6aa5cSAndroid Build Coastguard Worker ; xmm0=(20 21 22 23), xmm2=(24 25 26 27) 89*dfc6aa5cSAndroid Build Coastguard Worker ; xmm1=(30 31 32 33), xmm3=(34 35 36 37) 90*dfc6aa5cSAndroid Build Coastguard Worker 91*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm0 ; transpose coefficients(phase 1) 92*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm0, xmm1 ; xmm0=(20 30 21 31) 93*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm4, xmm1 ; xmm4=(22 32 23 33) 94*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm2 ; transpose coefficients(phase 1) 95*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm2, xmm3 ; xmm2=(24 34 25 35) 96*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm5, xmm3 ; xmm5=(26 36 27 37) 97*dfc6aa5cSAndroid Build Coastguard Worker 98*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, XMMWORD [XMMBLOCK(0,0,rdx,SIZEOF_FAST_FLOAT)] 99*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_FAST_FLOAT)] 100*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, XMMWORD [XMMBLOCK(0,1,rdx,SIZEOF_FAST_FLOAT)] 101*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [XMMBLOCK(1,1,rdx,SIZEOF_FAST_FLOAT)] 102*dfc6aa5cSAndroid Build Coastguard Worker 103*dfc6aa5cSAndroid Build Coastguard Worker ; xmm6=(00 01 02 03), xmm1=(04 05 06 07) 104*dfc6aa5cSAndroid Build Coastguard Worker ; xmm7=(10 11 12 13), xmm3=(14 15 16 17) 105*dfc6aa5cSAndroid Build Coastguard Worker 106*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(0)], xmm4 ; wk(0)=(22 32 23 33) 107*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(1)], xmm2 ; wk(1)=(24 34 25 35) 108*dfc6aa5cSAndroid Build Coastguard Worker 109*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm6 ; transpose coefficients(phase 1) 110*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm6, xmm7 ; xmm6=(00 10 01 11) 111*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm4, xmm7 ; xmm4=(02 12 03 13) 112*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, xmm1 ; transpose coefficients(phase 1) 113*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm1, xmm3 ; xmm1=(04 14 05 15) 114*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm2, xmm3 ; xmm2=(06 16 07 17) 115*dfc6aa5cSAndroid Build Coastguard Worker 116*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm6 ; transpose coefficients(phase 2) 117*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm6, xmm0 ; xmm6=(00 10 20 30)=data0 118*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm7, xmm0 ; xmm7=(01 11 21 31)=data1 119*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, xmm2 ; transpose coefficients(phase 2) 120*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm2, xmm5 ; xmm2=(06 16 26 36)=data6 121*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm3, xmm5 ; xmm3=(07 17 27 37)=data7 122*dfc6aa5cSAndroid Build Coastguard Worker 123*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, xmm7 124*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm6 125*dfc6aa5cSAndroid Build Coastguard Worker subps xmm7, xmm2 ; xmm7=data1-data6=tmp6 126*dfc6aa5cSAndroid Build Coastguard Worker subps xmm6, xmm3 ; xmm6=data0-data7=tmp7 127*dfc6aa5cSAndroid Build Coastguard Worker addps xmm0, xmm2 ; xmm0=data1+data6=tmp1 128*dfc6aa5cSAndroid Build Coastguard Worker addps xmm5, xmm3 ; xmm5=data0+data7=tmp0 129*dfc6aa5cSAndroid Build Coastguard Worker 130*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, XMMWORD [wk(0)] ; xmm2=(22 32 23 33) 131*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [wk(1)] ; xmm3=(24 34 25 35) 132*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(0)], xmm7 ; wk(0)=tmp6 133*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(1)], xmm6 ; wk(1)=tmp7 134*dfc6aa5cSAndroid Build Coastguard Worker 135*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm4 ; transpose coefficients(phase 2) 136*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm4, xmm2 ; xmm4=(02 12 22 32)=data2 137*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm7, xmm2 ; xmm7=(03 13 23 33)=data3 138*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, xmm1 ; transpose coefficients(phase 2) 139*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm1, xmm3 ; xmm1=(04 14 24 34)=data4 140*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm6, xmm3 ; xmm6=(05 15 25 35)=data5 141*dfc6aa5cSAndroid Build Coastguard Worker 142*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, xmm7 143*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, xmm4 144*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm1 ; xmm7=data3+data4=tmp3 145*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm6 ; xmm4=data2+data5=tmp2 146*dfc6aa5cSAndroid Build Coastguard Worker subps xmm2, xmm1 ; xmm2=data3-data4=tmp4 147*dfc6aa5cSAndroid Build Coastguard Worker subps xmm3, xmm6 ; xmm3=data2-data5=tmp5 148*dfc6aa5cSAndroid Build Coastguard Worker 149*dfc6aa5cSAndroid Build Coastguard Worker ; -- Even part 150*dfc6aa5cSAndroid Build Coastguard Worker 151*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, xmm5 152*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, xmm0 153*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm7 ; xmm5=tmp13 154*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm4 ; xmm0=tmp12 155*dfc6aa5cSAndroid Build Coastguard Worker addps xmm1, xmm7 ; xmm1=tmp10 156*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm4 ; xmm6=tmp11 157*dfc6aa5cSAndroid Build Coastguard Worker 158*dfc6aa5cSAndroid Build Coastguard Worker addps xmm0, xmm5 159*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm0, [rel PD_0_707] ; xmm0=z1 160*dfc6aa5cSAndroid Build Coastguard Worker 161*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm1 162*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm5 163*dfc6aa5cSAndroid Build Coastguard Worker subps xmm1, xmm6 ; xmm1=data4 164*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm0 ; xmm5=data6 165*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm6 ; xmm7=data0 166*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm0 ; xmm4=data2 167*dfc6aa5cSAndroid Build Coastguard Worker 168*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(0,1,rdx,SIZEOF_FAST_FLOAT)], xmm1 169*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(2,1,rdx,SIZEOF_FAST_FLOAT)], xmm5 170*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(0,0,rdx,SIZEOF_FAST_FLOAT)], xmm7 171*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(2,0,rdx,SIZEOF_FAST_FLOAT)], xmm4 172*dfc6aa5cSAndroid Build Coastguard Worker 173*dfc6aa5cSAndroid Build Coastguard Worker ; -- Odd part 174*dfc6aa5cSAndroid Build Coastguard Worker 175*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, XMMWORD [wk(0)] ; xmm6=tmp6 176*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, XMMWORD [wk(1)] ; xmm0=tmp7 177*dfc6aa5cSAndroid Build Coastguard Worker 178*dfc6aa5cSAndroid Build Coastguard Worker addps xmm2, xmm3 ; xmm2=tmp10 179*dfc6aa5cSAndroid Build Coastguard Worker addps xmm3, xmm6 ; xmm3=tmp11 180*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm0 ; xmm6=tmp12, xmm0=tmp7 181*dfc6aa5cSAndroid Build Coastguard Worker 182*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm3, [rel PD_0_707] ; xmm3=z3 183*dfc6aa5cSAndroid Build Coastguard Worker 184*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, xmm2 ; xmm1=tmp10 185*dfc6aa5cSAndroid Build Coastguard Worker subps xmm2, xmm6 186*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm2, [rel PD_0_382] ; xmm2=z5 187*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm1, [rel PD_0_541] ; xmm1=MULTIPLY(tmp10,FIX_0_541196) 188*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm6, [rel PD_1_306] ; xmm6=MULTIPLY(tmp12,FIX_1_306562) 189*dfc6aa5cSAndroid Build Coastguard Worker addps xmm1, xmm2 ; xmm1=z2 190*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm2 ; xmm6=z4 191*dfc6aa5cSAndroid Build Coastguard Worker 192*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm0 193*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm3 ; xmm0=z13 194*dfc6aa5cSAndroid Build Coastguard Worker addps xmm5, xmm3 ; xmm5=z11 195*dfc6aa5cSAndroid Build Coastguard Worker 196*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm0 197*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm5 198*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm1 ; xmm0=data3 199*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm6 ; xmm5=data7 200*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm1 ; xmm7=data5 201*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm6 ; xmm4=data1 202*dfc6aa5cSAndroid Build Coastguard Worker 203*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_FAST_FLOAT)], xmm0 204*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(3,1,rdx,SIZEOF_FAST_FLOAT)], xmm5 205*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(1,1,rdx,SIZEOF_FAST_FLOAT)], xmm7 206*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_FAST_FLOAT)], xmm4 207*dfc6aa5cSAndroid Build Coastguard Worker 208*dfc6aa5cSAndroid Build Coastguard Worker add rdx, 4*DCTSIZE*SIZEOF_FAST_FLOAT 209*dfc6aa5cSAndroid Build Coastguard Worker dec rcx 210*dfc6aa5cSAndroid Build Coastguard Worker jnz near .rowloop 211*dfc6aa5cSAndroid Build Coastguard Worker 212*dfc6aa5cSAndroid Build Coastguard Worker ; ---- Pass 2: process columns. 213*dfc6aa5cSAndroid Build Coastguard Worker 214*dfc6aa5cSAndroid Build Coastguard Worker mov rdx, r10 ; (FAST_FLOAT *) 215*dfc6aa5cSAndroid Build Coastguard Worker mov rcx, DCTSIZE/4 216*dfc6aa5cSAndroid Build Coastguard Worker.columnloop: 217*dfc6aa5cSAndroid Build Coastguard Worker 218*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, XMMWORD [XMMBLOCK(2,0,rdx,SIZEOF_FAST_FLOAT)] 219*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_FAST_FLOAT)] 220*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, XMMWORD [XMMBLOCK(6,0,rdx,SIZEOF_FAST_FLOAT)] 221*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [XMMBLOCK(7,0,rdx,SIZEOF_FAST_FLOAT)] 222*dfc6aa5cSAndroid Build Coastguard Worker 223*dfc6aa5cSAndroid Build Coastguard Worker ; xmm0=(02 12 22 32), xmm2=(42 52 62 72) 224*dfc6aa5cSAndroid Build Coastguard Worker ; xmm1=(03 13 23 33), xmm3=(43 53 63 73) 225*dfc6aa5cSAndroid Build Coastguard Worker 226*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm0 ; transpose coefficients(phase 1) 227*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm0, xmm1 ; xmm0=(02 03 12 13) 228*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm4, xmm1 ; xmm4=(22 23 32 33) 229*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm2 ; transpose coefficients(phase 1) 230*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm2, xmm3 ; xmm2=(42 43 52 53) 231*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm5, xmm3 ; xmm5=(62 63 72 73) 232*dfc6aa5cSAndroid Build Coastguard Worker 233*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, XMMWORD [XMMBLOCK(0,0,rdx,SIZEOF_FAST_FLOAT)] 234*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_FAST_FLOAT)] 235*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, XMMWORD [XMMBLOCK(4,0,rdx,SIZEOF_FAST_FLOAT)] 236*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [XMMBLOCK(5,0,rdx,SIZEOF_FAST_FLOAT)] 237*dfc6aa5cSAndroid Build Coastguard Worker 238*dfc6aa5cSAndroid Build Coastguard Worker ; xmm6=(00 10 20 30), xmm1=(40 50 60 70) 239*dfc6aa5cSAndroid Build Coastguard Worker ; xmm7=(01 11 21 31), xmm3=(41 51 61 71) 240*dfc6aa5cSAndroid Build Coastguard Worker 241*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(0)], xmm4 ; wk(0)=(22 23 32 33) 242*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(1)], xmm2 ; wk(1)=(42 43 52 53) 243*dfc6aa5cSAndroid Build Coastguard Worker 244*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm6 ; transpose coefficients(phase 1) 245*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm6, xmm7 ; xmm6=(00 01 10 11) 246*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm4, xmm7 ; xmm4=(20 21 30 31) 247*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, xmm1 ; transpose coefficients(phase 1) 248*dfc6aa5cSAndroid Build Coastguard Worker unpcklps xmm1, xmm3 ; xmm1=(40 41 50 51) 249*dfc6aa5cSAndroid Build Coastguard Worker unpckhps xmm2, xmm3 ; xmm2=(60 61 70 71) 250*dfc6aa5cSAndroid Build Coastguard Worker 251*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm6 ; transpose coefficients(phase 2) 252*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm6, xmm0 ; xmm6=(00 01 02 03)=data0 253*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm7, xmm0 ; xmm7=(10 11 12 13)=data1 254*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, xmm2 ; transpose coefficients(phase 2) 255*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm2, xmm5 ; xmm2=(60 61 62 63)=data6 256*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm3, xmm5 ; xmm3=(70 71 72 73)=data7 257*dfc6aa5cSAndroid Build Coastguard Worker 258*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, xmm7 259*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm6 260*dfc6aa5cSAndroid Build Coastguard Worker subps xmm7, xmm2 ; xmm7=data1-data6=tmp6 261*dfc6aa5cSAndroid Build Coastguard Worker subps xmm6, xmm3 ; xmm6=data0-data7=tmp7 262*dfc6aa5cSAndroid Build Coastguard Worker addps xmm0, xmm2 ; xmm0=data1+data6=tmp1 263*dfc6aa5cSAndroid Build Coastguard Worker addps xmm5, xmm3 ; xmm5=data0+data7=tmp0 264*dfc6aa5cSAndroid Build Coastguard Worker 265*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, XMMWORD [wk(0)] ; xmm2=(22 23 32 33) 266*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, XMMWORD [wk(1)] ; xmm3=(42 43 52 53) 267*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(0)], xmm7 ; wk(0)=tmp6 268*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [wk(1)], xmm6 ; wk(1)=tmp7 269*dfc6aa5cSAndroid Build Coastguard Worker 270*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm4 ; transpose coefficients(phase 2) 271*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm4, xmm2 ; xmm4=(20 21 22 23)=data2 272*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm7, xmm2 ; xmm7=(30 31 32 33)=data3 273*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, xmm1 ; transpose coefficients(phase 2) 274*dfc6aa5cSAndroid Build Coastguard Worker unpcklps2 xmm1, xmm3 ; xmm1=(40 41 42 43)=data4 275*dfc6aa5cSAndroid Build Coastguard Worker unpckhps2 xmm6, xmm3 ; xmm6=(50 51 52 53)=data5 276*dfc6aa5cSAndroid Build Coastguard Worker 277*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm2, xmm7 278*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm3, xmm4 279*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm1 ; xmm7=data3+data4=tmp3 280*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm6 ; xmm4=data2+data5=tmp2 281*dfc6aa5cSAndroid Build Coastguard Worker subps xmm2, xmm1 ; xmm2=data3-data4=tmp4 282*dfc6aa5cSAndroid Build Coastguard Worker subps xmm3, xmm6 ; xmm3=data2-data5=tmp5 283*dfc6aa5cSAndroid Build Coastguard Worker 284*dfc6aa5cSAndroid Build Coastguard Worker ; -- Even part 285*dfc6aa5cSAndroid Build Coastguard Worker 286*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, xmm5 287*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, xmm0 288*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm7 ; xmm5=tmp13 289*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm4 ; xmm0=tmp12 290*dfc6aa5cSAndroid Build Coastguard Worker addps xmm1, xmm7 ; xmm1=tmp10 291*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm4 ; xmm6=tmp11 292*dfc6aa5cSAndroid Build Coastguard Worker 293*dfc6aa5cSAndroid Build Coastguard Worker addps xmm0, xmm5 294*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm0, [rel PD_0_707] ; xmm0=z1 295*dfc6aa5cSAndroid Build Coastguard Worker 296*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm1 297*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm5 298*dfc6aa5cSAndroid Build Coastguard Worker subps xmm1, xmm6 ; xmm1=data4 299*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm0 ; xmm5=data6 300*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm6 ; xmm7=data0 301*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm0 ; xmm4=data2 302*dfc6aa5cSAndroid Build Coastguard Worker 303*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(4,0,rdx,SIZEOF_FAST_FLOAT)], xmm1 304*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(6,0,rdx,SIZEOF_FAST_FLOAT)], xmm5 305*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(0,0,rdx,SIZEOF_FAST_FLOAT)], xmm7 306*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(2,0,rdx,SIZEOF_FAST_FLOAT)], xmm4 307*dfc6aa5cSAndroid Build Coastguard Worker 308*dfc6aa5cSAndroid Build Coastguard Worker ; -- Odd part 309*dfc6aa5cSAndroid Build Coastguard Worker 310*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm6, XMMWORD [wk(0)] ; xmm6=tmp6 311*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm0, XMMWORD [wk(1)] ; xmm0=tmp7 312*dfc6aa5cSAndroid Build Coastguard Worker 313*dfc6aa5cSAndroid Build Coastguard Worker addps xmm2, xmm3 ; xmm2=tmp10 314*dfc6aa5cSAndroid Build Coastguard Worker addps xmm3, xmm6 ; xmm3=tmp11 315*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm0 ; xmm6=tmp12, xmm0=tmp7 316*dfc6aa5cSAndroid Build Coastguard Worker 317*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm3, [rel PD_0_707] ; xmm3=z3 318*dfc6aa5cSAndroid Build Coastguard Worker 319*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm1, xmm2 ; xmm1=tmp10 320*dfc6aa5cSAndroid Build Coastguard Worker subps xmm2, xmm6 321*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm2, [rel PD_0_382] ; xmm2=z5 322*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm1, [rel PD_0_541] ; xmm1=MULTIPLY(tmp10,FIX_0_541196) 323*dfc6aa5cSAndroid Build Coastguard Worker mulps xmm6, [rel PD_1_306] ; xmm6=MULTIPLY(tmp12,FIX_1_306562) 324*dfc6aa5cSAndroid Build Coastguard Worker addps xmm1, xmm2 ; xmm1=z2 325*dfc6aa5cSAndroid Build Coastguard Worker addps xmm6, xmm2 ; xmm6=z4 326*dfc6aa5cSAndroid Build Coastguard Worker 327*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm5, xmm0 328*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm3 ; xmm0=z13 329*dfc6aa5cSAndroid Build Coastguard Worker addps xmm5, xmm3 ; xmm5=z11 330*dfc6aa5cSAndroid Build Coastguard Worker 331*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm7, xmm0 332*dfc6aa5cSAndroid Build Coastguard Worker movaps xmm4, xmm5 333*dfc6aa5cSAndroid Build Coastguard Worker subps xmm0, xmm1 ; xmm0=data3 334*dfc6aa5cSAndroid Build Coastguard Worker subps xmm5, xmm6 ; xmm5=data7 335*dfc6aa5cSAndroid Build Coastguard Worker addps xmm7, xmm1 ; xmm7=data5 336*dfc6aa5cSAndroid Build Coastguard Worker addps xmm4, xmm6 ; xmm4=data1 337*dfc6aa5cSAndroid Build Coastguard Worker 338*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_FAST_FLOAT)], xmm0 339*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(7,0,rdx,SIZEOF_FAST_FLOAT)], xmm5 340*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(5,0,rdx,SIZEOF_FAST_FLOAT)], xmm7 341*dfc6aa5cSAndroid Build Coastguard Worker movaps XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_FAST_FLOAT)], xmm4 342*dfc6aa5cSAndroid Build Coastguard Worker 343*dfc6aa5cSAndroid Build Coastguard Worker add rdx, byte 4*SIZEOF_FAST_FLOAT 344*dfc6aa5cSAndroid Build Coastguard Worker dec rcx 345*dfc6aa5cSAndroid Build Coastguard Worker jnz near .columnloop 346*dfc6aa5cSAndroid Build Coastguard Worker 347*dfc6aa5cSAndroid Build Coastguard Worker uncollect_args 1 348*dfc6aa5cSAndroid Build Coastguard Worker mov rsp, rbp ; rsp <- aligned rbp 349*dfc6aa5cSAndroid Build Coastguard Worker pop rsp ; rsp <- original rbp 350*dfc6aa5cSAndroid Build Coastguard Worker pop rbp 351*dfc6aa5cSAndroid Build Coastguard Worker ret 352*dfc6aa5cSAndroid Build Coastguard Worker 353*dfc6aa5cSAndroid Build Coastguard Worker; For some reason, the OS X linker does not honor the request to align the 354*dfc6aa5cSAndroid Build Coastguard Worker; segment unless we do this. 355*dfc6aa5cSAndroid Build Coastguard Worker align 32 356