Extreme DXT Compression MSPL - Cauldron Ltd. Extreme DXT Compression Peter Uliiansky Cauldron,

  • View
    215

  • Download
    0

Embed Size (px)

Text of Extreme DXT Compression MSPL - Cauldron Ltd. Extreme DXT Compression Peter Uliiansky Cauldron,

  • Extreme DXT Compression Peter Uliiansky

    Cauldron, Ltd.

    Overview

    Simple highly optimized algorithm

    Uses SSE2 and SSSE3 for maximum performance

    Quality comparable to Real-Time DXT Compression algorithm

    Performance roughly 300%

    Whats identical to Real-Time DXT Compression

    o Only non-transparent compression scheme for DXT1 o Only six intermediate alpha values compression scheme for DXT5 o Uses bounding box method for representative color and alpha values

    Computes color and alpha indices by division (fixed point multiplication)

    o Uses lookup tables for color/alpha dividers

    )()()(

    )()()(4

    minmaxminmaxminmax

    minminmin

    BBGGRR

    BBGGRRColorIndex

    ++

    ++=

    )(

    )(8

    minmax

    min

    AA

    AAAlphaIndex

    =

    Converts natural index ordering to DXT index ordering by lookup tables

    o Tightly packs natural indices first o Then converts four color indices at once/two alpha indices at once

    Just two functions (CompressImageDXT1, CompressImageDXT5)

    o Saves function call overhead

    No comparisons, jumps, loops (except height/width loops)

    Processes two 4x4 blocks at once

    o Better utilization of registers o Hides instruction latency in some places o No need to extract block first

    Constant/temporary data just 24 * 16 = 384 bytes

    Lookup tables just 3072 + 1024 + 256 + 1280 = 5632 bytes

    Although some parts of DXT1/DXT5 compression algorithms are identical

    different instruction ordering is crucial for maximum performance

    Code is optimized for Core 2 Duo so Pentium 4 performance is not optimal

    (Dont see much point in optimizing for Pentium 4 these days)

  • Color Compression Comparison

    Original image Extreme DXT Comp. Real-Time DXT Comp.

    Alpha Compression Comparison

    Original image Extreme DXT Comp. Real-Time DXT Comp.

  • Performance

    256x256 texture graphs show maximum possible performance of the algorithms

    (all used data can fit and is already prepared in the cache memory)

    4096x4096 texture graphs show more real-life performance

    (source data cannot fit or is not already in the cache memory)

    The 256x256 Lena image was used for the 256x256 texture performance tests

    The same image was 16x16 tiled to create 4096x4096 texture for the 4096x4096

    texture performance tests

    The blue channel was replicated to the alpha channel for the DXT5 tests

    The DXT1 compression creates correct results regardless of the alpha information

    in the source texture and never outputs transparent pixels

  • The Algorithm

    Read 4x4 pixel block (movdqa)

    Compute bounding box and store minimum (movdqa, pmaxub, pminub, pshufd)

    Compute and store range (movdqa, punpcklbw, psubw, movq)

    Inset bounding box and interleave max/min values (psrlw, psubw, paddw, punpcklwd)

    Shift and mask max/min values as needed in the DXT block (pmulw, pand, movdqa)

    Pack and store max/min values to the DXT block (mov, shr, or)

    Load 4x4 pixel block again, subtract minimum, prepare for the division

    (SSSE3: movdqa, psubb, pmaddubsw, phaddw)

    (SSE2: movdqa, psubb, pand, pmaddwd, psrlw, psllw, paddw, packssdw)

    Min Max Min Max

    Pixel03 Pixel02 Pixel01 Pixel00

    Pixel13 Pixel12 Pixel11 Pixel10

    Pixel23 Pixel22 Pixel21 Pixel20

    Pixel33 Pixel32 Pixel31 Pixel30

    Max Max Max Max

    Min Min Min Min

    Range Range Range Range Range Range Range Range

    Min Max Min Max Min Max Min Max

    Min Max Min Max Min Max Min Max

    8(R+G+B)13 8(R+G+B)12 8(R+G+B)11 8(R+G+B)10 8(R+G+B)03 8(R+G+B)02 8(R+G+B)01 8(R+G+B)00

    8(R+G+B)33 8(R+G+B)32 8(R+G+B)31 8(R+G+B)30 8(R+G+B)23 8(R+G+B)22 8(R+G+B)21 8(R+G+B)20

    8A03 8(R+G+B)03 8A02 8(R+G+B)02 8A01 8(R+G+B)01 8A00 8(R+G+B)00

    8A13 8(R+G+B)13 8A12 8(R+G+B)12 8A11 8(R+G+B)11 8A10 8(R+G+B)10

    8A23 8(R+G+B)23 8A22 8(R+G+B)22 8A21 8(R+G+B)21 8A20 8(R+G+B)20

    8A33 8(R+G+B)33 8A32 8(R+G+B)32 8A31 8(R+G+B)31 8A30 8(R+G+B)30

    DXT5

    DXT1

  • Prepare dividers according to the range (mov, add, or, movd, pshufd)

    Perform the division (fixed point multiplication) to get indices (pmulhw)

    Pack indices together and store them to the temporary buffer

    (SSSE3: packuswb, pshufb, pmaddubsw, pmaddwd, movdqa)

    (SSE2: pshuflw, pshufhw, pmaddwd, packssdw, movdqa)

    Convert packed indices to final DXT indices and store them to the DXT block (mov, or)

    ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider

    AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider

    DXT5

    DXT1

    DXT5

    ColorIndex13 ColorIndex12 ColorIndex11 ColorIndex10 ColorIndex03 ColorIndex02 ColorIndex01 ColorIndex00

    ColorIndex33 ColorIndex32 ColorIndex31 ColorIndex30 ColorIndex23 ColorIndex22 ColorIndex21 ColorIndex20

    AlphaIndex03 ColorIndex03 AlphaIndex02 ColorIndex02 AlphaIndex01 ColorIndex01 AlphaIndex00 ColorIndex00

    AlphaIndex13 ColorIndex13 AlphaIndex12 ColorIndex12 AlphaIndex11 ColorIndex11 AlphaIndex10 ColorIndex10

    AlphaIndex23 ColorIndex23 AlphaIndex22 ColorIndex22 AlphaIndex21 ColorIndex21 AlphaIndex20 ColorIndex20

    AlphaIndex33 ColorIndex33 AlphaIndex32 ColorIndex32 AlphaIndex31 ColorIndex31 AlphaIndex30 ColorIndex30

    DXT1

    DXT5

    ColorIndex3330 ColorIndex2320 ColorIndex1310 ColorIndex0300

    AlphaIndex1310 ColorIndex1310 AlphaIndex0300 ColorIndex0300

    AlphaIndex3330 ColorIndex3330 AlphaIndex2320 ColorIndex2320

    Set3 Set2 Set1 Set0 Min Max Set2 Set1 Set0 Min Max

    DXT1

  • /************************************************************************************************************* Extreme DXT Compression Copyright (C) 2008 Cauldron, Ltd. Written by Peter Uliiansky Microsoft Public License (Ms-PL) This license governs use of the accompanying software. If you use the software, you accept this license. If you do not accept the license, do not use the software. 1. Definitions The terms "reproduce," "reproduction," "derivative works," and "distribution" have the same meaning here as under U.S. copyright law. A "contribution" is the original software, or any additions or changes to the software. A "contributor" is any person that distributes its contribution under this license. "Licensed patents" are a contributor's patent claims that read directly on its contribution. 2. Grant of Rights (A) Copyright Grant- Subject to the terms of this license, including the license conditions and limitations in section 3, each contributor grants you a non-exclusive, worldwide, royalty-free copyright license to reproduce its contribution, prepare derivative works of its contribution, and distribute its contribution or any derivative works that you create. (B) Patent Grant- Subject to the terms of this license, including the license conditions and limitations in section 3, each contributor grants you a non-exclusive, worldwide, royalty-free license under its licensed patents to make, have made, use, sell, offer for sale, import, and/or otherwise dispose of its contribution in the software or derivative works of the contribution in the software. 3. Conditions and Limitations (A) No Trademark License- This license does not grant you rights to use any contributors' name, logo, or trademarks. (B) If you bring a patent claim against any contributor over patents that you claim are infringed by the software, your patent license from such contributor to the software ends automatically. (C) If you distribute any portion of the software, you must retain all copyright, patent, trademark, and attribution notices that are present i